Last post Aug 01, 2011 08:11 AM by rohitjpuranik
Jul 11, 2007 01:49 AM|gavinharriss|LINK
Hi, I was wondering if any SQL Server gurus out there could help me...
I have a table I'm trying to apply a full text catalog to, however no results are ever returned due to the text column being cataloged being of varbinary(max) that's being populated from a converted nvarchar(max) value - I've narrowed it down to this specifically,
populating with non nvarchar text seems to work fine.
To re-create the problem quickly...
If I populate the column via
CONVERT(varbinary(max), 'test text')
then there is no problem, I get results as expected.
However if I populate the column via
CONVERT(varbinary(max), CAST('test text' as nvarchar(max)))
no results are ever returned.
Is this a bug with SQL Server 2005 Full Text Indexing? I'm happily creating full text catalogs when an nvarchar is not getting converted into a varbinary.
I'm setting the Document Type column to '.html' (I've tried changing this to '.txt' in case it was a fault with the html ifilter but the problem persists so I believe I can rule this out).
The reason I need to convert an nvarchar to varbinary is that the table holds multi-lingual text and I'm adding a html meta tag <META NAME="MS.LOCALE" CONTENT="ES"> to the beginning in order for the full text indexing word breaker to select the correct language
to catalog the text with. The aim being to provide more relevant searches in users native languages (I've read a few articles that describe this technique, but it's the first time I've tried to apply it).
Any pointers / suggestions would be greatly appreciated. Cheers,
Jul 11, 2007 02:41 AM|M A A Mehedi Hasan|LINK
Jul 11, 2007 05:36 AM|gavinharriss|LINK
Thanks for the input - unfortunately the article you provided seems to be an overview of creating full text catalogs and not related to my specific problem. I have no problem creating full text catalogs and using them
unless the varbinary column source is being populared from a converted (CONVERT) nvarchar input?!? It seems to be a very specific problem.
Jul 11, 2007 05:04 PM|gavinharriss|LINK
-- Create test database
CREATE DATABASE FullTextTest
-- Create test data table
CREATE TABLE TestTable
pk UNIQUEIDENTIFIER NOT NULL CONSTRAINT tablePK PRIMARY KEY,
-- The below single entry WILL BE FOUND (the text source is being entered directly)
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(), CONVERT(VARBINARY(MAX),'<META NAME="MS.LOCALE" CONTENT="EN">test entry 1'), '.html')
-- The bellow two entries below WILL NOT BE FOUND (the text source is taken from an NVARCHAR(MAX) value)
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(), CONVERT(VARBINARY(MAX), CAST('<META NAME="MS.LOCALE" CONTENT="EN">test entry 2' AS NVARCHAR(MAX))), '.html')
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(), CONVERT(VARBINARY(MAX), CAST('<META NAME="MS.LOCALE" CONTENT="EN">test entry 3' AS NVARCHAR(MAX))), '.html')
-- Create the full text catalog
CREATE FULLTEXT CATALOG TEST AS DEFAULT
CREATE FULLTEXT INDEX ON TestTable (varbinarycol TYPE COLUMN documentExtension LANGUAGE 1033)
KEY INDEX tablePK
-- NOTE: You might need to give the catalog a chance to build before running the script below.
-- Now do a search that SHOULD RETURN 3 ROWS of data, but ONLY 1 ROW IS RETURNED
SELECT CAST(varbinarycol AS NVARCHAR(MAX)) FROM TestTable WHERE CONTAINS(varbinarycol, 'test')
Jul 11, 2007 07:01 PM|Motley|LINK
Going out on a limb here, but try:
CAST(varbinarycol AS NVARCHAR(MAX))
WHERE CONTAINS(varbinarycol, N'test')
CAST(varbinarycol AS NVARCHAR(MAX))
WHERE CONTAINS(varbinarycol, CONVERT(NVARCHAR(4),'test'))
I could be completely wrong, but I'm confused as to why you've stuffed HTML into a varbinary field. It seems to me the correct field definition would be nvarchar(max). It is probably getting confused because the UTF-8/Unicode version is not the same binarily
(I made that word up -- I think) as the varchar version of it.
Jul 11, 2007 09:22 PM|gavinharriss|LINK
Cheers for the input, but it doesn't solve the problem I'm afraid :(
There's actually a good reason for using a varbinary - it's because I want to store multiple languages in the same table. In order to get the full text index to use the correct word breakers on a row by row basis the extra html meta tag is being used to
describe the language the ifilter should go and fetch the relevant word breaker for. From what I understand, this is only possible with varbinary data. By default sql server expects a single language on a table. The technique is described in this article here
http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/ (look for the heading "Via language tags in the content being indexed" about a quarter of the way down the article).
Jul 12, 2007 04:08 PM|Motley|LINK
try removing the period from the document extension?
Jul 12, 2007 05:39 PM|gavinharriss|LINK
I've tried that one as well and it doesn't help either :(
If you run my code example above you'll see that the one bit of html being populated without conversion into nvarchar does get found. Just the two records that use nvarchar that don't. All 3 records are using '.html' (but I have tried with '.htm', 'html',
'htm' and still no joy).
Jul 15, 2007 03:58 PM|gavinharriss|LINK
I've also tried adding the following extra meta tag - <META http-equiv="Content-Type" content="text/html; charset=utf-16">
But this seemed to break the Full Text Index even further. If applied to the example T-SQL I give above then even the non-nvarchar input is no longer returned:
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(), CONVERT(VARBINARY(MAX),'<META http-equiv="Content-Type" content="text/html; charset=utf-16"><META NAME="MS.LOCALE" CONTENT="EN">test entry 1'), '.html')
Apr 23, 2008 05:45 AM|Sindbad_nav|LINK
I want to extend this topic with my basic full text search question and glad to have a positive feedback. I am working on something similar to Gavin's project and I read lot of articles for this but still confused how to start. So my question is that I have
html pages (almost 8000 non english) in which I want to make search and populate their links for view the html file. In my database table I have one column named 'Path' with VARCHAR() datatype which contains path for the html pages i.e. 'A10987.html' which
are indexed and stored in folder outside the database and other column 'Extension' with VARCHAR() datatype which contains extension for the html file i.e '.html', the questions making me confused are:
1. How file name will be in the column "A10987.html" or "A10987" because the file extension is already saved in "Extension" column i.e. ".html".
2. What data types I have to use for file paths.
3. To search in html files we need to store the file data with all the tags in database table or save the html files in separate folder like I did and use the path to search it.
4. Or any other way to search in html pages.
Thanks for any help.
Apr 23, 2008 06:09 AM|gavinharriss|LINK
I don't think SQL Server 2005 will be able to index your external files for it's full text search functionality. I believe you'll find that the information to be indexed must be stored by SQL Server itself.
Apr 23, 2008 07:56 AM|Sindbad_nav|LINK
Thanks Gav for your quick response. I already just finished experimenting the tip you gave. I upload the html files through my web form and SQL Server indexed the files it self properly and place them in a folder with the same name of the DB table. I just
need some more tips to move further. Like I want to know that am I doing fine and what next I have to do. I also want to mention that I am not using VARBINARY() datatype it is just a VARCHAR() datatype. how far I understand that we use VARBINARY() when we
need to save the large amount of data in to our table and if we are using file system then you can use VARCHAR(). may be I am wrong.
Apr 24, 2008 07:11 PM|gavinharriss|LINK
If you're just storing text files in a database column then you should be fine using just VARCHAR(MAX). VARBINARY would only be needed if you were storing things like images and pdfs for instance.
You're probably best starting off a new thread with your problem as it's not really related to the problem discussed here. You might get better feedback from the community then.
Best of luck,
Jul 31, 2010 05:50 AM|Motlajs|LINK
probably too late, but for anybody encountering the same problem:
one have to add the UNICODE SIGNATURE - 0xFFFE - at the beginning of, if the being cast value is UNICODE (NVARCHAR(xxx) | NCHAR | NTEXT ....):
0xFFFE + CONVERT(varbinary(max), CAST('test text' as nvarchar(max)))
That solve my problem as well.
Enclosing text into <html><body> xxxxx </body></html> does not help.
Just let's imagine:
the fulltext is binary-based, that means, that fulltext code takes the VARBINARY as chunk of bytes, just converting them probably converting them to UTF-8, ANSI or something different than UNICODE.
So casting some text to UNICODE means having 0x650066006600 - every second byte is 00 - usually in PCHAR it means end of string. Without the instructing UNICODE SIGNATURE 0xFFFE fulltext engine is blind and does simple binary-base memory copy to probably
one-byte string encoding.
Aug 01, 2011 08:11 AM|rohitjpuranik|LINK
I am storing .pdf file in the varbinary(max) column. When I am executing the search query using contains or freetext on the varbinary(max) column I can find the correct row(s).
But my issue is can i get the Paragraph or the whole sentense /line where that searching word found.
Is this possible using Free Text Search ?