When I used the following code in SQL Server 2000 Query Analyzer, it worked fine, and I results:
declare @myvar nvarchar(200)
set @myvar = N'Search Term in English'
set @myvar = N'''Select Filename,PATH,rank from SCOPE() where contains(Contents, ''''' + @myvar + N''''')'' '
set @myvar = N'select * from openquery(ISRV,' + @myvar + N') order by FileName'
print @myvar
EXEC (@myvar)
Note: ISRV is linked server to Microsoft Indexing Services.
But, when I used the same and tried to search for Arabic, I get no results back:
declare @myvar nvarchar(200)
set @myvar = N'كلمة بحث بالعربية'
set @myvar = N'''Select Filename,PATH,rank from SCOPE() where contains(Contents, ''''' + @myvar + N''''')'' '
set @myvar = N'select * from openquery(ISRV,' + @myvar + N') order by FileName'
print @myvar
EXEC (@myvar)
What is worng?
How I can fix the above to allow me perform Full Text Search using Arabic Text?
I did a lot of testing and investigation ... I think I know what is happening...!
Under Windows 7 (on my work Laptop) I recreated Indexing Services on a very simple setup using one Text File "test.txt" (Notepad) with a few English and Arabic Words. Also, I added a linked server on SQL Express (Local Machine) to my Indexing Service Catalog.
All is working fine when I was searching against English Keywords. But, when I search using Arabic it does not work...
So, I opened "test.txt" and saved the file as "Unicode" ... and YES ... now search against Arabic words is working properly.
I used the following queries:
select * from openquery(IDXSRV, 'Select Filename,PATH,Characterization from SCOPE() where contains(''حادث'') ') order by FileName
select * from openquery(IDXSRV, 'Select Filename,PATH,Characterization from SCOPE() where contains(''"this is a test"'') ') order by FileName
select * from openquery(IDXSRV, 'Select Filename,PATH,Characterization from SCOPE() where contains(''جملة'') ') order by FileName
When the file was saved using ANSI encoding, the "Characterization" resut of the query was showing as following "Results" panel:
This is a test åÐå ÌãáÉ ÈÇáÚÑÈíÉ ÍÇÏË.
But, when I saved the same text file as Unicode, the "Characterization" was showing properly in the result window:
This is a test هذه جملة بالعربية حادث.
This means, to my understanding, that when I specify the search word from the Query Window, it is using Unicode, and the data must be stored in Unicode encoding.
Now, on the production server, I have thousands of XML and PDF Files and only searching using English words is working, but when I specify Arabic Words, it does not work...
There is a section that goes into the full-text support for Arabic in SQL Server 2000 and how to set it up.
Also, one thing to keep in mind about documents such as XML or PDF or DOC/DOCX is that they can support a language specification within the document. Full-text search will use the word breaker specified for the language to break the words. If you mix languages
then you have to say in the document you are switching languages or Full-text will use the Word breaker for the language specified and it may not break the words correctly.
For example, the following shows that we are saying the following text is in English, but we have mixed in Arabic. Full-text will load the English word breaker only because XML said all characters that follow are English when in reality they are mixed.
Contain both English and Arabic. So the English word breaker breaks the text using its linguistics instead of using Arabics linguistics. The end result is you get the words broken incorrectly and so when you search on it saying I'm looking for the Arabic
phrase, the full-text index cannot find it.
<text xml:lang="en">This is an example of saying this text is in English, but I'm also mixing in Arabic
كلمة بحث بالعربية</text>
So, it is important that the documents say the text is in Arabic or Full-Text will use the word breaker matching the language specified in the text or use the default specified when creating the full-text index.
One of the nice features of SQL Server 2008, 2008 R2, and 2012 is that it includes a Distributed Management function sys.dm_fts_parser to help test how Word breakers, thesaurus, etc will break the words.
that actually return the data from the full-text index so you can see how it actually appears in the full-text index.
If you are a heavy full-text user, I would encourage you to consider upgrading to a newer version of SQL Server just to have the ability to use these to help troubleshoot scenarios like this to help isolate how it is being handled.
However, there is one major thing not covered in the article you referenced, which is the Data Source is files based on Microsoft Indexing Service on Windows 2003 Server.
It is confirmed to me now that Full Text Search against Arabic is working on all Office Files (.DOX, .PPT, .XLS, ...) on the same server (Windows 2003), but only XML and PDF are not working for Arabic Search.
I tried to regenerate the problem on my Laptop (Windows 7), and the result is that Arabic Full Text Search is working on ALL file except XML, and also, is not visible at all in any search against file content (not English nor Arabic). It only worked when
I installed iFilter for XML, and I was able to perform search against nodes contents in English and also in Arabic. Also, I was able to define custom properties and map them to specific node names in the XML File.
On Windows 2003 Server, search against XML content is working (out of the box) as if XML is a regular text file, but only against English Text. So I can perform search against Node Names and also Node Contents but the search engine does not recognize the
difference.
After doing lot of testing and investigation, I found out the following on SQL Server 2000 under Windows Server 2003 using Indexing Service:
1. Arabic Text in Static PDF Files is stored using the unicode of the connected char shapes in reverse order !!!!!!
2. Arabic Text in MS Office Files is stored using the unicode of the isolated char shapes in normal order. This is the correct way as per my expectation.
I verified this when searching for the "Sick Leave" Form Requests. I found, by chance, the Arabic Text of "Sick" inside the "Characterization" result field and while it looks normal, when I verified the unicode value of each letter, I figured out what is
going wrong. See this:
print unicode(substring(N'ﺔﻴﺿﺮﻣ', 5,1))
print unicode(substring(N'مرضية', 1, 1))
print nchar(65251)
print nchar(1605)
--- result is --->
65251
1605
ﻣ
م
The following queries returns entirely different results:
select *
from openquery(ISRV, 'Select Filename,PATH,rank,url,characterization from SCOPE() where contains(contents, ''"ﺔﻴﺿﺮﻣ"'')')
order by FileName
select *
from openquery(ISRV, 'Select Filename,PATH,rank,url,characterization from SCOPE() where contains(contents, ''"مرضية"'')')
order by FileName
Now next question is why the Arabic Text inside the Static PDF File is stored in this wiered format ????!!!!
I have stored some Arabic Text in a regular TXT File, but until now, it is not picked up by the scanning engine (since one week). Once it is scanned, I will confirm the result.
I think I need to post this question to Adobe Support.
tarekahf
Member
143 Points
272 Posts
SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Jul 28, 2012 01:01 PM|LINK
When I used the following code in SQL Server 2000 Query Analyzer, it worked fine, and I results:
Note: ISRV is linked server to Microsoft Indexing Services.
But, when I used the same and tried to search for Arabic, I get no results back:
What is worng?
How I can fix the above to allow me perform Full Text Search using Arabic Text?
Tarek.
tarekahf
Member
143 Points
272 Posts
Re: SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Jul 28, 2012 01:37 PM|LINK
Looks like the solution is found here:
http://objectmix.com/inetserver/292452-arabic-search-index-service-asp-net-problem.html
But, how I can set the locale Identifier while I am in SQL Server 2000 Query Analyzer?
Tarek.
tarekahf
Member
143 Points
272 Posts
Re: SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Jul 29, 2012 09:21 AM|LINK
I did a lot of testing and investigation ... I think I know what is happening...!
Under Windows 7 (on my work Laptop) I recreated Indexing Services on a very simple setup using one Text File "test.txt" (Notepad) with a few English and Arabic Words. Also, I added a linked server on SQL Express (Local Machine) to my Indexing Service Catalog.
All is working fine when I was searching against English Keywords. But, when I search using Arabic it does not work...
So, I opened "test.txt" and saved the file as "Unicode" ... and YES ... now search against Arabic words is working properly.
I used the following queries:
select * from openquery(IDXSRV, 'Select Filename,PATH,Characterization from SCOPE() where contains(''حادث'') ') order by FileName select * from openquery(IDXSRV, 'Select Filename,PATH,Characterization from SCOPE() where contains(''"this is a test"'') ') order by FileName select * from openquery(IDXSRV, 'Select Filename,PATH,Characterization from SCOPE() where contains(''جملة'') ') order by FileNameWhen the file was saved using ANSI encoding, the "Characterization" resut of the query was showing as following "Results" panel:
But, when I saved the same text file as Unicode, the "Characterization" was showing properly in the result window:
This means, to my understanding, that when I specify the search word from the Query Window, it is using Unicode, and the data must be stored in Unicode encoding.
Now, on the production server, I have thousands of XML and PDF Files and only searching using English words is working, but when I specify Arabic Words, it does not work...
How I can solve this problem?
Tarek.
tarekahf
Member
143 Points
272 Posts
Re: SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Aug 01, 2012 11:22 AM|LINK
Hope someone can help me with this issue.
Tarek.
cts-rbeene
Member
16 Points
3 Posts
Re: SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Aug 02, 2012 02:28 PM|LINK
Hey Tarek,
Thank you for your post. Have you taken a look at http://msdn.microsoft.com/en-us/library/aa902664(v=SQL.80).aspx#sql_arabicsupport_fulltextsearch?
There is a section that goes into the full-text support for Arabic in SQL Server 2000 and how to set it up.
Also, one thing to keep in mind about documents such as XML or PDF or DOC/DOCX is that they can support a language specification within the document. Full-text search will use the word breaker specified for the language to break the words. If you mix languages then you have to say in the document you are switching languages or Full-text will use the Word breaker for the language specified and it may not break the words correctly.
For example, the following shows that we are saying the following text is in English, but we have mixed in Arabic. Full-text will load the English word breaker only because XML said all characters that follow are English when in reality they are mixed. Contain both English and Arabic. So the English word breaker breaks the text using its linguistics instead of using Arabics linguistics. The end result is you get the words broken incorrectly and so when you search on it saying I'm looking for the Arabic phrase, the full-text index cannot find it.
<text xml:lang="en">This is an example of saying this text is in English, but I'm also mixing in Arabic كلمة بحث بالعربية</text>
So, it is important that the documents say the text is in Arabic or Full-Text will use the word breaker matching the language specified in the text or use the default specified when creating the full-text index.
One of the nice features of SQL Server 2008, 2008 R2, and 2012 is that it includes a Distributed Management function sys.dm_fts_parser to help test how Word breakers, thesaurus, etc will break the words.
sys.dm_fts_parser('query_string', lcid, stoplist_id, accent_sensitivity)tarekahf
Member
143 Points
272 Posts
Re: SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Aug 03, 2012 11:16 PM|LINK
Thanks a lot for the very informative reply.
However, there is one major thing not covered in the article you referenced, which is the Data Source is files based on Microsoft Indexing Service on Windows 2003 Server.
It is confirmed to me now that Full Text Search against Arabic is working on all Office Files (.DOX, .PPT, .XLS, ...) on the same server (Windows 2003), but only XML and PDF are not working for Arabic Search.
I tried to regenerate the problem on my Laptop (Windows 7), and the result is that Arabic Full Text Search is working on ALL file except XML, and also, is not visible at all in any search against file content (not English nor Arabic). It only worked when I installed iFilter for XML, and I was able to perform search against nodes contents in English and also in Arabic. Also, I was able to define custom properties and map them to specific node names in the XML File.
On Windows 2003 Server, search against XML content is working (out of the box) as if XML is a regular text file, but only against English Text. So I can perform search against Node Names and also Node Contents but the search engine does not recognize the difference.
Very strange ....
Tarek.
tarekahf
Member
143 Points
272 Posts
Re: SQL Server 2000 Full Text Search using CONTAINS to search Arabic Text is not working
Aug 06, 2012 10:36 PM|LINK
After doing lot of testing and investigation, I found out the following on SQL Server 2000 under Windows Server 2003 using Indexing Service:
1. Arabic Text in Static PDF Files is stored using the unicode of the connected char shapes in reverse order !!!!!!
2. Arabic Text in MS Office Files is stored using the unicode of the isolated char shapes in normal order. This is the correct way as per my expectation.
I verified this when searching for the "Sick Leave" Form Requests. I found, by chance, the Arabic Text of "Sick" inside the "Characterization" result field and while it looks normal, when I verified the unicode value of each letter, I figured out what is going wrong. See this:
The following queries returns entirely different results:
Now next question is why the Arabic Text inside the Static PDF File is stored in this wiered format ????!!!!
I have stored some Arabic Text in a regular TXT File, but until now, it is not picked up by the scanning engine (since one week). Once it is scanned, I will confirm the result.
I think I need to post this question to Adobe Support.
Tarek.