Last post Aug 19, 2009 06:49 AM by Parp
Jul 24, 2009 12:30 PM|Parp|LINK
Hi, I have around 40 word documents of between 100 and 300 pages each. What I'd like to do is index them so that when I do a search, I get a list of each document that the search string appears in AND which pages that search string appears on (which could
be multiple pages within the one document).
I'm toying with the idea of putting each word doc into SQL Server, full text indexing them and then for each document that matches the search query using the Word API to search for each occurrence. This seems extremely inelegant though and probably very
Can any of you point me in the direction of a more sensible solution please?
Aug 19, 2009 06:49 AM|Parp|LINK
Here's the solution I used, if it's any use.
Instead of using word docs, I used PDF docs.
Using PDFBox, I extracted the text from each page and inserted the text from each page as a row into the table, along with the corresponding page number. So if a PDF doc has 10 pages then the database has 10 rows (and 1 entry in another table with the details
about the document, URL etc)
Then when doing a search, the resutls set consists of each page in the PDF document that has an occurence of the search term. In the results I display an extract with the matching search term boldened and with a link to the pdf file and the corresponding
page (pdffile.pdf#page=1 if my memory is right).
Extracting the text from the PDF can take a while but the rest works well. Email me if you'd like code extracts etc.