I am extracting text using from a pdf, and the encoding seems to not work. I have 2 methods to extract the text from the pdf because for some pdf's method 1 works, and for others, methods 2 works. I want to combine both but don't understand how...
Also for method 2, the encoding gets messed up, ie. whitespaces have ascii code of 63 for some reason, is there a way to fix this, so that I can use indexOf method using a string of a white space and it will match the whitespace in the extracted text.
public static bool does_document_text_have_keyword(string keyword, string pdf_src)
{
try
{
PdfReader pdfReader = new PdfReader(pdf_src);
string currentText;
int count = pdfReader.NumberOfPages;
for (int page = 1; page <= count; page++)
{ // method_1
PdfReader reader = new PdfReader(pdf_src);
currentText = PDFParser.ExtractTextFromPDFBytes(pdfReader.GetPageContent(page)) + " ";
if (currentText.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
// method_2
StringWriter output = new StringWriter();
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy()));
currentText = fix_encoding(output.ToString());
if (currentText.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
}
pdfReader.Close();
return false;
}
catch
{
return false;
}
}
To fix the encoding when extracting test from a pdf using itextsharp, you may want to try the following: the LocationTextExtractionStrategy.
It's documentation states: text extraction renderer that keeps track of relative position of text on page. The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
ryands
0 Points
1 Post
How to fix the encoding when extracting text from a pdf using itextsharp?
Dec 20, 2012 08:10 PM|LINK
Hi,
I am extracting text using from a pdf, and the encoding seems to not work. I have 2 methods to extract the text from the pdf because for some pdf's method 1 works, and for others, methods 2 works. I want to combine both but don't understand how...
Also for method 2, the encoding gets messed up, ie. whitespaces have ascii code of 63 for some reason, is there a way to fix this, so that I can use indexOf method using a string of a white space and it will match the whitespace in the extracted text.
public static bool does_document_text_have_keyword(string keyword, string pdf_src) { try { PdfReader pdfReader = new PdfReader(pdf_src); string currentText; int count = pdfReader.NumberOfPages; for (int page = 1; page <= count; page++) { // method_1 PdfReader reader = new PdfReader(pdf_src); currentText = PDFParser.ExtractTextFromPDFBytes(pdfReader.GetPageContent(page)) + " "; if (currentText.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true; // method_2 StringWriter output = new StringWriter(); output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy())); currentText = fix_encoding(output.ToString()); if (currentText.IndexOf(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true; } pdfReader.Close(); return false; } catch { return false; } }Rajesh Sawan...
Participant
1612 Points
246 Posts
Re: How to fix the encoding when extracting text from a pdf using itextsharp?
Dec 20, 2012 08:32 PM|LINK
Hi,
You can try the below links :-
http://stackoverflow.com/questions/4784385/extract-data-from-pdf-files
or
private string ExtractText()
{
PdfReader reader = new PdfReader(Server.MapPath(P_InputStream3));
string txt = PdfTextExtractor.GetTextFromPage(reader, 2, new LocationTextExtractionStrategy());
return txt;
}
Hope this will solve your problem.
//Happy Coding
Regards,
RajeshS.
april_123456
Participant
775 Points
246 Posts
Re: How to fix the encoding when extracting text from a pdf using itextsharp?
Dec 21, 2012 02:29 AM|LINK
Hello ryand!
To fix the encoding when extracting test from a pdf using itextsharp, you may want to try the following: the LocationTextExtractionStrategy.
It's documentation states: text extraction renderer that keeps track of relative position of text on page. The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
Hopefully this helps,
Best of Luck!
With Kind Regards,