Last post Sep 17, 2014 08:22 AM by Rion Williams
Sep 15, 2014 06:18 PMfirstname.lastname@example.org|LINK
I need to parse this pdf document in text file. I am using iTextSharp dll for that purpose. My whole PDF document is parsing correctly except there is a table in the pdf document that has lines in it. It parses that table, but if there is some space in one
cell of the table then i don't see that space in the converted text document. Below is the format of the table
Col1 Col2 Col3 Col4 Col5
1 Test1 2 5 Test6
2 3 Test7
3 Test6 9 Test8
The output that I see is like this:
1 Test1 2 5 Test6 <LF>
2 3 Test7<LF>
3 Test6 9 Test8<LF>
<LF> is line feed.
Is there any way, I can see those spaces too. Below is the PDF parsing code
Public Sub ExtractTextFromPdf(path As String)
Dim its As ITextExtractionStrategy = New LocationTextExtractionStrategy()
Dim HeadLine As String
Using reader As New PdfReader(path)
Dim str As New StringBuilder()
For i As Integer = 1 To reader.NumberOfPages
Dim thePage As String = PdfTextExtractor.GetTextFromPage(reader, i, its)
Dim pdf31460Lines As String() = thePage.Split(ControlChars.Lf)
For Each EachLine As String In pdf31460Lines
If EachLine.Contains("SNEW") Then
HeadLine = EachLine
I have been searching for this for 3-4 days and couldn't find the right answer. I am doing in 2010 visual studio , any help in C# or Vb.net will be appreciated.
Any help will be greatly appreciated.
Sep 17, 2014 08:20 AM|anuj_koundal|LINK
Try to give inline styles to your table.
<table style="border:1px solid black;">
Sep 17, 2014 08:22 AM|Rion Williams|LINK
Parsing PDFs as Text
The best method of handling this with any kind of reliability would be to use an Optical Character Recognition (OCR) library that would attempt to "read" the contents of a specific object (such as a PDF or an Image) and provide you with the actual
Tesseract is one of the most well known open-source OCR libraries out there and would be pretty simple to actually implement within your project to suit your needs. Tessnet2 is
also available, which is basically a .NET wrapper that contains that will allow you to use just as you would Tesseract.
You may also want to look into this Stack Overflow discussion as well, which covers several different techniques including using iTextSharp to attempt to read the content of a PDF and another mentions using
the PdfBox library to accomplish the same thing.
You can see a few more related methods of handling this below :