When I open and read the pdf file everything looks fine, but whenever I try to read and parse that same pdf file all of a sudden there are a bunch of extra characters or tags. And so whenever my code is looking for a specific string, it's not finding it.
I.E.
When I open the pdf file I see this:
Membership ID: 1111111
But when I open and parse each line I get this:
MembershipMembership ID:ID: <<MemberId>>1111111
Can someone explain to me why those extra characters or tags are there? And how can I get rid of them or account for them in my code when I'm reading and parsing pdf files.
According to your description and needs, please check the following tutorials about use itextsharp or other dll to extra data, the tutorials have example code to test, please check:
MSDN Community Support
Please remember to click "Mark as Answer" the responses that resolved your issue.
If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.
Imports System
Imports System.Text
Imports GemBox.Document
Imports System.Text.RegularExpressions
Module Program
Sub Main()
' If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")
Dim document As DocumentModel = DocumentModel.Load("CustomInvoice.pdf")
Dim sb As New StringBuilder()
' Read PDF file's document properties.
sb.AppendFormat("Author: {0}", document.DocumentProperties.BuiltIn(BuiltInDocumentProperty.Author)).AppendLine()
sb.AppendFormat("DateContentCreated: {0}", document.DocumentProperties.BuiltIn(BuiltInDocumentProperty.DateLastSaved)).AppendLine()
' Sample's input parameter.
Dim pattern As String = "(?<WorkHours>\d+)\s+(?<UnitPrice>\d+\.\d{2})\s+(?<Total>\d+\.\d{2})"
Dim regex As Regex = New Regex(pattern)
Dim row As Integer = 0
Dim line As StringBuilder = New StringBuilder()
' Read PDF file's text content and match a specified regular expression.
For Each match As Match In regex.Matches(document.Content.ToString())
line.Length = 0
line.AppendFormat("Result: {0}: ", ++row)
' Either write only successfully matched named groups or entire match.
Dim hasAny As Boolean = False
For i As Integer = 1 To match.Groups.Count - 1
Dim groupName As String = regex.GroupNameFromNumber(i)
Dim matchGroup As Group = match.Groups(i)
If (matchGroup.Success And groupName <> i.ToString()) Then
line.AppendFormat("{0}= {1}, ", groupName, matchGroup.Value)
hasAny = True
End If
Next
If (hasAny) Then
line.Length -= 2
Else
line.Append(match.Value)
End If
sb.AppendLine(line.ToString())
Next
Console.WriteLine(sb.ToString())
End Sub
End Module
Member
19 Points
95 Posts
How to read and extract data from pdf file in vb
Jan 10, 2018 06:05 PM|MikeT89|LINK
Hi all,
When I open and read the pdf file everything looks fine, but whenever I try to read and parse that same pdf file all of a sudden there are a bunch of extra characters or tags. And so whenever my code is looking for a specific string, it's not finding it.
I.E.
When I open the pdf file I see this:
Membership ID: 1111111
But when I open and parse each line I get this:
MembershipMembership ID:ID: <<MemberId>>1111111
Can someone explain to me why those extra characters or tags are there? And how can I get rid of them or account for them in my code when I'm reading and parsing pdf files.
I'am currently using aspose.pdf library
Thank you
Contributor
6730 Points
2715 Posts
Re: How to read and extract data from pdf file in vb
Jan 11, 2018 07:56 AM|Eric Du|LINK
Hi MikeT89,
According to your description and needs, please check the following tutorials about use itextsharp or other dll to extra data, the tutorials have example code to test, please check:
Read and Extract PDF Text in C# and VB.NET:
https://www.gemboxsoftware.com/document/examples/c-sharp-read-pdf/305
How to read PDF file using iTextSharp in ASP.NET:
http://www.devasp.net/net/articles/display/1447.html
Best Regards,
Eric Du
Please remember to click "Mark as Answer" the responses that resolved your issue.
If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.
Member
10 Points
8 Posts
Re: How to read and extract data from pdf file in vb
Dec 19, 2018 09:11 AM|Rahil Saxena|LINK
You may try this code: