Complex Images, PDF, text, OCR question

Last post 10-29-2009 8:06 AM by raja.s. 7 replies.

Sort Posts:

  • Complex Images, PDF, text, OCR question

    01-03-2009, 9:50 PM
    • Member
      53 point Member
    • Xcraft
    • Member since 08-14-2002, 12:16 PM
    • Posts 57

     Hi all,

     

    I have been looking around to find a solution this problem:

    I have a PDF file that has scans of pages from a book (in other words the pdf contains images with words)

    now is my question is there a way to read/extract the from the images? with OCR or something?

    Are there companents for this or software?

    Thanks a lot,

    Mike

     

  • Re: Complex Images, PDF, text, OCR question

    01-04-2009, 12:52 AM
    Answer
    • Star
      10,633 point Star
    • ps2goat
    • Member since 11-17-2006, 10:43 PM
    • Posts 1,942

    Pegasus (http://www.pegasusimaging.com/) has stuff for OCR.  Many other companies do, too.  Licensing for some, including Pegasus, are quite expensive for small operations. 

    Just do a search for .NET OCR component.

    ---------------------------------------
    MCP - Web Based Client Development .NET 2.0
  • Re: Complex Images, PDF, text, OCR question

    01-05-2009, 12:22 PM
    Answer
    • Member
      750 point Member
    • doknek
    • Member since 05-07-2007, 7:44 PM
    • Posts 143

    Why not go with a really good open source OCR for .NET...

    Tesseract OCR: http://code.google.com/p/tesseract-ocr/ (forum is very useful)

    .NET DLL here: http://www.pixel-technology.com/freeware/tessnet2/

     

    We've used it for extracting text from Tiff, but I am sure it also works with PDF files. It is one of the best OCR I've found. I've tried Pegasus one which only gave us 30% accuracy, but Tessnet2 gave us around 70-80% accuracy for old documents.

     

    Please click "Mark as Answer" if you think this post answers your question
  • Re: Complex Images, PDF, text, OCR question

    01-05-2009, 2:01 PM
    • Star
      10,633 point Star
    • ps2goat
    • Member since 11-17-2006, 10:43 PM
    • Posts 1,942

    I hate open source licenses =/.   Too wordy; either say I can reuse it and make money or I can't.

    I've never used Pegasus, I've just seen it in places and knew it was available.  Mostly, I just wanted to give the asker a term to search for.

    ---------------------------------------
    MCP - Web Based Client Development .NET 2.0
  • Re: Complex Images, PDF, text, OCR question

    01-06-2009, 3:38 PM
    • Member
      53 point Member
    • Xcraft
    • Member since 08-14-2002, 12:16 PM
    • Posts 57

     I found that acrobat ocr the tiff files perfectly.

  • Re: Complex Images, PDF, text, OCR question

    03-11-2009, 12:43 PM
    • Contributor
      5,018 point Contributor
    • mkamoski
    • Member since 07-04-2002, 8:05 PM
    • ZULU-0500
    • Posts 1,376

    ps2goat:

    I hate open source licenses =/.   Too wordy; either say I can reuse it and make money or I can't.

    Me too.

    However, if one reads them even a little, then it is quick to see if it is good or not.

    For example, the Apache license, now used for Tesseract, is, IMHO, very good.

    Others, FSF for example, are very bad "must release your project to open source domain", last I checked. Not good at all.

    That said-- I agree with you-- make it free or not, period.

    Thank you for bringing it up.

    Thank you.

    -- Mark Kamoski

  • Re: Complex Images, PDF, text, OCR question

    03-11-2009, 1:21 PM
    • Contributor
      5,018 point Contributor
    • mkamoski
    • Member since 07-04-2002, 8:05 PM
    • ZULU-0500
    • Posts 1,376

    FWIW, as of now... IMHO...

    GNU Lesser GPL = good

    GNU GPL = bad

    As noted here...

    http://www.gnu.org/licenses/why-not-lgpl.html

    The GNU Project has two principal licenses to use for libraries. One is the GNU Lesser GPL; the other is the ordinary GNU GPL. The choice of license makes a big difference: using the Lesser GPL permits use of the library in proprietary programs; using the ordinary GPL for a library makes it available only for free programs.

    ...so that's just part of the puzzle, I think.

    HTH.

    Thank you.

    -- Mark Kamoski

  • Re: Complex Images, PDF, text, OCR question

    10-29-2009, 8:06 AM
    • Member
      12 point Member
    • raja.s
    • Member since 10-28-2009, 8:01 AM
    • Chennai
    • Posts 10

    Hi...

    now is my question is there a way to read/extract the from the images?

    If you are going to extract text from that scanned image you can use Tessnet2.

    its a open source API.

    http://www.pixel-technology.com/freeware/tessnet2/



    Thanks and Regards,
    Raja Subramanian.
Page 1 of 1 (8 items)