Issue with Extracting Tables from Scanned PDFs Using Aspose.OCR

williamclark · May 19, 2025, 5:30am

HI all

I am working on a project to extract tabular data from scanned PDF documents. I have read through the Aspose handbook and performed some text recognition with Aspose.OCR on several of the scanned pages that I have scanned, and the recognition seems reasonably accurate. The difficulty that I am having is retaining the tabular structure.

At this stage, only the text seems to be produced - the row/column distinction is not apparent, or barely there at all.

I came across this website:https://forum.aspose.app/t/word-with-Devops-tutorial-table-inside-convert-to-txt/67824 but still facing issue.

Has anyone here who works for or has used Aspose.OCR or any other components of Aspose had reliable success extracting tables from scanned pdfs? More specifically, for what I am trying to do, a formatted version (to preserve) for eventually programming (like CSV or JSON structured data). Are there parameter settings within Aspose.OCR or preprocessing settings you recommend?

Any and all recommendations are welcomed, including any strategy with other pieces as well as Aspose elements. I reiterate, accuracy is critical to my use case, so if you have any sample code, sucker is lucky @ 0.001% - I would be extremely appreciative.

thanks in advance for any assistance, advice.
Best
williamclark

atir.tahir · May 19, 2025, 8:23am

@williamclark

To better assist you with the issue you’re experiencing regarding table structure retention, could you please clarify a few things:

Are you using our standalone/back-end API (Aspose.OCR for .NET/Java)?
If yes, could you share which version of the API you are using?
Would it be possible for you to share a sample scanned PDF document** (or a representative page) that demonstrates the issue? This will help us better understand the structure and provide more tailored guidance or a workaround.

Once we have these details, we’ll be in a much better position to assist you with table detection and extraction in a structured format like CSV or JSON.