Answered

Data extraction from Scanned PDF document

(0) Share

Report

Posted on by dc23

I have got scanned invoices converted into pdf format. I want to have data extracted from the invoices and store it in an excel sheet.

I have tried AI builder, but all the data which needs to be extracted is not getting analysed.

Please provide some suggestions.

Thanks!

Categories:

Using flows

I have the same question (0)

All responses (8)

Answers (1)

abm abm 32,985 Most Valuable Professional on at

Like (1)

Report
Copy link

Link copied!

Hi @dc23

Did you looked into this?

https://powerusers.microsoft.com/t5/Power-Automate-Community-Blog/Extract-data-from-documents-with-Microsoft-Flow/ba-p/370422

@Jay-Encodian could help you in this.

Thanks

Was this reply helpful? Yes No
dc23 62 on at

Like (1)

Report
Copy link

Link copied!

Yes i have gone through the solution provided. It worked for me in case I had to pick up a single element from the invoice, however, in case the invoice has multiple rows for Quantity, Item etc, I'm not able to get the desired result.

Thanks!

Was this reply helpful? Yes No
abm abm 32,985 Most Valuable Professional on at

Like (1)

Report
Copy link

Link copied!

Hi @dc23

Thanks for your quick reply. That might be the product limitation. @Jay-Encodian could clarify more on this.

Thanks

Was this reply helpful? Yes No
Jay-Encodian 2,920 on at

Like (1)

Report
Copy link

Link copied!
Thanks @abm

Hey @dc23

The 'Extract Text from Regions' action is designed to allow text extraction from pre-defined regions... if those regions are dynamic it is more difficult. You have a few options to consider:

Persist with the AI builder approach

Use the Encodian action but add regions where data might exist and then handle null values in your Flow

Use the Encodian 'Get PDF Text Layer' action combined with our 'Search Text - Regex' (Preview) action

I'd recommend you go with option #1, this is the right tool for this scenario especially where you have multiple differing layouts of invoices... it's absolutely possible with the Encodian action, but this is a more basic tool (By design) and you will have to do more work up front to get this to work.

If you want to try the Encodian approach let me know and I'll guide you through.

HTH

Jay

Was this reply helpful? Yes No
dc23 62 on at

Like (1)

Report
Copy link

Link copied!

Hi Jay,

Thanks for the response, however, before moving onto encodian I did try with the AI builder approach. In that case the issue, I encountered was that for some of the invoices the data was being fetched, however, for some others the with the same format the data was not getting analysed.

Also, not sure if AI builder allows us to select the data of our choice rather than just suggesting the fields we can select from.
Do you have suggestion to it.

Thanks!

Was this reply helpful? Yes No
Verified answer

Jay-Encodian 2,920 on at

Like (2)

Report
Copy link

Link copied!

Hi @dc23

In my exp you need to provide enough sample documents which are very similar but with different data so that the fields are recognized for selection, ref - https://docs.microsoft.com/en-us/ai-builder/form-processing-sample-data

I believe you need the AI model to detect the fields, you can't just select them.

HTH

Jay

Was this reply helpful? Yes No
CFernandes 8,482 Most Valuable Professional on at

Like (0)

Report
Copy link

Link copied!
You can use Muhimbi PDF Converter Power Automate action to Extract Data from Scanned PDF document.

Muhimbi PDF Converter comes with support for a number of OCR (Optical Character Recognition) related facilities including the ability to make image based PDFs (Scans, faxes) fully searchable and indexable. In addition it support a way to extract this text to allow information such as Invoice numbers, Purchase Order numbers or other identifiable information to be extracted.

You can find details.

Use Flow to extract text from scans and faxes using OCR.
Convert image based SharePoint content to OCRed PDFs.
Extract text from image based files in SharePoint - and set coordinates.

I hope this helps.

Was this reply helpful? Yes No
takolota1 4,980 Moderator on at

Like (0)

Report
Copy link

Link copied!

If anyone wants to extract data from a PDF or image without training a model for select documents, try this new GPT data extraction method: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-GPT/td-p/2201345

It doesn’t require specifying certain document areas, wordings, styles, etc. It just OCRs the file, converts it to a replica text (txt), and passes it to a GPT prompt where you can ask GPT to do whatever you want with the document data.

Was this reply helpful? Yes No