Skip to main content

Notifications

Extract Data From PDFs and Images With GPT

takolota1 Profile Picture Posted by takolota1 4,785


 

Extract Data From PDFs & Images With GPT

This template uses AI Builder's OCR for PDFs & Images to extract the text present in a file, replicates the file in a text (txt) format, then passes it off to a GPT prompt action for things like data extraction.

 

Seems to have a 85% or greater reliability for returning requested data fields from most PDFs. It's likely good enough to do more direct data entry on some use-cases with well formatted, clean PDFs, and in many other cases it is good at doing a 1st pass on a file & providing a default / pre-fill value for fields before a person then checks & completes something with the data.
It does not require training on different formats, styles, wording, etc. It works on multiple pages at once. And you can always adjust the prompt to extract the different data you want on different documents & adjust how you want the data to be represented in the output.

 

It also...
-Runs in less than a minute, usually 10-35 seconds, so it can respond in time for a Power Apps call.
-Handles 10-20 document pages at a time given the recent Create text with GPT update to a 16k model.
-Does not use additional 3rd party services, maintaining better data privacy.

 

Full Flow:


 

The AI Builder Recognize text action returns a JSON array of each piece of text found in the PDF or image.

The Convert to txt loop goes through each horizontal line in the PDF or image & creates a line of text to approximately match both the text & spacing between text for that line.

Each vertical line of text is then combined into a single block of text like a big txt file in the final Compose action, before it is then passed to GPT through the AI Builder Create text action.

 

 

Example

 

Demonstration Invoice Example...


 

The AI Builder action uses optical character recognition (OCR) on this invoice PDF to return each piece of text & its associated x, y coordinates.

Then the Convert to txt loop produces this output shown in the final Compose...



And if we copy that output over to a text (txt) notebook, then this is what it looks like...


 

That is then fed into this GPT action prompt...


 

Which produced this output...

​​​​​​​

 

{
  "Purchase Order Date": "2022-09-20",
  "Purchase Order (PO) Number": "PO10022556",
  "Requisition Order (RO) Number": "NULL",
  "Incoterms": "DAP",
  "Payment Terms": "NULL",
  "Supplier": "[Redacted]",
  "Chemonics Address": "[Redacted]",
  "Consignee Address": "[Redacted]",
  "Delivery Or Ship To Address": "[Redacted]",
  "Mode Of Shipment": "Air",
  "Description Of Goods": "NULL",
  "Sum Of Additional Charges": "NULL",
  "Product Lines": [
    {
      "Product Name": "KIT COBAS 58/68/8800 LYS",
      "Product Quantity": "49",
      "Product Unit Price": "213.00",
      "Product Line Total or Amount": "10,437.00",
      "Batch Or Lot Number": "J11051",
      "Manufacture Date (MFG)": "2024-05-31",
      "Expiration Date": "2024-05-31"
    },
    {
      "Product Name": "KIT COBAS 58/68/8800 MGP",
      "Product Quantity": "165",
      "Product Unit Price": "50.00",
      "Product Line Total or Amount": "8,250.00",
      "Batch Or Lot Number": "J07927",
      "Manufacture Date (MFG)": "2024-03-31",
      "Expiration Date": "2024-03-31"
    },
    {
      "Product Name": "KIT COBAS 6800/8800 HIV 96T",
      "Product Quantity": "5",
      "Product Unit Price": "838.95",
      "Product Line Total or Amount": "4,194.75",
      "Batch Or Lot Number": "H27735",
      "Manufacture Date (MFG)": "2023-09-30",
      "Expiration Date": "2023-09-30"
    },
    {
      "Product Name": "KIT COBAS 6800/8800 HIV 96T",
      "Product Quantity": "313",
      "Product Unit Price": "838.95",
      "Product Line Total or Amount": "262,591.35",
      "Batch Or Lot Number": "H27745",
      "Manufacture Date (MFG)": "2023-09-30",
      "Expiration Date": "2023-09-30"
    },
    {
      "Product Name": "KIT COBAS 6800/8800 HIV 96T",
      "Product Quantity": "65",
      "Product Unit Price": "838.95",
      "Product Line Total or Amount": "54,531.75",
      "Batch Or Lot Number": "H33673",
      "Manufacture Date (MFG)": "2023-09-30",
      "Expiration Date": "2023-09-30"
    },
    {
      "Product Name": "KIT COBAS HBV/HCV/HIV-1",
      "Product Quantity": "72",
      "Product Unit Price": "290.00",
      "Product Line Total or Amount": "20,880.00",
      "Batch Or Lot Number": "H35037",
      "Manufacture Date (MFG)": "2024-01-31",
      "Expiration Date": "2024-01-31"
    }
  ] }

 

 

And remember you can always adjust the prompt to extract the different data you want on different documents & adjust how you want the data to be represented in the output. You can also often improve the output with more data specifications like "A PO number is always 2 letters followed by 8 digits. Only return those 2 letters & 8 digits."

 

Also if you are working with some Word/.docx files, there are built in OneDrive actions to convert them to .pdf files. So you should be able to process PDF, Image, and/or Word documents on the same type of set-up.

Also if you need something that can handle much larger files with a better page text filter/search set-up & larger GPT context window, check out this Query Large PDFs With GPT RAG template.

 

Remember, you may need AI Builder credits for the OCR & GPT actions in the flow to work. Each Power Automate premium licenses already come with 5000 credits that can be assigned to your environment. Depending on your license & organization, you may already have a few credits assigned to the environment.

If you are new, you can get a trial license to test things out: https://learn.microsoft.com/en-us/ai-builder/administer-licensing 

 



Version 1.8 adds a PageNumbers compose action that allows one to input specific pages of a PDF or image packet to pass on to the text conversion & GPT prompt. This could be useful for scenarios where the relevant data is always on the 1st couple of pages or for scenarios where one must filter to only the relevant pages/images because the full packet of PDF page data or image data would exceed the GPT prompt token / character limit.

Version 2 redesigns the Convert to txt section of the flow to use several clever Select actions & expressions to avoid an additional level of Apply to each looping. So for an example 3 page document with 50 lines per page, instead of taking 15-20 seconds and 156 action calls, it takes 1 second and 21 action calls to create the text replica document.
This makes the entire flow 2X faster (15 seconds vs. 30 seconds) and 7X more efficient for daily action limits. 
This makes some use-cases like real-time processing on a Power Apps document upload or processing of larger batches of documents each day much more viable.

Version 2.5 More changes to the Convert to txt component to create a little more accurate text replicas and a change to the placeholder prompt to make the message a little more concise & more accurate. Also moved the spaces & line-break into a single Compose called StaticVariables & changed the variable name to the now more accurate EachPage.
The Convert to txt piece now calculates the minimum X coordinate so it can subtract that number from all X coordinates & thus remove additional spaces on the left margin, helping to reduce the characters fed to the GPT prompt.

The Convert to txt piece also now has a ZoomX parameter in the StaticPageVariables action which sets the spaces multiple, or the number of spaces, per coordinate point. So for example, 200=More Accurate Text Alignment, but 100=Less GPT Tokens. So there may be some trade-offs here. (The recognize text bounding box coordinates around longer pieces of text seem to be dis-proportionatly larger than on smaller pieces of text & mess up the text alignment for rows/lines with multiple boxes / text entries.)

In addition, the Convert to txt piece will now include line-breaks for blank Y coordinate rows/lines to more accurately replicate the vertical spacing of pieces of text. I figured since each line should be just a line-break character, it shouldn't add much to the character / token count for the GPT prompt.

So overall 2.5 adds some better options for increased extraction accuracy or for decreased characters/tokens per page & thus for slightly larger file capacity.

Version 2.7 Another adjustment to the conversion from OCR coordinates to the text (txt) replica.
It now calculates the X coordinates of a piece of text from the mid-point between X coordinates 0 & 1. So along with the Y coordinates that were already being calculated from the mid-point between Y coordinates 0 & 3, this now registers the position of each piece of text from the center point of each coordinates box.
I also set it to start using an estimate of the length of text characters instead of the length of the overall coordinates box to calculate the whitespace / number of spaces between each piece of text. 
Overall this makes this set-up even more accurate, improving text alignment, improving performance on more tilted pages, & adjusting the spacing/alignment for different font / text sizes on the same line.

Microsoft started requiring approval actions after every GPT action. If you want to get around this requirement, see this post on setting the approval step to automatically succeed & move to the next action.

Version 2.9 Adjustment For New MS Approval Requirement & Adjust Retry Policy

I added in the automatic approval step to get around the new MS approval action requirement. I also set the retry policy on the GPT action to retry every 5 seconds up to 7 times so it will fail less if wrongful 429 too many request errors occur.

Microsoft is deprecating the original Create text with GPT action this template relies on.
Users may need to use the new “Create text with GPT using a prompt” action & create a custom prompt on that action instead.

https://learn.microsoft.com/en-us/ai-builder/use-a-custom-prompt-in-flow
 
The ExtractPDFImageDataWithGPT_1_0_0_x Power Apps solution package contains a version of the flow with the new action.

 

Version 3.1 Solution Import

Solutions can now include data models / prompt models so I was able to update the flow in the Solution import to use the new prompt action.

I also was able to adjust the EachPage variable set-up so if anyone needs to put this workflow inside a loop to process multiple files at once, they can now turn concurrency on & process them in parallel.


Thanks for any feedback,

Please subscribe to my YouTube channel (https://youtube.com/@tylerkolota?si=uEGKko1U8D29CJ86).

And reach out on LinkedIn (https://www.linkedin.com/in/kolota/) if you want to hire me to consult or build more custom Microsoft solutions for you.




Legacy Power Automate Import: https://drive.google.com/file/d/1gyC6AK3ur0rcE1UwDK85qInwTRTeoAFU/view?usp=sharing (Not Recommended)
​​​​​​​

Solution Zip Download Link: https://drive.google.com/file/d/1XWJruaSEvIPls2AQS7btxg_r89lMz0XQ/view?usp=sharing (Recommended. Should import the AI Hub Prompt that uses one of the most recent LLM models.)
Go to the Power Apps home page (https://make.powerapps.com/). Select Solutions on the left-side menu, select Import solution, Browse your files & select the ExtractPDFImageDataWithGPT_1_0_0_x.zip file you just downloaded. Then select Next & follow the menu prompts to apply or create the required connections for the solution flows. And finish importing the solution.

Once the solution is done importing, select the solution name in the list at the center of the screen. Once inside the solution click on the 3 vertical dots next to the flow name of the latest version & select edit to enter the flow.


Categories:

AI Builder

Comments

  • FR-24111020-0 Profile Picture FR-24111020-0
    Posted at
    Extract Data From PDFs and Images With GPT
     
    Great work and many thanks. I am trying to use fragments of your flow, but on import it gives:
    The solution file is invalid. The compressed file must contain the following files at its root: solution.xml, customizations.xml, and [Content_Types].xml. Customization files exported from previous versions of Microsoft Dynamics 365 are not supported.
     
    ... after research: "If this solution was exported from an older version of Dynamics 365, you may need to re-export it from a supported version or convert it to a compatible format."
     
    Can you give some guide for solution?
     
    Thanks in advance x3
  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT
    GPT4o Mini is now the default model when using the Solution import set-up. So there is a much longer context window & lower costs.

    In fact it should take 1 credit per 1000 input tokens (~4000 characters) & 3 credits per 1000 output tokens. The OCR action takes 1-3 credits. So a single premium Power Automate per user license allowance of 5000 credits per month should allow you to process like 500-1250 pages of PDFs per month.
  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT
    Hello Devangchheda,

    There should be some improved accuracy after encoding the text location/spacing data in the replicas, especially if you have instances where a label is actually above its value instead of next to it, like...
    Date: 08/24/2024       Supplier ABC      Invoice Number
                                                                  122567
     
    Because without the location/spacing information represented in the replica that 122567 number may get lost below different labels.
     
     
    Then for GPT4o I have another template that converts PDFs to arrays of images so we can use the built-in image recognition system of GPT4o: https://community.powerplatform.com/galleries/gallery-posts/?postid=73cdb790-11c9-45b7-80d0-b991d1f43f34
    However, I have been disappointed at times with the accuracy of the built-in image recognition as it would like confuse 8s with 3s in order numbers. So I've found the absolute best accuracy by combining the two GPT templates. I'll generate a text replica from the extracted text & convert the PDF to a series of images, then I'll use an HTTP call to GPT4o, feed the PDF images to the image recognizer & including the text replica in the prompt text. I'll tell it in the prompt that it is receiving both images & text of the same document & that it should check the OCR document text if it needs to confirm or clarify text in the images.
  • devangchheda Profile Picture devangchheda 10
    Posted at
    Extract Data From PDFs and Images With GPT
    Hey @takolota1 , thanks for the awesome approaches. Had a couple of questions:
     
    Did you experience a drastic increase in accuracy after shaping the text based on coordinates obtained from Ai builder OCR and then passing it to GPT rather than directly passing the OCR Output as is to the GPT prompt?
     
    Also, would it give better accuracy by using the shaped OCR Output with GPT-4o Http call to Azure AI instance instead of using it with AI builder "Create Text with GPT prompt" ? As with the Http call we can use any of the latest models but we don't know what model is being used with "Create Text with GPT prompt"
     
    Thanks in advance
     
     
  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT
    We finally got a way to add AI Models to Solution packages so you all can import this without extra set-up steps.
    I also was able to adjust the EachPage variable set-up so if anyone needs to put this workflow inside a loop to process multiple files at once, they can now turn concurrency on & process them in parallel.

    V3.1 Solution Package Import:  https://drive.google.com/file/d/1mQGBl8GakIL5BkcW2pqsXaAF5nyUI4t2/view?usp=sharing
  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT

    I'm helping someone set things up with the new GPT prompt action and thought it may be helpful to share how I am using it...

     

    I go to the AI Hub on the Power Apps page & create a prompt like this...
    Create prompt instructions: https://learn.microsoft.com/en-us/ai-builder/create-a-custom-prompt

    GPT Extract Example2.png

     

    Then I use that prompt in the flow GPT action & insert my instructions & document text dynamic content

    GPT Extract Example1.png

     

  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT

    Bug Fix

     

    I found the Convert to txt loop inside the Convert to txt scope would error if it was passed a page that the Recognize text action found no text on (so the lines parameter was an empty array [ ]).
    I changed the "Filter array RemoveUnselectedPageBlanks" action logic to...

     

    @And(greater(length(string(item())), 0),not(equals(empty(item()?['lines']), true)))

     

     to remove any blank pages & avoid this error.

  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT

    Extract PDF Data With GPT4o

     

    Here is another method to extract PDF data using the Vision component of the most recent GPT4o model. This method is more expensive, but it will maintain the PDF formatting & things like signatures or stamps while analyzing it.

     

    Download & Set-Up Page: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-PDF-Data-With-GPT4o/td-p/2805514

  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT

    @SonnyGallmann @twidd 

    I just completed a prototype set-up with GPT4o with its image analysis features that I’ll work to make available. Should be significantly more accurate & flexible.

    As long as you have access to an Azure account, VS Code, & Python you should be able to set it up.

  • takolota1 Profile Picture takolota1 4,785
    Posted at
    Extract Data From PDFs and Images With GPT

    @SonnyGallmann What does the text (txt) replica look like when it fails to get the right values from the table?

    Depending on the file & how organized it is, this can reach like 90-95% accuracy.

    However, if you need more accuracy, especially with tables & other pieces that depend on non-text file elements like table lines then you could try GPT4-o.
    However, that takes much more set-up & runs much slower as to process PDFs with its vision component requires a 3rd party service like Encodian or Adobe to split a PDF by page, convert each page to an image, & then feed an array of all the images to GPT4-o in a custom HTTP call.