Answered

Need Help Automating PDF Text Extraction

(0) Share

Report

Posted on by Community Power Pla... Microsoft Employee

This may not be the correct board. If so, please let me know and I'll make adjustments as necessary.

I work in an office that receives monthly invoices for our company's billing. We get our invoices in PDF style formatted to be continuous tables. The data is laid out kind of like so:

-member ID#, NAME - new line

-Claim number, member ID# again - new line

-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...

Currently I'm reviewing these documents by hand with my little human eyeballs and fingers. It's not terribly slow but I have literally hundreds of these and each can contain upwards of 50+ claims.

Specifically my goal is this: (somehow) scan the PDF document w/ RegEx (or something to similar effect) and extract the text from the three lines of text demonstrated above while filtering out the "noise" (everything else). The desired data range will always start on page 2 and end on 3 pages from the last page of the PDF.

I'm using FileCenter to manage my documents. FileCenter pro provides you a neat tool to do some OCR text extraction similarly to what I'm desiring. I managed to set up a little "Demo" module to prove the concept and it worked in microcosm. I couldn't get it to export the data into a file or into a program.

Ultimately these extracted data need to be placed into an Excel document so I can do some other data related operations on them.

The target PDFs contain sensitive data so I can't share them.

Any ideas how to go from selecting a PDF to dropping specific text from the PDF into an Excel doc while pruning all the unnecessary stuff? (or a list object in PAD?)

Categories:

Power Automate Desktop

I have the same question (0)

All responses (9)

Answers (1)

Sort by

Henrik_M 2,021 Super User 2024 Season 1 on at

Like
a
(1)

Report
Copy link

Link copied!

That sounds like a job for the AI Builder:

AI Builder— Intelligent Automation | Microsoft Power Automate

Was this reply helpful? Yes No
Community Power Pla... Microsoft Employee on at

Like
a
(0)

Report
Copy link

Link copied!

I'll give it a swing and see what I can make happen with that.

Thank you very much, @Henrik_M !

Was this reply helpful? Yes No
UK_Mike on at

Like
a
(1)

Report
Copy link

Link copied!

Not sure how complex the pdfs are but...

Was this reply helpful? Yes No
Community Power Pla... Microsoft Employee on at

Like
a
(0)

Report
Copy link

Link copied!

I got some RegEx expressions built to capture the desired data.

In investigating @Henrik_M 's proposed solution I discovered that the AI builder is a premium feature and that there's some set up required to get it off the ground and flying. I've already familiarized myself with PAD so I'll continue to work on my solution based in PAD.

I've created a flow that targets a folder, grabs all the .pdfs, searches the content of all the .pdfs in a text input given range of pages, then spits out all the found matches into two running lists.

I'm hoping you can aid me here on this one,
I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"

I'm really new to manipulating strings and lists of text. Arrays have always baffled me, but I'm not completely unfamiliar with some of the concepts.

@UK_Mike , thank you for the encouragement. And the jokes! I needed that laugh really bad.

Was this reply helpful? Yes No
UK_Mike on at

Like
a
(1)

Report
Copy link

Link copied!

"
-member ID#, NAME - new line
-Claim number, member ID# again - new line
-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...
"

Probably best to type out what these actually look like, such as...

Member ID: Mike
Claim type code: abc123
Date of claim: 9/4/22
Dollar amount requested: $100.98
ETC...

"
I'm hoping you can aid me here on this one,
I need to now pair up the Match 1 (name) result with the Match 2 (date) result, I.E.: "Smitty McGee", "1/23/45"
"

Each of these will be individual vars for us to write to Excel in a " For Each Loop " targeting the pdf folder.
As in each loop pulls 5,6,7 etc separate vars holding the values from the current item (pdf).
Basically they are already matched...

var 1 = Mike
var 2 = abc123
var 3 = 9/4/22
var 4 = $100.98

Was this reply helpful? Yes No
Verified answer

Community Power Pla... Microsoft Employee on at

Like
a
(1)

Report
Copy link

Link copied!

Update:

I found the solution through RegEx and some highly specialized RegEx patterns developed by a friend for this particular case. For anyone with a similar need to take apart the contents of a PDF, then create a new variable out of the combined index value of each list of variables, please continue reading.

Step 1.: Get all files in designated target folder.

Step1.a: make sure your page ranges are good, I had to adjust mine to +1 page from page 1 (target page 2) and -3 pages from the last page (target page n-3, not the last three pages.)

Step 2. Create a list of the top level unique variable, mine is the member's name.
->Create new List: %YourListVar%

Step 3. Start a For Each loop, For %CurrentItem% in %Files%

Step 3a. Assign the PageStart# and PageEnd# to your page range variables. This is now the entire range of your data.

Step 4. Extract text from PDF inside the For Each loop from %PageStart#% to %PageEnd#% into %YourPDFTextVar%

Step 5. Use Parse text (%YourPDFTextVar%) and RegEx (YourRegExPattern) to find the text you want. Store the RegEx matches into a var of your choice, my first RegEx match is the member name, so mine is %Names%

Step 5a. For Each item in your %Names% trim out any junk data, extra string content, trim the result string %CurrentNames% and add the item %CurrentNames% to %YourListVar%.

Step 6. Repeat steps 4 and 5 or 5-5a until your data is satisfactory.

Step 7. Use a Loop from index 0 to %YourListVar.count - 1% (this will always be the upper bounds index value.)

Step 8. Inside the Loop perform an operation. In my case I am writing these values to an Excel book one at a time.
Step 8a (write to Excel book). Write to Excel worksheet %YourListVar[ListIndex]% (or any list value var(%YourListVar%) you want to cycle through,) in Column A and row %FirstFreeRow% of %ExcelInstance%

For any other uses, in your PAD Command block where it asks for a value enter %YourListVar[ListIndex]%
Example of iterate your list with Display Message box: "Message to display: %YourListVar[ListIndex]%

You will now see your Excel doc fill up with all the values from %(listvar)[ListIndex]%.

The purpose of the Loop is to get an index number of the item in question. The index number must always be a retrievable value from the target list. A list of 2 items (indexes 0,1) cannot be called upon for Index value 2, because Index value 2 is null.

%Item[RowNumber]%

Hope that helps anyone stuck with a similar issue.

Was this reply helpful? Yes No
UK_Mike on at

Like
a
(1)

Report
Copy link

Link copied!

Slightly different on my end.
After each loop I write to excel rather than holding the values in a list for later Excel write.
Lets say im after 10 values from each loop, if "ALL" values are populated to their respective variables they get written at the end of each loop plus the current pdf gets moved to a new folder " Processed ".
If just one value isnt found, skip current loop and that particular pdf gets moved to a new folder " Unprocessed ".
At the end of the flow " If unprocessed folder file count >=1 " than an email is sent to me calling me really really bad names 🙄

Nice write up though, well done 👏

Was this reply helpful? Yes No
JunZ 2 on at

Like
a
(0)

Report
Copy link

Link copied!

We got claims(invoice) as a dispute with Walmart, million dollars level and it will not stop. The claim only has the total invoice amount plus long text BOL shipment ID in the invoice PDF layout, which BOL might be 180 rows across 5-6 pages. After extract to get BOL number, the next step will call HTTP to the Walmart website and download item details, such as each item refund money, item ID, date, etc.

I almost get there, trigger email and copy PDF into SharePoint, convert PDF to extracted structured JSON object file(not array).
Parsing JSON successfully. Now, I need to run "FOR EACH" to loop/get this BOL number.

I am not a logic app expert, still learning those f(x) functions, and the project is very emergency, then I come here for asking a kind help.

I need "text" and "path", and this JSON file is object > elements (array) > attributes. Item() is not array, how to put elements[] array as output previous field as condition in FOR-EACH loop? Or my direction is wrong?

Thanks in advance if anyone can help.

Was this reply helpful? Yes No
takolota1 4,980 Moderator on at

Like
a
(0)

Report
Copy link

Link copied!

You can also now use this template to extract data from PDFs without any Regex using GPT:

https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-GPT/td-p/2201345

Was this reply helpful? Yes No