web
You’re offline. This is a read only version of the page.
close
Skip to main content

Announcements

News and Announcements icon
Community site session details

Community site session details

Session Id :
Power Platform Community / Forums / Power Automate / Extract Text From Stru...
Power Automate
Unanswered

Extract Text From Structured PDF

(0) ShareShare
ReportReport
Posted on by 14

Hi RPA Community,

I have this PDF file that i want to extract its text. The PDF will be in a structured form and the output text file should follow the structure accordingly. Can someone give advice on which approach should i use in order to get the correct output.

 

I will share you the sample PDF and the desired text format once extracted

https://pktgroup-my.sharepoint.com/:u:/p/nael_rashid/EfGkG-1n0KRAuHfFlbuV0q8BgPINNKLcVgNOfzcUqXk2lw?e=PdvHLd 

 

Appreciate your time and assistance,

Thanks and Regards,

Nael

 

I have the same question (0)
  • yoko2020 Profile Picture
    495 on at

    @naelaiman 

    Choose which fit with your budget.

     

    AI Builder
    https://docs.microsoft.com/en-us/power-automate/use-ai-builder

     

    ABBY flexicapture
    https://www.abbyy.com/flexicapture/

     

    Chronoscan
    https://www.chronoscan.org/

     

  • naelaiman Profile Picture
    14 on at

    Hi @yoko2020 ,

    Thank you for your suggestion, does this mean i need to rely on AI builder or any other 3rd party software service in order to get the extracted text for my situation?

    I was hoping that there is a way to get my desired output using commands that are available in PAD.

     

    Thanks and Regards,

    Nael

  • yoko2020 Profile Picture
    495 on at

    I never use parse/regex action or extract text from pdf action  from PAD when dealing with invoice, sales order, custom form document (pdf/image) extraction, always use third party software specialized for this purpose.

     

    Things to consider when dealing with this stuff :

    1. Does the document always come in text pdf ?

    2. What happen if document come in image pdf ?

    3. Are we dealing with  =>1000 of documents per month or just 10 documents per month ?

    4. What if in 1 document contain multiple invoices that need to be separated ?

        See this video what i mean about document separation/invoice splitting 

         https://www.youtube.com/watch?v=9fFjQn_E8dI

    5. And sometimes invoice contain multiple page, so we are facing dynamic invoice pages that need to be processed.

     

     

    Most of this software can handle invoice splitting except power automate aibuilder.

    If you only process small quantity you can try use internal PAD action, but make notice of those 5 points or else your project will stuck in the future.

     

     

     

     

     

  • Riyaz_riz11 Profile Picture
    4,150 Super User 2026 Season 1 on at

    Hi @yoko2020 

    If the pdf is constant header means you can directly use regex.

    first you need to us e action Extract text from PDF

    After use parse text and use regex based on the required data.

     

    Regards

    Ahammad Riyaz

    --------------------------------------------------------------------------------
    If this post helps answer your question, please click on “Accept as Solution” to help other members find it more quickly. If you thought this post was helpful, please give it a Thumbs Up.

  • UK_Mike Profile Picture
    on at

    "The PDF will be in a structured form and the output text file should follow the structure accordingly"

    This makes no sense, surely your just pulling particular values rather than the whole pdf text ?

     

    If particular values, yes it can be done wholly in PAD...

     

    Screenshot 2022-03-29 140513.png

  • naelaiman Profile Picture
    14 on at

    Hi @yoko2020 ,

    I previously used regex and string manipulation to extract data from this pdf format. However, previously i used Automation Anywhere (AA) it can extract text in structured format so it was easy for me to extract the data line by line with string conditions. Right now I had to migrate to PAD so i find the extract text is not the same as AA and i find that the result is different than i expected. I will share you the output i got from using PAD extract pdf to text command.

    https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?e=CVCMjm

    Let me answer your question:-

    1- Yes, for this file it will always come in readable pdf format as it is generated by a system.

    2- If there is an image pdf file during extraction it will extract nothing so an error handling should be able to overcome that.

    3- Yes, we are dealing with 1000+ documents per month.

    4- No, there wont be any invoice combined together as the system will generate 1 invoice per order.

    5- If there is multiple page i should still be able to extract all the necessary information if the text output is in a structured format.

     

    Thanks and Regards,

    Nael

  • naelaiman Profile Picture
    14 on at

    Hi @Ahammad_Riyaz ,

    Yes the pdf will have constant header and will repeat if there is multiple page of the invoice. I tried this approach but however the result i get from "Extract PDF to Text" is hard to implement the regex or string operations. This is probably due to the format of the pdf that's why the text result is cluttered and not in pair. I share with you the output i got from using the PAD Extract PDF to Text. https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?e=D1RacB

    Is there any way for me to extract the text without using 3rd party applications or AI-builder for this?

     

    Thanks and Regards,

    Nael

  • naelaiman Profile Picture
    14 on at

    Hi @UK_Mike ,

    Yes i wanted to pull particular values, but the result i get from Extract PDF to Text is not organized and applying regex or string manipulation can be difficult as the value doesn't seem to be coming in pairs for this PDF file. If the text extracted is written by following the same format as the PDF then it is possible for me to extract the invoice details as well as the item details. I share with you the text output i get from PAD. As you can see, each value is hard to differentiate. https://pktgroup-my.sharepoint.com/:t:/p/nael_rashid/EZ0SpMbDJKhNuXr_s5REAQMB5J-syMWQw4s519g602_dRw?e=jGIYHG

    Plenty of field that i want to extract and evaluate and if the extracted text is coming in this format i can be quite troublesome for me to extract the details. I was hoping to find a solution for this without relying on 3rd party applications or AI-Builder as additional cost will incur and there are over 1000+ PDF invoices needed to be processed per month.

     

    Thanks and Regards,

    Nael

  • yoko2020 Profile Picture
    495 on at

    @Ahammad_Riyaz 

     

    Yes i know that technique very well.

    But i never use that method, wasting of time and double work in the future when dealing with >2000 document and 100 vendor (document layout)

  • yoko2020 Profile Picture
    495 on at

    @naelaiman 

     

    what PAD version you use ? looks like action extract text from pdf has a bug, it does not keep indentation.

Under review

Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.

Helpful resources

Quick Links

Introducing the 2026 Season 1 community Super Users

Congratulations to our 2026 Super Users!

Kudos to our 2025 Community Spotlight Honorees

Congratulations to our 2025 community superstars!

Congratulations to the April Top 10 Community Leaders!

These are the community rock stars!

Leaderboard > Power Automate

#1
Vish WR Profile Picture

Vish WR 707

#2
Haque Profile Picture

Haque 475

#3
Valantis Profile Picture

Valantis 456

Last 30 days Overall leaderboard