Skip to main content

Notifications

Community site session details

Community site session details

Session Id : /jAjCNdQHw14XaLeD/XXoQ
Power Automate - Power Automate Desktop
Answered

PDF has multiple tables, need to extract correct ones

Like (0) ShareShare
ReportReport
Posted on 25 Sep 2023 22:30:05 by 23

Hi, I have a large PDF file which contains different types of tables. Each type of table is in the PDF multiple times. I would like to extract only one type of table, all occurences and write to an Excel sheet. 

For example, I have a table of outstanding invoices for customers. Each table contains only the invoices for one customer. We have 30 customers, so there are 30 of these tables (table 1) in the PDF.

Within the same PDF there are tables for invoices we owe to our vendors. We have 5 vendors, so there are 5 of these tables (table 2) in the PDF.

I would like to extract only table 2 information. Above each of these tables is text stating "Invoice info for Vendor xxx" where xxx is a vendor number. I can Parse Text to find "Invoice info for Vendor", then retrieve the vendor number with Get Subtext. How do I then extract the table underneath the text? I have tried Extract Tables from PDF, but when I parse the text, I do not know what the index of the associated table is.  

  • StanM Profile Picture
    23 on 27 Sep 2023 at 13:43:23
    Re: PDF has multiple tables, need to extract correct ones

    Thank you for the replies.

    I took your advice, Agnius, and split the PDF into multiple 1-page PDFs and processed that way. If the heading of a particular page was what I was looking for, I then extracted that page's table. 

  • Verified answer
    Agnius Bartninkas Profile Picture
    10,045 Most Valuable Professional on 26 Sep 2023 at 11:03:26
    Re: PDF has multiple tables, need to extract correct ones

    If you do not have a license for AI builder and don't want to purchase it, I can think of a couple of alternatives for doing this:

    1. You could split the document into a separate file per page and then process each resulting page as a separate document. This works best when:
      1. Your tables do not span over several pages,
      2. Each table starts in a new page - no pages with several tables
    2. You could just use Parse text with regular expressions to retrieve the relevant values from your PDF using certain keywords and patterns. 

    Regex is not limited by the two limitations option 1 is limited by, so it is essentially a better option, but arguably more complex to build. Especially if you are not familiar with regex at all. But since this is custom per document, in order for us to actually provide some guidance, we would need to get sample data.

     

    -------------------------------------------------------------------------
    If I have answered your question, please mark it as the preferred solution. If you like my response, please give it a Thumbs Up.

    I also provide paid consultancy and development services using Power Automate. If you're interested, DM me and we can discuss it.

  • OkanMTL Profile Picture
    703 Super User 2024 Season 1 on 26 Sep 2023 at 07:42:37
    Re: PDF has multiple tables, need to extract correct ones

    Have you tried Document Processing offered by microsoft? This suits your problem better as this uses AI to recognize tables from a File. After your Document Processing unit it build, you can run it from cloud or PAD as you like

     

    Good luck

Under review

Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.

Helpful resources

Quick Links

Understanding Microsoft Agents - Introductory Session

Confused about how agents work across the Microsoft ecosystem? Register today!

Warren Belz – Community Spotlight

We are honored to recognize Warren Belz as our May 2025 Community…

Congratulations to the April Top 10 Community Stars!

Thanks for all your good work in the Community!

Leaderboard > Power Automate - Power Automate Desktop

#1
eetuRobo Profile Picture

eetuRobo 4 Super User 2025 Season 1

#2
KO-05050229-0 Profile Picture

KO-05050229-0 2

#2
stampcoin Profile Picture

stampcoin 2

Overall leaderboard
Loading started