Answered

Not all table extracting correctly in PAD and how to solve it ??

(0) Share

Report

Posted on by LPad

I have a pdf of multiple pages and i want to extract all the tables. But only green mark table (in the attached image) extracted correctly but red mark table (in the attached image) is not. It is missing the header, couple of first row and first column. I have attached my flow here along with pdf. Please give me a solution that how can I extract the red mark table exactly as it is. Though column header is not a problem if its missing through extraction. Need a solution that any future pdf of such format must satisfy the flow and give correct extraction.

pad_pdf_file.pdf

flow_file.docx

Categories:

Power Automate Desktop

I have the same question (0)

All responses (7)

Answers (2)

Sort by

mmonline 169 on at

Like (1)

Report
Copy link

Link copied!

Well.... the Extract tables from pdf action is fairly limited.

If the document is always formatted like what you uploaded, you can extract to text and then pass that to something like Javascript, Python, or VBScript. I am looking at it now.
=======

There are definitely pattern-esqe elements. I am guessing this would require a bit of work to nail down and make work properly.

The tables return are inconsistent even between the first non-standard table and the next.
The above does not include the Index No. column.

The next one:
Closer but pretty ugly.

I have not evaluated any of the other tables.

Sorry I could not provide a better answer.

Was this reply helpful? Yes No
Verified answer

Agnius Bartninkas Most Valuable Professional on at

Like (1)

Report
Copy link

Link copied!
A table like that will not be extracted nicely via the Extract tables from PDF action. You will need to use the Extract text from PDF action and then parse the text to create a table variable.

Here's a sample flow that could handle the table you marked as red in your PDF file:

What it does is as follows:
Creates a new table and deletes the empty row (as it is impossible to create a table without the empty row)
Extracts the text from your PDF file
Parses the text with regex to retrieve the relevant lines
Splits the retrieved text into a list with an item for each line
Loops through the lines and splits it by space to get each item separately
If the line contains 5 items, inserts it directly into the table created in step 1, else, inserts it with some blank values in between.
Here's a snippet you can copy and paste directly into PAD to have the actions created automatically for you:
Variables.CreateNewDatatable InputTable: { ^['Index No.', 'Unit', 'Pack Breakup', 'Pack', 'Quantity'], [$'''''', $'''''', $'''''', $'''''', $''''''] } DataTable=> DataTable Variables.DeleteRowFromDataTable DataTable: DataTable RowIndex: 0 Pdf.ExtractTextFromPDF.ExtractText PDFFile: $'''C:\\RPA\\pad_pdf_file.pdf''' DetectLayout: False ExtractedText=> ExtractedPDFText Text.ParseText.RegexParseForFirstOccurrence Text: ExtractedPDFText TextToFind: $'''(?<=MULTIPACK\\r\\nQUANTITY\\r\\nIndex No..+\\r\\n)(.+\\r\\n)+?.+(?=\\r\\n\\d+\\s\\d+\\r\\n.+PREPACK)''' StartingPosition: 0 IgnoreCase: False OccurrencePosition=> Position Match=> Match Text.SplitText.Split Text: Match StandardDelimiter: Text.StandardDelimiter.NewLine DelimiterTimes: 1 Result=> TextList LOOP FOREACH TextLine IN TextList Text.SplitText.Split Text: TextLine StandardDelimiter: Text.StandardDelimiter.Space DelimiterTimes: 1 Result=> TextLineList IF TextLineList.Count = 5 THEN Variables.AddRowToDataTable.AppendRowToDataTable DataTable: DataTable RowToAdd: TextLineList ELSE Variables.AddRowToDataTable.AppendRowToDataTable DataTable: DataTable RowToAdd: [TextLineList[0], '', TextLineList[1], '', TextLineList[2]] END END

Note you will need to change the file path in the Extract text from PDF action for this to work.

An important note here: this flow will not work for a document where the table is split over two separate pages. That's because it parses the text using the headers and the totals after the table. Since the headers are repeated for each page and the totals are only there in the last page where the table ends, if your table is split over two pages, this flow would include the headers as text lines, so you would need to handle that.

In order to handle it, you could add some extra conditions into the loop like this:

This will skip the part of the loop that splits the line and inserts it into the table, if the line contains some of the text that should be in the headers.

Here's a snippet with the total flow:
Variables.CreateNewDatatable InputTable: { ^['Index No.', 'Unit', 'Pack Breakup', 'Pack', 'Quantity'], [$'''''', $'''''', $'''''', $'''''', $''''''] } DataTable=> DataTable Variables.DeleteRowFromDataTable DataTable: DataTable RowIndex: 0 Pdf.ExtractTextFromPDF.ExtractText PDFFile: $'''C:\\RPA\\pad_pdf_file.pdf''' DetectLayout: False ExtractedText=> ExtractedPDFText Text.ParseText.RegexParseForFirstOccurrence Text: ExtractedPDFText TextToFind: $'''(?<=MULTIPACK\\r\\nQUANTITY\\r\\nIndex No..+\\r\\n)(.+\\r\\n)+?.+(?=\\r\\n\\d+\\s\\d+\\r\\n.+PREPACK)''' StartingPosition: 0 IgnoreCase: False OccurrencePosition=> Position Match=> Match Text.SplitText.Split Text: Match StandardDelimiter: Text.StandardDelimiter.NewLine DelimiterTimes: 1 Result=> TextList LOOP FOREACH TextLine IN TextList IF (Contains(TextLine, 'MULTIPACK', False) OR Contains(TextLine, 'QUANTITY', False) OR Contains(TextLine, 'Index No', False)) = True THEN NEXT LOOP END Text.SplitText.Split Text: TextLine StandardDelimiter: Text.StandardDelimiter.Space DelimiterTimes: 1 Result=> TextLineList IF TextLineList.Count = 5 THEN Variables.AddRowToDataTable.AppendRowToDataTable DataTable: DataTable RowToAdd: TextLineList ELSE Variables.AddRowToDataTable.AppendRowToDataTable DataTable: DataTable RowToAdd: [TextLineList[0], '', TextLineList[1], '', TextLineList[2]] END END
-------------------------------------------------------------------------
If I have answered your question, please mark it as the preferred solution.
If you like my response, please give it a Thumbs Up.
If you are interested in Power Automate, you might want to follow me on LinkedIn at https://www.linkedin.com/in/agnius-bartninkas/

Was this reply helpful? Yes No
Verified answer

Agnius Bartninkas Most Valuable Professional on at

Like (1)

Report
Copy link

Link copied!
P.S. You actually can create a data table with headers but no rows if you paste the code as
Variables.CreateNewDatatable InputTable: { ^['Index No.', 'Unit', 'Pack Breakup', 'Pack', 'Quantity'] } DataTable=> DataTable
In which case you would not need the Delete row from data table action.
But this is not something you can actually do via the UI of PAD at all. The only way to do it is to create the action first, then copy it to a text editor, modify it to remove the empty row and then paste it back to PAD. So, the more natural solution is the one I posted above.

Was this reply helpful? Yes No
LPad 14 on at

Like (0)

Report
Copy link

Link copied!

@mmonline Thanks for trying it out. I have tested the text version but it's pretty much work.

Was this reply helpful? Yes No
Agnius Bartninkas Most Valuable Professional on at

Like (0)

Report
Copy link

Link copied!

Did you notice that I provided the entire script in my reply above?
-------------------------------------------------------------------------
If I have answered your question, please mark it as the preferred solution.
If you like my response, please give it a Thumbs Up.
If you are interested in Power Automate, you might want to follow me on LinkedIn at https://www.linkedin.com/in/agnius-bartninkas/

Was this reply helpful? Yes No
LPad 14 on at

Like (1)

Report
Copy link

Link copied!

Thanks @Agnius . But I don't want to stop at "PREPACK" instead all the table in between order to order. This model stops before prepack, though prepack table structure is same. I found that solution. But it stops before the next order. Indeed it's a great solution. Thanks again for helping me towards the solution. 👍✌

Was this reply helpful? Yes No
Agnius Bartninkas Most Valuable Professional on at

Like (1)

Report
Copy link

Link copied!

Glad I could help. You should be able to take it from here and handle other tables in the file by adjusting the starting and ending keywords around the pattern.

Was this reply helpful? Yes No