I need to extract information in tabular format from order confirmation pdfs received from suppliers. Each pdf has multiple items and each item will have a code, description, quantity and delivery date. So the table will have four columns: Code , Description, Quantity, Delivery Date with each row representing an item.
The problem arises when some details for an item are present at the bottom of one page and the remaining details are on the next page.
For example: if this is the pdf
-----some text--------------------------------------------
-----some text---------------------------------------------
code: 101
description: this is first item
quantity: 56
delivery date: 22.08.2023
code: 102
description: this is second item
quantity: 65
delivery date: 23.08.2023
code: 103
description: this is third item
-------page 1 ends here---------
-------page 2 begins here--------
quantity: 72
delivery date: 24.08.2023
code: 104
description: this is fourth item
quantity: 80
delivery date: 23.08.2023
code: 105
description: this is fifth item
quantity: 60
delivery date: 21.08.2023
---------some text here--------------------------------
------------------------------page 2 ends----------------------
------------------------------pdf ends----------------------------
The document cannot be tagged correctly and the accuracy of model is pretty bad. For the above document, the tagged tables look like this
Code | Description | Quantity | Delivery Date |
101 | this is first item | 56 | 22.08.2023 |
102 | this is second item | 65 | 23.08.2023 |
103 | this is third item |
Code | Description | Quantity | Delivery Date |
72 | 24.08.2023 | ||
104 | this is fourth item | 60 | 21.08.2023 |
105 | this is fifth item | 80 | 23.08.2023 |
Let me know if there is any solution for this. I have already tried the following things:
- not tagging these rows in multipage documents
- tagging one page documents only for training and then using multi page ones during testing
- tagging the tables like shown above
Everytime, the accuracy is bad. I used increase the number of documents during training but to no use.
Text spaning accross pages is unfortunatelly not yet supported.
Post processing could be a way, but it would be difficult to get a reliable behavior in all cases.
The only option is to manage this with manual edit after extraction which will make this process semi-automated.
Unfortunately nothing worked with AI builder, i tried to do some post processing but it was a huge effort for not so good results. The amount of effort that i had to put made no sense.
Furthermore, the model accuracy was drastically low while dealing with new pdfs where some details for a row were split over two pages.
I am not using AI builder anymore. I simply wrote some python code that captures the incoming emails, downloads pdf attachments in a folder, reads and extracts the relevant information from the pdfs and writes that information in an excel file. The only thing was that I had to write one python script for one type/layout of document.
AI builder is good for handling only one page documents or those multi page documents where the format is super nice and clean.
I am also working on the same requirement , any solutions found for this.
Maybe AI Builder Create text with GPT can help do the post processing ? 🙂
. Give the whole table content to GPT, and have it break it down in different rows
I could have considered a post processing step even if the model has accurately predicted all the other rows that donot break. Because of the break, the accuracy reduces drastically and the model starts to make mistakes even for the rows which have all the details on the same page.
We were thinking of extending the power platform across our organisation in all the countries but unfortunately if this is the case then it doesn't make sense to do so.
We are better off parsing the whole documents using python and create a logic to extract relevant information. Thanks.
ah sorry I didn't notice the line 103. Rows that break from one page to another are not supported.
It's a bit challenging. Perhaps you could consider consolidating this extracted data during a post-processing step?
Regards,
Hi @plarrue ,
yes, after tagging the first page and getting the table
Code | Description | Quantity | Delivery Date |
101 | this is first item | 56 | 22.08.2023 |
102 | this is second item | 65 | 23.08.2023 |
103 | this is third item |
I selected this table continues on next page, tagged the content in the second page and got the table below
Code | Description | Quantity | Delivery Date |
72 | 24.08.2023 | ||
104 | this is fourth item | 60 | 21.08.2023 |
105 | this is fifth item | 80 | 23.08.2023 |
The problem is for the third item as its code and description are on the first page and the values for quantity and delivery date are on second page.
Hi @Charanjit ,
Thanks for reaching out.
"The problem arises when some details for an item are present at the bottom of one page and the remaining details are on the next page. " - Did you select This table continues on next page and continue tagging the table on the following page ?
Thanks,
Regards