I need to extract information in tabular format from order confirmation pdfs received from suppliers. Each pdf has multiple items and each item will have a code, description, quantity and delivery date. So the table will have four columns: Code , Description, Quantity, Delivery Date with each row representing an item.
The problem arises when some details for an item are present at the bottom of one page and the remaining details are on the next page.
For example: if this is the pdf
-----some text--------------------------------------------
-----some text---------------------------------------------
code: 101
description: this is first item
quantity: 56
delivery date: 22.08.2023
code: 102
description: this is second item
quantity: 65
delivery date: 23.08.2023
code: 103
description: this is third item
-------page 1 ends here---------
-------page 2 begins here--------
quantity: 72
delivery date: 24.08.2023
code: 104
description: this is fourth item
quantity: 80
delivery date: 23.08.2023
code: 105
description: this is fifth item
quantity: 60
delivery date: 21.08.2023
---------some text here--------------------------------
------------------------------page 2 ends----------------------
------------------------------pdf ends----------------------------
The document cannot be tagged correctly and the accuracy of model is pretty bad. For the above document, the tagged tables look like this
Code | Description | Quantity | Delivery Date |
101 | this is first item | 56 | 22.08.2023 |
102 | this is second item | 65 | 23.08.2023 |
103 | this is third item |
Code | Description | Quantity | Delivery Date |
72 | 24.08.2023 | ||
104 | this is fourth item | 60 | 21.08.2023 |
105 | this is fifth item | 80 | 23.08.2023 |
Let me know if there is any solution for this. I have already tried the following things:
- not tagging these rows in multipage documents
- tagging one page documents only for training and then using multi page ones during testing
- tagging the tables like shown above
Everytime, the accuracy is bad. I used increase the number of documents during training but to no use.