I need to extract information in tabular format from order confirmation pdfs received. Each pdf has multiple items and each item will have a Name, description, Vendor and delivery date.
So the table will have four columns: Name , description, Vendor, Delivery Date with each row representing an item.
The problem arises when some details for an item are present at the bottom of one page and the remaining details are on the next page. Example : Description in the table continuing in the page 2 , So unable to tag these rows in multipage rows.
For example: if this is the pdf
-----some text--------------------------------------------
-----some text---------------------------------------------
code: 1
description: this is first item
Vendor: XYZ1
delivery date: 12.01.2024
code: 102
description: this is second item
Vendor: XYZ2
delivery date: 13.01.2024
code: 103
description: this is third item
-------page 1 ends here---------
-------page 2 begins here--------
description(Continuing from Page): this is third item Continuing
Vendor: XYZ3
delivery date: 14.01.2024
code: 104
description: this is fourth item
Vendor: XYZ4
delivery date: 15.01.2024
code: 105
description: this is fifth item
Vendor: XYZ5
delivery date: 16.01.2024
---------some text here--------------------------------
------------------------------page 2 ends----------------------
------------------------------pdf ends----------------------------
The document cannot be tagged correctly using custom model when page 1 content - Description is continuing on Page 2 . For the above document, the tagged tables look like this
| Code | Description | Vendor | Delivery Date |
| 101 | this is first item | XYZ1 | 11.01.2024 |
| 102 | this is second item | XYZ2 | 12.01.2024 |
| 103 | this is third item | XYZ3 | 13.01.2024 |
| Code | Description | Vendor | Delivery Date |
| | Some text are continuing from page 1 | | |
| 104 | this is fourth item | XYZ4 | 14.01.2024 |
| 105 | this is fifth item | XYZ5 | 15.01.2024 |