Hello colleagues! I'm attempting to customize a Document Processing model, training it to read a monthly invoice from our cell carrier. The document consists of up to 30 pages of a single table of 15 columns and 12-15 rows, depending on the page. There's various data types, but I've assigned them all "text". The table has no headers useful for training. Because I only have two paper examples of the invoice to work with, and I need a minimum of five documents, I've split the two documents into 11 PDFs of different page counts, with the same first and last pages, to simulate the needed number of training docs. Maybe 90 pages total. The tables are built exactly the same way on all pages, except page 1, which has some minor differences. I've created a table with headers in the model and have spent hours building the table (tagging) on each page, selecting the "Table continues on next page" option during the tagging process. I've checked each document to make sure they're accepting the table as I expected. Everything looks good.
After three attempts and much cursing, the results have been disappointing. Using the "Quick Test," it appears the model is correctly identifying the table boundaries, columns, and rows. Starting on page three or four, the model starts to lose the thread, misaligning the table boundaries, failing to read cell data in whole or in part, dropping columns, adding rows, and so on.
Any explanation for this behavior? Do I need more training docs? A different approach altogether? I've thought of building a table for each page, but that's complex and unworkable, because of the varying length of the table. Ultimately, I plan to build a PA Flow that reads the data into an Excel file for later analysis. (I've done this once with the Invoice Model. My new model isn't reliable enough at the moment to proceed.
Thanks in advance for your help.
Wannabe AI Fanboi