Unanswered

Training Failures on Custom Document Processing Model

(1) Share

Report

Posted on by JF-14022317-0

Hello colleagues! I'm attempting to customize a Document Processing model, training it to read a monthly invoice from our cell carrier. The document consists of up to 30 pages of a single table of 15 columns and 12-15 rows, depending on the page. There's various data types, but I've assigned them all "text". The table has no headers useful for training. Because I only have two paper examples of the invoice to work with, and I need a minimum of five documents, I've split the two documents into 11 PDFs of different page counts, with the same first and last pages, to simulate the needed number of training docs. Maybe 90 pages total. The tables are built exactly the same way on all pages, except page 1, which has some minor differences. I've created a table with headers in the model and have spent hours building the table (tagging) on each page, selecting the "Table continues on next page" option during the tagging process. I've checked each document to make sure they're accepting the table as I expected. Everything looks good.

After three attempts and much cursing, the results have been disappointing. Using the "Quick Test," it appears the model is correctly identifying the table boundaries, columns, and rows. Starting on page three or four, the model starts to lose the thread, misaligning the table boundaries, failing to read cell data in whole or in part, dropping columns, adding rows, and so on.

Any explanation for this behavior? Do I need more training docs? A different approach altogether? I've thought of building a table for each page, but that's complex and unworkable, because of the varying length of the table. Ultimately, I plan to build a PA Flow that reads the data into an Excel file for later analysis. (I've done this once with the Invoice Model. My new model isn't reliable enough at the moment to proceed.

Thanks in advance for your help.

Wannabe AI Fanboi

Categories:

AI Builder

I have the same question (0)

All responses (4)

Answers (0)

yashag2255 24,769 Super User 2024 Season 1 on at

Like (0)

Report

Hi @JF-14022317-0

Can you confirm if the layout of different pages on the same document is exactly the same? If not, you will have to train the model with atleast 5 different files that have the same layout. Also, are you custom mapping individual cells in the table or are you using the row and column differentiators that allow you to directly mark the columns and rows? This may be another reason your training output is not showing up as expected.

With custom models, although it says atleast 5, based on my experience having 8-10 documents for training will yield better results.

Hope this Helps!

If this reply has answered your question or solved your issue, please mark this question as answered. Answered questions helps users in the future who may have the same issue or question quickly find a resolution via search. If you liked my response, please consider giving it a thumbs up. THANKS!

Was this reply helpful? Yes No
JF-14022317-0 12 on at

Like (0)

Report

Hello @yashaq2255!

Thanks for replying! I have a total of 11 documents. They have very similar layouts, but not exactly the same, perhaps 95%. They are actual invoices from two different months, so there's a natural differences within a range that the model would encounter over time.

I used the row and column differentiators, not the custom mapping. The first page of the invoice has 12 rows and 15 columns. The following pages of varying numbers of pages have 15 rows and 15 columns. The final page can have anywhere from one row, 15 columns, to 15 rows and 15 columns. Again, the number of rows can vary by month, but the number of columns is always the same.

What do you think?

Was this reply helpful? Yes No
yashag2255 24,769 Super User 2024 Season 1 on at

Like (0)

Report

Hi @JF-14022317-0

If the number of columns are fixed on each page then that should be enough. However, when you upload an invoice that has more than one page, you can mark the checkbox that shows up when you identify a table on the first page that says table continues on next page. Have you tried that?

I have created a custom model to upload documents with multiple pages and it works as expected until 3-4 pages and then there are blanks or improper data extraction after 5 pages. Maybe the model needs more training or perhaps it is a limitation but there is no documentation for this.

Hope this Helps!

If this reply has answered your question or solved your issue, please mark this question as answered. Answered questions helps users in the future who may have the same issue or question quickly find a resolution via search. If you liked my response, please consider giving it a thumbs up. THANKS!

Was this reply helpful? Yes No
JF-14022317-0 12 on at

Like (1)

Report

Hi @yashag2255,

I have used the "table continues on next page" option, but the issue persists. I'm hoping more training data will resolve the issue. The invoices only come in once a month, so I'll have to be patient!

Thanks!

Was this reply helpful? Yes No