We have implemented AI Builder Invoice Processing with custom trained collections at one of our customers and keep on facing inconsistent and unexplainable behavior.
This is what we did
We have customized the pre-built Invoice Processing model, by training 75 collections (for 75 different types of invoices), with 10-40 invoices per collection.
We are extracting five fields. Four of those are standard fields (Invoice ID, Invoice Date, Invoice Total and Supplier Tax ID) but are re-tagged by us on all invoices in all of the collections. The fifth field is a custom text field.
Many invoices have multiple pages. Invoice ID and Invoice Date often show on all of these pages.
We do not extract any line items.
We have implemented the Document Automation Toolkit -a PowerPlatform reference canvas app from Microsoft- for validating & correcting invoices with low confidence levels. We have set the confidence level threshold for all 5 fields at 80%.
We have released the model to Production end of May 2024. Since then around 700 invoices per day are being processed.
We do not retrain the model on a regular/scheduled basis. We only retrain the model after we have made changes to the collections (eg when an invoice template has changed). We have done this 8 times since the release end of May.
This is what we are facing
For some invoice types we have seen a sudden and huge drop in the confidence levels for Invoice ID and Invoice Date.
The related collections have at least 20 invoices and all five fields are consistently and correctly tagged on all these invoices.
The Invoice ID and Invoice Date fields are very well readable and recognizable on the invoices and there seem to be no other ID’s ore Dates near that the model could be confused by. So visually there’s no reason for the model to have low confidence.
When we compare invoices with low confidence vs invoices with high confidence, we don’t see any differences between these invoices. The only thing is that there seems to be some correlation with the number of pages per invoice (invoices with 4-7 pages seem to do better than those with 2-3 pages).
For another invoice type, which always has multi-page invoices, we keep on having difficulty getting it to pick the right Invoice Total.
Regardless of the 40 consistently and correctly tagged invoices now in the related collection, it often still has low confidence on the Invoice Total and sometimes picks the subtotal on the first page as Invoice Total, sometimes even with very high confidence.
When we compare invoices with low confidence vs invoices with high confidence, we don’t see any differences between these invoices. The only thing is, again, that there seems to be some correlation with the number of pages per invoice (invoices with 4-7 pages and high Invoice Totals seem to do better than those with 2-3 pages and low Invoice Totals).
In a recent test with one of our non-prod models, we have compared two runs (with one week in between) of exactly the same set of test-invoices, with exactly the same model, and saw different outcomes in confidence levels. The test-set consisted only of invoices for which specific collections were trained in the model.
This is what we’ve tried to find root-causes and solve the issues
We have checked and confirmed that templates, file-types and PDF-structures did not change.
We have done many experiments and tests the past 3 months with different isolated non-prod single-collection models, as well as with the actual full model that is in production, trying to find patterns leading to the root-cause(s):
We have re-tagged invoices.
We have rebuilt collections from scratch.
We have ‘counter-tagged’ additional fields that might create confusion for the model.
We have iterated with re-tagging and not re-tagging standard fields.
We have iterated with removing and not-removing not-used standard fields.
We have iterated with the number of invoices in a collection.
We have iterated with the mix of invoices (eg few pages vs many pages, old vs recent invoices) in a collection.
Some of these measures did change the results somehow, sometimes in some direction. But these changes were never consistent and/or reproducable between succeeding iterations.
Overall, this has taken a tremendous amount of time and has put us, now 3 months after release, in a difficult and tense situation with our customer. We have replaced a UiPath-based Document Understanding solution (which we implemented ourselves 3,5 years ago) with this AI builder solution, with the promise that this would be cheaper and perform better. Cheaper it is; better it for sure isn’t so far.
Does anyone have any knowledge on what might be going on here and/or does anyone have any advice on how to solve this situation?
Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.