Hi everyone!
I'm training a document processing model with about 120 documents in pdf and 40 fields. The pdfs are divided into 3 different collections (3 different layouts): the first one with 70 documents, the second one with 34, and the third one with 16.
Mainly in the first collection, I have this problem:
Some of the documents are native pdfs, like the example above, and there is no problem with tagging because the detected word is precisely the one in the document.
On the other hand, some of them are scanned or photocopied documents, where the quality makes it difficult to detect the 100% correct values. For example, in this example, the value detected was "22:18" and not "22.18".
I have other examples where the values detected are like "22 18", "22.18.", "*22.18". On scanned documents with poor quality, this happens to me for about 4 of the 40 fields.
So my question is, what do you recommend me to do in these cases?
- Tag the word, even knowing that it is not exactly the correct value.
- Choose Not available in document option, even if the word is present in the document and it is the detected value that was not exactly correct.
- Eliminate that pdf from the training collection (Not my favorite, because the distribution between native and scanned pdf in my real case is almost 50/50, so I am interested in including this kind of case in the training).
Please base your answer on what is best in terms of training the model, I am looking for the performance to be fairly good and reliable. Also, feel free to include another alternative that I may not have considered. Thanks in advance for your help!! 😊
Hi @plarrue, thank you very much for your reply! I have not finished training the model, but I am going to use the option you recommend.
As I understand from your answer, when the model is finished and in use, values like "22 18", "22.18.", "*22.18" (that come from poor quality documents) are going to have a confidence score significantly lower than a correct value like "22.18", so I will be able to distinguish between these cases based on that confidence score. Am I correct?
Hi @dcortes187
We would recommend to : - Tag the word, even knowing that it is not exactly the correct value.
This will teach the model about different document quality types which is good.
When processing the documents, we then expose a confidence score for each field that can be used to flag if a document needs to be manually reviewed.
Improve the performance of your document processing model - AI Builder | Microsoft Learn
Hope it helps.