Answered

Character recognition poor in 1930s vintage forms

(0) Share

Report

Posted on by 55552

682

I attempted to train a model using some type-written, but old, 1930s oil well permits. They are PDF scanned forms from a government agency, with typed answers generally in similar locations, so field location is ok. However, many of the answers were typed almost directly on the dotted lines of the forms, instead of slightly above. That seemed to cause poor character recognition, and a poor overall document processing model training score of 46%.

Any suggestions for improving this, such as removing all horizontal dotted lines within the PDF document? Or, are there simply some applications which need a person to extract the information.

I tried using "Enhance" in Adobe Pro, but that did not help.

Any suggestions are appreciated.

Categories:

AI Builder

I have the same question (0)

All responses (5)

Answers (1)

Sort by

JoeF-MSFT Microsoft Employee on at

Like
a
(0)

Report
Copy link

Link copied!

Hi @Runner55552 - this seems to be a fascinating project, bringing AI to documents from the 1930s! 🙂

I'd recommend training a new model and selecting Unstructured and free-form documents as the type of document. It might perform better with this type of document and uses a more up to date version of our OCR engine, which will also be available for structured and semi-structured documents by September.

With the sample screenshot you provided, I can already see some improvements when selecting Unstructured and free-form documents.

When selecting Unstructured and free-form documents:

When selecting structured and semi-structured documents:

Was this reply helpful? Yes No
Verified answer

55552 682 on at

Like
a
(0)

Report
Copy link

Link copied!

Thanks for testing and suggesting this. I will give this a try. We do have some tables in our files also, and I see the unstructured option does not yet have the ability to model tables. However, I think I can work around that for now, since most tables only have up to 4 rows, so I can "flatten" the records and consider them as individual fields for now.

Was this reply helpful? Yes No
55552 682 on at

Like
a
(0)

Report
Copy link

Link copied!

One other item I noticed while selecting/tagging text, in both the structured and unstructured model set-up. Even when highlighting only the text desired with a mouse, the text that appears as the value is the work/phrase that has already been recognized as text by the model. So I cannot remove extraneous text or tic marks, etc. at the end of the word by sizing the selection box smaller. Will this be a feature for future enhancement?

Was this reply helpful? Yes No
55552 682 on at

Like
a
(0)

Report
Copy link

Link copied!

Update: I re-trained the model using the unstructured document option. The accuracy of FINDING the location of the fields improved greatly. However, there were still many issues with actual character recognition, Biggest problems were:
1. type-written data was on or near a dotted line, producing phantom dots that were interpreted as periods or various characters.
2. The ' and " tick marks used in the form for feet and inches were often mis-interpreted as ! or 1 or other characters.
3. Many extra spaces inserted in various locations, although I could probably use Power Automate to remove "white space."
4. Number values often wrong, again in part to the dotted lines, or other marks.

If there were the ability to manually train by telling the model during training that Oil equals Oil instead of 011, that would be very useful.

Thanks again for the helpful suggestions.

Was this reply helpful? Yes No
JoeF-MSFT Microsoft Employee on at

Like
a
(0)

Report
Copy link

Link copied!

Hi @Runner55552, thanks a lot for taking the time to provide back this update! Great to hear that when using unstructured documents as an option the finding of the location improved greatly.

Today there is no option to adjust when selecting text to the character level, nor an option to provide feedback in the tagging process to change wrongly detected characters. As you suggested, one option is to do post-processing in a cloud flow in Power Automate with expressions to remove white spaces, dots, tick marks.

Was this reply helpful? Yes No