web
You’re offline. This is a read only version of the page.
close
Skip to main content

Notifications

Announcements

Community site session details

Community site session details

Session Id :
Power Platform Community / Forums / Power Apps / Character recognition ...
Power Apps
Answered

Character recognition poor in 1930s vintage forms

(0) ShareShare
ReportReport
Posted on by 674

I attempted to train a model using some type-written, but old, 1930s oil well permits. They are PDF scanned forms from a government agency, with typed answers generally in similar locations, so field location is ok. However, many of the answers were typed almost directly on the dotted lines of the forms, instead of slightly above. That seemed to cause poor character recognition, and a poor overall document processing model training score of 46%.

Any suggestions for improving this, such as removing all horizontal dotted lines within the PDF document? Or, are there simply some applications which need a person to extract the information.

I tried using "Enhance" in Adobe Pro, but that did not help.

Any suggestions are appreciated.

Runner55552_0-1657569238873.png

 

Categories:
I have the same question (0)
  • JoeF-MSFT Profile Picture
    on at

    Hi @Runner55552 - this seems to be a fascinating project, bringing AI to documents from the 1930s! 🙂

     

    I'd recommend training a new model and selecting Unstructured and free-form documents as the type of document. It might perform better with this type of document and uses a more up to date version of our OCR engine, which will also be available for structured and semi-structured documents by September.

     

    JoeFMSFT_0-1657572363299.png

     

    With the sample screenshot you provided, I can already see some improvements when selecting Unstructured and free-form documents.

     

    When selecting Unstructured and free-form documents:

    JoeFMSFT_3-1657572565475.png

     

    When selecting structured and semi-structured documents:

    JoeFMSFT_2-1657572549588.png

     

     

  • Verified answer
    55552 Profile Picture
    674 on at

    Thanks for testing and suggesting this. I will give this a try.  We do have some tables in our files also, and I see the unstructured option does not yet have the ability to model tables. However, I think I can work around that for now, since most tables only have up to 4 rows, so I can "flatten" the records and consider them as individual fields for now.

  • 55552 Profile Picture
    674 on at

    One other item I noticed while selecting/tagging text, in both the structured and unstructured model set-up. Even when highlighting only the text desired with a mouse, the text that appears as the value is the work/phrase that has already been recognized as text by the model.  So I cannot remove extraneous text or tic marks, etc. at the end of the word by sizing the selection box smaller. Will this be a feature for future enhancement?

  • 55552 Profile Picture
    674 on at

    Update:  I re-trained the model using the unstructured document option.  The accuracy of FINDING the location of the fields improved greatly. However, there were still many issues with actual character recognition, Biggest problems were:

    1. type-written data was on or near a dotted line, producing phantom dots that were interpreted as periods or various characters.

    2. The ' and " tick marks used in the form for feet and inches were often mis-interpreted as ! or 1 or other characters.

    3. Many extra spaces inserted in various locations, although I could probably use Power Automate to remove "white space." 

    4. Number values often wrong, again in part to the dotted lines, or other marks.

     

    If there were the ability to manually train by telling the model during training that Oil equals Oil instead of 011, that would be very useful.

     

    Thanks again for the helpful suggestions.

  • JoeF-MSFT Profile Picture
    on at

    Hi @Runner55552, thanks a lot for taking the time to provide back this update! Great to hear that when using unstructured documents as an option the finding of the location improved greatly.

     

    Today there is no option to adjust when selecting text to the character level, nor an option to provide feedback in the tagging process to change wrongly detected characters. As you suggested, one option is to do post-processing in a cloud flow in Power Automate with expressions to remove white spaces, dots, tick marks. 

Under review

Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.

Helpful resources

Quick Links

Forum hierarchy changes are complete!

In our never-ending quest to improve we are simplifying the forum hierarchy…

Ajay Kumar Gannamaneni – Community Spotlight

We are honored to recognize Ajay Kumar Gannamaneni as our Community Spotlight for December…

Leaderboard > Power Apps

#1
WarrenBelz Profile Picture

WarrenBelz 765 Most Valuable Professional

#2
Michael E. Gernaey Profile Picture

Michael E. Gernaey 343 Super User 2025 Season 2

#3
Power Platform 1919 Profile Picture

Power Platform 1919 272

Last 30 days Overall leaderboard