I attempted to train a model using some type-written, but old, 1930s oil well permits. They are PDF scanned forms from a government agency, with typed answers generally in similar locations, so field location is ok. However, many of the answers were typed almost directly on the dotted lines of the forms, instead of slightly above. That seemed to cause poor character recognition, and a poor overall document processing model training score of 46%.
Any suggestions for improving this, such as removing all horizontal dotted lines within the PDF document? Or, are there simply some applications which need a person to extract the information.
I tried using "Enhance" in Adobe Pro, but that did not help.
Any suggestions are appreciated.
