This may not be the correct board. If so, please let me know and I'll make adjustments as necessary.
I work in an office that receives monthly invoices for our company's billing. We get our invoices in PDF style formatted to be continuous tables. The data is laid out kind of like so:
-member ID#, NAME - new line
-Claim number, member ID# again - new line
-Date of claim twice, claim type code, claim type code again, claim type code again again, ID number, number of units for this claim, dollar amount requested, dollar amount authorized, dollar amount deducted, dollar amount paid - new line...
Currently I'm reviewing these documents by hand with my little human eyeballs and fingers. It's not terribly slow but I have literally hundreds of these and each can contain upwards of 50+ claims.
Specifically my goal is this: (somehow) scan the PDF document w/ RegEx (or something to similar effect) and extract the text from the three lines of text demonstrated above while filtering out the "noise" (everything else). The desired data range will always start on page 2 and end on 3 pages from the last page of the PDF.
I'm using FileCenter to manage my documents. FileCenter pro provides you a neat tool to do some OCR text extraction similarly to what I'm desiring. I managed to set up a little "Demo" module to prove the concept and it worked in microcosm. I couldn't get it to export the data into a file or into a program.
Ultimately these extracted data need to be placed into an Excel document so I can do some other data related operations on them.
The target PDFs contain sensitive data so I can't share them.
Any ideas how to go from selecting a PDF to dropping specific text from the PDF into an Excel doc while pruning all the unnecessary stuff? (or a list object in PAD?)