Thank you very much SwatiSTW.
1. From your suggestion of i’ll need a vast and well-labeled training dataset (50–100 samples per document type), covering all layout variations.
May I know where I should select this option in AI Builder?
2. Scanned PDFs should be preprocessed using OCR, while digital PDFs can have text extracted directly.
I’m currently using Document Processing > Extract custom information from documents (Custom Model), which actually allows us to place the cursor on keywords (for both digital PDFs and scanned PDFs).
3. Because filenames are inconsistent, the model must rely on document content, not naming.
Yes, I fully agree. That’s why I have the idea below—what is your suggestion?
My approach is to train a custom AI Builder model to extract key words (like ‘Invoice’, ‘Customer’, ‘Supplier’) from document content, then use Power Automate conditions such as If extracted text contains 'Invoice' AND 'Customer' AND 'Supplier' → classify as Invoice to determine the document type and rename the file based on a mapping table
4. You’ll need a mapping table to convert document types to standard short codes (e.g., Invoice → Inv).
Yes, a mapping table is definitely needed.
5. Build logic to handle low-confidence predictions (e.g., manual review if confidence < 80%).
Since the number of folders is large, in the future I will need to check for false positives as well as false negatives.
6. Be mindful of AI Builder licensing and credit costs, especially at scale.
A hybrid model (AI + rule-based logic) can improve reliability in complex cases.
Yes, I agree—especially when deploying using a custom model. I do have some document types with fixed formats (though not many). I think you’re suggesting using a text extraction engine like Adobe for simple text extraction, right? But the first step—identifying the document type—is still critical. Any suggestions are welcome.