Extract text from image based SharePoint files
The purpose of this post is to show how to create a Power Automate (Flow) solution to extract text from image based SharePoint content and updating a field with the extracted text.
Prerequisites
Before you begin, please make sure the following prerequisites are in place:
- An Office 365 subscription with access to Power Automate (Flow).
- Muhimbi PDF Converter Services Online full subscription with OCR capability or trial subscription (Start trial). Do not use the Free subscription as it doesn’t support OCR, the free Trial subscription works fine.
- Appropriate privileges to create Flows.
- Working knowledge of Power Automate (Flow).
Extract image based text from the uploaded file using Muhimbi’s ‘Extract text using OCR’ action.
Let’s first see how the basic structure of our Power Automate (Flow) looks:
Step 1 – Trigger
- The trigger to be used is ‘When a file is created in a folder’ in SharePoint. In this example it is important that we do not trigger when the file is updated as that would result in an infinite loop when the last step in this flow updates the item.
- Whenever a file gets uploaded to the selected folder, the Power Automate (Flow) will get triggered automatically.
- For the ‘Site Address’ in the image below, choose the correct site address from the drop down menu.
- For the ‘Folder Id’ in the image below, select the source folder.
Step 2 – Get file content
- For the ‘Site Address’ in the image below, specify the same address as used in the Trigger in step 1.
- In the ‘File identifier’ field, navigate to the ‘Add Dynamic content’ line and choose the ‘File identifier’ option inside the ‘When a file is created in folder’ trigger.
Step 3 – Extract text using OCR
- Source file name: Select “File name” returned by the trigger.
- Source file content: Select “File Content” returned by the ‘Get file content’ action.
- Language: This is the language the source document is written in. It defaults to English, but we there is support for Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish languages.
- Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
- White list / Blacklist: Control which characters are recognized. For example limit recognition to numbers by white listing 1234567890. This prevents, for example, a 0 (zero) to be recognized as the letter o or O.
- Use Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
- X coordinate, Y Coordinate, Width and Height parameters: Specify the exact area to extract text from. The unit of measure (UOM) is 1/72nd of an inch.
Note – Please refer to the ‘Specifying Coordinates’ section near the bottom of this post for details about calculating coordinates.
Step 4 – Update file properties
Once the text has been extracted, we want to update a column in the source file. To facilitate this, I have created a column named “Extract Output” in my library.
The steps are as follows.
- Specify the same site address and library used in the trigger.
- In the Id field, specify the “ItemId” returned by the Trigger.
- Specify the “Out text”, as returned by the “Extract text using OCR” action, in the ‘Extract Output’ field.
Specifying Coordinates
Configuring the coordinates for the region to extract text from can seem a bit daunting at first. This is a limitation of the Power Automate user interface, which does not allow rich facilities such as coordinate pickers.
Let’s take an example which will make it clear how to specify the coordinates to extract the text from.
This example is really quite simple, but you may find yourself transported back to your school days as we go over some basic coordinate geometry concepts.
Let’s take the document shown above, from which I need to extract the text highlighted in blue- the ‘Adjust your computer’s settings’ region. Naturally, in the real world we’d use something more useful, like the location of the invoice number in a generic template.
- Whenever a document is sent to the ‘Extract text using OCR’ action of The Muhimbi PDF Converter, The Converter renders the page’s body in the traditional X-Coordinate and Y-Coordinate system.
- The top left corner of the page is the Point of Origin (0,0) , a reference point from where the page gets rendered as X-Axis horizontally and as Y-Axis vertically.
- The unit of measure (UOM) is 1/72nd of an inch.
- The method is simple, first you need to understand that, as you keep moving vertically downwards the value of Y-Axis is going to increase. This simply means that when you select a region that is located vertically at the bottom of the page you will need to enter a higher value in the Y Coordinate field.
- Similarly when you select a region that is located horizontally on the right-hand side the page, you will need to enter a higher value in the X Coordinate field.
- If you take a look at the ‘Extract text using OCR’ action, I have entered the value ‘368’ in the Y Coordinate, since the text that I have selected to extract is located significantly lower than the Point of Origin.
- I have mentioned the value in points/inches/cm’s on purpose, so you get a feel for the distance in the unit of measure you are most comfortable in.
- Since the text to extract is located slightly to the left of the Point of Origin, I have entered the value ‘150’ in the X Coordinate, which is considerably less as compared to the Y Coordinate since the text is significantly less far from the Point of Origin along that axis.
- Now this gives you a point of intersection and half your work is already done!
- Next we need to determine how much area we need to cover in order to capture the entire text for extraction. For this you need to make use of Width and Height parameters.
- If you want to capture multiple lines, your Height goes on increasing and similarly if you are trying to extract text of an entire line, then your Width also goes on increasing.
- In the example below, I wanted to capture a line with only 4 words and that is why my Width is 92 and my Height is 80.
- As you can see, by understanding how the page gets rendered by The Muhimbi Converter for The ‘Extract’ action, it now becomes easier to visualize how your parameters need to be populated, in order to extract the text from the exact region you intend.
Final Output
Let’s go ahead and run the Power Automate (Flow) solution, upload the test file, and check if we get the intended results. Naturally you will need to use your own files and coordinates.
That’s it! The exact text we expected to be returned has indeed been returned and placed in the ‘Extract Output’ column.
Keep checking this blog for exciting new articles about Power Automate, SharePoint Online, Power Apps and document conversion and manipulation.
Comments
-
Extract text from image based SharePoint files
If anyone wants to extract data from a PDF or image without training a model for select documents, try this new GPT data extraction method: https://powerusers.microsoft.com/t5/Power-Automate-Cookbook/Extract-Data-From-PDFs-and-Images-With-GPT/td-p/2201345
It doesn’t require specifying certain document areas, wordings, styles, etc. It just OCRs the file, converts it to a replica text (txt), and passes it to a GPT prompt where you can ask GPT to do whatever you want with the document data.
*This post is locked for comments