Guide: Auto tagging documents in SharePoint using Microsoft Cognitive Services/Text Analytics
Hey Flow Fans,
As a long-time SharePoint and information management architect (20 years!)… I've worked on many projects in many different verticals and held many debates with records managers, information managers, end users, legal, etc. about metadata and how much metadata should be associated with a document within SharePoint.
Metadata is a fundamental building block for structured document management, enabling management reporting, document classification, retention schedules, disposition execution, and so forth… so let's create lots of metadata! However... there aren't too many users who are happy or can see the value in completing lots of metadata fields and worse still there are lots of users who will click any old options just to get a document 'correctly' loaded. This typically has a huge detrimental impact on the management of an information repository, typically introducing significant regulatory risk.
Considering the desire to have both rich metadata and a usable solution, I have been asked many, many, many times... can we automate metadata selection based on the content of the file?
Yes! Better still this can be achieved quickly and simply with Microsoft Flow and a few extra components... in this post I'll outline how you can utilise Microsoft Cognitive Services and the Text Analytics API to perform key phrase extraction which we can use later to tag a document within SharePoint.
Flow Creation - Video Guide
Flow Creation - Test Guide
Please note: You will require an Azure subscription and a cognitive services account to utilise the Flow Text Analytics connector, you can create a free account here.
1. Create a new Flow from a blank template
2. Add the ‘When a file is created or modified (Properties Only)’ SharePoint trigger and configure to point to the library / folder where the Flow should be triggered from.
3. Add an 'Initialise variable' action
3a. Name: Set to 'KeyPhrases'
3b. Type: Select 'String'
NOTE: This flow will be triggered by either a new document being added or an existing document being updated, this Flow will then update the exact same document again. This will cause an infinite loop (a recursive event)... to protect against this we recommend using a service account identify for the SharePoint connection, this will ensure any updates to the document are made to the document by the Flow are executed by the same identity. We will then add a condition to the Flow to check for and ignore any Flow's which have been triggered by an update to the document made by the service account identity.
4. Add a 'Condition' action
4a. Click 'Choose a value', insert the 'Modified By Email' parameter from the 'When a file is created or modified (properties only)' trigger
4b. Set the operator to "Is not equal to"
4c. Set the value to the email address of the SharePoint connection's identity
4d. If you are unsure of the identity or wish to create a new connection, go to 'Settings > Connections'
5. Add a 'Get File Content' SharePoint action, inside the 'Yes' channel
5a. Site Address: Set as per the 'Site Address' value of step #2.
5b. File Identifier: Insert the 'Identifier' parameter from the 'When a file is created or modified (properties only)' action result
6. Add an Encodian 'Convert to PDF' action
6a. File Content: Insert the 'File Content' parameter from the 'Get file content' action result
6b. PDF Filename: Insert the 'File name with extension' parameter from the 'When a file is created or modified (properties only)' action result
Note: The Encodian 'Convert to PDF' action will automatically check the 'PDF Filename' value and change the file extension provided to '.pdf' if required.
6c. Filename: Insert the 'File name with extension' parameter from the 'When a file is created or modified (properties only)' action result
7. Add an Encodian 'Get PDF Text Layer' action
7a. Filename: Insert the 'Filename' parameter from the 'Convert to PDF' action result
7b. File Content: Insert the 'File Content' parameter from the 'Convert to PDF' action result
8.Checkpoint: Your new Flow should look similar to the following:
9. Add a Text Analytics 'Key Phrases' action
NOTE: If you have not already created a connection you will be prompted to create a new Text Analytics connection utilising acognitive services account hosted within an Azure subscription, you can create a free account here.
If you need to create a new connection please follow these additional steps:
9a. Connection Name: Enter a name for your connection
9b. Account Key: Enter the key obtained from your Cognitive Services account
9c. Site Url: Enter the endpoint obtained from your Cognitive Services account
9d. Click 'Create'
Once your connection is created or if your connection was previouslycreated, follow these steps:
9e. Text: Insert the 'Text Layer' parameter from the 'Get PDF Text Layer' action result
10. Add a 'Append to string variable' action
10a. Name: Set to 'KeyPhrases'
10b. Type: Insert the 'keyPhrases - Item' parameter from the 'Key Phrases' action result
10c. This will dynamically insert an 'Apply to each' loop action
10d. To correctly format the results, remove the default value and add the following expression to the 'Value' parameter.
concat(items('Apply_to_each'), ', ')
10e. Click 'OK'
11. Add an 'Update File Properties' SharePoint action
11a. Site Address: Set as per the 'Site Address' value of step #2.
11b. Library Name: Set as per the 'Library Name' value of step #2.
11c. Id:Insert the 'ID' parameter from the 'When a file is created or modified (properties only)' action result
The next step is to utilise the data returned from the 'Text Analytics' action and write to a metadata field associated with the source item. We have added a 'Key Phrases' column to the library to store the data.
11d. Key Phrases: Insert the 'KeyPhrases' variable
11e. Check and update the SharePoint connection and ensure the service account identity is used, see step 4.
12. Completed: Your flow should appear as follows
12. Test the flow
13. Validate the results
Please note: The 'Text Analytics' action is limited to process 5120 characters per request. It is likely that you will exceed this limit by sending an entire document, however the Encodian 'Get PDF Text Layer' action allows you to target specific pages which can help keep within this limit.
References:
- Microsoft Cognitive Services -https://azure.microsoft.com/en-gb/services/cognitive-services/
- Microsoft Cognitive Services Documentation -https://docs.microsoft.com/en-gb/azure/cognitive-services/
- Azure Account Sign-Up -https://azure.microsoft.com/en-us/free/
Comments
-
Guide: Auto tagging documents in SharePoint using Microsoft Cognitive Services/Text Analytics
Hi Jay- Thanks for the great post, it worked perfectly well.
*This post is locked for comments