Unanswered

How do I convert HTML to an XML object?

(0) Share

Report

Posted on by binki-dcx

I have HTML emails from which I want to extract information. My emails are well-formed HTML. Normally, in other environments, to handle this, I would use an XML library’s HTML parse mode to get an XML DOM to look at it.

It looks like the xpath() and xml() functions exist. Those are pretty powerful and would provide the ability to access the data as using the XML DOM.

However, I cannot figure out how to parse HTML as XML. When I pass my HTML to xml(), I get the error “The provided value cannot be converted to XML: 'The 'meta' start tag on line 2 position 161 does not match the end tag of 'head'. Line 214, position 11.'. Please see https://aka.ms/logicexpressions#xml for usage details.'” This error makes sense and is how this should work. However, I cannot figure out how to get an XML object from HTML like I can in other libraries.

Is there any equivalent to an HTML DOM library with XPath support in power automate? Or something that can process HTML into XML similar to xmllint --html - or DOMDocument::loadHTML())?

Categories:

Building flows

I have the same question (0)

All responses (4)

Answers (0)

tom_riha 10,185 Most Valuable Professional on at

Like (0)

Report

Hello @binki-dcx ,

Power Automate doesn't have anything to pre-process HTML, the only way to handle it would be as a string with some combination of split(...), replace(...), concat(...), etc. until you get a clear HTML that can be converted into xml.

But even if you remove the header and keep only the body, it'll still have problems with tags that don't have a closing, e.g. <img... />, <br>.

In the end you might be better off with a solution as shown e.g. here: https://www.youtube.com/watch?v=7tZ6bRtco3Y, get rid of all the HTML tags and parse it from plain text.

Was this reply helpful? Yes No
binki-dcx 70 on at

Like (0)

Report

The problem is that I want to do this “properly” and I need to use data from the DOM, such as attributes, to correctly identify the information I want to load and to extract data. The text content/text rendering of the HTML loses all semantics. An HTML parser outputting XML does exactly what I need. Just this component seems to be missing from the entire Microsoft ecosystem (even .net’s XmlDocument supports serializing in HTML format but not deserializing—and that is probably why Power has no html() function).
A more proper, but less performant, solution would be for me to write a custom connector which literally just passes the data to `xmllint --html -`.
Right now, I am using an improper string-based solution because the HTML I get is clean/self-consistent enough that I can know that splitting on the double-quote character, finding all non-spacey strings starting with https://, and filtering down the URIs I have identified using a substring works. But that is only because I am lucky with the contents of the HTML documents I am working with.
This shouldn’t be the case. Microsoft should add an HTML parser to .net akin to libxml2’s HTML parser.

Was this reply helpful? Yes No
tom_riha 10,185 Most Valuable Professional on at

Like (0)

Report

You can submit it as an idea to the ideas forum, but I'm afraid that's all that can be done at this moment: Power Automate · Community

Was this reply helpful? Yes No
binki-dcx 70 on at

Like (1)

Report

I have submitted it as an idea here.

1 people found this reply helpful.

Was this reply helpful? Yes No