10-16-2012 05:39 PM
Have you guys come across the requirements of parsing a HTML message, lets say an email, and convert into text?
The most common use case would be email parsing, remove all the HTML tags and use as string.
It has to remove the entire tags, with attributes, for example tables, with TRs and TDs with properties, must be removed. The BR converted into carriage return, etc?
Cheers,
Renato Fichmann
10-17-2012 04:49 AM
This is pretty easy to do using an XSL Transform activity. If you provide a small sample HTML and what you want the result to look like, I (or any number of people on this forum) can help you create an appropriate stylesheet.
11-26-2012 07:17 PM
Derek,
The problem is when the HTML in question is only HTML4 and not well formed XML. Then it gets hard and you have to do it with regexs.
Any thoughts on that?
Ryan
11-27-2012 04:36 AM
True, HTML 4 allows some open-ended tags like
You might look into a tool like TIDY to clean up the HTML into a well-formed XML document so it is useable by XSLT. There are other tools available as well that do similar cleanup.
11-27-2012 02:38 PM
Derek,
In my situation, I'm triggering a process from an email. I need to extract information out of this email but it's not formatted as XHTML. I don't know there's an easy way to preprocess the email first before trying to disect it.
I'm working on an alternative solution but that relys on the people sending me the email to grant direct access to their systems.
Hopefully there'll be some funky string functions like this in upcoming releases??
Ryan
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide