HTML tag strip (email parsing use case)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-16-2012 05:39 PM
Have you guys come across the requirements of parsing a HTML message, lets say an email, and convert into text?
The most common use case would be email parsing, remove all the HTML tags and use as string.
It has to remove the entire tags, with attributes, for example tables, with TRs and TDs with properties, must be removed. The BR converted into carriage return, etc?
Cheers,
Renato Fichmann
- Labels:
-
Cisco Process Orchestrator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2012 04:49 AM
This is pretty easy to do using an XSL Transform activity. If you provide a small sample HTML and what you want the result to look like, I (or any number of people on this forum) can help you create an appropriate stylesheet.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2012 07:17 PM
Derek,
The problem is when the HTML in question is only HTML4 and not well formed XML. Then it gets hard and you have to do it with regexs.
Any thoughts on that?
Ryan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2012 04:36 AM
True, HTML 4 allows some open-ended tags like
You might look into a tool like TIDY to clean up the HTML into a well-formed XML document so it is useable by XSLT. There are other tools available as well that do similar cleanup.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2012 02:38 PM
Derek,
In my situation, I'm triggering a process from an email. I need to extract information out of this email but it's not formatted as XHTML. I don't know there's an easy way to preprocess the email first before trying to disect it.
I'm working on an alternative solution but that relys on the people sending me the email to grant direct access to their systems.
Hopefully there'll be some funky string functions like this in upcoming releases??
Ryan
