cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Announcements

567
Views
3
Helpful
4
Replies
Renato Fichmann
Cisco Employee

HTML tag strip (email parsing use case)

Have you guys come across the requirements of parsing a HTML message, lets say an email, and convert into text?

The most common use case would be email parsing, remove all the HTML tags and use as string.

It has to remove the entire tags, with attributes, for example tables, with TRs and TDs with properties, must be removed. The BR converted into carriage return, etc?

Cheers,

Renato Fichmann

4 REPLIES 4
derevan
Enthusiast

This is pretty easy to do using an XSL Transform activity. If you provide a small sample HTML and what you want the result to look like, I (or any number of people on this forum) can help you create an appropriate stylesheet.

Derek,

The problem is when the HTML in question is only HTML4 and not well formed XML. Then it gets hard and you have to do it with regexs.

Any thoughts on that?

Ryan

True, HTML 4 allows some open-ended tags like



You might look into a tool like TIDY to clean up the HTML into a well-formed XML document so it is useable by XSLT. There are other tools available as well that do similar cleanup.

Derek,

In my situation, I'm triggering a process from an email. I need to extract information out of this email but it's not formatted as XHTML. I don't know there's an easy way to preprocess the email first before trying to disect it.

I'm working on an alternative solution but that relys on the people sending me the email to grant direct access to their systems.

Hopefully there'll be some funky string functions like this in upcoming releases??

Ryan