Thursday, August 09, 2007

Regular expressions for stripping HTML tags and attributes

Most web developers have, at some point, needed to strip some or all HTML tags from content. You may already know this, but it may help some people who are just getting started.

Normally, you will particularly want to remove the cr&p that Microsoft Word (et al) put in that increases the size of HTML by some 200%.

NB: The examples below are in ColdFusion, but the Regular Expressions should work nicely in whatever language you care to use

 

The following code strips out the attributes of certain HTML tags:

<cfset theString = reReplaceNoCase(
theString,
"<\s*(p|span|tr|th|div|li|ul|ol)\s+.*?
>",
"
<\1>",
"ALL"
)>


The following code strips out a list of specific tags altogether



<!--- in a cffunction, this would be var local --->
<cfset local = structNew()>
<cfset local.stripTags = "span,blockquote">
<cfloop list="#local.stripTags#" index="local.tag">
<cfset local.theString = reReplaceNoCase(
local.theString,
"</?#local.tag#[^
>]*>",
"", "ALL"
)>
</cfloop>


The following regular expression does a reasonable job of removing all HTML



<[^>]*>


The following code strips out the style and class (specific classes, anyway) attributes from an HTML string. Please note the use of single quotes:


<cfset local = structNew()>
<cfset local.theString = reReplaceNoCase(
local.theString,
'class\s*
=?\s*"?(MsoNormal|MsoTableGrid)"?',
"", "ALL"
)
>
<cfset local.theString = reReplaceNoCase(
local.theString,
'style\s*
=?\s*"?[^"]*?"',
"", "ALL"
)
>

No comments: