HTML tag regular expression Posted on June 16th, 2005 by

We found this regular expression online at haacked.com for finding html tags in a string. I used this in the new Gustavus eCard program that I have been working on (stay tuned) as a way to get the text of captions which contain hyperlinks. This regular expression was written by Phil Haack.

</?w+(((s|n)+w+((s|n)*=(s|n)*(?:".*?"|'.*?'|[^'">s]+))?)+(s|n)*|(s|n)*)/?>

This expression is so smart that it even accounts for things like newline characters and angle brackets which happen to appear in data.

Update: As Haacked pointed out, (s|n) is redundant, so the updated regular expression should be as follows:

</?w+((s+w+(s*=s*(?:".*?"|'.*?'|[^'">s]+))?)+s*|s*)/?>

Contact Us

Phone: 507-933-6111
Email: helpline@gustavus.edu
Web: https://gustavus.edu/gts
Blog: https://gts.blog.gustavus.edu
Remote Support: https://sos.gac.edu
System Status: https://gustavus.freshstatus.io

Sign up for our newsletter.

Receive a daily digest anytime we post something new.

We don’t spam! Unsubscribe at any time!

 


2 Comments

  1. Ryan Rud says:

    As far as I can tell, this simplified version works just as well at detecting newlines.

  2. Haacked says:

    I think that expression can be simplified just a bit. Anywhere you see (\s|\n) should be reducible to just \s. I need to test it just to be sure. If you test it, let me know if it works for you.