I recently needed to write a regular expression to find HTML and XHTML links with title attributes (<a href="http://www.gustavus.edu" title="Visit the Gustavus Adolphus College homepage">Gustavus</a>
). Here it is:
<a[[:print:]]*title=('|"")?(.*?(?=1))1?[[:print:]]*>([[:print:]]*)</a>
Note that the [[:print:]]
parts are applicable to ColdFusion regular expressions and would have to be changed to something else if you aren’t using ColdFusion. Additionally, the ('|"")
part uses the double-double quote escape syntax because the regular expression is passed in between double quotes.
I was at a major roadblock when writing this regular expression until I found out about wildcard non-greedy quantifier and used it with the positive lookahead/backreference combo. This critical section is the part that says (.*?(?=1))
. This basically means grab any amount of any character until the next occurrence of the first backreference (in this case either ‘ or “). Brilliant.
Leave a Reply