This post is not of general interest, but perhaps it will help some unfortunate person, like me, who finds himself struggling to tame regular expressions.
I was faced with the following text from which I desired to clean the Microsoft artifacts leaving only the pure html and English:
<HTML xmlns:o= "urn:schemas-microsoft-com:office:office"><BODY><P class=normal style="MARGIN: auto 0cm">Text not to mention other stuff that I need to clean up</HTML>
To do this I meant to use Perl, namely the substitution operator ($x =~ s///;).
The problem is that Perl's regular expressions are greedy. A code like s/<HTML.*>/<HTML>/ (which says, "0 up to infinity of anything until you reach >) turns the entire block into "<HTML>" because the regular expression matches everything from the first < all the way up to the last >
"Why don't you just convert the the >s to >s and the <s to <s and use this, s/<HTML[^>]+>/<HTML>/;," you ask.
([^>]+> means one or more of anything that is not a > until you reach a > and is one common way to solve greedy behavior.)
There are two reasons, first because < and > are used elsewhere for legitimate purposes, hence wherever I convert these characters I will have to convert them back, and second because I want a simple solution that I don't have go cross eyed over.
So, I tried s/<HTML[^(>)]+>/<HTML>/; which doesn't work for reasons unknown to me.
I found the answer, at last, somewhere on the Internet.
This does work for reasons unknown to me: s/<HTML.*?>/<HTML>/;.
In English it says, "0 or 1 of 0 up to infinity of anything until you reach >"
The key to curbing greedy behavior is .*? followed by the closing string.
If you are wondering why you bothered to read this, I can only say that you had fair warning that it was not of general interest.
Labels: Amature, Perl, Technology