2010년 7월 7일 수요일

RegEx match open tags except XHTML self-contained tags

246
494

I need to match all of these opening tags:


                            <p>
                            
                              
<a href = "foo" >

But not these:


                            <br
                             
                            />
                            
                              
<hr class = "foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.


                            <([a-z]+) *[^/]*
                            ?>
                            
                              

I believe it says:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than

Do I have that right? And more importantly, what do you think? =)

EDIT:

Hmm, which answer to mark as correct? For the record, ALL the answers are appreciated. Many thanks!

flag
142
I think that using a regular expression to process XML when there are plenty of good XML parsers out there is a bad idea. - David Dorward Nov 13 '09 at 22:42
48
<div style="background:url('/path/image.jpg');" title='Yes/No'> - Kobi Nov 13 '09 at 22:44
18
By this point, I really think it should be axiomatic that, if you're trying to parse arbitrary XML/HTML, regular expressions are not the tool to use. - jamesnvc Nov 14 '09 at 4:44
4
If your input set is small/homogeneous enough that you don't need a parser, then you can just try it out and check manually whether the regexp hack matches what you want it to match. If you can't then you should rethink your need for a parser... - David Schmitt Nov 14 '09 at 10:37
5
@Claudiu, because Jeff Atwood (codinghorror.com/blog/archives/001311.html) wrote a post about parsing Html with Regex. - Arnis L. Nov 17 '09 at 8:22
show 8 more comments

30 Answers

3579

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsingwill exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T A LL I​S LOST the pon̷y he comes he c̶̮omes he co mes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼ O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ


Have you tried using an XML parser instead?

link | flag
166
Is everything ok there? Is this a cry for help? :) - Kobi Nov 13 '09 at 23:07
145
Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death. - bobince Nov 13 '09 at 23:18
172
++ for "The <center> cannot hold" - Horace Loeb Nov 13 '09 at 23:27
570
Chuck Norris can parse HTML with regex. - user93422 Nov 14 '09 at 0:03
314
A true work of art; I weep at the poetic beauty. - Marc Gravell Nov 14 '09 at 0:29
show 117 more comments

265

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

link | flag
180
+1 for incorporating Paris Hilton in your answer. - Andrew Song Nov 14 '09 at 18:49
20
So Paris Hilton did write an OS after all? - Amarghosh Nov 15 '09 at 15:16
98
Great, we're now debating the possibility of chuck norris parsing HTML with regular expressions .. and paris hilton writing an operating system. Jon Skeet, however, can do both AND paris hilton. - Tim Post Nov 16 '09 at 5:12
35
Has anyone ever actually seen Linux Torvalds and Paris Hilton in the same room at the same time? Hmmmm.... - Graeme Perrow Nov 16 '09 at 18:06
56
Can Paris Hilton even spell OS? - David M Nov 17 '09 at 11:30
show 22 more comments
136

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:


                            <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:


                            <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

link | flag
1
Nothing to complain about, just three down votes. I'm at -5, probably for not adding a warning not to use my code :) - Kobi Nov 16 '09 at 14:10
2
Up for karma. And this makes 15 characters. - Jeff Nov 16 '09 at 14:31
1
I got a couple of anonymous down votes that were really about minor differences in opinion. I didn't like it. I mean, you put a disclaimer right at the front, right? One up for karma. - Stephen Harmon Mar 30 at 14:53
1
111 up for karma - zildjohn01 May 28 at 10:11
1
+1 This is definitely a helpful answer given all the caveats. - Christian Hayter Jun 14 at 8:57
117

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

link | flag
15
this is a very good answer - Paul Nathan Jan 4 at 18:25
Short and informative, I like it :) - Sune Rievers yesterday
+1 for science level - mico yesterday
2
This is not actually the case. RegEx in most programming languages is actually context-free, due to the fact that it has look-backs, etc. - michaelfairley yesterday
1
@michaelfairley Look ahead/behind/around features provide a richer syntax for expressing certain classes of regular expression. I do not believe these features provide fundamentally any more expressive power than a Chomsky type 3 grammar is capable of. One might argue that HTML is a visibly pushdown language (VPL) so may be parsed using techniques less powerful than required for a full blown context free grammar, however, I am unaware of any RegEx engine that support VPL's either. - NealB 23 hours ago
show 1 more comment
41

Perhaps http://www.crummy.com/software/BeautifulSoup/

link | flag
2
Yes, especially given this comment "I'm parsing a block of XHTML, truncating it, then closing any tags that are left open after it's been truncated. The DOM XML stuff doesn't work because it's not properly formed XML." Use BeautifulSoup to truncate and prettify. - Mark Nov 15 '09 at 18:51
10

Someone wrote a full html parser for PHP: http://htmlpurifier.org/

link | flag
Why is this -1?? This is also a good answer!!! - AntonioCS Dec 28 '09 at 23:01
This is a great parser +1 - alex Feb 12 at 3:26
9

I suggest using QueryPath (http://querypath.org/) for parsing XML and HTML. It's basically much the same syntax as jQuery, only it's on the server side.

link | flag
9

I find this small PHP library incredibly useful for parsing HTML tags:http://simplehtmldom.sourceforge.net/.

link | flag
Yep, this is the usual thing for HTML, when it's not well-formed XHTML anyway. - bobince Nov 13 '09 at 23:34
6

I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces. http://home.ccil.org/~cowan/XML/tagsoup/

link | flag
20
Java was never cool ;) - notandy Dec 8 '09 at 19:54
1
Java is cool as a platform for better languages, but i'm off topic. - rplevy Mar 24 at 23:53
6

Try:


                            <([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

link | flag
12
<a href="foo" title="5>3"> Oops </a> - Gareth Nov 13 '09 at 23:11
1
That is very true, and I did think about it, but I assumed the > symbol is properly escaped to &gt;. - Kobi Nov 13 '09 at 23:16
17
> is valid in an attribute value. Indeed, in the 'canonical XML' serialisation you must not use &gt;. (Which isn't entirely relevant, except to emphasise that > in an attribute value is not at all an unusual thing.) - bobince Nov 14 '09 at 0:15
5

                            <?
                            php
$selfClosing
= explode ( ',' , 'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed' );

$html
= '
<p><a href="#">foo</a></p>
<hr/>
<br/>
<div>name</div>'
;

$dom
= new DOMDocument ();
$dom
-> loadHTML ( $html );
$els
= $dom -> getElementsByTagName ( '*' );
foreach ( $els as $el ) {
$nodeName
= strtolower ( $el -> nodeName );
if ( ! in_array ( $nodeName , $selfClosing ) ) {
var_dump
( $nodeName );
}
}

Output:


                            string
                            (
                            4
                            )
                             
                            "html"
                            
                              
string ( 4 ) "body"
string ( 1 ) "p"
string ( 1 ) "a"
string ( 3 ) "div"

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

link | flag
If you're dealing with real XHTML then append getElementsByTagName with NS and specify the namespace. - meder Nov 15 '09 at 14:39
seems odd that every answer above mine isn't a real solution, just a recommendation to use some sort of parser. OP - did you try my answer? :p - meder Jan 9 at 5:11
3

XPath Luke, is your father.

link | flag
3

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?

Excerpt:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

link | flag
2

You want the first > not preceded by a /. Look here for details on how to do that. Its referred to as negative lookbehind.

However, a naive implementation of that will end up matching <bar/></foo> in this example document


                            <foo><bar/></foo>
                            
                              

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

link | flag
Yep, I sure am. Determining all the tags that are currently open, then compare that against the closed tags in a separate array. RegEx hurts my brain. - Jeff Nov 13 '09 at 23:04
2

You should check PHP DOM Functions. Very handy once you study this tutorial :http://php.net/manual/en/book.dom.php

link | flag
This is actually a very good answer! - AntonioCS Dec 28 '09 at 22:11
Thnx man. PHP DOM saved me many times :) - Fotis Jan 9 at 14:59
2

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:


                            <([a-z]+) *[^/]*
                            ?>
                            
                              

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>,[^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be


                            <([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

link | flag
2

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

link | flag
2

You can parse html in sed though.

  1. Turing.sed
  2. Write html parser (homework)
  3. ???
  4. Profit!
link | flag
1

The W3C explains parsing in a pseudo regexp form:
http://www.w3.org/TR/REC-xml-names/#ns-using

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

link | flag
1

It seems to me you're trying to match tags without a "/" at the end. Try this:


                            <([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>
link | flag
1

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

link | flag
1

If you need this for PHP:

The PHP dom functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

link | flag
1

Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this: //p/a[@href='foo']

link | flag
1

You can use nekohtml library to parse html. Чувак не парься и используй nekohtmlhttp://nekohtml.sourceforge.net/

link | flag
1

If it was not for @bobince answer I would say you should develop your regexes in a Test Driven manner.

Thank God you didn't

But next time use TDD.

link | flag
1

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

link | flag
1

I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes. For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog: roberto.open-lab.com

link | flag
1

This may do:


                            <.*?[^/]>

Or without the ending tags:


                            <[^/].*?[^/]>

What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...

link | flag
0

There are some nice regexes for replacing HTML with BBCode herehttp://www.garyshood.com/htmltobb/source.txt. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

link | flag
-8

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, sinceNOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]?)(([\s]\/>)|(>((([^<]?|<!--.?-->)|(?R)))<\/\1[\s]>))/s";

Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on january:

http://php.net/manual/en/regexp.reference.recursive.php

(Take care, in that note I wrongly used the "m" modifier, it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

  1. according to the specific implementation of the RegExp engine, recursion may have a limit in thenumber of nested patterns parsed, but it depends on the language used
  2. although corrupted (x)HTML does not drive into severe errors, it is not sanitized.

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

link | flag
3
"... I was soon aware that nobody got the point ..." ... sigh. -1 - Bart K. yesterday
4
Ooooh, recursive regexes! Why didn't we think of that? - Alan Moore yesterday
6
I'll put this in the "Regex which doesn't allow greater-than in attributes" bin. Check it against <input value="is 5 > 3?" /> - Gareth yesterday
12
If you put something like that in production code, you would likely be shot by the maintainer. A jury would never convict him. - aehiilrs yesterday
1
@Emanuele, yes, it's valid. - Bart K. 22 hours ago
show 8 more comments

댓글 없음:

댓글 쓰기