시닉

Question

246

494

I need to match all of these opening tags:

But not these:

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.


                            <([a-z]+) *[^/]*
                            ?>

I believe it says:

Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except /, then
Find a greater-than

Do I have that right? And more importantly, what do you think? =)

EDIT:

Hmm, which answer to mark as correct? For the record, ALL the answers are appreciated. Many thanks!

edited Nov 18 '09 at 18:21

community wiki

6 revs, 3 users 100%
Jeff

142

I think that using a regular expression to process XML when there are plenty of good XML parsers out there is a bad idea. - David Dorward Nov 13 '09 at 22:42

48

<div style="background:url('/path/image.jpg');" title='Yes/No'> - Kobi Nov 13 '09 at 22:44

18

By this point, I really think it should be axiomatic that, if you're trying to parse arbitrary XML/HTML, regular expressions are not the tool to use. - jamesnvc Nov 14 '09 at 4:44

4

If your input set is small/homogeneous enough that you don't need a parser, then you can just try it out and check manually whether the regexp hack matches what you want it to match. If you can't then you should rethink your need for a parser... - David Schmitt Nov 14 '09 at 10:37

5

@Claudiu, because Jeff Atwood (codinghorror.com/blog/archives/001311.html) wrote a post about parsing Html with Regex. - Arnis L. Nov 17 '09 at 8:22

show 8 more comments

30 Answers

oldest newest votes

5 revs, 3 users 67% bobince · Answer 1 · 2010-07-05 01:03:24Z

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes~~, the pestilent sl~~ithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expre~~ssion parsing~~will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuffing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T A LL IS LOST the pon̷y he comes he c̶̮om~~es he co~~ mes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼ OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s͎a̧͈͖r̽̾̈́͒͑e not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

Have you tried using an XML parser instead?

2 revs, 2 users 80% Kaitlin Duck Sherwood · Answer 2 · 2010-06-01 09:40:34Z

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

3 revs itsadok · Answer 3 · 2009-11-18 07:31:43Z

Disclaimer: use a parser if you have the option. That said...

This is the regex I use (!) to match HTML tags:


                            <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>

It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like <a name="badgenerator"">, which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:


                            <(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+(?<!/\s*)>

or just combine if and if not.

To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

3 revs, 2 users 55% NealB · Answer 4 · 2010-04-23 09:31:01Z

117

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

Chomsky Hierarchy

edited Apr 23 at 9:31

community wiki

3 revs, 2 users 55%
NealB

15

this is a very good answer - Paul Nathan Jan 4 at 18:25

Short and informative, I like it :) - Sune Rievers yesterday

+1 for science level - mico yesterday

2

This is not actually the case. RegEx in most programming languages is actually context-free, due to the fact that it has look-backs, etc. - michaelfairley yesterday

1

@michaelfairley Look ahead/behind/around features provide a richer syntax for expressing certain classes of regular expression. I do not believe these features provide fundamentally any more expressive power than a Chomsky type 3 grammar is capable of. One might argue that HTML is a visibly pushdown language (VPL) so may be parsed using techniques less powerful than required for a full blown context free grammar, however, I am unaware of any RegEx engine that support VPL's either. - NealB 23 hours ago

show 1 more comment

MattK · Answer 5 · 2009-11-15 17:06:14Z

41

Perhaps http://www.crummy.com/software/BeautifulSoup/

answered Nov 15 '09 at 17:06

community wiki

MattK

2

Yes, especially given this comment "I'm parsing a block of XHTML, truncating it, then closing any tags that are left open after it's been truncated. The DOM XML stuff doesn't work because it's not properly formed XML." Use BeautifulSoup to truncate and prettify. - Mark Nov 15 '09 at 18:51

Jesse Mullan · Answer 6 · 2009-11-17 00:05:53Z

10

Someone wrote a full html parser for PHP: http://htmlpurifier.org/

answered Nov 17 '09 at 0:05

community wiki

Jesse Mullan

	Why is this -1?? This is also a good answer!!! - AntonioCS Dec 28 '09 at 23:01
	This is a great parser +1 - alex Feb 12 at 3:26

John Fiala · Answer 7 · 2009-11-13 23:44:50Z

I suggest using QueryPath (http://querypath.org/) for parsing XML and HTML. It's basically much the same syntax as jQuery, only it's on the server side.

2 revs, 2 users 57% Kosso · Answer 8 · 2009-11-25 21:12:45Z

I find this small PHP library incredibly useful for parsing HTML tags:http://simplehtmldom.sourceforge.net/.

DanielHonig · Answer 9 · 2009-11-14 20:24:42Z

I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces. http://home.ccil.org/~cowan/XML/tagsoup/

3 revs, 2 users 71% Kobi · Answer 10 · 2009-11-25 21:12:04Z

6

Try:


                            <([^\s]+)(\s[^>]*?)?(?<!/)>

It is similar to yours, but the last > must not be after a slash, and also accepts h1.

edited Nov 25 '09 at 21:12

community wiki

3 revs, 2 users 71%
Kobi

12

<a href="foo" title="5>3"> Oops </a> - Gareth Nov 13 '09 at 23:11

1

That is very true, and I did think about it, but I assumed the > symbol is properly escaped to >. - Kobi Nov 13 '09 at 23:16

17

> is valid in an attribute value. Indeed, in the 'canonical XML' serialisation you must not use >. (Which isn't entirely relevant, except to emphasise that > in an attribute value is not at all an unusual thing.) - bobince Nov 14 '09 at 0:15

2 revs meder · Answer 11 · 2009-11-15 14:44:04Z


                            <?
                            php

$selfClosing 
                            =
                             explode
                            (
                            ','
                            ,
                             
                            'area,base,basefont,br,col,frame,hr,img,input,isindex,link,meta,param,embed'
                            );
                            
                              



$html 
                            =
                             
                            '

<p><a href="#">foo</a></p>

<hr/>

<br/>

<div>name</div>'
                            ;
                            
                              



$dom 
                            =
                             
                            new
                             
                            DOMDocument
                            ();
                            
                              

$dom
                            ->
                            loadHTML
                            (
                            $html
                            );
                            
                              

$els 
                            =
                             $dom
                            ->
                            getElementsByTagName
                            (
                            '*'
                            );
                            
                              

                            
                            foreach
                             
                            (
                             $els 
                            as
                             $el 
                            )
                             
                            {
                            
                              

    $nodeName 
                            =
                             strtolower
                            (
                            $el
                            ->
                            nodeName
                            );
                            
                              

    
                            if
                             
                            (
                             
                            !
                            in_array
                            (
                             $nodeName
                            ,
                             $selfClosing 
                            )
                             
                            )
                             
                            {
                            
                              

        var_dump
                            (
                             $nodeName 
                            );
                            
                              

    
                            }
                            
                              

                            
                            }

Output:

Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them.

I'm sure you already know by now that you shouldn't use regex for this purpose.

Excalibur2000 · Answer 12 · 2009-11-16 20:11:44Z

3

XPath Luke, is your father.

answered Nov 16 '09 at 20:11

community wiki

Excalibur2000

GONeale · Answer 13 · 2009-11-16 23:15:03Z

I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack?

Excerpt:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

Jherico · Answer 14 · 2009-11-13 22:47:17Z

You want the first > not preceded by a /. Look here for details on how to do that. Its referred to as negative lookbehind.

However, a naive implementation of that will end up matching <bar/></foo> in this example document


                            <foo><bar/></foo>

Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

Fotis · Answer 15 · 2009-11-15 14:25:20Z

You should check PHP DOM Functions. Very handy once you study this tutorial :http://php.net/manual/en/book.dom.php

moritz · Answer 16 · 2010-01-27 12:54:35Z

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.

The suggested regex is wrong, though:


                            <([a-z]+) *[^/]*
                            ?>

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>,[^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.

My suggestion would be


                            <([a-z]+)[^>]*(?<!/)>

Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".

Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

eyazici · Answer 17 · 2010-02-09 03:59:27Z

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works.

There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

2 revs, 2 users 82% profjim · Answer 18 · 2010-03-03 11:23:45Z

2

You can parse html in sed though.

Turing.sed
Write html parser (homework)
???
Profit!

edited Mar 3 at 11:23

community wiki

2 revs, 2 users 82%
profjim

See also perlmonks.org/?displaytype=print;node_id=809842 - profjim Mar 3 at 12:50

John-David Dalton · Answer 19 · 2009-11-15 06:18:15Z

The W3C explains parsing in a pseudo regexp form:
http://www.w3.org/TR/REC-xml-names/#ns-using

Follow the var links for QName, S, and Attribute to get a clearer picture.
Based on that you can create a pretty good regexp to handle things like stripping tags.

manixrock · Answer 20 · 2009-11-15 17:13:19Z

It seems to me you're trying to match tags without a "/" at the end. Try this:


                            <([a-zA-Z][a-zA-Z0-9]*)[^>]*(?<!/)>

logoin · Answer 21 · 2009-11-16 18:34:50Z

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you.

3 revs samgoody · Answer 22 · 2009-11-16 19:41:29Z

If you need this for PHP:

The PHP dom functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind.

simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.]

I have never used querypath, so can't comment on its usefulness.

Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted.

For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

Sembiance · Answer 23 · 2009-11-18 14:50:26Z

Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this: //p/a[@href='foo']

zzulus.ya.ru · Answer 24 · 2009-12-02 14:13:20Z

You can use nekohtml library to parse html. Чувак не парься и используй nekohtmlhttp://nekohtml.sourceforge.net/

Jader Dias · Answer 25 · 2009-12-04 01:47:17Z

If it was not for @bobince answer I would say you should develop your regexes in a Test Driven manner.

Thank God you didn't

But next time use TDD.

Corey Sanders · Answer 26 · 2010-02-04 16:22:00Z

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

2 revs, 2 users 74% Roberto · Answer 27 · 2010-05-13 21:01:03Z

I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes. For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog: roberto.open-lab.com

3 revs, 3 users 55% Paul · Answer 28 · 2010-07-05 16:27:57Z

This may do:


                            <.*?[^/]>

Or without the ending tags:


                            <[^/].*?[^/]>

What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents...

sblom · Answer 29 · 2010-04-25 16:38:42Z

There are some nice regexes for replacing HTML with BBCode herehttp://www.garyshood.com/htmltobb/source.txt. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

Emanuele Del Grande · Answer 30 · 2010-07-05 14:16:50Z

About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, sinceNOBODY here spoke about recursion.

A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.

After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
No, holy cow, no match found.
Since nobody mentioned the main mechanism a parser is built onto, I was soon aware that nobody got the point.

If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct.

The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :)

Here's the magic pattern:

$pattern = "/<([\w]+)([^>]?)(([\s]\/>)|(>((([^<]?|)|(?R)))<\/\1[\s]>))/s";

Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.
Here's a sample note on the PHP manual I wrote on january:

http://php.net/manual/en/regexp.reference.recursive.php

(Take care, in that note I wrongly used the "m" modifier, it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used).

Now, we could speak about the limits of this method from a more informed point of view:

according to the specific implementation of the RegExp engine, recursion may have a limit in thenumber of nested patterns parsed, but it depends on the language used
although corrupted (x)HTML does not drive into severe errors, it is not sanitized.

Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
I wrote this pattern to power the recursive descent parser of a template engine I built in my framework, and performances are really great, both in execution times or in memory usage (nothing to do with other template engines which use the same syntax).

시닉

2010년 7월 7일 수요일

RegEx match open tags except XHTML self-contained tags

30 Answers

댓글 없음:

댓글 쓰기