246
494
|
I need to match all of these opening tags:
But not these:
I came up with this and wanted to make sure I've got it right. I am only capturing the
I believe it says:
Do I have that right? And more importantly, what do you think? =) EDIT: Hmm, which answer to mark as correct? For the record, ALL the answers are appreciated. Many thanks! |
|||
show 8 more comments |
3579
|
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes Have you tried using an XML parser instead? |
||||||||||||||||||||
|
265
|
While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site. This was a limited, one-time job. Regexes worked just fine for me, and were very fast to set up. |
||||||||||||||||||||
|
136
|
Disclaimer: use a parser if you have the option. That said... This is the regex I use (!) to match HTML tags:
It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind:
or just combine if and if not. To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML. Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex... |
||||||||||||||||||||
|
117
|
I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up. |
||||||||||||||||
|
41
|
|
||||
|
10
|
Someone wrote a full html parser for PHP: http://htmlpurifier.org/ |
|||
9
|
I suggest using QueryPath (http://querypath.org/) for parsing XML and HTML. It's basically much the same syntax as jQuery, only it's on the server side. |
|||
9
|
I find this small PHP library incredibly useful for parsing HTML tags:http://simplehtmldom.sourceforge.net/. |
|||
|
6
|
I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces. http://home.ccil.org/~cowan/XML/tagsoup/ |
|||
6
|
Try:
It is similar to yours, but the last |
||||||||||||
|
5
|
Output:
Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them. I'm sure you already know by now that you shouldn't use regex for this purpose. |
|||
3
|
XPath Luke, is your father. |
|||
3
|
I don't know your exact need for this, but if you are also using .NET, couldn't you use Html Agility Pack? Excerpt:
|
|||
2
|
You want the first > not preceded by a /. Look here for details on how to do that. Its referred to as negative lookbehind. However, a naive implementation of that will end up matching
Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically? |
|||
|
2
|
You should check PHP DOM Functions. Very handy once you study this tutorial :http://php.net/manual/en/book.dom.php |
|||
2
|
While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression. The suggested regex is wrong, though:
If you add something to the regex, by backtracking it can be forced to match silly things like My suggestion would be
Where Note that this allows things like |
|||
2
|
Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works. There is a definitive blog post about matching innermost HTML elements written by Steven Levithan. |
|||
2
|
You can parse html in sed though.
|
|||
|
1
|
The W3C explains parsing in a pseudo regexp form:
Follow the var links for |
|||
1
|
It seems to me you're trying to match tags without a "/" at the end. Try this:
|
|||
1
|
I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node. Check it out and see if this can help you. |
|||
1
|
If you need this for PHP: The PHP dom functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind. simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy [Will crash on large pages.] I have never used querypath, so can't comment on its usefulness. Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful. For Python and Java, similar links were posted. For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please. |
|||
1
|
Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this: //p/a[@href='foo'] |
|||
1
|
You can use nekohtml library to parse html. Чувак не парься и используй nekohtmlhttp://nekohtml.sourceforge.net/ |
|||
1
|
If it was not for @bobince answer I would say you should develop your regexes in a Test Driven manner. Thank God you didn't But next time use TDD. |
|||
1
|
As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this. My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result. |
|||
1
|
I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs. Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes. For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated. There is a small article describing this work on my blog: roberto.open-lab.com |
|||
1
|
This may do:
Or without the ending tags:
What's with the flame wars on HTML parsers? HTML parsers must parse (and rebuild!) the entire document before it can categorize your search. Regular expressions may be a faster / elegant in certain circumstances. My 2 cents... |
|||
0
|
There are some nice regexes for replacing HTML with BBCode herehttp://www.garyshood.com/htmltobb/source.txt. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand. |
|||
-8
|
About the question of the RegExp methods to parse (x)HTML, the answer to all of the ones who spoke about some limits is: you have not been trained enough to rule the force of this powerful weapon, sinceNOBODY here spoke about recursion.
A RegExp-agnostic colleague notified me this discussion, which is not certainly the first on the web about this old and hot topic.
After reading some posts, the first thing I did was looking for the "?R" string in this thread. The second was to search about "recursion".
If an (x)HTML parser needs recursion, a RegExp parser without recursion is not enough for the purpose. It's a simple construct. The black art of RegExp is hard to master, so maybe there are further possibilities we left out while trying and testing our personal solution to capture the whole web in one hand... Well, I am sure about it :) Here's the magic pattern:
Just try it. It's written as a PHP string, so the "s" modifier makes classes include newlines.
(Take care, in that note I wrongly used the "m" modifier, it should be erased, notwithstanding it is discarded by the RegExp engine, since no ^ or $ anchorage was used). Now, we could speak about the limits of this method from a more informed point of view:
Anyhow it is only a RegExp pattern, but it discloses the possibility to develop of a lot of powerful implementations.
|
||||||||||||||||||||
|
<div style="background:url('/path/image.jpg');" title='Yes/No'>
- Kobi Nov 13 '09 at 22:44