2010년 7월 7일 수요일

Easily extracting links from a snippet of html with HtmlAgilityPack

The HtmlAgilityPack is a powerful library that makes screen scraping in asp.net a breeze. This is the second in a continuing series where I demonstrate a way for you to extract all the links from a snippet of html.

A little background

If you haven't heard about HtmlAgilityPack yet then you have landed on the wrong post. Head over to my introduction to the subject and then come back and see me when you have read that.

How the sample application is going to work

The sample application is going to take a snippet of messy html stored in a text file. We are going to load it in and parse out all the <a href=""> tags and present these links in the browser by binding them to a gridview.

Lets take a look at the html snippet that we are going to load:

~/App_Data/HtmlSnippet.txt

1
2
3
< table border = "0" cellpadding = "2" cellspacing = "7" style = "vertical-align:top;" >< tr >< td width = "80" align = "center" valign = "top" >< font style = "font-size:85%;font-family:arial,sans-serif" >< a href = "http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.timesonline.co.uk%2Ftol%2Fnews%2Fworld%2Fus_and_americas%2Farticle6802128.ece&amp;usg=AFQjCNGnZL4BdTSWSglpAZdprg3u_tJVhg" >< img src = "http://nt2.ggpht.com/news/tbn/XrArEKXhTe6dLM/6.jpg" alt = "" border = "1" width = "80" height = "80" />
< font size = "-2" >Times Online</ font ></ a ></ font ></ td >< td valign = "top" >< font style = "font-size:85%;font-family:arial,sans-serif" >
< div style = "padding-top:0.8em;" >< img alt = "" height = "1" width = "1" /></ div >< div >< a href = "http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.latimes.com%2Fnews%2Fnationworld%2Fnation%2Fla-na-health-coop20-2009aug20%2C0%2C4258832.story&amp;usg=AFQjCNG4LI_9w3yHg7H8ZqUBaKNwzpgiuA" >< b >Healthcare co-ops emerging as viable alternative</ b ></ a > <!-- snip -->

Well that's not all of it but I think you get the point. Like I said this snippet originally came from a forum question. The html itself came from what looks like a Google news feed. I have kept it for this article because it shows that the HtmlAgilityPack can handle messy code and also that its not going to be tripped up by the extra urls which are url encoded into it.

The normal approach of using a regular expression to extract this kind of information could be tricked by this kind of code. I am not a big fan of using regular expressions for extracting this kind of information because they are too brittle (but I am a big fan of regular expressions).

The main structure of the program

Here is the code for the Page_Load method for your perusal. It should give you an idea of the main steps this program takes to complete its tasks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
protected void Page_Load( object sender, EventArgs e)
{
// load snippet
HtmlDocument htmlSnippet = new HtmlDocument();
htmlSnippet = LoadHtmlSnippetFromFile();
// extract hrefs
List< string > hrefTags = new List< string >();
hrefTags = ExtractAllAHrefTags(htmlSnippet);
// bind to gridview
GridViewHrefs.DataSource = hrefTags;
GridViewHrefs.DataBind();
}

So as you can see it takes three main steps - loading the snippet of html into the system, parsing it and a final cosmetic stage of binding it to a GridView.

In the first line you see that we create an instance of a HtmlDocument class. This is a class which comes with the HtmlAgilityPack library. It is the primary class you use to store a complete HtmlDocument.

This brings us nicely to LoadHtmlSnippetFromFile();

Loading the html snippet from file

The second method we are going to look at is LoadHtmlSnippetFromFile();

It is a pretty simple method which loads in the full version of that horribly messy html snippet I showed you earlier. However this is not to say that it doesn't do anything educational. Lets take a look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/// <summary>
/// Load the html snippet from the txt file
/// </summary>
private HtmlDocument LoadHtmlSnippetFromFile()
{
TextReader reader = File.OpenText(Server.MapPath( "~/App_Data/HtmlSnippet.txt" ));
HtmlDocument doc = new HtmlDocument();
doc.Load(reader);
reader.Close();
return doc;
}

So as you can see, I have used one of the many stream reader classes to effortlessly load the html snippet txt file into memory.

Turning this stream of html text into a queryable document is the task of the Load() method on the HtmlDocument. If you poke around with intellisense on that method you will find that it has 10 overloads which will let you us paths, streams or TextReaders with various encoding options.

If you already have the contents of a html document contained with in string (such as from a web service) then you can use LoadHtml().

There is a third option which the HtmlAgilityPack supports and that is retrieving the page over the internet via a url. This is demonstrated in the next article which explains how you can test if a web page contains an rss or atom feed, but for now it will remain a tantalising mystery.

I shouldn't have to say it but don't forget to .Close() your stream after you have populated your HtmlDocument.

Extract all href tags from the document

This is the section we have all been waiting for - it is the part where the html parsing magic is done. And while this is a deceptively simple method it actually shows many of the key building blocks you will use in your screen scraping endeavours.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/// <summary>
/// Extract all anchor tags using HtmlAgilityPack
/// </summary>
/// <param name="htmlSnippet"></param>
/// <returns></returns>
private List< string > ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List< string > hrefTags = new List< string >();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes( "//a[@href]" ))
{
HtmlAttribute att = link.Attributes[ "href" ];
hrefTags.Add(att.Value);
}
return hrefTags;
}

Looking at the code several things become clear:

  • The HtmlDocument class contains a collection of HtmlNodes
  • These HtmlNodes can be selected with an XPath query
  • The HtmlNodes can then be interrogated attribute by attribute with the HtmlAttribute class.

Using these three elements I have extracted a list of all the hrefs in the html snippet.

The use of XPath for extracting information out of html documents is key to the power of HtmlAgilityPack. If you don't know what XPath is then it's a technology that goes alongside xml and provides a query language for extracting xml nodes out of xml documents.

XML documents must be valid to be queried and most html out on the web is far from valid. It is littered with unclosed tags, capitalisations and syntax errors. A normal implementation of XPath can't be used to query html unless you clean it up first. The great thing about this being in HtmlAgilityPack is that the implementation will do its best to extract the information regardless of the validity of the document.

This gives us a very expressive way to describe the information we want to extract. In fact for many projects you will find that the hardest part is figuring out the correct XPath query to describe exactly what you want.

Our query //a[@href] means select all a tags (html anchor tags) that have a href attribute (so don't select the anchor if its just a named anchor).

The resulting collection is then iterated over in the foreach loop where I read the href attribute from each a tag and put it into my final collection for databinding.

We have covered a lot of ground in very little code which I hope further impresses on you the power of this library.

Tune in next time to find out how we will build a query engine that can detect if a webpage has an rss or atom feed associated with it!

Download the sample application

The sample application contains everything we discussed in this article including the HtmlAgilityPack, the code and the html snippet file.

More In This Series

This article is part of a series. You can find more posts in this series here:

댓글 없음:

댓글 쓰기