2010년 7월 7일 수요일

Using HTML Agility Pack and XPath in C#

Scraping Web Pages with HTML Agility Pack & XPath in C# makes the process a million times easier

Chances are pretty high that at some point, a client is going to ask you to develop a software app that will need to scrape something.

If you're developing internet marketing apps (like I am) then scraping is a guaranteed task. Other programming languages make it painful, and C# is no exception. However HTML Agility Pack for the .NET Framework helps.

For this example, I'm going to show you how to scrape the title of an AssociatedContent article.

  • First, we need to determine a keyword. Let's assume a user enters a keyword in a text box which is named txtKeyword

We need to take the keyword and download the entire search results HTML. For this, we'll be using WebClient.

Note that you'll need to add " using System.Net " to the top of your source file.

WebClient
               wc 
              =
               
              new
               
              WebClient
              ();
              
                
HtmlDocument doc = new HtmlDocument ();

doc
. LoadHtml ( searchHtml );

foreach ( HtmlNode node in doc . DocumentNode . SelectNodes ( "//div[@class='basic_padding white_bg']/div[@class='content_list_item']/div[@class='title']" ))
{
int end = node . InnerHtml . IndexOf ( "\">" );
int start = node . InnerHtml . IndexOf ( "\"" ) + 1 ;
string link = node . InnerHtml . Substring ( start , end - start );
string title = node . InnerText ;
}


This is the entire code. Let's break it down

This loads the HTML into HTML Agility Pack

string
               searchHtml 
              =
               wc
              .
              DownloadString
              (
              "http://www.associatedcontent.com/subject/article/"
               
              +
               
              HttpUtility
              .
              UrlEncode
              (
              txtKeyword
              .
              Text
              ));
              
                
doc
. LoadHtml ( searchHtml );


This is the real magic. The special string in the SelectNodes function is the XPath string. Each XPath string should start with a double slash. As you can see, specific parameters can be set (like class=, id=, etc..).

Alternatively, XPath strings can include the xth element (starting from a 1-based array, not a zero based array). So //div[1] will select the first div.

foreach
               
              (
              HtmlNode
               node 
              in
               doc
              .
              DocumentNode
              .
              SelectNodes
              (
              "//div[@class='basic_padding white_bg']/div[@class='content_list_item']/div[@class='title']"
              ))


These next few lines of code are for extracting the actual hyperlink from the result. Hopefully this example helps.

int
               
              end
               
              =
               node
              .
              InnerHtml
              .
              IndexOf
              (
              "\">"
              );
              
                
int start = node . InnerHtml . IndexOf ( "\"" ) + 1 ;
string link = node . InnerHtml . Substring ( start , end - start );


XPath strings have the ability to do boolean operations, concatenation, and many other cool things. This only really scratched the surface for XPath and HTML Agility Pack. If you're developing more advanced applications, combining the WebBrowser control with HTML Agility Pack allows you to scrape while executing functions, which is often crucial for internet marketing & automation apps. If you were wondering, this code was included in a free software app called "Articles".

댓글 없음:

댓글 쓰기