2010년 7월 7일 수요일

HTML Agility Pack Rocks Your Screen Scraping World

I had to do some deep data extraction from a web page today, and naturally leaned on PowerShell for some assistance. PowerShell is a great language for text munging, and web content is no different. There are tons of examples online, but here's an example from earlier in this blog:http://www.leeholmes.com/blog/PowerShellTheOracleInstantAnswersFromYourPrompt.aspx.

As I looked at the underlying HTML of this page, though, my heart sank. I cared about four pieces of data, and they were arranged without much structure on the web page. The information I cared about was in a couple of different tables, a couple of different table rows, and sometimes in different columns. You can parse your way around this, but it's simply error-prone and annoying.

At that point, I remembered something called the HTML Agility Pack that I've been meaning to experiment with for some time. The HTML Agility Pack lets you navigate an HTML document as though it were well-formed XML, even though the underlying HTML usually isn't. It doesn't leverage PowerShell's XML adapter, but the .NET objects act just like the XML classes from the .NET Framework.

On the down-side, data navigation and selection in XML comes via the XPath language. Like Regular Expressions, XPath queries are an esoteric art and difficult to get right. Luckily, you don't need much knowledge of XPath for simple XML navigation.

This whole experience gives a great example of the "admin development model." 15 minutes after thinking about parsing the web page with the HTML Agility Pack, I had a working version. PowerShell's Get-Member cmdlet was all I used for discovery - no documentation was harmed in the making of this script. Here is the literal text of my history buffer, experimentation and all. On line 251 and 252, I put the history into the ISE so that I can hack out the experimentation bits and keep the stuff that worked.

221 cd C:\temp\HtmlAgilityPack.1.4.0.beta2.binaries
222 dir
223 add-type -Path .\HtmlAgilityPack.dll
224 $types = add-type -Path .\HtmlAgilityPack.dll -PassThru
225 $types
226 $types | ? { $_.IsPublic }
227 $doc = new-object HtmlWeb
228 ($types | ? { $_.IsPublic })[1]
229 ($types | ? { $_.IsPublic })[1].FullName
230 $doc = New-Object HtmlAgilityPack.HtmlDocument
231 $doc
232 $doc | gm
233 $result = $doc.Load("C:\temp\texts.html")
234 $result
235 $doc
236 $doc | gm
237 $doc.DocumentNode
238 $doc.DocumentNode | gm
239 $doc.DocumentNode.SelectNodes("//h1")
240 $doc.DocumentNode.SelectNodes("//table[@class='table-gen']")
241 $doc.DocumentNode.SelectNodes("//table[@class='table-gen']/tr[2]")
242 $doc.DocumentNode.SelectNodes("//table[@class='table-gen']")
243 $texts = $doc.DocumentNode.SelectNodes("//table[@class='table-gen']")
244 $texts[0]
245 $testText = $texts[0]
246 $testText | clip
247 $testText.SelectSingleNode("/tr[1]/td")
248 $testText.SelectSingleNode("tr[1]/td")
249 $testText.SelectSingleNode("tr[1]/td").InnerTExt
250 $testText.SelectSingleNode("tr[1]/td").InnerText.Trim()
251 ise
252 h
253 $time = [DateTime] $testText.SelectSingleNode("tr[1]/td").InnerText.Trim()
254 $testText.SelectSingleNode("tr[2]/td").InnerText.Trim()
255 $testText.SelectSingleNode("tr[2]/td").InnerText.Replace('Description:','').Trim()
256 $testText.SelectSingleNode("tr[6]/td").InnerText
257 $testText.SelectSingleNode("tr[5]/td").InnerText
258 $testText.SelectSingleNode("tr[4]/td").InnerText
259 $testText.SelectSingleNode("tr[5]/td").InnerText
260 $testText.SelectSingleNode("tr[5]/td[1]")
261 $testText.SelectSingleNode("tr[5]/td[2]")
262 $time = $testText.SelectSingleNode("tr[1]/td").InnerText.Trim()
263 $inOut = $testText.SelectSingleNode("tr[2]/td").InnerText.Replace('Description:',...
264 $to = $testText.SelectSingleNode("tr[5]/td").InnerText.Replace('Number Called:','...
265 $from = $testText.SelectSingleNode("tr[5]/td[2]").InnerText.Replace('Calling Numb...
266 New-Object PsObject -Property @{ Time = $time; Type = $inOut; From = $from; To = ...
267 New-Object PsObject -Property @{ Time = $time; Type = $inOut; From = $from; To = ...
268 $texts | % {...
269 C:\temp\textparser.ps1

The final script:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
cd C:\temp\HtmlAgilityPack.1.4.0.beta2.binaries
add-type -Path .\HtmlAgilityPack.dll
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc . Load ( "C:\temp\texts.html" )
$texts = $doc . DocumentNode . SelectNodes ( "//table[@class='table-gen']" )

$result = $texts | % {
$testText = $_
$time = $testText . SelectSingleNode ( "tr[1]/td" ) . InnerText . Trim ( )
$time = $time . TrimEnd ( " CST" )
$time = ( [DateTime] $time ) . AddHours ( -2 )
$inOut = $testText . SelectSingleNode ( "tr[2]/td" ) . InnerText . Replace ( 'Description:' , '' ) . Trim ( )
$to = $testText . SelectSingleNode ( "tr[5]/td" ) . InnerText . Replace ( 'Number Called:' , '' ) . Trim ( )
$from = $testText . SelectSingleNode ( "tr[5]/td[2]" ) . InnerText . Replace ( 'Calling Number:' , '' ) . Trim ( )

New-Object PsObject -Property @{ Time = $time ; Type = $inOut ; From = $from ; To = $to } |
Select From , To , Type , Time
}

$result | Sort Time | ft -auto | out-string -width 75

All in all, the HTML Agility Pack is a very attractive approach that I plan to start using more often.

댓글 없음:

댓글 쓰기