2010년 7월 7일 수요일

Easily extracting links from a snippet of html with HtmlAgilityPack

The HtmlAgilityPack is a powerful library that makes screen scraping in asp.net a breeze. This is the second in a continuing series where I demonstrate a way for you to extract all the links from a snippet of html.

A little background

If you haven't heard about HtmlAgilityPack yet then you have landed on the wrong post. Head over to my introduction to the subject and then come back and see me when you have read that.

How the sample application is going to work

The sample application is going to take a snippet of messy html stored in a text file. We are going to load it in and parse out all the <a href=""> tags and present these links in the browser by binding them to a gridview.

Lets take a look at the html snippet that we are going to load:

~/App_Data/HtmlSnippet.txt

1
2
3
< table border = "0" cellpadding = "2" cellspacing = "7" style = "vertical-align:top;" >< tr >< td width = "80" align = "center" valign = "top" >< font style = "font-size:85%;font-family:arial,sans-serif" >< a href = "http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.timesonline.co.uk%2Ftol%2Fnews%2Fworld%2Fus_and_americas%2Farticle6802128.ece&amp;usg=AFQjCNGnZL4BdTSWSglpAZdprg3u_tJVhg" >< img src = "http://nt2.ggpht.com/news/tbn/XrArEKXhTe6dLM/6.jpg" alt = "" border = "1" width = "80" height = "80" />
< font size = "-2" >Times Online</ font ></ a ></ font ></ td >< td valign = "top" >< font style = "font-size:85%;font-family:arial,sans-serif" >
< div style = "padding-top:0.8em;" >< img alt = "" height = "1" width = "1" /></ div >< div >< a href = "http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.latimes.com%2Fnews%2Fnationworld%2Fnation%2Fla-na-health-coop20-2009aug20%2C0%2C4258832.story&amp;usg=AFQjCNG4LI_9w3yHg7H8ZqUBaKNwzpgiuA" >< b >Healthcare co-ops emerging as viable alternative</ b ></ a > <!-- snip -->

Well that's not all of it but I think you get the point. Like I said this snippet originally came from a forum question. The html itself came from what looks like a Google news feed. I have kept it for this article because it shows that the HtmlAgilityPack can handle messy code and also that its not going to be tripped up by the extra urls which are url encoded into it.

The normal approach of using a regular expression to extract this kind of information could be tricked by this kind of code. I am not a big fan of using regular expressions for extracting this kind of information because they are too brittle (but I am a big fan of regular expressions).

The main structure of the program

Here is the code for the Page_Load method for your perusal. It should give you an idea of the main steps this program takes to complete its tasks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
protected void Page_Load( object sender, EventArgs e)
{
// load snippet
HtmlDocument htmlSnippet = new HtmlDocument();
htmlSnippet = LoadHtmlSnippetFromFile();
// extract hrefs
List< string > hrefTags = new List< string >();
hrefTags = ExtractAllAHrefTags(htmlSnippet);
// bind to gridview
GridViewHrefs.DataSource = hrefTags;
GridViewHrefs.DataBind();
}

So as you can see it takes three main steps - loading the snippet of html into the system, parsing it and a final cosmetic stage of binding it to a GridView.

In the first line you see that we create an instance of a HtmlDocument class. This is a class which comes with the HtmlAgilityPack library. It is the primary class you use to store a complete HtmlDocument.

This brings us nicely to LoadHtmlSnippetFromFile();

Loading the html snippet from file

The second method we are going to look at is LoadHtmlSnippetFromFile();

It is a pretty simple method which loads in the full version of that horribly messy html snippet I showed you earlier. However this is not to say that it doesn't do anything educational. Lets take a look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/// <summary>
/// Load the html snippet from the txt file
/// </summary>
private HtmlDocument LoadHtmlSnippetFromFile()
{
TextReader reader = File.OpenText(Server.MapPath( "~/App_Data/HtmlSnippet.txt" ));
HtmlDocument doc = new HtmlDocument();
doc.Load(reader);
reader.Close();
return doc;
}

So as you can see, I have used one of the many stream reader classes to effortlessly load the html snippet txt file into memory.

Turning this stream of html text into a queryable document is the task of the Load() method on the HtmlDocument. If you poke around with intellisense on that method you will find that it has 10 overloads which will let you us paths, streams or TextReaders with various encoding options.

If you already have the contents of a html document contained with in string (such as from a web service) then you can use LoadHtml().

There is a third option which the HtmlAgilityPack supports and that is retrieving the page over the internet via a url. This is demonstrated in the next article which explains how you can test if a web page contains an rss or atom feed, but for now it will remain a tantalising mystery.

I shouldn't have to say it but don't forget to .Close() your stream after you have populated your HtmlDocument.

Extract all href tags from the document

This is the section we have all been waiting for - it is the part where the html parsing magic is done. And while this is a deceptively simple method it actually shows many of the key building blocks you will use in your screen scraping endeavours.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/// <summary>
/// Extract all anchor tags using HtmlAgilityPack
/// </summary>
/// <param name="htmlSnippet"></param>
/// <returns></returns>
private List< string > ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
List< string > hrefTags = new List< string >();
foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes( "//a[@href]" ))
{
HtmlAttribute att = link.Attributes[ "href" ];
hrefTags.Add(att.Value);
}
return hrefTags;
}

Looking at the code several things become clear:

  • The HtmlDocument class contains a collection of HtmlNodes
  • These HtmlNodes can be selected with an XPath query
  • The HtmlNodes can then be interrogated attribute by attribute with the HtmlAttribute class.

Using these three elements I have extracted a list of all the hrefs in the html snippet.

The use of XPath for extracting information out of html documents is key to the power of HtmlAgilityPack. If you don't know what XPath is then it's a technology that goes alongside xml and provides a query language for extracting xml nodes out of xml documents.

XML documents must be valid to be queried and most html out on the web is far from valid. It is littered with unclosed tags, capitalisations and syntax errors. A normal implementation of XPath can't be used to query html unless you clean it up first. The great thing about this being in HtmlAgilityPack is that the implementation will do its best to extract the information regardless of the validity of the document.

This gives us a very expressive way to describe the information we want to extract. In fact for many projects you will find that the hardest part is figuring out the correct XPath query to describe exactly what you want.

Our query //a[@href] means select all a tags (html anchor tags) that have a href attribute (so don't select the anchor if its just a named anchor).

The resulting collection is then iterated over in the foreach loop where I read the href attribute from each a tag and put it into my final collection for databinding.

We have covered a lot of ground in very little code which I hope further impresses on you the power of this library.

Tune in next time to find out how we will build a query engine that can detect if a webpage has an rss or atom feed associated with it!

Download the sample application

The sample application contains everything we discussed in this article including the HtmlAgilityPack, the code and the html snippet file.

More In This Series

This article is part of a series. You can find more posts in this series here:

2010년 4월 12일 월요일

Hard Reset SB5101 / Debrick - SurfboardHacker Forums

This does NOT work with the TCNiso USBblackcat, nor does it have these capabilities don't polute this thread with the BlackcatUSB

This should ONLY be attempted IF;
1) you know your Jtag works, (i.e. if a parajtag has worked on other modems and your not using the blackcat software).
2) You cannot detect the flash of your modem using the SBHUsbJtag or a Parajtag running Tom's Jtag Util.
3) When you power up the modem all the lights come on and then immediately go out.

Solder some thin wire to the RH leg of the 3.3v power source then run the other end QUICKLY over the 5 inside Jtag pins, then connect your Jtag then enter these commands;

Method 1.
poke fffe230c 3
poke fffe2304 0
poke fffe2300 a
poke fffe2300 9
poke fffe2300 9
poke fffe2300 9
poke fffe2300 9
poke fffe2300 9
poke fffe2300 9
poke fffe2300 9
poke fffe2300 9
poke fffe2300 1c
detect (MAY HAVE TO RE-DO the POKE COMMANDS A FEW TIMES IF DETECT DOES NOT WORK)
ldram 9fc00000 (Select the 2mb dump file)
program 9fc00000 200000


Method 2
detect
ldram boot (Select boot file)
erase boot
sprogram boot
(Power off the modem)
detect
ldram 9fc00000 (Select the 2mb dump file)
program 9fc00000 200000

Textbook Flash chip reset;
Another method is shown in the pic below. 3.3v is applied to the pin 12 of the flash TSOP to reset it. I would recommend soldering a fine wire to the #12 lead first then apply power to it for about a second. You don't wan to accidentally touch the pins on either side. the #27 pin is already attached to a common ground on the PCB.

Attached File(s)

TUTORIAL 5101i

Hello everyone I'm a bit short of time so they put the steps something fast, but they will make amend SB5101 to give them a link included with the files. bin and ambit250 necessary. the blackat not published because there are many links.

http://rapidshare.com/files/154132565/hack_5101.rar.html


TUTORIAL 5101i

Tools and programs.
Ambit.250.
fullflash-250hack_dump_telnet.bin
Firmware-sb5101.mod.10.2.by.fercsa.bin
blackcat-v.128-programer

1-As your full flash flasher.
First check that your connection is correct. Open blackcat.v128. In the console tab.
After seeing that has a bcm3348. This tells you everything is. Ok
Open blackcat.v128. In the flash tab, open there look. WRITE ALL. --- And you put the 250hack_dump_telnet. Bin ---.
This will last you a lot. at least 2 hours.

2-FIRMWARE ACTUALIZASION
At the same flash tab. opened in WRITE. and you put the sb5101.mod.10.2.by.fercsa.bin
In this case the writing starts at 0x10000

3-ACCESS TO THE INTERFACE OF TELNET
This will serve two to see if you and if your moden design work.
Enter through http://192.168.100.1
Remember to respect the capital letters.
login: Infinite password: SetValue

4-CONFIGURATION AND CHANGES BY TELNET MODEN
A-to activate the BIS.
Run cmd.exe ... ... ... ..
Write clik enter telnet -----
Type or ------ clik enter 192.168.100.1
login: admin password: infinite
You give him two times while inside enter
Write non-vol cd clik enter ----
Cd write clik enter DOCSIS -----
Enable bpi true write clik enter -----
0 ------ write clik enter bpi_version
Write write clik enter -----

5-CHANGE OF MAC
can be done by telnet and ambit250.
I set some options.
For telnet:
Stop scanning frequency:
cd / cm_hal
scan_stop
September custom config:
cd / non-vol/docsis
force_cfgfile enable true
dhcp_settings
[Just press enter]
[Just press enter]
[Just press enter]
yes
10.10.10.10 (this your tftp server address - local tftp server not allowed, this is a bug ..)
config.cfg (config name)
[Just press enter]
[Just press enter]
write
Change MAC address to 00:11:22:33:44:55
cd
2155776876 write_memory 0x00
2155776877 write_memory 0x11
2155776878 write_memory 0x22
2155776879 write_memory 0x33
2155776880 write_memory 0x44
0x55 write_memory 2155776881
cd / non-vol
write

DOWN BY SOME THAT AMBIT250.HAY necessary patches.
The only thing I see with suspicion is that the lights do not work but everything works well, to see the functionality of the moden have to go screen telnet interface.

Motorola SBG9000 with USBJTAG

Salu2 Hello to everyone: little brother who has integrated the router SBG900 is the echo I do not have as many loses kiss when you edit the router but still have a tutorial here: Programming Motorola SBG9000 with USBJTAG nt, NOTE: This document is made for educational purposes, any other use and / or other purposes is the sole responsibility of the user.
The cable modem must be disconnected from electricity, when uncovered and also when
USBJTAG connect.

1 .- off all (light and PC)
2 .- open the SBG900
3 .- connect the SBG900 USBJTAG NT
4 .- NT connect to the PC USBJTAG
5 .- connect the SBG900 to light
6 .- Program

- Baja Firm_sgb900_BPI_OK_Consola flash, this has enabled telnet, thus it
can change the mac with the commands.
.- Subes full flash with the cable USBJTAG (selects first and foremost in the model SBG900
USBJTAG your program in "Tools" - "Settings")
detect
9fc00000 ldram
400,000 9fc00000 program
9fc00000 cmpram 400,000

First identified as modem SBG900 opens:


1 .- It has a screw in the upper back, 2 .- take off the label.


3 .- Remove the tabs careful not to break.


We revealed the equipment carefully to avoid breaking the antenna.


Already open, we seek to connect the USBJTAG nt.


Identify the pin 1 and connect the USBJTAG, red is the pin1,
USBJTAG not connected to anything when you connect the SBG900 (the usb jtag should not be
connected to the CPU)
SBG900 team should not be connected to the light when you connect the USBJTAG


Already connected the computer proceed to the software.

Aa USBJTAG connect the PC (for this and must be connected to the SBG900)
Then connect the modem to light
open the program usbjatg


Open the program and see if the USBJTAG detected as connected, configured to do
the SBG900 which is the model we will use


We selected the model


We write in the box below and we detect enter (my details are fuzzy to avoid misunderstanding)