2010년 7월 7일 수요일

HTML Parser

Hello guys!

At school we were given a homework to read some info from web page and display it. While surfing over net for some infos how to approach to this a came across at HTML Agility Pack. I decided to use it. But I have some problems with parsing the content.

Here is the part of http that I have to get values from (marked with red).

<h3>marec - maj 2009</h3>
<div class="graf_table">
        <table summary="layout table">
                <tr>
                        <th>DATUM</th>
                        <td class="datum">10.03.2009</td>
                        <td class="datum">24.03.2009</td>
                        <td class="datum">07.04.2009</td>
                        <td class="datum">21.04.2009</td>
                        <td class="datum">05.05.2009</td>
                        <td class="datum">06.05.2009</td>
                </tr>
                <tr>
                        <th>Maloprodajna cena [EUR/L]</th>
                        <td>0,96000</td>
                        <td>0,97000</td>
                        <td>0,99600</td>
                        <td>1,00800</td>
                        <td>1,00800</td>
                        <td>1,01000</td>
                </tr>
                <tr>
                        <th>Maloprodajna cena [SIT/L]</th>
                        <td>230,054</td>
                        <td>232,451</td>
                        <td>238,681</td>
                        <td>241,557</td>
                        <td>241,557</td>
                        <td>242,036</td>
                </tr>
                <tr>
                        <th>Prodajna cena brez dajatev</th>
                        <td>0,33795</td>
                        <td>0,34628</td>
                        <td>0,36795</td>
                        <td>0,37795</td>
                        <td>0,37795</td>
                        <td>0,37962</td>
                </tr>
                <tr>
                        <th>Trošarina</th>
                        <td>0,46205</td>
                        <td>0,46205</td>
                        <td>0,46205</td>
                        <td>0,46205</td>
                        <td>0,46205</td>
                        <td>0,46205</td>
                </tr>
                <tr>
                        <th>DDV</th>
                        <td>0,16000</td>
                        <td>0,16167</td>
                        <td>0,16600</td>
                        <td>0,16800</td>
                        <td>0,16800</td>
                        <td>0,16833</td>
                </tr>
        </table>
</div>

So far I managed to write this, which gives me all values from table.
So my question is, waht to add/change in query so that it'll return me only values form cells where table header is DATUM and Maloprodajna cena [EUR/L]?

C# Syntax (Toggle Plain Text)

  1. HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument ( );
  2. doc.Load ( @"C:\Users\User\Desktop\petrol.celota.htm" );
  3. doc.OptionCheckSyntax = true;
  4. doc.OptionFixNestedTags = true;
  5. doc.OptionAutoCloseOnEnd = true;
  6. doc.OptionOutputAsXml = true;
  7. doc.OptionDefaultStreamEncoding = Encoding.Default;
  8. var query = from html in doc.DocumentNode.SelectNodes ( "//div[@class='graf_table']" ).Cast<HtmlNode>( )
  9. from table in html.SelectNodes ( "//table" ).Cast<HtmlNode>( )
  10. from row in table.SelectNodes ( "tr" ).Cast<HtmlNode>( )
  11. from cell in row.SelectNodes ( "th|td" ).Cast<HtmlNode>( )
  12. select new { Table = table.Id, CellText = cell.InnerHtml };

댓글 없음:

댓글 쓰기