2010년 7월 7일 수요일

HTML Parser Libraries for .Net

HTMLParser.Net

HTMLParser.Net is a .Net library built on codebase of popular javabased HTMLParser available on sourceforge.net. If you are building applications that involve screen scrapping of HTML pages or data extraction from the web sites, then you definitely want to have a tool like HTMLParser.Net in your arsenal. Parsing of a page is as simple as writing 4 lines of code and you are on your way home. And if you want to little bit more creative with your parsing and query of results, then the API offer more advanced features that are easy to use.

HTML Agility Pack

HTML Agility Pack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry…). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface).

SgmlReader

SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.

HTML Tidy with .Net Wrapper

There's lots of .Net Wrapper of HTMLTidy:

  1. TidyForNet
  2. Tidy.NET
  3. EfTidy
  4. zetaHtmlTidy
  5. TidyATL
  6. Mark.Tidy.dll

Others

  1. Chilkat HTML Parser
  2. Antlrwith the grammar for HTML
  3. GOLD Parser with HTML grammar.
  4. HTML Parser(Java version) + IKVM.NET

댓글 없음:

댓글 쓰기