
Eoin Butler
10 Jul 2024
Scraping HTML with C# and the HtmlAgilityPack
Introduction
HtmlAgilityPack is a powerful .NET library for parsing and manipulating HTML documents. It simplifies tasks such as web scraping, HTML document modification, and validation. This guide will walk you through the basics of using HtmlAgilityPack with C#, including installation, basic usage, and some practical examples.
Installation
he easiest way to install it through NuGet, the package manager for .NET. You can install HtmlAgilityPack using the NuGet Package Manager Console or the Package Manager GUI in Visual Studio.
Using Package Manager Console:
Install-Package HtmlAgilityPack
Using .NET CLI:
dotnet add package HtmlAgilityPack
Basics
Once installed, you can start using HtmlAgilityPack to load, parse, and manipulate HTML documents.
Loading an HTML Document
First, include the necessary namespace:
using HtmlAgilityPack;
Then, you can load an HTML document from a URL, file, or string:
var url = "https://example.com";
var web = new HtmlWeb();
HtmlDocument document = web.Load(url);
For loading from a file or string:
// From a file
HtmlDocument document = new HtmlDocument();
document.Load("path/to/file.html");
// From a string
HtmlDocument document = new HtmlDocument();
document.LoadHtml("<html><body><p>Hello, World!</p></body></html>");
Parsing HTML Content
Once the document is loaded, you can navigate and parse the HTML content using XPath or LINQ.
// Using XPath
HtmlNode node = document.DocumentNode.SelectSingleNode("//p");
Console.WriteLine(node.InnerText);
// Using LINQ
var nodes = document.DocumentNode.Descendants("p");
foreach (var n in nodes)
{
Console.WriteLine(n.InnerText);
}
Example 1: Extracting Links
This example demonstrates how to extract all links (<a> tags) from a webpage.
var url = "https://example.com";
var web = new HtmlWeb();
HtmlDocument document = web.Load(url);
var links = document.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in links)
{
string href = link.GetAttributeValue("href", string.Empty);
Console.WriteLine(href);
}
Example 2: Scraping Table Data
Here’s how to extract data from an HTML table:
var url = "https://example.com";
var web = new HtmlWeb();
HtmlDocument document = web.Load(url);
var table = document.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("tr"))
{
Console.Write(cell.InnerText + "\t");
}
Example 3: Modifying HTML
This example shows how to modify an HTML document by adding a new element:
var html = "<html><body><p>Hello, World!</p></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
HtmlNode body = document.DocumentNode.SelectSingleNode("//body");
HtmlNode newElement = HtmlNode.CreateNode("<div><p>New Paragraph</p></div>");
body.AppendChild(newElement);
Console.WriteLine(document.DocumentNode.OuterHtml);
Advanced Features
HtmlAgilityPack also supports advanced features such as:
HTML Validation: You can validate the HTML structure and correct common errors.
Custom Parsing Rules: Define custom rules for parsing complex documents.
XPath Support: Extensive support for XPath to navigate and query the document.
HtmlAgilityPack is a powerful library for handling HTML documents in C#. Whether you need to scrape web data, modify HTML content, or perform complex queries, HtmlAgilityPack provides the tools you need. By following the examples and practices outlined in this guide, you too can leverage HtmlAgilityPack in your C# projects.