HtmlAgilityPack with C#

Eoin Butler

10 Jul 2024

Scraping HTML with C# and the HtmlAgilityPack

Introduction

HtmlAgilityPack is a powerful .NET library for parsing and manipulating HTML documents. It simplifies tasks such as web scraping, HTML document modification, and validation. This guide will walk you through the basics of using HtmlAgilityPack with C#, including installation, basic usage, and some practical examples.

Installation

he easiest way to install it through NuGet, the package manager for .NET. You can install HtmlAgilityPack using the NuGet Package Manager Console or the Package Manager GUI in Visual Studio.

Using Package Manager Console:

Install-Package HtmlAgilityPack

Using .NET CLI:

dotnet add package HtmlAgilityPack

Basics

Once installed, you can start using HtmlAgilityPack to load, parse, and manipulate HTML documents.

Loading an HTML Document

First, include the necessary namespace:

using HtmlAgilityPack;

Then, you can load an HTML document from a URL, file, or string:

var url = "https://example.com";
var web = new HtmlWeb();
HtmlDocument document = web.Load(url);

For loading from a file or string:

// From a file
HtmlDocument document = new HtmlDocument();
document.Load("path/to/file.html");

// From a string
HtmlDocument document = new HtmlDocument();
document.LoadHtml("<html><body><p>Hello, World!</p></body></html>");

Parsing HTML Content

Once the document is loaded, you can navigate and parse the HTML content using XPath or LINQ.

// Using XPath
HtmlNode node = document.DocumentNode.SelectSingleNode("//p");
Console.WriteLine(node.InnerText);

// Using LINQ
var nodes = document.DocumentNode.Descendants("p");
foreach (var n in nodes)
{
	Console.WriteLine(n.InnerText);
}

Example 1: Extracting Links

This example demonstrates how to extract all links (<a> tags) from a webpage.

var url = "https://example.com";
var web = new HtmlWeb();
HtmlDocument document = web.Load(url);

var links = document.DocumentNode.SelectNodes("//a[@href]");
foreach (var link in links)
{
	string href = link.GetAttributeValue("href", string.Empty);
	Console.WriteLine(href);
}

Example 2: Scraping Table Data

Here’s how to extract data from an HTML table:

var url = "https://example.com";
var web = new HtmlWeb();

HtmlDocument document = web.Load(url);
var table = document.DocumentNode.SelectSingleNode("//table");

foreach (var row in table.SelectNodes("tr"))
{
	Console.Write(cell.InnerText + "\t");
}

Example 3: Modifying HTML

This example shows how to modify an HTML document by adding a new element:

var html = "<html><body><p>Hello, World!</p></body></html>";
HtmlDocument document = new HtmlDocument();

document.LoadHtml(html);

HtmlNode body = document.DocumentNode.SelectSingleNode("//body");

HtmlNode newElement = HtmlNode.CreateNode("<div><p>New Paragraph</p></div>");

body.AppendChild(newElement);

Console.WriteLine(document.DocumentNode.OuterHtml);

Advanced Features

HtmlAgilityPack also supports advanced features such as:

HTML Validation: You can validate the HTML structure and correct common errors.
Custom Parsing Rules: Define custom rules for parsing complex documents.
XPath Support: Extensive support for XPath to navigate and query the document.

HtmlAgilityPack is a powerful library for handling HTML documents in C#. Whether you need to scrape web data, modify HTML content, or perform complex queries, HtmlAgilityPack provides the tools you need. By following the examples and practices outlined in this guide, you too can leverage HtmlAgilityPack in your C# projects.

SharpScrape