HTML Parser

This article demonstrates how to use the HTML Parser to parse HTML files.

The HTML parser takes any fragment of HTML, uri or a stream and trys to parse it into a DOM. The parser is based on the HTML Living Standard Once a document/fragment has been parsed, a set of extension methods over the HTML DOM elements allow you to extract information from a web page independently of the actual HTML Type provider.

open FSharp.Data

The following example uses Google to search for FSharp.Data then parses the first set of search results from the page, extracting the URL and Title of the link. We use the HtmlDocument type.

To achieve this we must first parse the webpage into our DOM. We can do this using the HtmlDocument.Load method. This method will take a URL and make a synchronous web call to extract the data from the page. Note: an asynchronous variant HtmlDocument.AsyncLoad is also available

let results = HtmlDocument.Load("http://www.google.co.uk/search?q=FSharp.Data")

val results: HtmlDocument =
  <!-- html>--><html lang="en">
  <head>
    <meta charset="UTF-8" /><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image" /><title>FSharp.Data - Google Search</title><script nonce="J_M7pddvW8DmmPApek7JIA">(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a=c==="1"||c==="q"&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentEle...

Now that we have a loaded HTML document we can begin to extract data from it. Firstly we want to extract all of the anchor tags a out of the document, then inspect the links to see if it has a href attribute, using HtmlDocumentExtensions.Descendants. If it does, extract the value, which in this case is the url that the search result is pointing to, and additionally the InnerText of the anchor tag to provide the name of the web page for the search result we are looking at.

let links =
    results.Descendants [ "a" ]
    |> Seq.choose (fun x ->
        x.TryGetAttribute("href")
        |> Option.map (fun a -> x.InnerText(), a.Value()))
    |> Seq.truncate 10
    |> Seq.toList

val links: (string * string) list =
  [("Google", "/?sa=X&ved=0ahUKEwibrsWJzJeHAxW4m44IHTNIDokQOwgC");
   ("here",
    "/search?q=FSharp.Data&sca_esv=d7f5b94d2e0e1e85&ie=UTF-8&gbv=1"+[28 chars]);
   ("Images",
    "/search?q=FSharp.Data&sca_esv=d7f5b94d2e0e1e85&ie=UTF-8&tbm=i"+[66 chars]);
   ("Videos",
    "/search?q=FSharp.Data&sca_esv=d7f5b94d2e0e1e85&ie=UTF-8&tbm=v"+[65 chars]);
   ("News",
    "/search?q=FSharp.Data&sca_esv=d7f5b94d2e0e1e85&ie=UTF-8&tbm=n"+[65 chars]);
   ("Maps",
    "/url?q=http://maps.google.co.uk/maps%3Fq%3DFSharp.Data%26um%3"+[145 chars]);
   ("Shopping",
    "/url?q=/search%3Fq%3DFSharp.Data%26sca_esv%3Dd7f5b94d2e0e1e85"+[172 chars]);
   ("Books",
    "/search?q=FSharp.Data&sca_esv=d7f5b94d2e0e1e85&ie=UTF-8&tbm=b"+[65 chars]);
   ("Search tools", "/advanced_search")]

Now that we have extracted our search results you will notice that there are lots of other links to various Google services and cached/similar results. Ideally we would like to filter these results as we are probably not interested in them. At this point we simply have a sequence of Tuples, so F# makes this trivial using Seq.filter and Seq.map.

let searchResults =
    links
    |> List.filter (fun (name, url) ->
        name <> "Cached"
        && name <> "Similar"
        && url.StartsWith("/url?"))
    |> List.map (fun (name, url) ->
        name,
        url
            .Substring(0, url.IndexOf("&sa="))
            .Replace("/url?q=", ""))

val searchResults: (string * string) list =
  [("Maps",
    "http://maps.google.co.uk/maps%3Fq%3DFSharp.Data%26um%3D1%26ie"+[52 chars]);
   ("Shopping",
    "/search%3Fq%3DFSharp.Data%26sca_esv%3Dd7f5b94d2e0e1e85%26ie%3"+[79 chars])]

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp

Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data

val results: HtmlDocument

Multiple items
module HtmlDocument from FSharp.Data

--------------------
type HtmlDocument = private | HtmlDocument of docType: string * elements: HtmlNode list override ToString: unit -> string static member New: docType: string * children: HtmlNode seq -> HtmlDocument + 1 overload

static member HtmlDocument.Load: stream: System.IO.Stream -> HtmlDocument
static member HtmlDocument.Load: reader: System.IO.TextReader -> HtmlDocument
static member HtmlDocument.Load: uri: string * [<System.Runtime.InteropServices.Optional>] ?encoding: System.Text.Encoding -> HtmlDocument

val links: (string * string) list

static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument -> HtmlNode seq
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * predicate: (HtmlNode -> bool) -> HtmlNode seq
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * names: string seq -> HtmlNode seq
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * name: string -> HtmlNode seq
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * predicate: (HtmlNode -> bool) * recurseOnMatch: bool -> HtmlNode seq
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * names: string seq * recurseOnMatch: bool -> HtmlNode seq
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * name: string * recurseOnMatch: bool -> HtmlNode seq

module Seq from Microsoft.FSharp.Collections

val choose: chooser: ('T -> 'U option) -> source: 'T seq -> 'U seq

val x: HtmlNode

static member HtmlNodeExtensions.TryGetAttribute: n: HtmlNode * name: string -> HtmlAttribute option

module Option from Microsoft.FSharp.Core

val map: mapping: ('T -> 'U) -> option: 'T option -> 'U option

val a: HtmlAttribute

static member HtmlNodeExtensions.InnerText: n: HtmlNode -> string

static member HtmlAttributeExtensions.Value: attr: HtmlAttribute -> string

val truncate: count: int -> source: 'T seq -> 'T seq

val toList: source: 'T seq -> 'T list

val searchResults: (string * string) list

Multiple items
module List from Microsoft.FSharp.Collections

--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T member IsEmpty: bool member Item: index: int -> 'T with get ...

val filter: predicate: ('T -> bool) -> list: 'T list -> 'T list

val name: string

val url: string

System.String.StartsWith(value: string) : bool
System.String.StartsWith(value: char) : bool
System.String.StartsWith(value: string, comparisonType: System.StringComparison) : bool
System.String.StartsWith(value: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : bool

val map: mapping: ('T -> 'U) -> list: 'T list -> 'U list

System.String.IndexOf(value: string) : int
System.String.IndexOf(value: char) : int
System.String.IndexOf(value: string, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: string, startIndex: int) : int
System.String.IndexOf(value: char, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: char, startIndex: int) : int
System.String.IndexOf(value: string, startIndex: int, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: string, startIndex: int, count: int) : int
System.String.IndexOf(value: char, startIndex: int, count: int) : int
System.String.IndexOf(value: string, startIndex: int, count: int, comparisonType: System.StringComparison) : int