Header menu logo FSharp.Data

BinderScriptNotebook

HTML Type Provider

This article demonstrates how to use the HTML type provider to read HTML tables files in a statically typed way.

The HTML Type Provider takes a sample HTML document as input and generates a type based on the data present in the columns of that sample. The column names are obtained from the first (header) row.

Introducing the provider

The type provider is located in the FSharp.Data.dll assembly. Assuming the assembly is located in the ../../../bin directory, we can load it in F# Interactive as follows:

open FSharp.Data

Parsing F1 Calendar Data

This example shows an example of using the HTML Type Provider to extract each row from a table on a Wikipedia page.

Usually with HTML files headers are demarked by using the <th> tag, however this is not true in general, so the provider assumes that the first row is headers. (This behaviour is likely to get smarter in later releases). But it highlights a general problem about HTML's strictness.

[<Literal>]
let F1_2017_URL =
    "https://en.wikipedia.org/wiki/2017_FIA_Formula_One_World_Championship"

type F1_2017 = HtmlProvider<F1_2017_URL>

The generated type provides a type space of tables that it has managed to parse out of the given HTML Document. Each type's name is derived from either the id, title, name, summary or caption attributes/tags provided. If none of these entities exist then the table will simply be named Tablexx where xx is the position in the HTML document if all of the tables were flattened out into a list. The Load method allows reading the data from a file or web resource. We could also have used a web URL instead of a local file in the sample parameter of the type provider. The following sample calls the Load method with an URL that points to a live version of the same page on Wikipedia.

// Download the table for the 2017 F1 calendar from Wikipedia
let f1Calendar = F1_2017.Load(F1_2017_URL).Tables.Calendar

// Look at the top row, being the first race of the calendar
let firstRow = f1Calendar.Rows |> Seq.head
let round = firstRow.Round
let grandPrix = firstRow.``Grand Prix``
let date = firstRow.Date

// Print the round, location and date for each race, corresponding to a row
for row in f1Calendar.Rows do
    printfn "Race, round %A is hosted at %A on %A" row.Round row.``Grand Prix`` row.Date
Race, round "1" is hosted at "Australian Grand Prix" on "26 March"
Race, round "2" is hosted at "Chinese Grand Prix" on "9 April"
Race, round "3" is hosted at "Bahrain Grand Prix" on "16 April"
Race, round "4" is hosted at "Russian Grand Prix" on "30 April"
Race, round "5" is hosted at "Spanish Grand Prix" on "14 May"
Race, round "6" is hosted at "Monaco Grand Prix" on "28 May"
Race, round "7" is hosted at "Canadian Grand Prix" on "11 June"
Race, round "8" is hosted at "Azerbaijan Grand Prix" on "25 June"
Race, round "9" is hosted at "Austrian Grand Prix" on "9 July"
Race, round "10" is hosted at "British Grand Prix" on "16 July"
Race, round "11" is hosted at "Hungarian Grand Prix" on "30 July"
Race, round "12" is hosted at "Belgian Grand Prix" on "27 August"
Race, round "13" is hosted at "Italian Grand Prix" on "3 September"
Race, round "14" is hosted at "Singapore Grand Prix" on "17 September"
Race, round "15" is hosted at "Malaysian Grand Prix" on "1 October"
Race, round "16" is hosted at "Japanese Grand Prix" on "8 October"
Race, round "17" is hosted at "United States Grand Prix" on "22 October"
Race, round "18" is hosted at "Mexican Grand Prix" on "29 October"
Race, round "19" is hosted at "Brazilian Grand Prix" on "12 November"
Race, round "20" is hosted at "Abu Dhabi Grand Prix" on "26 November"
Race, round "Source:[86]" is hosted at "Source:[86]" on "Source:[86]"
val f1Calendar: HtmlProvider<...>.Calendar
val firstRow: HtmlProvider<...>.Calendar.Row =
  ("1", "Australian Grand Prix", "Albert Park Circuit, Melbourne", "26 March")
val round: string = "1"
val grandPrix: string = "Australian Grand Prix"
val date: string = "26 March"
val it: unit = ()

The generated type has a property Rows that returns the data from the HTML file as a collection of rows. We iterate over the rows using a for loop. As you can see the (generated) type for rows has properties such as Grand Prix, Circuit, Round and Date that correspond to the columns in the selected HTML table file.

As you can see, the type provider also infers types of individual rows. The Date property is inferred to be a DateTime (because the values in the sample file can all be parsed as dates) while other columns are inferred as the correct type where possible.

Parsing Nuget package stats

This small sample shows how the HTML Type Provider can be used to scrape data from a website. In this example, we analyze the download counts of the FSharp.Data package on NuGet. Note that we're using the live URL as the sample, so we can just use the default constructor as the runtime data will be the same as the compile time data.

// Configure the type provider
type NugetStats = HtmlProvider<"https://www.nuget.org/packages/FSharp.Data">

// load the live package stats for FSharp.Data
let rawStats = NugetStats().Tables.``Version History of FSharp.Data``

// helper function to analyze version numbers from Nuget
let getMinorVersion (v: string) =
    System
        .Text
        .RegularExpressions
        .Regex(
            @"\d.\d"
        )
        .Match(
        v
    )
        .Value

// group by minor version and calculate the download count
let stats =
    rawStats.Rows
    |> Seq.groupBy (fun r -> getMinorVersion r.Version)
    |> Seq.map (fun (k, xs) -> k, xs |> Seq.sumBy (fun x -> x.Downloads))
    |> Seq.toArray
type NugetStats = HtmlProvider<...>
val rawStats: HtmlProvider<...>.VersionHistoryOfFSharpData
val getMinorVersion: v: string -> string
val stats: (string * decimal) array =
  [|("6.4", 274180M); ("6.3", 306928M); ("6.2", 134804M); ("6.1", 3083M);
    ("6.0", 17719M); ("5.0", 446049M); ("4.2", 857771M); ("4.1", 194761M);
    ("4.0", 117503M); ("3.3", 1230566M); ("3.2", 61150M); ("3.1", 261163M);
    ("3.0", 633128M); ("2.4", 483884M); ("2.3", 631856M); ("2.2", 371654M);
    ("2.1", 45671M); ("2.0", 165776M); ("1.1", 124506M); ("1.0", 77159M)|]

Getting statistics on Doctor Who

This sample shows some more screen scraping from Wikipedia:

[<Literal>]
let DrWho =
    "https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)"

let doctorWho = new HtmlProvider<DrWho>()

// Get the average number of viewers for each doctor's series run
let viewersByDoctor =
    doctorWho.Tables.``Season 1 (1963-1964)``.Rows
    |> Seq.groupBy (fun season -> season.``Directed by``)
    |> Seq.map (fun (doctor, seasons) ->
        let averaged =
            seasons
            |> Seq.averageBy (fun season -> season.``UK viewers (millions)``)

        doctor, averaged)
    |> Seq.toArray
[<Literal>]
val DrWho: string
  =
  "https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)"
val doctorWho: HtmlProvider<...>
val viewersByDoctor: (string * float) array =
  [|("Waris Hussein", 8.0); ("", nan); ("Christopher Barry", 8.275);
    ("Richard Martin", 10.025); ("Frank Cox", 7.9); ("John Crockett", 8.0);
    ("John Gorrie", 9.066666667); ("Mervyn Pinfield", 6.925);
    ("Henric Hirsch", 6.733333333)|]

Related articles

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
Multiple items
type LiteralAttribute = inherit Attribute new: unit -> LiteralAttribute

--------------------
new: unit -> LiteralAttribute
[<Literal>] val F1_2017_URL: string = "https://en.wikipedia.org/wiki/2017_FIA_Formula_One_World_Championship"
type F1_2017 = HtmlProvider<...>
type HtmlProvider
<summary>Typed representation of an HTML file.</summary> <param name='Sample'>Location of an HTML sample file or a string containing a sample HTML document.</param> <param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param> <param name='IncludeLayoutTables'>Includes tables that are potentially layout tables (with cellpadding=0 and cellspacing=0 attributes)</param> <param name='MissingValues'>The set of strings recognized as missing values. Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param> <param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param> <param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param> <param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param> <param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource (e.g. 'MyCompany.MyAssembly, resource_name.html'). This is useful when exposing types generated by the type provider.</param>
val f1Calendar: HtmlProvider<...>.Calendar
HtmlProvider<...>.Load(uri: string) : HtmlProvider<...>
Loads HTML from the specified uri
HtmlProvider<...>.Load(reader: System.IO.TextReader) : HtmlProvider<...>
Loads HTML from the specified reader
HtmlProvider<...>.Load(stream: System.IO.Stream) : HtmlProvider<...>
Loads HTML from the specified stream
val firstRow: HtmlProvider<...>.Calendar.Row
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.Calendar.Row array with get
module Seq from Microsoft.FSharp.Collections
val head: source: 'T seq -> 'T
val round: string
property HtmlProvider<...>.Calendar.Row.Round: string with get
val grandPrix: string
val date: string
property HtmlProvider<...>.Calendar.Row.Date: string with get
val row: HtmlProvider<...>.Calendar.Row
val printfn: format: Printf.TextWriterFormat<'T> -> 'T
type NugetStats = HtmlProvider<...>
val rawStats: HtmlProvider<...>.VersionHistoryOfFSharpData
val getMinorVersion: v: string -> string
val v: string
Multiple items
val string: value: 'T -> string

--------------------
type string = System.String
namespace System
val stats: (string * decimal) array
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.VersionHistoryOfFSharpData.Row array with get
val groupBy: projection: ('T -> 'Key) -> source: 'T seq -> ('Key * 'T seq) seq (requires equality)
val r: HtmlProvider<...>.VersionHistoryOfFSharpData.Row
property HtmlProvider<...>.VersionHistoryOfFSharpData.Row.Version: string with get
val map: mapping: ('T -> 'U) -> source: 'T seq -> 'U seq
val k: string
val xs: HtmlProvider<...>.VersionHistoryOfFSharpData.Row seq
val sumBy: projection: ('T -> 'U) -> source: 'T seq -> 'U (requires member (+) and member Zero)
val x: HtmlProvider<...>.VersionHistoryOfFSharpData.Row
property HtmlProvider<...>.VersionHistoryOfFSharpData.Row.Downloads: decimal with get
val toArray: source: 'T seq -> 'T array
[<Literal>] val DrWho: string = "https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)"
val doctorWho: HtmlProvider<...>
val viewersByDoctor: (string * float) array
property HtmlProvider<...>.Tables: HtmlProvider<...>.TablesContainer with get
val season: HtmlProvider<...>.Season119631964.Row
val doctor: string
val seasons: HtmlProvider<...>.Season119631964.Row seq
val averaged: float
val averageBy: projection: ('T -> 'U) -> source: 'T seq -> 'U (requires member (+) and member DivideByInt and member Zero)

Type something to start searching.