This article demonstrates how to use the HTML type provider to read HTML tables files
in a statically typed way.
The HTML Type Provider takes a sample HTML document as input and generates a type based on the data
present in the columns of that sample. The column names are obtained from the first (header) row.
The type provider is located in the FSharp.Data.dll assembly. Assuming the assembly
is located in the ../../../bin directory, we can load it in F# Interactive as follows:
The Elexon - BM Reports website provides market data about the U.K's current power system. For simplicity, an example of this data below is shown in CSV format.
Usually with HTML files headers are demarked by using the
tag, however in this file this is not the case, so the provider assumes that the
first row is headers. (This behaviour is likely to get smarter in later releases). But it highlights a general problem about HTML's strictness.
The generated type provides a type space of tables that it has managed to parse out of the given HTML Document.
Each type's name is derived from either the id, title, name, summary or caption attributes/tags provided. If none of these
entities exist then the table will simply be named Tablexx where xx is the position in the HTML document if all of the tables were flatterned out into a list.
The Load method allows reading the data from a file or web resource. We could also have used a web URL instead of a local file in the sample parameter of the type provider.
The following sample calls the Load method with an URL that points to a live version of the same page on wikipedia.
leturl="https://en.wikipedia.org/wiki/"+"2017_FIA_Formula_One_World_Championship"// Download the latest market depth informationletf1Calendar=F1_2017.Load(url).Tables.``Season calendar``// Look at the most recent row. Note the 'Date' property// is of type 'DateTime' and 'Open' has a type 'decimal'letfirstRow=f1Calendar.Rows|>Seq.headletround=firstRow.RoundletgrandPrix=firstRow.``Grand Prix``letdate=firstRow.Date// Print the bid / offer volumes for each rowforrowinf1Calendar.Rowsdoprintfn"Race, round %A is hosted at %A on %A"row.Roundrow.``Grand Prix``row.Date
The generated type has a property Rows that returns the data from the HTML file as a
collection of rows. We iterate over the rows using a for loop. As you can see the
(generated) type for rows has properties such as Grand Prix, Circuit, Round and Date that correspond
to the columns in the selected HTML table file.
As you can see, the type provider also infers types of individual rows. The Date
property is inferred to be a DateTime (because the values in the sample file can all
be parsed as dates) while other columns are inferred as the correct type where possible.
This small sample shows how the HTML Type Provider can be used to scrape data from a website. In this example we analyze the download counts of the FSharp.Data package on NuGet.
Note that we're using the live URL as the sample, so we can just use the default constructor as the runtime data will be the same as the compile time data.
// Configure the type providertypeNugetStats=HtmlProvider<"https://www.nuget.org/packages/FSharp.Data">// load the live package stats for FSharp.DataletrawStats=NugetStats().Tables.Table4// helper function to analyze version numbers from nugetletgetMinorVersion(v:string)=System.Text.RegularExpressions.Regex(@"\d.\d").Match(v).Value// group by minor version and calculate download countletstats=rawStats.Rows|>Seq.groupBy(funr->getMinorVersionr.Version)|>Seq.map(fun(k,xs)->k,xs|>Seq.sumBy(funx->x.Downloads))|>Seq.toArray
This sample shows some more screen scraping from Wikipedia:
let[<Literal>]DrWho="https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)"letdoctorWho=newHtmlProvider<DrWho>()// Get the average number of viewers for each doctor's series runletviewersByDoctor=doctorWho.Tables.``Season 1 (1963-1964) Edit``.Rows|>Seq.groupBy(funseason->season.``Directed by``)|>Seq.map(fun(doctor,seasons)->letaveraged=seasons|>Seq.averageBy(funseason->season.``UK viewers (millions)``)doctor,averaged)|>Seq.toArray
Multiple items type LiteralAttribute =
inherit Attribute
new: unit -> LiteralAttribute <summary>Adding this attribute to a value causes it to be compiled as a CLI constant literal.</summary> <category>Attributes</category>
-------------------- new: unit -> LiteralAttribute
[<Literal>]
val ResolutionFolder: string = "D:\a\FSharp.Data\FSharp.Data\docs\library"
type F1_2017 = HtmlProvider<...>
type HtmlProvider <summary>Typed representation of an HTML file.</summary>
<param name='Sample'>Location of an HTML sample file or a string containing a sample HTML document.</param>
<param name='PreferOptionals'>When set to true, inference will prefer to use the option type instead of nullable types, <c>double.NaN</c> or <c>""</c> for missing values. Defaults to false.</param>
<param name='IncludeLayoutTables'>Includes tables that are potentially layout tables (with cellpadding=0 and cellspacing=0 attributes)</param>
<param name='MissingValues'>The set of strings recognized as missing values. Defaults to <c>NaN,NA,N/A,#N/A,:,-,TBA,TBD</c>.</param>
<param name='Culture'>The culture used for parsing numbers and dates. Defaults to the invariant culture.</param>
<param name='Encoding'>The encoding used to read the sample. You can specify either the character set name or the codepage number. Defaults to UTF8 for files, and to ISO-8859-1 the for HTTP requests, unless <c>charset</c> is specified in the <c>Content-Type</c> response header.</param>
<param name='ResolutionFolder'>A directory that is used when resolving relative file references (at design time and in hosted execution).</param>
<param name='EmbeddedResource'>When specified, the type provider first attempts to load the sample from the specified resource
(e.g. 'MyCompany.MyAssembly, resource_name.html'). This is useful when exposing types generated by the type provider.</param>
val url: string
val f1Calendar: HtmlProvider<...>.SeasonCalendar
HtmlProvider<...>.Load(uri: string) : HtmlProvider<...> Loads HTML from the specified uri HtmlProvider<...>.Load(reader: System.IO.TextReader) : HtmlProvider<...> Loads HTML from the specified reader HtmlProvider<...>.Load(stream: System.IO.Stream) : HtmlProvider<...> Loads HTML from the specified stream
val firstRow: HtmlProvider<...>.SeasonCalendar.Row
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.SeasonCalendar.Row[] with get
module Seq
from Microsoft.FSharp.Collections <summary>Contains operations for working with values of type <see cref="T:Microsoft.FSharp.Collections.seq`1" />.</summary>
val head: source: seq<'T> -> 'T <summary>Returns the first element of the sequence.</summary> <param name="source">The input sequence.</param> <returns>The first element of the sequence.</returns> <exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception> <exception cref="T:System.ArgumentException">Thrown when the input does not have any elements.</exception> <example id="head-1"><code lang="fsharp">
let inputs = ["banana"; "pear"]
inputs |> Seq.head
</code>
Evaluates to <c>banana</c></example> <example id="head-2"><code lang="fsharp">
[] |> Seq.head
</code>
Throws <c>ArgumentException</c></example>
val round: string
property HtmlProvider<...>.SeasonCalendar.Row.Round: string with get
val grandPrix: string
val date: string
property HtmlProvider<...>.SeasonCalendar.Row.Date: string with get
val row: HtmlProvider<...>.SeasonCalendar.Row
val printfn: format: Printf.TextWriterFormat<'T> -> 'T <summary>Print to <c>stdout</c> using the given format, and add a newline.</summary> <param name="format">The formatter.</param> <returns>The formatted result.</returns> <example>See <c>Printf.printfn</c> (link: <see cref="M:Microsoft.FSharp.Core.PrintfModule.PrintFormatLine``1" />) for examples.</example>
type NugetStats = HtmlProvider<...>
val rawStats: HtmlProvider<...>.Table4
val getMinorVersion: v: string -> string
val v: string
Multiple items val string: value: 'T -> string <summary>Converts the argument to a string using <c>ToString</c>.</summary> <remarks>For standard integer and floating point values the and any type that implements <c>IFormattable</c><c>ToString</c> conversion uses <c>CultureInfo.InvariantCulture</c>. </remarks> <param name="value">The input value.</param> <returns>The converted string.</returns> <example id="string-example"><code lang="fsharp"></code></example>
-------------------- type string = System.String <summary>An abbreviation for the CLI type <see cref="T:System.String" />.</summary> <category>Basic Types</category>
namespace System
namespace System.Text
namespace System.Text.RegularExpressions
Multiple items type Regex =
interface ISerializable
new: pattern: string -> unit + 2 overloads
member GetGroupNames: unit -> string[]
member GetGroupNumbers: unit -> int[]
member GroupNameFromNumber: i: int -> string
member GroupNumberFromName: name: string -> int
member IsMatch: input: string -> bool + 4 overloads
member Match: input: string -> Match + 5 overloads
member Matches: input: string -> MatchCollection + 4 overloads
member Replace: input: string * replacement: string -> string + 11 overloads
... <summary>Represents an immutable regular expression.</summary>
property Runtime.BaseTypes.HtmlTable.Rows: HtmlProvider<...>.Table4.Row[] with get
val groupBy: projection: ('T -> 'Key) -> source: seq<'T> -> seq<'Key * seq<'T>> (requires equality) <summary>Applies a key-generating function to each element of a sequence and yields a sequence of
unique keys. Each unique key contains a sequence of all elements that match
to this key.</summary> <remarks>This function returns a sequence that digests the whole initial sequence as soon as
that sequence is iterated. As a result this function should not be used with
large or infinite sequences. The function makes no assumption on the ordering of the original
sequence.</remarks> <param name="projection">A function that transforms an element of the sequence into a comparable key.</param> <param name="source">The input sequence.</param> <returns>The result sequence.</returns> <example id="group-by-1"><code lang="fsharp">
let inputs = [1; 2; 3; 4; 5]
inputs |> Seq.groupBy (fun n -> n % 2)
</code>
Evaluates to a sequence yielding the same results as <c>seq { (1, seq { 1; 3; 5 }); (0, seq { 2; 4 }) }</c></example>
val r: HtmlProvider<...>.Table4.Row
property HtmlProvider<...>.Table4.Row.Version: string with get
val map: mapping: ('T -> 'U) -> source: seq<'T> -> seq<'U> <summary>Builds a new collection whose elements are the results of applying the given function
to each of the elements of the collection. The given function will be applied
as elements are demanded using the <c>MoveNext</c> method on enumerators retrieved from the
object.</summary> <remarks>The returned sequence may be passed between threads safely. However,
individual IEnumerator values generated from the returned sequence should not be accessed concurrently.</remarks> <param name="mapping">A function to transform items from the input sequence.</param> <param name="source">The input sequence.</param> <returns>The result sequence.</returns> <exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception> <example id="item-1"><code lang="fsharp">
let inputs = ["a"; "bbb"; "cc"]
inputs |> Seq.map (fun x -> x.Length)
</code>
Evaluates to a sequence yielding the same results as <c>seq { 1; 3; 2 }</c></example>
val k: string
val xs: seq<HtmlProvider<...>.Table4.Row>
val sumBy: projection: ('T -> 'U) -> source: seq<'T> -> 'U (requires member (+) and member get_Zero) <summary>Returns the sum of the results generated by applying the function to each element of the sequence.</summary> <remarks>The generated elements are summed using the <c>+</c> operator and <c>Zero</c> property associated with the generated type.</remarks> <param name="projection">A function to transform items from the input sequence into the type that will be summed.</param> <param name="source">The input sequence.</param> <returns>The computed sum.</returns> <example id="sumby-1"><code lang="fsharp">
let input = [ "aa"; "bbb"; "cc" ]
input |> Seq.sumBy (fun s -> s.Length)
</code>
Evaluates to <c>7</c>.
</example>
val x: HtmlProvider<...>.Table4.Row
property HtmlProvider<...>.Table4.Row.Downloads: decimal with get
val toArray: source: seq<'T> -> 'T[] <summary>Builds an array from the given collection.</summary> <param name="source">The input sequence.</param> <returns>The result array.</returns> <exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception> <example id="toarray-1"><code lang="fsharp">
let inputs = seq { 1; 2; 5 }
inputs |> Seq.toArray
</code>
Evaluates to <c>[| 1; 2; 5 |]</c>.
</example>
[<Literal>]
val DrWho: string = "https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963%E2%80%931989)"
val doctorWho: HtmlProvider<...>
val viewersByDoctor: (string * float)[]
property HtmlProvider<...>.Tables: HtmlProvider<...>.TablesContainer with get
val season: HtmlProvider<...>.Season119631964Edit.Row
val doctor: string
val seasons: seq<HtmlProvider<...>.Season119631964Edit.Row>
val averaged: float
val averageBy: projection: ('T -> 'U) -> source: seq<'T> -> 'U (requires member (+) and member DivideByInt and member get_Zero) <summary>Returns the average of the results generated by applying the function to each element
of the sequence.</summary> <remarks>The elements are averaged using the <c>+</c> operator, <c>DivideByInt</c> method and <c>Zero</c> property
associated with the generated type.</remarks> <param name="projection">A function applied to transform each element of the sequence.</param> <param name="source">The input sequence.</param> <returns>The average.</returns> <exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception> <exception cref="T:System.ArgumentException">Thrown when the input sequence has zero elements.</exception> <example id="average-by-1"><code lang="fsharp">
type Foo = { Bar: float }
let input = seq { {Bar = 2.0}; {Bar = 4.0} }
input |> Seq.averageBy (fun foo -> foo.Bar)
</code>
Evaluates to <c>3.0</c></example> <example id="average-by-2"><code lang="fsharp">
type Foo = { Bar: float }
Seq.empty |> Seq.averageBy (fun (foo: Foo) -> foo.Bar)
</code>
Throws <c>ArgumentException</c></example>