
This article demonstrates how to use the HTML Parser to parse HTML files.
The HTML parser takes any fragment of HTML, uri or a stream and trys to parse it into a DOM.
The parser is based on the HTML Living Standard
Once a document/fragment has been parsed, a set of extension methods over the HTML DOM elements allow you to extract information from a web page
independently of the actual HTML Type provider.
open FSharp.Data
The following example uses Google to search for FSharp.Data
then parses the first set of
search results from the page, extracting the URL and Title of the link.
We use the HtmlDocument type.
To achieve this we must first parse the webpage into our DOM. We can do this using
the HtmlDocument.Load method. This method will take a URL and make a synchronous web call
to extract the data from the page. Note: an asynchronous variant HtmlDocument.AsyncLoad is also available
let results = HtmlDocument.Load("http://www.google.co.uk/search?q=FSharp.Data")
val results: HtmlDocument =
<!-- html>--><html lang="en">
<head>
<meta charset="UTF-8" /><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image" /><title>FSharp.Data - Google Search</title><script nonce="c9CCVYsd8meH5Vxgp8X6pw">(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"===c||"q"===c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentEle...
|
Now that we have a loaded HTML document we can begin to extract data from it.
Firstly we want to extract all of the anchor tags a
out of the document, then
inspect the links to see if it has a href
attribute, using HtmlDocumentExtensions.Descendants. If it does, extract the value,
which in this case is the url that the search result is pointing to, and additionally the
InnerText
of the anchor tag to provide the name of the web page for the search result
we are looking at.
let links =
results.Descendants [ "a" ]
|> Seq.choose (fun x ->
x.TryGetAttribute("href")
|> Option.map (fun a -> x.InnerText(), a.Value()))
|> Seq.truncate 10
|> Seq.toList
val links: (string * string) list =
[("Google", "/?sa=X&ved=0ahUKEwizqKun95j_AhXulWoFHb2SAwUQOwgC");
("here", "/search?q=FSharp.Data&ie=UTF-8&gbv=1&sei=OcRzZPOnMO6rqtsPvaWOKA");
("News",
"/search?q=FSharp.Data&ie=UTF-8&source=lnms&tbm=nws&sa=X&ved=0"+[40 chars]);
("Images",
"/search?q=FSharp.Data&ie=UTF-8&source=lnms&tbm=isch&sa=X&ved="+[41 chars]);
("Videos",
"/search?q=FSharp.Data&ie=UTF-8&source=lnms&tbm=vid&sa=X&ved=0"+[40 chars]);
("Maps",
"http://maps.google.co.uk/maps?q=FSharp.Data&um=1&ie=UTF-8&sa="+[47 chars]);
("Shopping",
"/search?q=FSharp.Data&ie=UTF-8&source=lnms&tbm=shop&sa=X&ved="+[41 chars]);
("Books",
"/search?q=FSharp.Data&ie=UTF-8&source=lnms&tbm=bks&sa=X&ved=0"+[40 chars]);
("Search tools", "/advanced_search")]
|
Now that we have extracted our search results you will notice that there are lots of
other links to various Google services and cached/similar results. Ideally we would
like to filter these results as we are probably not interested in them.
At this point we simply have a sequence of Tuples, so F# makes this trivial using Seq.filter
and Seq.map
.
let searchResults =
links
|> List.filter (fun (name, url) ->
name <> "Cached"
&& name <> "Similar"
&& url.StartsWith("/url?"))
|> List.map (fun (name, url) ->
name,
url
.Substring(0, url.IndexOf("&sa="))
.Replace("/url?q=", ""))
val searchResults: (string * string) list = []
|
Multiple items
namespace FSharp
--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data
--------------------
namespace Microsoft.FSharp.Data
val results: HtmlDocument
Multiple items
module HtmlDocument
from FSharp.Data
--------------------
type HtmlDocument =
private | HtmlDocument of docType: string * elements: HtmlNode list
override ToString: unit -> string
static member New: docType: string * children: seq<HtmlNode> -> HtmlDocument + 1 overload
static member HtmlDocument.Load: stream: System.IO.Stream -> HtmlDocument
static member HtmlDocument.Load: reader: System.IO.TextReader -> HtmlDocument
static member HtmlDocument.Load: uri: string * ?encoding: System.Text.Encoding -> HtmlDocument
val links: (string * string) list
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * predicate: (HtmlNode -> bool) -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * names: seq<string> -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * name: string -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * predicate: (HtmlNode -> bool) * recurseOnMatch: bool -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * names: seq<string> * recurseOnMatch: bool -> seq<HtmlNode>
static member HtmlDocumentExtensions.Descendants: doc: HtmlDocument * name: string * recurseOnMatch: bool -> seq<HtmlNode>
module Seq
from Microsoft.FSharp.Collections
<summary>Contains operations for working with values of type <see cref="T:Microsoft.FSharp.Collections.seq`1" />.</summary>
val choose: chooser: ('T -> 'U option) -> source: seq<'T> -> seq<'U>
<summary>Applies the given function to each element of the sequence. Returns
a sequence comprised of the results "x" for each element where
the function returns Some(x).</summary>
<remarks>The returned sequence may be passed between threads safely. However,
individual IEnumerator values generated from the returned sequence should not
be accessed concurrently.</remarks>
<param name="chooser">A function to transform items of type T into options of type U.</param>
<param name="source">The input sequence of type T.</param>
<returns>The result sequence.</returns>
<exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception>
<example id="choose-1"><code lang="fsharp">
[Some 1; None; Some 2] |> Seq.choose id
</code>
Evaluates to a sequence yielding the same results as <c>seq { 1; 2 }</c></example>
<example id="choose-2"><code lang="fsharp">
[1; 2; 3] |> Seq.choose (fun n -> if n % 2 = 0 then Some n else None)
</code>
Evaluates to a sequence yielding the same results as <c>seq { 2 }</c></example>
val x: HtmlNode
static member HtmlNodeExtensions.TryGetAttribute: n: HtmlNode * name: string -> HtmlAttribute option
module Option
from Microsoft.FSharp.Core
<summary>Contains operations for working with options.</summary>
<category>Options</category>
val map: mapping: ('T -> 'U) -> option: 'T option -> 'U option
<summary><c>map f inp</c> evaluates to <c>match inp with None -> None | Some x -> Some (f x)</c>.</summary>
<param name="mapping">A function to apply to the option value.</param>
<param name="option">The input option.</param>
<returns>An option of the input value after applying the mapping function, or None if the input is None.</returns>
<example id="map-1"><code lang="fsharp">
None |> Option.map (fun x -> x * 2) // evaluates to None
Some 42 |> Option.map (fun x -> x * 2) // evaluates to Some 84
</code></example>
val a: HtmlAttribute
static member HtmlNodeExtensions.InnerText: n: HtmlNode -> string
static member HtmlAttributeExtensions.Value: attr: HtmlAttribute -> string
val truncate: count: int -> source: seq<'T> -> seq<'T>
<summary>Returns a sequence that when enumerated returns at most N elements.</summary>
<param name="count">The maximum number of items to enumerate.</param>
<param name="source">The input sequence.</param>
<returns>The result sequence.</returns>
<exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception>
<example id="truncate-1"><code lang="fsharp">
let inputs = ["a"; "b"; "c"; "d"]
inputs |> Seq.truncate 2
</code>
Evaluates to a sequence yielding the same results as <c>seq { "a"; "b" }</c></example>
<example id="truncate-2"><code lang="fsharp">
let inputs = ["a"; "b"; "c"; "d"]
inputs |> Seq.truncate 6
</code>
Evaluates to a sequence yielding the same results as <c>seq { "a"; "b"; "c"; "d" }</c></example>
<example id="truncate-3"><code lang="fsharp">
let inputs = ["a"; "b"; "c"; "d"]
inputs |> Seq.truncate 0
</code>
Evaluates to the empty sequence.
</example>
val toList: source: seq<'T> -> 'T list
<summary>Builds a list from the given collection.</summary>
<param name="source">The input sequence.</param>
<returns>The result list.</returns>
<exception cref="T:System.ArgumentNullException">Thrown when the input sequence is null.</exception>
<example id="tolist-1"><code lang="fsharp">
let inputs = seq { 1; 2; 5 }
inputs |> Seq.toList
</code>
Evaluates to <c>[ 1; 2; 5 ]</c>.
</example>
val searchResults: (string * string) list
Multiple items
module List
from Microsoft.FSharp.Collections
<summary>Contains operations for working with values of type <see cref="T:Microsoft.FSharp.Collections.list`1" />.</summary>
<namespacedoc><summary>Operations for collections such as lists, arrays, sets, maps and sequences. See also
<a href="https://docs.microsoft.com/dotnet/fsharp/language-reference/fsharp-collection-types">F# Collection Types</a> in the F# Language Guide.
</summary></namespacedoc>
--------------------
type List<'T> =
| op_Nil
| op_ColonColon of Head: 'T * Tail: 'T list
interface IReadOnlyList<'T>
interface IReadOnlyCollection<'T>
interface IEnumerable
interface IEnumerable<'T>
member GetReverseIndex: rank: int * offset: int -> int
member GetSlice: startIndex: int option * endIndex: int option -> 'T list
static member Cons: head: 'T * tail: 'T list -> 'T list
member Head: 'T
member IsEmpty: bool
member Item: index: int -> 'T with get
...
<summary>The type of immutable singly-linked lists.</summary>
<remarks>Use the constructors <c>[]</c> and <c>::</c> (infix) to create values of this type, or
the notation <c>[1;2;3]</c>. Use the values in the <c>List</c> module to manipulate
values of this type, or pattern match against the values directly.
</remarks>
<exclude />
val filter: predicate: ('T -> bool) -> list: 'T list -> 'T list
<summary>Returns a new collection containing only the elements of the collection
for which the given predicate returns "true"</summary>
<param name="predicate">The function to test the input elements.</param>
<param name="list">The input list.</param>
<returns>A list containing only the elements that satisfy the predicate.</returns>
<example id="filter-1"><code lang="fsharp">
let input = [1, "Luke"; 2, "Kirk"; 3, "Kenobi"; 4, "Spock"]
let isComingFromStarTrek (x,_) = isEven x
input |> List.filter isComingFromStarTrek
</code>
Evaluates to <c>[(2, "Kirk"); (4, "Spock")]</c></example>
val name: string
val url: string
System.String.StartsWith(value: string) : bool
System.String.StartsWith(value: char) : bool
System.String.StartsWith(value: string, comparisonType: System.StringComparison) : bool
System.String.StartsWith(value: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : bool
val map: mapping: ('T -> 'U) -> list: 'T list -> 'U list
<summary>Builds a new collection whose elements are the results of applying the given function
to each of the elements of the collection.</summary>
<param name="mapping">The function to transform elements from the input list.</param>
<param name="list">The input list.</param>
<returns>The list of transformed elements.</returns>
<example id="map-1"><code lang="fsharp">
let inputs = [ "a"; "bbb"; "cc" ]
inputs |> List.map (fun x -> x.Length)
</code>
Evaluates to <c>[ 1; 3; 2 ]</c></example>
System.String.IndexOf(value: string) : int
System.String.IndexOf(value: char) : int
System.String.IndexOf(value: string, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: string, startIndex: int) : int
System.String.IndexOf(value: char, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: char, startIndex: int) : int
System.String.IndexOf(value: string, startIndex: int, comparisonType: System.StringComparison) : int
System.String.IndexOf(value: string, startIndex: int, count: int) : int
System.String.IndexOf(value: char, startIndex: int, count: int) : int
System.String.IndexOf(value: string, startIndex: int, count: int, comparisonType: System.StringComparison) : int