Header menu logo FSharp.Data

BinderScriptNotebook

HTML CSS selectors

This article demonstrates how to use HTML CSS selectors to browse the DOM of parsed HTML files. We use the HtmlDocument type and associated HtmlDocument module and HtmlDocumentExtensions extensions.

Usage of CSS selectors is a very natural way to parse HTML when we come from Web developments. The HTML CSS selectors are based on the JQuery selectors. To use CSS selectors, reference the FSharp.Data package. You then need to open FSharp.Data namespace, which automatically exposes extension methods that implement the CSS selectors.

open FSharp.Data

Practice 1: Search for FSharp.Data on Google

We will parse links of a Google to search for FSharp.Data like in the HTML Parser article.

let googleUrl = "http://www.google.co.uk/search?q=FSharp.Data"
let doc = HtmlDocument.Load(googleUrl)
val googleUrl: string = "http://www.google.co.uk/search?q=FSharp.Data"
val doc: HtmlDocument =
  <!-- html>--><html lang="en">
  <head>
    <meta charset="UTF-8" /><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image" /><title>FSharp.Data - Google Search</title><script nonce="QtQEcapp7Eoce2boguSNSQ">(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"===c||"q"===c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentEle...

To make sure we extract search results only, we will parse links in the <div> with id search. Then we can , for example, use the direct descendants selector to select another <div> with the id ires. The CSS selector to do so is div#search > div#ires:

let links =
    doc.CssSelect("div#search > div#ires div.g > div.s div.kv cite")
    |> List.map (fun n ->
        match n.InnerText() with
        | t when
            (t.StartsWith("https://")
             || t.StartsWith("http://"))
            ->
            t
        | t -> "http://" + t)
val links: string list = []

The rest of the selector (written as li.g > div.s) skips the first 4 sub-results targeting GitHub pages, so we only extract proper links.

Now we might want the pages titles associated with their URLs. To do this, we can use the List.zip function:

let searchResults =
    doc.CssSelect("div#search > div#ires div.g > h3")
    |> List.map (fun n -> n.InnerText())
    |> List.zip (links)
val searchResults: (string * string) list = []

Practice 2: Search F# books on Google Books

We will parse links of the Google Books web site, searching for F#. After downloading the document, we simply ensure to match good links with their CSS's styles and DOM's hierachy. In case of Google Books, we need to look for <div> with class set to g, then for <h3> with CSS class r and then for all <a> elements:

let fsys = "https://www.google.com/search?tbm=bks&q=F%23"
let doc2 = HtmlDocument.Load(fsys)

let books =
    doc2.CssSelect("div.g h3.r a")
    |> List.map (fun a -> a.InnerText().Trim(), a.AttributeValue("href"))
    |> List.filter (fun (title, href) -> title.Contains("F#"))
val fsys: string = "https://www.google.com/search?tbm=bks&q=F%23"
val doc2: HtmlDocument =
  <!-- html>--><html lang="en">
  <head>
    <meta charset="UTF-8" /><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image" /><title>F# - Google Search</title><script nonce="BpX-9hyTR6FAgEObGyg95Q">(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"===c||"q"===c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addE...
val books: (string * string) list = []

JQuery selectors

This section provides a quick overview of the supported CSS selectors. If you are familiar with CSS selectors in JQuery, then you will see that most of the features are the same. You can also refer to the table below for a complete list of supported selectors.

Attribute Contains Prefix Selector

Finds all links with an english hreflang attribute.

let englishDoc =
    HtmlDocument.Parse(
        """
  <!doctype html>
  <html lang="en">
  <body>
    <a href="example.html" hreflang="en">Some text</a>
    <a href="example.html" hreflang="en-UK">Some other text</a>
    <a href="example.html" hreflang="english">will not be outlined</a>
  </body>
  </html>"""
    )

let englishLinks = englishDoc.CssSelect("a[hreflang|=en]")
val englishDoc: HtmlDocument =
  <!-- html>--><html lang="en">
  <body>
    <a href="example.html" hreflang="en">Some text</a><a href="example.html" hreflang="en-UK">Some other text</a><a href="example.html" hreflang="english">will not be outlined</a>
  </body>
</html>
val englishLinks: HtmlNode list =
  [<a href="example.html" hreflang="en">Some text</a>;
   <a href="example.html" hreflang="en-UK">Some other text</a>]

Attribute Contains Selector

Finds all inputs with a name containing "man". This includes results where "man" is a substring:

let manDoc =
    HtmlDocument.Parse(
        """
  <!doctype html>
  <html lang="en">
  <body>
    <input name="man-news">
    <input name="milkman">
    <input name="milk man">
    <input name="letterman2">
    <input name="newmilk">
    <input name="man">
    <input name="newsletter">
  </body>
  </html>"""
    )

let manElems = manDoc.CssSelect("input[name*='man']")
val manDoc: HtmlDocument =
  <!-- html>--><html lang="en">
  <body>
    <input name="man-news" /><input name="milkman" /><input name="milk man" /><input name="letterman2" /><input name="newmilk" /><input name="man" /><input name="newsletter" />
  </body>
</html>
val manElems: HtmlNode list =
  [<input name="man-news" />; <input name="milkman" />;
   <input name="milk man" />; <input name="letterman2" />;
   <input name="man" />]

Attribute Contains Word Selector

Finds all inputs with a name containing the word "man". This requires a whitespace around the word:

let manWordElems = manDoc.CssSelect("input[name~='man']")
val manWordElems: HtmlNode list =
  [<input name="milk man" />; <input name="man" />]

Attribute Ends With Selector

Finds all inputs with a name ending with "man".

let manEndElemes = manDoc.CssSelect("input[name$='man']")
val manEndElemes: HtmlNode list =
  [<input name="milkman" />; <input name="milk man" />; <input name="man" />]

Attribute Equals Selector

Finds all inputs with a name equal to "man".

let manEqElemes = manDoc.CssSelect("input[name='man']")
val manEqElemes: HtmlNode list = [<input name="man" />]

Attribute Not Equal Selector

Finds all inputs with a name different to "man".

let notManElems = manDoc.CssSelect("input[name!='man']")
val notManElems: HtmlNode list =
  [<input name="man-news" />; <input name="milkman" />;
   <input name="milk man" />; <input name="letterman2" />;
   <input name="newmilk" />; <input name="newsletter" />]

Attribute Starts With Selector

Finds all inputs with a name starting with "man".

let manStartElems = manDoc.CssSelect("input[name^='man']")
val manStartElems: HtmlNode list =
  [<input name="man-news" />; <input name="man" />]

Forms helpers

There are some syntax shortcuts to find forms controls.

let htmlForm =
    HtmlDocument.Parse(
        """
  <!doctype html>
  <html>
  <body>
  <form>
    <fieldset>
      <input type="button" value="Input Button">
      <input type="checkbox" id="check1">
      <input type="hidden" id="hidden1">
      <input type="password" id="pass1">
      <input name="email" disabled="disabled">
      <input type="radio" id="radio1">
      <input type="checkbox" id="check2" checked="checked">
      <input type="file" id="uploader1">
      <input type="reset">
      <input type="submit">
      <input type="text">
      <select><option>Option</option></select>
      <textarea class="comment box1">Type a comment here</textarea>
      <button>Go !</button>
    </fieldset>
  </form>
  </body>
  </html>"""
    )

You can use :prop to find CSS elements with the specified value of the type attribute or a specified form control property. This lets you easily select all buttons, checkboxes, radio buttons, but also hidden or disabled form elements:

// Find all buttons.
let buttons = htmlForm.CssSelect(":button")

// Find all checkboxes.
let checkboxes = htmlForm.CssSelect(":checkbox")

// Find all checked checkboxs or radio.
let checkd = htmlForm.CssSelect(":checked")

// Find all disabled controls.
let disabled = htmlForm.CssSelect(":disabled")

// Find all inputs with type hidden.
let hidden = htmlForm.CssSelect(":hidden")

// Find all inputs with type radio.
let radio = htmlForm.CssSelect(":radio")

// Find all inputs with type password.
let password = htmlForm.CssSelect(":password")

// Find all files uploaders.
let file = htmlForm.CssSelect(":file")
val buttons: HtmlNode list =
  [<button>Go !</button>; <input type="button" value="Input Button" />]
val checkboxes: HtmlNode list =
  [<input type="checkbox" id="check1" />;
   <input type="checkbox" id="check2" checked="checked" />]
val checkd: HtmlNode list =
  [<input type="checkbox" id="check2" checked="checked" />]
val disabled: HtmlNode list = [<input name="email" disabled="disabled" />]
val hidden: HtmlNode list = [<input type="hidden" id="hidden1" />]
val radio: HtmlNode list = [<input type="radio" id="radio1" />]
val password: HtmlNode list = [<input type="password" id="pass1" />]
val file: HtmlNode list = [<input type="file" id="uploader1" />]

Implemented and missing features

Basic CSS selectors are implemented, but some JQuery selectors are missing

This table lists all JQuery selectors and their status

Selector name

Status

specification

*All Selector *

TODO

specification

:animated Selector

not possible

specification

Attribute Contains Prefix Selector

implemented

specification

*Attribute Contains Selector *

implemented

specification

Attribute Contains Word Selector

implemented

specification

*Attribute Ends With Selector *

implemented

specification

*Attribute Equals Selector *

implemented

specification

Attribute Not Equal Selector

implemented

specification

*Attribute Starts With Selector *

implemented

specification

*:button Selector *

implemented

specification

:checkbox Selector

implemented

specification

:checked Selector

implemented

specification

Child Selector (“parent > child”)

implemented

specification

Class Selector (“.class”)

implemented

specification

:contains() Selector

TODO

specification

Descendant Selector (“ancestor descendant”)

implemented

specification

:disabled Selector

implemented

specification

Element Selector (“element”)

implemented

specification

:empty Selector

implemented

specification

:enabled Selector

implemented

specification

:eq() Selector

TODO

specification

:even Selector

implemented

specification

:file Selector

implemented

specification

:first-child Selector

TODO

specification

:first-of-type Selector

TODO

specification

:first Selector

TODO

specification

:focus Selector

not possible

specification

:gt() Selector

TODO

specification

Has Attribute Selector [name]

implemented

specification

:has() Selector

TODO

specification

:header Selector

TODO

specification

:hidden Selector

implemented

specification

ID Selector (“#id”)

implemented

specification

:image Selector

implemented

specification

:input Selector

implemented

specification

:lang() Selector

TODO

specification

:last-child Selector

TODO

specification

:last-of-type Selector

TODO

specification

:last Selector

TODO

specification

:lt() Selector

TODO

specification

Multiple Attribute Selector [name=”value”][name2=”value2″]

implemented

specification

Multiple Selector (“selector1, selector2, selectorN”)

TODO

specification

Next Adjacent Selector (“prev + next”)

TODO

specification

Next Siblings Selector (“prev ~ siblings”)

TODO

specification

:not() Selector

TODO

specification

:nth-child() Selector

TODO

specification

:nth-last-child() Selector

TODO

specification

:nth-last-of-type() Selector

TODO

specification

:nth-of-type() Selector

TODO

specification

:odd Selector

implemented

specification

:only-child Selector

TODO

specification

:only-of-type Selector

TODO

specification

:parent Selector

TODO

specification

:password Selector

implemented

specification

:radio Selector

implemented

specification

:reset Selector

not possible

specification

:root Selector

useless[1]

specification

:selected Selector

implemented

specification

:submit Selector

implemented

specification

:target Selector

not possible

specification

:text Selector

implemented

specification

:visible Selector

not possible

specification

[1] :root Selector seems to be useless in our case because with the HTML parser the root is always the html node.

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
val googleUrl: string
val doc: HtmlDocument
Multiple items
module HtmlDocument from FSharp.Data

--------------------
type HtmlDocument = private | HtmlDocument of docType: string * elements: HtmlNode list override ToString: unit -> string static member New: docType: string * children: HtmlNode seq -> HtmlDocument + 1 overload
static member HtmlDocument.Load: stream: System.IO.Stream -> HtmlDocument
static member HtmlDocument.Load: reader: System.IO.TextReader -> HtmlDocument
static member HtmlDocument.Load: uri: string * [<System.Runtime.InteropServices.Optional>] ?encoding: System.Text.Encoding -> HtmlDocument
val links: string list
static member CssSelectorExtensions.CssSelect: doc: HtmlDocument * selector: string -> HtmlNode list
Multiple items
module List from Microsoft.FSharp.Collections

--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T member IsEmpty: bool member Item: index: int -> 'T with get ...
val map: mapping: ('T -> 'U) -> list: 'T list -> 'U list
val n: HtmlNode
static member HtmlNodeExtensions.InnerText: n: HtmlNode -> string
val t: string
System.String.StartsWith(value: string) : bool
System.String.StartsWith(value: char) : bool
System.String.StartsWith(value: string, comparisonType: System.StringComparison) : bool
System.String.StartsWith(value: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : bool
val searchResults: (string * string) list
val zip: list1: 'T1 list -> list2: 'T2 list -> ('T1 * 'T2) list
val fsys: string
val doc2: HtmlDocument
val books: (string * string) list
val a: HtmlNode
static member HtmlNodeExtensions.AttributeValue: n: HtmlNode * name: string -> string
val filter: predicate: ('T -> bool) -> list: 'T list -> 'T list
val title: string
val href: string
System.String.Contains(value: string) : bool
System.String.Contains(value: char) : bool
System.String.Contains(value: string, comparisonType: System.StringComparison) : bool
System.String.Contains(value: char, comparisonType: System.StringComparison) : bool
val englishDoc: HtmlDocument
static member HtmlDocument.Parse: text: string -> HtmlDocument
val englishLinks: HtmlNode list
val manDoc: HtmlDocument
val manElems: HtmlNode list
val manWordElems: HtmlNode list
val manEndElemes: HtmlNode list
val manEqElemes: HtmlNode list
val notManElems: HtmlNode list
val manStartElems: HtmlNode list
val htmlForm: HtmlDocument
val buttons: HtmlNode list
val checkboxes: HtmlNode list
val checkd: HtmlNode list
val disabled: HtmlNode list
val hidden: HtmlNode list
val radio: HtmlNode list
val password: HtmlNode list
val file: HtmlNode list

Type something to start searching.