HTML CSS selectors
This article demonstrates how to use HTML CSS selectors to browse the DOM of parsed HTML files. We use the HtmlDocument type and associated HtmlDocument module and HtmlDocumentExtensions extensions.
Usage of CSS selectors is a very natural way to parse HTML when we come from Web developments.
The HTML CSS selectors are based on the JQuery selectors.
To use CSS selectors, reference the FSharp.Data package. You then need to open FSharp.Data
namespace, which
automatically exposes extension methods that implement the CSS selectors.
open FSharp.Data
Practice 1: Search for FSharp.Data on Google
We will parse links of a Google to search for FSharp.Data
like in the HTML Parser article.
let googleUrl = "http://www.google.co.uk/search?q=FSharp.Data"
let doc = HtmlDocument.Load(googleUrl)
|
To make sure we extract search results only, we will parse links in the <div>
with id search
.
Then we can , for example, use the direct descendants selector to select another <div>
with the
id ires
. The CSS selector to do so is div#search > div#ires
:
let links =
doc.CssSelect("div#search > div#ires div.g > div.s div.kv cite")
|> List.map (fun n ->
match n.InnerText() with
| t when
(t.StartsWith("https://")
|| t.StartsWith("http://"))
->
t
| t -> "http://" + t)
|
The rest of the selector (written as li.g > div.s
) skips the first 4 sub-results targeting GitHub pages,
so we only extract proper links.
Now we might want the pages titles associated with their URLs. To do this, we can use the List.zip
function:
let searchResults =
doc.CssSelect("div#search > div#ires div.g > h3")
|> List.map (fun n -> n.InnerText())
|> List.zip (links)
|
Practice 2: Search F# books on Google Books
We will parse links of the Google Books web site, searching for F#
. After downloading the document,
we simply ensure to match good links with their CSS's styles and DOM's hierachy. In case of Google Books,
we need to look for <div>
with class
set to g
, then for <h3>
with CSS class r
and then for all <a>
elements:
let fsys = "https://www.google.com/search?tbm=bks&q=F%23"
let doc2 = HtmlDocument.Load(fsys)
let books =
doc2.CssSelect("div.g h3.r a")
|> List.map (fun a -> a.InnerText().Trim(), a.AttributeValue("href"))
|> List.filter (fun (title, href) -> title.Contains("F#"))
|
JQuery selectors
This section provides a quick overview of the supported CSS selectors. If you are familiar with CSS selectors in JQuery, then you will see that most of the features are the same. You can also refer to the table below for a complete list of supported selectors.
Attribute Contains Prefix Selector
Finds all links with an english hreflang attribute.
let englishDoc =
HtmlDocument.Parse(
"""
<!doctype html>
<html lang="en">
<body>
<a href="example.html" hreflang="en">Some text</a>
<a href="example.html" hreflang="en-UK">Some other text</a>
<a href="example.html" hreflang="english">will not be outlined</a>
</body>
</html>"""
)
let englishLinks = englishDoc.CssSelect("a[hreflang|=en]")
|
Attribute Contains Selector
Finds all inputs with a name containing "man". This includes results where "man" is a substring:
let manDoc =
HtmlDocument.Parse(
"""
<!doctype html>
<html lang="en">
<body>
<input name="man-news">
<input name="milkman">
<input name="milk man">
<input name="letterman2">
<input name="newmilk">
<input name="man">
<input name="newsletter">
</body>
</html>"""
)
let manElems = manDoc.CssSelect("input[name*='man']")
|
Attribute Contains Word Selector
Finds all inputs with a name containing the word "man". This requires a whitespace around the word:
let manWordElems = manDoc.CssSelect("input[name~='man']")
|
Attribute Ends With Selector
Finds all inputs with a name ending with "man".
let manEndElemes = manDoc.CssSelect("input[name$='man']")
|
Attribute Equals Selector
Finds all inputs with a name equal to "man".
let manEqElemes = manDoc.CssSelect("input[name='man']")
|
Attribute Not Equal Selector
Finds all inputs with a name different to "man".
let notManElems = manDoc.CssSelect("input[name!='man']")
|
Attribute Starts With Selector
Finds all inputs with a name starting with "man".
let manStartElems = manDoc.CssSelect("input[name^='man']")
|
Forms helpers
There are some syntax shortcuts to find forms controls.
let htmlForm =
HtmlDocument.Parse(
"""
<!doctype html>
<html>
<body>
<form>
<fieldset>
<input type="button" value="Input Button">
<input type="checkbox" id="check1">
<input type="hidden" id="hidden1">
<input type="password" id="pass1">
<input name="email" disabled="disabled">
<input type="radio" id="radio1">
<input type="checkbox" id="check2" checked="checked">
<input type="file" id="uploader1">
<input type="reset">
<input type="submit">
<input type="text">
<select><option>Option</option></select>
<textarea class="comment box1">Type a comment here</textarea>
<button>Go !</button>
</fieldset>
</form>
</body>
</html>"""
)
You can use :prop
to find CSS elements with the specified value of the type
attribute
or a specified form control property. This lets you easily select all buttons, checkboxes,
radio buttons, but also hidden or disabled form elements:
// Find all buttons.
let buttons = htmlForm.CssSelect(":button")
// Find all checkboxes.
let checkboxes = htmlForm.CssSelect(":checkbox")
// Find all checked checkboxs or radio.
let checkd = htmlForm.CssSelect(":checked")
// Find all disabled controls.
let disabled = htmlForm.CssSelect(":disabled")
// Find all inputs with type hidden.
let hidden = htmlForm.CssSelect(":hidden")
// Find all inputs with type radio.
let radio = htmlForm.CssSelect(":radio")
// Find all inputs with type password.
let password = htmlForm.CssSelect(":password")
// Find all files uploaders.
let file = htmlForm.CssSelect(":file")
|
Implemented and missing features
Basic CSS selectors are implemented, but some JQuery selectors are missing
This table lists all JQuery selectors and their status
Selector name |
Status |
specification |
---|---|---|
*All Selector * |
|
|
:animated Selector |
|
|
Attribute Contains Prefix Selector |
|
|
*Attribute Contains Selector * |
|
|
Attribute Contains Word Selector |
|
|
*Attribute Ends With Selector * |
|
|
*Attribute Equals Selector * |
|
|
Attribute Not Equal Selector |
|
|
*Attribute Starts With Selector * |
|
|
*:button Selector * |
|
|
:checkbox Selector |
|
|
:checked Selector |
|
|
Child Selector (“parent > child”) |
|
|
Class Selector (“.class”) |
|
|
:contains() Selector |
|
|
Descendant Selector (“ancestor descendant”) |
|
|
:disabled Selector |
|
|
Element Selector (“element”) |
|
|
:empty Selector |
|
|
:enabled Selector |
|
|
:eq() Selector |
|
|
:even Selector |
|
|
:file Selector |
|
|
:first-child Selector |
|
|
:first-of-type Selector |
|
|
:first Selector |
|
|
:focus Selector |
|
|
:gt() Selector |
|
|
Has Attribute Selector [name] |
|
|
:has() Selector |
|
|
:header Selector |
|
|
:hidden Selector |
|
|
ID Selector (“#id”) |
|
|
:image Selector |
|
|
:input Selector |
|
|
:lang() Selector |
|
|
:last-child Selector |
|
|
:last-of-type Selector |
|
|
:last Selector |
|
|
:lt() Selector |
|
|
Multiple Attribute Selector [name=”value”][name2=”value2″] |
|
|
Multiple Selector (“selector1, selector2, selectorN”) |
|
|
Next Adjacent Selector (“prev + next”) |
|
|
Next Siblings Selector (“prev ~ siblings”) |
|
|
:not() Selector |
|
|
:nth-child() Selector |
|
|
:nth-last-child() Selector |
|
|
:nth-last-of-type() Selector |
|
|
:nth-of-type() Selector |
|
|
:odd Selector |
|
|
:only-child Selector |
|
|
:only-of-type Selector |
|
|
:parent Selector |
|
|
:password Selector |
|
|
:radio Selector |
|
|
:reset Selector |
|
|
:root Selector |
|
|
:selected Selector |
|
|
:submit Selector |
|
|
:target Selector |
|
|
:text Selector |
|
|
:visible Selector |
|
[1] :root Selector seems to be useless in our case because with the HTML parser the root is always the html node.
namespace FSharp
--------------------
namespace Microsoft.FSharp
namespace FSharp.Data
--------------------
namespace Microsoft.FSharp.Data
module HtmlDocument from FSharp.Data
--------------------
type HtmlDocument = private | HtmlDocument of docType: string * elements: HtmlNode list override ToString: unit -> string static member New: docType: string * children: HtmlNode seq -> HtmlDocument + 1 overload
static member HtmlDocument.Load: reader: System.IO.TextReader -> HtmlDocument
static member HtmlDocument.Load: uri: string * [<System.Runtime.InteropServices.Optional>] ?encoding: System.Text.Encoding -> HtmlDocument
module List from Microsoft.FSharp.Collections
--------------------
type List<'T> = | op_Nil | op_ColonColon of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex: rank: int * offset: int -> int member GetSlice: startIndex: int option * endIndex: int option -> 'T list static member Cons: head: 'T * tail: 'T list -> 'T list member Head: 'T member IsEmpty: bool member Item: index: int -> 'T with get ...
System.String.StartsWith(value: char) : bool
System.String.StartsWith(value: string, comparisonType: System.StringComparison) : bool
System.String.StartsWith(value: string, ignoreCase: bool, culture: System.Globalization.CultureInfo) : bool
System.String.Contains(value: char) : bool
System.String.Contains(value: string, comparisonType: System.StringComparison) : bool
System.String.Contains(value: char, comparisonType: System.StringComparison) : bool