This article describes how to use the RefSeq Type Provider to remotely access genomic data stored in the
RefSeq database. This Type Provider collects and parses the genomic data
for a specified organism and generates a static type containing its metadata and sequence.
The RefSeq Type Provider uses .NET Bio to parse the RefSeq data files
and BioFSharp to provide utilities for manipulating genomic sequences.
To load the RefSeq Type Provider, a script can use the NuGet syntax to reference the BioProviders package, shown below.
You can optionally include the BioFSharp package. While it's not required to use the basic BioProviders functions, it can be used to explore the metadata of the provided types, as shown in a later example.
#r "nuget: BioProviders"
#r "nuget: BioFSharp"
If creating an F# library or application, BioProviders can be added as a package reference. You can use your IDE for this, or use the dotnet add package BioProviders
command in your project folder from the command line.
BioProviders can then be used in your script or code by using an open command. Opening its dependencies should not be required. (BioFSharp is loaded for future examples.)
open BioProviders
open BioFSharp
The RefSeq Type Provider will be demonstrated for this RefSeq assembly
of the Staphylococcus borealis species. To create a typed representation of the assembly, two pieces of information
must be given to the Type Provider:
- Species name
- RefSeq assembly accession
For this example, the species name is "Staphylococcus borealis" and the RefSeq assembly accession is "GCF_001224225.1".
To find this information:
You can then select the assembly's RefSeq (as well as GenBank) accession from the list that appears.
.
Passing this information to the Type Provider generates the Assembly Type. The genomic data can then be extracted from the
Assembly Type by invoking the Genome method. This is demonstrated below.
// Define species name and RefSeq assembly accession.
let [<Literal>] Species = "Staphylococcus borealis"
let [<Literal>] Accession = "GCF_001224225.1"
// Create RefSeq assembly type.
type Borealis = RefSeqProvider<Species, Accession>
// Extract statically-typed genome data.
let genome = Borealis.Genome()
Each genome is accompanied by metadata describing the organism and sequence recorded in the assembly. This metadata can
be extracted using the Metadata field of the Genome Type created previously. The Metadata type is largely based on that
provided by .NET Bio, with modifications
made to be more idiomatic with F#.
Below is an example of how the raw metadata type can be retrieved and displayed:
// Extract the metadata.
let metadata = genome.Metadata
// Display the metadata type.
printf "%A" metadata
{ Locus = Some { Date = Some 4/27/2023 12:00:00 AM
DivisionCode = Some CON
MoleculeType = Some DNA
Name = Some "NZ_CUEE01000001"
SequenceLength = 563044
SequenceType = Some "bp"
Strand = None
StrandTopology = Some Linear }
Definition =
Some "Staphylococcus borealis strain 51-48, whole genome shotgun sequence."
Accession = Some { Primary = Some "NZ_CUEE01000001"
Secondary = Some ["NZ_CUEE01000000"] }
Version = Some { Accession = Some "NZ_CUEE01000001"
CompoundAccession = Some "NZ_CUEE01000001.1"
GiNumber = None
Version = Some "1" }
DbLinks =
Some
[{ Numbers = Some [" PRJNA224116"]
Type = Some BioProject }; { Numbers = Some [" SAMEA1035138"]
Type = None };
{ Numbers = Some [" GCF_001224225.1"]
Type = None }]
DbSource = None
Keywords = Some "WGS; RefSeq."
Primary = None
Source =
Some
{ CommonName = Some "Staphylococcus borealis"
Organism =
Some
{ ClassLevels =
Some
"Bacteria; Bacillota; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus."
Genus = Some "Staphylococcus"
Species = Some "borealis" } }
References =
Some
[{ Authors = Some "Informatics,Pathogen."
Consortiums = None
Journal =
Some
"Submitted (10-MAR-2015) SC, Wellcome Trust Sanger Institute, CB10 1SA, United Kingdom"
Location = None
Medline = None
Number = 1
PubMed = None
Remarks = None
Title = Some "Direct Submission" }]
Comments =
Some
["REFSEQ INFORMATION: The reference sequence is identical to
CUEE01000001.1.
The annotation was added by the NCBI Prokaryotic Genome Annotation
Pipeline (PGAP). Information about PGAP can be found here:
https://www.ncbi.nlm.nih.gov/genome/annotation_prok/
##Genome-Annotation-Data-START##
Annotation Provider :: NCBI RefSeq
Annotation Date :: 04/27/2023 01:28:26
Annotation Pipeline :: NCBI Prokaryotic Genome
Annotation Pipeline (PGAP)
Annotation Method :: Best-placed reference protein
set; GeneMarkS-2+
Annotation Software revision :: 6.5
Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA
Genes (total) :: 2,650
CDSs (total) :: 2,584
Genes (coding) :: 2,507
CDSs (with protein) :: 2,507
Genes (RNA) :: 66
rRNAs :: 2, 1, 1 (5S, 16S, 23S)
complete rRNAs :: 2, 1, 1 (5S, 16S, 23S)
tRNAs :: 58
ncRNAs :: 4
Pseudo Genes (total) :: 77
CDSs (without protein) :: 77
Pseudo Genes (ambiguous residues) :: 0 of 77
Pseudo Genes (frameshifted) :: 29 of 77
Pseudo Genes (incomplete) :: 49 of 77
Pseudo Genes (internal stop) :: 29 of 77
Pseudo Genes (multiple problems) :: 23 of 77
##Genome-Annotation-Data-END##"]
Contig = Some "join(CUEE01000001.1:1..563044)"
Segment = None
Origin = None }
|
The metadata type consists of many fields, though not all fields of the metadata exist for all assemblies. Therefore, they are provided as option types, on which a match expression can be used. Below are examples of accessing fields from the example assembly.
✅ Example - Accessing a field that is provided.
// Print definition if exists.
match metadata.Definition with
| Some definition -> printf "%s" definition
| None -> printf "No definition provided."
Staphylococcus borealis strain 51-48, whole genome shotgun sequence.
|
❌ Example - Accessing a field that is not provided.
// Print database source if exists.
match metadata.DbSource with
| Some dbsource -> printf "%s" dbsource
| None -> printf "No database source provided."
No database source provided.
|
The genomic sequence for the organism can be extracted using the Sequence field of the Genome Type created previously.
This field provides a BioFSharp BioSeq containing
a series of Nucleotides. More
can be read about BioFSharp containers here.
An example of accessing and manipulating the RefSeqProvider genomic sequence using BioFSharp is provided below:
// Extract the BioFSharp BioSeq.
let sequence = genome.Sequence
// Display the sequence type.
printf "%A" sequence
// Take the complement, then transcribe and translate the coding strand.
sequence
|> BioSeq.complement
|> BioSeq.transcribeCodingStrand
|> BioSeq.translate 0
seq [Val; Leu; Val; Ter; ...]
|
Wildcard operators are supported in both the Species and Accession provided to the RefSeqProvider. By using asterisks "*"
at the end of a Species or Accession name, species or accessions starting with the provided pattern will be matched.
For example, we can get all Staphylococcus species starting with the letter 'c' and assembly accesions starting with
'GCF_01':
// Define species name and RefSeq assembly accession using wildcards.
let [<Literal>] SpeciesPattern = "Staphylococcus c*"
let [<Literal>] AccessionPattern = "GCF_01*"
// Create RefSeq type containing all species matching the species pattern.
type SpeciesCollection = RefSeqProvider<SpeciesPattern, AccessionPattern>
// Select the species types.
type Capitis = SpeciesCollection.``Staphylococcus capitis``
type Cohnii = SpeciesCollection.``Staphylococcus cohnii``
// Select assemblies.
type Assembly1 = Capitis.``GCF_012926605.1``
type Assembly2 = Capitis.``GCF_012926635.1``
type Assembly3 = Cohnii.``GCF_013602215.1``
type Assembly4 = Cohnii.``GCF_013602265.1``
// Extract statically-typed genome data.
let data = Assembly1.Genome()
// Show the assembly's definition.
match data.Metadata.Definition with
| Some definition -> printf "%s" definition
| None -> printf "No definition provided."
Staphylococcus capitis strain 18-857 NODE_1, whole genome shotgun sequence.
|
The Accession parameter can also be omitted from the RefSeqProvider. In this case, all assemblies for the given species will
be matched. For example:
// Define species name.
let [<Literal>] SpeciesName = "Staphylococcus lugdunensis"
// Create RefSeq type containing all assemblies for the species.
type Assemblies = RefSeqProvider<SpeciesName>
// Select assemblies.
type Assembly = Assemblies.``GCF_001546615.1``
// Show the assembly's primary accession.
match (Assembly.Genome()).Metadata.Accession with
| Some accession -> match accession.Primary with
| Some primary -> printf "%s" primary
| None -> printf "No primary accession provided."
| None -> printf "No accession provided."
namespace BioProviders
namespace BioFSharp
Multiple items
type LiteralAttribute =
inherit Attribute
new: unit -> LiteralAttribute
--------------------
new: unit -> LiteralAttribute
[<Literal>]
val Species: string = "Staphylococcus borealis"
[<Literal>]
val Accession: string = "GCF_001224225.1"
type Borealis = RefSeqProvider<...>
type RefSeqProvider
<summary>Typed representation of the NCBI FTP server, for RefSeq data.</summary>
<param name="Species">The name of the species whose genome is being accessed (e.g. "Staphylococcus borealis"). Defaults to <c>""</c>.</param>
<param name="Accession">The accession of the genome assembly being accessed (e.g. "GCF_001224225.1"). Defaults to <c>""</c>.</param>
val genome: RefSeqProvider<...>.Genome
type Genome =
inherit GenBankFlatFile
new: unit -> Genome
member Metadata: Metadata
member Sequence: IEnumerable<Nucleotide>
<summary>Typed representation of an Assembly's Genomic GenBank Flat File.</summary>
val metadata: Metadata.Metadata
Multiple items
GenBankFlatFile.GenBankFlatFile.Metadata: Metadata.Metadata
--------------------
property RefSeqProvider<...>.Genome.Metadata: Metadata.Metadata with get
<summary>Typed representation of the Metadata of a Genomic GenBank Flat File.</summary>
val printf: format: Printf.TextWriterFormat<'T> -> 'T
Metadata.Metadata.Definition: string option
union case Option.Some: Value: 'T -> Option<'T>
val definition: string
union case Option.None: Option<'T>
Metadata.Metadata.DbSource: string option
val dbsource: string
val sequence: System.Collections.Generic.IEnumerable<Nucleotides.Nucleotide>
Multiple items
GenBankFlatFile.GenBankFlatFile.Sequence: BioSeq.BioSeq<Nucleotides.Nucleotide>
--------------------
property RefSeqProvider<...>.Genome.Sequence: System.Collections.Generic.IEnumerable<Nucleotides.Nucleotide> with get
<summary>Typed representation of the Sequence of a Genomic GenBank Flat File.</summary>
module BioSeq
from BioFSharp
val complement: nucs: Nucleotides.Nucleotide seq -> BioSeq.BioSeq<Nucleotides.Nucleotide>
val transcribeCodingStrand: nucs: Nucleotides.Nucleotide seq -> BioSeq.BioSeq<Nucleotides.Nucleotide>
val translate: nucleotideOffset: int -> rnaSeq: Nucleotides.Nucleotide seq -> BioSeq.BioSeq<AminoAcids.AminoAcid>
[<Literal>]
val SpeciesPattern: string = "Staphylococcus c*"
[<Literal>]
val AccessionPattern: string = "GCF_01*"
type SpeciesCollection = RefSeqProvider<...>
type Capitis = RefSeqProvider<...>.Staphylococcus capitis
type Cohnii = RefSeqProvider<...>.Staphylococcus cohnii
type Assembly1 = RefSeqProvider<...>.Staphylococcus capitis.GCF_012926605.1
type Assembly2 = RefSeqProvider<...>.Staphylococcus capitis.GCF_012926635.1
type Assembly3 = RefSeqProvider<...>.Staphylococcus cohnii.GCF_013602215.1
type Assembly4 = RefSeqProvider<...>.Staphylococcus cohnii.GCF_013602265.1
val data: RefSeqProvider<...>.Staphylococcus capitis.GCF_012926605.1.Genome
Multiple items
GenBankFlatFile.GenBankFlatFile.Metadata: Metadata.Metadata
--------------------
property RefSeqProvider<...>.Staphylococcus capitis.GCF_012926605.1.Genome.Metadata: Metadata.Metadata with get
<summary>Typed representation of the Metadata of a Genomic GenBank Flat File.</summary>
[<Literal>]
val SpeciesName: string = "Staphylococcus lugdunensis"
type Assemblies = RefSeqProvider<...>
type Assembly = RefSeqProvider<...>.GCF_001546615.1
module Metadata
from BioProviders
type Accession =
{
Primary: string option
Secondary: string list option
}
<summary>
Identifier assigned to each GenBank sequence record.
</summary>
val accession: Metadata.Accession
Metadata.Accession.Primary: string option
val primary: string