<<

ncbi-taxonomist Documentation Release 1.2.1+8580b9b

Jan P Buchmann

2020-11-15

Contents:

1 Installation 3

2 Basic functions 5

3 Cookbook 35

4 Container 39

5 Frequently Asked Questions 49

6 Module references 51

7 Synopsis 63

8 Requirements and Dependencies 65

9 Contact 67

10 Indices and tables 69

Python Module Index 71

Index 73

i ii ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

1.2.1+8580b9b :: 2020-11-15

Contents: 1 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

2 Contents: CHAPTER 1

Installation

Content

• Local pip install (no root required) • Global pip install (root required) ncbi-taxonomist is available on PyPi via pip. If you use another Python package manager than pip, please consult its documentation. If you are installing ncbi-taxonomist on a non-Linux system, consider the propsed methods as guidelines and adjust as required.

Important: Please note If some of the proposed commands are unfamiliar to you, don’t just invoke them but look them up, e.g. in man pages or search online. Should you be unfamiliar with pip, check pip -h

Note: Python 3 vs. Python 2 Due to co-existing Python 2 and Python 3, some installation commands may be invoked slighty different. In addition, development and support for Python 2 did stop January 2020 and should not be used anymore. ncbi-taxonomist requires Python >= 3.8. Depending on your OS and/or distribution, the default pip command can install either Python 2 or Python 3 packages. Make sure you use pip for Python 3, e.g. pip3 on Ubuntu.

1.1 Local pip install (no root required)

$: pip install ncbi-taxonomist --user

3 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

On Linux, ncbi-taxonomist will be installed to $HOME/.local/bin. If you cannot invoke ncbi-taxonomist from the command line, its’ likely $HOME/.local/bin is not in your $PATH (check echo $PATH). In such a case, choose one of the following possibilities: • add $HOME/.local/bin to your $PATH: – echo "export PATH=${PATH}:$HOME/.local/bin" >> ~/.bashrc • add an alias: – see man bash or https://www.tldp.org/LDP/abs/html/aliases.html • use $HOME/.local/bin/ncbi-taxonomist implicitly

1.2 Global pip install (root required)

$: pip install ncbi-taxonomist ncbi-taxonomist should be now in /usr/local/bin and in you $PATH.

4 Chapter 1. Installation CHAPTER 2

Basic functions

All ncbi-taxonomist commands have the following underlying structure: ncbi-taxonomist This section shows the basic usage of ncbi-taxonomist. More complex examples, inlcuding data extraction with jq can be found here. The output is a single JSON object or XML tree per line for each queried taxid, name, or accessions. The examples show pretty printed single results for clarity only.

Contents

• Collect – Output format

* JSON output * XML output • Map – Taxids and names – Mapping accession – Supported access Entrez databases – Output format

* JSON output · Single mapping result · Multiple mapping results

* XML output · Single mapping result

5 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

· Multiple mapping results • Resolve – Taxids and names – Accessions – Output format

* JSON output · Single mapping result · Multiple mapping results

* XML output · Single mapping result · Multiple mapping results • Import – Local database schema – Import taxa via collect – Import taxa via resolve – Import accessions • Subtree – Collecting subtrees

* Between two given ranks * Collect one specific rank * Collect from a given rank to root and print XML * Collect from a given rank to lowest rank – Output format

* JSON output * XML output • Group – Creating a group – Retrieve a group

2.1 Collect

The collect command fetches taxa from the Entrez database. If Taxids or names sharing parts of the same lineage, these taxa are printed only once.

6 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

2.1.1 Output format

The output describes the collected taxa, one per line. A single taxon has the following structure, for example chim- panzee (tx9598):

{ "taxid" : 9598, "rank" : "species", "parentid" : 9596, "name" : "Pan troglodytes", "names" : { "Pan troglodytes" : "scientific_name", "chimpanzee" : "GenbankCommonName" } }

Collecting taxa for chimpanzee and human: ncbi-taxonomist collect -n chimpanzee human

JSON output

{"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},

˓→"parentid":null,"name":"cellular organisms"} {"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":"scientific_name"},"parentid

˓→":131567,"name":"Eukaryota"} {"taxid":33154,"rank":"clade","names":{"Opisthokonta":"scientific_name"},"parentid

˓→":2759,"name":"Opisthokonta"} {"taxid":33208,"rank":"kingdom","names":{"Metazoa":"scientific_name"},"parentid

˓→":33154,"name":"Metazoa"} {"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},"parentid":33208,

˓→"name":"Eumetazoa"} {"taxid":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,

˓→"name":"Bilateria"} {"taxid":33511,"rank":"clade","names":{"Deuterostomia":"scientific_name"},"parentid

˓→":33213,"name":"Deuterostomia"} {"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_name"},"parentid":33511,

˓→"name":"Chordata"} {"taxid":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid

˓→":7711,"name":"Craniata"} {"taxid":7742,"rank":"clade","names":{"Vertebrata":"scientific_name"},"parentid

˓→":89593,"name":"Vertebrata"} {"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},"parentid

˓→":7742,"name":"Gnathostomata"} {"taxid":117570,"rank":"clade","names":{"Teleostomi":"scientific_name"},"parentid

˓→":7776,"name":"Teleostomi"} {"taxid":117571,"rank":"clade","names":{"Euteleostomi":"scientific_name"},"parentid

˓→":117570,"name":"Euteleostomi"} {"taxid":8287,"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},

˓→"parentid":117571,"name":"Sarcopterygii"} {"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":"scientific_name"},

˓→"parentid":8287,"name":"Dipnotetrapodomorpha"} {"taxid":32523,"rank":"clade","names":{"Tetrapoda":"scientific_name"},"parentid

˓→":1338369,"name":"Tetrapoda"} {"taxid":32524,"rank":"clade","names":{"Amniota":"scientific_name"},"parentid":32523, ˓→"name":"Amniota"} (continues on next page)

2.1. Collect 7 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) {"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name"},"parentid":32524,

˓→"name":"Mammalia"} {"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid":40674,

˓→"name":"Theria"} {"taxid":9347,"rank":"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,

˓→"name":"Eutheria"} {"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":"scientific_name"},"parentid

˓→":9347,"name":"Boreoeutheria"} {"taxid":314146,"rank":"superorder","names":{"Euarchontoglires":"scientific_name"},

˓→"parentid":1437010,"name":"Euarchontoglires"} {"taxid":9443,"rank":"order","names":{"Primates":"scientific_name"},"parentid":314146,

˓→"name":"Primates"} {"taxid":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},"parentid

˓→":9443,"name":"Haplorrhini"} {"taxid":314293,"rank":"infraorder","names":{"Simiiformes":"scientific_name"},

˓→"parentid":376913,"name":"Simiiformes"} {"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid

˓→":314293,"name":"Catarrhini"} {"taxid":314295,"rank":"superfamily","names":{"Hominoidea":"scientific_name"},

˓→"parentid":9526,"name":"Hominoidea"} {"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_name"},"parentid

˓→":314295,"name":"Hominidae"} {"taxid":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},"parentid

˓→":9604,"name":"Homininae"} {"taxid":9605,"rank":"genus","names":{"Homo":"scientific_name"},"parentid":207598,

˓→"name":"Homo"} {"taxid":9606,"rank":"species","names":{"Homo sapiens":"scientific_name","human":

˓→"GenbankCommonName","man":"CommonName"},"parentid":9605,"name":"Homo sapiens"} {"taxid":9596,"rank":"genus","names":{"Pan":"scientific_name"},"parentid":207598,"name

˓→":"Pan"} {"taxid":9598,"rank":"species","names":{"Pan troglodytes":"scientific_name",

˓→"chimpanzee":"GenbankCommonName"},"parentid":9596,"name":"Pan troglodytes"}

XML output

131567no rankcellular organisms

˓→Nonecellular organisms

˓→name> 2759superkingdomEukaryota

˓→131567Eukaryota

˓→taxon> 33154cladeOpisthokonta2759

˓→parentid>Opisthokonta 33208kingdomMetazoa33154

˓→parentid>Metazoa 6072cladeEumetazoa33208

˓→parentid>Eumetazoa 33213cladeBilateria6072

˓→parentid>Bilateria 33511cladeDeuterostomia33213

˓→Deuterostomia 7711phylumChordata33511

˓→parentid>Chordata 89593subphylumCraniata7711

˓→parentid>Craniata (continues on next page)

8 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) 7742cladeVertebrata89593

˓→parentid>Vertebrata 7776cladeGnathostomata7742

˓→parentid>Gnathostomata 117570cladeTeleostomi7776

˓→parentid>Teleostomi 117571cladeEuteleostomi

˓→117570Euteleostomi

˓→taxon> 8287superclassSarcopterygii

˓→117571Sarcopterygii

˓→taxon> 1338369cladeDipnotetrapodomorpha

˓→8287Dipnotetrapodomorpha

˓→name> 32523cladeTetrapoda1338369

˓→parentid>Tetrapoda 32524cladeAmniota32523

˓→parentid>Amniota 40674classMammalia32524

˓→parentid>Mammalia 32525cladeTheria40674

˓→parentid>Theria 9347cladeEutheria32525

˓→parentid>Eutheria 1437010cladeBoreoeutheria

˓→9347Boreoeutheria

˓→taxon> 314146superorderEuarchontoglires

˓→1437010Euarchontoglires

˓→name> 9443orderPrimates314146

˓→parentid>Primates 376913suborderHaplorrhini

˓→9443Haplorrhini

˓→taxon> 314293infraorderSimiiformes

˓→376913Simiiformes

˓→taxon> 9526parvorderCatarrhini

˓→314293Catarrhini

˓→taxon> 314295superfamilyHominoidea

˓→9526Hominoidea 9604familyHominidae314295

˓→parentid>Hominidae 207598subfamilyHomininae9604

˓→Homininae 9605genusHomo207598

˓→parentid>Homo 9606speciesHomo sapiens9605

˓→Homo sapiens

˓→"GenbankCommonName">humanman 9596genusPan207598

˓→parentid>Pan 9598speciesPan troglodytes

˓→9596Pan troglodytes

˓→"GenbankCommonName">chimpanzee (continues on next page)

2.1. Collect 9 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

2.2 Map

The map command maps taxonomic information for taxids, names, and accessions. Without specifing the -edb argument, nucleotide Entrez database is assumed.

2.2.1 Taxids and names

Taxids and names can be mapped together. The taxids and names can be separated by commas and/or space. However, names containing space need to be encapsulated by '. For example:

$: ncbi-taxonomist map -t 562, 10508 -n man ' B (B/Acre/121609/2012)',

˓→chimpanzee

2.2.2 Mapping accession

The default database to map accessions is nucleotide. To map an accession form a different database, it has to be specified by the –entrezdb/-edb argument.

2.2.3 Supported access Entrez databases

Entrez Example database assembly ncbi-taxonomist map -edb assembly -a ASM1001476v1 ViralProj177933 bioproject ncbi-taxonomist map -edb bioproject -a PRJNA604394 nucleotide ncbi-taxonomist map -edb nucleotide -a MH510449.1 ncbi-taxonomist map -a MH510449.1 protein ncbi-taxonomist map -a YP_009345145 -edb protein

Note: Querying the following databases does not return the queried accession in the results. Therefore, results cannot identify which accession corresponds to which results if more than one are requested. To solve the one-to-one relationship, each of the accessions from these databases needs to be queried one-by-one and not as batch query. Future releases will try to implement such queries. • biosample • biosystems • cdd • dbvar • gap • gapplus

10 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

• gene • • geoprofiles: using accessions like GDS6063 should work • proteinclusters: commontaxonomy attribute can be used as name • sra: Only XML results. Needs a dedicated parser

2.2.4 Output format

The result shows the used command, query, type of result, and the corresponding taxon.

JSON output

Single mapping result

• Taxon:

{ "mode" : "mapping", "query" : " (B/Acre/121609/2012)", "cast" : "taxon", "parentid" : 11520, "name" : "Influenza B virus (B/Acre/121609/2012)", "taxon" : { "taxid" : 1334390, "rank" : "no rank", "names" : { "Influenza B virus (B/Acre/121609/2012)" : "scientific_name" } } }

• Accession:

{ "mode" : "mapping", "query" : "ASM1001476v1", "cast" : "accs", "db":"assembly", "uid":5515991, "accession" : { "taxid" : 1962788, "accessions" : { "assemblyaccession" : "GCA_010014765.1", "lastmajorreleaseaccession" : "GCA_010014765.1", "assemblyname" : "ASM1001476v1" } } }

2.2. Map 11 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Multiple mapping results

1 {"mode":"mapping","query":"Influenza B virus (B/Acre/121609/2012)","cast":"taxon",

˓→"taxon":{"taxid":1334390,"rank":"no rank","names":{"Influenza B virus (B/Acre/

˓→121609/2012)":"scientific_name"},"parentid":11520,"name":"Influenza B virus (B/Acre/

˓→121609/2012)"}}

2 {"mode":"mapping","query":"man","cast":"taxon","taxon":{"taxid":9606,"rank":"species",

˓→"names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"}}

3 {"mode":"mapping","query":"562","cast":"taxon","taxon":{"taxid":562,"rank":"species",

˓→"names":{"Escherichia coli":"scientific_name","Bacillus coli":"Synonym","Bacterium

˓→coli":"Synonym","Bacterium coli commune":"Synonym","Enterococcus coli":"Synonym","E.

˓→ coli":"CommonName","Escherichia sp. 3_2_53FAA":"Includes","Escherichia sp. MAR":

˓→"Includes","bacterium 10a":"Includes","bacterium E3":"Includes","Escherichia/

˓→Shigella coli":"EquivalentName","ATCC 11775":"type material","ATCC:11775":"type

˓→material","BCCM/LMG:2092":"type material","CCUG 24":"type material","CCUG 29300":

˓→"type material","CCUG:24":"type material","CCUG:29300":"type material","CIP 54.8":

˓→"type material","CIP:54.8":"type material","DSM 30083":"type material","DSM:30083":

˓→"type material","IAM 12119":"type material","IAM:12119":"type material","JCM 1649":

˓→"type material","JCM:1649":"type material","LMG 2092":"type material","LMG:2092":

˓→"type material","NBRC 102203":"type material","NBRC:102203":"type material","NCCB

˓→54008":"type material","NCCB:54008":"type material","NCTC 9001":"type material",

˓→"NCTC:9001":"type material","personal::U5/41":"type material","strain U5/41":"type

˓→material"},"parentid":561,"name":"Escherichia coli"}}

4 {"mode":"mapping","query":"ASM1001476v1","cast":"accs","accession":{"taxid":1962788,

˓→"accessions":{"assemblyaccession":"GCA_010014765.1","lastmajorreleaseaccession":

˓→"GCA_010014765.1","assemblyname":"ASM1001476v1"},"db":"assembly","uid":5515991}}

5 {"mode":"mapping","query":"PRJNA604394","cast":"accs","accession":{"taxid":573,

˓→"accessions":{"project_id":604394,"project_acc":"PRJNA604394","project_name":

˓→"Klebsiella pneumoniae strain:S01"},"db":"bioproject","uid":604394}}

XML output

Single mapping result

• Taxon:

man 9606 species Homo sapiens 9605 Homo sapiens human man

• Accession:

12 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

ASM1001476v1 1962788 5515991 assembly GCA_010014765.1 GCA_010014765.1 ASM1001476v1

Multiple mapping results

1 Influenza B virus (B/Acre/121609/2012)

˓→1334390no rankInfluenza B virus (B/Acre/121609/

˓→2012)11520Influenza

˓→B virus (B/Acre/121609/2012)

2 man9606species

˓→Homo sapiens9605Homo sapienshuman

˓→"CommonName">man

3 562562species

˓→Escherichia coli561Escherichia coliBacillus coli

˓→"Synonym">Bacterium coliBacterium coli commune

˓→Enterococcus coliE. coli

˓→Escherichia sp. 3_2_53FAA

˓→Escherichia sp. MARbacterium 10a

˓→"Includes">bacterium E3Escherichia/Shigella coli

˓→ATCC 11775

˓→ATCC:11775BCCM/LMG:2092CCUG 24CCUG 29300

˓→"type material">CCUG:24CCUG:29300

˓→type="type material">CIP 54.8CIP:54.8

˓→type="type material">DSM 30083DSM:30083

˓→IAM 12119IAM:12119

˓→name>JCM 1649JCM:1649

˓→name>LMG 2092LMG:2092

˓→name>NBRC 102203

˓→NBRC:102203NCCB 54008NCCB:54008NCTC 9001

˓→"type material">NCTC:9001personal::U5/41

˓→strain U5/41

4 PRJNA604394573

˓→604394bioproject604394

˓→id>PRJNA604394Klebsiella pneumoniae

˓→strain:S01

5 ASM1001476v11962788

˓→5515991assemblyGCA_

˓→010014765.1GCA_010014765.1

˓→lastmajorreleaseaccession>ASM1001476v1

˓→accession>

2.2. Map 13 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

2.3 Resolve

The resolve command resolve lineages. Names and taxid can be resolved directly, while accessions need a mapping step first.

2.3.1 Taxids and names ncbi-taxonomist resolve -n man -t2

2.3.2 Accessions

$: ncbi-taxonomist map -a QZWG01000002.1 MG831203 | ncbi-taxonomist resolve -m

2.3.3 Output format

The result shows the used command, query, type of result, and the corresponding lineage. In case of queried names or taxids, the data for the taxon used as query is shown. For accessions, the queried accession data is shown.

JSON output

Single mapping result

{ "mode" : "resolve", "query" : "man", "cast" : "taxon", "parentid" : 9605, "name":"Homo sapiens", "taxon" : { "taxid" : 9606, "rank" : "species", "names" : { "Homo sapiens" : "scientific_name", "human" : "GenbankCommonName", "man" : "CommonName" } }, "lineage": [ {"taxid":9606,"rank":"species","names":{"Homo sapiens":"scientific_name","human":

˓→"GenbankCommonName","man":"CommonName"},"parentid":9605,"name":"Homo sapiens"}, {"taxid":9605,"rank":"genus","names":{"Homo":"scientific_name"},"parentid":207598,

˓→"name":"Homo"}, {"taxid":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},

˓→"parentid":9604,"name":"Homininae"}, {"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_name"},"parentid

˓→":314295,"name":"Hominidae"}, {"taxid":314295,"rank":"superfamily","names":{"Hominoidea":"scientific_name"}, ˓→"parentid":9526,"name":"Hominoidea"}, (continues on next page)

14 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) {"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},

˓→"parentid":314293,"name":"Catarrhini"}, {"taxid":314293,"rank":"infraorder","names":{"Simiiformes":"scientific_name"},

˓→"parentid":376913,"name":"Simiiformes"}, {"taxid":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},

˓→"parentid":9443,"name":"Haplorrhini"}, {"taxid":9443,"rank":"order","names":{"Primates":"scientific_name"},"parentid

˓→":314146,"name":"Primates"}, {"taxid":314146,"rank":"superorder","names":{"Euarchontoglires":"scientific_name"}

˓→,"parentid":1437010,"name":"Euarchontoglires"}, {"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":"scientific_name"},

˓→"parentid":9347,"name":"Boreoeutheria"}, {"taxid":9347,"rank":"clade","names":{"Eutheria":"scientific_name"},"parentid

˓→":32525,"name":"Eutheria"}, {"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid

˓→":40674,"name":"Theria"}, {"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name"},"parentid

˓→":32524,"name":"Mammalia"}, {"taxid":32524,"rank":"clade","names":{"Amniota":"scientific_name"},"parentid

˓→":32523,"name":"Amniota"}, {"taxid":32523,"rank":"clade","names":{"Tetrapoda":"scientific_name"},"parentid

˓→":1338369,"name":"Tetrapoda"}, {"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":"scientific_name"}

˓→,"parentid":8287,"name":"Dipnotetrapodomorpha"}, {"taxid":8287,"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},

˓→"parentid":117571,"name":"Sarcopterygii"}, {"taxid":117571,"rank":"clade","names":{"Euteleostomi":"scientific_name"},

˓→"parentid":117570,"name":"Euteleostomi"}, {"taxid":117570,"rank":"clade","names":{"Teleostomi":"scientific_name"},"parentid

˓→":7776,"name":"Teleostomi"}, {"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},"parentid

˓→":7742,"name":"Gnathostomata"}, {"taxid":7742,"rank":"clade","names":{"Vertebrata":"scientific_name"},"parentid

˓→":89593,"name":"Vertebrata"}, {"taxid":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid

˓→":7711,"name":"Craniata"}, {"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_name"},"parentid

˓→":33511,"name":"Chordata"}, {"taxid":33511,"rank":"clade","names":{"Deuterostomia":"scientific_name"},

˓→"parentid":33213,"name":"Deuterostomia"}, {"taxid":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid

˓→":6072,"name":"Bilateria"}, {"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},"parentid

˓→":33208,"name":"Eumetazoa"}, {"taxid":33208,"rank":"kingdom","names":{"Metazoa":"scientific_name"},"parentid

˓→":33154,"name":"Metazoa"}, {"taxid":33154,"rank":"clade","names":{"Opisthokonta":"scientific_name"},"parentid

˓→":2759,"name":"Opisthokonta"}, {"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":"scientific_name"},

˓→"parentid":131567,"name":"Eukaryota"}, {"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},

˓→"parentid":null,"name":"cellular organisms"} ] }

2.3. Resolve 15 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Multiple mapping results

{"mode":"resolve","query":"man","cast":"taxon","taxon":{"taxid":9606,"rank":"species",

˓→"names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"},"lineage":[{"taxid":9606,"rank

˓→":"species","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName",

˓→"man":"CommonName"},"parentid":9605,"name":"Homo sapiens"},{"taxid":9605,"rank":

˓→"genus","names":{"Homo":"scientific_name"},"parentid":207598,"name":"Homo"},{"taxid

˓→":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},"parentid":9604,

˓→"name":"Homininae"},{"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_

˓→name"},"parentid":314295,"name":"Hominidae"},{"taxid":314295,"rank":"superfamily",

˓→"names":{"Hominoidea":"scientific_name"},"parentid":9526,"name":"Hominoidea"},{

˓→"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid

˓→":314293,"name":"Catarrhini"},{"taxid":314293,"rank":"infraorder","names":{

˓→"Simiiformes":"scientific_name"},"parentid":376913,"name":"Simiiformes"},{"taxid

˓→":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},"parentid

˓→":9443,"name":"Haplorrhini"},{"taxid":9443,"rank":"order","names":{"Primates":

˓→"scientific_name"},"parentid":314146,"name":"Primates"},{"taxid":314146,"rank":

˓→"superorder","names":{"Euarchontoglires":"scientific_name"},"parentid":1437010,"name

˓→":"Euarchontoglires"},{"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":

˓→"scientific_name"},"parentid":9347,"name":"Boreoeutheria"},{"taxid":9347,"rank":

˓→"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,"name":"Eutheria"},{

˓→"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid":40674,

˓→"name":"Theria"},{"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name

˓→"},"parentid":32524,"name":"Mammalia"},{"taxid":32524,"rank":"clade","names":{

˓→"Amniota":"scientific_name"},"parentid":32523,"name":"Amniota"},{"taxid":32523,"rank

˓→":"clade","names":{"Tetrapoda":"scientific_name"},"parentid":1338369,"name":

˓→"Tetrapoda"},{"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":

˓→"scientific_name"},"parentid":8287,"name":"Dipnotetrapodomorpha"},{"taxid":8287,

˓→"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},"parentid":117571,

˓→"name":"Sarcopterygii"},{"taxid":117571,"rank":"clade","names":{"Euteleostomi":

˓→"scientific_name"},"parentid":117570,"name":"Euteleostomi"},{"taxid":117570,"rank":

˓→"clade","names":{"Teleostomi":"scientific_name"},"parentid":7776,"name":"Teleostomi

˓→"},{"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},

˓→"parentid":7742,"name":"Gnathostomata"},{"taxid":7742,"rank":"clade","names":{

˓→"Vertebrata":"scientific_name"},"parentid":89593,"name":"Vertebrata"},{"taxid

˓→":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid":7711,

˓→"name":"Craniata"},{"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_

˓→name"},"parentid":33511,"name":"Chordata"},{"taxid":33511,"rank":"clade","names":{

˓→"Deuterostomia":"scientific_name"},"parentid":33213,"name":"Deuterostomia"},{"taxid

˓→":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,"name

˓→":"Bilateria"},{"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},

˓→"parentid":33208,"name":"Eumetazoa"},{"taxid":33208,"rank":"kingdom","names":{

˓→"Metazoa":"scientific_name"},"parentid":33154,"name":"Metazoa"},{"taxid":33154,"rank

˓→":"clade","names":{"Opisthokonta":"scientific_name"},"parentid":2759,"name":

˓→"Opisthokonta"},{"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":

˓→"scientific_name"},"parentid":131567,"name":"Eukaryota"},{"taxid":131567,"rank":"no

˓→rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":

˓→"cellular organisms"}]} {"mode":"resolve","query":"2","cast":"taxon","taxon":{"taxid":2,"rank":"superkingdom",

˓→"names":{"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":

˓→"BlastName","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart",

˓→"Prokaryotae":"Inpart","":"Inpart","":"Inpart"},"parentid

˓→":131567,"name":"Bacteria"},"lineage":[{"taxid":2,"rank":"superkingdom","names":{

˓→"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":"BlastName

˓→","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart","Prokaryotae":

˓→"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid":131567,"name":

˓→"Bacteria"},{"taxid":131567,"rank":"no rank","names":{"cellular organisms": (continues on next page) ˓→"scientific_name"},"parentid":null,"name":"cellular organisms"}]}

16 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) {"mode":"resolve","query":"MG831203","cast":"accs","accs":{"taxid":198112,"accessions

˓→":{"accessionversion":"MG831203.1","caption":"MG831203","extra":

˓→"gi|1496532032|gb|MG831203.1|"},"db":"nucleotide","uid":1496532032},"lineage":[{

˓→"taxid":198112,"rank":"species","names":{"Deformed wing virus":"scientific_name",

˓→"DWV":"GenbankAcronym"},"parentid":232799,"name":"Deformed wing virus"},{"taxid

˓→":232799,"rank":"genus","names":{"Iflavirus":"scientific_name"},"parentid":699189,

˓→"name":"Iflavirus"},{"taxid":699189,"rank":"family","names":{"":

˓→"scientific_name"},"parentid":464095,"name":"Iflaviridae"},{"taxid":464095,"rank":

˓→"order","names":{"":"scientific_name"},"parentid":2732506,"name":

˓→"Picornavirales"},{"taxid":2732506,"rank":"class","names":{"":

˓→"scientific_name"},"parentid":2732408,"name":"Pisoniviricetes"},{"taxid":2732408,

˓→"rank":"phylum","names":{"":"scientific_name"},"parentid":2732396,"name

˓→":"Pisuviricota"},{"taxid":2732396,"rank":"kingdom","names":{"":

˓→"scientific_name"},"parentid":2559587,"name":"Orthornavirae"},{"taxid":2559587,"rank

˓→":"clade","names":{"":"scientific_name"},"parentid":10239,"name":"Riboviria

˓→"},{"taxid":10239,"rank":"superkingdom","names":{"":"scientific_name"},

˓→"parentid":null,"name":"Viruses"}]}

XML output

Single mapping result

198112 1496532032 nucleotide MG831203.1 MG831203 gi|1496532032|gb|MG831203.1| 198112 species Deformed wing virus 232799 Deformed wing virus DWV 232799 genus Iflavirus 699189 Iflavirus (continues on next page)

2.3. Resolve 17 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) 699189 family Iflaviridae 464095 Iflaviridae 464095 order Picornavirales 2732506 Picornavirales 2732506 class Pisoniviricetes 2732408 Pisoniviricetes 2732408 phylum Pisuviricota 2732396 Pisuviricota 2732396 kingdom Orthornavirae 2559587 Orthornavirae 2559587 clade Riboviria 10239 Riboviria 10239 (continues on next page)

18 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) superkingdom Viruses None Viruses

Multiple mapping results

9606species

˓→Homo sapiens9605Homo sapienshuman

˓→"CommonName">man9606

˓→speciesHomo sapiens9605

˓→type="scientific_name">Homo sapienshuman

˓→name>man9605

˓→genusHomo207598

˓→"scientific_name">Homo207598

˓→subfamilyHomininae9604

˓→"scientific_name">Homininae9604

˓→familyHominidae314295

˓→"scientific_name">Hominidae314295

˓→superfamilyHominoidea9526

˓→"scientific_name">Hominoidea9526

˓→parvorderCatarrhini314293

˓→"scientific_name">Catarrhini314293

˓→infraorderSimiiformes376913

˓→Simiiformes376913

˓→suborderHaplorrhini9443

˓→Haplorrhini

˓→9443orderPrimates314146

˓→Primates

˓→314146superorderEuarchontoglires1437010

˓→Euarchontoglires

˓→taxon>1437010cladeBoreoeutheria

˓→9347Boreoeutheria

˓→names>9347cladeEutheria

˓→32525Eutheria

˓→names>32525cladeTheria

˓→40674Theria

˓→40674classMammalia

˓→32524Mammalia

˓→32524cladeAmniota32523

˓→parentid>Amniota

˓→32523cladeTetrapoda1338369

˓→parentid>Tetrapoda

˓→1338369cladeDipnotetrapodomorpha

˓→8287Dipnotetrapodomorpha

˓→names>8287superclassSarcopterygii

˓→name>117571Sarcopterygii

˓→name>117571clade

˓→Euteleostomi117570

˓→Euteleostomi117570clade(continues on next page)

˓→Teleostomi7776

˓→Teleostomi7776clade 2.3. Resolve 19 ˓→Gnathostomata7742

˓→Gnathostomata7742clade

˓→Vertebrata89593Vertebrata89593subphylum

˓→Craniata7711

˓→Craniata7711phylum

˓→Chordata33511

˓→Chordata33511clade

˓→Deuterostomia33213

˓→Deuterostomia33213clade

˓→Bilateria6072

˓→Bilateria6072clade

˓→Eumetazoa33208

˓→Eumetazoa33208kingdom

˓→Metazoa33154

˓→Metazoa33154clade

˓→Opisthokonta2759

˓→Opisthokonta2759superkingdom

˓→rank>Eukaryota131567Eukaryota131567no rank

˓→rank>cellular organismsNone

˓→"scientific_name">cellular organisms ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) 2superkingdom

˓→rank>Bacteria131567Bacteriaeubacteria

˓→"BlastName">bacteriaMonera

˓→ProcaryotaeProkaryota

˓→Prokaryotaeprokaryote

˓→prokaryotes2

˓→superkingdomBacteria131567

˓→type="scientific_name">Bacteriaeubacteria

˓→name>bacteriaMonera

˓→type="Inpart">ProcaryotaeProkaryota

˓→"Inpart">Prokaryotaeprokaryote

˓→prokaryotes131567no rank

˓→cellular organismsNone

˓→"scientific_name">cellular organisms 198112

˓→1496532032nucleotide

˓→MG831203.1MG831203

˓→gi|1496532032|gb|MG831203.1|

˓→198112speciesDeformed wing virus

˓→232799Deformed wing virus

˓→name>DWV232799

˓→taxid>genusIflavirus699189

˓→Iflavirus699189

˓→taxid>familyIflaviridae464095

˓→Iflaviridae464095

˓→orderPicornavirales2732506

˓→Picornavirales

˓→2732506classPisoniviricetes

˓→2732408Pisoniviricetes

˓→2732408phylumPisuviricota

˓→2732396Pisuviricota

˓→2732396kingdomOrthornavirae

˓→2559587Orthornavirae

˓→2559587clade

˓→Riboviria10239

˓→Riboviria10239superkingdom

˓→VirusesNone

˓→Viruses

2.4 Import

The import command import taxa, lineages, and accessions into a local SQLite database. The import command will print the resulkts from the preceding command to stanard output.

2.4.1 Local database schema

1 CREATE TABLE taxa

2 (id INTEGER PRIMARY KEY,

3 taxonid INT NOT NULL,

4 rank TEXT NULL,

5 parentid INT NULL, (continues on next page)

20 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

6 UNIQUE(taxonid));

7 CREATE UNIQUE INDEX taxa_idx ON taxa (taxonid);

8 CREATE TABLE names

9 (id INTEGER PRIMARY KEY,

10 taxonid INT,

11 name TEXT,

12 type TEXT NULL,

13 FOREIGN KEY (taxonid) REFERENCES taxa(taxonid) ON DELETE CASCADE,

14 UNIQUE(taxonid, name));

15 CREATE TRIGGER delete_names DELETEON names

16 BEGIN DELETE FROM names WHERE taxonid=old.taxonid; END;

17 CREATE UNIQUE INDEX names_idx ON names (taxonid, name);

18 CREATE TABLE accessions

19 (id INTEGER PRIMARY KEY,

20 accession TEXT NOT NULL,

21 db TEXT NOT NULL,

22 type TEXT NULL,

23 uid INT NOT NULL,

24 taxonid INT NOT NULL,

25 FOREIGN KEY (taxonid) REFERENCES taxa(taxonid) ON DELETE CASCADE,

26 UNIQUE(accession, uid));

27 CREATE TRIGGER delete_uids DELETEON accessions

28 BEGIN DELETE FROM accessions WHERE uid=old.uid; END;

29 CREATE UNIQUE INDEX accessions_idx ON

30 accessions (accession, uid);

2.4.2 Import taxa via collect

ncbi-taxonomist collect -n man -t2 | ncbi-taxonomist --database taxa-collect.db {"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},

˓→"parentid":null,"name":"cellular organisms"} {"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":"scientific_name"},"parentid

˓→":131567,"name":"Eukaryota"} {"taxid":33154,"rank":"clade","names":{"Opisthokonta":"scientific_name"},"parentid

˓→":2759,"name":"Opisthokonta"} {"taxid":33208,"rank":"kingdom","names":{"Metazoa":"scientific_name"},"parentid

˓→":33154,"name":"Metazoa"} {"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},"parentid":33208,

˓→"name":"Eumetazoa"} {"taxid":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,

˓→"name":"Bilateria"} {"taxid":33511,"rank":"clade","names":{"Deuterostomia":"scientific_name"},"parentid

˓→":33213,"name":"Deuterostomia"} {"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_name"},"parentid":33511,

˓→"name":"Chordata"} {"taxid":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid

˓→":7711,"name":"Craniata"} {"taxid":7742,"rank":"clade","names":{"Vertebrata":"scientific_name"},"parentid

˓→":89593,"name":"Vertebrata"} {"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},"parentid

˓→":7742,"name":"Gnathostomata"} {"taxid":117570,"rank":"clade","names":{"Teleostomi":"scientific_name"},"parentid

˓→":7776,"name":"Teleostomi"} {"taxid":117571,"rank":"clade","names":{"Euteleostomi":"scientific_name"},"parentid

˓→":117570,"name":"Euteleostomi"} (continues on next page)

2.4. Import 21 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) {"taxid":8287,"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},

˓→"parentid":117571,"name":"Sarcopterygii"} {"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":"scientific_name"},

˓→"parentid":8287,"name":"Dipnotetrapodomorpha"} {"taxid":32523,"rank":"clade","names":{"Tetrapoda":"scientific_name"},"parentid

˓→":1338369,"name":"Tetrapoda"} {"taxid":32524,"rank":"clade","names":{"Amniota":"scientific_name"},"parentid":32523,

˓→"name":"Amniota"} {"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name"},"parentid":32524,

˓→"name":"Mammalia"} {"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid":40674,

˓→"name":"Theria"} {"taxid":9347,"rank":"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,

˓→"name":"Eutheria"} {"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":"scientific_name"},"parentid

˓→":9347,"name":"Boreoeutheria"} {"taxid":314146,"rank":"superorder","names":{"Euarchontoglires":"scientific_name"},

˓→"parentid":1437010,"name":"Euarchontoglires"} {"taxid":9443,"rank":"order","names":{"Primates":"scientific_name"},"parentid":314146,

˓→"name":"Primates"} {"taxid":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},"parentid

˓→":9443,"name":"Haplorrhini"} {"taxid":314293,"rank":"infraorder","names":{"Simiiformes":"scientific_name"},

˓→"parentid":376913,"name":"Simiiformes"} {"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid

˓→":314293,"name":"Catarrhini"} {"taxid":314295,"rank":"superfamily","names":{"Hominoidea":"scientific_name"},

˓→"parentid":9526,"name":"Hominoidea"} {"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_name"},"parentid

˓→":314295,"name":"Hominidae"} {"taxid":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},"parentid

˓→":9604,"name":"Homininae"} {"taxid":9605,"rank":"genus","names":{"Homo":"scientific_name"},"parentid":207598,

˓→"name":"Homo"} {"taxid":9606,"rank":"species","names":{"Homo sapiens":"scientific_name","human":

˓→"GenbankCommonName","man":"CommonName"},"parentid":9605,"na me":"Homo sapiens"} {"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},

˓→"parentid":null,"name":"cellular organisms"} {"taxid":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name","eubacteria":

˓→"GenbankCommonName","bacteria":"BlastName","Monera":"Inp art","Procaryotae":

˓→"Inpart","Prokaryota":"Inpart","Prokaryotae":"Inpart","prokaryote":"Inpart",

˓→"prokaryotes":"Inpart"},"parentid":131567,"na me":"Bacteria"}

• Check database: sqlite3 taxa.db 'SELECT * FROM taxa t JOIN names n on t.taxonid=n.taxonid;' id|taxonid|rank|parentid|id|taxonid|name|type 1|9606|species|9605|1|9606|Homo sapiens|scientific_name 1|9606|species|9605|2|9606|human|GenbankCommonName 1|9606|species|9605|3|9606|man|CommonName 2|9605|genus|207598|4|9605|Homo|scientific_name 3|207598|subfamily|9604|5|207598|Homininae|scientific_name 4|9604|family|314295|6|9604|Hominidae|scientific_name 5|314295|superfamily|9526|7|314295|Hominoidea|scientific_name 6|9526|parvorder|314293|8|9526|Catarrhini|scientific_name 7|314293|infraorder|376913|9|314293|Simiiformes|scientific_name 8|376913|suborder|9443|10|376913|Haplorrhini|scientific_name (continues on next page)

22 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) 9|9443|order|314146|11|9443|Primates|scientific_name 10|314146|superorder|1437010|12|314146|Euarchontoglires|scientific_name 11|1437010|clade|9347|13|1437010|Boreoeutheria|scientific_name 12|9347|clade|32525|14|9347|Eutheria|scientific_name 13|32525|clade|40674|15|32525|Theria|scientific_name 14|40674|class|32524|16|40674|Mammalia|scientific_name 15|32524|clade|32523|17|32524|Amniota|scientific_name 16|32523|clade|1338369|18|32523|Tetrapoda|scientific_name 17|1338369|clade|8287|19|1338369|Dipnotetrapodomorpha|scientific_name 18|8287|superclass|117571|20|8287|Sarcopterygii|scientific_name 19|117571|clade|117570|21|117571|Euteleostomi|scientific_name 20|117570|clade|7776|22|117570|Teleostomi|scientific_name 21|7776|clade|7742|23|7776|Gnathostomata|scientific_name 22|7742|clade|89593|24|7742|Vertebrata|scientific_name 23|89593|subphylum|7711|25|89593|Craniata|scientific_name 24|7711|phylum|33511|26|7711|Chordata|scientific_name 25|33511|clade|33213|27|33511|Deuterostomia|scientific_name 26|33213|clade|6072|28|33213|Bilateria|scientific_name 27|6072|clade|33208|29|6072|Eumetazoa|scientific_name 28|33208|kingdom|33154|30|33208|Metazoa|scientific_name 29|33154|clade|2759|31|33154|Opisthokonta|scientific_name 30|2759|superkingdom|131567|32|2759|Eukaryota|scientific_name 31|131567|no rank||33|131567|cellular organisms|scientific_name 32|2|superkingdom|131567|34|2|Bacteria|scientific_name 32|2|superkingdom|131567|35|2|eubacteria|GenbankCommonName 32|2|superkingdom|131567|36|2|bacteria|BlastName 32|2|superkingdom|131567|37|2|Monera|Inpart 32|2|superkingdom|131567|38|2|Procaryotae|Inpart 32|2|superkingdom|131567|39|2|Prokaryota|Inpart 32|2|superkingdom|131567|40|2|Prokaryotae|Inpart 32|2|superkingdom|131567|41|2|prokaryote|Inpart 32|2|superkingdom|131567|42|2|prokaryotes|Inpart

2.4.3 Import taxa via resolve ncbi-taxonomist resolve -n man -t2 | ncbi-taxonomist import -db taxa-resolve.db {"mode":"resolve","query":"man","cast":"taxon","taxon":{"taxid":9606,"rank":"species",

˓→"names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"},"lineage":[{"taxid":9606,"rank

˓→":"species","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName",

˓→"man":"CommonName"},"parentid":9605,"name":"Homo sapiens"},{"taxid":9605,"rank":

˓→"genus","names":{"Homo":"scientific_name"},"parentid":207598,"name":"Homo"},{"taxid

˓→":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},"parentid":9604,

˓→"name":"Homininae"},{"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_

˓→name"},"parentid":314295,"name":"Hominidae"},{"taxid":314295,"rank":"superfamily",

˓→"names":{"Hominoidea":"scientific_name"},"parentid":9526,"name":"Hominoidea"},{

˓→"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid

˓→":314293,"name":"Catarrhini"},{"taxid":314293,"rank":"infraorder","names":{

˓→"Simiiformes":"scientific_name"},"parentid":376913,"name":"Simiiformes"},{"taxid

˓→":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},"parentid

˓→":9443,"name":"Haplorrhini"},{"taxid":9443,"rank":"order","names":{"Primates":

˓→"scientific_name"},"parentid":314146,"name":"Primates"},{"taxid":314146,"rank":

˓→"superorder","names":{"Euarchontoglires":"scientific_name"},"parentid":1437010,"name

˓→":"Euarchontoglires"},{"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":

˓→"scientific_name"},"parentid":9347,"name":"Boreoeutheria"},{"taxid":9347,"rank":

˓→"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,"name":(continues"Eutheria" on next} page),{

˓→"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid":40674,

˓→"name":"Theria"},{"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name

2.4.˓→"}, Import"parentid":32524,"name":"Mammalia"},{"taxid":32524,"rank":"clade","names":{ 23

˓→"Amniota":"scientific_name"},"parentid":32523,"name":"Amniota"},{"taxid":32523,"rank

˓→":"clade","names":{"Tetrapoda":"scientific_name"},"parentid":1338369,"name":

˓→"Tetrapoda"},{"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":

˓→"scientific_name"},"parentid":8287,"name":"Dipnotetrapodomorpha"},{"taxid":8287,

˓→"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},"parentid":117571,

˓→"name":"Sarcopterygii"},{"taxid":117571,"rank":"clade","names":{"Euteleostomi":

˓→"scientific_name"},"parentid":117570,"name":"Euteleostomi"},{"taxid":117570,"rank":

˓→"clade","names":{"Teleostomi":"scientific_name"},"parentid":7776,"name":"Teleostomi

˓→"},{"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},

˓→"parentid":7742,"name":"Gnathostomata"},{"taxid":7742,"rank":"clade","names":{

˓→"Vertebrata":"scientific_name"},"parentid":89593,"name":"Vertebrata"},{"taxid

˓→":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid":7711,

˓→"name":"Craniata"},{"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_

˓→name"},"parentid":33511,"name":"Chordata"},{"taxid":33511,"rank":"clade","names":{

˓→"Deuterostomia":"scientific_name"},"parentid":33213,"name":"Deuterostomia"},{"taxid

˓→":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,"name

˓→":"Bilateria"},{"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},

˓→"parentid":33208,"name":"Eumetazoa"},{"taxid":33208,"rank":"kingdom","names":{

˓→"Metazoa":"scientific_name"},"parentid":33154,"name":"Metazoa"},{"taxid":33154,"rank

˓→":"clade","names":{"Opisthokonta":"scientific_name"},"parentid":2759,"name":

˓→"Opisthokonta"},{"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":

˓→"scientific_name"},"parentid":131567,"name":"Eukaryota"},{"taxid":131567,"rank":"no

˓→rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":

˓→"cellular organisms"}]} ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) {"mode":"resolve","query":"2","cast":"taxon","taxon":{"taxid":2,"rank":"superkingdom",

˓→"names":{"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":

˓→"BlastName","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart",

˓→"Prokaryotae":"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid

˓→":131567,"name":"Bacteria"},"lineage":[{"taxid":2,"rank":"superkingdom","names":{

˓→"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":"BlastName

˓→","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart","Prokaryotae":

˓→"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid":131567,"name":

˓→"Bacteria"},{"taxid":131567,"rank":"no rank","names":{"cellular organisms":

˓→"scientific_name"},"parentid":null,"name":"cellular organisms"}]}

• Check database: The database should be identical to the database created with the collect command above. sqlite3 taxa-resolve.db 'SELECT * FROM taxa t JOIN names n ON t.taxonid=n.taxonid;'

2.4.4 Import accessions

Importing accessions does not inmport only the taxid for the accession, not any other taxon metadata. ncbi-taxonomist map --entrezdb protein --accessions AFR11853 AIA66128.1 | ncbi-

˓→taxonomist import -db taxa.db

• Check database: sqlite3 -header taxa.db 'SELECT * FROM accessions a JOIN taxa t ON a.taxonid==t. ˓→taxonid;' id|accession|db|type|uid|taxonid|id|taxonid|rank|parentid 1|AIA66128.1|protein|accessionversion|641483259|1239567|33|1239567|| 2|AIA66128|protein|caption|641483259|1239567|33|1239567|| 3|gi|641483259|gb|AIA66128.1||protein|extra|641483259|1239567|33|1239567|| 4|AFR11853.1|protein|accessionversion|403044789|1224525|34|1224525|| 5|AFR11853|protein|caption|403044789|1224525|34|1224525|| 6|gi|403044789|gb|AFR11853.1||protein|extra|403044789|1224525|34|1224525||

To add the missing information, please check Importing accessions for an extended command accomplishing this. The following example shows the database after adding the missing data: sqlite3 -header taxa.db 'SELECT * FROM accessions a JOIN taxa t ON a.taxonid==t. ˓→taxonid;' id|accession|db|type|uid|taxonid|id|taxonid|rank|parentid 1|AIA66128.1|protein|accessionversion|641483259|1239567|33|1239567|species|249588 2|AIA66128|protein|caption|641483259|1239567|33|1239567|species|249588 3|gi|641483259|gb|AIA66128.

˓→1||protein|extra|641483259|1239567|33|1239567|species|249588 4|AFR11853.1|protein|accessionversion|403044789|1224525|34|1224525|species|35278 5|AFR11853|protein|caption|403044789|1224525|34|1224525|species|35278 6|gi|403044789|gb|AFR11853.1||protein|extra|403044789|1224525|34|1224525|species|35278

2.5 Subtree ncbi-taxonomist subtree collects taxonomic subsamples for taxids or names in a local database.

24 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Note: Fetching subtrees remotely form Entrez is in development.

A local database is required, for example:

$: ncbi-taxonomist collect -t 142786 9606 | ncbi-taxonomist import -db test.db

2.5.1 Collecting subtrees

Between two given ranks

$: ncbi-taxonomist subtree -db test.db -t 142786 9606 --lrank order --hrank phylum {"mode":"subtree","query":9606,"subtree":[{"taxid":9443,"rank":"order","names":{

˓→"Primates":"scientific_name"},"parentid":314146,"name":"Primates"},{"taxid":314146,

˓→"rank":"superorder","names":{"Euarchontoglires":"scientific_name"},"parentid

˓→":1437010,"name":"Euarchontoglires"},{"taxid":1437010,"rank":"clade","names":{

˓→"Boreoeutheria":"scientific_name"},"parentid":9347,"name":"Boreoeutheria"},{"taxid

˓→":9347,"rank":"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,"name

˓→":"Eutheria"},{"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},

˓→"parentid":40674,"name":"Theria"},{"taxid":40674,"rank":"class","names":{"Mammalia":

˓→"scientific_name"},"parentid":32524,"name":"Mammalia"},{"taxid":32524,"rank":"clade

˓→","names":{"Amniota":"scientific_name"},"parentid":32523,"name":"Amniota"},{"taxid

˓→":32523,"rank":"clade","names":{"Tetrapoda":"scientific_name"},"parentid":1338369,

˓→"name":"Tetrapoda"},{"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":

˓→"scientific_name"},"parentid":8287,"name":"Dipnotetrapodomorpha"},{"taxid":8287,

˓→"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},"parentid":117571,

˓→"name":"Sarcopterygii"},{"taxid":117571,"rank":"clade","names":{"Euteleostomi":

˓→"scientific_name"},"parentid":117570,"name":"Euteleostomi"},{"taxid":117570,"rank":

˓→"clade","names":{"Teleostomi":"scientific_name"},"parentid":7776,"name":"Teleostomi

˓→"},{"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},

˓→"parentid":7742,"name":"Gnathostomata"},{"taxid":7742,"rank":"clade","names":{

˓→"Vertebrata":"scientific_name"},"parentid":89593,"name":"Vertebrata"},{"taxid

˓→":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid":7711,

˓→"name":"Craniata"},{"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_

˓→name"},"parentid":33511,"name":"Chordata"}]} {"mode":"subtree","query":142786,"subtree":[{"taxid":464095,"rank":"order","names":{

˓→"Picornavirales":"scientific_name"},"parentid":2732506,"name":"Picornavirales"},{

˓→"taxid":2732506,"rank":"class","names":{"Pisoniviricetes":"scientific_name"},

˓→"parentid":2732408,"name":"Pisoniviricetes"},{"taxid":2732408,"rank":"phylum","names

˓→":{"Pisuviricota":"scientific_name"},"parentid":2732396,"name":"Pisuviricota"}]}

Collect one specific rank

$: ncbi-taxonomist subtree -db test.db -t 142786 9606 --rank order {"mode":"subtree","query":9606,"subtree":[{"taxid":9443,"rank":"order","names":{

˓→"Primates":"scientific_name"},"parentid":314146,"name":"Primates"}]} {"mode":"subtree","query":142786,"subtree":[{"taxid":464095,"rank":"order","names":{

˓→"Picornavirales":"scientific_name"},"parentid":2732506,"name":"Picornavirales"}]

2.5. Subtree 25 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Collect from a given rank to root and print XML

$: ncbi-taxonomist subtree -x -db test.db -t 142786 9606 --lrank order 9443

˓→orderPrimates314146

˓→"scientific_name">Primates314146

˓→superorderEuarchontoglires1437010

˓→Euarchontoglires

˓→1437010cladeBoreoeutheria9347

˓→Boreoeutheria

˓→9347cladeEutheria32525

˓→Eutheria

˓→32525cladeTheria40674

˓→Theria40674

˓→taxid>classMammalia32524

˓→type="scientific_name">Mammalia32524

˓→cladeAmniota32523

˓→"scientific_name">Amniota32523

˓→cladeTetrapoda1338369

˓→"scientific_name">Tetrapoda1338369

˓→cladeDipnotetrapodomorpha8287

˓→Dipnotetrapodomorpha

˓→8287superclassSarcopterygii117571

˓→Sarcopterygii

˓→117571cladeEuteleostomi

˓→117570Euteleostomi

˓→taxon>117570cladeTeleostomi

˓→7776Teleostomi

˓→names>7776cladeGnathostomata

˓→7742Gnathostomata

˓→names>7742cladeVertebrata

˓→89593Vertebrata

˓→names>89593subphylumCraniata

˓→7711Craniata

˓→7711phylumChordata

˓→33511Chordata

˓→33511cladeDeuterostomia

˓→33213Deuterostomia

˓→taxon>33213cladeBilateria

˓→6072Bilateria

˓→6072cladeEumetazoa33208

˓→parentid>Eumetazoa

˓→33208kingdomMetazoa33154

˓→parentid>Metazoa

˓→33154cladeOpisthokonta2759

˓→parentid>Opisthokonta

˓→2759superkingdomEukaryota

˓→131567Eukaryota

˓→taxon>131567no rankcellular organisms

˓→name>Nonecellular organisms

˓→ 464095

˓→orderPicornavirales2732506

˓→type="scientific_name">Picornavirales2732506

˓→taxid>classPisoniviricetes2732408

˓→Pisoniviricetes

˓→2732408phylumPisuviricota2732396

˓→Pisuviricota

˓→2732396kingdomOrthornavirae(continues on next page)

˓→2559587Orthornavirae

˓→2559587cladeRiboviria10239Riboviria

˓→10239superkingdomViruses

˓→name>NoneViruses

˓→names> ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

Collect from a given rank to lowest rank

$: ncbi-taxonomist subtree -db test.db -t 142786 9606 --hrank order {"mode":"subtree","query":9606,"subtree":[{"taxid":9606,"rank":"species","names":{

˓→"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":"CommonName"},

˓→"parentid":9605,"name":"Homo sapiens"},{"taxid":9605,"rank":"genus","names":{"Homo":

˓→"scientific_name"},"parentid":207598,"name":"Homo"},{"taxid":207598,"rank":

˓→"subfamily","names":{"Homininae":"scientific_name"},"parentid":9604,"name":

˓→"Homininae"},{"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_name"},

˓→"parentid":314295,"name":"Hominidae"},{"taxid":314295,"rank":"superfamily","names":{

˓→"Hominoidea":"scientific_name"},"parentid":9526,"name":"Hominoidea"},{"taxid":9526,

˓→"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid":314293,"name

˓→":"Catarrhini"},{"taxid":314293,"rank":"infraorder","names":{"Simiiformes":

˓→"scientific_name"},"parentid":376913,"name":"Simiiformes"},{"taxid":376913,"rank":

˓→"suborder","names":{"Haplorrhini":"scientific_name"},"parentid":9443,"name":

˓→"Haplorrhini"},{"taxid":9443,"rank":"order","names":{"Primates":"scientific_name"},

˓→"parentid":314146,"name":"Primates"}]} {"mode":"subtree","query":142786,"subtree":[{"taxid":142786,"rank":"genus","names":{

˓→"":"scientific_name","Norwalk-like viruses":"EquivalentName"},"parentid

˓→":11974,"name":"Norovirus"},{"taxid":11974,"rank":"family","names":{"":

˓→"scientific_name"},"parentid":464095,"name":"Caliciviridae"},{"taxid":464095,"rank":

˓→"order","names":{"Picornavirales":"scientific_name"},"parentid":2732506,"name":

˓→"Picornavirales"}]}

2.5.2 Output format

JSON output

{ "mode": "subtree", "query": 9606, "subtree":[ { "taxid": 9443, "rank": "order", "names":{ "Primates": "scientific_name" }, "parentid": 314146, "name": "Primates" }, { "taxid": 314146, "rank": "superorder", "names":{ "Euarchontoglires": "scientific_name" }, "parentid": 1437010, "name": "Euarchontoglires" }, { (continues on next page)

2.5. Subtree 27 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) "taxid": 1437010, "rank": "clade", "names":{ "Boreoeutheria": "scientific_name" }, "parentid": 9347, "name": "Boreoeutheria" }, { "taxid": 9347, "rank": "clade", "names":{ "Eutheria": "scientific_name" }, "parentid": 32525, "name": "Eutheria" }, { "taxid": 32525, "rank": "clade", "names":{ "Theria": "scientific_name" }, "parentid": 40674, "name": "Theria" }, { "taxid": 40674, "rank": "class", "names":{ "Mammalia": "scientific_name" }, "parentid": 32524, "name": "Mammalia" }, { "taxid": 32524, "rank": "clade", "names":{ "Amniota": "scientific_name" }, "parentid": 32523, "name": "Amniota" }, { "taxid": 32523, "rank": "clade", "names":{ "Tetrapoda": "scientific_name" }, "parentid": 1338369, "name": "Tetrapoda" }, { "taxid": 1338369, "rank": "clade", "names":{ (continues on next page)

28 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) "Dipnotetrapodomorpha": "scientific_name" }, "parentid": 8287, "name": "Dipnotetrapodomorpha" }, { "taxid": 8287, "rank": "superclass", "names":{ "Sarcopterygii": "scientific_name" }, "parentid": 117571, "name": "Sarcopterygii" }, { "taxid": 117571, "rank": "clade", "names":{ "Euteleostomi": "scientific_name" }, "parentid": 117570, "name": "Euteleostomi" }, { "taxid": 117570, "rank": "clade", "names":{ "Teleostomi": "scientific_name" }, "parentid": 7776, "name": "Teleostomi" }, { "taxid": 7776, "rank": "clade", "names":{ "Gnathostomata": "scientific_name" }, "parentid": 7742, "name": "Gnathostomata" }, { "taxid": 7742, "rank": "clade", "names":{ "Vertebrata": "scientific_name" }, "parentid": 89593, "name": "Vertebrata" }, { "taxid": 89593, "rank": "subphylum", "names":{ "Craniata": "scientific_name" }, "parentid": 7711, (continues on next page)

2.5. Subtree 29 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) "name": "Craniata" }, { "taxid": 7711, "rank": "phylum", "names":{ "Chordata": "scientific_name" }, "parentid": 33511, "name": "Chordata" } ] } { "mode": "subtree", "query": 142786, "subtree":[ { "taxid": 464095, "rank": "order", "names":{ "Picornavirales": "scientific_name" }, "parentid": 2732506, "name": "Picornavirales" }, { "taxid": 2732506, "rank": "class", "names":{ "Pisoniviricetes": "scientific_name" }, "parentid": 2732408, "name": "Pisoniviricetes" }, { "taxid": 2732408, "rank": "phylum", "names":{ "Pisuviricota": "scientific_name" }, "parentid": 2732396, "name": "Pisuviricota" } ] }

XML output

9443 order (continues on next page)

30 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) Primates 314146 Primates 314146 superorder Euarchontoglires 1437010 Euarchontoglires 1437010 clade Boreoeutheria 9347 Boreoeutheria 9347 clade Eutheria 32525 Eutheria 32525 clade Theria 40674 Theria 40674 class Mammalia 32524 Mammalia 32524 clade Amniota 32523 (continues on next page)

2.5. Subtree 31 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) Amniota 32523 clade Tetrapoda 1338369 Tetrapoda 1338369 clade Dipnotetrapodomorpha 8287 Dipnotetrapodomorpha 8287 superclass Sarcopterygii 117571 Sarcopterygii 117571 clade Euteleostomi 117570 Euteleostomi 117570 clade Teleostomi 7776 Teleostomi 7776 clade Gnathostomata 7742 Gnathostomata (continues on next page)

32 Chapter 2. Basic functions ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page) 7742 clade Vertebrata 89593 Vertebrata 89593 subphylum Craniata 7711 Craniata 7711 phylum Chordata 33511 Chordata

2.6 Group ncbi-taxonomist group creates and lists taxonomix groups in a local ncbi-taxonomist database.

2.6.1 Creating a group

$: ncbi-taxonomist collect -n 'Black willow' 'Black hickory'| \ ncbi-taxonomist import -db taxa.db | \ ncbi-taxonomist group --add tree -db taxa.db

2.6.2 Retrieve a group

Groups can be retrieved as taxids and processed, e.g. with jq, and reused.

$: ncbi-taxonomist group --get tree -db taxa.db | \ jq '.taxa[]'| \ ncbi-taxonomist map -t -db taxa.db

2.6. Group 33 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

34 Chapter 2. Basic functions CHAPTER 3

Cookbook

Contents

• Reformatting results – Convert accession lineages into TSV – Convert a lineage into a table • Importing accessions – Map accessions and collect corresponding taxa • Creating a valid XML file from line based XML output

3.1 Reformatting results

Examples how to use jq to reformat JSON output. For more jq help, please refer to: • jq manual • Reshaping JSON with jq

3.1.1 Convert accession lineages into TSV

Converting the lineage of several nucleotide accessions into a tab separated output. The queried accession is printed in the first field. Substituting @tsv with @csv in the example will result in CSV output.

1 ncbi-taxonomist map -a QZWG01000002.1 MG831203 | ncbi-taxonomist resolve --mapping \|

2 jq -r '[.query, .lineage[].name]|@tsv' (continues on next page)

35 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

3 MG831203 Deformed wing virus Iflavirus Iflaviridae Picornavirales Pisoniviricetes

˓→Pisuviricota Orthornavirae Riboviria Viruses

4 QZWG01000002.1 Glycine soja Glycine subgen. Soja Glycine Phaseoleae indigoferoid/

˓→millettioid clade NPAAA clade 50 kb inversion clade Papilionoideae Fabaceae

˓→Fabales fabids rosids Pentapetalae Gunneridae eudicotyledons Mesangiospermae

˓→Magnoliopsida Spermatophyta Euphyllophyta Tracheophyta Embryophyta Streptophytina

˓→Streptophyta Viridiplantae Eukaryota cellular organisms

3.1.2 Convert a lineage into a table

Convert the lineage into a table with the tab separated columns taxid, rank, and parentid.

1 ncbi-taxonomist resolve -t 9606 \ |

2 jq -r '.lin[]|"\(.taxon_id) \(.name) \(.rank) \(.parent_id)"'

3 9606 Homo sapiens species 9605

4 9605 Homo genus 207598

5 207598 Homininae subfamily 9604

6 9604 Hominidae family 314295

7 314295 Hominoidea superfamily 9526

8 9526 Catarrhini parvorder 314293

9 314293 Simiiformes infraorder 376913

10 376913 Haplorrhini suborder 9443

11 9443 Primates order 314146

12 314146 Euarchontoglires superorder 1437010

13 1437010 Boreoeutheria clade 9347

14 9347 Eutheria clade 32525

15 32525 Theria clade 40674

16 40674 Mammalia class 32524

17 32524 Amniota clade 32523

18 32523 Tetrapoda clade 1338369

19 1338369 Dipnotetrapodomorpha clade 8287

20 8287 Sarcopterygii superclass 117571

21 117571 Euteleostomi clade 117570

22 117570 Teleostomi clade 7776

23 7776 Gnathostomata clade 7742

24 7742 Vertebrata clade 89593

25 89593 Craniata subphylum 7711

26 7711 Chordata phylum 33511

27 33511 Deuterostomia clade 33213

28 33213 Bilateria clade 6072

29 6072 Eumetazoa clade 33208

30 33208 Metazoa kingdom 33154

31 33154 Opisthokonta clade 2759

32 2759 Eukaryota superkingdom 131567

33 131567 cellular organisms no rank null

3.2 Importing accessions

Mapping accessions fetched only the corresponding taxid but not all corresponding metadata.

36 Chapter 3. Cookbook ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

3.2.1 Map accessions and collect corresponding taxa

1 ncbi-taxonomist map --entrezdb protein --accessions AFR11853 AIA66128.1 | \

2 ncbi-taxonomist import -db taxa.db | \

3 jq '.accession.taxid'| \

4 ncbi-taxonomist collect -t | \

5 ncbi-taxonomist import -db taxa.db

6 {"taxid":10239,"rank":"superkingdom","names":{"Viruses":"scientific_name"},"parentid

˓→":null,"name":"Viruses"}

7 {"taxid":2559587,"rank":"clade","names":{"Riboviria":"scientific_name"},"parentid

˓→":10239,"name":"Riboviria"}

8 {"taxid":2732396,"rank":"kingdom","names":{"Orthornavirae":"scientific_name"},

˓→"parentid":2559587,"name":"Orthornavirae"}

9 {"taxid":2732408,"rank":"phylum","names":{"Pisuviricota":"scientific_name"},"parentid

˓→":2732396,"name":"Pisuviricota"}

10 {"taxid":2732507,"rank":"class","names":{"Stelpaviricetes":"scientific_name"},

˓→"parentid":2732408,"name":"Stelpaviricetes"}

11 {"taxid":2732551,"rank":"order","names":{"Stellavirales":"scientific_name"},"parentid

˓→":2732507,"name":"Stellavirales"}

12 {"taxid":39733,"rank":"family","names":{"Astroviridae":"scientific_name"},"parentid

˓→":2732551,"name":"Astroviridae"}

13 {"taxid":249588,"rank":"genus","names":{"":"scientific_name"},"parentid

˓→":39733,"name":"Mamastrovirus"}

14 {"taxid":1239567,"rank":"species","names":{"Mamastrovirus 3":"scientific_name",

˓→"Porcine ":"EquivalentName"},"parentid":249588,"name":"Mamastrovirus 3"}

15 {"taxid":2585030,"rank":"no rank","names":{"unclassified Riboviria":"scientific_name"}

˓→,"parentid":2559587,"name":"unclassified Riboviria"}

16 {"taxid":439490,"rank":"no rank","names":{"unclassified ssRNA viruses":"scientific_

˓→name"},"parentid":2585030,"name":"unclassified ssRNA viruses"}

17 {"taxid":35278,"rank":"clade","names":{"unclassified ssRNA positive-strand viruses":

˓→"scientific_name"},"parentid":439490,"name":"unclassified ssRNA positive-strand

˓→viruses"}

18 {"taxid":1224525,"rank":"species","names":{"Cadicistrovirus":"scientific_name"},

˓→"parentid":35278,"name":"Cadicistrovirus"}

3.3 Creating a valid XML file from line based XML output

To create a valid XML document from the line based output, the output has to be encapsulated between two root XML tags. On Linux, this can be achieved via process substitution as shown in Listing 3.1.

Listing 3.1: Creating valid XML from line based output. Line 3 shows the command to create a valid XML output. The xmllint command on line 4 is not required but demonstrates the validity of the created XML output.

1 $: ncbi-taxonomist map --accessions QZWG01000002.1 MG831203 | \

2 ncbi-taxonomist resolve --xml --mapping | \

3 (echo ""&& cat&& echo "")| \

4 xmllint --pretty1-

5

6

7

8

9

10 198112 (continues on next page)

3.3. Creating a valid XML file from line based XML output 37 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

11

12

13

38 Chapter 3. Cookbook CHAPTER 4

Container

ncbi-taxonomist comes with a Docker container and Singularity image. Both include jq to facilitate JSON handling. Both containers have the /dbs mountpoint to mount host directories, e.g. to use local databases.

Content

• Docker – Install – Test – Basic usage

* Mapping * Resolving * Pipelines * Local database * Docker ncbi-taxonomist and jq • Singularity – Install

* Build – Test – Basic usage

* Mapping * Resolving * Pipelines

39 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

* Local database * Singularity ncbi-taxonomist and jq

Note: The commands shown here assume a current Linux system. Please adjust the commands to your system, accordingly.

4.1 Docker

The Docker container can be found at https://gitlab.com/janpb/ncbi-taxonomist/container_registry/. Please check the Docker Docs if some commands are unclear. • The Docker image creates the user user for the container to run all commands • The container has the mountpoint /dbs to bind host paths

4.1.1 Install

The latest ncbi-taxonomist Docker image can be pulled from registry.gitlab.com/janpb/ ncbi-taxonomist:latest . It can be run with the command docker run registry.gitlab.com/ janpb/ncbi-taxonomist. If desired, the image can be tagged to a more concise tag name using docker tag registry.gitlab.com/ janpb/ncbi-taxonomist ncbi-taxonomist.

1 $: docker pull registry.gitlab.com/janpb/ncbi-taxonomist:latest

2 latest: Pulling from janpb/ncbi-taxonomist

3 cbdbe7a5bc2a: Pull complete

4 50d9a3e26028: Pull complete

5 a0e2567dead0: Pull complete

6 #cut

7 $: docker tag registry.gitlab.com/janpb/ncbi-taxonomist:latest ncbi-taxonomist

8 $: docker images

9 ncbi-taxonomist latest f957b80d1034

˓→22 hours ago 68.3MB

10 registry.gitlab.com/janpb/ncbi-taxonomist latest f957b80d1034

˓→22 hours ago 68.3MB

Line 6 indicats cut output and the output on lines 3-8 and 12-13 will likely look different.

4.1.2 Test

Assuming the image is tagged ncbi-taxonomist, the following command should print the basic usage:

1 $: docker run --rm -it ncbi-taxonomist

2 usage: ncbi-taxonomist[--version][-v][--apikey APIKEY]{map,resolve,import,collect,

˓→subtree,group} ...

3

4 commands:

5 {map,resolve,import,collect,subtree,group}

6 map Map taxid to names and vice-versa

7 #cut

40 Chapter 4. Container ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

4.1.3 Basic usage

The examples assume the image has been tagged ncbi-taxonomist and show representative commands.

Mapping

1 $: docker run --rm -it ncbi-taxonomist map -t 9606

2 {"mode":"mapping","query":"9606","cast":"taxon","taxon":{"taxid":9606,"rank":"species

˓→","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"}}

Resolving

1 $: docker run --rm -it ncbi-taxonomist resolve -t2 -n 'Arabidopsis'

2 {"mode":"resolve","query":"Arabidopsis","cast":"taxon","taxon":{"taxid":3701,"rank":

˓→"genus","names":{"Arabidopsis":"scientific_name","Cardaminopsis":"Synonym"},

˓→"parentid":980083,"name":"Arabidopsis"},"lineage":[{"taxid":3701,"rank":"genus",

˓→"names":{"Arabidopsis":"scientific_name","Cardaminopsis":"Synonym"},"parentid

˓→":980083,"name":"Arabidopsis"},{"taxid":980083,"rank":"tribe","names":{"Camelineae":

˓→"scientific_name"},"parentid":3700,"name":"Camelineae"},{"taxid":3700,"rank":"family

˓→","names":{"Brassicaceae":"scientific_name"},"parentid":3699,"name":"Brassicaceae"},

˓→{"taxid":3699,"rank":"order","names":{"Brassicales":"scientific_name"},"parentid

˓→":91836,"name":"Brassicales"},{"taxid":91836,"rank":"clade","names":{"malvids":

˓→"scientific_name"},"parentid":71275,"name":"malvids"},{"taxid":71275,"rank":"clade",

˓→"names":{"rosids":"scientific_name"},"parentid":1437201,"name":"rosids"},{"taxid

˓→":1437201,"rank":"clade","names":{"Pentapetalae":"scientific_name"},"parentid

˓→":91827,"name":"Pentapetalae"},{"taxid":91827,"rank":"clade","names":{"Gunneridae":

˓→"scientific_name"},"parentid":71240,"name":"Gunneridae"},{"taxid":71240,"rank":

˓→"clade","names":{"eudicotyledons":"scientific_name"},"parentid":1437183,"name":

˓→"eudicotyledons"},{"taxid":1437183,"rank":"clade","names":{"Mesangiospermae":

˓→"scientific_name"},"parentid":3398,"name":"Mesangiospermae"},{"taxid":3398,"rank":

˓→"class","names":{"Magnoliopsida":"scientific_name"},"parentid":58024,"name":

˓→"Magnoliopsida"},{"taxid":58024,"rank":"clade","names":{"Spermatophyta":"scientific_

˓→name"},"parentid":78536,"name":"Spermatophyta"},{"taxid":78536,"rank":"clade","names

˓→":{"Euphyllophyta":"scientific_name"},"parentid":58023,"name":"Euphyllophyta"},{

˓→"taxid":58023,"rank":"clade","names":{"Tracheophyta":"scientific_name"},"parentid

˓→":3193,"name":"Tracheophyta"},{"taxid":3193,"rank":"clade","names":{"Embryophyta":

˓→"scientific_name"},"parentid":131221,"name":"Embryophyta"},{"taxid":131221,"rank":

˓→"subphylum","names":{"Streptophytina":"scientific_name"},"parentid":35493,"name":

˓→"Streptophytina"},{"taxid":35493,"rank":"phylum","names":{"Streptophyta":

˓→"scientific_name"},"parentid":33090,"name":"Streptophyta"},{"taxid":33090,"rank":

˓→"kingdom","names":{"Viridiplantae":"scientific_name"},"parentid":2759,"name":

˓→"Viridiplantae"},{"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":

˓→"scientific_name"},"parentid":131567,"name":"Eukaryota"},{"taxid":131567,"rank":"no

˓→rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":

˓→"cellular organisms"}]}

3 {"mode":"resolve","query":"2","cast":"taxon","taxon":{"taxid":2,"rank":"superkingdom",

˓→"names":{"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":

˓→"BlastName","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart",

˓→"Prokaryotae":"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid

˓→":131567,"name":"Bacteria"},"lineage":[{"taxid":2,"rank":"superkingdom","names":{

˓→"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":"BlastName

˓→","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart","Prokaryotae":

˓→"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid":131567,"name":

˓→"Bacteria"},{"taxid":131567,"rank":"no rank","names":{"cellular organisms":

˓→"scientific_name"},"parentid":null,"name":"cellular organisms"}]} (continues on next page)

4.1. Docker 41 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

Pipelines

1 $: docker run --rm -i ncbi-taxonomist map -edb bioproject -a PRJNA604394 | \

2 docker run --rm -i ncbi-taxonomist resolve -m

3 {"mode":"resolve","query":"PRJNA604394","cast":"accs","accs":{"taxid":573,"accessions

˓→":{"project_id":604394,"project_acc":"PRJNA604394","project_name":"Klebsiella

˓→pneumoniae strain:S01"},"db":"bioproject","uid":604394},"lineage":[{"taxid":573,

˓→"rank":"species","names":{"Klebsiella pneumoniae":"scientific_name","'Klebsiella

˓→aerogenes' (Kruse) Taylor et al. 1956":"Synonym","Bacillus pneumoniae":"Synonym",

˓→"Bacterium pneumoniae crouposae":"Synonym","Hyalococcus pneumoniae":"Synonym",

˓→"Klebsiella pneumoniae aerogenes":"Synonym","Klebsiella sp. 2N3":"Includes",

˓→"Klebsiella sp. C1(2016)":"Includes","Klebsiella sp. M-AI-2":"Includes","Klebsiella

˓→sp. PB12":"Includes","Klebsiella sp. RCE-7":"Includes","ATCC 13883":"type material",

˓→"ATCC:13883":"type material","BCCM/LMG:2095":"type material","CCUG 225":"type

˓→material","CCUG:225":"type material","CDC 298-53":"type material","CDC:298-53":

˓→"type material","CIP 82.91":"type material","CIP:82.91":"type material","DSM 30104":

˓→"type material","DSM:30104":"type material","HAMBI 450":"type material","HAMBI:450":

˓→"type material","IAM 14200":"type material","IAM:14200":"type material","IFO 14940":

˓→"type material","IFO:14940":"type material","JCM 1662":"type material","JCM:1662":

˓→"type material","LMG 2095":"type material","LMG:2095":"type material","NBRC 14940":

˓→"type material","NBRC:14940":"type material","NCTC 9633":"type material","NCTC:9633

˓→":"type material"},"parentid":570,"name":"Klebsiella pneumoniae"},{"taxid":570,"rank

˓→":"genus","names":{"Klebsiella":"scientific_name"},"parentid":543,"name":"Klebsiella

˓→"},{"taxid":543,"rank":"family","names":{"Enterobacteriaceae":"scientific_name"},

˓→"parentid":91347,"name":"Enterobacteriaceae"},{"taxid":91347,"rank":"order","names":

˓→{"Enterobacterales":"scientific_name"},"parentid":1236,"name":"Enterobacterales"},{

˓→"taxid":1236,"rank":"class","names":{"Gammaproteobacteria":"scientific_name"},

˓→"parentid":1224,"name":"Gammaproteobacteria"},{"taxid":1224,"rank":"phylum","names":

˓→{"Proteobacteria":"scientific_name"},"parentid":2,"name":"Proteobacteria"},{"taxid

˓→":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name"},"parentid":131567,

˓→"name":"Bacteria"},{"taxid":131567,"rank":"no rank","names":{"cellular organisms":

˓→"scientific_name"},"parentid":null,"name":"cellular organisms"}]}

Local database

To use local databases with the ncbi-taxonomist Docker container, the path on the host machine needs to be bound to the container’s internal mountpoint /dbs. To have the proper permissions, the --user argument needs to be set when writing to a local database. On Linux, this can be done via the id command (Listing 4.1).

Listing 4.1: Populating a local database using the ncbi-taxonomist Docker container. Line 4 shows how to run the container as current user.

1 $ ls ${PWD}

2 #empty

3 $: docker run --rm -i ncbi-taxonomist collect -t 9606 \ |

4 docker run --rm -i --user $(id -u):$(id -g) -v ${PWD}:/dbs ncbi-taxonomist import -

˓→db /dbs/dockertaxa.db

5 {"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},

˓→"parentid":null,"name":"cellular organisms"}

6 {"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":"scientific_name"},"parentid

˓→":131567,"name":"Eukaryota"} (continues on next page)

42 Chapter 4. Container ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

7 {"taxid":33154,"rank":"clade","names":{"Opisthokonta":"scientific_name"},"parentid

˓→":2759,"name":"Opisthokonta"}

8 {"taxid":33208,"rank":"kingdom","names":{"Metazoa":"scientific_name"},"parentid

˓→":33154,"name":"Metazoa"}

9 {"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},"parentid":33208,

˓→"name":"Eumetazoa"}

10 {"taxid":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,

˓→"name":"Bilateria"}

11 {"taxid":33511,"rank":"clade","names":{"Deuterostomia":"scientific_name"},"parentid

˓→":33213,"name":"Deuterostomia"}

12 {"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_name"},"parentid":33511,

˓→"name":"Chordata"}

13 {"taxid":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid

˓→":7711,"name":"Craniata"}

14 #cut

15 $: ls ${PWD}

16 dockertaxa.db

17 $: docker run --rm -i -v ${PWD}:/dbs ncbi-taxonomist resolve -t 9606 -db /dbs/

˓→dockertaxa.db

18 {"mode":"resolve","query":"9606","cast":"taxon","taxon":{"taxid":9606,"rank":"species

˓→","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"},"lineage":[{"taxid":9606,"rank

˓→":"species","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName",

˓→"man":"CommonName"},"parentid":9605,"name":"Homo sapiens"},{"taxid":9605,"rank":

˓→"genus","names":{"Homo":"scientific_name"},"parentid":207598,"name":"Homo"},{"taxid

˓→":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},"parentid":9604,

˓→"name":"Homininae"},{"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_

˓→name"},"parentid":314295,"name":"Hominidae"},{"taxid":314295,"rank":"superfamily",

˓→"names":{"Hominoidea":"scientific_name"},"parentid":9526,"name":"Hominoidea"},{

˓→"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid

˓→":314293,"name":"Catarrhini"},{"taxid":314293,"rank":"infraorder","names":{

˓→"Simiiformes":"scientific_name"},"parentid":376913,"name":"Simiiformes"},{"taxid

˓→":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},"parentid

˓→":9443,"name":"Haplorrhini"},{"taxid":9443,"rank":"order","names":{"Primates":

˓→"scientific_name"},"parentid":314146,"name":"Primates"},{"taxid":314146,"rank":

˓→"superorder","names":{"Euarchontoglires":"scientific_name"},"parentid":1437010,"name

˓→":"Euarchontoglires"},{"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":

˓→"scientific_name"},"parentid":9347,"name":"Boreoeutheria"},{"taxid":9347,"rank":

˓→"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,"name":"Eutheria"},{

˓→"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid":40674,

˓→"name":"Theria"},{"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name

˓→"},"parentid":32524,"name":"Mammalia"},{"taxid":32524,"rank":"clade","names":{

˓→"Amniota":"scientific_name"},"parentid":32523,"name":"Amniota"},{"taxid":32523,"rank

˓→":"clade","names":{"Tetrapoda":"scientific_name"},"parentid":1338369,"name":

˓→"Tetrapoda"},{"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha":

˓→"scientific_name"},"parentid":8287,"name":"Dipnotetrapodomorpha"},{"taxid":8287,

˓→"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},"parentid":117571,

˓→"name":"Sarcopterygii"},{"taxid":117571,"rank":"clade","names":{"Euteleostomi":

˓→"scientific_name"},"parentid":117570,"name":"Euteleostomi"},{"taxid":117570,"rank":

˓→"clade","names":{"Teleostomi":"scientific_name"},"parentid":7776,"name":"Teleostomi

˓→"},{"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},

˓→"parentid":7742,"name":"Gnathostomata"},{"taxid":7742,"rank":"clade","names":{

˓→"Vertebrata":"scientific_name"},"parentid":89593,"name":"Vertebrata"},{"taxid

˓→":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid":7711,

˓→"name":"Craniata"},{"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_

˓→name"},"parentid":33511,"name":"Chordata"},{"taxid":33511,"rank":"clade","names":{

˓→"Deuterostomia":"scientific_name"},"parentid":33213,"name":"Deuterostomia"},{"taxid

˓→":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid"(continues:6072, on next"name page)

˓→":"Bilateria"},{"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},

˓→"parentid":33208,"name":"Eumetazoa"},{"taxid":33208,"rank":"kingdom","names":{

4.1.˓→"Metazoa" Docker :"scientific_name"},"parentid":33154,"name":"Metazoa"},{"taxid":33154,"rank43

˓→":"clade","names":{"Opisthokonta":"scientific_name"},"parentid":2759,"name":

˓→"Opisthokonta"},{"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":

˓→"scientific_name"},"parentid":131567,"name":"Eukaryota"},{"taxid":131567,"rank":"no

˓→rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":

˓→"cellular organisms"}]} ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

Docker ncbi-taxonomist and jq

To use the included jq, Docker’s run command has to be adjusted with the --entrypoint argument (Listing 4.2).

Listing 4.2: ncbi-taxonomist and jq together in the Docker con- tainer. Line 3 shows how to modify the Docker run command for jq.

1 $: docker run --rm -i ncbi-taxonomist map -a QZWG01000002.1 MG831203 | \

2 docker run --rm -i ncbi-taxonomist resolve --mapping | \

3 docker run --rm -i --entrypoint 'jq' ncbi-taxonomist -r '[.query, .lineage[].

˓→name]|@tsv'

4 MG831203 Deformed wing virus Iflavirus Iflaviridae

˓→Picornavirales Pisoniviricetes Pisuviricota Orthornavirae Riboviria

˓→Viruses

5 QZWG01000002.1 Glycine soja Glycine subgen. Soja Glycine Phaseoleae

˓→indigoferoid/millettioid clade NPAAA clade 50 kb inversion clade

˓→Papilionoideae Fabaceae Fabales fabids rosids Pentapetalae Gunneridae

˓→ eudicotyledons Mesangiospermae Magnoliopsida Spermatophyta Euphyllophyta

˓→Tracheophyta Embryophyta Streptophytina Streptophyta Viridiplantae

˓→Eukaryota cellular organisms

4.2 Singularity

The Singularity container can be found at https://cloud.sylabs.io/library/jpb/ncbi-taxonomist/ncbi-taxonomist. Please check the Singularity Docs if some commands are unclear. • The Singularity image creates the user user for the container to run all commands • The container has the mountpoint /dbs to bind host paths

4.2.1 Install

The latest ncbi-taxonomist Singularity image can be pulled from https://cloud.sylabs. io/library/jpb/ncbi-taxonomist/ncbi-taxonomist using the command singularity pull library://jpb/ncbi-taxonomist/ncbi-taxonomist. If desired, the image can be renamed to a more concise name.

1 $: singularity pull library://jpb/ncbi-taxonomist/ncbi-taxonomist

2 INFO: Downloading library image

3 23.7MiB / 23.7MiB

˓→[======]

˓→100% 545.9 KiB/s 0s

4 $: mv ncbi-taxonomist_latest.sif ncbi-taxonomist.sif

Line 3 will likely look different.

Build

The Singularity container can be built using the definition file container/SINGULARITY.def present in the repository.

44 Chapter 4. Container ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

For more Singularity building ootions check the corresponding man page (‘’man singularity build’‘) or documentation To build locally, you need root permissions or use the --remote option for the build command (Listing 4.3):

Listing 4.3: Building the ncbi-taxonomist Singularity container lo- cally. The command on line 1 requires root permissions while the com- mand on line 2 uses the ‘’–remote” build option without root permis- sions.

1 $: singularity build ncbi-taxonomist.sif SINGULARITY.def

2 $: singularity build --remote ncbi-taxonomist.sif SINGULARITY.def

4.2.2 Test

Assuming the image is named ncbi-taxonomist.sif, invoking the command without arguments shows the basic usage and indicating a succesful isntall(Listing 4.4):

Listing 4.4: ncbi-taxonomist usage

1 $: ./ncbi-taxonomist

2 usage: ncbi-taxonomist[--version][-v][--apikey APIKEY]{map,resolve,import,collect,

˓→subtree,group} ...

3

4 commands:

5 {map,resolve,import,collect,subtree,group}

6 map Map taxid to names and vice-versa

7 #cut

4.2.3 Basic usage

The examples assume the image is names ncbi-taxonomist.sif and show representative commands. The image can be used as an executable, i.e. it can be invoked as ./ncbi-taxonomist.sif. This corresponds to the command singularity run ncbi-taxonomist.sif. Listing 4.5 shows hoe to use both commands.

Mapping

1 $: ./ncbi-taxonomist.sif map -t 9606

2 {"mode":"mapping","query":"9606","cast":"taxon","taxon":{"taxid":9606,"rank":"species

˓→","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"}}

Resolving

1 $: ./ncbi-taxonomist.sif resolve -t2 -n 'Arabidopsis'

2 {"mode":"resolve","query":"Arabidopsis","cast":"taxon","taxon":{"taxid":3701,"rank":

˓→"genus","names":{"Arabidopsis":"scientific_name","Cardaminopsis":"Synonym"},

˓→"parentid":980083,"name":"Arabidopsis"},"lineage":[{"taxid":3701,"rank":"genus",

˓→"names":{"Arabidopsis":"scientific_name","Cardaminopsis":"Synonym"},"parentid

˓→":980083,"name":"Arabidopsis"},{"taxid":980083,"rank":"tribe","names":{"Camelineae":

˓→"scientific_name"},"parentid":3700,"name":"Camelineae"},{"taxid":3700,"rank":"family

˓→","names":{"Brassicaceae":"scientific_name"},"parentid":3699,"name":"Brassicaceae"},

˓→{"taxid":3699,"rank":"order","names":{"Brassicales":"scientific_name"},"parentid (continues on next page) ˓→":91836,"name":"Brassicales"},{"taxid":91836,"rank":"clade","names":{"malvids":

˓→"scientific_name"},"parentid":71275,"name":"malvids"},{"taxid":71275,"rank":"clade",

˓→"names":{"rosids":"scientific_name"},"parentid":1437201,"name":"rosids"},{"taxid 4.2. Singularity 45 ˓→":1437201,"rank":"clade","names":{"Pentapetalae":"scientific_name"},"parentid

˓→":91827,"name":"Pentapetalae"},{"taxid":91827,"rank":"clade","names":{"Gunneridae":

˓→"scientific_name"},"parentid":71240,"name":"Gunneridae"},{"taxid":71240,"rank":

˓→"clade","names":{"eudicotyledons":"scientific_name"},"parentid":1437183,"name":

˓→"eudicotyledons"},{"taxid":1437183,"rank":"clade","names":{"Mesangiospermae":

˓→"scientific_name"},"parentid":3398,"name":"Mesangiospermae"},{"taxid":3398,"rank":

˓→"class","names":{"Magnoliopsida":"scientific_name"},"parentid":58024,"name":

˓→"Magnoliopsida"},{"taxid":58024,"rank":"clade","names":{"Spermatophyta":"scientific_

˓→name"},"parentid":78536,"name":"Spermatophyta"},{"taxid":78536,"rank":"clade","names

˓→":{"Euphyllophyta":"scientific_name"},"parentid":58023,"name":"Euphyllophyta"},{

˓→"taxid":58023,"rank":"clade","names":{"Tracheophyta":"scientific_name"},"parentid

˓→":3193,"name":"Tracheophyta"},{"taxid":3193,"rank":"clade","names":{"Embryophyta":

˓→"scientific_name"},"parentid":131221,"name":"Embryophyta"},{"taxid":131221,"rank":

˓→"subphylum","names":{"Streptophytina":"scientific_name"},"parentid":35493,"name":

˓→"Streptophytina"},{"taxid":35493,"rank":"phylum","names":{"Streptophyta":

˓→"scientific_name"},"parentid":33090,"name":"Streptophyta"},{"taxid":33090,"rank":

˓→"kingdom","names":{"Viridiplantae":"scientific_name"},"parentid":2759,"name":

˓→"Viridiplantae"},{"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":

˓→"scientific_name"},"parentid":131567,"name":"Eukaryota"},{"taxid":131567,"rank":"no

˓→rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":

˓→"cellular organisms"}]} ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

3 {"mode":"resolve","query":"2","cast":"taxon","taxon":{"taxid":2,"rank":"superkingdom",

˓→"names":{"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":

˓→"BlastName","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart",

˓→"Prokaryotae":"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid

˓→":131567,"name":"Bacteria"},"lineage":[{"taxid":2,"rank":"superkingdom","names":{

˓→"Bacteria":"scientific_name","eubacteria":"GenbankCommonName","bacteria":"BlastName

˓→","Monera":"Inpart","Procaryotae":"Inpart","Prokaryota":"Inpart","Prokaryotae":

˓→"Inpart","prokaryote":"Inpart","prokaryotes":"Inpart"},"parentid":131567,"name":

˓→"Bacteria"},{"taxid":131567,"rank":"no rank","names":{"cellular organisms":

˓→"scientific_name"},"parentid":null,"name":"cellular organisms"}]}

Pipelines

1 $: ./ncbi-taxonomist.sif map -edb bioproject -a PRJNA604394 | \

2 ./ncbi-taxonomist.sif resolve -m

3 {"mode":"resolve","query":"PRJNA604394","cast":"accs","accs":{"taxid":573,"accessions

˓→":{"project_id":604394,"project_acc":"PRJNA604394","project_name":"Klebsiella

˓→pneumoniae strain:S01"},"db":"bioproject","uid":604394},"lineage":[{"taxid":573,

˓→"rank":"species","names":{"Klebsiella pneumoniae":"scientific_name","'Klebsiella

˓→aerogenes' (Kruse) Taylor et al. 1956":"Synonym","Bacillus pneumoniae":"Synonym",

˓→"Bacterium pneumoniae crouposae":"Synonym","Hyalococcus pneumoniae":"Synonym",

˓→"Klebsiella pneumoniae aerogenes":"Synonym","Klebsiella sp. 2N3":"Includes",

˓→"Klebsiella sp. C1(2016)":"Includes","Klebsiella sp. M-AI-2":"Includes","Klebsiella

˓→sp. PB12":"Includes","Klebsiella sp. RCE-7":"Includes","ATCC 13883":"type material",

˓→"ATCC:13883":"type material","BCCM/LMG:2095":"type material","CCUG 225":"type

˓→material","CCUG:225":"type material","CDC 298-53":"type material","CDC:298-53":

˓→"type material","CIP 82.91":"type material","CIP:82.91":"type material","DSM 30104":

˓→"type material","DSM:30104":"type material","HAMBI 450":"type material","HAMBI:450":

˓→"type material","IAM 14200":"type material","IAM:14200":"type material","IFO 14940":

˓→"type material","IFO:14940":"type material","JCM 1662":"type material","JCM:1662":

˓→"type material","LMG 2095":"type material","LMG:2095":"type material","NBRC 14940":

˓→"type material","NBRC:14940":"type material","NCTC 9633":"type material","NCTC:9633

˓→":"type material"},"parentid":570,"name":"Klebsiella pneumoniae"},{"taxid":570,"rank

˓→":"genus","names":{"Klebsiella":"scientific_name"},"parentid":543,"name":"Klebsiella

˓→"},{"taxid":543,"rank":"family","names":{"Enterobacteriaceae":"scientific_name"},

˓→"parentid":91347,"name":"Enterobacteriaceae"},{"taxid":91347,"rank":"order","names":

˓→{"Enterobacterales":"scientific_name"},"parentid":1236,"name":"Enterobacterales"},{

˓→"taxid":1236,"rank":"class","names":{"Gammaproteobacteria":"scientific_name"},

˓→"parentid":1224,"name":"Gammaproteobacteria"},{"taxid":1224,"rank":"phylum","names":

˓→{"Proteobacteria":"scientific_name"},"parentid":2,"name":"Proteobacteria"},{"taxid

˓→":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name"},"parentid":131567,

˓→"name":"Bacteria"},{"taxid":131567,"rank":"no rank","names":{"cellular organisms":

˓→"scientific_name"},"parentid":null,"name":"cellular organisms"}]}

Local database

To use local databases with the ncbi-taxonomist Singularity container, the path on the host machine needs to be bound to the container’s internal mountpoint /dbs via the --bind options, which cannot be used when using the executable form (Listing 4.5). However, the bind options can be stored in the enviromental variable SINGULAR- ITY_BIND(Listing 4.6).

46 Chapter 4. Container ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Listing 4.5: Populating a local database using the ncbi-taxonomist Singularity container. Lines 4 and 17 and shows how to bind the current working directory to the container. #cut indicates shortened output.

1 $ ls ${PWD}

2 #empty

3 $: ./ncbi-taxonomist.sif collect -t 9606| \

4 singularity run --bind ${PWD}:/dbs ncbi-taxonomist.sif import -db /dbs/simgtaxa.db

5 {"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},

˓→"parentid":null,"name":"cellular organisms"}

6 {"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":"scientific_name"},"parentid

˓→":131567,"name":"Eukaryota"}

7 {"taxid":33154,"rank":"clade","names":{"Opisthokonta":"scientific_name"},"parentid

˓→":2759,"name":"Opisthokonta"}

8 {"taxid":33208,"rank":"kingdom","names":{"Metazoa":"scientific_name"},"parentid

˓→":33154,"name":"Metazoa"}

9 {"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},"parentid":33208,

˓→"name":"Eumetazoa"}

10 {"taxid":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,

˓→"name":"Bilateria"}

11 {"taxid":33511,"rank":"clade","names":{"Deuterostomia":"scientific_name"},"parentid

˓→":33213,"name":"Deuterostomia"}

12 {"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_name"},"parentid":33511,

˓→"name":"Chordata"}

13 {"taxid":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid

˓→":7711,"name":"Craniata"}

14 #cut

15 $: ls ${PWD}

16 simgtaxa.db

17 $: singularity run --bind ${PWD}:/dbs ncbi-taxonomist.sif resolve -t 9606 -db /dbs/

˓→simgtaxa.db

18 {"mode":"resolve","query":"9606","cast":"taxon","taxon":{"taxid":9606,"rank":"species

˓→","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName","man":

˓→"CommonName"},"parentid":9605,"name":"Homo sapiens"},"lineage":[{"taxid":9606,"rank

˓→":"species","names":{"Homo sapiens":"scientific_name","human":"GenbankCommonName",

˓→"man":"CommonName"},"parentid":9605,"name":"Homo sapiens"},{"taxid":9605,"rank":

˓→"genus","names":{"Homo":"scientific_name"},"parentid":207598,"name":"Homo"},{"taxid

˓→":207598,"rank":"subfamily","names":{"Homininae":"scientific_name"},"parentid":9604,

˓→"name":"Homininae"},{"taxid":9604,"rank":"family","names":{"Hominidae":"scientific_

˓→name"},"parentid":314295,"name":"Hominidae"},{"taxid":314295,"rank":"superfamily",

˓→"names":{"Hominoidea":"scientific_name"},"parentid":9526,"name":"Hominoidea"},{

˓→"taxid":9526,"rank":"parvorder","names":{"Catarrhini":"scientific_name"},"parentid

˓→":314293,"name":"Catarrhini"},{"taxid":314293,"rank":"infraorder","names":{

˓→"Simiiformes":"scientific_name"},"parentid":376913,"name":"Simiiformes"},{"taxid

˓→":376913,"rank":"suborder","names":{"Haplorrhini":"scientific_name"},"parentid

˓→":9443,"name":"Haplorrhini"},{"taxid":9443,"rank":"order","names":{"Primates":

˓→"scientific_name"},"parentid":314146,"name":"Primates"},{"taxid":314146,"rank":

˓→"superorder","names":{"Euarchontoglires":"scientific_name"},"parentid":1437010,"name

˓→":"Euarchontoglires"},{"taxid":1437010,"rank":"clade","names":{"Boreoeutheria":

˓→"scientific_name"},"parentid":9347,"name":"Boreoeutheria"},{"taxid":9347,"rank":

˓→"clade","names":{"Eutheria":"scientific_name"},"parentid":32525,"name":"Eutheria"},{

˓→"taxid":32525,"rank":"clade","names":{"Theria":"scientific_name"},"parentid":40674,

˓→"name":"Theria"},{"taxid":40674,"rank":"class","names":{"Mammalia":"scientific_name

˓→"},"parentid":32524,"name":"Mammalia"},{"taxid":32524,"rank":"clade","names":{

˓→"Amniota":"scientific_name"},"parentid":32523,"name":"Amniota"},{"taxid":32523,"rank

˓→":"clade","names":{"Tetrapoda":"scientific_name"},"parentid":1338369,"name":

˓→"Tetrapoda"},{"taxid":1338369,"rank":"clade","names":{"Dipnotetrapodomorpha": ˓→"scientific_name"},"parentid":8287,"name":"Dipnotetrapodomorpha"},{"taxid"(continues:8287, on next page) ˓→"rank":"superclass","names":{"Sarcopterygii":"scientific_name"},"parentid":117571,

˓→"name":"Sarcopterygii"},{"taxid":117571,"rank":"clade","names":{"Euteleostomi": 4.2.˓→"scientific_name" Singularity },"parentid":117570,"name":"Euteleostomi"},{"taxid":117570,"rank":47 ˓→"clade","names":{"Teleostomi":"scientific_name"},"parentid":7776,"name":"Teleostomi

˓→"},{"taxid":7776,"rank":"clade","names":{"Gnathostomata":"scientific_name"},

˓→"parentid":7742,"name":"Gnathostomata"},{"taxid":7742,"rank":"clade","names":{

˓→"Vertebrata":"scientific_name"},"parentid":89593,"name":"Vertebrata"},{"taxid

˓→":89593,"rank":"subphylum","names":{"Craniata":"scientific_name"},"parentid":7711,

˓→"name":"Craniata"},{"taxid":7711,"rank":"phylum","names":{"Chordata":"scientific_

˓→name"},"parentid":33511,"name":"Chordata"},{"taxid":33511,"rank":"clade","names":{

˓→"Deuterostomia":"scientific_name"},"parentid":33213,"name":"Deuterostomia"},{"taxid

˓→":33213,"rank":"clade","names":{"Bilateria":"scientific_name"},"parentid":6072,"name

˓→":"Bilateria"},{"taxid":6072,"rank":"clade","names":{"Eumetazoa":"scientific_name"},

˓→"parentid":33208,"name":"Eumetazoa"},{"taxid":33208,"rank":"kingdom","names":{

˓→"Metazoa":"scientific_name"},"parentid":33154,"name":"Metazoa"},{"taxid":33154,"rank

˓→":"clade","names":{"Opisthokonta":"scientific_name"},"parentid":2759,"name":

˓→"Opisthokonta"},{"taxid":2759,"rank":"superkingdom","names":{"Eukaryota":

˓→"scientific_name"},"parentid":131567,"name":"Eukaryota"},{"taxid":131567,"rank":"no

˓→rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":

˓→"cellular organisms"}]} ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

(continued from previous page)

Listing 4.6: Populating a local database using the ncbi-taxonomist Singularity container using the SINGULARITY_BIND enviromental vari- able. Line 1 shows how to set the enviromental variable and the echo command on line 2 should correspond to your current working direc- tory. #result indicates the same results for the corresponding commands in Listing 4.5.

1 $: export SINGULARITY_BIND="${PWD}:/dbs"

2 $: echo $SINGULARITY_BIND

3 /path/to/your/current/working/directory

4 $: ./ncbi-taxonomist.sif collect -t 9606| \

5 ./ncbi-taxonomist.sif import -db /dbs/simgtaxa.db

6 #result

7 $: ls ${PWD}

8 simgtaxa.db

9 $: ./ncbi-taxonomist.sif resolve -t 9606 -db /dbs/simgtaxa.db

10 #result

Singularity ncbi-taxonomist and jq

To use the included jq with the Singularity container, the run command has to used in conjunction with the –app option

Listing 4.7: Using ncbi-taxonomist and jq together in the Singu- larity container. Line 1 shows how to invoke jq to print its usage (cut for clarity). Line 5 shows the use of jq in a ncbi-taxonomist Sin- gularity pipeline.

1 $: singularity run --app jq ncbi-taxonomist.sif

2 #jq usage

3 $: ./ncbi-taxonomist.sif map -a QZWG01000002.1 MG831203 | \

4 ./ncbi-taxonomist.sif resolve --mapping | \

5 singularity run --app jq ncbi-taxonomist.sif -r '[.query, .lineage[].name]|@tsv'

6 MG831203 Deformed wing virus Iflavirus Iflaviridae

˓→Picornavirales Pisoniviricetes Pisuviricota Orthornavirae Riboviria

˓→Viruses

7 QZWG01000002.1 Glycine soja Glycine subgen. Soja Glycine Phaseoleae

˓→indigoferoid/millettioid clade NPAAA clade 50 kb inversion clade

˓→Papilionoideae Fabaceae Fabales fabids rosids Pentapetalae Gunneridae

˓→ eudicotyledons Mesangiospermae Magnoliopsida Spermatophyta Euphyllophyta

˓→Tracheophyta Embryophyta Streptophytina Streptophyta Viridiplantae

˓→Eukaryota cellular organisms

48 Chapter 4. Container CHAPTER 5

Frequently Asked Questions

Content

• openSSL – I’m getting an SSL: CERTIFICATE_VERIFY_FAILED error • SQLite – sqlite3.OperationalError: near "ON": syntax error during import

* Possible solution 1 * Possible solution 2

5.1 openSSL

5.1.1 I’m getting an SSL: CERTIFICATE_VERIFY_FAILED error

If you encounter an SSL error like SSL: CERTIFICATE_VERIFY_FAILED, you may need to enable the SSL mofule for Python or update the certifications. It depends on your OS or distribution. • Mac OS you need to find run Install Certificates.command, usually found in the folder where Python has been installed. On Linux, you may need to update the certificates: • Debian: run update-ca-certificates --fresh and export the environment variable SSL_CERT_DIR=/etc/ssl/certs. • Arch Linux: install the ca-certificates* packages, e.g. pacman -S ca-certificates* It is also possible to update the certificates via pip: • run pip install --upgrade certifi

49 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

5.2 SQLite

5.2.1 sqlite3.OperationalError: near "ON": syntax error during im- port ncbi-taxonomist aborts with an error message similar as shown below:

Traceback(most recent call last): File "/tools/python/3.7.4/bin/ncbi-taxonomist", line 93, in main() File "/tools/python/3.7.4/bin/ncbi-taxonomist", line 58, in main ncbitaxonomist.db.dbimporter.import_stdin(nt.db) File "/tools/python/3.7.4/lib/python3.7/site-packages/ncbitaxonomist/db/dbimporter.

˓→py", line 95, in import_stdin commit(db, taxa, names) File "/tools/python/3.7.4/lib/python3.7/site-packages/ncbitaxonomist/db/dbimporter.

˓→py", line 34, in commit db.add_taxa(taxa) File "/tools/python/3.7.4/lib/python3.7/site-packages/ncbitaxonomist/db/dbmanager.py

˓→", line 69, in add_taxa self.taxa.insert(self.connection, values) File "/tools/python/3.7.4/lib/python3.7/site-packages/ncbitaxonomist/db/table/taxa.

˓→py", line 39, in insert connection.cursor().executemany(stmt, taxavalues) sqlite3.OperationalError: near "ON": syntax error

Possible solution 1

The taxonomic database uses an old ncbi-taxonomist database scheme. In this case, you need to rebuild the database using a current version of ncbi-taxonomist.

Possible solution 2

This has been reported earlier (issue 2). ncbi-taxonomist uses a PostgreSQL-style UPSERT introduced to SQLite 3.24.0. You need a recent Python version (>= Python 3.8) and SQlite version >= 3.24.0. You can use the available containers if you can’t update Python or SQLite. If none of these solutions work for you, please open an issue.

50 Chapter 5. Frequently Asked Questions CHAPTER 6

Module references

Documentation of the different modules and classes used in ncbi-taxonomist.

Contents

• Module references – ncbi-taxonomist – Analyzer

* Accession analyzer: ncbitaxonomist.analyzer.accession * Collection analyzer: ncbitaxonomist.analyzer.collect * Mapping analyzer: ncbitaxonomist.analyzer.mapping * Resolve analyzer: ncbitaxonomist.analyzer.resolve – Cache

* Cache module: ncbitaxonomist.cache * Taxa cache module: ncbitaxonomist.cache.taxa * Accession cache module: ncbitaxonomist.cache.accession – Converter

* Base converter: ncbitaxonomist.convert.converter * Attribute mapping: ncbitaxonomist.convert.convertermap * Local database accession converter: ncbitaxonomist.convert.accessiondb * NCBI accessions converter: ncbitaxonomist.convert.ncbiaccession * NCBI taxon converter: ncbitaxonomist.convert.ncbitaxon * Local database taxon converter: ncbitaxonomist.convert.taxadb

51 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

– Data models

* Basic data model: ncbitaxonomist.model.datamodel * Taxon model: ncbitaxonomist.model.taxon * Accession Data model: ncbitaxonomist.model.accession – Database

* Database manager: ncbitaxonomist.db.dbmanager * Database importer: ncbitaxonomist.db.dbimporter * Database tables · Base table: ncbitaxonomist.db.table.basetable · Taxa table: ncbitaxonomist.db.table.taxa · Names table: ncbitaxonomist.db.table.names · Accession table: ncbitaxonomist.db.table.accessions · Accession table: ncbitaxonomist.db.table.groups – Entrez results

* Accession result: ncbitaxonomist.entrezresult.accession * Taxa cache module: ncbitaxonomist.entrezresult.mapping * Accession cache module: ncbitaxonomist.entrezresult.taxonomy – Formatter

* Base module: ncbitaxonomist.formatter.base * JSON formatter: ncbitaxonomist.formatter.jsonformatter * XML formatter: ncbitaxonomist.formatter.xmlformatter – Logging

* Configuration: ncbitaxonomist.log.conf * Logger: ncbitaxonomist.log.logger – Mappers

* Mapper: ncbitaxonomist.mapper * Remote mapper: ncbitaxonomist.analyzer.mapping * Remote accession mapper: ncbitaxonomist.analyzer.accession – Parser

* Argument parser: ncbitaxonomist.parser.arguments * Group data parser: ncbitaxonomist.parser.group * General stdout parser: ncbitaxonomist.parser.stdout – Queries

* Collect queries · Base query: ncbitaxonomist.query.collect.collect

52 Chapter 6. Module references ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

· Name query: ncbitaxonomist.query.collect.name · Taxid query: ncbitaxonomist.query.collect.taxid

* Map queries · Base query: ncbitaxonomist.query.map.map · Name query: ncbitaxonomist.query.map.name · Taxid query: ncbitaxonomist.query.map.taxid · Accession query: ncbitaxonomist.query.map.accession

* Resolve queries · Base query: ncbitaxonomist.query.resolve.resolve · Name query: ncbitaxonomist.query.resolve.name · Taxid query: ncbitaxonomist.query.resolve.taxid · Accession query: ncbitaxonomist.query.resolve.accession

* Remote query pipelines – Payloads

* Base class for payloads: ncbitaxonomist.payload.payload * Taxid payload * Names payload * Accessions payload * Accession map payload – Resolver – Lineage resolver – Subtrees

* Subtree – Subtree analyzer – Utility functions used across modules

* Utility functions: ncbitaxonomist.utils

6.1 ncbi-taxonomist

This is the entry script for ncbi-taxonomist. It runs the requested command and checks its parameters.

6.2 Analyzer

Analyzer handle remote data from Entrez and are inherited from entrezpy.base.analyzer. EutilsAnalyzer.

6.1. ncbi-taxonomist 53 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

6.2.1 Accession analyzer: ncbitaxonomist.analyzer.accession

6.2.2 Collection analyzer: ncbitaxonomist.analyzer.collect

6.2.3 Mapping analyzer: ncbitaxonomist.analyzer.mapping

6.2.4 Resolve analyzer: ncbitaxonomist.analyzer.resolve

6.3 Cache

Cache caches taxa to reuse already solved queries, avoiding unnessecary local or remote database lookups.

6.3.1 Cache module: ncbitaxonomist.cache

6.3.2 Taxa cache module: ncbitaxonomist.cache.taxa

6.3.3 Accession cache module: ncbitaxonomist.cache.accession class ncbitaxonomist.cache.accession.AccessionCache Class to handle caching of accessions. Accessions are stored mapping accessions as key and class:ncbitaxonomist.model.accession.AccessionData as value. cache(acc: Type[ncbitaxonomist.model.accession.Accession]) Caches accession get_accession(acc) → Type[ncbitaxonomist.model.accession.Accession] Returns given or all taxids in cache incache(name=None, taxid=None) Tests if given accession is in cache. is_empty() Tests if cache is empty.

6.4 Converter

Converter convert between data models and pure attributes.

6.4.1 Base converter: ncbitaxonomist.convert.converter class ncbitaxonomist.convert.converter.ModelConverter Base class for converters between attributes and models. convert_from_model(model: Type[ncbitaxonomist.model.datamodel.DataModel], outdict: Map- ping[KT, VT_co] = None) → Dict[KT, VT] Virtual method converts model to attributes convert_to_model(attributes: Mapping[str, any], srcdb=None) → Type[ncbitaxonomist.model.datamodel.DataModel] Virtual method converts attributes to model

54 Chapter 6. Module references ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

map_inattributes(mattribs: Mapping[str, any], indata: Mapping[str, any], convmap: Mapping[str, str], switch: bool = False) Map input attributes to wanted model attributes

6.4.2 Attribute mapping: ncbitaxonomist.convert.convertermap

Maps indicating which data attributes are convertred to which model attributes.

6.4.3 Local database accession converter: ncbitaxonomist.convert. accessiondb class ncbitaxonomist.convert.accessiondb.DbAccessionConverter Class implementing a converter for accession attributes and models convert_from_model(model: Type[ncbitaxonomist.model.accession.Accession], outdict=None) → Dict[str, str] Converts accession model to attributes convert_to_model(attributes: Mapping[str, any], srcdb=None) → Type[ncbitaxonomist.model.accession.Accession] Converts local database attributes to accession model

6.4.4 NCBI accessions converter: ncbitaxonomist.convert.ncbiaccession class ncbitaxonomist.convert.ncbiaccession.NcbiAccessionConverter Convert NCBI accession data into model or model into attributes convert_from_model(model: Type[ncbitaxonomist.model.accession.Accession], outdict=None) → Dict[str, str] Converts accession model to attributes convert_to_model(attributes: Mapping[str, any], srcdb=None) → Type[ncbitaxonomist.model.accession.Accession] Converts NCBI attributes to accession model

6.4.5 NCBI taxon converter: ncbitaxonomist.convert.ncbitaxon

6.4.6 Local database taxon converter: ncbitaxonomist.convert.taxadb class ncbitaxonomist.convert.taxadb.TaxaDbConverter Converts local database attributes into class:ncbitaxonomist.model.taxon.Taxon instances and vice versa convert_from_model(model: Type[ncbitaxonomist.model.taxon.Taxon], outdict=None) → Dict[str, str] Virtual method converts model to attributes convert_to_model(attributes: Mapping, srcdb=None) → Type[ncbitaxonomist.model.taxon.Taxon] Convert local database taxon attributes into class:ncbitaxonomist.model.taxon.Taxon

6.5 Data models ncbi-taxonomist data models implement taxonomic and accession data. Models use a ncbitaxonomist. model.datamodel.DataModel as base class.

6.5. Data models 55 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

6.5.1 Basic data model: ncbitaxonomist.model.datamodel class ncbitaxonomist.model.datamodel.DataModel(cast, attributes: Mapping[KT, VT_co] = None) Base class for data models. get_attributes() → Dict[str, any] Return taxon attributes as dictionary. classmethod new(attributes: Mapping[str, any] = None) → ncbitax- onomist.model.datamodel.DataModel Return new instance with given attributes classmethod new_from_json(json_attributes: str) → ncbitax- onomist.model.datamodel.DataModel Return new instance with attributes encoded in JSON taxid() ncbitaxonomist.model.datamodel.int_attribute(attribute) Enforce int for attribute ncbitaxonomist.model.datamodel.standardize_attributes(attributes: Mapping[str, any]) Convert None into empty dictionary. See Important warning at https://docs.python.org/3/tutorial/controlflow. html#default-argument-values

6.5.2 Taxon model: ncbitaxonomist.model.taxon

6.5.3 Accession Data model: ncbitaxonomist.model.accession

ncbitaxonomist.model.datamodel.DataModel ncbitaxonomist.model.accession.Accession

class ncbitaxonomist.model.accession.Accession(attributes: Mapping[KT, VT_co] = None)

get_accessions() → Dict[str, str] Return accessions as dictionary get_attributes() → Dict[str, any] Return taxon attributes as dictionary. classmethod new(attributes: Mapping[str, any] = None) → ncbitax- onomist.model.datamodel.DataModel Return new instance with given attributes classmethod new_from_json(json_attributes: str) → ncbitax- onomist.model.datamodel.DataModel Return new instance with attributes encoded in JSON taxid()

56 Chapter 6. Module references ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

update_accessions(accession: Mapping[str, str]) Update accessions from dictionary with structure accession:type

6.6 Database

Database modules for a local ncbi-taxonomist database

6.6.1 Database manager: ncbitaxonomist.db.dbmanager

6.6.2 Database importer: ncbitaxonomist.db.dbimporter

6.6.3 Database tables

Base table: ncbitaxonomist.db.table.basetable class ncbitaxonomist.db.table.basetable.BaseTable(name: str, database: str) Implements a basic table in a taxonomist database. create(connection: Type[sqlite3.Connection]) → ncbitaxonomist.db.table.basetable.BaseTable Virtual function to create table create_index(connection: Type[sqlite3.Connection]) → None Virtual function to create table index insert(connection: Type[sqlite3.Connection], values: Tuple) → None Virtual function to insert rows

Taxa table: ncbitaxonomist.db.table.taxa class ncbitaxonomist.db.table.taxa.TaxaTable(database: str) Implements taxa table for local taxonomy database. create(connection: Type[sqlite3.Connection]) → ncbitaxonomist.db.table.taxa.TaxaTable Virtual function to create table create_index(connection: Type[sqlite3.Connection]) → None Virtual function to create table index get_lineage(connection: Type[sqlite3.Connection], taxid: int, name_table: str) → Type[sqlite3.Cursor] Recursive construction of lineage from given taxid to highest parent. get_rows(connection: Type[sqlite3.Connection]) → Type[sqlite3.Cursor] get_subtree(connection: Type[sqlite3.Connection], taxid: int) → Type[sqlite3.Cursor] Depth first search of taxon ids to find the subtree of taxid get_taxids(connection: Type[sqlite3.Connection]) → Type[sqlite3.Cursor] insert(connection: Type[sqlite3.Connection], taxavalues: Iterable[Tuple[int, str, int]]) → None Virtual function to insert rows insert_taxids(connection: Type[sqlite3.Connection], taxids: Iterable[int]) → None

6.6. Database 57 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Names table: ncbitaxonomist.db.table.names class ncbitaxonomist.db.table.names.NameTable(database: str) Implements the name table in a taxonomist database. create(connection: Type[sqlite3.Connection]) → ncbitaxonomist.db.table.names.NameTable Virtual function to create table create_index(connection: Type[sqlite3.Connection]) Virtual function to create table index get_rows(connection: Type[sqlite3.Connection]) → Type[sqlite3.Cursor] insert(connection: Type[sqlite3.Connection], values: Tuple[int, str, str]) Virtual function to insert rows name_to_taxid(connection: Type[sqlite3.Connection], name) → Type[sqlite3.Cursor]

Accession table: ncbitaxonomist.db.table.accessions class ncbitaxonomist.db.table.accessions.AccessionTable(database)

create(connection: Type[sqlite3.Connection]) → ncbitaxonomist.db.table.accessions.AccessionTable Virtual function to create table create_index(connection: Type[sqlite3.Connection]) → None Virtual function to create table index get_rows(connection: Type[sqlite3.Connection]) → Type[sqlite3.Cursor] insert(connection: Type[sqlite3.Connection], values: Iterable[Tuple[str, str, str, int, int]]) → None Virtual function to insert rows

Accession table: ncbitaxonomist.db.table.groups class ncbitaxonomist.db.table.groups.GroupTable(database: str)

create(connection: Type[sqlite3.Connection]) → ncbitaxonomist.db.table.groups.GroupTable Virtual function to create table create_index(connection: Type[sqlite3.Connection]) → None Virtual function to create table index delete_from_group(connection: Type[sqlite3.Connection], values: Iterable[Tuple[str, int]]) → None delete_group(connection: Type[sqlite3.Connection], groupname: str) → None insert(connection: Type[sqlite3.Connection], values: Iterable[Tuple[int, str]]) → None Virtual function to insert rows retrieve_group(connection: Type[sqlite3.Connection], groupname: str) retrieve_names(connection: Type[sqlite3.Connection]) → Type[sqlite3.Cursor]

58 Chapter 6. Module references ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

6.7 Entrez results

Implementations of Entrez results inherited from entrezpy.base.result.EutilsResult.

6.7.1 Accession result: ncbitaxonomist.entrezresult.accession

6.7.2 Taxa cache module: ncbitaxonomist.entrezresult.mapping

6.7.3 Accession cache module: ncbitaxonomist.entrezresult.taxonomy

6.8 Formatter

Formats JSON and XML outputs.

6.8.1 Base module: ncbitaxonomist.formatter.base

6.8.2 JSON formatter: ncbitaxonomist.formatter.jsonformatter

6.8.3 XML formatter: ncbitaxonomist.formatter.xmlformatter

6.9 Logging

Logging for ncbi-taxonomist.

6.9.1 Configuration: ncbitaxonomist.log.conf

6.9.2 Logger: ncbitaxonomist.log.logger

6.10 Mappers

Mappers handle the mapping of taxids, names, and accessions to each other. Analyzers are inherited and adjusted from entrezpy.

6.10.1 Mapper: ncbitaxonomist.mapper

6.10.2 Remote mapper: ncbitaxonomist.analyzer.mapping

6.10.3 Remote accession mapper: ncbitaxonomist.analyzer.accession

6.11 Parser

Parsers used in ncbi-taxonomist

6.7. Entrez results 59 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

6.11.1 Argument parser: ncbitaxonomist.parser.arguments ncbitaxonomist.parser.arguments.parse(basename) ncbitaxonomist.parser.arguments.version(basename)

6.11.2 Group data parser: ncbitaxonomist.parser.group class ncbitaxonomist.parser.group.GroupParser

parse(groupname: str) Parse stdin for taxonid to add into group groupname parse_taxa_list(taxa_list, taxids, groupname) parse_taxon(taxid, taxids, groupname)

6.11.3 General stdout parser: ncbitaxonomist.parser.stdout

6.12 Queries

Queries are modules implementing a specific taxonomic query, either remote or for a local database.

6.12.1 Collect queries

Queries to collect taxa remotely from Entrez.

Base query: ncbitaxonomist.query.collect.collect

Name query: ncbitaxonomist.query.collect.name

Taxid query: ncbitaxonomist.query.collect.taxid

6.12.2 Map queries

Queries to map taxa locally or remotely from Entrez.

Base query: ncbitaxonomist.query.map.map

Name query: ncbitaxonomist.query.map.name

Taxid query: ncbitaxonomist.query.map.taxid

Accession query: ncbitaxonomist.query.map.accession

6.12.3 Resolve queries

Queries to resolve taxa and accessions locally or remotely from Entrez.

60 Chapter 6. Module references ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

Base query: ncbitaxonomist.query.resolve.resolve

Name query: ncbitaxonomist.query.resolve.name

Taxid query: ncbitaxonomist.query.resolve.taxid

Accession query: ncbitaxonomist.query.resolve.accession

6.12.4 Remote query pipelines entrezpy.conduit pipelines to fetch remote query data.

6.13 Payloads

Payloads implement the requested taxids, names, and accessions. They keep track what has been successfully ana- lyzed.

6.13.1 Base class for payloads: ncbitaxonomist.payload.payload

6.13.2 Taxid payload

6.13.3 Names payload

6.13.4 Accessions payload

6.13.5 Accession map payload

6.14 Resolver

The resolver module implements the resolving of lineages for names, taxids, and accessions.

6.15 Lineage resolver

The lineage resolver resolves whole lineages or the lienage taxa between given ranks.

6.16 Subtrees

Subtrees are selected taxa form lineages.

6.16.1 Subtree

Implemenets a subtree

6.13. Payloads 61 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

6.17 Subtree analyzer

The subtree analyzer manages subtrees

6.18 Utility functions used across modules

6.18.1 Utility functions: ncbitaxonomist.utils

Content

• Synopsis – Functions • Requirements and Dependencies – Requirements – Dependencies • Contact • Indices and tables

62 Chapter 6. Module references CHAPTER 7

Synopsis

$: pip install ncbi-taxonomist --user $: ncbi-taxonomist collect -n human ncbi-taxonomist handles and manages phylogenetic data available in NCBI’s Entrez databases.

7.1 Functions

• Collect collect taxa from the Entrez Taxonomy database • Map map taxids, names, and accessions to related taxonomic information • Resolve: resolve lineages for taxa (taxid and names) and accessions, e.g. sequence or protein • Import: store obtained results locally in a SQLite databases • Subtree: extract a whole lineage, or a specific rank, or a range of ranks, from a taxid or name • Group: create user defined groups for taxa, for example: • create a group for all taxa specific for a project • group taxa without a phylogenetic relationship, e.g. group all taxa representing trees inot a group “trees” The ncbi-taxonomist commands, e.g. map or import, can be chained together using pipes to from more complex tasks. For example, to populate a local database collect will fetch data remotely from Entrez and print it to STDOUT where import will read STDIN and populates the local database (see below). ncbi-taxonomist collect -n human | ncbi-taxonomist import -db taxo.db

63 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

64 Chapter 7. Synopsis CHAPTER 8

Requirements and Dependencies

8.1 Requirements

• Required: Python >= 3.8 $: python --version • Optional: To use local databases, SQLite (>= 3.24.0) has to be installed. ncbi-taxonomist works without local databases, but needs to fetch all data remotely for each query. $: sqlite3 --version

8.2 Dependencies

ncbi-taxonomist has one dependency: • entrezpy: to handle remote requests to NCBI’s Entrez databases – https://gitlab.com/ncbipy/entrezpy.git – https://pypi.org/project/entrezpy/ – https://doi.org/10.1093/bioinformatics/btz385 This is a library maintained by myself and relies solely on the Python standard library. Therefore, ncbi-taxonomist is less prone to suffer dependency hell.

65 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

66 Chapter 8. Requirements and Dependencies CHAPTER 9

Contact

To report bugs and/or errors, please open an issue at https://gitlab.com/ncbi-taxonomist or contact me at: [email protected]. Of course, feel free to fork the code, improve it, and/or open a pull request.

67 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

68 Chapter 9. Contact CHAPTER 10

Indices and tables

• genindex • modindex • search

69 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

70 Chapter 10. Indices and tables Python Module Index

n ncbitaxonomist.cache, 54 ncbitaxonomist.cache.accession, 54 ncbitaxonomist.convert.accessiondb, 55 ncbitaxonomist.convert.converter, 54 ncbitaxonomist.convert.convertermap, 55 ncbitaxonomist.convert.ncbiaccession, 55 ncbitaxonomist.convert.taxadb, 55 ncbitaxonomist.db.table.accessions, 58 ncbitaxonomist.db.table.basetable, 57 ncbitaxonomist.db.table.groups, 58 ncbitaxonomist.db.table.names, 58 ncbitaxonomist.db.table.taxa, 57 ncbitaxonomist.log.conf, 59 ncbitaxonomist.model.accession, 56 ncbitaxonomist.model.datamodel, 56 ncbitaxonomist.parser.arguments, 60 ncbitaxonomist.parser.group, 60

71 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

72 Python Module Index Index

A create() (ncbitaxonomist.db.table.accessions.AccessionTable Accession (class in ncbitaxonomist.model.accession), method), 58 56 create() (ncbitaxonomist.db.table.basetable.BaseTable AccessionCache (class in ncbitax- method), 57 onomist.cache.accession), 54 create() (ncbitaxonomist.db.table.groups.GroupTable AccessionTable (class in ncbitax- method), 58 onomist.db.table.accessions), 58 create() (ncbitaxonomist.db.table.names.NameTable method), 58 B create() (ncbitaxonomist.db.table.taxa.TaxaTable method BaseTable (class in ncbitax- ), 57 create_index() ncbitax- onomist.db.table.basetable), 57 ( onomist.db.table.accessions.AccessionTable C method), 58 create_index() (ncbitax- cache() ncbitaxonomist.cache.accession.AccessionCache ( onomist.db.table.basetable.BaseTable method), method ), 54 57 convert_from_model() ncbitax- ( create_index() (ncbitax- onomist.convert.accessiondb.DbAccessionConverter onomist.db.table.groups.GroupTable method), method ), 55 58 convert_from_model() ncbitax- ( create_index() (ncbitax- onomist.convert.converter.ModelConverter onomist.db.table.names.NameTable method), method ), 54 58 convert_from_model() ncbitax- ( create_index() (ncbitax- onomist.convert.ncbiaccession.NcbiAccessionConverter onomist.db.table.taxa.TaxaTable method), method ), 55 57 convert_from_model() (ncbitax- onomist.convert.taxadb.TaxaDbConverter D method), 55 DataModel (class in ncbitaxonomist.model.datamodel), convert_to_model() (ncbitax- 56 onomist.convert.accessiondb.DbAccessionConverter DbAccessionConverter (class in ncbitax- method), 55 onomist.convert.accessiondb), 55 convert_to_model() (ncbitax- delete_from_group() (ncbitax- onomist.convert.converter.ModelConverter onomist.db.table.groups.GroupTable method), method), 54 58 convert_to_model() (ncbitax- delete_group() (ncbitax- onomist.convert.ncbiaccession.NcbiAccessionConverter onomist.db.table.groups.GroupTable method), method), 55 58 convert_to_model() (ncbitax- onomist.convert.taxadb.TaxaDbConverter G method), 55 get_accession() (ncbitax-

73 ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

onomist.cache.accession.AccessionCache M method), 54 map_inattributes() (ncbitax- get_accessions() (ncbitax- onomist.convert.converter.ModelConverter onomist.model.accession.Accession method), method), 54 56 ModelConverter (class in ncbitax- get_attributes() (ncbitax- onomist.convert.converter), 54 onomist.model.accession.Accession method), 56 N get_attributes() (ncbitax- name_to_taxid() (ncbitax- onomist.model.datamodel.DataModel method), onomist.db.table.names.NameTable method), 56 58 get_lineage() ncbitax- ( NameTable (class in ncbitaxonomist.db.table.names), onomist.db.table.taxa.TaxaTable method), 58 57 NcbiAccessionConverter (class in ncbitax- get_rows() (ncbitax- onomist.convert.ncbiaccession), 55 onomist.db.table.accessions.AccessionTable ncbitaxonomist.cache (module), 54 method), 58 ncbitaxonomist.cache.accession (module), get_rows() ncbitax- ( 54 onomist.db.table.names.NameTable method), ncbitaxonomist.convert.accessiondb (mod- 58 ule), 55 get_rows() (ncbitaxonomist.db.table.taxa.TaxaTable ncbitaxonomist.convert.converter (mod- method), 57 ule), 54 get_subtree() (ncbitax- ncbitaxonomist.convert.convertermap onomist.db.table.taxa.TaxaTable method ), (module), 55 57 ncbitaxonomist.convert.ncbiaccession get_taxids() (ncbitax- (module), 55 onomist.db.table.taxa.TaxaTable method), ncbitaxonomist.convert.taxadb (module), 55 57 ncbitaxonomist.db.table.accessions (mod- GroupParser class in ncbitaxonomist.parser.group ( ), ule), 58 60 ncbitaxonomist.db.table.basetable (mod- GroupTable (class in ncbitaxonomist.db.table.groups), ule), 57 58 ncbitaxonomist.db.table.groups (module), I 58 ncbitaxonomist.db.table.names (module), 58 incache() (ncbitaxonomist.cache.accession.AccessionCachencbitaxonomist.db.table.taxa (module), 57 method), 54 ncbitaxonomist.log.conf (module), 59 insert() (ncbitaxonomist.db.table.accessions.AccessionTablencbitaxonomist.model.accession (module), method), 58 56 insert() (ncbitaxonomist.db.table.basetable.BaseTable ncbitaxonomist.model.datamodel (module), method), 57 56 insert() (ncbitaxonomist.db.table.groups.GroupTable ncbitaxonomist.parser.arguments (module), method), 58 60 insert() (ncbitaxonomist.db.table.names.NameTable ncbitaxonomist.parser.group (module), 60 method), 58 new() (ncbitaxonomist.model.accession.Accession class insert() (ncbitaxonomist.db.table.taxa.TaxaTable method), 56 method), 57 new() (ncbitaxonomist.model.datamodel.DataModel insert_taxids() (ncbitax- class method), 56 onomist.db.table.taxa.TaxaTable method), new_from_json() (ncbitax- 57 onomist.model.accession.Accession class int_attribute() (in module ncbitax- method), 56 onomist.model.datamodel), 56 new_from_json() (ncbitax- is_empty() (ncbitax- onomist.model.datamodel.DataModel class onomist.cache.accession.AccessionCache method), 56 method), 54

74 Index ncbi-taxonomist Documentation, Release 1.2.1+8580b9b

P parse() (in module ncbitaxonomist.parser.arguments), 60 parse() (ncbitaxonomist.parser.group.GroupParser method), 60 parse_taxa_list() (ncbitax- onomist.parser.group.GroupParser method), 60 parse_taxon() (ncbitax- onomist.parser.group.GroupParser method), 60 R retrieve_group() (ncbitax- onomist.db.table.groups.GroupTable method), 58 retrieve_names() (ncbitax- onomist.db.table.groups.GroupTable method), 58 S standardize_attributes() (in module ncbitax- onomist.model.datamodel), 56 T TaxaDbConverter (class in ncbitax- onomist.convert.taxadb), 55 TaxaTable (class in ncbitaxonomist.db.table.taxa), 57 taxid() (ncbitaxonomist.model.accession.Accession method), 56 taxid() (ncbitaxonomist.model.datamodel.DataModel method), 56 U update_accessions() (ncbitax- onomist.model.accession.Accession method), 56 V version() (in module ncbitax- onomist.parser.arguments), 60

Index 75