<<

sam2lca Release 0.3.1

Maxime Borry

Sep 07, 2021

CONTENTS:

1 sam2lca 3 1.1 Quick start...... 3 1.2 Installation...... 3 1.3 Documentation...... 4

2 Python API 5

3 Command Line Interface 7 3.1 sam2lca...... 7

4 Output 11 4.1 JSON...... 11 4.2 CSV...... 13

5 Indices and tables 15

Python Module Index 17

Index 19

i ii sam2lca, Release 0.3.1

Homepage: github.com/maxibor/sam2lca

CONTENTS: 1 sam2lca, Release 0.3.1

2 CONTENTS: CHAPTER ONE

SAM2LCA

Lowest Common Ancestor from a SAM/BAM/CRAM sequence alignment file

1.1 Quick start

Quick analyis of sequencing reads aligned to a DNA database sam2lca analyze myfile.bam

See all options sam2lca --help sam2lca update-db --help sam2lca analyze --help

1.2 Installation

1.2.1 From source git clone [email protected]:maxibor/sam2lca.git conda env create -f environment.yml conda activate sam2lca pip install git+ssh://[email protected]/maxibor/sam2lca.git

1.2.2 From Conda conda install -c conda-forge -c bioconda -c maxibor sam2lca

3 sam2lca, Release 0.3.1

1.2.3 From Pypi pip install sam2lca

1.3 Documentation

The documentation is available here: sam2lca.readthedocs.io

4 Chapter 1. sam2lca CHAPTER TWO

PYTHON API

sam2lca.main.sam2lca(sam, mappings, tree, process, identity, length, conserved, dbdir, output) Performs LCA on SAM/BAM/CRAM alignment file Parameters • sam (str) – Path to SAM/BAM/CRAM alignment file • mappings (str) – Type of Acc2Tax mapping • tree (str) – Optional taxonomic tree • process (int) – Number of process for parallelization • identity (float) – Minimum identity • length (int) – Minimum alignment length • dbdir (str) – Path to database stroring directory • output (str) – Path to sam2lca output file sam2lca.main.update_database(mappings, dbdir, ncbi) Performs LCA on SAM/BAM/CRAM alignment file Parameters • mappings (str) – Type of Acc2Tax mapping • dbdir (str) – Path to database stroring directory • ncbi (bool) – Updates NCBI taxonomic tree

5 sam2lca, Release 0.3.1

6 Chapter 2. Python API CHAPTER THREE

COMMAND LINE INTERFACE

To access the help menu:

$ sam2lca --help

The list of arguments of options is detailed below

3.1 sam2lca sam2lca: Last Common Ancestor on SAM/BAM/CRAM alignment files Author: Maxime Borry Contact: Homepage & Documentation: github.com/maxibor/sam2lca

sam2lca[OPTIONS] COMMAND[ARGS]...

Options

--version Show the version and exit. -m, --mappings Mapping type of accession to TAXID Default nucl Options nucl|prot|test -d, --dbdir Directory to store databases Default /home/docs/.sam2lca

7 sam2lca, Release 0.3.1

3.1.1 analyze

Run the sam2lca analysis

SAM: path to SAM/BAM/CRAM alignment file sam2lca analyze[OPTIONS] SAM

Options

-i, --identity Minimum identity Default 0.8 -l, --length Minimum alignment length Default 30 -c, --conserved Ignore reads mapping in ultraconserved regions -p, --process Number of process for parallelization Default 2 -t, --tree Optional Newick Taxonomy Tree -o, --output sam2lca output file

Arguments

SAM Required argument

3.1.2 update-db

Download/prepare mappings and taxonomy databases sam2lca update-db[OPTIONS]

8 Chapter 3. Command Line Interface sam2lca, Release 0.3.1

Options

-n, --ncbi Update NCBI taxonomy tree

3.1. sam2lca 9 sam2lca, Release 0.3.1

10 Chapter 3. Command Line Interface CHAPTER FOUR

OUTPUT

sam2lca generates a JSON and CSV file as outputs.

4.1 JSON

A JSON file with NCBI Taxonomy IDs as keys. • name: scientific name of the taxon • rank: taxonomic rank of the taxon • count: number of reads mapping to the taxon • lineage: taxonomic lineage of the taxon Example:

{ "543":{ "name": "", "rank": "family", "count": 2152, "lineage":[ { "no rank": "root" }, { "no rank": "cellular organisms" }, { "superkingdom": "" }, { "phylum": "" }, { "class": "" }, { "order": "" }, { "family": "Enterobacteriaceae" } ] (continues on next page)

11 sam2lca, Release 0.3.1

(continued from previous page) }, "300267":{ "name": " dysenteriae Sd197", "rank": "no rank", "count": 338, "lineage":[ { "no rank": "root" }, { "no rank": "cellular organisms" }, { "superkingdom": "Bacteria" }, { "phylum": "Proteobacteria" }, { "class": "Gammaproteobacteria" }, { "order": "Enterobacterales" }, { "family": "Enterobacteriaceae" }, { "genus": "Shigella" }, { "species": "Shigella dysenteriae" }, { "no rank": "Shigella dysenteriae Sd197" } ] }, "511145":{ "name": " str. K-12 substr. MG1655", "rank": "no rank", "count": 385, "lineage":[ { "no rank": "root" }, { "no rank": "cellular organisms" }, { "superkingdom": "Bacteria" }, { "phylum": "Proteobacteria" }, { "class": "Gammaproteobacteria" (continues on next page)

12 Chapter 4. Output sam2lca, Release 0.3.1

(continued from previous page) }, { "order": "Enterobacterales" }, { "family": "Enterobacteriaceae" }, { "genus": "Escherichia" }, { "species": "Escherichia coli" }, { "no rank": "Escherichia coli K-12" }, { "no rank": "Escherichia coli str. K-12 substr. MG1655" } ] } }

4.2 CSV

Rows: Taxons Columns: • TAXID: NCBI taxonomy ID • name: Name of the taxon • rank: Taxonomic rank • count: Number of reads assigned to this taxon • lineage: Taxonomic lineage of this taxon

TAXID, name, rank, count, lineage 543, Enterobacteriaceae, family, 2242,"[{'no rank':'root'},{'no rank':'cellular

˓→organisms'},{'superkingdom':'Bacteria'},{'phylum':'Proteobacteria'},{'class

˓→':'Gammaproteobacteria'},{'order':'Enterobacterales'},{'family':

˓→'Enterobacteriaceae'}]" 511145, Escherichia coli str.K-12 substr. MG1655, no rank, 385,"[{'no rank':'root'}

˓→,{'no rank':'cellular organisms'},{'superkingdom':'Bacteria'},{'phylum':

˓→'Proteobacteria'},{'class':'Gammaproteobacteria'},{'order':'Enterobacterales'}

˓→,{'family':'Enterobacteriaceae'},{'genus':'Escherichia'},{'species':

˓→'Escherichia coli'},{'no rank':'Escherichia coli K-12'},{'no rank':

˓→'Escherichia coli str. K-12 substr. MG1655'}]" 300267, Shigella dysenteriae Sd197, no rank, 248,"[{'no rank':'root'},{'no rank':

˓→'cellular organisms'},{'superkingdom':'Bacteria'},{'phylum':'Proteobacteria'},

˓→ {'class':'Gammaproteobacteria'},{'order':'Enterobacterales'},{'family':

˓→'Enterobacteriaceae'},{'genus':'Shigella'},{'species':'Shigella dysenteriae'},

˓→ {'no rank':'Shigella dysenteriae Sd197'}]"

4.2. CSV 13 sam2lca, Release 0.3.1

14 Chapter 4. Output CHAPTER FIVE

INDICES AND TABLES

• genindex • modindex • search

15 sam2lca, Release 0.3.1

16 Chapter 5. Indices and tables PYTHON MODULE INDEX

s sam2lca.main,5

17 sam2lca, Release 0.3.1

18 Python Module Index INDEX

Symbols sam2lca-update-db command line --conserved option,9 sam2lca-analyze command line -o option,8 sam2lca-analyze command line --dbdir option,8 sam2lca command line option,7 -p --identity sam2lca-analyze command line sam2lca-analyze command line option,8 option,8 -t --length sam2lca-analyze command line sam2lca-analyze command line option,8 option,8 --mappings M sam2lca command line option,7 module --ncbi sam2lca.main,5 sam2lca-update-db command line option,9 S --output SAM sam2lca-analyze command line sam2lca-analyze command line option,8 option,8 --process sam2lca command line option sam2lca-analyze command line --dbdir ,7 option,8 --mappings ,7 --tree --version,7 sam2lca-analyze command line -d,7 option,8 -m,7 --version sam2lca() (in module sam2lca.main),5 sam2lca command line option,7 sam2lca.main -c module,5 sam2lca-analyze command line sam2lca-analyze command line option option,8 --conserved,8 -d --identity ,8 sam2lca command line option,7 --length ,8 -i --output ,8 sam2lca-analyze command line --process ,8 option,8 --tree ,8 -l -c,8 sam2lca-analyze command line -i,8 option,8 -l,8 -m -o,8 sam2lca command line option,7 -p,8 -n -t,8

19 sam2lca, Release 0.3.1

SAM,8 sam2lca-update-db command line option --ncbi,9 -n,9 U update_database() (in module sam2lca.main),5

20 Index