UNIX, GNU/ and simple tools for data manipulation

Dr Jean-Baka DOMELEVO ENTFELLNER

BecA-ILRI Hub

Basic Bioinformatics Training Workshop @ILRI Addis Ababa Wednesday December 13th 2017

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 1 / 37 1 UNIX & GNU/Linux: brief and introduction

2 Using the Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 2 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 3 / 37 UNIX & GNU/Linux: introduction

GNU/Linux is an (OS). GNU/Linux fully belongs to a broad family of OSes, the UNIX family. Operating system: definition unique interface between the (hardware) and the different programs (software) users run on it allows different programs and different users to use concurrently the same machine implements a filesystem, a console environment, a graphical environment, drivers for keyboard and mouse, etc examples of operating systems: Windows (), Mac OS X (Apple), Android (Google), GNU/Linux, FreeBSD, etc

“Linux” is only the kernel of GNU/Linux systems, responsible for granting access to the resources on the and for -sharing between processes.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 4 / 37 UNIX & GNU/Linux systems: timeline

GNU/Linux: a fairly recent member of an old and huge family (see http://www.levenez.com/unix/) 1969: UNICS 1971: -Sharing System V1 1982: SunOS 1.0 1983: UNIX System tag">V 1991: GNU project (GNU/Hurd) ; Linux 0.01 1994: Linux 1.0 1999: Darwin 0.1 ; Mac OS X 1.0 2008: Android 1.0 (derived from Linux 2.6.23) 2013: Linux 3.9

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 5 / 37 Linux distributions: different flavours of the same OS

The GNU/Linux operatring system comes in different distributions. Three distributions have ever been true beacons and gave many offsprings:

1 (1993) ⇒ , 2004 and , 2010 2 (1993), from SLS (1992) ⇒ SuSE, 1998 3 RedHat (late 1994) ⇒ CentOS and Fedora, both 2003 For a full account, see http://futurist.se/gldt

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 6 / 37 UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi- and multi- systems. UNIX systems are much secure than any Windows.

Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.

What makes UNIX systems superior to the Windows family

UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software).

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.

Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.

What makes UNIX systems superior to the Windows family

UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.

Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.

What makes UNIX systems superior to the Windows family

UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 UNIX systems are much more secure than any Windows.

Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.

What makes UNIX systems superior to the Windows family

UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.

What makes UNIX systems superior to the Windows family

UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 What makes UNIX systems superior to the Windows family

UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.

Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 8 / 37 interact with the installed software (, run, etc), login to distant (, ssh), perform all of the above through automated processes ⇒ scripts.

Shells are the same time commandline environments (run one command at a time) and scripting environments ( and run scripts).

On GNU/Linux distributions, Bash is accessible through the "Terminal"

icon .

Bash: a shell environment

Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".

Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc),

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.

Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).

On most GNU/Linux distributions, Bash is accessible through the "Terminal"

icon .

Bash: a shell environment

Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".

Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc),

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 perform all of the above through automated processes ⇒ scripts.

Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).

On most GNU/Linux distributions, Bash is accessible through the "Terminal"

icon .

Bash: a shell environment

Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".

Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh),

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).

On most GNU/Linux distributions, Bash is accessible through the "Terminal"

icon .

Bash: a shell environment

Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".

Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 On most GNU/Linux distributions, Bash is accessible through the "Terminal"

icon .

Bash: a shell environment

Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".

Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.

Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 Bash: a shell environment

Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".

Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.

Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).

On most GNU/Linux distributions, Bash is accessible through the "Terminal"

icon .

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 10 / 37 -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) one two (command and two objects) man (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 head --lines=2 one (same command, POSIX-style long option)

Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 Standard structure of a UNIX command

Synopsis of a command

For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 12 / 37 UNIX filesystems

Filesystems are hierarchies. The filesystem of a UNIX machine is standardized. Under the root (/) are: /bin → essential command binairies /boot → static files of the boot /dev → device files (special files to access your devices) /etc → host-specific system configuration files /home → user home directories (e.g. /home/peter, /home/sarah, etc) /lib → essential shared librairies and kernel modules /media → point for removable media (e.g. -ROMs & flash disks) /mnt → “old-style” mount point for any media /tmp → system-wide , writable by anyone

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 13 / 37 Three types of people: the owner of a file (u) the other members of the user’ group (g) the rest of the world, the others (o)

Typical line of output from ls -l -rw-r--r-- 1 jbde jbde 171104 juil. 6 12:48 .dvi

File permissions

Three (four) types of rights: right to read from a file (r) right to write to it () right to execute a binary file or a (x) right to traverse a (x)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 14 / 37 permissions

Three (four) types of rights: right to read from a file (r) right to write to it (w) right to execute a binary file or a script (x) right to traverse a directory (x) Three types of people: the owner of a file (u) the other members of the user’s group (g) the rest of the world, the others (o)

Typical line of output from ls -l -rw-r--r-- 1 jbde jbde 171104 juil. 6 12:48 awk.dvi

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 14 / 37 File permissions explained

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 15 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 16 / 37 ⇒ escaping or quoting prevents these characters from being interpreted by the shell.

Why it is often necessary to quote or escape chars

Some characters have a special meaning for the tools you use, e.g. the commandline interpreter Bash: spaces or tabs are logical separators between elements on the commandline: cd /tmp a dollar sign introduces Bash variables: $ a star means “all the files” (wildcard): * the “greater than” sign is interpreted as a redirection: cat * > listing.txt the “” pipes the output of some command into the input of another: h3a long_course.htm | -l ...

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 17 / 37 Why it is often necessary to quote strings or escape chars

Some characters have a special meaning for the tools you use, e.g. the commandline interpreter Bash: spaces or tabs are logical separators between elements on the commandline: cd /tmp a dollar sign introduces Bash variables: echo $PATH a star means “all the files” (wildcard): cat * the “greater than” sign is interpreted as a redirection: cat * > listing.txt the “vertical bar” pipes the output of some command into the input of another: grep h3a long_course.htm | wc -l ... ⇒ escaping or quoting prevents these characters from being interpreted by the shell.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 17 / 37 > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH

And if a filename contains spaces, e.g. named with spaces.txt:

> cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt

Escaping a single character

In Unix, prepending a backslash (\) escapes the character following the backslash.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37 And if a filename contains spaces, e.g. named with spaces.txt:

> cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt

Escaping a single character

In Unix, prepending a backslash (\) escapes the character following the backslash. > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37 Escaping a single character

In Unix, prepending a backslash (\) escapes the character following the backslash. > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH

And if a filename contains spaces, e.g. named with spaces.txt:

> cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 19 / 37 Generally speaking, simple quote do not allow any kind of interpretation/substitution/expansion.

> echo 'Your PATH variable contains $PATH' Your PATH variable contains $PATH

Strong quoting with single quotes

You can also quote a string to prevent included spaces to be interpreted:

> cat 'named with spaces.txt'

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 20 / 37 Strong quoting with single quotes

You can also quote a string to prevent included spaces to be interpreted:

> cat 'named with spaces.txt'

Generally speaking, simple quote do not allow any kind of interpretation/substitution/expansion.

> echo 'Your PATH variable contains $PATH' Your PATH variable contains $PATH

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 20 / 37 “Weak quoting” with double quotes

While preventing included spaces to be interpreted, double quotes allow expansion of Bash variables:

> cat "named with spaces.txt" > echo "Your PATH variable contains $PATH" Your PATH variable contains /home/jbde/bin:/usr/local/bin:/usr/bin:/bin

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 21 / 37 Using Bash every day

Bash has features you should use to work efficiently: the history of previous commands (browse vith ↑, ↓, Ctrl+R) autocompletion with the key everywhere you can (commands, filenames, etc) wildcards and regexps use quoting appropriately pipe commands into each other (|) redirect output (> erases previous file, >> appends)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 22 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 23 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 24 / 37 Asking for on a command: man

This is the absolute basic command, to learn first! man ls

To browse within the manpage: : next : previous page G: goto the bottom g: goto the beginning /: search an expression (indicate pattern or string and press ) : quit and return to commandline

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 25 / 37 Sectioning of a manpage Manpages are all written using the same /sectioning:

1 NAME: the name of the command 2 SYNOPSIS: the syntax of the command (sometimes several lines to describe several ways of using the command)

I square brackets ([...]) indicate optional components I pipes (|) within a construct separates alternatives I ellipsis (...) usually indicate that the previous object is repeatable

3 DESCRIPTION and OPTIONS: meaning and behaviour of the different options and objects to give on the commandline

4 EXAMPLES: the most useful section, provides real-world examples along with some explanation of what they do 5 STATUS: useful in scripts, to monitor automatically whether the command execution produced and error 6 SEE ALSO: also useful when you don’ know exactly the name of a command but know a similar/sister one (e.g. and are cross-referenced)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 26 / 37 Outline

1 UNIX & GNU/Linux: brief history and introduction

2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash

3 So many tools You CANNOT live without your man Data manipulation commandline tools

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 27 / 37 is a pager produces the full content of file(s) to the standard output, one page at a time several files are processed one after the other: less FILE1 FILE2 and then :n (next) and :p (previous) to browse is fully interactive: for next page, b for the previous, / to search, q to quit, etc useful option: -S not to have your lines automatically wrapped (preserves column alignment on long lines)

Reading files: cat and less

cat produces the full content of file(s) to the standard output can concatenate several files: cat FILE1 FILE2 > FILE3 is non-interactive: prints all and quits

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 28 / 37 Reading files: cat and less

cat produces the full content of file(s) to the standard output can concatenate several files: cat FILE1 FILE2 > FILE3 is non-interactive: prints all and quits

less is a pager produces the full content of file(s) to the standard output, one page at a time several files are processed one after the other: less FILE1 FILE2 and then :n (next) and :p (previous) to browse is fully interactive: for next page, b for the previous, / to search, q to quit, etc useful option: -S not to have your lines automatically wrapped (preserves column alignment on long lines)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 28 / 37 Count the numbers of chars, words or lines: wc

wc stands for "word count" wc -l FILE → number of lines wc - FILE → number of bytes (' chars) wc -w FILE → number of words wc -L FILE → length of longest line in file

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 29 / 37 Example: select fields 2 and 5 from a semicolon-separated file -f 2,5 -d ';' cut_example.csv

Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv

Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv

Select columns from a file: cut

Simplified syntax cut -f -d FILE

be sure you quote the , e.g. ``;'' can be a comma-separated list (ranges indicated with hyphens)

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv

Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv

Select columns from a file: cut

Simplified syntax cut -f -d FILE

be sure you quote the delimiter, e.g. ``;'' can be a comma-separated list (ranges indicated with hyphens)

Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv

Select columns from a file: cut

Simplified syntax cut -f -d FILE

be sure you quote the delimiter, e.g. ``;'' can be a comma-separated list (ranges indicated with hyphens)

Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv

Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 Select columns from a file: cut

Simplified syntax cut -f -d FILE

be sure you quote the delimiter, e.g. ``;'' can be a comma-separated list (ranges indicated with hyphens)

Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv

Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv

Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 But it’s usually not a good idea not to control the way sorts.

Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv

Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'

Sort a file according to some rules: sort

sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv

Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'

Sort a file according to some rules: sort

sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv

But it’s usually not a good idea not to control the way sort sorts.

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'

Sort a file according to some rules: sort

sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv

But it’s usually not a good idea not to control the way sort sorts.

Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 Sort a file according to some rules: sort

sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv

But it’s usually not a good idea not to control the way sort sorts.

Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv

Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 WARNING!! sort relies heavily on your locale setting! Try:

LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation

One-letter sorting options can be used as flags, and several fields specified:

Ascending order on the 5th field, descending on the 6th and then alphabetically on the 1st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output | less -S

sort, continued

-g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD!

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37 One-letter sorting options can be used as flags, and several fields specified:

Ascending order on the 5th field, descending on the 6th and then alphabetically on the 1st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output | less -S

sort, continued

-g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD!

WARNING!! sort relies heavily on your locale setting! Try:

LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37 sort, continued

-g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD!

WARNING!! sort relies heavily on your locale setting! Try:

LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation

One-letter sorting options can be used as flags, and several fields specified:

Ascending order on the 5th field, descending on the 6th and then alphabetically on the 1st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output | less -S

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37 sort, some caveats

WARNING!! by default, sort separates fields on blank to non-blank transitions. ⇒ careful with empty fields! One should specify the delimiter. A precise delimiter to prevent sort from “merging “ sort -k 11,11 -t $'\t' CDS_top_100.txt | cut -f 11 | less

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 33 / 37 join lines of two files sharing a common field

join allows you to perform the relational join operation on two files. Example: “I want to select the lines of FILE2 whose 11th field corresponds to an entry in FILE1.”

join -1 1 -2 11 -t $'\t' dg_top_100.txt CDS_top_100.txt

WARNING!! join operates on files already sorted on the join field!

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 34 / 37 Produce the last 30 lines of a file -n 30 input_file

or simply:

tail -30 input_file

Produce all the lines from the 30th tail -n +30 input_file

Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the the file is sorted and the last lines contain the samples of interest: you set a cutoff

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37 Produce all the lines from the 30th tail -n +30 input_file

Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff

Produce the last 30 lines of a file tail -n 30 input_file

or simply:

tail -30 input_file

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37 Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff

Produce the last 30 lines of a file tail -n 30 input_file

or simply:

tail -30 input_file

Produce all the lines from the 30th tail -n +30 input_file

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37 Produce all but the last 30 lines head -n -30 input_file

Symmetrical to tail: head

Produce the first 30 lines of a file head -n 30 input_file

or simply:

head -30 input_file

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 36 / 37 Symmetrical to tail: head

Produce the first 30 lines of a file head -n 30 input_file

or simply:

head -30 input_file

Produce all but the last 30 lines head -n -30 input_file

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 36 / 37 Warning! only processes its standard input! But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt | tr ';''\t'

Translate chars with tr

tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file | tr '\r''\n' > UNIX_formatted_file

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37 But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt | tr ';''\t'

Translate chars with tr

tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file | tr '\r''\n' > UNIX_formatted_file

Warning! tr only processes its standard input!

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37 Translate chars with tr

tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file | tr '\r''\n' > UNIX_formatted_file

Warning! tr only processes its standard input! But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt | tr ';''\t'

Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37