UNIX, GNU/Linux and simple tools for data manipulation
Dr Jean-Baka DOMELEVO ENTFELLNER
BecA-ILRI Hub
Basic Bioinformatics Training Workshop @ILRI Addis Ababa Wednesday December 13th 2017
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 1 / 37 1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 2 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 3 / 37 UNIX & GNU/Linux: introduction
GNU/Linux is an operating system (OS). GNU/Linux fully belongs to a broad family of OSes, the UNIX family. Operating system: definition unique interface between the computer (hardware) and the different programs (software) users run on it allows different programs and different users to use concurrently the same machine implements a filesystem, a console environment, a graphical environment, drivers for keyboard and mouse, etc examples of operating systems: Windows (Microsoft), Mac OS X (Apple), Android (Google), GNU/Linux, FreeBSD, etc
“Linux” is only the kernel of GNU/Linux systems, responsible for granting access to the resources on the host and for time-sharing between processes.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 4 / 37 UNIX & GNU/Linux systems: timeline
GNU/Linux: a fairly recent member of an old and huge family (see http://www.levenez.com/unix/) 1969: UNICS 1971: UNIX Time-Sharing System V1 1982: SunOS 1.0 1983: UNIX System tag">V 1991: GNU project (GNU/Hurd) ; Linux 0.01 1994: Linux 1.0 1999: Darwin 0.1 ; Mac OS X Server 1.0 2008: Android 1.0 (derived from Linux 2.6.23) 2013: Linux 3.9
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 5 / 37 Linux distributions: different flavours of the same OS
The GNU/Linux operatring system comes in different distributions. Three distributions have ever been true beacons and gave many offsprings:
1 Debian (1993) ⇒ Ubuntu, 2004 and Linux Mint, 2010 2 Slackware (1993), from SLS (1992) ⇒ SuSE, 1998 3 RedHat (late 1994) ⇒ CentOS and Fedora, both 2003 For a full account, see http://futurist.se/gldt
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 6 / 37 UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.
Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.
What makes UNIX systems superior to the Windows family
UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software).
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.
Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.
What makes UNIX systems superior to the Windows family
UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.
Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.
What makes UNIX systems superior to the Windows family
UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 UNIX systems are much more secure than any Windows.
Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.
What makes UNIX systems superior to the Windows family
UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.
What makes UNIX systems superior to the Windows family
UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 What makes UNIX systems superior to the Windows family
UNIX gives you more control over your computer (no hidden actions, no undesired pieces of software). UNIX environments are free from viruses. UNIX enables you to harness the full computational power of your machine. UNIX systems have been designed from their origin to be massively multi-user and multi-process systems. UNIX systems are much more secure than any Windows.
Take-home message The true power of UNIX (and so of GNU/Linux) lies in its commandline interface.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 7 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 8 / 37 interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.
Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).
On most GNU/Linux distributions, Bash is accessible through the "Terminal"
icon .
Bash: a shell environment
Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".
Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc),
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.
Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).
On most GNU/Linux distributions, Bash is accessible through the "Terminal"
icon .
Bash: a shell environment
Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".
Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc),
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 perform all of the above through automated processes ⇒ scripts.
Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).
On most GNU/Linux distributions, Bash is accessible through the "Terminal"
icon .
Bash: a shell environment
Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".
Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh),
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).
On most GNU/Linux distributions, Bash is accessible through the "Terminal"
icon .
Bash: a shell environment
Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".
Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 On most GNU/Linux distributions, Bash is accessible through the "Terminal"
icon .
Bash: a shell environment
Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".
Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.
Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 Bash: a shell environment
Bash is the most popular shell environment on GNU/Linux systems. It stands for "Bourne Again Shell".
Shell environments are designed to: interact with the host filesystem (browse and create directories, see the content of files, etc), interact with the installed software (install, run, etc), login to distant hosts (telnet, ssh), perform all of the above through automated processes ⇒ scripts.
Shells are at the same time commandline environments (run one command at a time) and scripting environments (write and run scripts).
On most GNU/Linux distributions, Bash is accessible through the "Terminal"
icon .
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 9 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 10 / 37 ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 head --lines=2 one (same command, POSIX-style long option)
Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 Standard structure of a UNIX command
Synopsis of a command
For example: ls (only the command) ls -l (command plus an option) ls -l -h h3a* (command, two options and one object) ls -lh h3a* (single-letter options can be concatenated) cp one two (command and two objects) man head (command and one object) head -n 2 one (an option with a value) head --lines=2 one (same command, POSIX-style long option)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 11 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 12 / 37 UNIX filesystems
Filesystems are hierarchies. The filesystem of a UNIX machine is standardized. Under the root (/) are: /bin → essential command binairies /boot → static files of the boot loader /dev → device files (special files to access your devices) /etc → host-specific system configuration files /home → user home directories (e.g. /home/peter, /home/sarah, etc) /lib → essential shared librairies and kernel modules /media → mount point for removable media (e.g. CD-ROMs & flash disks) /mnt → “old-style” mount point for any media /tmp → system-wide temporary folder, writable by anyone
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 13 / 37 Three types of people: the owner of a file (u) the other members of the user’s group (g) the rest of the world, the others (o)
Typical line of output from ls -l -rw-r--r-- 1 jbde jbde 171104 juil. 6 12:48 awk.dvi
File permissions
Three (four) types of rights: right to read from a file (r) right to write to it (w) right to execute a binary file or a script (x) right to traverse a directory (x)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 14 / 37 File permissions
Three (four) types of rights: right to read from a file (r) right to write to it (w) right to execute a binary file or a script (x) right to traverse a directory (x) Three types of people: the owner of a file (u) the other members of the user’s group (g) the rest of the world, the others (o)
Typical line of output from ls -l -rw-r--r-- 1 jbde jbde 171104 juil. 6 12:48 awk.dvi
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 14 / 37 File permissions explained
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 15 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 16 / 37 ⇒ escaping or quoting prevents these characters from being interpreted by the shell.
Why it is often necessary to quote strings or escape chars
Some characters have a special meaning for the tools you use, e.g. the commandline interpreter Bash: spaces or tabs are logical separators between elements on the commandline: cd /tmp a dollar sign introduces Bash variables: echo $PATH a star means “all the files” (wildcard): cat * the “greater than” sign is interpreted as a redirection: cat * > listing.txt the “vertical bar” pipes the output of some command into the input of another: grep h3a long_course.htm | wc -l ...
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 17 / 37 Why it is often necessary to quote strings or escape chars
Some characters have a special meaning for the tools you use, e.g. the commandline interpreter Bash: spaces or tabs are logical separators between elements on the commandline: cd /tmp a dollar sign introduces Bash variables: echo $PATH a star means “all the files” (wildcard): cat * the “greater than” sign is interpreted as a redirection: cat * > listing.txt the “vertical bar” pipes the output of some command into the input of another: grep h3a long_course.htm | wc -l ... ⇒ escaping or quoting prevents these characters from being interpreted by the shell.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 17 / 37 > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH
And if a filename contains spaces, e.g. named with spaces.txt:
> cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt
Escaping a single character
In Unix, prepending a backslash (\) escapes the character following the backslash.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37 And if a filename contains spaces, e.g. named with spaces.txt:
> cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt
Escaping a single character
In Unix, prepending a backslash (\) escapes the character following the backslash. > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37 Escaping a single character
In Unix, prepending a backslash (\) escapes the character following the backslash. > echo $PATH /home/jbde/bin:/usr/local/bin:/usr/bin:/bin > echo \$PATH $PATH
And if a filename contains spaces, e.g. named with spaces.txt:
> cat named with spaces.txt cat: named: No such file or directory cat: with: No such file or directory cat: spaces.txt: No such file or directory > cat named\ with\ spaces.txt
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 18 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 19 / 37 Generally speaking, simple quote do not allow any kind of interpretation/substitution/expansion.
> echo 'Your PATH variable contains $PATH' Your PATH variable contains $PATH
Strong quoting with single quotes
You can also quote a string to prevent included spaces to be interpreted:
> cat 'named with spaces.txt'
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 20 / 37 Strong quoting with single quotes
You can also quote a string to prevent included spaces to be interpreted:
> cat 'named with spaces.txt'
Generally speaking, simple quote do not allow any kind of interpretation/substitution/expansion.
> echo 'Your PATH variable contains $PATH' Your PATH variable contains $PATH
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 20 / 37 “Weak quoting” with double quotes
While preventing included spaces to be interpreted, double quotes allow expansion of Bash variables:
> cat "named with spaces.txt"
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 21 / 37 Using Bash every day
Bash has nice features you should use to work efficiently: the history of previous commands (browse vith ↑, ↓, Ctrl+R) autocompletion with the
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 22 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 23 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 24 / 37 Asking for help on a command: man
This is the absolute basic command, to learn first! man ls
To browse within the manpage:
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 25 / 37 Sectioning of a manpage Manpages are all written using the same format/sectioning:
1 NAME: the name of the command 2 SYNOPSIS: the syntax of the command (sometimes several lines to describe several ways of using the command)
I square brackets ([...]) indicate optional components I pipes (|) within a construct separates alternatives I ellipsis (...) usually indicate that the previous object is repeatable
3 DESCRIPTION and OPTIONS: meaning and behaviour of the different options and objects to give on the commandline
4 EXAMPLES: the most useful section, provides real-world examples along with some explanation of what they do 5 EXIT STATUS: useful in scripts, to monitor automatically whether the command execution produced and error 6 SEE ALSO: also useful when you don’t know exactly the name of a command but know a similar/sister one (e.g. uniq and join are cross-referenced)
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 26 / 37 Outline
1 UNIX & GNU/Linux: brief history and introduction
2 Using the Bash shell Your first commands Filesystems and permissions Bash special characters and features Quoting in Bash
3 So many tools You CANNOT live without your man Data manipulation commandline tools
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 27 / 37 less is a pager produces the full content of file(s) to the standard output, one page at a time several files are processed one after the other: less FILE1 FILE2 and then :n (next) and :p (previous) to browse is fully interactive:
Reading files: cat and less
cat produces the full content of file(s) to the standard output can concatenate several files: cat FILE1 FILE2 > FILE3 is non-interactive: prints all and quits
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 28 / 37 Reading files: cat and less
cat produces the full content of file(s) to the standard output can concatenate several files: cat FILE1 FILE2 > FILE3 is non-interactive: prints all and quits
less is a pager produces the full content of file(s) to the standard output, one page at a time several files are processed one after the other: less FILE1 FILE2 and then :n (next) and :p (previous) to browse is fully interactive:
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 28 / 37 Count the numbers of chars, words or lines: wc
wc stands for "word count" wc -l FILE → number of lines wc -c FILE → number of bytes (' chars) wc -w FILE → number of words wc -L FILE → length of longest line in file
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 29 / 37 Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv
Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv
Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv
Select columns from a file: cut
Simplified syntax cut -f
be sure you quote the delimiter, e.g. ``;''
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv
Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv
Select columns from a file: cut
Simplified syntax cut -f
be sure you quote the delimiter, e.g. ``;''
Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv
Select columns from a file: cut
Simplified syntax cut -f
be sure you quote the delimiter, e.g. ``;''
Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv
Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 Select columns from a file: cut
Simplified syntax cut -f
be sure you quote the delimiter, e.g. ``;''
Example: select fields 2 and 5 from a semicolon-separated file cut -f 2,5 -d ';' cut_example.csv
Example: specify output separator cut -f 1-3 -d ';' --output-separator=$'\t' cut_example.csv
Example: extract only the first three characters of each line cut -c 1-3 cut_example.csv
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 30 / 37 But it’s usually not a good idea not to control the way sort sorts.
Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv
Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'
Sort a file according to some rules: sort
sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv
Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'
Sort a file according to some rules: sort
sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv
But it’s usually not a good idea not to control the way sort sorts.
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'
Sort a file according to some rules: sort
sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv
But it’s usually not a good idea not to control the way sort sorts.
Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 Sort a file according to some rules: sort
sort sorts text files according to the content of some fields, called keys. Example: sorting lines alphabetically sort cut_example.csv
But it’s usually not a good idea not to control the way sort sorts.
Example: sort according to 2nd and then 3rd field (semicol-separated fields) sort -t ';' -k 2,3 cut_example.csv
Example: sort numerically (-n) according to 9th field only sort -t ';' -n -k 9,9 cut_example.csv # to check results: sort -t ';' -n -k 9,9 cut_example.csv | cut -f 9 -d ';'
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 31 / 37 WARNING!! sort relies heavily on your locale setting! Try:
LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation
One-letter sorting options can be used as flags, and several fields specified:
Ascending order on the 5th field, descending on the 6th and then alphabetically on the 1st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output | less -S
sort, continued
-g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD!
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37 One-letter sorting options can be used as flags, and several fields specified:
Ascending order on the 5th field, descending on the 6th and then alphabetically on the 1st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output | less -S
sort, continued
-g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD!
WARNING!! sort relies heavily on your locale setting! Try:
LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37 sort, continued
-g option to sort numerical fields containing scientific notation: sort -k 2,2 -n with_sci_notation # unexpected result sort -k 2,2 -g with_sci_notation # GOOD!
WARNING!! sort relies heavily on your locale setting! Try:
LC_ALL=fr_FR.utf8 sort -k 2,2 -g with_sci_notation
One-letter sorting options can be used as flags, and several fields specified:
Ascending order on the 5th field, descending on the 6th and then alphabetically on the 1st field sort -k 5,5g -k 6,6nr -k 1,1 hmmsearch_raw_output | less -S
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 32 / 37 sort, some caveats
WARNING!! by default, sort separates fields on blank to non-blank transitions. ⇒ careful with empty fields! One should specify the delimiter. A precise delimiter to prevent sort from “merging delimiters“ sort -k 11,11 -t $'\t' CDS_top_100.txt | cut -f 11 | less
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 33 / 37 join lines of two files sharing a common field
join allows you to perform the relational join operation on two files. Example: “I want to select the lines of FILE2 whose 11th field corresponds to an entry in FILE1.”
join -1 1 -2 11 -t $'\t' dg_top_100.txt CDS_top_100.txt
WARNING!! join operates on files already sorted on the join field!
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 34 / 37 Produce the last 30 lines of a file tail -n 30 input_file
or simply:
tail -30 input_file
Produce all the lines from the 30th tail -n +30 input_file
Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37 Produce all the lines from the 30th tail -n +30 input_file
Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff
Produce the last 30 lines of a file tail -n 30 input_file
or simply:
tail -30 input_file
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37 Produce only the n last lines of a file: tail Convenient to cut parts you are not interested in, for instance because: the final lines of a log file contain the error that matters to you the header (first few lines) of the file is of no interest for the next tool in the pipeline the file is sorted and the last lines contain the samples of interest: you set a cutoff
Produce the last 30 lines of a file tail -n 30 input_file
or simply:
tail -30 input_file
Produce all the lines from the 30th tail -n +30 input_file
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 35 / 37 Produce all but the last 30 lines head -n -30 input_file
Symmetrical to tail: head
Produce the first 30 lines of a file head -n 30 input_file
or simply:
head -30 input_file
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 36 / 37 Symmetrical to tail: head
Produce the first 30 lines of a file head -n 30 input_file
or simply:
head -30 input_file
Produce all but the last 30 lines head -n -30 input_file
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 36 / 37 Warning! tr only processes its standard input! But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt | tr ';''\t'
Translate chars with tr
tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file | tr '\r''\n' > UNIX_formatted_file
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37 But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt | tr ';''\t'
Translate chars with tr
tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file | tr '\r''\n' > UNIX_formatted_file
Warning! tr only processes its standard input!
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37 Translate chars with tr
tr helps you change any occurrence of a character into another: Translating Windows end-of-lines into UNIX ones cat Win_formatted_file | tr '\r''\n' > UNIX_formatted_file
Warning! tr only processes its standard input! But tr also comes handy to change separators in a CSV file: Translating semicols into tabulations cat example_mj.txt | tr ';''\t'
Dr Jean-Baka DOMELEVO ENTFELLNER UNIX, GNU/Linux and simple tools for data manipulation 37 / 37