<<

● Log into the Moodle site

● Enter the “Lecture 8” area (button 9)

14:00, choose “Daily Quiz 6”

● Answer the multiple choice quiz (you have until 10 min to finish) text manipulation: sorting, cutting, pasting, joining, subsetting, … J. M. P. Alves

Laboratory of Genomics & Bioinformatics in Parasitology Department of Parasitology, ICB, USP Editing

● Everything very so far, we can view and even change the output copy of the text a little bit sometimes, but… how about real editing?

● There are lots of text editors for Linux

● There is even rivalry and much teasing between users of one editor () and another (emacs) – I myself much prefer emacs

● These are very powerful editors, heavily tailored for programmers but used by many kinds of users

J.M.P. Alves 3 / 43 BMP0260 / ICB5765 / IBI5765 Editing

● For that reason, and being command-line based (no mouse or menus to you), these programs are very complex and hard to learn

● Fret not, for there is a much simpler, and of the good enough, alternative: nano nano file

● nano is a small, free and friendly editor (from the ) aims to replace Pico, the default editor included in the non-free Pine package

● (Pine was a command-line email client, and included a text editor called Pico; since they were not free-software, they could not be included in GNU and other such free systems, so nano was written)

J.M.P. Alves 4 / 43 BMP0260 / ICB5765 / IBI5765 nano

● As the SI prefixes imply, nano is much more than pico, with many more functions and capabilities

● Although it lives in the CLI, nano is of window-based (still little to no use for the mouse though)

● It displays a helpful bar at the bottom of the window with the most-used commands

● nano commands are usually a combination of Ctrl and some other key

J.M.P. Alves 5 / 43 BMP0260 / ICB5765 / IBI5765 nano ~jmalves/bin/average

J.M.P. Alves 6 / 43 BMP0260 / ICB5765 / IBI5765 nano -Y perl ~jmalves/bin/average

J.M.P. Alves 7 / 43 BMP0260 / ICB5765 / IBI5765 nano

● nano is to be used like any simple text-only editor, such as Notepad (in Windows) or Gedit (in Linux)

● Therefore, there are no text decorations (such as italics, bold, different fonts etc.) available

● The main nano commands are: Ctrl+x: Ctrl+o: save file Ctrl+k: one or more whole lines Ctrl+u: paste lines cut by Ctrl+k Ctrl+: search Ctrl+g: display the built-in help

J.M.P. Alves 8 / 43 BMP0260 / ICB5765 / IBI5765 Exploring files

● We have already seen a useful command to explore the contents of a text file: (for word count)

● wc will tell you some basic statistics about the text file

● Those are:

● Number or lines ● Number or bytes (or alternatively characters) ● Number of words ● A word is (from the man page) a non-zero-length sequence of characters delimited by white space; so fdh236 is considered a word

● By default, wc give us lines, words, and bytes; we can change that, of course: wc -l some_text_file

J.M.P. Alves 9 / 43 BMP0260 / ICB5765 / IBI5765 A unique challenge

● Often you want to see only the unique lines of a file, omitting repeated lines

● In other cases, you want the opposite: see only what occurs more than once

● The command is a simple utility designed to do exactly that

● uniq takes a sorted text file (we will learn sorting in the next lecture)

● Only one instance of the repeated lines will be shown

● uniq is also a , like most Unix commands: takes STDIN, sends to STDOUT

● But it can also be given file names directly (input then output). Examples: uniq some_file filtered_file sorted_file | uniq -d J.M.P. Alves 10 / 43 BMP0260 / ICB5765 / IBI5765 Quiz time!

Go to the course page and choose Quiz 21

J.M.P. Alves 11 / 43 BMP0260 / ICB5765 / IBI5765 Now you do it!

Go to the course site and enter Practical Exercise 22

Follow the instructions to answer the questions in the exercise

Remember: in a PE, you should do things in practice before answering the question!

J.M.P. Alves 12 / 43 BMP0260 / ICB5765 / IBI5765 Can't heads or tails of any of this?

● The next two commands are also great ways to quickly explore text file content

● To see just the beginning of a file, use the head command

● By default, head shows us the first ten lines of the data

● head can also show us the first X bytes (or kilobytes, or megabytes etc.) of a file

● ...or all lines (or bytes) except for a certain amount of the final ones

● head is a filter, like most Unix commands: takes STDIN, sends to STDOUT

● Examples: head some_file -lS /usr/bin/ | head J.M.P. Alves 13 / 43 BMP0260 / ICB5765 / IBI5765 Can't make heads or tails of any of this?

● If head shows us the first X lines or bytes, guess what shows us the last lines or bytes...

● That would be the command, of course; it is very similar to head, including most options

● By default, tail shows us the last ten lines of the data

● tail can also show us the last X bytes (or kilobytes, or megabytes etc.) of a file

● ...or all lines (or bytes) except for a certain amount of the first ones

● tail is a filter, like most Unix commands: takes STDIN, sends to STDOUT

● Examples: tail some_file ls -lS /usr/bin/ | tail J.M.P. Alves 14 / 43 BMP0260 / ICB5765 / IBI5765 Getting remote data

● We have already used a command-line tool, called wget, to download data from the Web to our local computer

● This program does non-interactive download of files, and it supports HTTP, HTTPS, and FTP protocols

● At its basic, you just give wget the exact address of the file you want: wget http://www.google.com

● wget can also do recursive downloading: you tell it where to start and how deep it should go, and it downloads all files it finds

● It is also possible to specify certain patterns, like prefixes or file extensions, to accept (ignoring all else) or reject (getting all else)

J.M.P. Alves 15 / 43 BMP0260 / ICB5765 / IBI5765 Getting remote data

● In other cases, we want to transfer data from (or to) another computer that is not a Web server accessible through the browser or an FTP client

● In that case, scp (for secure copy) is the main tool for the job – as long as the OpenSSH packages are installed!

● This program behaves a lot like , and it can even perform copies within the same computer

● Differently from wget, scp is bidirectional: you can either get data from a remote computer or send data to a remote computer

● Another difference is that with scp you need a user account on the remote computer

● Like cp, scp does not copy directories by default

J.M.P. Alves 16 / 43 BMP0260 / ICB5765 / IBI5765 Getting remote data

● Giving its added complexity, scp uses a slightly different command structure: scp user1@remote_computer:path_to_file local_path

● This command will get a file (path_to_file) from a remote computer (remote_computer), using user1 as the user name – scp will ask for this user’s password – and saving the file in the path and/or file name local_path

● To do the opposite and send a file to a remote computer, just invert the order or the arguments: scp local_path user1@remote_computer:path_to_file

● If you just want to send local_path to your $HOME, the path_to_file part is not needed (just leave the command empty after the : character if you just want to send the file to your home directory)

● Example: scp [email protected]:hello.c .

J.M.P. Alves 17 / 43 BMP0260 / ICB5765 / IBI5765 Quiz time!

Go to the course page and choose Quiz 23

J.M.P. Alves 18 / 43 BMP0260 / ICB5765 / IBI5765 Inverse

● Last lecture we learned about viewing text with the cat (for concatenate) command

● cat sends the contents of one or more text files (or STDIN) to STDOUT

● But what if you want to get the contents from each file starting from the end instead of the beginning?

● Then, you invert cat and use the tac command

J.M.P. Alves 19 / 43 BMP0260 / ICB5765 / IBI5765 Inverse cat cat file Program version: 1.0.12 Date: Thu May 4 Seed: ./18s Seed : DNA Database: db/dma97 Database type: fastq Duplicate headers name: tac file Expansion direction: both Assembler used: abyss-pe Number of threads: 4 Number of threads: 4 Assembler used: abyss-pe Expansion direction: both Duplicate headers name: yes Database type: fastq Database: db/dma97 Seed type: DNA Seed: ./18s Date: Thu May 4 Program version: 1.0.12

J.M.P. Alves 20 / 43 BMP0260 / ICB5765 / IBI5765 Getting columns

● Tabular files frequently contain much more data than what we are actually interested in

● It is thus sometimes useful to be able to get at just one or a few columns of the data

● The cut command allows us to do exactly that

● cut uses the TAB character as the default column delimiter

● Of course, there is an option to change that

● Another option tells the program which column(s) to retrieve

J.M.P. Alves 21 / 43 BMP0260 / ICB5765 / IBI5765 Getting columns

● Let’s try!

● In the remote server, run: cut -f 1 /data/column_example

● Notice that only the first column (tab-delimited!) of the file was sent to STDOUT

● Now try: cut -f 1 /data/column_example2 cut -f 1 -d , /data/column_example2

J.M.P. Alves 22 / 43 BMP0260 / ICB5765 / IBI5765 Getting columns cut -f 1 /data/column_example2 cut -f 1 -d , /data/column_example2

● In the first instance, nothing got cut, since there was no line containing a tab character (fields are separated by commas, in that file)

● In the second instance, we changed the field delimiter to comma (,)

J.M.P. Alves 23 / 43 BMP0260 / ICB5765 / IBI5765 Quiz time!

Go to the course page and choose Quiz 24

J.M.P. Alves 24 / 43 BMP0260 / ICB5765 / IBI5765 Sorting data

● One of the most basic, and useful, things a computer is used for is sorting data – many complex data analysis algorithms depend on sorted data

● The sort command is the command-line tool that we can use for that

● sort uh… sorts the contents of STDIN or one or more files and sends the resulting data to STDOUT

● Notice that it said “one or more” files! We can merge the contents of different files while sorting

J.M.P. Alves 25 / 43 BMP0260 / ICB5765 / IBI5765 Sorting data

● By default, sorting is performed on the content of the whole line

● But there are many options that allow us to change sorting behavior

● Besides sorting, sort can also be use to just check whether some data is sorted (it will say nothing if the data is sorted or point out “disorder” if not sorted)

● There are different kinds of sorting available, e.g., numerical, alphabetical

● By default, the sort command does case sensitive lexical sorting

● Search fields are separated by blank-space (spaces or TAB), but that can be changed using an option

J.M.P. Alves 26 / 43 BMP0260 / ICB5765 / IBI5765 Sorting data

● Let’s try! In the remote server, as usual…

● First, let’s look at the contents of file /data/column_example as they are

● Then, sort the file’s contents: sort /data/column_example

● That is the simplest sort command: uses the whole line, default delimiter, default search type

J.M.P. Alves 27 / 43 BMP0260 / ICB5765 / IBI5765 Sorting data Before After avocado 12 apple 5 lime 4 avocado 12 apple 5 banana 3 banana 3 date 20 orange 6 lime 4 date 20 orange 6

● But what if we want the results to be in descending order? ● Or ordered using different fields (columns), instead of the whole line? ● Different field delimiter than white-space? ● Or using numerical comparison?

J.M.P. Alves 28 / 43 BMP0260 / ICB5765 / IBI5765 Sorting data

● There are different kinds of sorting available, e.g., numerical, alphabetical

● Let’s try sorting numerically now (using the second column):

sort -k 2 -n /data/column_example

● Option -k determines the column(s) to use in sorting

J.M.P. Alves 29 / 43 BMP0260 / ICB5765 / IBI5765 Sorting data

● As used above, we are telling the program to use column 2 and whatever comes afterwards; but there are many other ways:

● -k 2,2 : use just column 2

● -k 3,7 : use columns 3, 4, 5, 6, and 7

● -k 4n,5d : sort by columns 4 (numerically) and 5 (dictionary order)

● -k 1,1n -k 2,2M : sort first by column 1 (numerically), then break any ties by sorting column 2 (by month name, like JAN, FEB etc.)

J.M.P. Alves 30 / 43 BMP0260 / ICB5765 / IBI5765 Quiz time!

Go to the course page and choose Quiz 25

J.M.P. Alves 31 / 43 BMP0260 / ICB5765 / IBI5765 Now you do it!

Go to the course site and enter Practical Exercise 23

Follow the instructions to answer the questions in the exercise

Remember: in a PE, you should do things in practice before answering the question!

J.M.P. Alves 32 / 43 BMP0260 / ICB5765 / IBI5765 Shuffling data

● We have learned how to sort data to get it in order

● After that is done, there is no way to undo it, of course; the original order is lost (unless you kept the original file, obviously)

● But we can do something else to get data out of sorted order: shuffle

● That is done by the command: shuf /data/shuf_example

● shuf works generally as a “randomness generator”

J.M.P. Alves 33 / 43 BMP0260 / ICB5765 / IBI5765 Shuffling data

● For example:

● shuf -i 0-9 : print digits 0 to 9, one per line, randomly

● shuf -i 0-9 -r -n 50 : print 50 digits (0 to 9) randomly

● shuf -e heads tails -r -n 50 : simulate 50 coin tosses

J.M.P. Alves 34 / 43 BMP0260 / ICB5765 / IBI5765 Deleting parts of the data

● The Unix program colrm allows us to delete “columns” from the data (from STDIN)

● Here, a “column” is defined as a single character, not a column in the sense we are used to when dealing with tables!

● For example, the line: acbce 12345 example

column 1 column 3 column 19 column 6 column 2 J.M.P. Alves 35 / 43 BMP0260 / ICB5765 / IBI5765 Deleting parts of the data

● The exception here is TAB: each one counts for eight columns

● colrm gets two numbers to specify the columns

● The first number is the first column to remove; the second, the last column to remove

● If only one number is given, remove everything from there to the end of the line

● For example: colrm 8 <<< "acbce 12345 example"

● Will result in: acbce 1

J.M.P. Alves 36 / 43 BMP0260 / ICB5765 / IBI5765 Deleting parts of the data ● Another example: colrm 3 10 <<< "acbce 12345 example"

● Will result in:

ac5 example

● Finally: colrm 3 30 <<< "acbce 12345 example"

● Will result in:

ac

● That is: if the line is shorter than the second number, just delete to the end

J.M.P. Alves 37 / 43 BMP0260 / ICB5765 / IBI5765 Putting things together

● We have learned how get columns out of data with cut and colrm

● Some other commands, such as paste, different files into one

● Typically, each one of the input files to paste will be present as a column in the new file, creating a table

● For example:

paste /data/p1 /data/p2 /data/p3

J.M.P. Alves 38 / 43 BMP0260 / ICB5765 / IBI5765 Putting things together

● If you look at the original contents of those two files, you will see that everything that was in the first file was put in column 1 of the output, while everything that was in the second one was put in column 2

● One interesting option to paste is -s (for serial), which basically transposes the data. Try it!

J.M.P. Alves 39 / 43 BMP0260 / ICB5765 / IBI5765 Putting things together

● Another command to join files is… join

● join works as a very primitive database function

● It joins lines from two files based on one common column containing an identifier string

● The files must be sorted based on that common column!

J.M.P. Alves 40 / 43 BMP0260 / ICB5765 / IBI5765 Putting things together

● For example: join -j 1 /data/j4 /data/j5

● If you look at the original contents of those two files, you will see that column 1 was the “join field”: whenever that was the same in both files, the lines got joined in the output

● When there is no match, by default the line is not included in the output

J.M.P. Alves 41 / 43 BMP0260 / ICB5765 / IBI5765 Recap

● The number of Unix text tools keeps on growing

● tac is an easy way to invert the lines of a file

● One of the most basic computer tasks, being at the basis of many advanced data analysis algorithms, is sorting; sort is a very flexible program to do that

● shuf provides a “randomness generator”, which can be very useful

● Large data files are often tabular in nature, and Unix has a few programs designed to deal with that kind of file

● cut allows us to retrieve one or more columns from a file

J.M.P. Alves 42 / 43 BMP0260 / ICB5765 / IBI5765 Recap

● paste and join, on the other hand, enable the building of larger, multi-column files out of smaller files containing the individual columns

(translate) is a quick way of replacing or deleting certain characters

● colrm provides a way to delete certain “columns” (for example, from characters 5 to 9) from each line of a file

lets us break a large file into smaller ones, based on the number of lines or bytes that we would like to have in each resulting file, or the final number of files

J.M.P. Alves 43 / 43 BMP0260 / ICB5765 / IBI5765