
Minera¸c~aode Dados Aplicada Simple but Powerful Text-Processing Commands Lo¨ıcCerf August, 21st 2019 DCC { ICEx { UFMG Simple but powerful text-processing commands Part of the Unix philosophy Part of the Unix philosophy Doug McIlroy (inventor of Unix pipes). In A Quarter-Century of Unix (1994): Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal inter- face. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 2 / 22 N Most Free operating systems are POSIX-compliant: GNU/Linux, BSD, illumos, Haiku, etc. Mac OS X is too. Windows is not but Cygwin is a good compatibility layer. Simple but powerful text-processing commands Utilities from the 70s The Unix operating system came with specific text-processing commands. They are still very useful today. Since 1984, the GNU project has improved them a lot (new options, improved efficiency). The original commands are part of POSIX standards. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 3 / 22 N Simple but powerful text-processing commands Utilities from the 70s The Unix operating system came with specific text-processing commands. They are still very useful today. Since 1984, the GNU project has improved them a lot (new options, improved efficiency). The original commands are part of POSIX standards. Most Free operating systems are POSIX-compliant: GNU/Linux, BSD, illumos, Haiku, etc. Mac OS X is too. Windows is not but Cygwin is a good compatibility layer. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 3 / 22 N Simple but powerful text-processing commands The shell The shell interprets every command fired in a terminal or in an executable file, whose first line indicates the shell to use: #!/bin/sh (any POSIX-compliant shell) or #!/bin/bash or #!/bin/dash or #!/bin/zsh, etc. A 2-line shell script #!/bin/sh # Get started with the MDA exercises wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ mv data data-$(date +%m%d) 2> /dev/null Lo¨ıcCerf Minera¸c~aode Dados Aplicada 4 / 22 N Simple but powerful text-processing commands Geeks and repetitive tasks Lo¨ıcCerf Minera¸c~aode Dados Aplicada 5 / 22 N </> redirects the standard I/O from/to a file. A pipe binds an output stream to an input stream. It can bear a name (in argument of mkfifo) but most workflows only need the unnamed pipe, |. | redirects the standard output of the command on the left to the standard input of the command on the right. Simple but powerful text-processing commands Standard I/O POSIX text processing commands: process the input stream of text line by line; by default: read from the standard input (the keyboard if not redirected); write to the standard output (the terminal if not redirected). Lo¨ıcCerf Minera¸c~aode Dados Aplicada 6 / 22 N Simple but powerful text-processing commands Standard I/O POSIX text processing commands: process the input stream of text line by line; by default: read from the standard input (the keyboard if not redirected); write to the standard output (the terminal if not redirected). </> redirects the standard I/O from/to a file. A pipe binds an output stream to an input stream. It can bear a name (in argument of mkfifo) but most workflows only need the unnamed pipe, |. | redirects the standard output of the command on the left to the standard input of the command on the right. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 6 / 22 N the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). wget and tar are two GNU commands. Like all GNU commands: Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N The solution is named less. It is the viewer for man pages. A few commands inside less: Page-up/down, R (repaint), F (follow), [0-9]+ (scroll that many lines), / (search forwards for a regexp), ? (search backwards for a regexp), q (quit). Exercise Find, with less, the IP address of the first Brazilian visitor after the 100th line of DistroWatch/20100428/debian. Simple but powerful text-processing commands Reading a large text file Your favorite text editor (Vim or Emacs?) loads the entire file in main memory, a problem if it weights gigabytes or more. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 8 / 22 N Exercise Find, with less, the IP address of the first Brazilian visitor after the 100th line of DistroWatch/20100428/debian. Simple but powerful text-processing commands Reading a large text file Your favorite text editor (Vim or Emacs?) loads the entire file in main memory, a problem if it weights gigabytes or more. The solution is named less. It is the viewer for man pages. A few commands inside less: Page-up/down, R (repaint), F (follow), [0-9]+ (scroll that many lines), / (search forwards for a regexp), ? (search backwards for a regexp), q (quit). Lo¨ıcCerf Minera¸c~aode Dados Aplicada 8 / 22 N Simple but powerful text-processing commands Reading a large text file Your favorite text editor (Vim or Emacs?) loads the entire file in main memory, a problem if it weights gigabytes or more. The solution is named less. It is the viewer for man pages. A few commands inside less: Page-up/down, R (repaint), F (follow), [0-9]+ (scroll that many lines), / (search forwards for a regexp), ? (search backwards for a regexp), q (quit). Exercise Find, with less, the IP address of the first Brazilian visitor after the 100th line of DistroWatch/20100428/debian. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 8 / 22 N Exercise Print the lines 5 to 15 of one file in DistroWatch.com's logs. Simple but powerful text-processing commands Outputting the first/last lines Do not test your scripts on the whole dataset! head outputs the first lines of the input; tail its last lines. A few options: -[0-9]+ tunes the number of lines (10 by default), -n too but a -/+ prefix asks head/tail to output all lines except the provided number of last/first lines. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 9 / 22 N Simple but powerful text-processing commands Outputting the first/last lines Do not test your scripts on the whole dataset! head outputs the first lines of the input; tail its last lines. A few options: -[0-9]+ tunes the number of lines (10 by default), -n too but a -/+ prefix asks head/tail to output all lines except the provided number of last/first lines.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages37 Page
-
File Size-