Simple but Powerful Text-Processing Commands

Simple but Powerful Text-Processing Commands

Minera¸c~aode Dados Aplicada Simple but Powerful Text-Processing Commands Lo¨ıcCerf August, 21st 2019 DCC { ICEx { UFMG Simple but powerful text-processing commands Part of the Unix philosophy Part of the Unix philosophy Doug McIlroy (inventor of Unix pipes). In A Quarter-Century of Unix (1994): Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal inter- face. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 2 / 22 N Most Free operating systems are POSIX-compliant: GNU/Linux, BSD, illumos, Haiku, etc. Mac OS X is too. Windows is not but Cygwin is a good compatibility layer. Simple but powerful text-processing commands Utilities from the 70s The Unix operating system came with specific text-processing commands. They are still very useful today. Since 1984, the GNU project has improved them a lot (new options, improved efficiency). The original commands are part of POSIX standards. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 3 / 22 N Simple but powerful text-processing commands Utilities from the 70s The Unix operating system came with specific text-processing commands. They are still very useful today. Since 1984, the GNU project has improved them a lot (new options, improved efficiency). The original commands are part of POSIX standards. Most Free operating systems are POSIX-compliant: GNU/Linux, BSD, illumos, Haiku, etc. Mac OS X is too. Windows is not but Cygwin is a good compatibility layer. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 3 / 22 N Simple but powerful text-processing commands The shell The shell interprets every command fired in a terminal or in an executable file, whose first line indicates the shell to use: #!/bin/sh (any POSIX-compliant shell) or #!/bin/bash or #!/bin/dash or #!/bin/zsh, etc. A 2-line shell script #!/bin/sh # Get started with the MDA exercises wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ mv data data-$(date +%m%d) 2> /dev/null Lo¨ıcCerf Minera¸c~aode Dados Aplicada 4 / 22 N Simple but powerful text-processing commands Geeks and repetitive tasks Lo¨ıcCerf Minera¸c~aode Dados Aplicada 5 / 22 N </> redirects the standard I/O from/to a file. A pipe binds an output stream to an input stream. It can bear a name (in argument of mkfifo) but most workflows only need the unnamed pipe, |. | redirects the standard output of the command on the left to the standard input of the command on the right. Simple but powerful text-processing commands Standard I/O POSIX text processing commands: process the input stream of text line by line; by default: read from the standard input (the keyboard if not redirected); write to the standard output (the terminal if not redirected). Lo¨ıcCerf Minera¸c~aode Dados Aplicada 6 / 22 N Simple but powerful text-processing commands Standard I/O POSIX text processing commands: process the input stream of text line by line; by default: read from the standard input (the keyboard if not redirected); write to the standard output (the terminal if not redirected). </> redirects the standard I/O from/to a file. A pipe binds an output stream to an input stream. It can bear a name (in argument of mkfifo) but most workflows only need the unnamed pipe, |. | redirects the standard output of the command on the left to the standard input of the command on the right. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 6 / 22 N the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). wget and tar are two GNU commands. Like all GNU commands: Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N Simple but powerful text-processing commands Getting the data $ wget dcc.ufmg.br/~lcerf/data.tar.xz -O - | tar -xJ wget and tar are two GNU commands. Like all GNU commands: the man command (e. g., man wget) gives their specifications; the info command (e. g., info wget) often provides more detailed explanations, examples of use, etc.; long options are prefixed with --, short (i. e., one letter) options with - and can be grouped (e. g., -xJ equates to -x -J); options can take (right after) an argument; - means the standard input (/dev/stdin) or the standard output (/dev/stdout). Lo¨ıcCerf Minera¸c~aode Dados Aplicada 7 / 22 N The solution is named less. It is the viewer for man pages. A few commands inside less: Page-up/down, R (repaint), F (follow), [0-9]+ (scroll that many lines), / (search forwards for a regexp), ? (search backwards for a regexp), q (quit). Exercise Find, with less, the IP address of the first Brazilian visitor after the 100th line of DistroWatch/20100428/debian. Simple but powerful text-processing commands Reading a large text file Your favorite text editor (Vim or Emacs?) loads the entire file in main memory, a problem if it weights gigabytes or more. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 8 / 22 N Exercise Find, with less, the IP address of the first Brazilian visitor after the 100th line of DistroWatch/20100428/debian. Simple but powerful text-processing commands Reading a large text file Your favorite text editor (Vim or Emacs?) loads the entire file in main memory, a problem if it weights gigabytes or more. The solution is named less. It is the viewer for man pages. A few commands inside less: Page-up/down, R (repaint), F (follow), [0-9]+ (scroll that many lines), / (search forwards for a regexp), ? (search backwards for a regexp), q (quit). Lo¨ıcCerf Minera¸c~aode Dados Aplicada 8 / 22 N Simple but powerful text-processing commands Reading a large text file Your favorite text editor (Vim or Emacs?) loads the entire file in main memory, a problem if it weights gigabytes or more. The solution is named less. It is the viewer for man pages. A few commands inside less: Page-up/down, R (repaint), F (follow), [0-9]+ (scroll that many lines), / (search forwards for a regexp), ? (search backwards for a regexp), q (quit). Exercise Find, with less, the IP address of the first Brazilian visitor after the 100th line of DistroWatch/20100428/debian. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 8 / 22 N Exercise Print the lines 5 to 15 of one file in DistroWatch.com's logs. Simple but powerful text-processing commands Outputting the first/last lines Do not test your scripts on the whole dataset! head outputs the first lines of the input; tail its last lines. A few options: -[0-9]+ tunes the number of lines (10 by default), -n too but a -/+ prefix asks head/tail to output all lines except the provided number of last/first lines. Lo¨ıcCerf Minera¸c~aode Dados Aplicada 9 / 22 N Simple but powerful text-processing commands Outputting the first/last lines Do not test your scripts on the whole dataset! head outputs the first lines of the input; tail its last lines. A few options: -[0-9]+ tunes the number of lines (10 by default), -n too but a -/+ prefix asks head/tail to output all lines except the provided number of last/first lines.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    37 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us