Unix/Regex Lab Overview

Total Page:16

File Type:pdf, Size:1020Kb

Unix/Regex Lab Overview Overview 1. Setup & Unix review Unix/Regex Lab 2. Count words in a text CS 341: Natural Language Processing Heather Pon-Barry 3. Sort a list of words in various ways 4. Search with grep 5. Two-minute response Based on Unix For Poets (by Ken Church) Setting Up 1. Setup & Unix Review • In your home directory, make a cs341 folder • Make a directory called unixForPoets for today’s lab activity Unix Tools pwd! • grep: search for a pattern (regular expression) ls ! • sort cd <dirname> piping • uniq –c (count duplicates) cd ../ > • tr (translate characters) less <filename> • wc (word – or line – count) head <filename> <! tail <filename> |! • cat (send file(s) in stream) man <command> CTRL-C • sed (edit string -- replacement) Counting lines, words, characters 2. Count words in a text • wc alice.txt ! 1601 27336 135029 alice.txt tr command Counting Words NAME • Input: mini-alice.txt; alice.txt tr - translate or delete characters ! • SYNOPSIS Output: list of words with freq counts tr [OPTION]... SET1 [SET2] ! • Algorithm DESCRIPTION 1. Create a file with one token per line (tr -sc …) Translate, squeeze, and/or delete characters from standard input, writing to standard 2. Sort (sort) output. ! 3. Count duplicates (uniq –c) -c complement of SET1 ! • Practice using tr, sort, and uniq incrementally on mini-alice.txt -s, if SET2 is specified, squeezes repeated SET2 characters to a single character ! • … Once you understand each step, run your command on ! alice.txt --help display this help and exit Output head and tail 632 a! 1 abide! • head gives you the first n lines (n=10 by default; can specify n with flag - 1 able! n) 94 about! 3 above! • tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | uniq -c | 1 absence! head –n 5! 2 absurd! 1 acceptance! 632 a! 2 accident! 1 abide! 1 accidentally! . ! 1 able! .! . 94 about! 3 above! Solution: tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | • what do you think tail does? uniq -c (hidden) Most Frequent Words Exercise 3. Sort a list of words in various ways • Find the 50 most common words in alice.txt • Hint: Use sort a second time, then head grep 4. Search with grep • Grep finds patterns specified as regular expressions • globally search for regular expression and print grep grep • Try this: grep cheshire alice.txt ! it s a cheshire cat said the duchess and that s why • Make an intermediary words file: ! pig she said the last word with such sudden violence that alice • tr -sc ’A-Za-z’ ’\n’ < alice.txt > quite jumped but she saw in another moment that it was alice.words! addressed to the baby and not to her so she took courage and went on again i didn t know that cheshire cats always grinned in ! fact i didn t know that cats could grin • Finding words ending in –ing: ! … • grep 'ing$' alice.words | sort | uniq –c ! • Next, try grepping other phrases grep Take-home Message • grep is a filter – you keep only some lines of the input • Try these on alice.words • grep gh keep lines containing ‘‘gh’’ • grep ’ˆcon’ keep lines beginning with ‘‘con’’ • Piping commands together can be simple yet • grep ’ing$’ keep lines ending with ‘‘ing’’ powerful in Unix • grep –v gh keep lines NOT containing “gh” • grep –i ’[aeiou].*[aeiou]’ keep lines with two or more vowels • grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ keep lines with exactly one vowel 5. Two-minute response https://xkcd.com/208/ Two-minute Response • In Piazza, post a Note to Instructor only: 1. What is one thing you understand better after Extra Exercises today’s activity? 2. What is something that’s still unclear on/a question you have? Sorting exercises Exercises on grep & wc • In alice.txt… • Find the words in alice.txt that end in “ling” • How many 4-letter words? using sorting (and not using grep) • How many different words are there with no vowels • Hint: what does this do? • What subtypes do they belong to? • tr -sc 'A-Za-z' '\n' < alice.txt | • How many “1 syllable” words are there sort | uniq | head | rev • That is, ones with exactly one vowel ! Answer these with respect to word types, not word tokens grep • We used the following to keep lines with exactly one vowel • grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]* $’ • What would happen if we instead used the command? In what contexts is this important? • grep –i ’[ˆaeiou]*[aeiou][ˆaeiou]*’ .
Recommended publications
  • At—At, Batch—Execute Commands at a Later Time
    at—at, batch—execute commands at a later time at [–csm] [–f script] [–qqueue] time [date] [+ increment] at –l [ job...] at –r job... batch at and batch read commands from standard input to be executed at a later time. at allows you to specify when the commands should be executed, while jobs queued with batch will execute when system load level permits. Executes commands read from stdin or a file at some later time. Unless redirected, the output is mailed to the user. Example A.1 1 at 6:30am Dec 12 < program 2 at noon tomorrow < program 3 at 1945 pm August 9 < program 4 at now + 3 hours < program 5 at 8:30am Jan 4 < program 6 at -r 83883555320.a EXPLANATION 1. At 6:30 in the morning on December 12th, start the job. 2. At noon tomorrow start the job. 3. At 7:45 in the evening on August 9th, start the job. 4. In three hours start the job. 5. At 8:30 in the morning of January 4th, start the job. 6. Removes previously scheduled job 83883555320.a. awk—pattern scanning and processing language awk [ –fprogram–file ] [ –Fc ] [ prog ] [ parameters ] [ filename...] awk scans each input filename for lines that match any of a set of patterns specified in prog. Example A.2 1 awk '{print $1, $2}' file 2 awk '/John/{print $3, $4}' file 3 awk -F: '{print $3}' /etc/passwd 4 date | awk '{print $6}' EXPLANATION 1. Prints the first two fields of file where fields are separated by whitespace. 2. Prints fields 3 and 4 if the pattern John is found.
    [Show full text]
  • Application for a Certificate of Eligibility to Employ Child Performers
    Division of Labor Standards Permit and Certificate Unit Harriman State Office Campus Building 12, Room 185B Albany, NY 12240 www.labor.ny.gov Application for a Certificate of Eligibility to Employ Child Performers A. Submission Instructions A Certificate of Eligibility to Employ Child Performers must be obtained prior to employing any child performer. Certificates are renew able every three (3) years. To obtain or renew a certificate: • Complete Parts B, C, D and E of this application. • Attach proof of New York State Workers’ Compensation and Disability Insurance. o If you currently have employees in New York, you must provide proof of coverage for those New York State w orkers by attaching copies of Form C-105.2 and DB-120.1, obtainable from your insurance carrier. o If you are currently exempt from this requirement, complete Form CE-200 attesting that you are not required to obtain New York State Workers’ Compensation and Disability Insurance Coverage. Information on and copies of this form are available from any district office of the Workers’ Compensation Board or from their w ebsite at w ww.wcb.ny.gov, Click on “WC/DB Exemptions,” then click on “Request for WC/DB Exemptions.” • Attach a check for the correct amount from Section D, made payable to the Commissioner of Labor. • Sign and mail this completed application and all required documents to the address listed above. If you have any questions, call (518) 457-1942, email [email protected] or visit the Department’s w ebsite at w ww.labor.ny.gov B. Type of Request (check one) New Renew al Current certificate number _______________________________________________ Are you seeking this certificate to employ child models? Yes No C.
    [Show full text]
  • Program #6: Word Count
    CSc 227 — Program Design and Development Spring 2014 (McCann) http://www.cs.arizona.edu/classes/cs227/spring14/ Program #6: Word Count Due Date: March 11 th, 2014, at 9:00 p.m. MST Overview: The UNIX operating system (and its variants, of which Linux is one) includes quite a few useful utility programs. One of those is wc, which is short for Word Count. The purpose of wc is to give users an easy way to determine the size of a text file in terms of the number of lines, words, and bytes it contains. (It can do a bit more, but that’s all of the functionality that we are concerned with for this assignment.) Counting lines is done by looking for “end of line” characters (\n (ASCII 10) for UNIX text files, or the pair \r\n (ASCII 13 and 10) for Windows/DOS text files). Counting words is also straight–forward: Any sequence of characters not interrupted by “whitespace” (spaces, tabs, end–of–line characters) is a word. Of course, whitespace characters are characters, and need to be counted as such. A problem with wc is that it generates a very minimal output format. Here’s an example of what wc produces on a Linux system when asked to count the content of a pair of files; we can do better! $ wc prog6a.dat prog6b.dat 2 6 38 prog6a.dat 32 321 1883 prog6b.dat 34 327 1921 total Assignment: Write a Java program (completely documented according to the class documentation guidelines, of course) that counts lines, words, and bytes (characters) of text files.
    [Show full text]
  • DC Console Using DC Console Application Design Software
    DC Console Using DC Console Application Design Software DC Console is easy-to-use, application design software developed specifically to work in conjunction with AML’s DC Suite. Create. Distribute. Collect. Every LDX10 handheld computer comes with DC Suite, which includes seven (7) pre-developed applications for common data collection tasks. Now LDX10 users can use DC Console to modify these applications, or create their own from scratch. AML 800.648.4452 Made in USA www.amltd.com Introduction This document briefly covers how to use DC Console and the features and settings. Be sure to read this document in its entirety before attempting to use AML’s DC Console with a DC Suite compatible device. What is the difference between an “App” and a “Suite”? “Apps” are single applications running on the device used to collect and store data. In most cases, multiple apps would be utilized to handle various operations. For example, the ‘Item_Quantity’ app is one of the most widely used apps and the most direct means to take a basic inventory count, it produces a data file showing what items are in stock, the relative quantities, and requires minimal input from the mobile worker(s). Other operations will require additional input, for example, if you also need to know the specific location for each item in inventory, the ‘Item_Lot_Quantity’ app would be a better fit. Apps can be used in a variety of ways and provide the LDX10 the flexibility to handle virtually any data collection operation. “Suite” files are simply collections of individual apps. Suite files allow you to easily manage and edit multiple apps from within a single ‘store-house’ file and provide an effortless means for device deployment.
    [Show full text]
  • UNIX Cheat Sheet – Sarah Medland Help on Any Unix Command List a Directory Change to Directory Make a New Directory Remove A
    THE 2013 INTERNATIONAL WORKSHOP ON STATISTICAL METHODOLOGY FOR HUMAN GENOMIC STUDIES UNIX cheat sheet – Sarah Medland Help on any Unix command man {command} Type man ls to read the manual for the ls command. which {command} Find out where a program is installed whatis {command} Give short description of command. List a directory ls {path} ls -l {path} Long listing, with date, size and permisions. ls -R {path} Recursive listing, with all subdirs. Change to directory cd {dirname} There must be a space between. cd ~ Go back to home directory, useful if you're lost. cd .. Go back one directory. Make a new directory mkdir {dirname} Remove a directory/file rmdir {dirname} Only works if {dirname} is empty. rm {filespec} ? and * wildcards work like DOS should. "?" is any character; "*" is any string of characters. Print working directory pwd Show where you are as full path. Copy a file or directory cp {file1} {file2} cp -r {dir1} {dir2} Recursive, copy directory and all subdirs. cat {newfile} >> {oldfile} Append newfile to end of oldfile. Move (or rename) a file mv {oldfile} {newfile} Moving a file and renaming it are the same thing. View a text file more {filename} View file one screen at a time. less {filename} Like more , with extra features. cat {filename} View file, but it scrolls. page {filename} Very handy with ncftp . nano {filename} Use text editor. head {filename} show first 10 lines tail {filename} show last 10 lines Compare two files diff {file1} {file2} Show the differences. sdiff {file1} {file2} Show files side by side. Other text commands grep '{pattern}' {file} Find regular expression in file.
    [Show full text]
  • CS101 Lecture 9
    How do you copy/move/rename/remove files? How do you create a directory ? What is redirection and piping? Readings: See CCSO’s Unix pages and 9-2 cp option file1 file2 First Version cp file1 file2 file3 … dirname Second Version This is one version of the cp command. file2 is created and the contents of file1 are copied into file2. If file2 already exits, it This version copies the files file1, file2, file3,… into the directory will be replaced with a new one. dirname. where option is -i Protects you from overwriting an existing file by asking you for a yes or no before it copies a file with an existing name. -r Can be used to copy directories and all their contents into a new directory 9-3 9-4 cs101 jsmith cs101 jsmith pwd data data mp1 pwd mp1 {FILES: mp1_data.m, mp1.m } {FILES: mp1_data.m, mp1.m } Copy the file named mp1_data.m from the cs101/data Copy the file named mp1_data.m from the cs101/data directory into the pwd. directory into the mp1 directory. > cp ~cs101/data/mp1_data.m . > cp ~cs101/data/mp1_data.m mp1 The (.) dot means “here”, that is, your pwd. 9-5 The (.) dot means “here”, that is, your pwd. 9-6 Example: To create a new directory named “temp” and to copy mv option file1 file2 First Version the contents of an existing directory named mp1 into temp, This is one version of the mv command. file1 is renamed file2. where option is -i Protects you from overwriting an existing file by asking you > cp -r mp1 temp for a yes or no before it copies a file with an existing name.
    [Show full text]
  • HEP Computing Part I Intro to UNIX/LINUX Adrian Bevan
    HEP Computing Part I Intro to UNIX/LINUX Adrian Bevan Lectures 1,2,3 [email protected] 1 Lecture 1 • Files and directories. • Introduce a number of simple UNIX commands for manipulation of files and directories. • communicating with remote machines [email protected] 2 What is LINUX • LINUX is the operating system (OS) kernel. • Sitting on top of the LINUX OS are a lot of utilities that help you do stuff. • You get a ‘LINUX distribution’ installed on your desktop/laptop. This is a sloppy way of saying you get the OS bundled with lots of useful utilities/applications. • Use LINUX to mean anything from the OS to the distribution we are using. • UNIX is an operating system that is very similar to LINUX (same command names, sometimes slightly different functionalities of commands etc). – There are usually enough subtle differences between LINUX and UNIX versions to keep you on your toes (e.g. Solaris and LINUX) when running applications on multiple platforms …be mindful of this if you use other UNIX flavours. – Mac OS X is based on a UNIX distribution. [email protected] 3 Accessing a machine • You need a user account you should all have one by now • can then log in at the terminal (i.e. sit in front of a machine and type in your user name and password to log in to it). • you can also log in remotely to a machine somewhere else RAL SLAC CERN London FNAL in2p3 [email protected] 4 The command line • A user interfaces with Linux by typing commands into a shell.
    [Show full text]
  • BIMM 143 Introduction to UNIX
    BIMM 143 Introduction to UNIX Barry Grant http://thegrantlab.org/bimm143 Do it Yourself! Lets get started… Mac Terminal PC Git Bash SideNote: Terminal vs Shell • Shell: A command-line interface that allows a user to Setting Upinteract with the operating system by typing commands. • Terminal [emulator]: A graphical interface to the shell (i.e. • Mac users: openthe a window Terminal you get when you launch Git Bash/iTerm/etc.). • Windows users: install MobaXterm and then open a terminal Shell prompt Introduction To Barry Grant Introduction To Shell Barry Grant Do it Yourself! Print Working Directory: a.k.a. where the hell am I? This is a comment line pwd This is our first UNIX command :-) Don’t type the “>” bit it is the “shell prompt”! List out the files and directories where you are ls Q. What do you see after each command? Q. Does it make sense if you compare to your Mac: Finder or Windows: File Explorer? On Mac only :( open . Note the [SPACE] is important Download any file to your current directory/folder curl -O https://bioboot.github.io/bggn213_S18/class-material/bggn213_01_unix.zip curl -O https://bioboot.github.io/bggn213_S18/class-material/bggn213_01_unix.zip ls unzip bggn213_01_unix.zip Q. Does what you see at each step make sense if you compare to your Mac: Finder or Windows: File Explorer? Download any file to your current directory/folder curl -O https://bioboot.github.io/bggn213_S18/class-material/bggn213_01_unix.zip List out the files and directories where you are (NB: Use TAB for auto-complete) ls bggn213_01_unix.zip Un-zip your downloaded file unzip bggn213_01_unix.zip curlChange -O https://bioboot.github.io/bggn213_S18/class-material/bggn213_01_unix.zip directory (i.e.
    [Show full text]
  • GNU Grep: Print Lines That Match Patterns Version 3.7, 8 August 2021
    GNU Grep: Print lines that match patterns version 3.7, 8 August 2021 Alain Magloire et al. This manual is for grep, a pattern matching engine. Copyright c 1999{2002, 2005, 2008{2021 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled \GNU Free Documentation License". i Table of Contents 1 Introduction ::::::::::::::::::::::::::::::::::::: 1 2 Invoking grep :::::::::::::::::::::::::::::::::::: 2 2.1 Command-line Options ::::::::::::::::::::::::::::::::::::::::: 2 2.1.1 Generic Program Information :::::::::::::::::::::::::::::: 2 2.1.2 Matching Control :::::::::::::::::::::::::::::::::::::::::: 2 2.1.3 General Output Control ::::::::::::::::::::::::::::::::::: 3 2.1.4 Output Line Prefix Control :::::::::::::::::::::::::::::::: 5 2.1.5 Context Line Control :::::::::::::::::::::::::::::::::::::: 6 2.1.6 File and Directory Selection:::::::::::::::::::::::::::::::: 7 2.1.7 Other Options ::::::::::::::::::::::::::::::::::::::::::::: 9 2.2 Environment Variables:::::::::::::::::::::::::::::::::::::::::: 9 2.3 Exit Status :::::::::::::::::::::::::::::::::::::::::::::::::::: 12 2.4 grep Programs :::::::::::::::::::::::::::::::::::::::::::::::: 13 3 Regular Expressions ::::::::::::::::::::::::::: 14 3.1 Fundamental Structure ::::::::::::::::::::::::::::::::::::::::
    [Show full text]
  • Common Commands Cheat Sheet by Mmorykan Via Cheatography.Com/89673/Cs/20411
    Common Commands Cheat Sheet by mmorykan via cheatography.com/89673/cs/20411/ Scripting Scripting (cont) GitHub bash filename - Runs script sleep value - Forces the script to wait value git clone <url​ > - Clones gitkeeper url Shebang - "#​ !bi​ n/b​ ash​ " - First line of bash seconds git add <fil​ ena​ me>​ - Adds the file to git script. Tells script what binary to use while [[ condition ]]; do stuff; done git commit - Commits all files to git ./file​ name - Also runs script if [[ condition ]]; do stuff; fi git push - Pushes all git files to host # - Creates a comment until [[ condition ]]; do stuff; done echo ${varia​ ble} - Prints variable words="​ h​ ouse dogs telephone dog" - Package / Networking hello_int = 1 - Treats "1​ " as a string Declares words array dnf upgrade - Updates system packages Use UPPERC​ ASE for constant variables for word in ${words} - traverses each dnf install - Installs package element in array Use lowerc​ ase​ _wi​ th_​ und​ ers​ cores for dnf search - Searches for package for counter in {1..10} - Loops 10 times regular variables dnf remove - Removes package for ((;;)) - Is infinite for loop echo $(( ${hello​ _int} + 1 )) - Treats hello_int systemctl start - Starts systemd service as an integer and prints 2 break - exits loop body systemctl stop - Stops systemd service mktemp - Creates temporary random file for ((count​ er=1; counter -le 10; counter​ ++)) systemctl restart - Restarts systemd service test - Denoted by "[[ condition ]]" tests the - Loops 10 times systemctl reload - Reloads systemd service condition
    [Show full text]
  • Unix Essentials (Pdf)
    Unix Essentials Bingbing Yuan Next Hot Topics: Unix – Beyond Basics (Mon Oct 20th at 1pm) 1 Objectives • Unix Overview • Whitehead Resources • Unix Commands • BaRC Resources • LSF 2 Objectives: Hands-on • Parsing Human Body Index (HBI) array data Goal: Process a large data file to get important information such as genes of interest, sorting expression values, and subset the data for further investigation. 3 Advantages of Unix • Processing files with thousands, or millions, of lines How many reads are in my fastq file? Sort by gene name or expression values • Many programs run on Unix only Command-line tools • Automate repetitive tasks or commands Scripting • Other software, such as Excel, are not able to handle large files efficiently • Open Source 4 Scientific computing resources 5 Shared packages/programs https://tak.wi.mit.edu Request new packages/programs Installed packages/programs 6 Login • Requesting a tak account http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php • Windows PuTTY or Cygwin Xming: setup X-windows for graphical display • Macs Access through Terminal 7 Connecting to tak for Windows Command Prompt user@tak ~$ 8 Log in to tak for Mac ssh –Y [email protected] 9 Unix Commands • General syntax Command Options or switches (zero or more) Arguments (zero or more) Example: uniq –c myFile.txt command options arguments Options can be combined ls –l –a or ls –la • Manual (man) page man uniq • One line description whatis ls 10 Unix Directory Structure root / home dev bin nfs lab . jdoe BaRC_Public solexa_public
    [Show full text]
  • Useful Tai Ls Dino
    SCIENCE & NATURE Useful Tails Materials Pictures of a possum, horse, lizard, rattlesnake, peacock, fish, bird, and beaver What to do 1. Display the animal pictures so the children can see them. 2. Say the following sentences. Ask the children to guess the animal by the usefulness of its tail. I use my tail for hanging upside down. (possum) I use my tail as a fly swatter. (horse) When my tail breaks off, I grow a new one. (lizard) I shake my noisy tail when I am about to strike. (rattlesnake) My tail opens like a beautiful fan. (peacock) I use my tail as a propeller. I cannot swim without it. (fish) I can’t fly without my tail. (bird) I use my powerful tail for building. (beaver) More to do Ask the children if they can name other animals that have tails. Ask them how these animals’Downloaded tails might by [email protected] useful. from Games: Cut out the tailsProFilePlanner.com of each of the animals. Encourage the children to pin the tails on the pictures (like “Pin the Tail on the Donkey”). Dotti Enderle, Richmond, TX Dino Dig Materials Plastic or rubber dinosaurs or bones Sand Wide-tip, medium-sized paintbrushes Plastic sand shovels Small plastic buckets Clipboards Paper Pencil or pens 508 The GIANT Encyclopedia of Preschool Activities for Four-Year-Olds Downloaded by [email protected] from ProFilePlanner.com SCIENCE & NATURE What to do 1. Beforehand, hide plastic or rubber dinosaurs or bones in the sand. 2. Give each child a paintbrush, shovel, and bucket. 3.
    [Show full text]