
Overview 1. Setup & Unix review Unix/Regex Lab 2. Count words in a text CS 341: Natural Language Processing Heather Pon-Barry 3. Sort a list of words in various ways 4. Search with grep 5. Two-minute response Based on Unix For Poets (by Ken Church) Setting Up 1. Setup & Unix Review • In your home directory, make a cs341 folder • Make a directory called unixForPoets for today’s lab activity Unix Tools pwd! • grep: search for a pattern (regular expression) ls ! • sort cd <dirname> piping • uniq –c (count duplicates) cd ../ > • tr (translate characters) less <filename> • wc (word – or line – count) head <filename> <! tail <filename> |! • cat (send file(s) in stream) man <command> CTRL-C • sed (edit string -- replacement) Counting lines, words, characters 2. Count words in a text • wc alice.txt ! 1601 27336 135029 alice.txt tr command Counting Words NAME • Input: mini-alice.txt; alice.txt tr - translate or delete characters ! • SYNOPSIS Output: list of words with freq counts tr [OPTION]... SET1 [SET2] ! • Algorithm DESCRIPTION 1. Create a file with one token per line (tr -sc …) Translate, squeeze, and/or delete characters from standard input, writing to standard 2. Sort (sort) output. ! 3. Count duplicates (uniq –c) -c complement of SET1 ! • Practice using tr, sort, and uniq incrementally on mini-alice.txt -s, if SET2 is specified, squeezes repeated SET2 characters to a single character ! • … Once you understand each step, run your command on ! alice.txt --help display this help and exit Output head and tail 632 a! 1 abide! • head gives you the first n lines (n=10 by default; can specify n with flag - 1 able! n) 94 about! 3 above! • tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | uniq -c | 1 absence! head –n 5! 2 absurd! 1 acceptance! 632 a! 2 accident! 1 abide! 1 accidentally! . ! 1 able! .! . 94 about! 3 above! Solution: tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | • what do you think tail does? uniq -c (hidden) Most Frequent Words Exercise 3. Sort a list of words in various ways • Find the 50 most common words in alice.txt • Hint: Use sort a second time, then head grep 4. Search with grep • Grep finds patterns specified as regular expressions • globally search for regular expression and print grep grep • Try this: grep cheshire alice.txt ! it s a cheshire cat said the duchess and that s why • Make an intermediary words file: ! pig she said the last word with such sudden violence that alice • tr -sc ’A-Za-z’ ’\n’ < alice.txt > quite jumped but she saw in another moment that it was alice.words! addressed to the baby and not to her so she took courage and went on again i didn t know that cheshire cats always grinned in ! fact i didn t know that cats could grin • Finding words ending in –ing: ! … • grep 'ing$' alice.words | sort | uniq –c ! • Next, try grepping other phrases grep Take-home Message • grep is a filter – you keep only some lines of the input • Try these on alice.words • grep gh keep lines containing ‘‘gh’’ • grep ’ˆcon’ keep lines beginning with ‘‘con’’ • Piping commands together can be simple yet • grep ’ing$’ keep lines ending with ‘‘ing’’ powerful in Unix • grep –v gh keep lines NOT containing “gh” • grep –i ’[aeiou].*[aeiou]’ keep lines with two or more vowels • grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ keep lines with exactly one vowel 5. Two-minute response https://xkcd.com/208/ Two-minute Response • In Piazza, post a Note to Instructor only: 1. What is one thing you understand better after Extra Exercises today’s activity? 2. What is something that’s still unclear on/a question you have? Sorting exercises Exercises on grep & wc • In alice.txt… • Find the words in alice.txt that end in “ling” • How many 4-letter words? using sorting (and not using grep) • How many different words are there with no vowels • Hint: what does this do? • What subtypes do they belong to? • tr -sc 'A-Za-z' '\n' < alice.txt | • How many “1 syllable” words are there sort | uniq | head | rev • That is, ones with exactly one vowel ! Answer these with respect to word types, not word tokens grep • We used the following to keep lines with exactly one vowel • grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]* $’ • What would happen if we instead used the command? In what contexts is this important? • grep –i ’[ˆaeiou]*[aeiou][ˆaeiou]*’ .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages14 Page
-
File Size-