Overview
1. Setup & Unix review
Unix/Regex Lab 2. Count words in a text CS 341: Natural Language Processing Heather Pon-Barry 3. Sort a list of words in various ways
4. Search with grep
5. Two-minute response Based on Unix For Poets (by Ken Church) Setting Up
1. Setup & Unix Review • In your home directory, make a cs341 folder • Make a directory called unixForPoets for today’s lab activity Unix Tools pwd! • grep: search for a pattern (regular expression) ls ! • sort cd
2. Count words in a text • wc alice.txt !
1601 27336 135029 alice.txt tr command Counting Words
NAME • Input: mini-alice.txt; alice.txt tr - translate or delete characters ! • SYNOPSIS Output: list of words with freq counts tr [OPTION]... SET1 [SET2] ! • Algorithm DESCRIPTION 1. Create a file with one token per line (tr -sc …) Translate, squeeze, and/or delete characters from standard input, writing to standard 2. Sort (sort) output. ! 3. Count duplicates (uniq –c) -c complement of SET1 ! • Practice using tr, sort, and uniq incrementally on mini-alice.txt -s, if SET2 is specified, squeezes repeated SET2 characters to a single character ! • … Once you understand each step, run your command on ! alice.txt --help display this help and exit Output head and tail
632 a! 1 abide! • head gives you the first n lines (n=10 by default; can specify n with flag - 1 able! n) 94 about! 3 above! • tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | uniq -c | 1 absence! head –n 5! 2 absurd! 1 acceptance! 632 a! 2 accident! 1 abide! 1 accidentally! . ! 1 able! .! . 94 about! 3 above! Solution: tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | • what do you think tail does? uniq -c (hidden) Most Frequent Words Exercise
3. Sort a list of words in various ways • Find the 50 most common words in alice.txt • Hint: Use sort a second time, then head grep
4. Search with grep • Grep finds patterns specified as regular expressions
• globally search for regular expression and print grep grep
• Try this: grep cheshire alice.txt ! it s a cheshire cat said the duchess and that s why • Make an intermediary words file: ! pig she said the last word with such sudden violence that alice • tr -sc ’A-Za-z’ ’\n’ < alice.txt > quite jumped but she saw in another moment that it was alice.words! addressed to the baby and not to her so she took courage and went on again i didn t know that cheshire cats always grinned in ! fact i didn t know that cats could grin • Finding words ending in –ing: ! … • grep 'ing$' alice.words | sort | uniq –c ! • Next, try grepping other phrases grep Take-home Message
• grep is a filter – you keep only some lines of the input
• Try these on alice.words
• grep gh keep lines containing ‘‘gh’’
• grep ’ˆcon’ keep lines beginning with ‘‘con’’ • Piping commands together can be simple yet • grep ’ing$’ keep lines ending with ‘‘ing’’ powerful in Unix • grep –v gh keep lines NOT containing “gh”
• grep –i ’[aeiou].*[aeiou]’ keep lines with two or more vowels
• grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ keep lines with exactly one vowel 5. Two-minute response
https://xkcd.com/208/ Two-minute Response
• In Piazza, post a Note to Instructor only:
1. What is one thing you understand better after Extra Exercises today’s activity?
2. What is something that’s still unclear on/a question you have? Sorting exercises Exercises on grep & wc
• In alice.txt…
• Find the words in alice.txt that end in “ling” • How many 4-letter words? using sorting (and not using grep) • How many different words are there with no vowels • Hint: what does this do? • What subtypes do they belong to?
• tr -sc 'A-Za-z' '\n' < alice.txt | • How many “1 syllable” words are there sort | uniq | head | rev • That is, ones with exactly one vowel ! Answer these with respect to word types, not word tokens grep
• We used the following to keep lines with exactly one vowel
• grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]* $’
• What would happen if we instead used the command? In what contexts is this important?
• grep –i ’[ˆaeiou]*[aeiou][ˆaeiou]*’