<<

Overview

1. Setup & review

Unix/Regex Lab 2. Count words in a text CS 341: Natural Language Processing Heather Pon-Barry 3. a list of words in various ways

4. Search with

5. Two-minute response Based on Unix For Poets (by Ken Church) Setting Up

1. Setup & Unix Review • In your home directory, a cs341 folder • Make a directory called unixForPoets for today’s lab activity Unix Tools ! • grep: search for a pattern () ! • sort piping • –c (count duplicates) cd ../ > • (translate characters) <filename> • (word – or line – count) <filename> |! • (send file(s) in stream) man CTRL-C • (edit string -- replacement) Counting lines, words, characters

2. Count words in a text • wc alice.txt !

1601 27336 135029 alice.txt tr command Counting Words

NAME • Input: mini-alice.txt; alice.txt tr - translate or delete characters ! • SYNOPSIS Output: list of words with freq counts tr [OPTION]... SET1 [SET2] ! • Algorithm DESCRIPTION 1. Create a file with one token per line (tr -sc …) Translate, squeeze, and/or delete characters from standard input, writing to standard 2. Sort (sort) output. ! 3. Count duplicates (uniq –c) -c complement of SET1 ! • Practice using tr, sort, and uniq incrementally on mini-alice.txt -s, if SET2 is specified, squeezes repeated SET2 characters to a single character ! • … Once you understand each step, run your command on ! alice.txt -- display this help and Output head and

632 a! 1 abide! • head gives you the first n lines (n=10 by default; can specify n with flag - 1 able! n) 94 about! 3 above! • tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | uniq -c | 1 absence! head –n 5! 2 absurd! 1 acceptance! 632 a! 2 accident! 1 abide! 1 accidentally! . ! 1 able! .! . 94 about! 3 above! Solution: tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | • what do you think tail does? uniq -c (hidden) Frequent Words Exercise

3. Sort a list of words in various ways • the 50 most common words in alice.txt • Hint: Use sort a second , then head grep

4. Search with grep • Grep finds patterns specified as regular expressions

• globally search for regular expression and print grep grep

• Try this: grep cheshire alice.txt ! it s a cheshire cat said the duchess and that s why • Make an intermediary words file: ! pig she said the last word with such sudden violence that alice • tr -sc ’A-Za-z’ ’\n’ < alice.txt > quite jumped but she saw in another moment that it was alice.words! addressed to the baby and not to her so she took courage and went on again i didn t know that cheshire cats always grinned in ! fact i didn t know that cats could grin • Finding words ending in –ing: ! … • grep 'ing$' alice.words | sort | uniq –c ! • Next, try grepping other phrases grep Take-home Message

• grep is a filter – you keep only some lines of the input

• Try these on alice.words

• grep gh keep lines containing ‘‘gh’’

• grep ’ˆcon’ keep lines beginning with ‘‘con’’ • Piping commands together can be simple yet • grep ’ing$’ keep lines ending with ‘‘ing’’ powerful in Unix • grep – gh keep lines NOT containing “gh”

• grep –i ’[aeiou].*[aeiou]’ keep lines with two or vowels

• grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ keep lines with exactly one vowel 5. Two-minute response

https://xkcd.com/208/ Two-minute Response

• In Piazza, post a Note to Instructor only:

1. What is one thing you understand better after Extra Exercises today’s activity?

2. What is something that’s still unclear on/a question you have? Sorting exercises Exercises on grep & wc

• In alice.txt…

• Find the words in alice.txt that end in “ling” • How many 4-letter words? using sorting (and not using grep) • How many different words are there with no vowels • Hint: what does this do? • What subtypes do they belong to?

• tr -sc 'A-Za-z' '\n' < alice.txt | • How many “1 syllable” words are there sort | uniq | head | rev • That is, ones with exactly one vowel ! Answer these with respect to word types, not word tokens grep

• We used the following to keep lines with exactly one vowel

• grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]* $’

• What would happen if we instead used the command? In what contexts is this important?

• grep –i ’[ˆaeiou]*[aeiou][ˆaeiou]*’