Unix/Regex Lab Overview

Unix/Regex Lab Overview

Overview 1. Setup & Unix review Unix/Regex Lab 2. Count words in a text CS 341: Natural Language Processing Heather Pon-Barry 3. Sort a list of words in various ways 4. Search with grep 5. Two-minute response Based on Unix For Poets (by Ken Church) Setting Up 1. Setup & Unix Review • In your home directory, make a cs341 folder • Make a directory called unixForPoets for today’s lab activity Unix Tools pwd! • grep: search for a pattern (regular expression) ls ! • sort cd <dirname> piping • uniq –c (count duplicates) cd ../ > • tr (translate characters) less <filename> • wc (word – or line – count) head <filename> <! tail <filename> |! • cat (send file(s) in stream) man <command> CTRL-C • sed (edit string -- replacement) Counting lines, words, characters 2. Count words in a text • wc alice.txt ! 1601 27336 135029 alice.txt tr command Counting Words NAME • Input: mini-alice.txt; alice.txt tr - translate or delete characters ! • SYNOPSIS Output: list of words with freq counts tr [OPTION]... SET1 [SET2] ! • Algorithm DESCRIPTION 1. Create a file with one token per line (tr -sc …) Translate, squeeze, and/or delete characters from standard input, writing to standard 2. Sort (sort) output. ! 3. Count duplicates (uniq –c) -c complement of SET1 ! • Practice using tr, sort, and uniq incrementally on mini-alice.txt -s, if SET2 is specified, squeezes repeated SET2 characters to a single character ! • … Once you understand each step, run your command on ! alice.txt --help display this help and exit Output head and tail 632 a! 1 abide! • head gives you the first n lines (n=10 by default; can specify n with flag - 1 able! n) 94 about! 3 above! • tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | uniq -c | 1 absence! head –n 5! 2 absurd! 1 acceptance! 632 a! 2 accident! 1 abide! 1 accidentally! . ! 1 able! .! . 94 about! 3 above! Solution: tr -sc ’A-Za-z’ ’\n’ < alice.txt | sort | • what do you think tail does? uniq -c (hidden) Most Frequent Words Exercise 3. Sort a list of words in various ways • Find the 50 most common words in alice.txt • Hint: Use sort a second time, then head grep 4. Search with grep • Grep finds patterns specified as regular expressions • globally search for regular expression and print grep grep • Try this: grep cheshire alice.txt ! it s a cheshire cat said the duchess and that s why • Make an intermediary words file: ! pig she said the last word with such sudden violence that alice • tr -sc ’A-Za-z’ ’\n’ < alice.txt > quite jumped but she saw in another moment that it was alice.words! addressed to the baby and not to her so she took courage and went on again i didn t know that cheshire cats always grinned in ! fact i didn t know that cats could grin • Finding words ending in –ing: ! … • grep 'ing$' alice.words | sort | uniq –c ! • Next, try grepping other phrases grep Take-home Message • grep is a filter – you keep only some lines of the input • Try these on alice.words • grep gh keep lines containing ‘‘gh’’ • grep ’ˆcon’ keep lines beginning with ‘‘con’’ • Piping commands together can be simple yet • grep ’ing$’ keep lines ending with ‘‘ing’’ powerful in Unix • grep –v gh keep lines NOT containing “gh” • grep –i ’[aeiou].*[aeiou]’ keep lines with two or more vowels • grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ keep lines with exactly one vowel 5. Two-minute response https://xkcd.com/208/ Two-minute Response • In Piazza, post a Note to Instructor only: 1. What is one thing you understand better after Extra Exercises today’s activity? 2. What is something that’s still unclear on/a question you have? Sorting exercises Exercises on grep & wc • In alice.txt… • Find the words in alice.txt that end in “ling” • How many 4-letter words? using sorting (and not using grep) • How many different words are there with no vowels • Hint: what does this do? • What subtypes do they belong to? • tr -sc 'A-Za-z' '\n' < alice.txt | • How many “1 syllable” words are there sort | uniq | head | rev • That is, ones with exactly one vowel ! Answer these with respect to word types, not word tokens grep • We used the following to keep lines with exactly one vowel • grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]* $’ • What would happen if we instead used the command? In what contexts is this important? • grep –i ’[ˆaeiou]*[aeiou][ˆaeiou]*’ .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    14 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us