<<

CSCI 136: Fall 2017 Lab 8 Handout L8 Due 5 November 29 October

Super Lexicon!

1 Prelab

Before lab, read through this entire handout carefully. Construct a design document for the LexiconNode class as described in Section 2.3. We will briefly go over this design together at the beginning of lab on Wednesday, and we will collect your design documents. Please bring a physical copy or a PDF document.

2 Lab Program

Virtually all modern word processors contain a feature to check the spelling of words in documents. More advanced word processors also provide suggested corrections for misspelled words. For your next lab, you will be taking on the fun and challenging task of implementing such a spelling corrector. You are to implement a highly-efficient Lexicon class and go on to augment it with additional functionality to find spelling corrections for misspelled words.

The assignment has several purposes: • To gain further skill with recursion in managing a recursive tree structure and exploring it using sophisticated ; • To explore the notion of a layered abstraction where one class (Lexicon) makes use of other classes (e.g., Iterator, Vector, ) as part of its internal representation.

2.1 The Lexicon Interface Your first task is to implement a basic Lexicon class. You may recall this class from the recursion lab (Lab 3), where an optional extension used Lexicon.class to provide word completions for T-9 word cell-phones (if you want to see such a cell phone “in the wild”, Bill J has one). Now, you’ll get a chance to build a lexicon yourself to see the kind of specialized and highly-efficient data structures that are used in its implementation. This version has similar functionality to the Lexicon from the Lab 3 extension, with additional operations to remove words, iterate over words, and read words from a text file. The following methods are in the Lexicon interface. You should implement this interface in a file called LexiconTrie.. public interface Lexicon { boolean addWord(String word); boolean removeWord(String word); boolean containsWord(String word); boolean containsPrefix(String prefix); int addWordsFromFile(String filename); int numWords(); Iterator iterator(); Set suggestCorrections(String target, int maxDistance); Set matchRegex(String pattern); }

Most of the method behaviors can be inferred from their names and return types. For more information about the usage and purpose of these methods, refer to the comments in the starter code. 1 2 Implementing the Lexicon as a 2.2There Implementing are several different the Lexicon data structures as a Trie you could use to implement a lexicon— a sorted array, a Therelinked are list, several a binary different search data structures tree, a hashtable, you could use and to many implement others. a lexicon—a Each of sorted these array, offers a linkedtradeoffs list, abetween binary searchthe speed tree, and of manyword others. and prefix Each of lookup, these offers amount tradeoffs of betweenmemory the required speed of wordto store and prefixthe data lookup, structure, amount ofthe memoryease of required writing to and store debugging the , the code, the ease performance of writing and of add/remove, debugging the code,and so performance on. The implementation of add/remove, andwe so will on. Theuse implementationis a special kind we willof tree use iscalled a special a trie kind (pronounced of tree called "try"), a trie (pronounced designed for “try”), just designed this purpose. for just this purpose. AA trie trie is is a a letter-tree that that efficiently efficiently stores stores strings. strings. A node A in node a trie representsin a trie represents a letter. A a letter. through A thepath trie through traces outthe a sequencetrie traces of lettersout a thatsequence represent of aletters prefix orthat word represent in the lexicon. a prefix or word in the lexicon. Instead of just two children as in a binary tree, each trie node has potentially 26 child pointers (one for each letter ofInstead the alphabet). of just Whereas two children searching as ain binary a binary search tree, tree each eliminates trie node half thehas words potentially with a left26 orchild right pointers turn (we will(one eventuallyfor each discuss letter binaryof the searchalphabet). trees in Whereas class), a search searching in a trie a followsbinary thesearch child tree pointer eliminates for the next half letter, the whichwords narrowswith a the left search or right to just turn, words a search starting in with a trie that follows letter. For the example, child pointer from the for root, the any next words letter, that which begin withnarrows “n” canthe be search found byto followingjust words the pointerstarting to with the “n” that child letter. node. For From example, there, following from the “o” root, leads toany just words those wordsthat begin that beginwith with n can “no” be and found so on, by recursively. following If the two pointer words have to the the samen child prefix, node. they From share thatthere, initial following part of their o leads paths. to Thisjust saves those space words since that there begin are typically with no many and sharedso on prefixesrecursively. among words.If twoEach words node have has athe boolean sameisWord prefix, flagthey whichshare indicates that initial that thepart path of takentheir frompaths. the This root to saves this node space represents since there a complete are typically word. Here’s many a conceptual shared prefixes picture ofamong a small trie:words. Each node has a boolean isWord flag which indicates that the path taken from the root to this node represents a word. Here's a conceptual picture of a small trie:

Start

A N Z

R S E O E

E W T N

The thick border around a node indicates its isWord flag is true. This trie contains the words: a, are,The thickas, bordernew, aroundno, not, a node and indicates zen. that Strings its isWord suchflag as isze true. or Thisar are trie containsnot valid the words words: fora, this are, trie as,because new, the no, path not, forand thosezen strings. Strings ends such at as aze nodeor ar whereare not isWord valid words is false. for this Any trie becausepath not the drawn path for is thoseassumed strings to ends not at exist, a node so where stringsisWord such isas false. cat Anyor next path notare drawnnot valid is assumed because to not there exist, is sono strings such suchpath as in catthisor trie.astronaut are not valid because there is no such path in this trie. Like other trees, a trie is a recursive data structure. All of the children of a given trie node are themselves smaller .Like You other will betrees, making a trie good is usea recursive of your recursion data structure. skills when operatingAll of the on thechildren trie! of a given trie node are themselves smaller tries. You will be making good use of your recursion skills when operating on 2.3the trie! Managing Node Children ForManaging each node innode the trie, children you need a list of pointers to children nodes. In the sample trie drawn above, the root node hasFor three each children, node onein the each trie, for the you letters need A, a N, list and of Z. pointers One possibility to children for storing nodes. the children In the pointers sample is a trie statically- drawn sizedabove, 26-member the root array node of pointershas three to children, nodes, where onearray[0] each for isthe the letters child forA, A,N,array[1] and Z. Onerefers possibility to B, ... and for array[25]storing therefers children to Z. Whenpointers there is is a nostatically-sized child for a given 26-member letter, (such as array from Zof to pointers X) the array to entrynodes, would where be nullarray[0]. This is arrangement the child makesfor A, it array[1] trivial to findrefers the to child B, for... and a given array[25] letter—you refers simply to Z. access When the correctthere is element no child in thefor array a given by letter letter, index. (such as from Z to X) the array entry would be NULL. This arrangement makes it trivialHowever, to find for the child nodes for within a given the trie,letter, very you few simply of the 26access pointers the arecorrect needed, element so using in athe largely arraynull by letter26- memberindex. array However, is much for too most expensive nodes (i.e., within it wastes the a trie, lot ofvery space). few Better of the alternatives 26 pointers would are be needed, a dynamically-sized so using a arraylargely which NULL can grow 26-member and shrink array as needed is much (i.e., too a Vector), expensive. or a linked Better list alternatives of children pointers. would be We a leavedynamically- the final choicesized of array a space-efficient which can design grow upand to you,shrink but asyou needed, should justifya linked the list choice of children you make pointers, (briefly) inor yourleveraging program the comments. Two things you may want to consider: standard classes in our toolkit, such as a VectorList or Set, to store the children pointers. We leave the final1. there choice are at of most a space-efficient 26 children, so even design a O(N) up to you,operation but you to find should a particular justify child the is choice O(1), and you make in your2. trie program operations comments. such as iteration Two requirethings traversing you may the want words to in consider: alphabetical there order, are so keepingat most the 26 list children, of children so evenpointers a O(N) sorted operation by letter to will find be advantageous.a particular child is no big deal, and operations such as writing the

2 3 words to a file need to access the words in alphabetical order, so keeping the list of children pointersBegin sorted implementing by letter your will trie be by advantageous. constructing a LexiconNode class. LexiconNodes should be Compara- ble, so make sure to implement the Comparable interface. After implementing LexiconNode, you should create a constructor in LexiconTrie that creates an empty LexiconTrie to represent an empty word list. Be sure to incrementallySearching for compile words and and test prefixes your code. Searching the trie for words and prefixes is a matter of tracing out the path letter by letter. Let's 2.4consider Searching a few examples for Words on andthe sample Prefixes trie shown previously. To determine if the string new is a word, start at the root node and examine its children to find one pointing to n. Once found, recur on Searchingmatching the the trie remainder for words andstring prefixes ew. Find using econtainsWord among its children,and containsPrefix follow its pointer,is a and matter recur of tracing again out to thematch path w letter. Once by letter.we arrive Let’s at consider the w node, a few there examples are no on more the sample letters trie remaining shown previously. in the input, To determine so this is if the stringlast node.new is The a word, isWord start at field the root of nodethis andnode examine is true its, children indicating to find that one the pointing path to ton. this Once node found, is recurse a word on matchingcontained the in remainder the lexicon. string ew. Find e among its children, follow its pointer, and recurse again to match w. Once we arrive at the w node, there are no more letters remaining in the input, so this is the last node. The isWord field of thisAlternatively, node is true search, indicating for thatar. the The path path to this exists node and is a wordwe can contained trace inour the way lexicon. through all letters, but the isWordAlternatively, field on search the last for arnode. The is path false exists, which and we indicates can trace ourthat way this through path is all not letters, a word. but the (ItisWord is, however,field on a theprefix last nodeof other is false words, which in the indicates trie). thatSearching this path isfor not nap a word. follows (It is, n however, away from a prefix the of otherroot, wordsbut finds in the no trie). a Searchingchild leading for nap fromfollows there,n away so the from path the root,for this but findsstring no doesa child not leading exist from in the there, trie so and the pathit is for neither this string a word does notnor exist a prefix in the in trie this and trie. it is neither a word nor a prefix in this trie (containsWord and containsPrefix both return false). AllAll paths paths through through the trietrieeventually eventually lead lead to a to valid a valid node node (a LexiconNode (a node wherewhere isWordisWord hashas value value truetrue).). ThereforeTherefore determining determining whether whether a string a string is a prefix is a of prefix at least of one at least word inone the word trie is in simply the trie a matter is simply of verifying a matter that theof verifying path for the that prefix the exists. path for the prefix exists.

2.5Adding Adding words Words AddingAdding a newa new word word into the into trie the with trieaddWord is a matteris a matter of oftracing tracing out itsits path path starting starting from thefrom root, the as ifroot, searching. as if Ifsearching. any part of If the any path part does of not the exist, path the does missing not nodesexist, mustthe missing be added nodes to the trie. must Lastly, be added the isWord to the flagtrie. is Lastly, turned onthe for isWord the final node.flag is In turned some situations, on for addingthe final a new node. word willIn some necessitate situations, adding a adding new node a fornew each word letter, will for example,necessitate adding adding the word a newdot nodeto our for sample each trie letter, will for add example, three new nodes,adding one the for word each dot letter. to On our the sample other hand, trie addingwill add the three word newsnew nodes,would only one require for each adding letter. an sOnchild the to other the end hand, of existing adding path the for word new. Addingnews would the word onlydo afterrequiredot hasadding been addedan s doesn’tchild to require the end any newof existing nodes at all;path it justfor setsnew the. AddingisWord flagthe onword an existing do after node dot to true.has Herebeen is added the sample doesn't trie afterrequire those any three new words nodes have at been all, added:just turning on the flag on an existing node. Here is the sample trie after those three words have been added:

Start

A D N Z

R S O E O E

E T W T N

S

A trie is an unusual data structure in that its performance can improve as it becomes more loaded. InsteadA trie of is anslowing unusual down data structure as its get in that full, its performanceit becomes can faster improve to add as it becomeswords when more loaded.they can Instead share of slowingcommon down prefix as its nodes get full, with it becomeswords already faster to in add the words trie. when they can share common prefix nodes with words already in the trie.

2.6Removing Removing words Words The first step to removing a word is tracing out its path and turning off the isWord flag on the final Thenode. first However, step to removing your work a word is with not removeWordyet done becauseis tracing you outneed its pathto remove and turning any offpart the ofisWord the wordflag that on the is final node. However, your work is not yet done because you need to remove any part of the word that is now a dead 3 end. (This last part is left as an optional extra credit extension, described in the next paragraph. You only need4 tonow update a dead the isWord end. Allflag paths for full in credit.)the trie Allmust paths eventually in the trie lead must to eventually a word. leadIf the to word a word. being If the removed word being was removedthe only was valid the only word valid along word this along path, this path,the nodes the nodes along along that that path path must be be deleted deleted from from the triethe along trie along with the word. For example, if you removed the words zen and not from the trie shown previously, you should have the with the word. For example, if you removed the words zen and not from the trie shown previously, resultyou below.should have the result below.

Start

A D N

R S O E O

E T W

S

As a general observation, there should never be a leaf node whose isWord field is false. If a node hasOptional no children extension. and Deletingdoes not unneeded represent nodes a valid is pretty word tricky (i.e., because isWord of the is recursive false), nature then of this the node trie. Think is not aboutpart howof any we removedpath to a the valid last elementword in of the a SinglyLinkedList trie and such nodes should(Chapter be 9.4 deleted in Bailey when). We removing had to maintain a word. a pointerIn some to the cases, second removing to last element a word to update from thethepointers trie may appropriately. not require The removing same is true any in nodes. our trie. For example, if weAs were a general to remove observation, the thereword should new neverfrom bethe a leafabove node trie, whose it isWordturns offfield isWord is false. but If aall node nodes has no along children that andpath does are not still represent in use a for valid other word words. (i.e., isWord is false), then this node is not part of any path to a valid word in the trie and such nodes should be deleted when removing a word. In some cases, removing a word from the trie may notImportant require removing note: when any nodes. removing For example, a word iffrom we were the totrie, remove the only the word nodesnew thatfrom may the aboverequire trie, deallocation it turns off isWordare nodesbut allon nodesthe path along to that the path word are that still was in use removed. for other words. It would be extremely inefficient if you were to traverseImportant the note whole: when trie removing to check a word for deallocating from the trie, thenodes only nodesevery thattime may a requireword was removal removed, are nodes and on theyou pathshould to the not word use that such was removed.an inefficient It would strategy. be extremely inefficient to check additional nodes that are not on the path.

2.7Other Other trie Trieoperations Operations There are few remaining odds and ends to the trie implementation. Creating an iterator and writing Therethe words are a few to remaininga file both odds involve and ends a recursive to the trie implementation.exploration of all paths through the trie to find all of the contained1. You need words. to keep Remember track of the total that number in both of wordscases storedit is inonly the trie.words (not prefixes) that you want to operate on and that these operations need to access the words in alphabetical order. 2. You should add support for reading words from a file using a Scanner. You may find the Scanner handout Onceon you the course have webpage a working helpful. lexicon, you're ready to implement the snazzy spelling correction features.3. Creating There an iterator are two to traverseadditional the trieLexicon involves member a recursive functions,exploration one of for all suggesting paths through simple the trie corrections to find all and ofthe the second contained for words.regular Remember expressions that matching: it is only words (not prefixes) that you want to operate on and that the iterator needs to access the words in alphabetical order. You may find the approach that we used in the Set *SuggestCorrections(string target, int maxDistance); ReverseIterator.javaSetexample *MatchRegex(string from class helpful (e.g., pattern); creating a data structure that contains the elements in the desired order and returning an iterator for that data strurcture). Suggesting corrections First consider the member function SuggestCorrections. Given a (potentially misspelled) target string and a maximum distance, this function gathers the set of words from the lexicon that have a distance to the target string than or equal to the given maxDistance. We define the distance between two equal-length strings to be the total number of positions in which the strings differ. For example, "place" and "peace" have distance 1, "place" and "plank" have distance 2. The returned set contains all words in the lexicon that are the same length as the target string and are within the maximum distance.

4 3 Optional Extensions

Once you have a working lexicon and iterator, you’re ready to implement snazzy spelling correction features. There are two additional Lexicon member functions, one for suggesting simple corrections, and a second for regular expressions matching: Set suggestCorrections(String target, int maxDistance); Set matchRegex(String pattern);

Sets are basically just fancy Vectors that do not allow duplicates. Check out the javadocs on Sets and SetVectors in structure5 for more information.

3.1 Suggesting Corrections First consider the member function suggestCorrections. Given a (potentially misspelled) target string and a maximum distance, this function gathers the set of words from the lexicon that have a distance to the target string less than or equal to the given maxDistance. We define the distance between two equal-length strings to be the total number of character positions in which the strings differ. For example, “place” and “peace” have distance 1, “place” and “plank” have distance 2. The returned set contains all words in the lexicon that are the same length as the target string and are within the maximum distance. For example, consider the original sample trie containing the words a, are, as, new, no, not, and zen. If we were to call suggestCorrections with the following target string and maximum distance, here are the suggested corrections:

Target string Max distance Suggested corrections ben 1 zen nat 2 new, not

For a more rigorous test, we also provide the word file ospd2.txt, which lists all of the words in the second edition of the Official Scrabble Player’s Dictionary. Here are a few examples of suggestCorrections run on a lexicon containing all the words in ospd2.txt:

Target string Max distance Suggested corrections crw 1 caw, cow, cry zqwp 2 gawp, yawp

Finding appropriate spelling corrections will require a recursive traversal through the trie gathering those “neigh- bors” that are close to the target path. You should not find suggestions by examining each word in the lexicon and seeing if it is close enough. Instead, think about how you can generate candidate suggestions by traversing the path of the target string taking small “detours” to the neighbors that are within the maximum distance.

3.2 Matching Regular Expressions The second optional extension is to use recursion to match regular expressions. The matchRegex method takes a regular expression as input and gathers the set of lexicon words that match that regular expression. If you have not encountered them before, a regular expression describes a string-matching pattern. Ordinary alphabetic letters within the pattern indicate where a candidate word must exactly match. The pattern may also contain “wildcard” characters, which specify how and where the candidate word is allowed to vary. For now, we will consider a subset of wildcard characters that have the following meanings:

• The ‘*’ matches any sequence of zero or more characters. • The ‘?’ wildcard character matches either zero or one character. 5 For example, consider the original sample trie containing the words a, are, as, new, no, not, and zen. Here are the matches for some sample regular expression:

Regular expression Matching words from lexicon a* a, are, as a? a, as *e* are, new, zen not not z*abc?*d *o? no, not

Finding the words that match a regular expression will require applying your finest recursive skills. You should not find suggestions by examining each word in the lexicon and seeing if it is a match. Instead, think about how to generate matches by traversing the path of the pattern. For non-wildcard characters, it proceeds just as for traversing ordinary words. On wildcard characters, “fan out” the search to include all possibilities for that wildcard.

4 Suggestions

• Lexicon operations are case-insensitive. Searching for words, suggesting corrections, matching regular expres- sions, and other operations should have the same behavior for both for upper and lowercase inputs. Be sure to take that into consideration when designing your data structure and algorithms.

• Build and test incrementally. Develop your trie one function at a time and continually test as you go. We have provided a handy client program that exercises the lexicon and allows you to drive the testing interactively from the console. It is supplied in source code (Main.java) and you are encouraged to modify and extend it as needed for your purposes. • Use stubs as placeholders. The testing code we provide makes calls to all of the public member functions on the Lexicon. In order for the program to compile, you must have implementations for all the functions. However, this doesn’t suggest that you need to write all the code first and then attempt to debug it all at once. You can implement methods with placeholder “stubs” to start. If your lexicon doesn’t yet remove words, implement the remove operation to ignore its argument or raise an error. Before you have implemented regular expression match, just return an empty set from the function and so on. • Test on smaller data first. There are small.txt and small2.txt data files with just a few words that are especially helpful for testing in the early stages. You can also create word data files of your own that test specific trie configurations. The ospd2.txt word file is very large. It will be useful for stress-testing once you have the basics in place, but it can be overwhelming to try to debug using that version.

5 Thought Questions

Suppose we use an OrderedVector instead of a trie for storing our Lexicon. Discuss how the process of searching for suggested spelling corrections would differ from our trie-based implementation. Which is more efficient? Why?

6 Completing the Lab

We provide basic starter code for this assignment. To obtain it, execute the following command to copy the lexicon lab folder to your own directory: cp -r /opt/mac-cs-local/share/cs136/labs/lexicon .

The lexicon directory contains the following files:

6 • Lexicon.java: The interface that you need to implement • Main.java: Sample code that you can use to test your LexiconTrie • LexiconTrie.java: skeleton for a LexiconTrie implementation • LexiconNode.java: skeleton for the class that represents Trie nodes • small.txt, small2.txt, ospd2.txt: Data files containing sets of words for you to test with

6.1 Deliverables Create lab directory in the normal way (-lab8/) and submit it to the drop-off folder before the due date. Be sure to include: • Your well-documented source code for all Java files that you write/modify. • Your answer to the thought question in a file named README.txt. As in all labs, you will be graded on design, documentation, style, and correctness. Be sure to document your program with appropriate comments, including a general description at the top of the file, a description of each method with pre- and post-conditions where appropriate. Also use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it. Have fun!

7