Super Lexicon! 1 Prelab 2 Lab Program

CSCI 136: Fall 2017 Lab 8 Handout L8 Due 5 November 29 October Super Lexicon! 1 Prelab Before lab, read through this entire handout carefully. Construct a design document for the LexiconNode class as described in Section 2.3. We will briefly go over this design together at the beginning of lab on Wednesday, and we will collect your design documents. Please bring a physical copy or a PDF document. 2 Lab Program Virtually all modern word processors contain a feature to check the spelling of words in documents. More advanced word processors also provide suggested corrections for misspelled words. For your next lab, you will be taking on the fun and challenging task of implementing such a spelling corrector. You are to implement a highly-efficient Lexicon class and go on to augment it with additional functionality to find spelling corrections for misspelled words. The assignment has several purposes: • To gain further skill with recursion in managing a recursive tree structure and exploring it using sophisticated algorithms; • To explore the notion of a layered abstraction where one class (Lexicon) makes use of other classes (e.g., Iterator, Vector, Set) as part of its internal representation. 2.1 The Lexicon Interface Your first task is to implement a basic Lexicon class. You may recall this class from the recursion lab (Lab 3), where an optional extension used Lexicon.class to provide word completions for T-9 word cell-phones (if you want to see such a cell phone “in the wild”, Bill J has one). Now, you’ll get a chance to build a lexicon yourself to see the kind of specialized and highly-efficient data structures that are used in its implementation. This version has similar functionality to the Lexicon from the Lab 3 extension, with additional operations to remove words, iterate over words, and read words from a text file. The following methods are in the Lexicon interface. You should implement this interface in a file called LexiconTrie.java. public interface Lexicon { boolean addWord(String word); boolean removeWord(String word); boolean containsWord(String word); boolean containsPrefix(String prefix); int addWordsFromFile(String filename); int numWords(); Iterator<String> iterator(); Set<String> suggestCorrections(String target, int maxDistance); Set<String> matchRegex(String pattern); } Most of the method behaviors can be inferred from their names and return types. For more information about the usage and purpose of these methods, refer to the comments in the starter code. 1 2 Implementing the Lexicon as a trie 2.2There Implementing are several different the Lexicon data structures as a Trie you could use to implement a lexicon— a sorted array, a Therelinked are list, several a binary different search data structures tree, a hashtable, you could use and to many implement others. a lexicon—a Each of sorted these array, offers a linkedtradeoffs list, abetween binary searchthe speed tree, and of manyword others. and prefix Each of lookup, these offers amount tradeoffs of betweenmemory the required speed of wordto store and prefixthe data lookup, structure, amount ofthe memoryease of required writing to and store debugging the data structure, the code, the ease performance of writing and of add/remove, debugging the code,and so performance on. The implementation of add/remove, andwe so will on. Theuse implementationis a special kind we willof tree use iscalled a special a trie kind (pronounced of tree called "try"), a trie (pronounced designed for “try”), just designed this purpose. for just this purpose. AA trie trie is is a a letter-tree that that efficiently efficiently stores stores strings. strings. A node A in node a trie representsin a trie represents a letter. A path a letter. through A thepath trie through traces outthe a sequencetrie traces of lettersout a thatsequence represent of aletters prefix orthat word represent in the lexicon. a prefix or word in the lexicon. Instead of just two children as in a binary tree, each trie node has potentially 26 child pointers (one for each letter ofInstead the alphabet). of just Whereas two children searching as ain binary a binary search tree, tree each eliminates trie node half thehas words potentially with a left26 orchild right pointers turn (we will(one eventuallyfor each discuss letter binaryof the searchalphabet). trees in Whereas class), a search searching in a trie a followsbinary thesearch child tree pointer eliminates for the next half letter, the whichwords narrowswith a the left search or right to just turn, words a search starting in with a trie that follows letter. For the example, child pointer from the for root, the any next words letter, that which begin withnarrows “n” canthe be search found byto followingjust words the pointerstarting to with the “n” that child letter. node. For From example, there, following from the “o” root, leads toany just words those wordsthat begin that beginwith with n can “no” be and found so on, by recursively. following If the two pointer words have to the the samen child prefix, node. they From share thatthere, initial following part of their o leads paths. to Thisjust saves those space words since that there begin are typically with no many and sharedso on prefixesrecursively. among words.If twoEach words node have has athe boolean sameisWord prefix, flagthey whichshare indicates that initial that thepart path of takentheir frompaths. the This root to saves this node space represents since there a complete are typically word. Here’s many a conceptual shared prefixes picture ofamong a small trie:words. Each node has a boolean isWord flag which indicates that the path taken from the root to this node represents a word. Here's a conceptual picture of a small trie: Start A N Z R S E O E E W T N The thick border around a node indicates its isWord flag is true. This trie contains the words: a, are,The thickas, bordernew, aroundno, not, a node and indicates zen. that Strings its isWord suchflag as isze true. or Thisar are trie containsnot valid the words words: fora, this are, trie as,because new, the no, path not, forand thosezen strings. Strings ends such at as aze nodeor ar whereare not isWord valid words is false. for this Any trie becausepath not the drawn path for is thoseassumed strings to ends not at exist, a node so where stringsisWord such isas false. cat Anyor next path notare drawnnot valid is assumed because to not there exist, is sono strings such suchpath as in catthisor trie.astronaut are not valid because there is no such path in this trie. Like other trees, a trie is a recursive data structure. All of the children of a given trie node are themselves smaller tries.Like You other will betrees, making a trie good is usea recursive of your recursion data structure. skills when operatingAll of the on thechildren trie! of a given trie node are themselves smaller tries. You will be making good use of your recursion skills when operating on 2.3the trie! Managing Node Children ForManaging each node innode the trie, children you need a list of pointers to children nodes. In the sample trie drawn above, the root node hasFor three each children, node onein the each trie, for the you letters need A, a N, list and of Z. pointers One possibility to children for storing nodes. the children In the pointers sample is a trie statically- drawn sizedabove, 26-member the root array node of pointershas three to children, nodes, where onearray[0] each for isthe the letters child forA, A,N,array[1] and Z. Onerefers possibility to B, ... and for array[25]storing therefers children to Z. Whenpointers there is is a nostatically-sized child for a given 26-member letter, (such as array from Zof to pointers X) the array to entrynodes, would where be nullarray[0]. This is arrangement the child makesfor A, it array[1] trivial to findrefers the to child B, for... and a given array[25] letter—you refers simply to Z. access When the correctthere is element no child in thefor array a given by letter letter, index. (such as from Z to X) the array entry would be NULL. This arrangement makes it trivialHowever, to find for the most child nodes for within a given the trie,letter, very you few simply of the 26access pointers the arecorrect needed, element so using in athe largely arraynull by letter26- memberindex. array However, is much for too most expensive nodes (i.e., within it wastes the a trie, lot ofvery space). few Better of the alternatives 26 pointers would are be needed, a dynamically-sized so using a arraylargely which NULL can grow 26-member and shrink array as needed is much (i.e., too a Vector), expensive. or a linked Better list alternatives of children pointers. would be We a leavedynamically- the final choicesized of array a space-efficient which can design grow upand to you,shrink but asyou needed, should justifya linked the list choice of children you make pointers, (briefly) inor yourleveraging program the comments. Two things you may want to consider: standard classes in our toolkit, such as a VectorList or Set, to store the children pointers. We leave the final1. there choice are at of most a space-efficient 26 children, so even design a O(N) up to you,operation but you to find should a particular justify child the is choice O(1), and you make in your2. trie program operations comments. such as iteration Two requirethings traversing you may the want words to in consider: alphabetical there order, are so keepingat most the 26 list children, of children so evenpointers a O(N) sorted operation by letter to will find be advantageous.a particular child is no big deal, and operations such as writing the 2 3 words to a file need to access the words in alphabetical order, so keeping the list of children pointersBegin sorted implementing by letter your will trie be by advantageous.

Load more