NLP - Assignment 2

NLP - Assignment 2

NLP - Assignment 2 Week 2 December 27th, 2016 1. A 5-gram model is a order Markov Model: (a) Six (b) Five (c) Four (d) Constant Ans : c) Four 2. For the following corpus C1 of 3 sentences, what is the total count of unique bi- grams for which the likelihood will be estimated? Assume we do not perform any pre-processing, and we are using the corpus as given. (i) ice cream tastes better than any other food (ii) ice cream is generally served after the meal (iii) many of us have happy childhood memories linked to ice cream (a) 22 (b) 27 (c) 30 (d) 34 Ans : b) 27 3. Arrange the words \curry, oil and tea" in descending order, based on the frequency of their occurrence in the Google Books n-grams. The Google Books n-gram viewer is available at https://books.google.com/ngrams: (a) tea, oil, curry (c) curry, tea, oil (b) curry, oil, tea (d) oil, tea, curry Ans: d) oil, tea, curry 4. Given a corpus C2, The Maximum Likelihood Estimation (MLE) for the bigram \ice cream" is 0.4 and the count of occurrence of the word \ice" is 310. The likelihood of \ice cream" after applying add-one smoothing is 0:025, for the same corpus C2. What is the vocabulary size of C2: 1 (a) 4390 (b) 4690 (c) 5270 (d) 5550 Ans: b)4690 The Questions from 5 to 10 require you to analyse the data given in the corpus C3, using a programming language of your choice. The data and the code snippets (in python) can be obtained from the URL: https://github.com/krishnamrith12/NLPMOOC/raw/master/w2a.zip 5. For the string `ceating', identify which of the following set of strings have a Lev- enshtein distance of 1. (a) eating, hating, beating, melting (c) cheating, beating, eating, creating (b) cheating, feasting, eating, healing (d) None of these Ans: c) cheating, beating, eating, creating 6. Assume that we modify the costs incurred for operations in calculating Leven- shtein distance, such that both the insertion and deletion operations incur a cost of 1 each, while substitution incurs a cost of 2. Now, for the string `ceating' which of the following set of strings will have an edit distance of 1 (a) cheating, beating, eating (c) cheating, eating, casting (b) cheating, eating, creating (d) None of these Ans: b) cheating, eating, creating 7. Given, a user wants to check if the string `hareesh' is there in the corpus. But a search for the string yields no result. Use Levenshtein (original version) dis- tance to find the closest matches. Report the count of words with the minimum Levenshtein distance and also the minimum distance value. (a) distance:2, entries:3 (c) distance:1, entries:4 (b) distance 1:, entries:5 (d) distance:3, entries:12 Ans: a) distance:2, entries:3 8. Jaro-Wingler distance turns out to be more effective when it comes to handling the spelling variations that occur in names. The description for calculating Jaro- Wingler distance can be found in the url: http://wikipedia.oerg/jaro-wingler. Report the highest distance score (precision of 3 decimal places) that can be obtained from the corpus C3 for the string `hareesh' as per the Jaro-Wingler distance. Consider the scaling factor to be 0.1. 2 (a) 0.865 (b) 0.894 (c) 0.965 (d) 0.942 Ans: b) 0.894 9. Which of the following bi-grams is most likely to occur as per the corpus C3: (a) iron hand (c) iron safe (b) iron chain (d) iron bars Ans: c) iron safe 10. Which of the following sentence fragments will be most likely to occur as per the corpus C3: Assume that you are using a bi-gram language model with add one smoothing. (a) chandranath babu asked for betel (c) poor bimala went to the dressing leaves room (d) all men are equal. some men are (b) sandip babu sang bande mataram more equal Ans : b) sandip babu sang bande mataram 3.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    3 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us