<<

NLP - Assignment 2

Week 2 December 27th, 2016

1. A 5-gram model is a order Markov Model:

(a) Six (b) Five (c) Four (d) Constant

Ans : c) Four

2. For the following corpus C1 of 3 sentences, what is the total count of unique bi- grams for which the likelihood will be estimated? Assume we do not perform any pre-processing, and we are using the corpus as given.

(i) ice cream tastes better than any other food (ii) ice cream is generally served after the meal (iii) many of us have happy childhood memories linked to ice cream

(a) 22 (b) 27 (c) 30 (d) 34

Ans : b) 27

3. Arrange the “curry, oil and tea” in descending order, based on the frequency of their occurrence in the Google Books n-grams. The Google Books n-gram viewer is available at https://books.google.com/ngrams:

(a) tea, oil, curry (c) curry, tea, oil (b) curry, oil, tea (d) oil, tea, curry

Ans: d) oil, tea, curry

4. Given a corpus C2, The Maximum Likelihood Estimation (MLE) for the “ice cream” is 0.4 and the count of occurrence of the “ice” is 310. The likelihood of “ice cream” after applying add-one smoothing is 0.025, for the same corpus C2. What is the vocabulary size of C2:

1 (a) 4390 (b) 4690 (c) 5270 (d) 5550

Ans: b)4690 The Questions from 5 to 10 require you to analyse the data given in the corpus C3, using a programming language of your choice. The data and the code snippets (in python) can be obtained from the URL: https://github.com/krishnamrith12/NLPMOOC/raw/master/w2a.zip

5. For the string ‘ceating’, identify which of the following set of strings have a Lev- enshtein of 1.

(a) eating, hating, beating, melting (c) cheating, beating, eating, creating (b) cheating, feasting, eating, healing (d) None of these

Ans: c) cheating, beating, eating, creating

6. Assume that we modify the costs incurred for operations in calculating Leven- shtein distance, such that both the insertion and deletion operations incur a cost of 1 each, while substitution incurs a cost of 2. Now, for the string ‘ceating’ which of the following set of strings will have an of 1

(a) cheating, beating, eating (c) cheating, eating, casting (b) cheating, eating, creating (d) None of these

Ans: b) cheating, eating, creating

7. Given, a user wants to check if the string ‘hareesh’ is there in the corpus. But a search for the string yields no result. Use Levenshtein (original version) dis- tance to find the closest matches. Report the count of words with the minimum and also the minimum distance value.

(a) distance:2, entries:3 (c) distance:1, entries:4 (b) distance 1:, entries:5 (d) distance:3, entries:12

Ans: a) distance:2, entries:3

8. Jaro-Wingler distance turns out to be more effective when it comes to handling the spelling variations that occur in names. The description for calculating Jaro- Wingler distance can be found in the url: http://wikipedia.oerg/jaro-wingler. Report the highest distance score (precision of 3 decimal places) that can be obtained from the corpus C3 for the string ‘hareesh’ as per the Jaro-Wingler distance. Consider the scaling factor to be 0.1.

2 (a) 0.865 (b) 0.894 (c) 0.965 (d) 0.942

Ans: b) 0.894

9. Which of the following bi-grams is most likely to occur as per the corpus C3:

(a) iron hand (c) iron safe (b) iron chain (d) iron bars

Ans: c) iron safe

10. Which of the following sentence fragments will be most likely to occur as per the corpus C3: Assume that you are using a bi-gram with add one smoothing.

(a) chandranath babu asked for betel (c) poor bimala went to the dressing leaves room (d) all men are equal. some men are (b) sandip babu sang bande mataram more equal

Ans : b) sandip babu sang bande mataram

3