Dr. Eick COSC 6342“Machine Learning” Homework2&3 Spring 2014

Last updated: April 2, 9a Deadline: Tu., April 15, 11p 11) K-Means and EM (Ungraded) a) What is the meaning of b_it and h_it in the minimization process of K-mean’s/EM’s objective function? What constraint holds with respect to h_it and b_it? How is b_it computed by EM? Remark: b_it means h_it means b) Why do K-means and EM employ an iterative optimization procedure rather than differentiating the objective function and setting its gradient to 0 to derive the optimal clustering? What is the disadvantage of using an iterative optimization procedure? c) EM is called “a soft clustering algorithm”—what does this mean? d) Summarize in natural language what computations EM performs during its M-step!

12) Comparing Results of K-Means and EM Apply k-means and EM with k=5 to the Iris Flower Dataset; run each algorithm twice; in the case EM additionally explore the impact of alternative input parameters; compare the results; assess the differences in results of there are any. Based on this experiment, assess the strength / weaknesses of each algorithm.

13) Non-Parametric Density Estimation1 (Ungraded) Assume we have a one dimensional dataset containing values {2, 3, 7, 8, 9, 12} i. Assume h=2 for all questions (formula 8.2); compute p(x) using equation 8.2 for x=6.5 and x=10 ii. Now compute the same densities using Silverman’s naïve estimator (formula 8.4)! iii. Now assume we use a Gaussian Kernel Estimator (equation 8.7); give a verbal description and a formula how this estimator measures the density for x=10 iv. Compare the 3 density estimation approaches; what are the main differences and advantages for each approach?

14) Non-parametric Density Estimation2 a) Assume a dataset X={xt,rt}consisting of 4 examples (0,1), (1,3), (2,7), (4,1) is given and the bin-width is 2.5: assume that x and x’ belong to the same bin if |x-x’|2.5. a1) Compute the values (also give the formula) for the regressogram for inputs 0.5, 1.8, and 4.4 for the mean smoother (see formula 8.19 on page 175 of the textbook). ĝ(0.5)= ĝ(1.8)= ĝ(4.4)= Now assume the bin-width is only 1. Recompute the prediction for input 1.8! ĝ(1.8)=

1 a2) In general the function obtained using the above approach has discontinuities. What could be done to obtain a continuous function? b) What is the main difference between the Gaussian Kernel Density function approach as described in Section 8.2.2 of the textbook and the k-nearest Neighbor Density Estimator that has been described in Section 8.2.3? c) What advantages you see in using a non-parametric density estimation approach compared to parametric density approaches, such as using multivariate Gaussians?

15) Computations in Belief Networks /D-separation [11] Assume that the following Belief Network is given that consists of nodes A, B, C, D, and E that can take values of true and false. a) Using the given probabilities of the probability tables of the above belief network (D| C,E; C|A,B; A; B; E) give a formula to compute P(D|A). Justify all nontrivial steps you used to obtain the formula! b) Using the given probabilities of the probability tables of the above belief network (D| C,E; C|A,B; A; B; E) give a formula to compute P(E|A,B). Justify all nontrivial steps you used to obtain the formula! c) Are C and E independent; is C| and E| d-separable? Give a reason for your answer!  denotes “no evidence given d) Is E|CD d-separable from A|CD? Give a reason for your answer!

2 16) Using Hidden Markov Model Tools Assume the following Hidden Markov Model (HMM) is given:

a) What is the probability of the following 3 DNA sequences? i. CTCTGTTTT ii. CGGGGAGTT iii. CACTCTCGG b) What is the most likely state path for each of the above 3 sequences? Interpret the answers you obtained—do they make sense? Remark: using any HMM tool to obtain an answers to these questions if fine!

17) Support Vector Machines a) Why do most support vector machine approaches usually map examples to a higher dimensional space? b) The support vector regression approach minimizes the following objective function, given below. Give a verbal description what this objective function minimizes! What purpose does  serve? What purpose does C serve? t T t r  w x  w0    T t t w x  w0  r    t t  ,  0

1 2 t t min w  C   2 t c) Assume you apply support vector regression to a particular problem and for the obtained hyper plane and are all 0 for the n training examples (t=1,..,n); what does this mean?

3