<<

Review: Mining n Information Retrieval CS 341, Spring 2007 – Similarity measures – Evaluation Metrics : Precision and Recall n Question Answering Lecture 4: Data Mining Techniques (I) n Web Search Engine – An application of IR – Related to web mining

© Prentice Hall 2

Data Mining Techniques Outline Point Estimation Goal: Provide an overview of basic data n Point Estimate: estimate a population mining techniques parameter. n May be made by calculating the parameter for a n Statistical sample. – Point Estimation n May be used to predict value for . – Models Based on Summarization n Ex: – Bayes Theorem – R contains 100 employees – 99 have salary information – Hypothesis Testing – salary of these is $50,000 – Regression and Correlation – Use $50,000 as value of remaining employee’s salary. n Similarity Measures Is this a good idea?

© Prentice Hall 3 © Prentice Hall 4

Estimation Error Jackknife Estimate n Jackknife Estimate: estimate of n Bias: Difference between expected value and actual value. parameter is obtained by omitting one value from the set of observed values. n Named to describe a “handy and useful n Mean Squared Error (MSE): expected value tool” of the squared difference between the estimate and the actual value: n Used to reduce bias n Property: The Jackknife lowers the bias from the order of 1/n to 1/n2 n Root Mean Square Error (RMSE)

© Prentice Hall 5 © Prentice Hall 6

1 Jackknife Estimate Jackknife Estimator: Example 1 n Definition: n Estimate of mean for X={x1, x2, x3,}, n =3, g=3, θ µ – Divide the sample size n into g groups of m=1, = = (x1+ x2+ x3)/3 θ θ θ size m each, so n=mg. (often m=1 and n 1 = (x2 + x3)/2, 2 = (x1 + x3)/2, 1 = (x1 + x2)/2, θ θ θ θ g=n) n _ = ( 1 + 2 + 2)/3 θ θ θ θ θ θ – estimate j by ignoring the jth group. n Q = g -(g-1) _= 3 -(3-1) _= (x1 + x2 + x3)/3 θ θ – _ is the average of j . – The Jackknife estimator is n In this case, the Jackknife Estimator is the θ θ θ » Q = g – (g-1) _. same as the usual estimator. Where θ is an estimator for the parameter theta.

© Prentice Hall 7 © Prentice Hall 8

Jackknife Estimator: Example 2 Jackknife Estimator: Example 2(cont’d) n Estimate of for X={1, 4, 4}, n =3, g=3, n In general, apply the Jackknife technique m=1, θ = σ2 to the biased estimator σ2 n σ2 = ((1-3)2 +(4-3)2 +(4-3)2 )/3 = 2 to the biased estimator θ 2 2 n 1 = ((4-4) + (4-4) ) /2 = 0, θ θ σ2 Σ 2 n 2 = 2.25 , 3 = 2.25 = (xi – x ) / n θ θ θ θ n _ = ( 1 + 2 + 2)/3 = 1.5 θ θ θ θ θ 2 n Q = g -(g-1) _= 3 -(3-1) _ n then the jackknife estimator is s =3(2)-2(1.5)=3 2 Σ 2 s = (xi – x ) / (n -1) n In this case, the Jackknife Estimator is Which is known to be unbiased for σ2 different from the usual estimator.

© Prentice Hall 9 © Prentice Hall 10

Maximum Likelihood MLE Example Estimate (MLE) n Obtain parameter estimates that maximize n Coin toss five times: {H,H,H,H,T} the probability that the sample data occurs for n Assuming a perfect coin with H and T equally the specific model. likely, the likelihood of this sequence is: n Joint probability for observing the sample data by multiplying the individual probabilities. : n However if the probability of a H is 0.8 then: n Maximize L.

© Prentice Hall 11 © Prentice Hall 12

2 Expectation-Maximization MLE Example (cont’d) (EM) n General likelihood formula:

n Solves estimation with incomplete data. n Obtain initial estimates for parameters. n Iteratively use estimates for missing data and continue until convergence.

n Estimate for p is then 4/5 = 0.8 © Prentice Hall 13 © Prentice Hall 14

EM Example EM Algorithm

© Prentice Hall 15 © Prentice Hall 16

Models Based on Summarization Scatter Diagram

n Basic concepts to provide an abstraction and summarization of the data as a whole. – Statistical concepts: mean, variance, , , etc. n Visualization: display the structure of the data graphically. – Line graphs, Pie charts, , Scatter plots, Hierarchical graphs

© Prentice Hall 17 © Prentice Hall 18

3 Bayes Theorem Bayes Theorem Example n Credit authorizations (hypotheses): n : P(h1|xi) h1=authorize purchase, h2 = authorize after n Prior Probability: P(h ) further identification, h3=do not authorize, 1 h = do not authorize but contact police n Bayes Theorem: 4 n A ssign twelve data values for all

combinations of credit and income: 1 2 3 4

Excellent x1 x2 x3 x4 Good x5 x6 x7 x8 Bad x9 x10 x11 x12

n From training data: P(h1) = 60%; P(h2)=20%;

n Assign probabilities of hypotheses given a P(h3)=10%; P(h4)=10%. data value. © Prentice Hall 19 © Prentice Hall 20

Bayes Example(cont’d) Bayes Example(cont’d) n Training Data: n Calculate P(xi|hj) and P(xi) ID Income Credit Class xi n Ex: P(x |h )=2/6; P(x |h )=1/6; P(x |h )=2/6; 1 4 Excellent h1 x4 7 1 4 1 2 1 P(x |h )=1/6; P(x |h )=0 for all other x . 2 3 Good h1 x7 P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. n 3 2 Excellent h1 x2 Predict the class for x4: 4 3 Good h1 x7 – Calculate P(hj|x4) for all hj. 5 4 Good h1 x8 – Place x4 in class with largest value. 6 2 Excellent h1 x2 – Ex: 7 3 Bad h2 x11 »P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) 8 2 Bad h2 x10 =(1/6)(0.6)/0.1=1. 9 3 Bad h3 x11 »x4 in class h1. 10 1 Bad h4 x9 © Prentice Hall 21 © Prentice Hall 22

Hypothesis Testing Chi-Square Test n One technique to perform hypothesis testing n Find model to explain behavior by n Used to test the association between two creating and then testing a hypothesis observed variable values and determine if a about the data. set of observed values is statistically different. n Exact opposite of usual DM approach. n The chi-squared is defines as:

n H0 – Null hypothesis; Hypothesis to be tested. n O – observed value n H1 – n E – Expected value based on hypothesis.

© Prentice Hall 23 © Prentice Hall 24

4 Chi-Square Test Regression

n Given the average scores of five schools. Determine whether the difference is n Predict future values based on past values statistically significant. n Fitting a set of points to a curve n Ex: n assumes linear – O={50,93,67,78,87} relationship exists. – E=75 – χ2=15.55 and therefore significant y = c0 + c1 x1 + … + cn xn n Examine a chi-squared significance table. – n input variables, (called regressors or predictors) – with a degree of 4 and a significance level of 95%, – One out put variable, called response the critical value is 9.488. Thus the variance between the schools’ scores and the expected – n+1 constants, chosen during the modlong value cannot be associated with pure chance. process to match the input examples

© Prentice Hall 25 © Prentice Hall 26

Linear Regression -- with one input value Correlation

n Examine the degree to which the values for two variables behave similarly. n r: • 1 = perfect correlation • -1 = perfect but opposite correlation • 0 = no correlation

© Prentice Hall 27 © Prentice Hall 28

Correlation Similarity Measures

n Determine similarity between two objects. n Similarity characteristics:

n Where X, Y are for X and Y respectively. n Suppose X=(1,3,5,7,9) and Y=(9,7,5,3,1) r = ? n Suppose X=(1,3,5,7,9) and Y=(2,4,6,8,10) n Alternatively, distance measure measure how r = ? unlike or dissimilar objects are.

© Prentice Hall 29 © Prentice Hall 30

5 Similarity Measures Distance Measures

n Measure dissimilarity between objects

© Prentice Hall 31 © Prentice Hall 32

Next Lecture: n Data Mining techniques (II) – Decision trees, neural networks and genetic algorithms n Reading assignments: Chapter 3

© Prentice Hall 33

6