<<

LearningLearning BayesianBayesian Networks:Networks: NaNaïïveve andand nonnon--NaNaïïveve BayesBayes HypothesisHypothesis SpaceSpace –– fixedfixed sizesize –– stochasticstochastic –– continuouscontinuous parametersparameters LearningLearning AlgorithmAlgorithm –– directdirect computationcomputation –– eagereager –– batchbatch MultivariateMultivariate GaussianGaussian ClassifierClassifier

y x

TheThe multivariatemultivariate GaussianGaussian ClassifierClassifier isis equivalentequivalent toto aa simplesimple BayesianBayesian networknetwork ThisThis modelsmodels thethe jointjoint distributiondistribution P(P(xx,y),y) underunder thethe assumptionassumption thatthat thethe classclass conditionalconditional distributionsdistributions P(P(xx|y)|y) areare multivariatemultivariate gaussiansgaussians – P(y): multinomial (K-sided coin)

– P(x|y): multivariate gaussian mean µk covariance matrix Σk NaNaïïveve BayesBayes ModelModel

y

x1 x2 x3 … xn

Each node contains a probability table – y: P(y = k)

– xj: P(xj = v | y = k) “class ” Interpret as a – Choose the class k according to P(y = k)

– Generate each feature independently according to P(xj=v | y=k) – The feature values are conditionally independent

P(xi,xj | y) = P(xi | y) · P(xj | y) RepresentingRepresenting P(P(xxjj||yy)) Many representations are possible – Univariate Gaussian

if xj is a continuous random variable, then we can use a and learn the mean µ and variance σ2 – Multinomial

if xj is a discrete random variable, xj ∈ {v1, …, vm}, then we construct the conditional probability table

y = 1 y=2 … y=K

xj=v1 P(xj=v1 | y = 1) P(xj=v1 | y = 2) … P(xj=vm | y = K)

xj=v2 P(xj=v2 | y = 1) P(xj=v2 | y = 2) … P(xj=vm | y = K) … … … … …

xj=vm P(xj=vm | y = 1) P(xj=vm | y = 2) … P(xj=vm | y = K)

– Discretization

convert continuous xj into a discrete variable – Kernel Density Estimates

apply a kind of nearest-neighbor algorithm to compute P(xj | y) in neighborhood of query point DiscretizationDiscretization viavia MutualMutual InformationInformation Many discretization algorithms have been studied. One of the best is discretization

– To discretize feature xj, grow a decision tree considering only splits on xj. Each leaf of the resulting tree will correspond to a single value of the discretized xj. – Stopping rule (applied at each node). Stop when

log2(N 1) ∆ I(xj; y) < − + N N K ∆=log2(3 2) [K H(S) K H(S ) Kr H(Sr)] − − · − l· l − ·

– where S is the training data in the parent node; Sl and Sr are the examples in the left and right child. K, Kl, and Kr are the corresponding number of classes present in these examples. I is the mutual information, H is the entropy, and N is the number of examples in the node. KernelKernel DensityDensity EstimatorsEstimators 2 1 xj xi,j DefineDefine K ( x j ,x i,j )= exp − toto √2πσ − µ σ ¶ bebe thethe GaussianGaussian KernelKernel withwith parameterparameter σσ EstimateEstimate i y=k K(xj,xi,j) P(xj y = k)= { | } | P Nk wherewhere NNk isis thethe numbernumber ofof trainingtraining examplesexamples inin classclass kk.. KernelKernel DensityDensity EstimatorsEstimators (2)(2)

ThisThis isis equivalentequivalent toto placingplacing aa GaussianGaussian

““bumpbump”” ofof heightheight 1/1/NNk onon eacheach trianingtrianing datadata pointpoint fromfrom classclass kk andand thenthen addingadding themthem upup |y) j P(x

xj KernelKernel DensityDensity EstimatorsEstimators

ResultingResulting probabilityprobability densitydensity |y) j P(x

xj TheThe valuevalue chosenchosen forfor σσ isis criticalcritical

σ=0.15 σ=0.50 NaNaïïveve BayesBayes learnslearns aa LinearLinear ThresholdThreshold UnitUnit ForFor multinomialmultinomial andand discretizeddiscretized attributesattributes (but(but notnot Gaussian),Gaussian), NaNaïïveve BayesBayes givesgives aa linearlinear decisiondecision boundaryboundary

P (x Y = y)=P(x1 = v1 Y = y) P(x2 = v2 Y = y) P(xn = vn Y = y) | | · | ··· | DefineDefine aa discriminantdiscriminant functionfunction forfor classclass 11 versusversus classclass KK

P(Y =1X) P(x1 = v1 Y =1) P (xn = vn Y =1) P (Y =1) h(x)= | = | | P (Y = K X) P(x1 = v1 Y = K) ···P(xn = vn Y = K)·P(Y = K) | | | LogLog ofof OddsOdds RatioRatio

P(y =1x) P (x1 = v1 y =1) P(xn = vn y =1) P(y =1) | = | | P (y = K x) P(x1 = v1 y = K) ···P (xn = vn y = K) · P(y = K) | | | P(y =1x) P(x1 = v1 y =1) P(xn = vn y =1) P (y =1) log | =log | + ...log | +log P (y = K x) P(x1 = v1 y = K) P(xn = vn y = K) P(y = K) | | |

Suppose each xj is binary and define

P(xj =0y =1) αj,0 =log | P (xj =0y = K) | P(xj =1y =1) αj,1 =log | P (xj =1y = K) | LogLog OddsOdds (2)(2)

NowNow rewriterewrite asas

P(y =1x) P(y =1) log | = (αj,1 αj,0)xj + αj,0 +log P (y = K x) − P (y = K) | Xj P(y =1x) P(y =1) log | = (αj,1 αj,0)xj + αj,0 +log P (y = K x) − ⎛ P(y = K)⎞ | Xj Xj ⎝ ⎠ WeWe classifyclassify intointo classclass 11 ifif thisthis isis ≥≥ 00 andand intointo classclass KK otherwiseotherwise LearningLearning thethe ProbabilityProbability DistributionsDistributions byby DirectDirect ComputationComputation P(P(yy==kk)) isis justjust thethe fractionfraction ofof trainingtraining examplesexamples belongingbelonging toto classclass k.k.

ForFor multinomialmultinomial variables,variables, P(P(xxj == vv || yy == kk)) isis thethe fractionfraction ofof trainingtraining examplesexamples inin classclass kk wherewhere xxj == vv ForFor GaussianGaussian variables,variables, µˆ jk isis thethe averageaverage valuevalue ofof xxj forfor trainingtraining examplesexamples inin classclass kk.. σ ˆ jk isis thethe samplesample standardstandard deviationdeviation ofof thosethose points:points: 1 2 σˆjk = (xi,j µˆjk) vNk − u i yi=k u { |X } t ImprovedImproved ProbabilityProbability EstimatesEstimates viavia LaplaceLaplace CorrectionsCorrections

When we have very little training data, direct probability computation can give probabilities of 0 or 1. Such extreme probabilities are “too strong” and cause problems

Suppose we are estimate a probability P(z) and we have n0 examples where z is false and n1 examples where z is true. Our direct estimate is n P(z =1)= 1 n0 + n1 Laplace Estimate. Add 1 to the numerator and 2 to the denominator n +1 P(z =1)= 1 n0 + n1 +2 This says that in the absence of any evidence, we expect P(z) = 0.5, but our belief is weak (equivalent to 1 example for each outcome). Generalized Laplace Estimate. If z has K different outcomes, then we estimate it as n +1 P (z =1)= 1 n0 + + nK 1 + K ··· − NaNaïïveve BayesBayes AppliedApplied toto DiabetesDiabetes DiagnosisDiagnosis

Bayes nets and – Bayes nets work best when arrows follow the direction of causality two things with a common cause are likely to be conditionally independent given the cause; arrows in the causal direction capture this independence – In a Naïve Bayes network, arrows are often not in the causal direction diabetes does not cause pregnancies diabetes does not cause age – But some arrows are correct diabetes does cause the level of blood insulin and blood glucose NonNon--NaNaïïveve BayesBayes

Manually construct a graph in which all arcs are causal Learning the probability tables is still easy. For example, P(Mass | Age, Preg) involves counting the number of patients of a given age and number of pregnancies that have a given body mass Classification: P (D = d A, P, M, I, G)= | P(I D = d)P(G I,D = d)P(D = d A, M, P) | | | P (I,G) EvaluationEvaluation ofof NaNaïïveve BayesBayes

Criterion LMS Logistic LDA Trees Nets NNbr SVM NB Mixed data no no no yes no no no yes Missing values no no yes yes no some no yes no yes no yes yes yes yes disc Monotone transformations no no no yes some no no disc Scalability yes yes yes yes yes no no yes Irrelevant inputs no no no some no no some some Linear combinations yes yes yes no yes some yes yes Interpretable yes yes yes yes no no some yes Accurate yes yes yes no yes no yes yes

• Naïve Bayes is very popular, particularly in natural language processing and information retrieval where there are many features compared to the number of examples • In applications with lots of data, Naïve Bayes does not usually perform as well as more sophisticated methods NaNaïïveve BayesBayes SummarySummary

Advantages of Bayesian networks – Produces stochastic classifiers can be combined with utility functions to make optimal decisions – Easy to incorporate causal knowledge resulting probabilities are easy to interpret – Very simple learning algorithms if all variables are observed in training data Disadvantages of Bayesian networks – Fixed sized hypothesis space may underfit or overfit the data may not contain any good classifiers if prior knowledge is wrong – Harder to handle continuous features