Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch
Total Page:16
File Type:pdf, Size:1020Kb
Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch ABSTRACT The formula is : Two major methods of detecting similarities Euclidean Distance and 2 a "fuzzy Hamming distance" presented in a paper entitled "F %%y Dij : ; ( <3/i - yj 4 Hamming Distance: A New Dissimilarity Measure" by Bookstein, Klein, and Raita, will be compared using entropy calculations to Euclidean distance is especially useful for comparing determine their abilities to detect patterns in data and match data matching whole word object elements of 2 vectors. The records. .e find that both means of measuring distance are useful code in Java is: depending on the context. .hile fuzzy Hamming distance outperforms in supervised learning situations such as searches, the Listing 1: Euclidean Distance Euclidean distance measure is more useful in unsupervised pattern matching and clustering of data. /** * EuclideanDistance.java INTRODUCTION * * Pattern matching is becoming more of a necessity given the needs for * Created: Fri Oct 07 08:46:40 2002 such methods for detecting similarities in records, epidemiological * * @author David Bell: DHS-GENETICS patterns, genetics and even in the emerging fields of criminal * @version 1.0 behavioral pattern analysis and disease outbreak analysis due to */ possible terrorist activity. 0nfortunately, traditional methods for /** Abstact Distance class linking or matching records, data fields, etc. rely on exact data * @param None matches rather than looking for close matches or patterns. 1f course * @return Distance Template proximity pattern matches are often necessary when dealing with **/ messy data, data that has inexact values and/or data with missing key abstract class distance{ values. They are invaluable when constructing case based reasoning double d() {return 0.0;} (CBR) or neural network based systems. double wd() {return 0.0;} abstract String GetName(); } In this paper, we will compare two methods of measuring similarities class euclidean extends distance of data objects: the Euclidean distance method and a new method, the { fuz%y Hamming distance. .e constructed a very small sample of // Constructor less than 20 names and identifiers selected at random from a large /** Calculates Squared Euclidean medical database of thousands of records to see how the Distances measurements performed in a search and small clustering * @param 2 values of type double experiment. In order to protect the confidentiality of the records in * @return Squared Distances the database, their names were aliased and scrambled and other between values **/ identifiers will not be presented. public euclidean(double x1, double y2) PATTERN SEARCHING AND DISSIMILARITY { METRICS double x = y2 - x1; distance1(x); } Measures of dissimilarity: // Transfer values to d() public void distance1(double x){ .hen attempting to match records or detect patterns in data, a xfactor = x; reliable distance measurement or metric is crucial. The data mining } methods used to cluster, match or search the database will be somewhat undermined if a less than optimal distance measurement is public double d() used. { return xfactor*xfactor; } The purpose of this paper is to examine two popular methods for public String GetName() measuring object distances or 8dissimilarities9: the Fuzzy Hamming { return "euclidean"; and Euclidean distance measurements. } Euclidean Distance protected double xfactor,yfactor; } The Euclidean Distance metric is a classically used means of public class EuclideanDistance { measuring the distance between 2 vectors of n elements. // Constructor /** than absolute terms like 8identical9 versus 8different.9 The * EuclideanDistance Java code for calculating a F %%y Hamming Distance: * @param d uble[][] Array[m][n],int nr 3s,int nc ls Listing 2: Fuzzy Hamming Distance * @return array[][]: Relative Euclidean Dist. * **/ imp rt java.util.*1 public d uble[][]EuclideanDistance/d uble imp rt java.math.*1 [][]5,int i, int j0. imp rt java.lang.D uble1 int iIN,jIN,p:01 /** d uble [][]5ydist: ne3 d uble[i?2][j?2]1 * FHamming.java * FuBBy Hamming Distances f r Alpha //perf rm calculati ns 3ithin a matri5 and bject * Numeric Fields. f r/iIN:11iIN@:i1iIN??0. * Definiti n f a Hamming Distance is f r/jIN:11jIN@:j1jIN??0. the sum f f r/p:11p@i1??p0. * binary differences bet3een S urce euclidean eu : ne3 and Target arrays in terms euclidean/5[p][jIN],5[iIN][jIN]01 * f abs lute values as 3ell as 5ydist[iIN][jIN]?:eu.d/01 cardinality. 2 * FuBBy Hamming has these features as 5ydist[iIN][jIN]: 3ell as Aath.s6rt/5ydist[iIN][jIN]01 * a pr 5imity measure f r cl se versus 2 far c mparis ns. 2return/5ydist01 * based up n:Abraham B Cstein, Shmuel T mi Dlein, and Tim Raita, * FuBBy Hamming Distance: A Ne3 2 Dissimilarity Aeasure /in 2 * 12th Annual Symp sium n C mbinat rial Eattern Aatching. * Ferusalem: Springer Gerlag0 * Created: Thu Feb 13 14:40:08 2003* Fuzzy Hamming Distance hamming2.java * The fuzzy Hamming distance is a more recent dissimilarity * @auth r David Bell measurement that was introduced in 2001 in an article presented at * @versi n 1.0 the 12 Annual Symposium of Combinatorial Pattern Matching */ Proceedings by Bookstein, Klein and ,aita. The metric extends the familiar Hamming distance that has been used to measure edit public class FHamming. distances in bitmaps and letters of word objects. //**************************** // Get minimum f three values The drawback of the conventional Hamming distance metric is that it // ISE A FILTER SORT measures absolute differences in bit or alphanumeric strings without //**************************** indicating or giving credit for 8close calls.9 1ften absolute /** differences are too crude when trying to measure the approximation * Shift class distances in bitmaps or characters. For example, let there be the * @param t target String following bit strings: * @param s_i S urce char * @param i integer p siti n s urce Target: 11110000 character * @param j integer p siti n f S,C 1: 11101000 target character S,C2: 11100100 * @return delta shift value */ 0sing the traditional Hamming distance, the distance between the public d uble Shift/String t,char Target and S,C1 string is the same as between the Target and S,C2 s_i,int i,int j0. string. In both cases the distance is 2. d uble delta : 01 int 5 : 01 The fuzzy Hamming distance, however, gives credit for being close. f r/5:j15@: t.length/0-11??50. In fuzzy Hamming terms, S,C1 is closer to the Target string than is if/s_i :: t.charAt/500. S,C2 in that the 81“ needs only to move a single position to delta : i-/5?101 if/delta @ 00delta : delta exchange with the 879 to make the strings identical. 1n account of * -11 this, the f %%y Hamming distance is .25. It follows that S,C2 is a delta : delta/t.length/01 little farther away from the Target string in terms of requiring the 81“ breaC1 to move 2 positions before it exchanges for the correct 879 to be 2 equal to the Target string. Now it is not twice the distance as S,C1 from the Target, but it is a 8little farther9 relative to the length of the 2return/delta01 string. So, its f %%y Hamming distance is 0.375. As the 81“ bit 2 wanders farther away from the Target string the distance increases which begs the usage of f %%y terms like 8nearer9 or 9farther9 rather private d uble Ainimum /d uble a, 2 d uble b, d uble c0 . d uble mi1 if /s_i :: t_j0 . c st : 01 mi : a1 2 if /b @ mi0 . else . mi : b1 // HERE IS MHERE ME FANN IT IE 2 // LETS ADD SOAE SHIFT EARAS AND RANNLE if /c @ mi0 . //DANNLE mi : c1 String tsub : 2 t.substring/j,t.length/001 return mi1 d uble tc st: 2 Shift/t,s_i,i,j01 if/tc st @ 1 OO tc st P: 00c st //***************************** : tc st1 // C mpute FuBBy Hamming distance else c st : 11 //***************************** 2 /** * hamming2 class // Step 6 * Recursively calculates hamming // RECIRSIGELY CALCILATE FINNY * distance fr m HAAAING DISTANCE * distance matri5. Inputs: S urce and d[i][j] : Ainimum /d[i-1][j]?1, Target strings d[i][j-1]?1, d[i-1][j-1] ? c st01 * @param s s urce string * @param t target string 2 * @return d uble fuBBy hamming distance */ 2 public d uble hamming2/String s, String t0 . d uble d[][]1 // matri5 return d[n][m]1 int n1 // length f s int m1 // length f t 2 int i1 // iterates thr ugh s // Ise this c de t access hamming int j1 // iterates thr ugh t distance char s_i1 // ith character f s public d uble FHamming/String target, char t_j1 // jth character f t String s urce0 . d uble c st1 // c st hamming2 h : ne3 hamming2/01 d uble ut : // Step 1 h.hamming2/s urce,target01 return ut1 n : s.length /01 m : t.length /01 2 if /n :: 00 . return m1 2 2 if /m :: 00 . return n1 2 Distance Normalization: // set the dimensi ns f d matri5 d : ne3 d uble[n?1][m?1]1 Groucho: —Who will gi e me 20 for this lot"# Chico: 8 Sure, I gi e you 20..# // Step 2 Groucho: —O)ay.. how about 50… who will gi e me 50?“ // set the r 3s and c lumns: r 3s : s urce 1c ls : target Chico: —Sure, I gi e you 50… I gi e you 100...150…550… f r /i : 01 i @: n1 i??0 . I gotta plenty of numbers+# d[i][0] : i1 2 From the auction scene in Marx Bros.:Coconuts f r /j : 01 j @: m1 j??0 . d[0][j] : j1 After obtaining a distance of, say, 9.112, what can we say 2 about how close we are to a target or how similar a set of values areE Not much.