Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch

ABSTRACT

The formula is :

Two major methods of detecting similarities Euclidean Distance and 2 a "fuzzy " presented in a paper entitled "Fuzzy Dij = √ ( ∑(xi - yj ) Hamming Distance: A New Dissimilarity Measure" by Bookstein, Klein, and Raita, will be compared using entropy calculations to Euclidean distance is especially useful for comparing determine their abilities to detect patterns in data and match data matching whole word object elements of 2 vectors. The records. We that both means of measuring distance are useful code in Java is: depending on the context. While fuzzy Hamming distance outperforms in supervised learning situations such as searches, the Listing 1: Euclidean Distance Euclidean distance measure is more useful in unsupervised pattern matching and clustering of data. /** * EuclideanDistance.java INTRODUCTION * * Pattern matching is becoming more of a necessity given the needs for * Created: Fri Oct 07 08:46:40 2002 such methods for detecting similarities in records, epidemiological * * @author David Bell: DHS-GENETICS patterns, genetics and even in the emerging fields of criminal * @version 1.0 behavioral pattern analysis and disease outbreak analysis due to */ possible terrorist activity. Unfortunately, traditional methods for /** Abstact Distance class linking or matching records, data fields, etc. rely on exact data * @param None matches rather than looking for close matches or patterns. Of course * @return Distance Template proximity pattern matches are often necessary when dealing with **/ messy data, data that has inexact values and/or data with missing key abstract class distance{ values. They are invaluable when constructing case based reasoning double d() {return 0.0;} (CBR) or neural network based systems. double wd() {return 0.0;} abstract String GetName();

} In this paper, we will compare two methods of measuring similarities class euclidean extends distance of data objects: the Euclidean distance method and a new method, the { fuzzy Hamming distance. We constructed a very small sample of // Constructor less than 20 names and identifiers selected at random from a large /** Calculates Squared Euclidean medical database of thousands of records to see how the Distances measurements performed in a search and small clustering * @param 2 values of type double experiment. In order to protect the confidentiality of the records in * @return Squared Distances the database, their names were aliased and scrambled and other between values **/ identifiers will not be presented. public euclidean(double x1, double y2) PATTERN SEARCHING AND DISSIMILARITY { METRICS double x = y2 - x1; distance1(x); } Measures of dissimilarity: // Transfer values to d() public void distance1(double x){ When attempting to match records or detect patterns in data, a xfactor = x; reliable distance measurement or metric is crucial. The data mining } methods used to cluster, match or search the database will be somewhat undermined if a less than optimal distance measurement is public double d() used. { return xfactor*xfactor; } The purpose of this paper is to examine two popular methods for public String GetName() measuring object distances or —dissimilarities“: the Fuzzy Hamming { return "euclidean"; and Euclidean distance measurements. }

Euclidean Distance protected double xfactor,yfactor; } The Euclidean Distance metric is a classically used means of public class EuclideanDistance { measuring the distance between 2 vectors of n elements. // Constructor /** than absolute terms like —identical“ versus —different.“ The * EuclideanDistance Java code for calculating a Fuzzy Hamming Distance: * @param double[][] Array[m][n],int nrows,int ncols Listing 2: Fuzzy Hamming Distance * @return array[][]: Relative Euclidean Dist.

* **/ import java.util.*; public double[][]EuclideanDistance(double import java.math.*; [][]x,int i, int j){ import java.lang.Double; int iIN,jIN,p=0; /** double [][]xydist= new double[i+2][j+2]; * FHamming.java * Fuzzy Hamming Distances for Alpha //perform calculations within a matrix and object * Numeric Fields. for(iIN=1;iIN<=i;iIN++){ * Definition of a Hamming Distance is for(jIN=1;jIN<=j;jIN++){ the sum of for(p=1;p

// Get minimum of three values The drawback of the conventional Hamming distance metric is that it // USE A FILTER SORT measures absolute differences in bit or alphanumeric strings without //**************************** indicating or giving credit for —close calls.“ Often absolute /** differences are too crude when trying to measure the approximation * Shift class distances in bitmaps or characters. For example, let there be the * @param t target String following bit strings: * @param s_i Source char * @param i integer position source Target: 11110000 character * @param j integer position of SRC 1: 11101000 target character SRC2: 11100100 * @return delta shift value */ Using the traditional Hamming distance, the distance between the public double Shift(String t,char Target and SRC1 string is the same as between the Target and SRC2 s_i,int i,int j){ string. In both cases the distance is 2. double delta = 0; int x = 0; The fuzzy Hamming distance, however, gives credit for being close. for(x=j;x<= t.length()-1;++x){ In fuzzy Hamming terms, SRC1 is closer to the Target string than is if(s_i == t.charAt(x)){ SRC2 in that the —1“ needs only to move a single position to delta = i-(x+1); if(delta < 0)delta = delta exchange with the —0“ to make the strings identical. On account of * -1; this, the fuzzy Hamming distance is .25. It follows that SRC2 is a delta = delta/t.length(); little farther away from the Target string in terms of requiring the —1“ break; to move 2 positions before it exchanges for the correct —0“ to be } equal to the Target string. Now it is not twice the distance as SRC1 from the Target, but it is a —little farther“ relative to the length of the }return(delta); string. So, its fuzzy Hamming distance is 0.375. As the —1“ bit } wanders farther away from the Target string the distance increases which begs the usage of fuzzy terms like —nearer“ or “farther“ rather private double Minimum (double a, 2 double b, double c) { double mi; if (s_i == t_j) { cost = 0; mi = a; } if (b < mi) { else { mi = b; // HERE IS WHERE WE JAZZ IT UP } // LETS ADD SOME SHIFT PARMS AND RAZZLE if (c < mi) { //DAZZLE mi = c; String tsub = } t.substring(j,t.length()); return mi; double tcost= } Shift(t,s_i,i,j); if(tcost < 1 && tcost != 0)cost //***************************** = tcost; // Compute Fuzzy Hamming distance else cost = 1; //***************************** } /** * hamming2 class // Step 6 * Recursively calculates hamming // RECURSIVELY CALCULATE FUZZY * distance from HAMMING DISTANCE * distance matrix. Inputs: Source and d[i][j] = Minimum (d[i-1][j]+1, Target strings d[i][j-1]+1, d[i-1][j-1] + cost); * @param s source string * @param t target string } * @return double fuzzy hamming distance */ } public double hamming2(String s, String t) { double d[][]; // matrix return d[n][m]; int n; // length of s int m; // length of t } int i; // iterates through s // Use this code to access hamming int j; // iterates through t distance char s_i; // ith character of s public double FHamming(String target, char t_j; // jth character of t String source) { double cost; // cost hamming2 h = new hamming2(); double out = // Step 1 h.hamming2(source,target); return out; n = s.length (); m = t.length (); } if (n == 0) { return m; } } if (m == 0) { return n; } Distance Normalization: // set the dimensions of d matrix d = new double[n+1][m+1]; Groucho: —Who will give me 20 for this lot?“ Chico: — Sure, I give you 20..“ // Step 2 Groucho: —Okay.. how about 50… who will give me 50?“ // set the rows and columns: rows = source ;cols = target Chico: —Sure, I give you 50… I give you 100...150…550… for (i = 0; i <= n; i++) { I gotta plenty of numbers…“ d[i][0] = i; } From the auction scene in Marx Bros.:Coconuts for (j = 0; j <= m; j++) { d[0][j] = j; After obtaining a distance of, say, 9.112, what can we say } about how close we are to a target or how similar a set of values are? Not much. Like Chico, we have plenty of // Step 3 numbers, but what do they mean? Raw distances do not for (i = 1; i <= n; i++) { convey similarity or closeness to the target in terms that are s_i = s.charAt (i - 1); intuitive. Now if we say were are 99 percent close to the target, or that a set of objects are 99 percent similar, that // Step 4 gives us a little better idea of what is going on. // Here is where the fun begins. // Note cost is figured recursively Two methods of achieving normalization are: for (j = 1; j <= m; j++) { 1. The normalized distance measure formula: t_j = t.charAt (j - 1); D D œ mindistance/ maxdistance- ij(norm)= ij mindistance // Step 5

3 2. The Similarity distance: The Next Step: Further Study Similarity = Dij œ mindistance/ maxdistance

We use these measurements to indicate the nearest neighbor The next step is to study the performance of these distances between data objects and elements as well as the similarity dissimilarity methods in a larger data base of 2500 then of vectors derived from a combination of distance measurements. against the full data set of 500,000 records. These experiments will give us information on how these two measurements perform in larger scale tasks. A SMALL EXPERIMENT CONCLUSION In order to test the mechanics of the two dissimilarity or distance measurements, we conducted a very small experiment We presented two dissimilarity measures used for on a set of 15 cases selected at random from a large medical data clustering and pattern recognition in database of approximately 500,000 records. Due to the unsupervised and supervised clustering in a very Herculean size of deriving randomly selected records from small experimental study. Our initial findings this massive database, we chose to use SAS proc sql to extract suggest that both dissimilarity measures have the data into two smaller sample datasets: one containing strengths relative to the data mining task at hand. We approximately 2500 samples, and another very small one found that the Euclidean distance metric performed containing 15. All samples were drawn from Los Angeles better in unsupervised clustering while the fuzzy county. Due to its size and ethnic mix, L.A. presented an Hamming metric performed better in supervised data excellent representative subsample of the state. mining tasks such as record searching.

To handle names, and variant spellings, we used the Metaphone method developed by Lawrence Philips for coding names which has been shown to be much superior to Soundex especially when the data base has a large ethnic mix as ours does. ACKNOWLEDGMENTS The authors wish to thank those in the Open Source After recoding the names, we ran our small sample through community for their products that assisted in the two scenarios. This first scenario involved unsupervised production of this study. The authors also wish to thank clustering of data elements as one would do to find duplicate William B. Brogden for providing the his Metaphone code records in a data set or to find matching data records. The which aided in the production of this study. other scenario involved searching the data for specific persons as one would do using a search engine. REFERENCES

The Unsupervised Clustering of Data Elements Abraham Bookstein, Shmuel Tomi Klein, and Timo Raita, The first experiment tested the two dissimilarity measurement —Fuzzy Hamming Distance: A New Dissimilarity Measure, methods in their ability to find nearest neighbor clusters between data 12th Annual Symposium on Combinatorial Pattern elements. In essence the task is to detect correct clusters from noise Matching. (Jerusalem: Springer Verlag) or false clusters. We use the Shannon Entropy measure to detect the amount of information as per the formula: Lawrence Philips ,"Hanging on the Metaphone" Computer Language of Dec. 1990, p 39 H(x)= p(x)∑ log(1/p(x))

We found that the Euclidean Distance outperformed the fuzzy CONTACT INFORMATION Hamming distance yielding a combined Entropy measure of 1.29 vs. .86. Your comments and questions are valued and encouraged. Contact the author at: David Bell

State of California: Genetic Disease Branch The Supervised Clustering of Data Elements 850 Marina Bay Pkwy. Richmond, CA The second experiment tested the two dissimilarity measurement Work Phone: (510)412-6211 methods in their ability to find nearest similarity between a target E-mail Address: [email protected] record and various source records. This task wasto detect correct correct matching cases from noise or false matches. In this instance we found the fuzzy Hamming distance outperformed the Euclidean distance with an Entropy measure of 2.3 versus 1.7 .

4