Social Dating: Matching and Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Social Dating: Matching and Clustering SangHyeon (Alex) Ahn Jin Yong Shin 3900 North Charles St 3900 North Charles St Baltimore, MD 21218, USA Baltimore, MD 21218, USA [email protected] [email protected] Abstract 1 Introduction Purpose of the paper was to apply machine learn- Social dating is a stage of interpersonal rela- ing algorithms for Social Dating problems and build tionship between two individuals for with the tools to be used for smarter decision making in aim of each assessing the other’s suitability as matching process. We used the data from Speed Dat- a partner in a more committed intimate rela- ing Experiment (of 21 waves) which includes each tionship or marriage. Today, many individ- candidate’s demographic attributes and preferences. uals spend a lot of money, time, and effort We have selected relevant features to provide the for the search of their true partners. Some most informative data and used SQL to inner join reasons for the inefficiency in seeking sexual partners include limited pool of candidates, missing feature variables. lack of transparency in uncovering personal- We implemented Machine Learning algorithms ities, and the nature of time consumption in (without the use of external libraries) to perform building relationships. Finding the right part- (1) binary classifications to predict a date match ner in a machine driven automated process can between a candidate and a potential partner, and be beneficial towards increasing efficiency in (2) clustering analysis on candidates to use narrow the following aspects: reduced time and ef- down a pool of candidates by demographic traits fort, a larger pool, and a level of quantita- and partner preferences. We have further combined tive/qualitative guarantees. the two model for acquiring a better prediction A binary classification prediction model pre- model. For our binary label, we gave 1 as a match dicting a potential match between a candidate between two candidates and −1 as a non-match and a partner can significantly improve social between the two. dating process in terms of increasing positive match outcomes. Also, clustering candidates We have constructed binary classification model who have similar demographic traits and pref- erences can help to narrow down the pool of using multiple algorithms each with varying param- potential partners for a given candidate. eters to select the best performing (highest predic- tion accuracy) model by tuning following parame- In this paper, we modeled binary classification ters: predictor for a potential match for a candidate, clustered candidates into similar demographic 1. I: number of training iterations traits and preferences, and combined the two models for prediction model. In final stages, 2. θ: regularization terms we have acquired a model with prediction ac- curacy near at 0.85. Also, clustering analysis is conducted under different λ values for λ-means clustering to observe how many different clusters we may acquire from the data. 2.2 Refining and Sampling the data To create and collect the full data set with non- Each model’s effectiveness and usefulness was missing values and proper feature columns, we also evaluated to verify suitability and validity, us- constructed a relational database in SQL and ing accuracy testings and 5-fold cross-validation. performed inner join function to compile a full database as described above. 2 Feature Engineering 2.1 Data In order to perform analysis, we duplicated the database for two scenarios, one for classification Speed Dating Experiment data of 21 waves is using full data, and another for clustering analysis provided by the research paper from University of for a single candidate. Columbia conducting gender differences in mate selection, by Ray Fisman and Sheena Iyengar. For both database, we performed 5-fold random The entire data has approximately 5000 instances sampling, then divided train data to test data, at (4-minutes speed date instance) with 150 different size of 8 : 2 ratio (training:8, testing:2) for cross attributes including id, gender, match, age, race, validation. candidate attributes, partner attributes, candidate preferences, partner preferences, interest ratings, shared interest correlations, income and so on. 3 Algorithms For our model, we have selected only the relevant We developed two different algorithms for (1) clas- features for our train model. In addition to binary sification model (2) clustering analysis. label indicating a match, There are five large feature categories in our selected features: 1. Binary classification models explored: We explored three different classification mod- 1. Candidate Demographic els for our binary prediction model for predict- ing a candidate as a potential match for a per- 2. Candidate Attributes & Preference son. 3. Partner Demographic (a) Margin Perceptron Perceptron is a mistake driven online 4. Partner Attributes & Preference learning algorithm that performs like a single neuron. Prediction label, y^i, for an 5. Interests & Characteristics instance xi is computed as a sign value of the dot product between weight, w, and in- Demographic features include gender, age, race, stance’s feature vector. field of study, income range, career field. Attributes include candidate’s own measures of y^i = sign(w · xi) (1) attractiveness, sincerity, humor, intelligence, and ambitiousness. In the training algorithm, w is updated Preferences include the desirable attributes (di- whenever the classifier makes an incor- mensions are equal as above) of a potential partner rect prediction (y 6=y ^), at each itera- as well as level of shared interests. tion in training stage. Margin Perceptron, Interests & Characteristics includes a corre- however, updates w whenever margin (dot lation between two candidates’ preferred interests product value) is not satisfied at a label and activities, importance of race, importance of prediction even when the prediction is cor- religious background, frequency of dates, and fre- rect, to ensure labeling with at least a mar- quency of going out. gin. (b) SVM (Pegasos) Support Vector Machine is a linear classifier using max-margin X 0 y^ = arg max [yi = y ] (4) principle. SVM classifies data by con- 0 y i2Xn structing a hyperplane in high dimensional space that segregates data, while retain- ing max-margin between the hyperplane 2. Clustering analysis method explored: to each data points. The margin is en- forced by the objective function solved by Clustering is performed on all candidates who QP solver (or other optimization method participated in the Speed Dating Experiment, described below). to be able to group people into specific cluster which gathers people with similar demographic traits and partner preferences to each cluster. N (a) λ-means 1 2 1 X f(w) = min λ jjwjj + l(w;(xi; yi)) λ-means clustering is an unsupervised w 2 N i=1 learning algorithm based on Expectation- (2) Maximization algorithm, which is an iter- ative method of assigning expected clus- l(w;(xi; yi)) = max0; 1 − y < w; x >) (3) ter (by closest distance) for each instance where < ·; · > refers to the inner product then maximizing (expected) likelihood by of two vectors. Here, bias term is omitted updating the estimator (cluster mean) with with the assumption that hyperplane maximum likelihood estimator at each it- crosses the origin. eration. λ-means clustering is a modi- fied version of K-means clustering where λ is used to formulate different number of Pegasos is a version of SVM refering to clusters without a fixed number of clus- Primal Estimated sub-Gradient Solver, ters. Whenever distance is larger than λ, which takes a stochastic gradient descent the algorithm creates another cluster to as- to optimize the objective function stated sign instances. above. Pegasos takes sub-gradient of E-step: f(w; i) at iteration i, where learning j j rate continuously decreases at each k = arg min jjxi − µ jj&&jjxi − µ jj iteration (function of time) to guar- j (5) antee convergence. This algorithm adheres to online-training method with M-step: stochastic-optimization step rather than Pn k i=1 rikxi batch-optimization. µ = Pn (6) i=1 rik where rik is an indicator having a value of (c) K-Nearest Neighbors 1 is ith instance belongs to kth cluster and 0 otherwise. K-Nearest Neighbors algorithm is a classifica- 4 Implementation and Method Results tion method using labels of training instances to predict latent instance label by distances to In this section, we discuss our methods and ap- the training instances (i.e. Euclidean distance). proaches used for selecting which binary classifi- cation model to predict potential match between a K-Nearest Neighbors is an instance-based candidate and an opposing partner. Also, we give a learning method which predicts a label by the detailed analysis on our reasoning behind choosing labels of the nearest neighbors to the latent in- parameters for both classification model and cluster- stance. ing analysis. Figure 1: Binary Classification prediction accuracy using different algorithms Figure 4: Number of clusters at different λ Figure 2: Binary Classification prediction accuracy using different algorithms 4.1 Binary Classification Model For choosing binary classification model, we con- ducted 5-fold cross validation on 3 different algo- rithms: (1) Margin Perceptron, (2) SVM (Pegasos) and (3) K-Nearest Neighbors. Figure 1 and Figure 2 shows a changing behavior of prediction values by different algorithms for binary classifications using default parameters (training iterations=10, and $θ = 1:0). Notice that Figure 5: Variance of Information at different number of KNN performs best, then the other two models clusters perform similarly. Plot below show changes in prediction accuracy 4.2 Clustering Analysis at different number of iterations. For clustering of candidates by demographic traits and preferences, we explored various values of λ to observe data distribution behaviors. Figure 4 shows changes in number of clusters at different λ values. Figure 5 shows changes in Variance of Informa- tion (VI) at different cluster numbers.