Social Dating: Matching and Clustering

SangHyeon (Alex) Ahn Jin Yong Shin 3900 North Charles St 3900 North Charles St Baltimore, MD 21218, USA Baltimore, MD 21218, USA [email protected] [email protected]

Abstract 1 Introduction Purpose of the paper was to apply machine learn- Social dating is a stage of interpersonal rela- ing algorithms for Social Dating problems and build tionship between two individuals for with the tools to be used for smarter decision making in aim of each assessing the other’s suitability as matching process. We used the data from Speed Dat- a partner in a more committed intimate rela- ing (of 21 waves) which includes each tionship or marriage. Today, many individ- candidate’s demographic attributes and preferences. uals spend a lot of money, time, and effort We have selected relevant features to provide the for the search of their true partners. Some most informative data and used SQL to inner join reasons for the inefficiency in seeking sexual partners include limited pool of candidates, missing feature variables. lack of transparency in uncovering personal- We implemented algorithms ities, and the nature of time consumption in (without the use of external libraries) to perform building relationships. Finding the right part- (1) binary classifications to predict a date match ner in a machine driven automated process can between a candidate and a potential partner, and be beneficial towards increasing efficiency in (2) clustering analysis on candidates to use narrow the following aspects: reduced time and ef- down a pool of candidates by demographic traits fort, a larger pool, and a level of quantita- and partner preferences. We have further combined tive/qualitative guarantees. the two model for acquiring a better prediction A binary classification prediction model pre- model. For our binary label, we gave 1 as a match dicting a potential match between a candidate between two candidates and −1 as a non-match and a partner can significantly improve social between the two. dating process in terms of increasing positive match outcomes. Also, clustering candidates We have constructed binary classification model who have similar demographic traits and pref- erences can help to narrow down the pool of using multiple algorithms each with varying param- potential partners for a given candidate. eters to select the best performing (highest predic- tion accuracy) model by tuning following parame- In this paper, we modeled binary classification ters: predictor for a potential match for a candidate, clustered candidates into similar demographic 1. I: number of training iterations traits and preferences, and combined the two models for prediction model. In final stages, 2. θ: regularization terms we have acquired a model with prediction ac- curacy near at 0.85. Also, clustering analysis is conducted under different λ values for λ- clustering to observe how many different clusters we may acquire from the data. 2.2 Refining and the data To create and collect the full data set with non- Each model’s effectiveness and usefulness was missing values and proper feature columns, we also evaluated to verify suitability and validity, us- constructed a relational database in SQL and ing accuracy testings and 5-fold cross-validation. performed inner join function to compile a full database as described above. 2 Feature Engineering

2.1 Data In order to perform analysis, we duplicated the database for two scenarios, one for classification Speed Dating Experiment data of 21 waves is using full data, and another for clustering analysis provided by the research paper from University of for a single candidate. Columbia conducting gender differences in mate selection, by Ray Fisman and Sheena Iyengar. For both database, we performed 5-fold random The entire data has approximately 5000 instances sampling, then divided train data to test data, at (4-minutes speed date instance) with 150 different size of 8 : 2 ratio (training:8, testing:2) for cross attributes including id, gender, match, age, race, validation. candidate attributes, partner attributes, candidate preferences, partner preferences, interest ratings, shared interest correlations, income and so on. 3 Algorithms

For our model, we have selected only the relevant We developed two different algorithms for (1) clas- features for our train model. In addition to binary sification model (2) clustering analysis. label indicating a match, There are five large feature categories in our selected features: 1. Binary classification models explored: We explored three different classification mod- 1. Candidate Demographic els for our binary prediction model for predict- ing a candidate as a potential match for a per- 2. Candidate Attributes & Preference son.

3. Partner Demographic (a) Margin Perceptron is a mistake driven online 4. Partner Attributes & Preference learning algorithm that performs like a single neuron. Prediction label, yˆi, for an 5. Interests & Characteristics instance xi is computed as a sign value of the dot product between weight, w, and in- Demographic features include gender, age, race, stance’s feature vector. field of study, income , career field. Attributes include candidate’s own measures of yˆi = sign(w · xi) (1) attractiveness, sincerity, humor, intelligence, and ambitiousness. In the training algorithm, w is updated Preferences include the desirable attributes (di- whenever the classifier makes an incor- mensions are equal as above) of a potential partner rect prediction (y 6=y ˆ), at each itera- as well as level of shared interests. tion in training stage. Margin Perceptron, Interests & Characteristics includes a corre- however, updates w whenever margin (dot lation between two candidates’ preferred interests product value) is not satisfied at a label and activities, importance of race, importance of prediction even when the prediction is cor- religious background, frequency of dates, and fre- rect, to ensure labeling with at least a mar- quency of going out. gin. (b) SVM (Pegasos) Support Vector Machine is a linear classifier using max-margin X 0 yˆ = arg max [yi = y ] (4) principle. SVM classifies data by con- 0 y i∈Xn structing a hyperplane in high dimensional space that segregates data, while retain- ing max-margin between the hyperplane 2. Clustering analysis method explored: to each data points. The margin is en- forced by the objective function solved by Clustering is performed on all candidates who QP solver (or other optimization method participated in the Speed Dating Experiment, described below). to be able to group people into specific cluster which gathers people with similar demographic traits and partner preferences to each cluster.

N (a) λ-means 1 2 1 X f(w) = min λ ||w|| + l(w;(xi, yi)) λ-means clustering is an unsupervised w 2 N i=1 learning algorithm based on Expectation- (2) Maximization algorithm, which is an iter- ative method of assigning expected clus- l(w;(xi, yi)) = max0, 1 − y < w, x >) (3) ter (by closest distance) for each instance where < ·, · > refers to the inner product then maximizing (expected) likelihood by of two vectors. Here, bias term is omitted updating the estimator (cluster ) with with the assumption that hyperplane maximum likelihood estimator at each it- crosses the origin. eration. λ-means clustering is a modi- fied version of K-means clustering where λ is used to formulate different number of Pegasos is a version of SVM refering to clusters without a fixed number of clus- Primal Estimated sub-Gradient Solver, ters. Whenever distance is larger than λ, which takes a stochastic gradient descent the algorithm creates another cluster to as- to optimize the objective function stated sign instances. above. Pegasos takes sub-gradient of E-step: f(w; i) at iteration i, where learning j j rate continuously decreases at each k = arg min ||xi − µ ||&&||xi − µ || iteration (function of time) to guar- j (5) antee convergence. This algorithm adheres to online-training method with M-step: stochastic-optimization step rather than Pn k i=1 rikxi batch-optimization. µ = Pn (6) i=1 rik where rik is an indicator having a value of (c) K-Nearest Neighbors 1 is ith instance belongs to kth cluster and 0 otherwise. K-Nearest Neighbors algorithm is a classifica- 4 Implementation and Method Results tion method using labels of training instances to predict latent instance label by distances to In this section, we discuss our methods and ap- the training instances (i.e. Euclidean distance). proaches used for selecting which binary classifi- cation model to predict potential match between a K-Nearest Neighbors is an instance-based candidate and an opposing partner. Also, we give a learning method which predicts a label by the detailed analysis on our reasoning behind choosing labels of the nearest neighbors to the latent in- parameters for both classification model and cluster- stance. ing analysis. Figure 1: Binary Classification prediction accuracy using different algorithms

Figure 4: Number of clusters at different λ

Figure 2: Binary Classification prediction accuracy using different algorithms

4.1 Binary Classification Model For choosing binary classification model, we con- ducted 5-fold cross validation on 3 different algo- rithms: (1) Margin Perceptron, (2) SVM (Pegasos) and (3) K-Nearest Neighbors. Figure 1 and Figure 2 shows a changing behavior of prediction values by different algorithms for binary classifications using default parameters (training iterations=10, and $θ = 1.0). Notice that Figure 5: of Information at different number of KNN performs best, then the other two models clusters perform similarly.

Plot below show changes in prediction accuracy 4.2 Clustering Analysis at different number of iterations.

For clustering of candidates by demographic traits and preferences, we explored various values of λ to observe data distribution behaviors. Figure 4 shows changes in number of clusters at different λ values. Figure 5 shows changes in Variance of Informa- tion (VI) at different cluster numbers. Figure 6 shows changes in prediction accuracy for binary classification model using clusterID’s as fea- Figure 3: Binary Classification at different iterations tures, at different cluster numbers. Figure 9: Combined Model prediction accuracy at differ- Figure 6: Variance of Information at different λ ent iterations and regularization constant

The Figure 8 shows changes in prediction accu- racy at different iteration numbers. The Figure 9 shows changes in prediction accu- racy at different regularization constant.

5 Evaluation and

As mentioned above, we performed 5-fold cross validation by randomly sampling the data into 5 different sets of train and test sets (8:2 ratio) to account for overfitting issues.

Figure 7: Combined Model prediction accuracy with fea- From the results presented above we can draw the ture dimension with different number of clusters following observations:

4.3 Combined Model 1. Binary Classification Model Lastly, we combined the two models illustrated At default parameter setting, resulting predic- above to increase the performance of our predic- tion accuracy is the following: tion model. We have used the cluster assignments (a) KNN: 0.816 as a additive feature dimension in the classification model using three different algorithms. (b) Pegasos: 0.785 The Figure 7 shows changes in prediction accu- (c) Margin Perceptron: 0.785 racy using different number of clusters as an additive We observe that KNN performs at best in terms feature dimension. of prediction accuracy, and that other two al- gorithms behaves similarly. However, we also observed that KNN runs at a much slower rate compared with the other two algorithms.

2. Clustering Analysis We observe the following properties (VI and number of clusters) at different λ values:

(a) λ = 500: 23 unique clusters / VI: 3.27 (b) λ = 700: 15 unique clusters / VI: 3.60 Figure 8: Combined Model prediction accuracy at differ- (c) λ = 900: 11 unique clusters / VI: 3.87 ent iteration numbers (d) λ = 1100: 8 unique clusters / VI: 3.94 When using Cluster ID’s as features for bi- 6 Conclusion nary classification model we observe that KNN seems to perform best, however, this is due to In conclusion, we have successfully modeled and the fact that there are only two features for cal- chosen a prediction model using Pegasos Support culating distances, and that there are a lot more Vector Machine (supervised binary classification labels for no match (−1) compared with match algorithm) and λ-means clustering (unsupervised (1) that predicting as all non-match gives back clustering algorithm), then we compiled a combined good prediction accuracy. model using cluster ID’s as features. 3. Combined Model For tuning parameters we have chosen each by the In the combined model, we observe a notice- results presented in section 4: able contribution to prediction model with ad- ditive feature on dimension over cluster ID’s. We observe the following (best) prediction ac- 1. Pegasos: Iterations=25, Regularization(θ)=1.0 curacy at each number of clusters, where Pega- sos and Margin Perceptron outperform KNN: 2. λ-means: λ = 900 (a) 23 unique clusters : 0.844 (Pegasos) | 0.832 (Margin Perceptron) (b) 15 unique clusters : 0.830 (Pegasos) | 3. Combined Model: added feature dimension on 0.838 (Margin Perceptron) cluster assignments by λ-means clustering (c) 11 unique clusters : 0.852 (Pegasos) | 0.852 (Margin Perceptron) (d) 8 unique clusters : 0.843 (Pegasos) | 0.844 Our final combined model gives us a prediction (Margin Perceptron) accuracy value of 0.852 which we believe is a strong predictor. Generally speaking, both algorithms perform best when number of cluster equals 11, which is at λ = 900. In terms of algorithmic usage, Acknowledgments Pegasos algorithm which is a version of SVM enables us to tune parameters including We thank professor Mark Dredze from Johns Hop- regularization terms while Margin Perceptron kins University for supervising our project and help- does not. Hence, Pegasos gives us a lot more ing us formulate the project idea using machine flexibility which we can change the learning learning algorithm. rate as well as regularization term to increase prediction accuracy in the future. References Therefore, from our experiment and observa- tions, we chose Pegasos algorithm with Clus- Mark Dredze. 2016. Introduction to Machine Learning tering analysis at λ = 900 that gives us fea- Baltimore, MD. ture dimension of 11 different cluster ID as- Ray Fisman and Sheena Iyengar. 2005. GENDER signments. Optimal number of iterations used DIFFERENCES IN MATE SELECTION: EVIDENCE for this algorithm was at 25 iterations. Regu- FROM A SPEED DATING EXPERIMENT* New larization factor did not seem to greatly affect York, NY. the predictability. The resulting prediction ac- curacy is at 0.852, for correctly predicting a match between a candidate and a potential part- Appendix ner.