Scaled Pearson's Correlation Coefficient for Evaluating Text

Modern Applied Science; Vol. 13, No. 10; 2019 ISSN 1913-1844 E-ISSN 1913-1852 Published by Canadian Center of Science and Education Scaled Pearson’s Correlation Coefficient for Evaluating Text Similarity Measures Issa Atoum1 1 Faculty of Information Technology, The World Islamic Sciences and Education, Amman, Jordan Correspondence: Issa Atoum, Faculty of Information Technology, The World Islamic Sciences and Education, Amman, Jordan. E-mail: [email protected] Received: August 26, 2019 Accepted: September 12, 2019 Online Published: September 16, 2019 doi:10.5539/mas.v13n10p26 URL: https://doi.org/10.5539/mas.v13n10p26 Abstract Despite the ever-increasing interest in the field of text similarity methods, the development of adequate text similarity methods is lagging. Some methods are decent in entailment while others are reasonable to the degree to which two texts are similar. Very often, these methods are compared using Pearson’s correlation; however, Pearson’s correlation is bound to outliers that could affect the final correlation coefficient figure. As a result, the Pearson correlation is inadequate to find which text similarity method is better in situations where data items are very similar or are unrelated. This paper borrows the scaled Pearson correlation from the finance domain and builds a metric that can evaluate the performance of similarity methods over cross-sectional datasets. Results showed that the new metric is fine-grained with the benchmark dataset scores range as a promising alternative to Pearson’s correlation. Moreover, extrinsic results from the application of the System Usability Scale (SUS) questionnaire on the scaled Pearson correlation revealed that the proposed metric is attaining attention from scholars which implicate its usage in the academia. Keywords: Pearson, semantic similarity, performance metric, correlation metrics 1. Introduction Semantic Textual Similarity (STS) determines the degree of which two texts are similar. It is active research, part of which in the SemEval workshop series (Agirre et al., 2012; D. M. Cer, Diab, Agirre, Lopez-Gazpio, & Specia, 2017). The STS has many applications in the field of automatic essay grading (Ratna, Luhurkinanti, Ibrahim, Husna, & Purnamasari, 2018), anti-plagiarism detection (Abdi, Idris, Alguliyev, & Aliguliyev, 2015; Meuschke, Siebeck, Schubotz, & Gipp, 2017), automated text summarization (Fang, Mu, Deng, & Wu, 2017) , web page duplication detection (Manku, Jain, & Das Sarma, 2007), and other domain-specific tasks (Atoum, 2018, 2019; Ayyagari & Atoum, 2019). Typically, an STS system computes the similarity between two texts as a score between 0 to 5, where 0 indicates dissimilarity and 5 indicates equivalence. Consequently, the relationship between human rating scores and STS system scores is used as the foundation for STS system assortment, often using Pearson Correlation (e.g., Šarić, Glavaš, Karan, Šnajder, & Bašić, 2012). The Pearson correlation finds the degree of association between the STS system and human scores, which is a value in the range of -1 to a +1. When the magnitude of the value is close to 1, it implies a high correlation with the human rating; consequently, the similarity method becomes promising. Text similarity measures performance depends on many factors, including the underlying semantic features that might be affected by the domain of the text, the length of text, and the text-similarity algorithm itself. Such factors and many others could inject noise to the regression line of the Pearson Correlation. Therefore, under such noise conditions, extracting the correlation coefficient between two sets of stochastic variables is not trivial (Moriya, 2008). One major problem of the Pearson correlation is the outliers that influence the slope of the regression line, and accordingly, the value of the correlation coefficient. The Pearson correlation uses the mean and standard deviation of scores regardless of the STS task under consideration. Therefore, it is possible to get a high Pearson correlation for scores even though there is no actual relationship between the outputs and human judgments. This problem reconciles with Anscombe’s quartet (Anscombe, 1973). According to Anscombe’s quartet, the same Pearson correlation score might be obtained even the dataset is showing a visually nonlinear relationship due to outliers, as shown in Figure 1. 26 mas.ccsenet.org Modern Applied Science Vol. 13, No. 10; 2019 Fiigure 1. Effect of outliers on Pearson’s correlation Figure 2. Effect of Similarity method on Pearson’s Figure 3. LSA method on STS-65 Dataset(r=0.84) Figure 4. Li method on STS-65 Dataset(r=0.81) This research aims to discover an STS system that works for every group of the dataset pairs. A useful benchmark dataset has adequate examples of numerous text pairs that include pairs that are dissimilar, moderately similar, and almost similar pairs. These groups could resemble STS tasks of web page duplication detection, automatic essay grading, and text summarization. Therefore, it is essential to map the STS task to the suitable performance metric (such as Pearson correlation). Figure 2 shows three different systems named a) perfect_for_low, b) perfect_for_medium, c)perfect_for_high, and d) perfect_for_all that resembles a similarity measure that works perfectly for low similar pairs, moderately similar pairs, high similar pairs, and the perfect method that work with all cases. The set of human pairs were constructed for illustration purposes. Figure 2a is when the low similar pairs match the target similarity method, where the first three pairs are close to human ratings. The same can be applied to the figures 2b-2c. The optimal goal is to find a method that works well on all data pairs siimilar to Figure 2d. Therefore, based on the Pearson correlation alone, it is not clear wither a similarity measure (STS system) approbates similar, dissimilar, or moderate similar text pairs. Therefore, relying on the Pearson correlation is misleading (Reimers, Beyer, & Gurevych, 2016) and is inappropriate for every semantic similarrity task (Wilson & Mihalcea, 2017). The problem of the similarity groups is further illustrated with the STSS-65 dataset that consists of 30 text pairs (Li, McLean, Bandar, O’Shea, & Crockett, 2006). The human ratings shown in Figure 3 and Figure 4, demonstrate that the first arbitrary 12 pairs are dissimilar, 13th to 20th pairs are moderately similar, and 21st to 30th pairs are allmost similar. Therefore, some similarity measures are reasonable to measure low similarity texts, while others are significantly performing better with high similarityy traits. The LSA measure of O’Shea et al. (2008) shown in Figure 3 reveal similarity scores of more than 0.50 for all compared text pairs. Therefore, the reported LSA method is not suitable for low similarity texts as the figure shows that the LSA has higher scores than expected.. The reported percentage absolute relative error of LSA to human rating is ranging from 53- 80%; accordingly, the 27 mas.ccsenet.org Modern Applied Science Vol. 13, No. 10; 2019 author calls the LSA method an overestimating method that gives a moderate Pearson correlation. Figure 4 shows the behavior of Li method (Li et al., 2006); relatively, the same symptom as the LSA method with percentage absolute relative error of human ratings to Li method is ranging from 61-98%. Therefore, these methods suffer from Anscombe’s quartet problem, and an alternative metric should tackle their outliers. Although there are many alternatives to the Pearson correlation, they are domain-specific (Bogolyubova, Panicheva, Tikhonov, Ivanov, & Ledovaya, 2018; Dikbaş, 2018; Smarandache, 2009; Wu, Pang, Lin, & Pei, 2013). The Spearman's rank correlation is more robust than Pearson; however, Pearson remains the most popular correlation measure (Serin, Nijveen, Hilhorst, & Ligterink, 2016). Compared to Pearson’s correlation, Kendall’s Tau is robust to outliers (M. Kendall & Gibbons, 1990); however, it suits data that is ranked by nature. Several evaluation methods were combined in (Reimers et al., 2016); however, they concentrate on the STS systems’ ranking instead of particular properties of the benchmark dataset. Therefore, to solve the problem, the author borrows the scaled Pearson correlation from the finance domain (Nikolić, Mureşan, Feng, & Singer, 2012). The scaled Pearson correlation method is used to reveal correlations between test and human scores in various data segments. If the dataset is arbitrary grouped into three segments, three Pearson correlations scores are obtained. The author averages the scores over segments; therefore, getting a single scaled Pearson score. For the overall computation of many datasets with different distributions, the Fisher z-scores (Fisher, 1921) is used to convert scaled Pearson scores to a single Pearson correlation score; therefore, providing a quantitative one value that could be used to compare datasets of different sizes. The scaled Pearson score reduce skewness, and therefore, this research could evaluate the STS system in different data range scores (different STS tasks). The proposed metric can be used to identify the effectiveness of a similarity method over a specific semantic task. Consequently, discovering significant pitfalls of semantic similarity methods and allowing better STS method selection. The remainder of the paper is structured as follows. Section 2 summarizes the related works. In Section 3, this paper explains the proposed metric. In Section 4, this

Load more