<<

ReAttach Therapy International Foundation, Voerendaal, The Netherlands Journal for ReAttach Therapy and Developmetnal Diversities. 2018 Aug 15; 1(1):44-58. https://doi.org/10.26407/2018jrtdd.1.6 eISSN: 2589-7799 Neuropsychological

An Overview of the History and Methodological Aspects of History and Methodological aspects of Psychometrics

Luis ANUNCIAÇÃO Received: 15-June-2018 Federal University of Rio de Janeiro Revised: 1-July-2018 Institute of , Department of Psychometrics Accepted: 12-July-2018 Rio de Janeiro, Brazil Online first 15-July-2018 E-mail: [email protected] Scientific article Abstract INTRODUCTION: The use of psychometric tools such as tests or inventories comes with an agreement and acceptance that psychological characteristics, such as abilities, attitudes or traits, can be represented numerically and manipulated according to mathematical principles. Psychometrics and its close relation with provides the scientific foundations and the standards that guide the development and use of psychological instruments, some of which are tests or inventories. This has its own historic foundations and its particular analytical specificities and, while some are widely used analytical methods among and educational researchers, the history of psychometrics is either widely unknown or only partially known by these researchers or other students. OBJECTIVES: With that being said, this paper provides a succinct review of the history of psychometrics and its methods. From a theoretical approach, this study explores and describes the Classical Theory (CTT) and the (IRT) frameworks and its models to deal with questions such as and . Different aspects that gravitate around the field, in addition to recent developments are also discussed, including Goodness-of-Fit and Differential Item Functioning and Differential Test Functioning. CONCLUSIONS: This theoretical article helps to enhance the body of on psychometrics, it is especially addressed to social and educational researchers, and also contributes to training these . To a lesser degree, the present article serves as a brief tutorial on the topic.

Keywords: Psychometrics, History, Classical Test Theory, Item Response Theory, .

Citation: Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics- History and Methodological aspects of Psychometrics. Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. https://doi.org/10.26407/2018jrtdd.1.6

Copyright ©2018 Anunciação, L. This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)

Corresponding address: Luis Anunciação Federal University of Rio de Janeiro Institute of Psychology, Department of Psychometrics Rio de Janeiro – Brazil 22290-902 E-mail: [email protected] ______44 https://jrtdd.com

Neuropsychological research

Introduction psychological tests and other measures used to When one decides to use a psychological assess variability in behaviour and then to link instrument such as a or a test, the such variability to psychological phenomena decision comes with an inherent and theoretical frameworks (Browne, 2000; Furr and agreement that psychological characteristics, & Bacharach, 2008). More recently, traits or abilities can be investigated in a psychometrics also aims to develop new systematic manner. Another agreement is made methods of statistical analysis or the refinement when one decides to analyse the obtained of older techniques, which has been possible by a psychological tool by summing up the with the advancements in computer and scores or by using other mathematical methods. software technologies. This latter comes with a deep The two disciplines of psychometrics and epistemological acceptance that have at least three points in common. traits can be casted in numerical form for the Firstly, they use models to simplify and study underlying structure. Although these premises reality; secondly, they are highly dependent on were already well known and documented in ; and thirdly, both can be observed publications by the first psychologists, this by its tools (e.g. tests are paradigm was not entirely accepted by the provided by statistics and/or psychological scientific community until recently. instruments are provided by psychometrics) or Discussions about the general utility or validity by their theoretical framework, where of psychometrics are still present in the researchers seek to build new models and mainstream academic debate. Some authors paradigms through guidelines, empirical data argue against the utility or validity of and simulations. psychometrics for answering questions about the Strictly speaking, psychological phenomena underlying processes that guide observed such as attention and extraversion are not (Toomela, 2010), and others say that directly observable, nor can they be measured the quantitative approach led psychology into a directly. Because of that, they must be inferred “rigorous ” (Townsend, 2008, p. 270). from made on some behaviour that Apart from this discussion, the growth in the use may be observed and is assumed to of statistical and psychometric methods in operationally represent the unobservable psychological, social and characteristic (or “variable”) that is of interest. has been growing in recent years and some There are numerous synonyms in the literature concerns have been expressed because of its when referring to non-directly observable inadequate, superficial or misapplied use psychological phenomena such as abilities, (Newbery, Petocz, & Newbery, 2010; Osborne, constructs, attributes, latent variables, factors or 2010). dimensions (Furr & Bacharach, 2008). The close relationship between statistics and There are several avenues available when trying psychology is well documented and with the to assess psychological phenomena. formation of the Psychometric Society in 1935 Multimethod assessments such as interviews, by L.L. Thurstone, psychometrics is seen as a direct , and self-reporting, as well as separate science that interfaces with quantitative tools such as tests and scales are mathematics and psychology. In a broader , accessible to psychologists (Hilsenroth, Segal, & psychometrics is defined as the area concerned Hersen, 2003). However, from this group of with quantifying and analysing human methods the use of tests, inventories, scales, and differences and in a narrower sense it is other quantitative tools are seen as the best concerned with evaluating the attributes of choices when one needs to accurately measure psychological traits (Borsboom, Mellenbergh, &

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 45

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics van Heerden, 2003; Craig, 2017; Marsman et entitled “Psychometric Investigation”, in which al., 2018; Novick, 1980), as long as they are he studied what we now know today as the psychometrically adequate. Stroop effect. At this time, Cattel was Wundt’s In line with this, the use of quantitative methods student, but he was highly influenced by Francis in psychology (and social in general) Galton and his “Anthropometric Laboratory” has been increasing dramatically in the last which opened in London in 1884. As decades (since 1980s), despite strong criticism consequence of the interface between the two and concern from different groups that disagree researchers, Cattell is also credited as the with this quantitative view (Cousineau, 2007). founder of the first laboratory developed to Paradoxically, this quantitative trend was only study psychometrics, which was established partially followed by academics and other within the Cavendish Laboratory at the students of psychology, which has led to the University of Cambridge in 1887 (Cattell, 1928; American Psychological Association creating a Ferguson, 1990). task force aiming to increase the number of With this first laboratory, the field of quantitative psychologists and to improve the psychometrics could differentiate from quantitative training among students. and the major differences can be With that being said, the aim of this article is to grouped as the following: 1) while provide a succinct review of the history of psychophysics aimed to discover general psychometrics and its methods through sensory- laws (i.e. psychophysical important points of psychometrics. It is functions), psychometrics was (is) concerned important to clarify that this review is not about with studying differences between individuals; examining all trends in psychometrics so that it 2) the goal of psychophysics is to explore the is not exhaustive and has concentrated on fundamental relations of dependency between a describing and summarising the topics related to physical and its psychological this thesis. Several other resources are relevant response, but the goal of psychometrics is to to the topic and some are listed in the references. measure what we call latent variables, such as , attitudes, beliefs and personality; 3) History of Psychometrics the methods in psychophysics are based on The precise historical origins of psychometrics experimental design where the same subject is and the field of are observed over repeated conditions in a difficult to define. The same condition is found controlled , but the majority of in statistics when trying to detail when statistics studies in psychometrics are observational when was incorporated into social the measurement occurs without trying to affect sciences/humanities. However, it is possible to the participants (Jones & Thissen, 2007). argue that the investigation into psychometrics Nowadays, graduate programs in has two starting points. The first one was Psychometrics are found in countries such as the concerned with discovering general laws and division 5 (Quantitative and relating the physical world to observable Qualitative Methods) from the American behaviour and the second one had the aim to Psychological Association (APA) helps in explore and to test some hypotheses about the studying measurement, statistics, and nature of individual differences by using psychometrics. As can be captured in the (Craig, 2017; Furr & definition of psychometrics, one of the primary Bacharach, 2008). When arranging events in strengths of psychometrics is to improve their order of occurrence in time, James psychological science by developing McKeen Cattell was the first to instruments based on different theories and write about psychometrics in 1886 with a thesis approaches, thus, comparing its results.

46 https://jrtdd.com

Neuropsychological research

However, these instruments can be developed to check test properties, as well as the by other needs and areas (e.g. health sciences epistemological background to address typical and business administration), which that questions that emerge in psychometric research. psychometric tools span across a variety of Validity is an extensive concept and has been different disciplines. widely debated since it was conceived in the With that being said, the Classical Test Theory 1920s. Throughout its history, at least three (CTT) and the Item Response Theory (IRT) are different approaches emerged to define it. The the primary measurement theories employed by first authors had the understanding that validity researchers in order to construct psychological was a test property (e.g. Giles Murrel Ruch, assessment instruments and will be described in 1924; or Truman L. Kelley, 1927); the second the following section. conceived validity within a nomological framework (e.g. Cronbach & Meehl, 1955), and Classical Test Theory (CTT) and Item finally, current authors state that validity must Response Theory (IRT) not only consider the interpretations and actions As there is no universal unit of psychological based on test scores, but also the ethical processes (or it has not been discovered yet), consequences and social considerations (e.g. such as meters (m) or seconds (s), psychologists Messick, 1989) (Borsboom, Mellenbergh, & operate on the assumption that the units are van Heerden, 2004). implicitly created by the instrument that is being There is no difficulty in recognising that the used in research (Rakover, 2012; Rakover & latter approach influenced official guidelines, Cahlon, 2001). Two consequences emerge from such as the Standards of Testing, when it defines this: first, there are several instruments to validity as “the degree to which evidence and measure (sometimes the same) psychological theory support the interpretations of test scores phenomena; second, evaluating the attributes of for proposed uses of tests” (AERA, APA, & psychological testing is one of the greatest NCME, 2014, p. 14). However, even concerns of psychometrics. considering that psychometric tools always exist The indirect nature of the instruments leaves in a larger context and thus must be evaluated much room for unknown sources of to within this standpoint, this definition imposes a contribute to participant’s results, which validation process which is hard to achieve. The translates into a large measurement error and the absence of standard guidance for how to conclusion that assessing the validity and the integrate different pieces of validity, or which reliability of the psychometric instruments is evidence should be highlighted and prioritised vital (Peters, 2014). Additionally, as the data contributes even more to weaken the link yielded by those tests are often used to make between theoretical understanding about validity important decisions, including awarding and the practical actions performed by credentials, judging the effectiveness of psychometricians to validate a tool (Wolming & interventions and making personal or business Wikström, 2010). decisions, ensuring that psychometric qualities Another effect of plural definitions is that not remain up to date is a central objective in everyone has access to updated materials. This is psychometrics (Osborne, 2010). pretty common in some , mainly in developing countries, in which only translated There are two distinct approaches/paradigms in content is available. Moreover, the types of psychometrics used in evaluating the of validity elaborated by Cronbach and Meehl tests: CTT and IRT. Both deal with broad (1955) are not only older than the recent concepts such as validity, reliability and definitions, which increases its chances to have usability, and provide the mathematical guidance been translated, but are still informative and

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 47

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics reported in academic books. This mix between type of validity became the central issue on the absence of updated knowledge about study of psychometrics (Cronbach & Meehl, psychometrics and multiple ways to define the 1955). same concept nurtures an environment where Nowadays, the of is analysis and conclusions can be diametrically accepted by virtually all psychometricians, in opposed from one academic group to another. addition to the agreement that validity is not all Within this traditional framework, validity can or one nor a test property. should be be divided into content, criterion and construct evaluated within multiple sources of evidence (i.e. the “tripartite” perspective). Criterion- with respect to specific contexts and purposes. related validity is formed by concurrent and Thus, the validation is a continuous process and predictive validity. The construct-related validity a test can be valid for one purpose, but not for is formed by convergent and discriminant another (Sireci, 2007). validity. Finally, content refers to the degree an CTT is based on the concept of the “true score”. instrument measures all of the domains that That means the observed test score (푌푖) as constitute the domain and it is mainly assessed composed of a True score (푇푖) plus an Error ( ) by experts in the domain. The statistical methods considered normally distributed with its were developed or used for focusing on some taken to be 0. The mathematical formulation of particular aspect of validity, seen as independent CTT have been made over the years until the of one another. However, as construct validity work of Novick (1966), that defined: points to the degree to which an instrument measures what it is intended to measure, this

Equation 1. Basic CTT Equation

CTT accesses validity mainly by inter-item and equivalence. Internal consistency is also correlations, and correlation referred to as item homogeneity and attempts to between the measure and some external check if all the items of a test are relatively evidence (Salzberger, Sarstedt, & similar. Consistency across time is also known Diamantopoulos, 2016). CTT also understands as temporal stability and is checked by reliability as a necessary but not sufficient consecutive measures of the same group of condition for validity, while the reliability participants. Equivalence refers to the degree to represents the consistency and the which equivalent forms of an instrument or reproducibility of the results across different test different raters yield similar or identical scores situations. (AERA, APA, & NCME, 2014; Borsboom et To a lesser degree with what occurs for validity, al., 2004; Sijtsma, 2013). this concept also has multiple meanings. It refers From a CTT perspective, reliability is the ratio to at least three different concepts, which are of true-variance to the total variance yielded by internal consistency, consistency across time, the measuring instrument. The variance is:

2 2 2 퐼?푌 = 퐼?푇+ 퐼?퐸 Equation 2. Decomposition of the test

48 https://jrtdd.com

Neuropsychological research

Hence, the reliability is:

2 2 2 퐼?푇 2 퐼?푇 퐼?(푌푇) = 2 푎? ? 퐼?(푌푇) = 2 2 퐼?푌 퐼?푇+ 퐼?퐸 Equation 2. Decomposition of the test reliability

As reliability is not a unitary concept, several one considers FA from both methods were developed for its such perspectives/traditions in psychometrics (Steyer, as Cronbach’s alpha, Test-retest, Intraclass 2001). If the framework in this article uses the (ICC) and Pearson or “true vs ”, FA will be allocated Spearman correlation. Cronbach’s alpha is the into latent framework such as IRT, and from a most commonly used to measure the internal statistical/methodological standpoint it is consistency and has proven to be very resistant possible to combine approaches or understand to the passage of time, despite its limitations: its some methods as particular cases of a general values are dependent on the number of items in approach, such as with Confirmatory Factor the , assumes tau-equivalence (i.e. all factor Analysis (Edwards & Bagozzi, 2000; loadings are equal or the same true score for all Mellenbergh, 1994). test items), is not robust against , In regards to the statistical process to and treats the items as continuous and normally explore the constructs covered in psychometric distributed data (McNeish, 2017). Alternatives work, there are two main ways in which this to Cronbach’s alpha have been proposed, and connection between constructs and observations examples are the McDonald's omega, The has been construed. The first approach Greatest Lower Bound (GLB) and Composite understands constructs as inductive summaries Reliability (Sijtsma, 2009). of attributes or behaviours as a function of the Still within the CTT framework, a shift has observed variables (i.e. formative model, where occurred with Factor Analysis (FA). This latent variables are formed by their indicators). method relies on a and depends on The second approach understands constructs as the items included in the test and the persons reflective and the presence of the construct is examined, but it also models a latent variable assumed to be the common cause of the and some of its models achieve virtually observed variables (Fried, 2017; Schmittmann et identical results to those obtained by IRT al., 2013). Image 1 below displays these models. Therefore, these conditions allow that conceptualisations.

Image 1. On the left, the PCA model (formative); on the right, the Factor model (reflective).

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 49

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics

As the goal of PCA is data reduction, but Models (Edwards & Bagozzi, 2000; Marsman et psychometric theory wants to investigate how al., 2018). The question whether latent variables observable variables are related to are continuous (therefore dimensions) or theoretical/latent constructs, the reflective model categorical (therefore typologies) will influence is mostly used. Some of the statistical models the choice of the model. Table 1 reports the associated with this model are the Common theoretical assumptions of latent and manifest Factor Model, Item Response Theory models variables in reflective models. (IRT), Latent Class Models, and Latent Profile

Table 1. Properties of Latent and Observed variables Latent variable Observed variable Model (ability, trait) (items, indicators)

Common Factor Continuous Continuous

Item Response Theory Continuous Categorical

Latent Class Analysis Categorical Categorical

Latent Profile Analysis Categorical Continuous

The Factor Analysis (FA) is part of its models, how indicators are related to the underlying its concept is analogous to CTT and was factors, EFA is not generally considered a developed with the work of member of the SEM family. In contrast, (1904) in the context of intelligence testing. The confirmatory factor analysis (CFA) is a theory- FA operates on the notion that measurable and driven model and aims to see whether a observable variables can be reduced to fewer particular of factors can account for the latent variables that share a common variance correlations by imposing lower triangular and are unobservable (Borsboom et al., 2003). constraints on the factor loading matrix, thus The statistical purpose of factor analysis is to rendering identifiability to the established explain relations among a large set of observed parameters of the model. In other words, CFA is variables using a small number of designed to evaluate the a priori factor structure latent/unobserved variables called factors. FA specified by researchers (Brown, 2015; Finch, can be divided into exploratory and 2011). confirmatory, and in a broad sense is viewed as a In another direction, some authors argue that special case of Structural Equation Modeling there is no clear EFA-CFA distinction in most (SEM) (Gunzler & Morris, 2015). factor analysis applications and they fall on a Exploratory factor analysis (EFA) explores data continuum running from exploration to to determine the number or nature of factors that confirmation. Because of this, they choose to account for the covariation between variables if call both techniques at a statistics standpoint; an the researcher does not own sufficient a priori unrestricted model for EFA and a restricted evidence to establish a hypothesis regarding the model for CFA. An unrestricted solution does number of factors underlying the data. In detail, not restrict the factor space, so unrestricted since there is not an a priori hypothesis about solutions can be obtained by a rotation of an

50 https://jrtdd.com

Neuropsychological research

arbitrary orthogonal solution and all the and CFA are available, and despite some unrestricted solutions will yield the same fit for changes in the mathematical notation or the same data. On the other hand, a restricted formula, the common factor model is a linear solution imposes restrictions on the whole factor regression model with observed variables as space and cannot be obtained by a rotation of an outcomes (dependent variables) and factors as unrestricted solution (Ferrando & Lorenzo-Seva, predictors (independent variable) (see equation 2000). 1): Leaving aside these particular questions, several high-quality resources on best practices in EFA

Equation 3. Common factor model

Where 푌푖 is the ith observed variable (item extraction; 2. How many factors to settle on for a score) from a set of I observed variables, is confirmatory step; and 3. Which factor rotation the mth of M common factors, is the should be employed. All questions need to be regression coefficient (slope, also known as answered by the researcher. The extraction factor loading) relating factor m to 푌푖, and is methods reflect the analyst’s assumptions about the error term unique for each 푌푖. The variance the obtained factors. Their mathematical of ε for variable i is known as the variable’s conceptualisation is also based on manipulations uniqueness, whereas 1 – VAR( ) is that of the correlation matrix to be analysed. There variable’s communality. This latter concept is are a number of factors to retain changes equivalent to the regression R2 and describes the throughout the literature and there are many proportion of variability in the observed variable rules of thumb to guide the decision. Finally, all explained by the common factors. In some results are often adjusted to become more interpretable. guidelines, the inclusion of the item intercept 퐼?푖 is made, but this parameter usually does not In summary, the factor extraction methods are contribute to the matrix (Furr & statistical used to estimate loadings Bacharach, 2008). and are composed of techniques such as the Operationally, some assumptions must be minimum residual method, principal axis fulfilled before an EFA, such as the proportion factoring, weighted , generalized of variance among variables that might be least squares and maximum likelihood factor common variance, and that the dependent analysis. The decision of how many factors will variable covariance matrices are not equal across be retained relies on many recommendations the levels of the independent variables. The first such as: 1. The rule of an eigenvalue of ≥ 1; 2. assumption is tested by the Kaiser-Meyer-Olkin The point in a where the slope of the (KMO) test and the second with the Bartlett test. curve is clearly leveling off; or 3. The KMO values between 0.8 and 1 indicate the data interpretability of the factors. It is easy to is adequate for FA, and a significant Bartlett's recognise that these guides can provide test (p < .05) means that data matrix is not an contradictory answers and illustrate some degree identity matrix, which prevents factor analysis of arbitrary decisions during this process from working (Costello & Osborne, 2011). (Nowakowska, 1983). The factor rotations are Next, three main questions arise when classified as either orthogonal, in which the conducting an EFA: 1. The method of factor factors are constrained to be uncorrelated (e.g.

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 51

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics

Varimax, Quartimax, Equamax), or oblique (e.g. Considering a of n individuals that Oblimin, Promax, Quartimin) in which this answered I items. s = 1, …, n and i = 1, ..., I. Let constraint is not present (Finch, 2011). 푌푖푗 be random variables associated with the Another approach in psychometrics independent response of individual s to an item i. These of the factor analysis developments and apart responses can be dichotomous (e.g. fail or pass) from CTT is the IRT. The focus of IRT modeling or polytomous (e.g. agree, partially agree, is on the relation between a latent trait (퐼?푠 ), the neutral). Let denote the set of possible values properties of each item in the instrument and the of the 푌 , assumed to be identical for each item individual’s response to each item. IRT assumes 푖푗 in the test, and 퐼? denotes the latent trait for an that the underlying latent dimension (or 푠 dimensions) are causal to the observed individual s, and a set of parameters that will responses to the items, and different from CTT, be used to model item features. The IRT models item and person parameters are invariant, neither arise from different sets of possible responses depending on the subset of items used nor on the and different functional forms assumed to distribution of latent traits in the population of describe the probabilities with which the 푌푖푗 respondents. In addition, the total scores of a test assume those values, as expressed below (Le, has no space in IRT, which is concerned with 2014; Sijtsma & Junker, 2006; Zumbo & focusing on quality at the item level. Hubley, 2017):

Equation 4. General formula of IRT models

The represents the item parameters and may expresses the probability of a high-ability include four distinct types of parameters: participant failing to answer an item correctly. The common 4PL model for a dichotomous parameter “푎푖” denotes the discrimination, “푏푖” response is (Loken & Rulison, 2010): the difficulty, “푐푖” the guessing, and “푑푖”

Equation 5. 4PL IRT model

Which leads to:

1 푃(푌푖푠 = 1 | 퐼?푠 , 푎, 푏, 푐, 푑) = 푐푖 + (푑푖 − 푐푖) 1 + 푒[−푎푖(퐼?푠−푏푖)] Equation 6. 4he concept PL IRT model

The three IRT models that precede the 4PL are model in CTT, 2. Both the 1PL and Rasch seen as its constrained version. The 3PL model models assume that items do not differ in the constrains the upper asymptote (“d”) to 1, the discrimination parameter (“a”), but Rasch 2PL model keeps the previous constraint and models set the discrimination at 1.0, whereas also constrains the lower asymptote (“c”) to 0, 1PL can assume other values, and 3. Some and the 1PL model only estimates the difficulty authors argue that Rasch models focus on parameter (“b”). Some information about these fundamental measurement, trying to check how models must be emphasised for better well the data fits the model, while IRT models understanding of the topic: 1. The 2PL is check the degree to which the model fits the data analogous to the congeneric measurement (De Ayala, 2009, p. 19).

52 https://jrtdd.com

Neuropsychological research

As can be seen from the equations, there is a within the CTT and IRT framework. In IRT, conceptual bridge between IRT parameters and reliability varies for different levels of the latent Factor Analysis, and between IRT models and trait, that the items discriminate better . The “a” parameter is around their difficulty parameter. analogous to the factor loading in traditional It should be emphasised that both CTT and IRT linear factor analysis, with the advantage that the methods are currently seen as complementary IRT model can accommodate items with and are frequently used to assess the test validity different difficulties, whereas linear factor and respond to other research questions. loadings and item-total correlations will treat easy or hard items as inadequate because they Goodness-of-fit (GoF) have less variance than medium items. The “b” As most modern measurement parameter in Rasch models is analogous to item techniques do not measure the variable of difficulty in CTT, which is the probability of interest directly, but indirectly derive the target answering the item correctly (Schweizer & variables into models, the adequacy of models DiStefano, 2016). must be tested by statistical techniques and experimental or empirical inspection. Goodness- Similarities also exist between IRT and logistic of-Fit (GoF) is an important procedure to test regression, but the explanatory (independent) how well a model fits a set of observations or variable in IRT is a latent variable as opposed to whether the model could have generated the an observed variable in logistic regression. In the observed data. Both SEM and IRT provide a IRT case, the model will recognize the person’s wide of GoF indices focusing on the item variability on the dimension measured in and/or test-level, and the guidelines in SEM are common by the items and individual differences seen as reasonable for IRT models (Maydeu- 퐼? may be estimated (Wu & Zumbo, 2007). Olivares & Joe, 2014). In the origins of IRT, some assumptions (such as Traditional GoF indices can be broken into unidimensionality and local independence) were absolute and relative fit indices. The absolute held, but IRT models can currently deal with measures the discrepancy between a statistical multidimensional latent structure (MIRT) and model and the data, whereas the relative local dependence. In MIRT, an Item measures the discrepancy between two statistical Characteristic Surface (ICS) represents the models. The first indices are comprised of Chi- probability that an examinee with a given ability Square, Index (GFI), Root Mean Square of Approximation (RMSEA), and (퐼?푠) composite will correctly answer an item. To deal with local independence, Item Splitting Standardized Root Mean-Square Residual is a way for the estimation of item and person (SRMR). The second indices are comprised of parameters (Olsbjerg & Christensen, 2015). In Bollen’s Incremental Fit Index (IFI), the same direction, the comparison between Comparative Fit Index (CFI) and Tucker-Lewis unidimensional and multidimensional models Index (TLI). have shown that as the number of latent traits Mainly in Rasch-based methods, some item- underlying item performance increase, item and level fit indices are also available to assess the ability parameters estimated under MIRT have degree to which an estimated item response less error scores and reach more precise function approximates (or does not) an observed measurement (Kose & Demirtasli, 2012). item response pattern. Finally, the information- As previously stated, the reliability of an theoretic approach is a commonly used criteria instrument is investigated along with the validity in , with the Akaike information during a psychometric examination of an criterion (AIC) and the Bayesian information instrument, and it can be performed via methods (BIC) being the most used measures to select

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 53

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics from among several candidate models (Fabozzi, based on Bentler (1990), Maydeu-Olivares Focardi, Rachev, & Arshanapalli, 2014). Table 2 (2013), Fabozzi et al. (2014), and Wright & summarizes these indices; it may be used as a Linacre (1994). preliminary approach to these models and is

Table 2. Measures of Goodness-of-Fit Commonly used Type Indices Values Framework Chi-Square ( ) P value ≥ 0.05 Absolute RMSEA ≤ 0.08 – acceptable SRMR ≤ 0.05 – ideally SEM TLI ≥ 0.95 – ideally Relative CLI ≥ 0.90 – acceptable ILI Infit Based on type of test, Item level Rasch/IRT for surveys, 0.6 - 1.4 Outfit

AIC Lowest value Methods for comparing Information Criteria competing models BIC Lowest value

Differential item functioning (DIF) of the item score (dependent variable) on and Differential Test Functioning examinees’ ability, different patterns emerging from groups with the same ability is the first (DTF) evidence of DIF. In addition, Mantel-Haenszel DIF refers to the change in the probability of (MH), Wald statistics, and the Likelihood-ratio participants within the same ability level, but test approach offer a numerical approach to from different groups, in successfully answering investigate DIF (De Beer, 2004). a specific item. Therefore, assuming two individuals from different subgroups have the As meaningful comparisons require that same ability level, their probability of endorsing measurement equivalence holds, both DIF and the same items should not be different. When DFT may influence the psychometric properties DIF is present for many items on the test, the of test scores and represent lack of fairness. final test scores do not represent the same Further literature about DIF and DFT are measurement across groups, and this is known available elsewhere (Hagquist & Andrich, as DFT (Runnels, 2013). 2017). DIF (and DFT) may reflect measurement Conclusions and indicate a violation of the invariance The investigator often needs to simplify some assumption. Testing DIF is enabled by visual representation of reality in order to achieve an inspection and statistical testing. As the Item understanding of the dominant aspects of the Characteristic Curves represents the regression

54 https://jrtdd.com

Neuropsychological research

system under study. This is no different in theoretical understanding of science and the Psychology; models are built and their study scientific method. allows researchers to answer focused and well- Because the use of psychometric tools is posed questions. When models are useful, their becoming an important part of several sciences, predictions are analogous to the real world. understanding the concepts presented in this Additionally, psychometric tools are often used paper will mainly be of importance to enhance to inform important decisions, judging the the abilities of social and educational effectiveness of interventions, and making researchers. personal or business decisions. Therefore, ensuring that psychometric qualities remain up Acknowledgements to date is a central objective in psychometrics I deeply thank three anonymous reviewers who (Osborne, 2010). gave me useful and constructive comments that The present manuscript had the goal to explore helped me to improve the manuscript. I also some aspects of the history of psychometrics would like to thank Prof. Dr. Vladimir and to describe its main models. Theoretical Trajkovski for all support, Prof. Dr. J. Landeira- studies frequently focus on one of the two Fernandez and Prof. Dr. Cristiano Fernandes for aspects. However, the integration of methods the fruitful discussions about psychometrics. and its history helps to better understand (and Conflicts of interests contextualise) psychometrics. The preceding pages revisited the origins of psychometrics The author declares no conflict of interests. through its models, as well as illustrated some of References the mathematical conceptualisations of these techniques, in addition to academic perspectives AERA, APA, & NCME. (2014). Standards for on psychometrics. educational and psychological testing. In Despite the contributions provided in this American Educational Research Association. manuscript, it is not free from limitations. The Bentler, P. M. (1990). Comparative fit indexes in present text does not cover some of the recent structural models. , methods and debate, such as Bayesian 107(2), 238–246. https://doi.org/10.1037/0033- psychometrics, network psychometrics and the 2909.107.2.238 effect of computational psychometrics on Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent psychology. Bayesian psychometrics in variables. , 110(2), 203– particular, and Bayesian statistics in general are 219. https://doi.org/10.1037/0033- seen as candidates to make a revolution in 295X.110.2.203 Psychology and other behavioural sciences Borsboom, D., Mellenbergh, G. J., & van Heerden, (Kruschke, Aguinis, & Joo, 2012). Along these J. (2004). The Concept of Validity. same lines, the potential candidates to change Psychological Review, 111(4), 1061–1071. current psychometric paradigms are network https://doi.org/10.1037/0033-295X.111.4.1061 models. Different from the traditional view that Brown, T. (2015). Confirmatory Factor Analysis for understands item response being caused by Applied Reaearch. Journal of Chemical latent variables, in network models items are Information and Modeling. hypothesised to form networks of mutually https://doi.org/10.1017/CBO9781107415324.00 4 reinforcing variables (Fried, 2017). Finally, the Browne, M. W. (2000). Psychometrics. Journal of growth in computer power and the availability the American Statistical Association, 95(450), of statistical packages can negatively impact 661–665. psychometrics by encouraging a generation of https://doi.org/10.1080/01621459.2000.1047424 mindless analysts if uncorrelated with the 6

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 55

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics

Cattell, J. M. (1928). Early Psychological 10(2), 549–570. Laboratories. Science, 67(1744), 543–548. https://doi.org/10.22237/jmasm/1320120780 https://doi.org/10.1126/science.67.1744.543 Fried, E. I. (2017). What are psychological Costello, A., & Osborne, J. (2011). Best practices in constructs? On the nature and statistical exploratory factor analysis: Four modelling of , intelligence, personality recommendations for getting the most from your traits and mental disorders. analysis. Practical Assessment, Research & Review, 11(2), 130–134. Evaluation, 10, np. https://doi.org/10.1080/17437199.2017.1306718 Cousineau, D. (2007). The rise of quantitative Furr, R. M., & Bacharach, V. R. (2008). methods in psychology. Tutorials in Quantitative Psychometrics : an introduction. Retrieved from Methods for Psychology, 1(1), 1–3. http://catdir.loc.gov/catdir/enhancements/fy0808 https://doi.org/10.20982/tqmp.01.1.p001 /2007016663-b.html Gunzler, D. D., & Morris, N. (2015). A tutorial on Craig, K. (2017). The History of Psychometrics. In structural equation modeling for analysis of Psychometric Testing (pp. 1–14). Wiley. overlapping symptoms in co-occurring https://doi.org/10.1002/9781119183020.ch1 conditions using MPlus. Statistics in Medicine, Cronbach, L., & Meehl, P. (1955). Construct 34(24), 3246–3280. validity in psychological tests. Psychological https://doi.org/10.1002/sim.6541 Bulletin, 52(4), 281–302. Hagquist, C., & Andrich, D. (2017). Recent https://doi.org/10.1037/h0061470 advances in analysis of differential item De Ayala, R. J. (2009). The Theory and Practice of functioning in health research using the Rasch Item Response Theory. Methodology in the model. Health and Quality of Life Outcomes, Social Sciences. 15(1), 181. https://doi.org/10.1186/s12955-017- De Beer, M. (2004). Use of differential item 0755-0 functioning (DIF) analysis for bias analysis in Hilsenroth, M. J., Segal, D. L., & Hersen, M. test construction. SA Journal of Industrial (2003). Comprehensive handbook of Psychology, 30(4). psychological assessment. Assessment (Vol. 2). https://doi.org/10.4102/sajip.v30i4.175 https://doi.org/10.1002/9780471726753 Edwards, J. R., & Bagozzi, R. P. (2000). On the Jones, L. V., & Thissen, D. (2007). A History and nature and direction of relationships between Overview of Psychometrics. Handbook of constructs and measures. Psychological Statistics, 26(6), 1–27. Methods, 5(2), 155–174. https://doi.org/10.1016/S0169-7161(06)26001-2 https://doi.org/10.1037/1082-989X.5.2.155 Kose, I. A., & Demirtasli, N. C. (2012). Comparison Fabozzi, F. J., Focardi, S. M., Rachev, S. T., & of Unidimensional and Multidimensional Arshanapalli, B. G. (2014). Multiple Linear Models Based on Item Response Theory in Regression. In The Basics of Financial Terms of Both Variables of Test Length and (pp. 41–80). Sample Size. Procedia - Social and Behavioral https://doi.org/10.1002/9781118856406.ch3 Sciences, 46, 135–140. Ferguson, B. (1990). Modern psychometrics. The https://doi.org/10.1016/j.sbspro.2012.05.082 science of psychological assessment. Journal of Kruschke, J. K., Aguinis, H., & Joo, H. (2012). The Psychosomatic Research, 34(5), 598. Time Has Come. Organizational Research https://doi.org/10.1016/0022-3999(90)90043-4 Methods, 15(4), 722–752. Ferrando, P. J., & Lorenzo-Seva, U. (2000). https://doi.org/10.1177/1094428112457829 Unrestricted versus restricted factor analysis of Le, D.-T. (2014). Applying item response theory multidimensional test items: some aspects of the modeling in educational research. Dissertation problem and some suggestions. Psicológica, 21, Abstracts International: Section B: The Sciences 301–323. and , 75(1). Finch, W. H. (2011). A Comparison of Factor Loken, E., & Rulison, K. L. (2010). Estimation of a Rotation Methods for Dichotomous Data. four-parameter item response theory model. Journal of Modern Applied Statistical Methods, British Journal of Mathematical and Statistical

56 https://jrtdd.com

Neuropsychological research

Psychology, 63(3), 509–525. https://doi.org/10.3389/fpsyg.2010.00001 https://doi.org/10.1348/000711009X474502 Peters, G.-J. Y. (2014). The alpha and the omega of Marsman, M., Borsboom, D., Kruis, J., Epskamp, scale reliability and validity. The European S., van Bork, R., Waldorp, L. J., … Maris, G. Health Psychologist, 16(2), 56–69. (2018). An Introduction to Network https://doi.org/10.17605/OSF.IO/TNRXV Psychometrics: Relating Ising Network Models Rakover, S. S. (2012). Psychology as an to Item Response Theory Models. Multivariate Associational Science: A Methodological Behavioral Research, 53(1), 15–35. Viewpoint. Open Journal of Philosophy, 2(2), https://doi.org/10.1080/00273171.2017.1379379 143–152. Maydeu-Olivares, A. (2013). Goodness-of-Fit https://doi.org/10.4236/ojpp.2012.22023 Assessment of Item Response Theory Models. Rakover, S. S., & Cahlon, B. (2001). Face Measurement: Interdisciplinary Research & Recognition (Vol. 31). Amsterdam: John Perspective, 11(3), 71–101. Benjamins Publishing Company. https://doi.org/10.1080/15366367.2013.831680 https://doi.org/10.1075/aicr.31 Maydeu-Olivares, A., & Joe, H. (2014). Assessing Runnels, J. (2013). Measuring differential item and Approximate Fit in Categorical Data Analysis. test functioning across academic disciplines. Multivariate Behavioral Research, 49(4), 305– Language Testing in Asia, 3(1), 9. 328. https://doi.org/10.1186/2229-0443-3-9 https://doi.org/10.1080/00273171.2014.911075 Salzberger, T., Sarstedt, M., & Diamantopoulos, A. McNeish, D. (2017). Thanks coefficient alpha, we’ll (2016). Measurement in the social sciences: take it from here. Psychological Methods. where C-OAR-SE delivers and where it does https://doi.org/10.1037/met0000144 not. European Journal of Marketing, 50(11), Mellenbergh, G. J. (1994). Generalized linear 1942–1952. https://doi.org/10.1108/EJM-10- item response theory. Psychological Bulletin, 2016-0547 115(2), 300–307. Schmittmann, V. D., Cramer, A. O. J., Waldorp, L. https://doi.org/10.1037/0033-2909.115.2.300 J., Epskamp, S., Kievit, R. A., & Borsboom, D. Newbery, G., Petocz, A., & Newbery, G. (2010). On (2013). Deconstructing the construct: A network Conceptual Analysis as the Primary Qualitative perspective on psychological phenomena. New Approach to Statistics Research in Ideas in Psychology, 31(1), 43–53. Psychology. Research https://doi.org/10.1016/j.newideapsych.2011.02. Journal, 9(2), 123–145. 007 Novick, M. R. (1966). The axioms and principal Schweizer, K., & DiStefano, C. (2016). Principles results of classical test theory. Journal of and Methods of Test Construction: Standards , 3(1), 1–18. and Recent Advances. Hogrefe Publishing. https://doi.org/10.1016/0022-2496(66)90002-2 https://doi.org/10.1027/00449-000 Novick, M. R. (1980). Statistics as psychometrics. Sijtsma, K. (2009). On the Use, the Misuse, and the , 45(4), 411–424. Very Limited Usefulness of Cronbach’s Alpha. https://doi.org/10.1007/BF02293605 Psychometrika, 74(1), 107–120. Nowakowska, M. (1983). Chapter 2 Factor https://doi.org/10.1007/s11336-008-9101-0 Analysis: Arbitrary Decisions within Sijtsma, K. (2013). Theory Development as a (pp. 115–182). Precursor for Test Validity (pp. 267–274). https://doi.org/10.1016/S0166-4115(08)62370-5 Springer New York. https://doi.org/10.1007/978- Olsbjerg, M., & Christensen, K. B. (2015). 1-4614-9348-8_17 Modeling local dependence in longitudinal IRT Sijtsma, K., & Junker, B. W. (2006). Item response models. Research Methods, 47(4), theory: Past performance, present developments, 1413–1424. https://doi.org/10.3758/s13428-014- and future expectations. Behaviormetrika, 33(1), 0553-0 75–102. https://doi.org/10.2333/bhmk.33.75 Osborne, J. W. (2010). Challenges for quantitative Sireci, S. G. (2007). On Validity Theory and Test psychology and measurement in the 21st Validation. Educational Researcher, 36(8), 477– century. Frontiers in Psychology. 481.

Journal for ReAttach Therapy and Developmental Diversities. 2018 Aug 15; 1(1):44-58. 57

Anunciação, L. An Overview of the History and Methodological Aspects of Psychometrics

https://doi.org/10.3102/0013189X07311609 https://doi.org/10.1016/j.jmp.2008.05.001 Steyer, R. (2001). Classical (Psychometric) Test Wolming, S., & Wikström, C. (2010). The concept Theory. In International Encyclopedia of the of validity in theory and practice. Assessment in Social & Behavioral Sciences (pp. 1955–1962). Education: Principles, Policy and Practice, Elsevier. https://doi.org/10.1016/B0-08-043076- 17(2), 117-132. 7/00721-X Wright, B. D., & Linacre, J. M. (1994). Reasonable Toomela. (2010). Quantitative methods in mean-square fit values. Rasch Measurement psychology: inevitable and useless. Frontiers in Transactions, 8(3), 370. Psychology. Zumbo, B. D., & Hubley, A. M. (Eds.). (2017). https://doi.org/10.3389/fpsyg.2010.00029 Understanding and Investigating Response Townsend, J. T. (2008). Mathematical psychology: Processes in Validation Research (Vol. 69). Prospects for the 21st century: A guest editorial. Cham: Springer International Publishing. Journal of Mathematical Psychology, 52(5), https://doi.org/10.1007/978-3-319-56129-5 269–280.

58 https://jrtdd.com