What the Examination Must Achieve: Three Perspectives by Michael T. Kane, Ph.D.

ar examinations are expected to meet ac- DP: You look perplexed. What’s up? cepted criteria for testing procedures, but MP: This candidate’s score is only one point they are also integral parts of the overall above the passing score, and given this small dif- process of admitting candidates to the ference, he could easily have failed. Bpractice of and therefore have to meet the re- quirements entailed by this use. In this article I will DP: Was some mistake made? address three complementary perspectives on bar examinations: the measurement perspective, the MP: There’s no indication of any irregularity, but decision-making perspective, and the candidate or his score is so close to the passing score that if we -taker perspective. The first two perspectives can repeated the testing, he might fail. be characterized as follows: DP: Why should we repeat the testing? He • In measurement theory, a candidate’s test passed and has no reason to test again! As a mat- score is thought of as an estimate of that ter of fact, he is not allowed to take the test again. candidate’s “true” score (the expected score over an infinite number of replications), and MP: His score is above the passing score, but it is the emphasis is on the validity (or accuracy) only an estimate of his “true” score over all pos- and reliability (or precision) of this estimate. sible replications.

• In evaluating examinations as tools for mak- DP: He took the test on this test date, and he ing decisions, the emphasis tends to passed. Concerns about what might have hap- be on the appropriateness and fairness of the pened on a different test date are irrelevant. examination process, in addition to accuracy MP: But the reliability is not perfect, and he and precision. would probably get a slightly different score if tested again. In designing and evaluating bar examinations, it is important to consider the measurement perspective, DP: We want the reliability to be high, but the more pragmatic decision-making perspective, and the decision is based on his score on this test to the extent possible, the candidate perspective. administration.

We can imagine a conversation between a person The point here is not that one perspective is better with a strong measurement perspective (MP) and a than the other. Rather, they are complementary, as person with a decision-making perspective (DP): I will explain.

6 The Bar Examiner, September 2012 The Function of Licensure for the bar, some of these non-cognitive traits are Examinations evaluated in character and fitness evaluations, where the emphasis also is on identifying individuals with Boards of bar examiners and Courts are responsible serious limitations rather than rank-ordering candi- for evaluating candidates for admission to the prac- dates or identifying the best candidates. tice of law, and in meeting this responsibility, they generally rely on standardized tests of some kind as he hree erspectives one way to evaluate competence. The tests function T T P as measurement instruments, but they are also inte- Each perspective from which licensure examination gral components in administrative decision-making programs can be evaluated—the measurement per- procedures, and they have to fit into and support the spective, the decision-making perspective, and the overall goals of the program. candidate perspective—reflects a legitimate set of goals, and an examination program has to respect all Bar examinations and other licensure examina- of these goals to be effective. It is necessary for the tions are intended to enhance the quality of profes- testing program to sional practice and thereby to protect the public by identifying candidates who are not adequately • meet the requirements embedded in the prepared for entry-level practice; the exams are not measurement perspective in order to be con- intended to rank-order the candidates or to identify sidered technically adequate, the best candidates. In particular, the examinations • meet the requirements for a sound decision- are designed to assess candidates’ levels of compe- making procedure in order to be considered tence in some target domain of knowledge, skills, fair and effective, and and judgment (KSJs) that are considered important in entry-level practice. For licensure examinations, • provide a level playing field for candidates the focus tends to be on a fairly broad array of in order to be considered acceptable. KSJs that would be relevant to the range of practice The requirements imposed by considering all situations and responsibilities covered by the license. three perspectives are not too burdensome, because Even though most practitioners are likely to special- there is a lot of overlap in the criteria valued by each ize to some extent even in their early years (and more perspective. The differences among the perspectives so later), licensure examinations focus on the broad tend to be more a matter of emphasis given to certain range of KSJs needed in entry-level practice and do criteria than they are about the criteria themselves. not include advanced topics and esoteric content

required for advanced, specialized practice. The Measurement Perspective

Note that licensure examinations do not evaluate As I explained earlier, in measurement theory, a all of the attributes needed for effective practice over candidate’s test score is thought of as an estimate of a career, or even in the first few years of a career. a “true” score for the candidate, and the emphasis Traits like honesty, conscientiousness, and diligence is on the validity (or accuracy) and reliability (or are clearly essential for a candidate to practice effec- precision) of this estimate. For measures of over- tively, as are adequate levels of social skills, physical all competence in some performance domain, the health, and mental health. In assessing candidates major concerns are the extent to which the test

What the Bar Examination Must Achieve: Three Perspectives 7 adequately covers the domain (validity) and the questions covering the same content domain) cannot extent to which the score would be stable for each be interpreted in any consistent way. candidate over possible replications of the testing The measurement perspective assumes that it procedure (reliability). is at least conceptually possible to repeat the mea- The measurement perspective assumes that the surement over and over again on the same person attribute being measured has a definite value for without changing the person. This is obviously not each test taker. We do not generally know what this the case in practice, but like assumptions about per- value is, and in a sense, we can never know exactly fectly smooth surfaces in physics, it can be a useful what it is, but we can try to estimate this value as assumption for some purposes. accurately as possible. Within the measurement Two Methods for Measuring Validity perspective, the main concern is the accuracy of the test scores as estimates of the “true” value of the Criterion-Based Validity Studies and Their Problems for attribute being measured. For bar examinations and Licensure Exams other licensure tests, the accuracy of the test scores is The notion of validity as agreement of scores with evaluated in two ways: the “true” value of the attribute being measured

1. in terms of the plausibility of the claims that may seem simple enough at first, but it can get com- are based on the test scores (particularly the plicated. Assuming that we don’t know the value of claim that the examination provides an accu- an attribute for a test taker, how can we determine rate indication of overall mastery of the KSJ how closely a test taker’s scores approximate her or domain), and his true score for the attribute? In some cases, we may have an alternate measure of the attribute that 2. by taking into account any external factors is thought to be very accurate, and in these cases, that might distort the results. we can compare the scores on the test to those on this (presumably more accurate) criterion for some The conceptualization tends to be abstract and tech- sample of test takers; if the test scores agree with nical, although the underlying principles are simple those on the criterion measure, we have evidence for and intuitive. the validity of the exam scores as measures of the attribute. This is the approach used for employment A second and supporting concern is the reliabil- tests, where test scores are compared to measures of ity (or precision) of the scores, which is evaluated performance on the job, and in college admissions in terms of the extent to which candidates’ scores tests, where test scores are related to performance would be likely to stay more or less the same if the in college (e.g., first-year GPA). When a reason- examination were repeated at about the same time ably good criterion is available, this criterion-based but using different questions (and in the cases of approach can be very effective, but in many contexts, essay questions and performance tasks, with differ- it is difficult to identify or develop an acceptable cri- ent graders). Reliability is considered a necessary terion measure. (but not sufficient) condition for validity, because test scores that are not reasonably stable across dif- It is generally not possible to conduct adequate ferent conditions (e.g., different graders or different criterion-based validity studies for licensure exams

8 The Bar Examiner, September 2012 for three reasons. First, the practice of a profession ing examinations (i.e., those used for licensure and like law covers a wide range of activities, clients, and certification) mainly because “criterion measures are settings, even in a single area of practice, and it would generally not available for those who are not granted be very difficult (if not impossible) to evaluate per- a license” and as a result, “[v]alidation of cre- formance consistently across these activities, clients, dentialing tests depends mainly on content-related and settings in order to get an accurate evaluation evidence, often in the form of judgments that the of performance in practice. The general measures of test adequately represents the content domain of the professional success that are most readily available occupation or specialty being considered.”2 (awards, publications, income, etc.) do not focus on the protection of the public and therefore are not This approach is consistent with the intended particularly relevant to the validation of licensure purpose of licensure exams, to protect the public by programs. The development of general measures of excluding from practice candidates who lack a level the quality of practice in terms of client outcomes is of competence in the specified KSJ domain needed probably not possible for most professions. for safe and effective practice. According to the Standards, “Tests used in credentialing are designed Second, to the extent that a decent criterion mea- to determine whether the essential knowledge and sure could be developed, it would probably need to skills of a specified domain have been mastered by be context-specific (e.g., performance in handling the candidate. The focus of performance standards is criminal prosecution or defense). Ratings of perfor- on levels of knowledge and performance necessary mance that would be applicable to different areas of for safe and appropriate practice.”3 That is, the goal practice would need to be quite general and there- is to get a good indication of each candidate’s level fore judgmental and prone to concerns about bias of of competence in the KSJ domain by asking the can- various kinds. To the extent that the evaluations of didate to respond to a set of questions or tasks that performance are limited to specific areas of practice, are representative of the KSJ domain. The sample it would be necessary to study many separate areas of questions or tasks should cover the domain of practice in order to get a comprehensive assess- pretty well and should be large enough to yield ment of validity. reliable results (i.e., results that will not vary much Third, and most fundamental, candidates who over an independent sample of the same size and fail the examination are not allowed to practice as representativeness). , and it is therefore not possible to evaluate the relationship between passing the bar examina- The Use of Reliability Analyses to Address tion and effective performance in practice, even if Variations in Performance we could develop an adequate criterion measure of performance in practice. Reliability analyses evaluate the stability of the scores over replications of the testing procedure and Content-Based Validity Studies: thereby support the proposed interpretation of com- An Effective Choice for Licensure Exams petence in the KSJ domain by indicating that candi- The Standards for Educational and Psychological Testing1 dates’ scores are not affected much by the particular suggest that content-based validity analyses, rather sample of questions/tasks or the particular graders than criterion-based analyses, be used for credential- who rate the responses. Score variations across

What the Bar Examination Must Achieve: Three Perspectives 9 replications of the testing program are referred to as Question Variability “errors of measurement” or “random errors.” A second potentially important source of “random error” is the variability that occurs across questions In measurement theory, each test taker’s because a candidate knows some subject areas better observed score on a particular testing date is assumed than others, misreads a question, or applies a prin- to equal the “true” value of the attribute for the test ciple incorrectly. It is essential that the questions be taker, plus an unknown component (the error of written as clearly and accurately as possible. It is also measurement) that reflects specific aspects of the possible to control this source of error to a significant testing situation and that varies randomly from one extent by including a large number of questions. score to another. These “errors” are not associated with any mistakes or violations How Standardization of procedure; rather, they reflect in measurement theory, each Supports Reliability variations in performance across test taker’s observed score on and Validity samples of questions/tasks and a particular testing date is natural fluctuations in the grad- Standardization of testing assumed to equal the “true” ing of responses. Evaluations of materials and procedures tends the reliability of licensure test value of the attribute for the to enhance both reliability and scores tend to focus on two espe- test taker, plus an unknown validity. It improves reliability cially significant sources of vari- component (the error of mea- by eliminating various sources ation in scores on written tests. surement) that reflects specific of irrelevant score variability that would occur if the testing aspects of the testing situation Grader Variability materials, procedures, and con- and that varies randomly from ditions were allowed to vary one score to another. The scores given to a particu- from one candidate to another lar performance on the written or from one administration to (essay and/or performance task) portion of the bar another. By standardizing as many aspects of the examination can vary from one grader to another, examination program as possible, random fluctua- even if the graders are well trained and calibrated, tions are reduced and reliability is increased. and these variations can be a significant source of error in the essay or performance task scores. This Standardization also tends to improve valid- source of variation can be controlled by training the ity by controlling the sources of random variability graders thoroughly and monitoring grader perfor- that are considered under the heading of reliabil- mance. It is also highly desirable to include multiple ity, and by controlling some additional potential graders for each candidate, in order to “average out” sources of variability (test administration proce- variations among graders that remain after train- dures, time limits, test site conditions) that are not ing and calibration. This can be done efficiently by usually addressed under the heading of reliability. including multiple essay questions and/or perfor- In addition, standardization tends to promote both mance tasks and having each essay answer or task reliability and validity by making it possible for the performance evaluated by a different grader (or set standardized materials and procedures to be care- of graders). fully designed so that they yield consistent results

10 The Bar Examiner, September 2012 while giving each candidate a good opportunity to fair decisions in a timely way. Bar admissions pro- demonstrate his or her level of competence in the grams and other licensure programs employ well- KSJ domain. defined, systematic procedures, mainly to promote fairness by eliminating various potential sources of For aspects of testing that cannot be standard- bias that can arise in more face-to-face assessments ized, statistical adjustments can sometimes be used and to achieve the relatively high level of reliability to achieve some of the benefits associated with that is made possible by standardization. In addition, standardization. For example, in order to maintain standardized examination structures and procedures the security of the examination and to keep test tend to make the decision-making process more effi- content current, it is generally necessary to change cient, because large numbers of candidates can be the questions from one administration to another. assessed at the same time. Nevertheless, it would be desirable from a measure- ment perspective to control possible changes in diffi- Theodore Porter, a professor in the UCLA culty from one sample of questions/tasks to another Department of History who specializes in the history by standardizing the questions/tasks. Equating pro- of science, has suggested that testing is a common cedures are designed to adjust for any differences approach to high-stakes decision making in the pub- in the statistical characteristics (in particular, the lic arena, and that objectivity (defined in terms of not overall difficulty of the questions/tasks) from one being subjective, personal, or capricious) is highly administration of the examination to another. valued in decision making because it “provides an answer to a moral demand for impartiality and fair- The Decision-Making Perspective ness.”5 Test scores and other quantitative measures Decision makers generally take a more pragmatic tend to provide the ultimate in objectivity and, as a view of testing. Public officials and other institu- result, are commonly used in making licensure and tional decision makers who are charged with mak- certification decisions. Objectivity and standardiza- ing decisions that impact people’s lives generally tion are seen as effective ways to promote fairness as operate under mandates that put a high value on well as validity and reliability: objectivity and fairness.4 They also typically operate under tight budgets. As a result, they try to make Mechanical objectivity has been a favorite of pos- decisions quickly, efficiently, and fairly, and they itivist philosophers, and it has a powerful appeal want the process to be viewed as being efficient and to the wider public. It implies personal restraint. fair. In high-stakes testing contexts like bar exami- It means following the rules. Rules are a check nations and other licensure tests, the measurement on subjectivity: they should make it impossible perspective’s concerns about validity and reliability for personal biases or preferences to affect the are important, but the more general goal of making outcome of an investigation.6 fair and defensible decisions about candidate compe- tence is the central concern. The decision-making perspective values the tech- nical qualities emphasized by the measurement Tests as Objective Tools perspective, because these qualities are expected to From a decision-making perspective, the tests are enhance the validity, reliability, and fairness of the tools that can be helpful in making reasonable and decision. They are not ends in themselves.

What the Bar Examination Must Achieve: Three Perspectives 11 A Focus on the Specific Test Score decisions and requesting accommodations. Many

From the decision-making point of view, the candi- of these issues can be informed by measurement date is evaluated in terms of his or her performance theory, but they are all basically policy issues. on a specific test administration, without consider- Specification of the KSJ Domain ing any other variables (e.g., the candidate’s grades in ; the school attended; the candidate’s The design of the KSJ domain generally involves a appearance, race, or gender). The emphasis is not series of compromises between a desire to include on how well a candidate might do on an infinite a broad array of KSJs needed for effective entry- sequence of possible replications of the testing pro- level practice and the need to be able to test this cedure, but on how well the candidate does on the KSJ domain reliably and validly within a reasonable current examination. The focus period of time. The specifica- is on one specific score rather The design of the KSJ domain tion of the KSJ domain depends than on the expected score over generally involves a series of mainly on professional defini- a hypothetical sequence of pos- compromises between a desire to tions of the scope of practice, sible replications. but the experts’ judgments can include a broad array of KSJs be supported by empirical sur- The decision-making pro- needed for effective entry-level veys of the client problems that cess is specified in some detail, practice and the need to be able are likely to be encountered and much of it is automatic; if to test this KSJ domain reliably in entry-level practice and by a candidate gets a score at or and validly within a reasonable evaluations of the kinds of above the passing score, she or period of time. KSJs needed to handle these he passes, and otherwise, the problems. candidate fails and is not admit- ted to practice. The procedures are designed to be Test-Development Procedures as explicit as possible, thereby minimizing the need for judgment, subjectivity, and possible bias of any The test-development procedures generally require kind. that content specialists develop test questions that cover the KSJ domain adequately, with the guide- Judgments Inherent in the Decision-Making lines for adequate coverage set forth in test speci- Perspective fications that indicate the mix of questions to be However, judgment cannot be excluded from the included in each administration of the examination. overall decision-making process. Judgments are To keep things reasonably simple, the measurement made at a more general level in designing the test- models generally assume that the test questions ing program and the decision rules before these are sampled from the domain, but in fact, in bar procedures and rules are applied consistently to all examinations and other licensure examinations, the candidates. Among the considerations are the speci- questions have to be written and edited by scholars fication of the KSJ domain to be assessed, the proce- or practitioners; the questions don’t exist until com- dures for developing the tests based on the domain, posed and edited. Like many theoretical assump- the length of the tests, the scoring procedures, the tions, the sampling assumption is a fiction, but it passing score, and the procedures for appealing works pretty well in estimating reliability.

12 The Bar Examiner, September 2012 Length of the Test The specification of procedures for appealing score-based decisions and for the granting of testing The length of the test is also a compromise. According accommodations is guided by applicable and to measurement theory, reliability tends to get better regulations, and beyond that by analyses of fairness as the test gets longer (i.e., has more questions). A and practicality. Testing accommodations are a par- longer test can also sample the KSJ domain more ticularly difficult issue from both the measurement thoroughly and thereby enhance the case for the perspective and the decision-making perspective, validity of the examination as a measure of overall because both of these perspectives value standard- competence in the domain. The decision-making ization as a means of enhancing reliability, validity, perspective values these measurement characteris- and fairness. From both perspectives, accommoda- tics and also values the thoroughness associated with tions that level the playing field are legitimate, but a longer examination. On the downside, a longer accommodations that yield an advantage to one test requires more candidate time and expense and group over another are not considered acceptable. also requires more time and money to develop the questions. The decision makers have to decide on The Candidate Perspective the appropriate trade-off between cost and time on A third perspective, that of the candidate, is more the one hand and reliability, validity, and coverage goal oriented. In high-stakes contexts, candidates of the KSJ domain on the other. tend to view tests as opportunities or as hurdles that they want to get through successfully and without Scoring Procedures too much trouble. The candidate perspective does

The scoring procedures are also basically a policy not pay much attention to abstract notions of validity issue, with some technical implications and limi- and reliability of the examination as a measurement tations. For bar examinations involving an objec- procedure. It is more concerned about the fairness tive test component such as the Multistate Bar and appropriateness of the examination, having a Examination and an essay and/or performance task level playing field, and having a reasonable chance component, the weight to be assigned to each com- of success. Paul Holland, a statistician who has ponent is an important policy issue. Analyses of worked at Educational Testing Service and as a pro- the expected reliability and validity7 suggest that in fessor at the University of California, Berkeley, refers order to optimize these measurement characteristics, to this point of view as the “contest perspective”: about 50% to 60% of the weight should be assigned The basic idea behind the measurement view to the objective component, but any weighting sys- is that a test measures something about the test tem that assigns at least 40% to the objective compo- taker. The basic idea behind the contest view is nent works reasonably well. that a test that matters is a contest with winners and losers . . . . These two views can emphasize The selection of a passing score is also primarily different goals. Measurement view: make a test a policy issue. Measurement theory does offer some “reliable” and “valid.” Contest view: make a test guidance on how to conduct empirical standard- “fair” and “understandable.”8 setting studies, but the results of such studies serve mainly to provide those who make the policy deci- Candidates are likely to engage in various test- sions with potentially useful information. preparation activities, including practice on item

What the Bar Examination Must Achieve: Three Perspectives 13 types included in the test. This is a legitimate course It is also important that test takers see the mea- of action for the candidates, for whom test results surement procedures and the decision rules as giv- have important consequences. The candidates are ing them a fair chance to succeed. Test takers who encouraged to prepare themselves as well as they see the process as fair and reasonable are more likely can, but they are not allowed to engage in any to prepare for the examination in constructive ways activities, like cheating, that would short-circuit the and are less likely to try to cut corners by engag- intended interpretation of the test scores as indica- ing in questionable test-preparation or test-taking tions of overall competence in the KSJ domain. As in activities. sports, the candidates have a right to prepare them- From the decision-making perspective, tests are selves for the examination using any means that are tools that can help make the decision-making process not ruled out. more fair and effective than it would otherwise be. Tests must be reliable and valid enough to achieve The Intersection of the Three that purpose, but these criteria for good measure- Perspectives ment are not of primary concern. The examination is expected to be clearly relevant to the purpose at Bar examinations have to meet accepted standards hand and reasonably reliable, and testing procedures for validity and reliability in order to be considered should be transparent and fair. defensible, but they also have to meet the require- ments for fair and reasonable decision making and Measurement theory tends to be abstract and ensure a level playing field for all candidates. If the hypothetical, relying on notions like statistical sam- test scores are not reasonably reliable (i.e., consis- pling from a domain, replications of the sampling tent, stable), and if they are not valid in the sense design, and the expected values (and variances) of of adequately sampling a KSJ domain that is clearly a candidate’s scores over hypothetical replications. relevant to the decision to be made, they will not The statistical models are quite useful in sorting out justify the claims based on them and therefore will the complex choices that need to be made in design- not be seen as providing a reasonable basis for bar ing a testing program, but their results have to mesh admissions decisions. with the need to make fair and reasonable decisions, and must do so transparently. As a requirement for admission to the bar, the testing program has to function effectively as a fair Again, the point is not that one perspective is and objective process in order to be implemented better than the others but that they are complemen- and accepted. All of the elements in the decision- tary. The measurement perspective is especially use- making process have to work together to yield fair ful in designing testing programs that will yield and reasonable decisions. The decision-making per- reliable and valid information about candidate com- spective gives a lot of attention to due process and petence, and it can also be helpful in designing the fairness and therefore provides a useful counterbal- decision procedures, but it does not provide answers ance to the measurement perspective, which focuses to the wide range of policy questions associated with mainly on accuracy and precision. If critics can make high-stakes testing programs. Deciding how the the case that the testing program does not provide score-based information will be used in the opera- a level playing field, the program is not likely to tional context of decision making requires a broader survive. and more pragmatic perspective.

14 The Bar Examiner, September 2012 Notes

1. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, Standards for Educational and Psychological Testing (American Educational Research Association 1999). 2. Id. at 157. 3. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, supra note 1, at 156. 4. Theodore M. Porter, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton University Press 1995). 5. Id., at 8. 6. Porter, supra note 4, at 4. 7. Michael Kane and Susan M. Case, The Reliability and Validity Michael T. Kane, Ph.D., is the holder of the Samuel J. Messick of Weighted Composite Scores, 17 Applied Measurement in Chair in Test Validity at the Educational Testing Service in Education 221–240 (2004). Princeton, New Jersey. He was Director of Research for the 8. P.W. Holland, Measurements or Contests? Comment on Zwick, National Conference of Bar Examiners from 2001 to August Bond, and Allen/Donogue, in Proceedings of the Social 2009. From 1991 to 2001, he was a professor in the School of Statistics Section of the American Statistical Association Education at the University of Wisconsin–Madison, where he 27–29 (American Statistical Association 1994). taught measurement theory and practice. Before his appointment at Wisconsin, Kane was a senior research scientist at ACT, where he supervised large-scale validity studies of licensure examina- tions. Kane holds an M.S. in statistics and a Ph.D. in education from Stanford University.

What the Bar Examination Must Achieve: Three Perspectives 15