Running head: BWSTOOLS 1

bwsTools: An R Package for Case 1 Best-Worst Scaling

Mark H. White II

National Coalition of Independent Scholars

Author’s note:

I would like to thank Anthony Marley, Geoff Hollis, Kenneth Massey, Guy Hawkins, and Geoff

Soutar for their correspondence as well as the anonymous reviewers for their helpful comments.

All code needed to reproduce analyses in this paper, as well as the source code for the package, can be found at https://osf.io/wb4c3/. Correspondence concerning this article should be addressed to Mark White, [email protected].

BWSTOOLS 2

Abstract

Case 1 best-worst scaling, also known as best-worst scaling or MaxDiff, is a popular method for examining the relative ratings and ranks of a series of items in various disciplines in academia and industry. The method involves a survey respondent indicating the “best” and “worst” from a sample of items across a series of trials. Many methods exist for calculating scores at the individual and aggregate levels. I introduce the bwsTools package, a free and open-source set of tools for the R statistical programming language, to aid researchers and practitioners in the construction and analysis of best-worst scaling designs. This package is designed to work seamlessly with tidy , does not require design matrices, and employs various published individual- and aggregate-level scoring methods that have yet to be employed in free software.

Keywords: Best-worst scaling, MaxDiff, choice modeling, R

BWSTOOLS 3

1. Introduction

Important social and psychological processes require people to choose between alternatives. A high school, for example, might need new chemistry equipment and updated books—but the budget only supports one or the other. In politics, people say they are highly supportive of equality and freedom—but what about when these values come into conflict?

Affirmative action policies, for example, have been framed as promoting racial equality in academic institutions, while others have said these policies necessarily limit the freedom for universities to accept who they would like (Lakoff, 2014).

Likert-type scales—such as seven-point scales anchored at 1 (Strongly Disagree) and 7

(Strongly Agree) —may not be appropriate measurement tools in these common situations. On a seven-point scale, a respondent can indicate that they “strongly agree” that races should be equal on one item, then also “strongly agree” that universities should be free to accept any students they want on another. The tension between the two, such as in the case of affirmative action, is obscured. Ceiling (plurality responding at the highest point of the scale) and floor (plurality responding at the lowest) effects are common in studying certain important issues like prejudice, values, and political ideology. In the abstract, every one might agree—or at least follow a social norm—that racial inequality is bad and that freedom is good.

A different method to measure attitudes in these domains is to have respondents choose between a series of alternatives. Case 1 best-worst scaling is one such method. This article introduces an R package for designing and analyzing data using this method. It is meant as a

BWSTOOLS 4 tutorial and introduction; this article does not explore detailed mathematical proofs for different analysis options, but offers suggested readings for those interested.

1.1. The Best-Worst Scaling Method

Best-worst scaling is one research method to measure ratings involving trade-offs among many items. This method is also known as “MaxDiff,” “case 1 best-worst scaling,” or “object case best-worst scaling.” I refer to this case of base-worst scaling as BWS in the article and package.

BWS involves respondents making repeated selections of the best and worst items in a series of subsets of items (Louviere, Flynn, & Marley, 2015). As a working example, I consider the question: “Of the issues below, which is the most important to you and which is the least important to you when making political decisions?”

A collection of t items (“treatments”) are displayed to respondents across b trials (or

“blocks”). Each block contains a subset of k items from the total list. Respondents are asked to mark which of the k items is best, which is worst, while k - 2 items are left unmarked. Although the terminology “best” and “worst” is used, it can be generalized to the most or least of any construct. Figure 1 shows an example block.

BWS researchers recommend structuring these series of blocks in balanced incomplete block designs (BIBD; Louviere, Lings, Islam, Gudergan, & Flynn, 2013). These designs ensure each item is shown the same number of times r and that each pairwise comparison of items also appears the same number of times λ. The bwsTools package generally assumes that data are generated using a BIBD, although some functions (described below) will analyze data from a

BWSTOOLS 5 non-BIBD. Figure 2 shows an example design with t = 13 items, b = 13 blocks, k = 4 items per block, each item is repeated r = 4 times, and each pairwise comparison occurs λ = 1 time.

This means that each respondent will yield b times k observations: b “best” choices, b

“worst” choices, and b ( k - 2) observations where the item was neither selected “best” nor

“worst.” These observations can be used to calculate both aggregate ratings (across the sample) and individual ratings (for each respondent).

The motivation for this package was to provide a free, open-source alternative to existing software. bwsTools follows the principles of tidy data (Wickham, 2014), allowing for BWS analyses to be more seamlessly integrated into pipelines for importing, preparing, analyzing, and visualizing data (Wickham & Grolemund, 2017). No design matrices are needed for analysis in bwsTools—only the survey responses. Detailed instructions with annotated code are provided in the package vignettes on how to structure the data in the required tidy format. For individual-level tidying, users can run vignette("tidying_data", "bwsTools") , while vignette("aggregate", "bwsTools") covers formatting data for aggregate-level analysis. All inputs and outputs for functions in bwsTools inherit the class data.frame , allowing inputs and outputs chained in data pipelines easily.

bwsTools also provides analysis options for multiple individual- and aggregate-level methods (discussed below), published by multiple researchers, that have yet to be implemented in freely-available, open-source software. Lastly, bwsTools has a publicly-available working

GitHub repository 1, documenting all development. Programming best practices, such as unit tests

1 github.com/markhwhiteii/bwsTools

BWSTOOLS 6 and continuous integration tools, are used to ensure stable, reliable releases. The current bwsTools analysis functions do not exhaust the list of published methods; a public repository allows for community collaboration and feedback in adding new methods for analyzing BWS data.

2. The bwsTools R Package

bwsTools is an R package with three main purposes: generating BIBDs, calculating aggregate ratings, and calculating individual ratings. Each is discussed in turn. The package can be installed from the Comprehensive R Archive Network (CRAN) using the following code:

install.packages(“bwsTools”)

2.1. Generating a BIBD

The characteristics of a BIBD ( t, b , k , r , and λ) follow specific properties: First, the design contains b blocks of k items; second, each of the t items appears r times and only r times; third, each of the t ! / 2(t - 2)! pairs of items appear λ times and only λ times. An incomplete block design is balanced when λ = r ( k - 1) / (t - 1) and both λ and r are integers. It can be difficult for researchers to create designs satisfying these criteria (Morris, 2011; Wu & Hamada, 2000), so textbooks often reference lists of BIBDs from which researchers can choose. The bwsTools package contains a data.frame object, showing possible values of t , b , k , r, and λ that satisfy the criteria for a BIBD. Thirty-two designs are in this object, taken from Table 11.3 of Cochran and Cox (1957); included are all possible designs where t and b are less than or equal to 20, as

BWSTOOLS 7 more than 20 trials may put cognitive strain on a survey respondent. However, Cochran and Cox

(1957) provide many more examples of a larger size.

While planning a study, a researcher can load the bwsTools package and examine the list of 32 designs with the following code:

library(bwsTools)

bibds

Table 1 shows the first and last six rows that are returned when calling bibds . For example, design six shows a design where six items ( t) are shown in groups of two (k ) across 15 ( b ) blocks, with each item appearing five times (r ) and each pair of items occurring once (λ). The working example in this article follows design 27 in bibds .

To generate a BIBD using bwsTools, one can supply the design number to the make_bibd function. This call produced the design found on the right-hand side of Figure 2:

make_bibd(27, seed = 1839)

Note that the function will generate one of many designs satisfying these criteria at random. To ensure reproducibility, there is also an argument for a seed to set for the random number generator. This defaults to 1839, so each call to make_bibd without a seed explicitly set will yield the same design every time, making for reproducible designs by default.

BWSTOOLS 8

One then assigns a number to each of their items then finds-and-replaces the number in this design with the text of the item. Referring back to Figure 2, the first block lists 2, 6, 7, and

13. Comparing these numbers to the text in the left panel, the items in this block would be:

“taxes”, “crime and violence,” “race relations and racism,” and “gun policy.” Each respondent would then be asked to indicate the most and least important issue from that set of four items, then continue on to the second block that contains: “taxes,” “abortion,” “drug and drug abuse,” and “bias in the media” (2, 5, 8, and 10).

2.2. Aggregate Ratings

A researcher might have the goal of determining how well the different items rank against one another, across the entire sample. Such is the case, for example, if researchers wish to know the most- and least-persuasive arguments in favor of a proposed policy (e.g., Nilsen,

2019). The two options whose theoretical properties are best known are normalized difference scores and analytical best worst scores (Marley, Islam, & Hawkins, 2016). Both can be calculated using: the total number of times each item was presented to respondents, the number of times each item was chosen as best, and how many times each was chosen as worst. These data for the running example are shown in the first four columns of Table 2.

The normalized difference scores do not require bwsTools to calculate; it is the number of times selected as best minus the number of times selected as worst, divided by the total number of times the item appeared to respondents. It is “normalized” because it is bounded between +1

(selected as best every single time it appeared) and -1 (selected as worst every single time it appeared). These are displayed in the “NDiff” column of Table 2.

BWSTOOLS 9

2.2.1. Analytical Estimation of Multinomial Logistic Regression. Under reasonable assumptions, BWS data can be modeled using multinomial logistic regression (MNL)—but the simple normalized difference scores are sufficient statistics for that model and correlate very highly with coefficients from that model (Finn & Louviere, 1992; Marley, et al., 2016; Marley &

Louviere, 2005). However, this method does not yield estimates of uncertainty around the scores.

Researchers benefit from measures of uncertainty, such as standard errors, whether one uses them to roughly compare items or get a broad idea of replicability and variability in the data

(Cumming & Maillardet, 2006; Rouder & Morey, 2005). The bwsTools package provides a function to calculate the coefficients and standard errors from the MNL model.

bwsTools uses the analytical estimation (i.e., closed-form solution) presented by

Lipovetsky and Conklin (2014) for the MNL model. The ae_mnl function takes data of the format in the first four columns of Table 2 and returns utility coefficients, standard errors, confidence intervals, and choice probabilities using Equations 7, 10, 12, 13, and 18 from

Lipovestky and Conklin (2014). Utility coefficients are calculated by:

total worst best total pj = (N j - N j + Nj ) / 2N j [1]

bj = ln( pj / (1 - p j)) [2]

total Where b j is the coefficient for each item j , which can be used as an aggregate score. N j ,

worst best Nj , and N j are the number of times each item appears total, is chosen worst, and is chosen best across the entire sample of respondents. Choice probabilities are then calculated by dividing

the exponential function each b j by the sum of the exponential functions of all b j. See Lipovetsky and Conklin (2014) for the full solution.

BWSTOOLS 10

In the code below, let d0 represent a data.frame containing the first four columns of

Table 2. The user supplies the data as well as the names of the columns containing the number of times the item was presented, chosen as best, and chosen as worst:

ae_mnl(d0, "totals", "bests", "worsts")

Where a z-statistic (z ) can also be supplied to determine the confidence level of the upper and lower bounds; the function defaults to z = 1.96 . This code returns the sixth through tenth columns of Table 2.

The “b” column contains the utility coefficients from the MNL, while the “LB” and

“UB” columns show the bounds of the 95% confidence intervals (CIs) around these coefficients.

The CIs help users compare the coefficients to one another. From these results, we can see that healthcare is seen as most important, followed by the economy. The “Choice” column contains choice probabilities—how likely it is that each item is chosen from the full set.

2.2.2. Elo rating. BIBDs are recommended for data collection, but sometimes they are not used. An analyst may not have designed the study but are nonetheless tasked with analyzing data which do not come from a BIBD, or there may be too many items of interest to fit into a

BIBD (e.g., Hollis 2018a). In these situations, Elo scores can be employed as an aggregate scoring method. Physics professor Arpad Elo developed the now-famous rating system for the game of chess. These ratings took on his name, and variants are now used in many sports and

BWSTOOLS 11 competitions beyond chess (Langville & Meyer, 2012). Hollis (2018a, 2018b, 2019) extended this scoring system to BWS.

The concept of Elo ratings is that two competitors each have ratings befor e a game starts.

After the competition, the winner’s rating increases, while the loser’s rating decreases. How much their scores change is a function, Δ, of the difference between their ratings before they started playing. It is also a function of a constant, K , that determines how much this Δ updates the rankings; the larger the K, the more the rankings change after one competition.

For the winner, their beginning Elo score, S W, is updated to their new score, S’ W , after the competition by:

S’ w = S w + Δw [3]

where Δw = K(1 - E w ) [4]

Sw / 400 Sw / 400 Sl / 400 and E W = 10 / (10 + 10 ) [5]

While the loser’s score is updated by:

S’ l = S l + Δl [6]

where Δl = K(0 - El ) [7]

Sl / 400 Sw / 400 Sl / 400 and E l = 10 / (10 + 10 ) [8]

See Langville and Meyer (2012, pp. 53-56) for additional details.

Hollis applies Elo to BWS by conceptualizing each block for each respondent as pairwise comparisons (i.e., “competitions”) between the subset of items (i.e., “players”). The item selected as best “beats'' all of the other items in that block, the item selected as worst “loses'' to all of the others, and all of the items not selected tie with one another. Hollis (2018a) does not

BWSTOOLS 12 consider ties—only comparisons where there is a clear “winner” and “loser.” For the entire sample, a “season” of pairwise comparisons are created from every block, each with a winner and loser. Hollis adds two “dummy” items to this list of comparisons: A dummy item that loses to every other item, and a dummy item that beats every other item. This prevents items always selected as best (worst) from having too extreme of scores. Hollis sets K to 30; this is the bwsTools default.

Elo ratings are temporal in nature: It updates ratings as a “season” goes along. Since this does not have an analogue to BWS, Hollis recommends running multiple iterations, each with a different randomized order of the match-ups. The resulting Elo score is the average of these iterations. The default number of iterations in bwsTools is the 100 used by Hollis. Lastly, Elo ratings require a starting value, and all items start with the same score of 1000. This is arbitrary and does not affect the rank-ordering of the scores themselves—it only acts as a starting point for the first calculation.

To use Elo ratings for aggregate scoring, data are still required to be in tidy, disaggregated format. Users need to arrange their data, referred to as d1 in the code examples below, such that each column is a variable and each row is an individual observation. Columns should be present that: indicate which respondent generated the observation, which block it came from, which item it refers to, and the choice made by the respondent. Choices should be coded as 1 (best), -1

(worst), or 0 (not selected). bwsTools includes an object named indiv , a data.frame which contains example data that follow this tidy format. The first eight rows of this data.frame are displayed in Table 3. This shows that Respondent One saw the abortion, race, drugs, and

BWSTOOLS 13 education items in the first block, selecting abortion as least important and drugs as most important. In the second block, this same respondent saw the drugs, economy, foreign affairs, and guns items, selecting the economy as most important and foreign affairs as least important.

A detailed how-to on formatting data is outside the scope of this paper, but can be found in the package’s data tidying vignette, accessed by running vignette("tidying_data",

"bwsTools"). The data d1 below follow this format. Calculating aggregate scores using Elo ratings can be used with the elo function:

elo(d1, "id", "block", "issue", "value", K = 30,

iter = 100)

This code returns the “Elo” column in Table 2. This function does not assume a BIBD. The code above keeps the values of K and the number of iterations, iter, at their defaults, 30 and 100, respectively. Currently, setting K = 30 is a recommendation made by Hollis. Future research could investigate how changing this constant could fit specific contexts (e.g., number of respondents, number of blocks, number of options per block, and so on)

2.3. Individual Ratings

One might have a research goal of examining correlates of these BWS scores, using the scores as predictors of other constructs, testing for group differences in scores, or using unsupervised learning approaches to cluster individuals based on their BWS scores. Each of these goals requires a researcher to calculate BWS scores at the respondent level—that is,

BWSTOOLS 14 individual ratings. These scores are used in studying psychological values (Lee et al., 2019), risk perception (Erdem & Rigby, 2013), corporate social responsibility (Nakano & Tsuge, 2019), wine preferences (de-Magistris, Gracia, & Albisu, 2014), and healthcare (Cheung et al., 2016).

bwsTools contains five functions for calculating these scores, each with similar syntax and each taking data in tidy format (Wickham, 2014). Like with the aggregate Elo function, the data need to be specified in the disaggregated form described in Section 2.2.2 and shown in

Table 3.

2.3.1. Difference scoring. This is the most common individual-level metric used (e.g.,

Auger, Devinney, & Louviere, 2007; Cohen, 2009; Hein, Jaeger, Carr, & Delahunty, 2008;

Jaeger, Jorgensen, Aaslyng, & Bredie, 2008; Kiritchenko & Mohammad, 2017; Mielby,

Edelenbos, & Thybo, 2012; Yu et al., 2015). For each respondent, a researcher takes the number of times an item was selected as best and subtracts from it the number of times it was selected as worst. This means potential values range from -r to + r, where r refers to how many times each item appeared. This can also be divided by r , which returns a normalized difference score for each individual.

Let d1 refer to tidy data following the format described above. The call to generate difference scores is:

diffscoring(d1, "id", "block", "issue", "value",

std = TRUE, wide = FALSE)

BWSTOOLS 15

Where the arguments after the data refer to the names of the columns containing the respondent

IDs, block numbers, name or label of the issue, and whether it was chosen as best (1), worst (-1), or neither (0). The argument std indicates whether one wants raw difference scores (bounded between +/- r ) or normalized difference scores (bounded between +/- 1). The first five rows returned from this call are shown in the first three columns of Table 4. While it is recommended that researchers use BIBDs when using difference scoring, bwsTools will allow the user to calculate scores for data that do not come from BIBDs. However, a warning message will notify users when characteristics of BIBDs are not met.

All of the individual scoring functions have these same first five arguments. Every function also includes an argument, wide , which is a logical value indicating if the user wants their data returned in wide (i.e., a column for respondent ID, and a column for every variable) format. The default is to return it in a tidy format (each row is a score for a combination of respondent and item).

2.3.2. Empirical Bayes. A hierarchical Bayesian version of the MNL (HB-MNL) is used in some statistical packages for calculating utility coefficients at the individual level (e.g., Orme,

2005). However, this relies on estimation using Markov chain Monte Carlo (MCMC) procedures, which can be time and computationally inefficient when analyzing large data. These BWS packages also tend to be neither free nor open-source.

Instead of HB-MNL, Lipovetsky and Conklin (2015) extend their analytical estimation for aggregate utility coefficients (Lipovetsky and Conklin, 2014) to the individual level. They

show that choice frequencies can be calculated at the aggregate level, p j , and individual level, p i j,

BWSTOOLS 16 from simple count data using an empirical Bayesian approach. The aggregate estimate is calculated in the same way as described in Equation 1 above for the analytical estimation of the

MNL.

But, at the individual level, it is plausible that a respondent might choose an item as best

(or worst) every time it was shown. This causes probabilities of one (or zero) to occur, which lead to utility coefficients of infinity (or negative infinity) in Equation 1. To avoid this, a precision parameter, E , is specified by the user. When individual probabilities of zero are encountered, they are replaced with E in Equation 1; when values of one are encountered, they are replaced with 1 - E.

The values of pi and p i j can be treated as the prior and likelihood, respectively, to be used in Bayes’ formula. Lipovetsky and Conklin show that, under reasonable assumptions, these two values can be used to calculate individual-level posterior utility coefficients using an analytical, closed-form solution. To do so, the user must also specify a mixing parameter, ⍺, that indicates how much weight is put on the prior relative to the likelihood. Larger values indicate more weight on the prior. The individual-level score for each item is calculated by:

(⍺ / (1 + ⍺)) pj + (1 / (1 + ⍺)) p ij [9]

Lipovetsky and Conklin (2015) show that the results from this fast analytical procedure correlate highly ( r > .85) with methods that are far more computationally complex (e.g., estimation via MCMC). Users can calculate individual utility coefficients using this empirical

Bayes technique in bwsTools:

BWSTOOLS 17

e_bayescoring(d1, "id", "block", "issue", "value",

E = .1, alpha = 1, wide = FALSE)

Both E and alpha can be arbitrarily specified by the user. However, Lipovetsky and Conklin

(2015) show that scores using this method correlate highest with proprietary HB-MNL software at .1 and 1, respectively. These are thus the recommended and default values in bwsTools package. The first five rows returned from this call are shown in columns one, two, and four of

Table 4. Since this method assumes a BIBD, the function will throw an error if the data do not follow such a design.

2.3.3. Elo rating. The same method for using Elo scores here is the same as in the aggregate case, except the calculation is done separately for each individual instead of across the entire sample. The code to run this in bwsTools is:

eloscoring(d1, "id", "block", "issue", "value",

K = 30, iter = 100, wide = FALSE)

The first five rows returned from this call are shown in columns one, two, and five of Table 4.

Since this method does not make an assumption about design, non-BIBD data are permitted, although the user is warned that their data do not come from a balanced design.

2.3.4. Walkscoring and PageRank. The final two methods for individual BWS scores are applications of centrality measures in weighted graphs to BWS. These are unique to the

BWSTOOLS 18 bwsTools package and current article. Similar to the Elo approach, these methods both liken

BWS to a sport or game and do not assume a BIBD.

Imagine an abbreviated version of the board game Monopoly that involves three or more players. When one player goes bankrupt, the game ends. The person that went bankrupt loses

(their score is -1), the person with the most wealth (i.e., money on hand, property values, etc.) at the time the game ends wins (scored +1), and everyone else neither wins nor loses (scored 0).

The winner beats the loser by two points and beats the others by one point, while those who neither win nor lose tie with one another and beat the loser by one point. Applying this to BWS, each player is an item and each game is a block: The item chosen as “best” beats the item chosen as “worst” by two points and all other items by one point, while the items chosen as neither

“best” nor “worst” tie with one another and beat the item chosen as “worst” by one point.

These “point differentials” can be represented in a square matrix, projected into a weighted graph, and then random walks can be taken through this graph to yield BWS scores.

Network-based ratings methods are common across a variety of topics, including sports and recommender systems (e.g., Bogers, 2010; Callaghan, Mucha, & Porter, 2007; Jamali & Ester,

2009; Lazova & Basnarkov, 2015; Motegi & Masuda, 2012). For a general discussion of random walks in networks, see Masuda, Porter, and Lambiotte (2017).

The detailed procedure for BWS walkscoring is in Table 5. The conceptual idea is that each item is a node in a directed network. Imagine a person starting at a random item. The person asks this item, “Which item is the best?” This item knows it was beaten by Item J by 2 points, and it was beaten by Item K and Item L by 1 point. It will decide to answer this person

BWSTOOLS 19 probabilistically, such that it will say Item J is best 50% of the time (2 points out of 4 total points), Item K 25% of the time (1 out of 4 points), and Item L 25% of the time (1 out of 4 points). The person walks to the item that they were told was best, and this procedure repeats itself. Two networks are made: one based on directing the random walker to the best item and another based on directing the random walker to the worst item. The proportion of walks leading to each node is its raw best and raw worst score, respectively; the worst score is then flipped and averaged with the best score to get the best-worst walkscore. Random walks are implemented using the random_walk function from the igraph R package (Csardi & Nepusz, 2006). See

Langville and Meyer (2012, pp. 67-78) for additional details.

Random walks are performed for each individual separately, leading to individual best-worst scores. The bwsTools function to calculate walkscores is:

walkscoring(d1, "id", "block", "issue", "value",

walks = 10000, wide = FALSE)

Where the walks argument indicates how many random walks to do for a given respondent.

This defaults to 10,000. This makes the method less computationally-efficient than methods like empirical Bayes scoring; users can lower the number of walks when using large datasets. The first five rows returned from this call are shown in columns one, two, and six of Table 4. The function does not require BIBD data but throws a warning when such data are not present.

BWSTOOLS 20

A benefit of this approach is that ties will rarely occur. With difference scoring, consider two items: Item A is never selected best or worst, leading to a difference score of zero; Item B is selected as best twice and selected as worst twice, which also leads to a difference score of zero.

Walkscoring takes into consideration the full pattern of block results, which may score these items A and B differently.

Steps 1 - 7 and 10 of Table 5 are the same for the PageRank method. The PageRank algorithm was designed by Google and originally used to rank search results (Brin & Page, 1998;

Gleich, 2015). The primary differentiating factor from the walkscoring method is that it has a

“teleportation” parameter. Sometimes a random walk can get stuck traversing between a few dominant nodes, which can lead to items having extreme scores. The “teleportation” parameter is the probability that a random walker will jump to a random node instead of following one of the edges determined by a point differential; this keeps the random walker from getting stuck between a few dominant nodes. bwsTools uses the page_rank function from the igraph package to calculate these scores. The code to run this method is:

prscoring(d1, "id", "block", "issue", "value",

wide = FALSE)

Additional arguments can be specified, which get passed to the page_rank function. The most pertinent is the damping argument, which controls the “teleportation” parameter. The closer the damping value is to one, the less likely the walker will jump to a random node. It follows that

BWSTOOLS 21 setting this argument to .9999 will yield essentially the same results as walkscoring. The default is .85. The first five rows returned from this call are shown in columns one, two, and seven of

Table 4. This function also does not require BIBD data but warns the user when the data are not balanced.

Future work can continue to extend these methods beyond BIBDs. Elo, walkscore, and

PageRank methods do not assume a specific design. This makes them useful for very large designs, when each pairwise comparison between items is not feasible; for example, Park and

Newman (2005) apply a network-based ranking system in college football, where each team plays less than a tenth of the total teams being rated. The logic of using scoring metrics usually reserved for sports and games can also apply to rating systems beyond Elo ratings and random walks (see Barrow, Drayer, Elliott, Gaut, & Osting, 2013; Stefani, 1997; Vaziri, Dabadghao, Yih,

& Morin, 2018).

3. Discussion and Conclusion

Case 1 best-worst scaling, also known as object-case BWS or MaxDiff, is a useful tool employed by researchers in academia and industry across a variety of fields and topics. It provides a way to rate and rank many items in a way that is easier and less cognitively-taxing

(Hein et al., 2008; Jaeger et al., 2008) on respondents than asking them to place all items in a rank order. It is also a subtle way to measure sensitive topics, as an item of interest might only be present in, for example, a fourth of the trials. It is a method that requires respondents to make trade-offs between items, which humans frequently must do in social life.

BWSTOOLS 22

This article introduced the bwsTools R package to facilitate a user-friendly way to design and analyze BWS studies in ways not found in alternative free, open-source packages. The make_bibd function aids in finding appropriate balanced incomplete block designs for BWS studies, as these designs satisfy the assumptions of the multinomial logistic model that many methods are based off of. The ae_mnl function calculates utility coefficients, confidence intervals, and choice probabilities across the entire sample. It uses a closed-form solution, which provides speed for large data sets. Five different functions calculate individual best-worst scores.

These range from simple difference scoring to more complex methods involving tournament-style scoring.

Multiple functions are provided for calculating scores so that researchers can use methods aligning with their best judgment, needs, and assumptions. The bwsTools package also allows for comparison of various methods in the published literature. In the one empirical example used throughout this paper, all of the methods yield basically the same results (all r s > .84; Figure 3).

This suggests that methods which are more computationally-efficient and follow a simpler procedure should be preferred, when possible.

I recommend using a survey instrument that follows a BIBD and then the ae_mnl and e_bayescoring functions for scoring at the aggregate and individual levels, respectively.

These functions rely on counts and employ closed-form solutions, making them faster than the package’s other functions. If a BIBD is not employed, however, then the assumptions on which these methods rely may not necessarily be met. I recommend using the tournament scoring approaches of Elo, walk, and PageRank scores in these situations.

BWSTOOLS 23

References

Aizaki, H., Nakatani, T., & Sato, K. (2014). Stated preference methods using R. Boca Raton, FL:

CRC Press.

Auger, P., Devinney, T. M., & Louviere, J. J. (2007). Using best-worst scaling methodology to

investigate consumer ethical beliefs across countries. Journal of Business Ethics, 70(3),

299-326. doi: 10.1007/s10551-006-9112-7

Barrow, D., Drayer, I., Elliott, P., Gaut, G., & Osting, B. (2013). Ranking rankings: An empirical

comparison of the predictive power of sports ranking methods. Journal of Quantitative

Analysis in Sports, 9(2), 187-202. doi: 10.1515/jqas-2013-0013

Bogers, T. (2010). Movie recommendation using random walks over the contextual graph. In

CARS 2010: 2nd Workshop on Context-Aware Recommender Systems, Barcelona, Spain.

Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.

Computer Networks and ISDN Systems, 30(1-7), 107-117. doi:

10.1016/S0169-7552(98)00110-X

Callaghan, T., Mucha, P. J., & Porter, M. A. (2007). Random walker ranking for NCAA Division

I-A football. The American Mathematical Monthly, 114(9), 761-777.

http://www.jstor.org/stable/27642330

Cheung, K. L., Wijnen, B. F. M., Hollin, I. L., Jassen, E. M., Bridges, J. F., Evers, S. M. A. A., &

Hiligsmann, M. (2016). Using best-worst scaling to investigate preferences in health care.

PharmacoEconomics, 34 , 1195-1209. doi: 10.1007/s40273-016-0429-5

BWSTOOLS 24

Cochran, W. G., & Cox, G. M. (1957). Experimental designs (2nd ed.) . Oxford, England: John

Wiley & Sons.

Cohen, E. (2009). Applying best-worst scaling to wine marketing. International Journal of Wine

Business Research, 21(1), 8-23. doi: 10.1108/17511060910948008

Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research.

InterJournal: Complex Systems, 1695 , 1-9. http://igraph.org

Cumming, G. & Maillardet, R. (2006). Confidence intervals and replication: Where will the next

mean fall? Psychological Methods, 11(3), 217-227. doi: 10.1037/1082-989X.11.3.217 de-Magistris, T., Garcia, A., & Albisu, L. (2014). Wine consumers’ preferences in Spain: An

analysis using the best-worst scaling approach. Spanish Journal of Agricultural Research,

12 (3), 529-541. doi: 10.5424/sjar/2014123-4499

Erdem, S. & Rigby, D. (2013). Investigating heterogeneity in the characterization of risks using

best worst scaling. Risk Analysis, 33(9), 1728-1748. doi: 10.1111/risa.12012

Finn, A. & Louviere, J. J. (1992). Determining the appropriate response to evidence of public

concern: The case of food safety. Journal of Public Policy & Marketing, 11(1), 12-25.

doi: 10.1177/074391569201100202

Gleich, D. F. (2015). PageRank beyond the web. SIAM Review, 57(3), 321-363. doi:

10.1137/140976649

Hein, K. A., Jaeger, S. R., Carr, B. T., & Delahunty, C. M. (2008). Comparison of five common

acceptance and preference methods. Food Quality and Preference, 19(7), 651-661. doi:

10.1016/j.foodqual.2008.06.001

BWSTOOLS 25

Hess, S. & Palma, D. (2019). Apollo: A flexible, powerful and customisable freeware package

for choice model estimation and application. Journal of Choice Modelling, 32. doi:

10.1016/j.jocm.2019.100170

Hollis, G. (2018a). Scoring best-worst data in unbalanced many-item designs, with applications

to crowdsourcing semantic judgments. Behavior Research Methods, 50(2), 711-729. doi:

10.3758/s13428-017-0898-2

Hollis, G. (2018b). When is best-worst best? A comparison of best-worst scaling, numeric

estimation, and rating scales for collection of semantic norms. Behavior Research

Methods, 50(1), 115-133. doi: 10.3758/s13428-017-1009-0

Hollis, G. (2019). The role of number of items per trial in best-worst scaling experiments.

Behavior Research Methods, Published online before print. doi:

10.3758/s13428-019-01270-w

Jaeger, S. R., Jorgensen, A. S., Aaslyng, M. D., & Bredie, W. L. P. (2008). Best-worst scaling:

An introduction and initial comparison with monadic rating for preference elicitation

with food products. Food Qualtiy and Preference, 19(6), 579-588. doi:

10.1016/j.foodqual.2008.03.002

Jamali, M. & Ester, M. (2009). TrustWalker: A random walk model for combining trust-based

and item-based recommendation. In Proceedings of the 15th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, New York, NY, (pp. 397-406).

doi: 1557019.1557067

BWSTOOLS 26

Kiritchenko, S. & Mohammad, S. M. (2017). Best-worst scaling more reliable than rating scales:

A case study on sentiment intensity annotation. In Pr oceedings of The Annual Meeting of

the Association for Computational Linguistics (ACL), Vancouver, Canada.

Lakoff, G. (2014). The All New Don't Think of an Elephant!: Know Your Values and Frame the

Debate. White River Junction, VT: Chelsea Green Publishing.

Langville, A. N. & Meyer, C. D. (2012). Who's #1?: The Science of Rating and Ranking.

Princeton, NJ: Princeton University Press.

Lazova, V. & Basnarkov, L. (2015). PageRank approach to ranking national football teams. In

Proceedings of the 12th International Conference for Informatics and Information

Technology, Bitola, Macedonia, (pp. 310-313).

Lee, J. A., Sneddon, J. N., Daly, T. M., Schwartz, S. H., Soutar, G. N., & Louiviere, J. J. (2019).

Testing and extending Schwartz Refined Value Theory using a best-worst scaling

approach. Assessment, 26(2), 166-180. doi: 10.1177/1073191116683799

Lipovetsky, S. & Conklin, M. (2014). Best-worst scaling in analytical closed-form solution.

Journal of Choice Modelling, 10, 60-68. doi: 10.1016/j.jocm.2014.02.001

Lipovetsky, S. & Conklin, M. (2015). MaxDiff priority estimations with and without HB-MNL.

Advances in Adaptive Data Analysis, 7(1-2), 1-10. Doi: 10.1142/S1793536915500028

Louviere, J. J., Flynn, T. N., & Marley, A. A. J. (2015). Best-W orst Scaling: Theory, Methods and

Applications. Cambridge, UK: Cambridge University Press.

BWSTOOLS 27

Louviere, J., Lings, I., Islam, T., Gudergan, S., & Flynn, T. (2013). An introduction to the

application of (case 1) best-worst scaling in . International Journal of

Research in Marketing, 30(3), 292-303. doi: 10.1016/j.ijresmar.2012.10.002

Marley, A. A. J., Islam, T., & Hawkins, G. E. (2016). A formal and empirical comparison of two

score measures for best-worst scaling. Journal of Choice Modelling, 21, 15-14. doi:

10.1016/j.jocm.2016.03.002

Marley, A. A. J. & Louviere, J. J. (2005). Some probabilistic models of best, worst, and

best-worst choices. Journal of Mathematical Psychology, 49 (6), 464-480. doi:

10.1016/j.jmp.2005.05.003

Masuda, N., Porter, M. A., & Lambiotte, R. (2017). Random walks and diffusion on networks.

Physics Reports, 716-716, 1-58. doi: 10.1016/j.physrep.2017.07.007

Mielby, L. H., Edelenbos, M., & Thybo, A. K. (2012). Comparison of rating, best-worst scaling,

and adolescents’ real choice of snacks. Food Quality and Preference, 25(2), 140-147. doi:

10.1016/j.foodqual.2012.02.007

Morris, M. D. (2011). : An Introduction Based on Linear Models. Boca

Raton, FL: Chapman & Hall/CRC.

Motegi, S. & Masuda, N. (2012). A network-based dynamical ranking system for competitive

sports. Scientific Reports, 2(904), 1-7. doi: 10.1038/srep00904

Nakano, M. & Tsuge, T. (2019). Assessing the heterogeneity of consumers’ preferences for

corporate social responsibility using the best-worst scaling approach. Sustainability ,

11 (10), 1-12. doi: 10.3390/su11102995

BWSTOOLS 28

Nilsen, E. (2019). Poll: The Green New Deal is popular in swing House districts. Vox.com.

Retrieved from

https://www.vox.com/policy-and-politics/2019/9/26/20883384/green-new-deal-poll-swin

g-districts

Park, J. & Newman, M. E. J. (2005). A network-based ranking system for US college football.

Journal of Statistical Mechanics: Theory and Experiment, 10. doi:

10.1088/1742-5468/2005/10/P10014

Rouder, J. N. & Morey, R. D. (2005). Relational and arelational confidence intervals.

Psychological Science, 16(1), 77-79. doi: 10.1111/j.0956-7976.2005.00783.x

Stefani, R. T. (1997). Survey of the major world sports rating systems. Journal of Applied

Statistics, 24 (6), 635-646. doi: 10.1080/02664769723387

Therneau, T. M. & Grambsch, P. M. (2000). Modeling survival data: Extending the Cox model.

New York, NY: Springer.

Vaziri, B., Dabadghao, S., Yih, Y., & Morin, T. L. (2018). Properties of sports ranking methods.

Journal of the Operational Research Society, 69 (5), 776-787. doi:

10.1057/s41274-017-0266-8

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59 (10), 1-23. doi:

10.18637/jss.v059.i10

Wickham, H. & Grolemund, G. (2017). R for data science. Sebastopol, CA: O’Reilly Media, Inc.

Wu, C. F. J. & Hamada, M. S. (2000). Experiments: Planning, Analysis, and Optimization (2nd

Edition) . Hoboken, NJ: Wiley.

BWSTOOLS 29

Yu, T., Holbrook, J. T., Thorne, J. E., Flynn, T. N., Van Natta, M. L., & Puhan, M. A. (2015).

Outcome preference in patients with noninfectious uveitis: Results of a best-worst scaling

study. Investigating Ophthalmology & Visual Science, 56(1 1), 6864-6872. doi:

10.1167/iovs.15-16705

BWSTOOLS 30

Table 1

Snippet of Data Returned From Calling bibds

Design t k r b λ

1 4 2 3 6 1

2 4 3 3 4 2

3 5 2 4 10 1

4 5 3 6 10 3

5 5 4 4 5 3

6 6 2 5 15 1

27 13 4 4 13 1

28 13 9 9 13 6

29 15 7 7 15 3

30 15 8 8 15 4

31 19 9 9 19 4

32 19 10 10 19 5

BWSTOOLS 31

Table 2

Example Data and Results for Aggregate Ratings

Item Bests Worsts Totals NDiff b SE LB UB Choice Elo

Healthcare 731 125 1400 .43 .93 .04 .84 1.01 .17 1194

The economy 634 119 1400 .37 .77 .04 .69 .85 .15 1167

Education 467 252 1400 .15 .31 .04 .23 .38 .09 1065

National 364 221 1400 .10 .21 .04 .13 .28 .08 1044

security

Gun policy 354 290 1400 .05 .09 .04 .02 .17 .07 1023

Taxes 386 350 1400 .03 .05 .04 -.02 .13 .07 1014

Crime and 286 282 1400 .00 .01 .04 -.07 .08 .07 1006

violence

Investigating 318 351 1400 -.02 -.05 .04 -.12 .03 .06 987

government

corruption

Abortion 302 352 1400 -.04 -.07 .04 -.15 .00 .06 984

Race relations 300 392 1400 -.07 -.13 .04 -.21 -.06 .06 975

BWSTOOLS 32

and racism

Drugs and drug 163 463 1400 -.21 -.44 .04 -.51 -.36 .04 908

abuse

Foreign affairs 121 545 1400 -.30 -.63 .04 -.70 -.55 .04 858

and aid

Bias in the 124 808 1400 -.49 -1.07 .04 -1.15 -.98 .02 775

media

BWSTOOLS 33

Table 3

Snippet of Data Format Required for Individual Rating Functions in bwsTools

ID Block Label Value

1 1 Abortion -1

1 1 Race 0

1 1 Drugs 1

1 1 Education 0

1 2 Drugs 0

1 2 Economy 1

1 2 Foreign Affairs -1

1 2 Guns 0

BWSTOOLS 34

Table 4

Example Individual Ratings from Various Functions in bwsTools

ID Issue Difference Empirical Elo Walk PageRank

Bayes

1 Abortion -0.75 -0.83 895 -1.03 -0.98

1 Bias in Media 0.25 -0.24 1036 0.16 0.14

1 Corruption -0.5 -0.54 931 -0.80 -0.79

1 Crime 0.5 0.51 1070 0.28 0.31

1 Drugs 0.25 .04 1034 0.38 0.33

BWSTOOLS 35

Table 5

Process for Generating Best-Worst Walkscores

Step Description

1 Score each response to each item in each block as +1 (best), -1 (worst), or 0 (not

selected).

2 Treat each block of k items as a series of k ( k -1)/2 pairwise competitions between

items.

3 Score each of these “pairwise competitions” by calculating the “point differentials,”

such that an item chosen as worst lost by two points to the item chosen as best, the

items not chosen tied, the items not chosen both lost to the best item by one, and so

on.

4 Create a t x t square “best” matrix where each (i, j ) cell corresponds to how many

points item i lost to item j by . Ties and wins are scored as zero; diagonal cells (i = j )

are scored as zero.

5 Create a t x t square “worst” matrix where each (i , j ) cell corresponds to how many

points item i beat item j by . Ties and losses are scored as zero; diagonal cells ( i = j )

are scored as zero.

6 Normalize each of these matrices such that each row sums to one. If a row contains

BWSTOOLS 36

only zeros, each cell is replaced with 1/t .

7 Convert each of these adjacency matrices to weighted directed graphs. bwsTools does

so using the igraph package.

8 Run an arbitrarily large number of walks around each of these graphs.

9 Treat the proportion of walks leading to each of the t items as raw best and worst

walkscores.

10 Standardize (z-score) each vector of raw best and worst walkscores; multiply the

standardized worst scores by -1; average these vectors together for a best-worst

walkscore.

BWSTOOLS 37

Least Important Issue Most Important

The economy X

Taxes

X Bias in the media

National security

Figure 1. Example block of a best-worst scaling design, where “bias in the media” is selected as the least important (i.e., worst) and “the economy” as most (i.e., best).

BWSTOOLS 38

Item Text Block Option 1 Option 2 Option 3 Option 4

1 Healthcare 1 2 6 7 13

2 Taxes 2 2 5 8 10

3 National security 3 2 3 4 9

4 Investigating government corruption 4 3 7 8 11

5 Abortion 5 1 4 5 7

6 Crime and violence 6 1 2 11 12

7 Race relations and racism 7 4 10 11 13

8 Drugs and drug abuse 8 4 6 8 12

9 Education 9 1 8 9 13

10 Bias in the media 10 1 3 6 10

11 The economy 11 3 5 12 13

12 Foreign affairs and aid 12 5 6 9 11

13 Gun policy 13 7 9 10 12

Figure 2. Example BIBD. The left panel shows the key for each item number corresponding to what is shown to participants; the right panel shows a b x k matrix showing which of the items appear in each block (row).

BWSTOOLS 39

Figure 3. A scatterplot depicting the relationships between all individual-level scoring methods on the dataset used throughout this article.