Running head: BWSTOOLS 1
bwsTools: An R Package for Case 1 Best-Worst Scaling
Mark H. White II
National Coalition of Independent Scholars
Author’s note:
I would like to thank Anthony Marley, Geoff Hollis, Kenneth Massey, Guy Hawkins, and Geoff
Soutar for their correspondence as well as the anonymous reviewers for their helpful comments.
All code needed to reproduce analyses in this paper, as well as the source code for the package, can be found at https://osf.io/wb4c3/. Correspondence concerning this article should be addressed to Mark White, [email protected].
BWSTOOLS 2
Abstract
Case 1 best-worst scaling, also known as best-worst scaling or MaxDiff, is a popular method for examining the relative ratings and ranks of a series of items in various disciplines in academia and industry. The method involves a survey respondent indicating the “best” and “worst” from a sample of items across a series of trials. Many methods exist for calculating scores at the individual and aggregate levels. I introduce the bwsTools package, a free and open-source set of tools for the R statistical programming language, to aid researchers and practitioners in the construction and analysis of best-worst scaling designs. This package is designed to work seamlessly with tidy data, does not require design matrices, and employs various published individual- and aggregate-level scoring methods that have yet to be employed in free software.
Keywords: Best-worst scaling, MaxDiff, choice modeling, R
BWSTOOLS 3
1. Introduction
Important social and psychological processes require people to choose between alternatives. A high school, for example, might need new chemistry equipment and updated books—but the budget only supports one or the other. In politics, people say they are highly supportive of equality and freedom—but what about when these values come into conflict?
Affirmative action policies, for example, have been framed as promoting racial equality in academic institutions, while others have said these policies necessarily limit the freedom for universities to accept who they would like (Lakoff, 2014).
Likert-type scales—such as seven-point scales anchored at 1 (Strongly Disagree) and 7
(Strongly Agree) —may not be appropriate measurement tools in these common situations. On a seven-point scale, a respondent can indicate that they “strongly agree” that races should be equal on one item, then also “strongly agree” that universities should be free to accept any students they want on another. The tension between the two, such as in the case of affirmative action, is obscured. Ceiling (plurality responding at the highest point of the scale) and floor (plurality responding at the lowest) effects are common in studying certain important issues like prejudice, values, and political ideology. In the abstract, every one might agree—or at least follow a social norm—that racial inequality is bad and that freedom is good.
A different method to measure attitudes in these domains is to have respondents choose between a series of alternatives. Case 1 best-worst scaling is one such method. This article introduces an R package for designing and analyzing data using this method. It is meant as a
BWSTOOLS 4 tutorial and introduction; this article does not explore detailed mathematical proofs for different analysis options, but offers suggested readings for those interested.
1.1. The Best-Worst Scaling Method
Best-worst scaling is one research method to measure ratings involving trade-offs among many items. This method is also known as “MaxDiff,” “case 1 best-worst scaling,” or “object case best-worst scaling.” I refer to this case of base-worst scaling as BWS in the article and package.
BWS involves respondents making repeated selections of the best and worst items in a series of subsets of items (Louviere, Flynn, & Marley, 2015). As a working example, I consider the question: “Of the issues below, which is the most important to you and which is the least important to you when making political decisions?”
A collection of t items (“treatments”) are displayed to respondents across b trials (or
“blocks”). Each block contains a subset of k items from the total list. Respondents are asked to mark which of the k items is best, which is worst, while k - 2 items are left unmarked. Although the terminology “best” and “worst” is used, it can be generalized to the most or least of any construct. Figure 1 shows an example block.
BWS researchers recommend structuring these series of blocks in balanced incomplete block designs (BIBD; Louviere, Lings, Islam, Gudergan, & Flynn, 2013). These designs ensure each item is shown the same number of times r and that each pairwise comparison of items also appears the same number of times λ. The bwsTools package generally assumes that data are generated using a BIBD, although some functions (described below) will analyze data from a
BWSTOOLS 5 non-BIBD. Figure 2 shows an example design with t = 13 items, b = 13 blocks, k = 4 items per block, each item is repeated r = 4 times, and each pairwise comparison occurs λ = 1 time.
This means that each respondent will yield b times k observations: b “best” choices, b
“worst” choices, and b ( k - 2) observations where the item was neither selected “best” nor
“worst.” These observations can be used to calculate both aggregate ratings (across the sample) and individual ratings (for each respondent).
The motivation for this package was to provide a free, open-source alternative to existing software. bwsTools follows the principles of tidy data (Wickham, 2014), allowing for BWS analyses to be more seamlessly integrated into pipelines for importing, preparing, analyzing, and visualizing data (Wickham & Grolemund, 2017). No design matrices are needed for analysis in bwsTools—only the survey responses. Detailed instructions with annotated code are provided in the package vignettes on how to structure the data in the required tidy format. For individual-level tidying, users can run vignette("tidying_data", "bwsTools") , while vignette("aggregate", "bwsTools") covers formatting data for aggregate-level analysis. All inputs and outputs for functions in bwsTools inherit the class data.frame , allowing inputs and outputs chained in data pipelines easily.
bwsTools also provides analysis options for multiple individual- and aggregate-level methods (discussed below), published by multiple researchers, that have yet to be implemented in freely-available, open-source software. Lastly, bwsTools has a publicly-available working
GitHub repository 1, documenting all development. Programming best practices, such as unit tests
1 github.com/markhwhiteii/bwsTools
BWSTOOLS 6 and continuous integration tools, are used to ensure stable, reliable releases. The current bwsTools analysis functions do not exhaust the list of published methods; a public repository allows for community collaboration and feedback in adding new methods for analyzing BWS data.
2. The bwsTools R Package
bwsTools is an R package with three main purposes: generating BIBDs, calculating aggregate ratings, and calculating individual ratings. Each is discussed in turn. The package can be installed from the Comprehensive R Archive Network (CRAN) using the following code:
install.packages(“bwsTools”)
2.1. Generating a BIBD
The characteristics of a BIBD ( t, b , k , r , and λ) follow specific properties: First, the design contains b blocks of k items; second, each of the t items appears r times and only r times; third, each of the t ! / 2(t - 2)! pairs of items appear λ times and only λ times. An incomplete block design is balanced when λ = r ( k - 1) / (t - 1) and both λ and r are integers. It can be difficult for researchers to create designs satisfying these criteria (Morris, 2011; Wu & Hamada, 2000), so textbooks often reference lists of BIBDs from which researchers can choose. The bwsTools package contains a data.frame object, showing possible values of t , b , k , r, and λ that satisfy the criteria for a BIBD. Thirty-two designs are in this object, taken from Table 11.3 of Cochran and Cox (1957); included are all possible designs where t and b are less than or equal to 20, as
BWSTOOLS 7 more than 20 trials may put cognitive strain on a survey respondent. However, Cochran and Cox
(1957) provide many more examples of a larger size.
While planning a study, a researcher can load the bwsTools package and examine the list of 32 designs with the following code:
library(bwsTools)
bibds
Table 1 shows the first and last six rows that are returned when calling bibds . For example, design six shows a design where six items ( t) are shown in groups of two (k ) across 15 ( b ) blocks, with each item appearing five times (r ) and each pair of items occurring once (λ). The working example in this article follows design 27 in bibds .
To generate a BIBD using bwsTools, one can supply the design number to the make_bibd function. This call produced the design found on the right-hand side of Figure 2:
make_bibd(27, seed = 1839)
Note that the function will generate one of many designs satisfying these criteria at random. To ensure reproducibility, there is also an argument for a seed to set for the random number generator. This defaults to 1839, so each call to make_bibd without a seed explicitly set will yield the same design every time, making for reproducible designs by default.
BWSTOOLS 8
One then assigns a number to each of their items then finds-and-replaces the number in this design with the text of the item. Referring back to Figure 2, the first block lists 2, 6, 7, and
13. Comparing these numbers to the text in the left panel, the items in this block would be:
“taxes”, “crime and violence,” “race relations and racism,” and “gun policy.” Each respondent would then be asked to indicate the most and least important issue from that set of four items, then continue on to the second block that contains: “taxes,” “abortion,” “drug and drug abuse,” and “bias in the media” (2, 5, 8, and 10).
2.2. Aggregate Ratings
A researcher might have the goal of determining how well the different items rank against one another, across the entire sample. Such is the case, for example, if researchers wish to know the most- and least-persuasive arguments in favor of a proposed policy (e.g., Nilsen,
2019). The two options whose theoretical properties are best known are normalized difference scores and analytical best worst scores (Marley, Islam, & Hawkins, 2016). Both can be calculated using: the total number of times each item was presented to respondents, the number of times each item was chosen as best, and how many times each was chosen as worst. These data for the running example are shown in the first four columns of Table 2.
The normalized difference scores do not require bwsTools to calculate; it is the number of times selected as best minus the number of times selected as worst, divided by the total number of times the item appeared to respondents. It is “normalized” because it is bounded between +1
(selected as best every single time it appeared) and -1 (selected as worst every single time it appeared). These are displayed in the “NDiff” column of Table 2.
BWSTOOLS 9
2.2.1. Analytical Estimation of Multinomial Logistic Regression. Under reasonable assumptions, BWS data can be modeled using multinomial logistic regression (MNL)—but the simple normalized difference scores are sufficient statistics for that model and correlate very highly with coefficients from that model (Finn & Louviere, 1992; Marley, et al., 2016; Marley &
Louviere, 2005). However, this method does not yield estimates of uncertainty around the scores.
Researchers benefit from measures of uncertainty, such as standard errors, whether one uses them to roughly compare items or get a broad idea of replicability and variability in the data
(Cumming & Maillardet, 2006; Rouder & Morey, 2005). The bwsTools package provides a function to calculate the utility coefficients and standard errors from the MNL model.
bwsTools uses the analytical estimation (i.e., closed-form solution) presented by
Lipovetsky and Conklin (2014) for the MNL model. The ae_mnl function takes data of the format in the first four columns of Table 2 and returns utility coefficients, standard errors, confidence intervals, and choice probabilities using Equations 7, 10, 12, 13, and 18 from
Lipovestky and Conklin (2014). Utility coefficients are calculated by:
total worst best total pj = (N j - N j + Nj ) / 2N j [1]
bj = ln( pj / (1 - p j)) [2]
total Where b j is the coefficient for each item j , which can be used as an aggregate score. N j ,
worst best Nj , and N j are the number of times each item appears total, is chosen worst, and is chosen best across the entire sample of respondents. Choice probabilities are then calculated by dividing
the exponential function each b j by the sum of the exponential functions of all b j. See Lipovetsky and Conklin (2014) for the full solution.
BWSTOOLS 10
In the code below, let d0 represent a data.frame containing the first four columns of
Table 2. The user supplies the data as well as the names of the columns containing the number of times the item was presented, chosen as best, and chosen as worst:
ae_mnl(d0, "totals", "bests", "worsts")
Where a z-statistic (z ) can also be supplied to determine the confidence level of the upper and lower bounds; the function defaults to z = 1.96 . This code returns the sixth through tenth columns of Table 2.
The “b” column contains the utility coefficients from the MNL, while the “LB” and
“UB” columns show the bounds of the 95% confidence intervals (CIs) around these coefficients.
The CIs help users compare the coefficients to one another. From these results, we can see that healthcare is seen as most important, followed by the economy. The “Choice” column contains choice probabilities—how likely it is that each item is chosen from the full set.
2.2.2. Elo rating. BIBDs are recommended for data collection, but sometimes they are not used. An analyst may not have designed the study but are nonetheless tasked with analyzing data which do not come from a BIBD, or there may be too many items of interest to fit into a
BIBD (e.g., Hollis 2018a). In these situations, Elo scores can be employed as an aggregate scoring method. Physics professor Arpad Elo developed the now-famous rating system for the game of chess. These ratings took on his name, and variants are now used in many sports and
BWSTOOLS 11 competitions beyond chess (Langville & Meyer, 2012). Hollis (2018a, 2018b, 2019) extended this scoring system to BWS.
The concept of Elo ratings is that two competitors each have ratings befor e a game starts.
After the competition, the winner’s rating increases, while the loser’s rating decreases. How much their scores change is a function, Δ, of the difference between their ratings before they started playing. It is also a function of a constant, K , that determines how much this Δ updates the rankings; the larger the K, the more the rankings change after one competition.
For the winner, their beginning Elo score, S W, is updated to their new score, S’ W , after the competition by:
S’ w = S w + Δw [3]
where Δw = K(1 - E w ) [4]
Sw / 400 Sw / 400 Sl / 400 and E W = 10 / (10 + 10 ) [5]
While the loser’s score is updated by:
S’ l = S l + Δl [6]
where Δl = K(0 - El ) [7]
Sl / 400 Sw / 400 Sl / 400 and E l = 10 / (10 + 10 ) [8]
See Langville and Meyer (2012, pp. 53-56) for additional details.
Hollis applies Elo to BWS by conceptualizing each block for each respondent as pairwise comparisons (i.e., “competitions”) between the subset of items (i.e., “players”). The item selected as best “beats'' all of the other items in that block, the item selected as worst “loses'' to all of the others, and all of the items not selected tie with one another. Hollis (2018a) does not
BWSTOOLS 12 consider ties—only comparisons where there is a clear “winner” and “loser.” For the entire sample, a “season” of pairwise comparisons are created from every block, each with a winner and loser. Hollis adds two “dummy” items to this list of comparisons: A dummy item that loses to every other item, and a dummy item that beats every other item. This prevents items always selected as best (worst) from having too extreme of scores. Hollis sets K to 30; this is the bwsTools default.
Elo ratings are temporal in nature: It updates ratings as a “season” goes along. Since this does not have an analogue to BWS, Hollis recommends running multiple iterations, each with a different randomized order of the match-ups. The resulting Elo score is the average of these iterations. The default number of iterations in bwsTools is the 100 used by Hollis. Lastly, Elo ratings require a starting value, and all items start with the same score of 1000. This is arbitrary and does not affect the rank-ordering of the scores themselves—it only acts as a starting point for the first calculation.
To use Elo ratings for aggregate scoring, data are still required to be in tidy, disaggregated format. Users need to arrange their data, referred to as d1 in the code examples below, such that each column is a variable and each row is an individual observation. Columns should be present that: indicate which respondent generated the observation, which block it came from, which item it refers to, and the choice made by the respondent. Choices should be coded as 1 (best), -1
(worst), or 0 (not selected). bwsTools includes an object named indiv , a data.frame which contains example data that follow this tidy format. The first eight rows of this data.frame are displayed in Table 3. This shows that Respondent One saw the abortion, race, drugs, and
BWSTOOLS 13 education items in the first block, selecting abortion as least important and drugs as most important. In the second block, this same respondent saw the drugs, economy, foreign affairs, and guns items, selecting the economy as most important and foreign affairs as least important.
A detailed how-to on formatting data is outside the scope of this paper, but can be found in the package’s data tidying vignette, accessed by running vignette("tidying_data",
"bwsTools"). The data d1 below follow this format. Calculating aggregate scores using Elo ratings can be used with the elo function:
elo(d1, "id", "block", "issue", "value", K = 30,
iter = 100)
This code returns the “Elo” column in Table 2. This function does not assume a BIBD. The code above keeps the values of K and the number of iterations, iter, at their defaults, 30 and 100, respectively. Currently, setting K = 30 is a recommendation made by Hollis. Future research could investigate how changing this constant could fit specific contexts (e.g., number of respondents, number of blocks, number of options per block, and so on)
2.3. Individual Ratings
One might have a research goal of examining correlates of these BWS scores, using the scores as predictors of other constructs, testing for group differences in scores, or using unsupervised learning approaches to cluster individuals based on their BWS scores. Each of these goals requires a researcher to calculate BWS scores at the respondent level—that is,
BWSTOOLS 14 individual ratings. These scores are used in studying psychological values (Lee et al., 2019), risk perception (Erdem & Rigby, 2013), corporate social responsibility (Nakano & Tsuge, 2019), wine preferences (de-Magistris, Gracia, & Albisu, 2014), and healthcare (Cheung et al., 2016).
bwsTools contains five functions for calculating these scores, each with similar syntax and each taking data in tidy format (Wickham, 2014). Like with the aggregate Elo function, the data need to be specified in the disaggregated form described in Section 2.2.2 and shown in
Table 3.
2.3.1. Difference scoring. This is the most common individual-level metric used (e.g.,
Auger, Devinney, & Louviere, 2007; Cohen, 2009; Hein, Jaeger, Carr, & Delahunty, 2008;
Jaeger, Jorgensen, Aaslyng, & Bredie, 2008; Kiritchenko & Mohammad, 2017; Mielby,
Edelenbos, & Thybo, 2012; Yu et al., 2015). For each respondent, a researcher takes the number of times an item was selected as best and subtracts from it the number of times it was selected as worst. This means potential values range from -r to + r, where r refers to how many times each item appeared. This can also be divided by r , which returns a normalized difference score for each individual.
Let d1 refer to tidy data following the format described above. The call to generate difference scores is:
diffscoring(d1, "id", "block", "issue", "value",
std = TRUE, wide = FALSE)
BWSTOOLS 15
Where the arguments after the data refer to the names of the columns containing the respondent
IDs, block numbers, name or label of the issue, and whether it was chosen as best (1), worst (-1), or neither (0). The argument std indicates whether one wants raw difference scores (bounded between +/- r ) or normalized difference scores (bounded between +/- 1). The first five rows returned from this call are shown in the first three columns of Table 4. While it is recommended that researchers use BIBDs when using difference scoring, bwsTools will allow the user to calculate scores for data that do not come from BIBDs. However, a warning message will notify users when characteristics of BIBDs are not met.
All of the individual scoring functions have these same first five arguments. Every function also includes an argument, wide , which is a logical value indicating if the user wants their data returned in wide (i.e., a column for respondent ID, and a column for every variable) format. The default is to return it in a tidy format (each row is a score for a combination of respondent and item).
2.3.2. Empirical Bayes. A hierarchical Bayesian version of the MNL (HB-MNL) is used in some statistical packages for calculating utility coefficients at the individual level (e.g., Orme,
2005). However, this relies on estimation using Markov chain Monte Carlo (MCMC) procedures, which can be time and computationally inefficient when analyzing large data. These BWS packages also tend to be neither free nor open-source.
Instead of HB-MNL, Lipovetsky and Conklin (2015) extend their analytical estimation for aggregate utility coefficients (Lipovetsky and Conklin, 2014) to the individual level. They
show that choice frequencies can be calculated at the aggregate level, p j , and individual level, p i j,
BWSTOOLS 16 from simple count data using an empirical Bayesian approach. The aggregate estimate is calculated in the same way as described in Equation 1 above for the analytical estimation of the
MNL.
But, at the individual level, it is plausible that a respondent might choose an item as best
(or worst) every time it was shown. This causes probabilities of one (or zero) to occur, which lead to utility coefficients of infinity (or negative infinity) in Equation 1. To avoid this, a precision parameter, E , is specified by the user. When individual probabilities of zero are encountered, they are replaced with E in Equation 1; when values of one are encountered, they are replaced with 1 - E.
The values of pi and p i j can be treated as the prior and likelihood, respectively, to be used in Bayes’ formula. Lipovetsky and Conklin show that, under reasonable assumptions, these two values can be used to calculate individual-level posterior utility coefficients using an analytical, closed-form solution. To do so, the user must also specify a mixing parameter, ⍺, that indicates how much weight is put on the prior relative to the likelihood. Larger values indicate more weight on the prior. The individual-level score for each item is calculated by:
(⍺ / (1 + ⍺)) pj + (1 / (1 + ⍺)) p ij [9]
Lipovetsky and Conklin (2015) show that the results from this fast analytical procedure correlate highly ( r > .85) with methods that are far more computationally complex (e.g., estimation via MCMC). Users can calculate individual utility coefficients using this empirical
Bayes technique in bwsTools:
BWSTOOLS 17
e_bayescoring(d1, "id", "block", "issue", "value",
E = .1, alpha = 1, wide = FALSE)
Both E and alpha can be arbitrarily specified by the user. However, Lipovetsky and Conklin
(2015) show that scores using this method correlate highest with proprietary HB-MNL software at .1 and 1, respectively. These are thus the recommended and default values in bwsTools package. The first five rows returned from this call are shown in columns one, two, and four of
Table 4. Since this method assumes a BIBD, the function will throw an error if the data do not follow such a design.
2.3.3. Elo rating. The same method for using Elo scores here is the same as in the aggregate case, except the calculation is done separately for each individual instead of across the entire sample. The code to run this in bwsTools is:
eloscoring(d1, "id", "block", "issue", "value",
K = 30, iter = 100, wide = FALSE)
The first five rows returned from this call are shown in columns one, two, and five of Table 4.
Since this method does not make an assumption about design, non-BIBD data are permitted, although the user is warned that their data do not come from a balanced design.
2.3.4. Walkscoring and PageRank. The final two methods for individual BWS scores are applications of centrality measures in weighted graphs to BWS. These are unique to the
BWSTOOLS 18 bwsTools package and current article. Similar to the Elo approach, these methods both liken
BWS to a sport or game and do not assume a BIBD.
Imagine an abbreviated version of the board game Monopoly that involves three or more players. When one player goes bankrupt, the game ends. The person that went bankrupt loses
(their score is -1), the person with the most wealth (i.e., money on hand, property values, etc.) at the time the game ends wins (scored +1), and everyone else neither wins nor loses (scored 0).
The winner beats the loser by two points and beats the others by one point, while those who neither win nor lose tie with one another and beat the loser by one point. Applying this to BWS, each player is an item and each game is a block: The item chosen as “best” beats the item chosen as “worst” by two points and all other items by one point, while the items chosen as neither
“best” nor “worst” tie with one another and beat the item chosen as “worst” by one point.
These “point differentials” can be represented in a square matrix, projected into a weighted graph, and then random walks can be taken through this graph to yield BWS scores.
Network-based ratings methods are common across a variety of topics, including sports and recommender systems (e.g., Bogers, 2010; Callaghan, Mucha, & Porter, 2007; Jamali & Ester,
2009; Lazova & Basnarkov, 2015; Motegi & Masuda, 2012). For a general discussion of random walks in networks, see Masuda, Porter, and Lambiotte (2017).
The detailed procedure for BWS walkscoring is in Table 5. The conceptual idea is that each item is a node in a directed network. Imagine a person starting at a random item. The person asks this item, “Which item is the best?” This item knows it was beaten by Item J by 2 points, and it was beaten by Item K and Item L by 1 point. It will decide to answer this person
BWSTOOLS 19 probabilistically, such that it will say Item J is best 50% of the time (2 points out of 4 total points), Item K 25% of the time (1 out of 4 points), and Item L 25% of the time (1 out of 4 points). The person walks to the item that they were told was best, and this procedure repeats itself. Two networks are made: one based on directing the random walker to the best item and another based on directing the random walker to the worst item. The proportion of walks leading to each node is its raw best and raw worst score, respectively; the worst score is then flipped and averaged with the best score to get the best-worst walkscore. Random walks are implemented using the random_walk function from the igraph R package (Csardi & Nepusz, 2006). See
Langville and Meyer (2012, pp. 67-78) for additional details.
Random walks are performed for each individual separately, leading to individual best-worst scores. The bwsTools function to calculate walkscores is:
walkscoring(d1, "id", "block", "issue", "value",
walks = 10000, wide = FALSE)
Where the walks argument indicates how many random walks to do for a given respondent.
This defaults to 10,000. This makes the method less computationally-efficient than methods like empirical Bayes scoring; users can lower the number of walks when using large datasets. The first five rows returned from this call are shown in columns one, two, and six of Table 4. The function does not require BIBD data but throws a warning when such data are not present.
BWSTOOLS 20
A benefit of this approach is that ties will rarely occur. With difference scoring, consider two items: Item A is never selected best or worst, leading to a difference score of zero; Item B is selected as best twice and selected as worst twice, which also leads to a difference score of zero.
Walkscoring takes into consideration the full pattern of block results, which may score these items A and B differently.
Steps 1 - 7 and 10 of Table 5 are the same for the PageRank method. The PageRank algorithm was designed by Google and originally used to rank search results (Brin & Page, 1998;
Gleich, 2015). The primary differentiating factor from the walkscoring method is that it has a
“teleportation” parameter. Sometimes a random walk can get stuck traversing between a few dominant nodes, which can lead to items having extreme scores. The “teleportation” parameter is the probability that a random walker will jump to a random node instead of following one of the edges determined by a point differential; this keeps the random walker from getting stuck between a few dominant nodes. bwsTools uses the page_rank function from the igraph package to calculate these scores. The code to run this method is:
prscoring(d1, "id", "block", "issue", "value",
wide = FALSE)
Additional arguments can be specified, which get passed to the page_rank function. The most pertinent is the damping argument, which controls the “teleportation” parameter. The closer the damping value is to one, the less likely the walker will jump to a random node. It follows that
BWSTOOLS 21 setting this argument to .9999 will yield essentially the same results as walkscoring. The default is .85. The first five rows returned from this call are shown in columns one, two, and seven of
Table 4. This function also does not require BIBD data but warns the user when the data are not balanced.
Future work can continue to extend these methods beyond BIBDs. Elo, walkscore, and
PageRank methods do not assume a specific design. This makes them useful for very large designs, when each pairwise comparison between items is not feasible; for example, Park and
Newman (2005) apply a network-based ranking system in college football, where each team plays less than a tenth of the total teams being rated. The logic of using scoring metrics usually reserved for sports and games can also apply to rating systems beyond Elo ratings and random walks (see Barrow, Drayer, Elliott, Gaut, & Osting, 2013; Stefani, 1997; Vaziri, Dabadghao, Yih,
& Morin, 2018).
3. Discussion and Conclusion
Case 1 best-worst scaling, also known as object-case BWS or MaxDiff, is a useful tool employed by researchers in academia and industry across a variety of fields and topics. It provides a way to rate and rank many items in a way that is easier and less cognitively-taxing
(Hein et al., 2008; Jaeger et al., 2008) on respondents than asking them to place all items in a rank order. It is also a subtle way to measure sensitive topics, as an item of interest might only be present in, for example, a fourth of the trials. It is a method that requires respondents to make trade-offs between items, which humans frequently must do in social life.
BWSTOOLS 22
This article introduced the bwsTools R package to facilitate a user-friendly way to design and analyze BWS studies in ways not found in alternative free, open-source packages. The make_bibd function aids in finding appropriate balanced incomplete block designs for BWS studies, as these designs satisfy the assumptions of the multinomial logistic model that many methods are based off of. The ae_mnl function calculates utility coefficients, confidence intervals, and choice probabilities across the entire sample. It uses a closed-form solution, which provides speed for large data sets. Five different functions calculate individual best-worst scores.
These range from simple difference scoring to more complex methods involving tournament-style scoring.
Multiple functions are provided for calculating scores so that researchers can use methods aligning with their best judgment, needs, and assumptions. The bwsTools package also allows for comparison of various methods in the published literature. In the one empirical example used throughout this paper, all of the methods yield basically the same results (all r s > .84; Figure 3).
This suggests that methods which are more computationally-efficient and follow a simpler procedure should be preferred, when possible.
I recommend using a survey instrument that follows a BIBD and then the ae_mnl and e_bayescoring functions for scoring at the aggregate and individual levels, respectively.
These functions rely on counts and employ closed-form solutions, making them faster than the package’s other functions. If a BIBD is not employed, however, then the assumptions on which these methods rely may not necessarily be met. I recommend using the tournament scoring approaches of Elo, walk, and PageRank scores in these situations.
BWSTOOLS 23
References
Aizaki, H., Nakatani, T., & Sato, K. (2014). Stated preference methods using R. Boca Raton, FL:
CRC Press.
Auger, P., Devinney, T. M., & Louviere, J. J. (2007). Using best-worst scaling methodology to
investigate consumer ethical beliefs across countries. Journal of Business Ethics, 70(3),
299-326. doi: 10.1007/s10551-006-9112-7
Barrow, D., Drayer, I., Elliott, P., Gaut, G., & Osting, B. (2013). Ranking rankings: An empirical
comparison of the predictive power of sports ranking methods. Journal of Quantitative
Analysis in Sports, 9(2), 187-202. doi: 10.1515/jqas-2013-0013
Bogers, T. (2010). Movie recommendation using random walks over the contextual graph. In
CARS 2010: 2nd Workshop on Context-Aware Recommender Systems, Barcelona, Spain.
Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Computer Networks and ISDN Systems, 30(1-7), 107-117. doi:
10.1016/S0169-7552(98)00110-X
Callaghan, T., Mucha, P. J., & Porter, M. A. (2007). Random walker ranking for NCAA Division
I-A football. The American Mathematical Monthly, 114(9), 761-777.
http://www.jstor.org/stable/27642330
Cheung, K. L., Wijnen, B. F. M., Hollin, I. L., Jassen, E. M., Bridges, J. F., Evers, S. M. A. A., &
Hiligsmann, M. (2016). Using best-worst scaling to investigate preferences in health care.
PharmacoEconomics, 34 , 1195-1209. doi: 10.1007/s40273-016-0429-5
BWSTOOLS 24
Cochran, W. G., & Cox, G. M. (1957). Experimental designs (2nd ed.) . Oxford, England: John
Wiley & Sons.
Cohen, E. (2009). Applying best-worst scaling to wine marketing. International Journal of Wine
Business Research, 21(1), 8-23. doi: 10.1108/17511060910948008
Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research.
InterJournal: Complex Systems, 1695 , 1-9. http://igraph.org
Cumming, G. & Maillardet, R. (2006). Confidence intervals and replication: Where will the next
mean fall? Psychological Methods, 11(3), 217-227. doi: 10.1037/1082-989X.11.3.217 de-Magistris, T., Garcia, A., & Albisu, L. (2014). Wine consumers’ preferences in Spain: An
analysis using the best-worst scaling approach. Spanish Journal of Agricultural Research,
12 (3), 529-541. doi: 10.5424/sjar/2014123-4499
Erdem, S. & Rigby, D. (2013). Investigating heterogeneity in the characterization of risks using
best worst scaling. Risk Analysis, 33(9), 1728-1748. doi: 10.1111/risa.12012
Finn, A. & Louviere, J. J. (1992). Determining the appropriate response to evidence of public
concern: The case of food safety. Journal of Public Policy & Marketing, 11(1), 12-25.
doi: 10.1177/074391569201100202
Gleich, D. F. (2015). PageRank beyond the web. SIAM Review, 57(3), 321-363. doi:
10.1137/140976649
Hein, K. A., Jaeger, S. R., Carr, B. T., & Delahunty, C. M. (2008). Comparison of five common
acceptance and preference methods. Food Quality and Preference, 19(7), 651-661. doi:
10.1016/j.foodqual.2008.06.001
BWSTOOLS 25
Hess, S. & Palma, D. (2019). Apollo: A flexible, powerful and customisable freeware package
for choice model estimation and application. Journal of Choice Modelling, 32. doi:
10.1016/j.jocm.2019.100170
Hollis, G. (2018a). Scoring best-worst data in unbalanced many-item designs, with applications
to crowdsourcing semantic judgments. Behavior Research Methods, 50(2), 711-729. doi:
10.3758/s13428-017-0898-2
Hollis, G. (2018b). When is best-worst best? A comparison of best-worst scaling, numeric
estimation, and rating scales for collection of semantic norms. Behavior Research
Methods, 50(1), 115-133. doi: 10.3758/s13428-017-1009-0
Hollis, G. (2019). The role of number of items per trial in best-worst scaling experiments.
Behavior Research Methods, Published online before print. doi:
10.3758/s13428-019-01270-w
Jaeger, S. R., Jorgensen, A. S., Aaslyng, M. D., & Bredie, W. L. P. (2008). Best-worst scaling:
An introduction and initial comparison with monadic rating for preference elicitation
with food products. Food Qualtiy and Preference, 19(6), 579-588. doi:
10.1016/j.foodqual.2008.03.002
Jamali, M. & Ester, M. (2009). TrustWalker: A random walk model for combining trust-based
and item-based recommendation. In Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, New York, NY, (pp. 397-406).
doi: 1557019.1557067
BWSTOOLS 26
Kiritchenko, S. & Mohammad, S. M. (2017). Best-worst scaling more reliable than rating scales:
A case study on sentiment intensity annotation. In Pr oceedings of The Annual Meeting of
the Association for Computational Linguistics (ACL), Vancouver, Canada.
Lakoff, G. (2014). The All New Don't Think of an Elephant!: Know Your Values and Frame the
Debate. White River Junction, VT: Chelsea Green Publishing.
Langville, A. N. & Meyer, C. D. (2012). Who's #1?: The Science of Rating and Ranking.
Princeton, NJ: Princeton University Press.
Lazova, V. & Basnarkov, L. (2015). PageRank approach to ranking national football teams. In
Proceedings of the 12th International Conference for Informatics and Information
Technology, Bitola, Macedonia, (pp. 310-313).
Lee, J. A., Sneddon, J. N., Daly, T. M., Schwartz, S. H., Soutar, G. N., & Louiviere, J. J. (2019).
Testing and extending Schwartz Refined Value Theory using a best-worst scaling
approach. Assessment, 26(2), 166-180. doi: 10.1177/1073191116683799
Lipovetsky, S. & Conklin, M. (2014). Best-worst scaling in analytical closed-form solution.
Journal of Choice Modelling, 10, 60-68. doi: 10.1016/j.jocm.2014.02.001
Lipovetsky, S. & Conklin, M. (2015). MaxDiff priority estimations with and without HB-MNL.
Advances in Adaptive Data Analysis, 7(1-2), 1-10. Doi: 10.1142/S1793536915500028
Louviere, J. J., Flynn, T. N., & Marley, A. A. J. (2015). Best-W orst Scaling: Theory, Methods and
Applications. Cambridge, UK: Cambridge University Press.
BWSTOOLS 27
Louviere, J., Lings, I., Islam, T., Gudergan, S., & Flynn, T. (2013). An introduction to the
application of (case 1) best-worst scaling in marketing research. International Journal of
Research in Marketing, 30(3), 292-303. doi: 10.1016/j.ijresmar.2012.10.002
Marley, A. A. J., Islam, T., & Hawkins, G. E. (2016). A formal and empirical comparison of two
score measures for best-worst scaling. Journal of Choice Modelling, 21, 15-14. doi:
10.1016/j.jocm.2016.03.002
Marley, A. A. J. & Louviere, J. J. (2005). Some probabilistic models of best, worst, and
best-worst choices. Journal of Mathematical Psychology, 49 (6), 464-480. doi:
10.1016/j.jmp.2005.05.003
Masuda, N., Porter, M. A., & Lambiotte, R. (2017). Random walks and diffusion on networks.
Physics Reports, 716-716, 1-58. doi: 10.1016/j.physrep.2017.07.007
Mielby, L. H., Edelenbos, M., & Thybo, A. K. (2012). Comparison of rating, best-worst scaling,
and adolescents’ real choice of snacks. Food Quality and Preference, 25(2), 140-147. doi:
10.1016/j.foodqual.2012.02.007
Morris, M. D. (2011). Design of Experiments: An Introduction Based on Linear Models. Boca
Raton, FL: Chapman & Hall/CRC.
Motegi, S. & Masuda, N. (2012). A network-based dynamical ranking system for competitive
sports. Scientific Reports, 2(904), 1-7. doi: 10.1038/srep00904
Nakano, M. & Tsuge, T. (2019). Assessing the heterogeneity of consumers’ preferences for
corporate social responsibility using the best-worst scaling approach. Sustainability ,
11 (10), 1-12. doi: 10.3390/su11102995
BWSTOOLS 28
Nilsen, E. (2019). Poll: The Green New Deal is popular in swing House districts. Vox.com.
Retrieved from
https://www.vox.com/policy-and-politics/2019/9/26/20883384/green-new-deal-poll-swin
g-districts
Park, J. & Newman, M. E. J. (2005). A network-based ranking system for US college football.
Journal of Statistical Mechanics: Theory and Experiment, 10. doi:
10.1088/1742-5468/2005/10/P10014
Rouder, J. N. & Morey, R. D. (2005). Relational and arelational confidence intervals.
Psychological Science, 16(1), 77-79. doi: 10.1111/j.0956-7976.2005.00783.x
Stefani, R. T. (1997). Survey of the major world sports rating systems. Journal of Applied
Statistics, 24 (6), 635-646. doi: 10.1080/02664769723387
Therneau, T. M. & Grambsch, P. M. (2000). Modeling survival data: Extending the Cox model.
New York, NY: Springer.
Vaziri, B., Dabadghao, S., Yih, Y., & Morin, T. L. (2018). Properties of sports ranking methods.
Journal of the Operational Research Society, 69 (5), 776-787. doi:
10.1057/s41274-017-0266-8
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59 (10), 1-23. doi:
10.18637/jss.v059.i10
Wickham, H. & Grolemund, G. (2017). R for data science. Sebastopol, CA: O’Reilly Media, Inc.
Wu, C. F. J. & Hamada, M. S. (2000). Experiments: Planning, Analysis, and Optimization (2nd
Edition) . Hoboken, NJ: Wiley.
BWSTOOLS 29
Yu, T., Holbrook, J. T., Thorne, J. E., Flynn, T. N., Van Natta, M. L., & Puhan, M. A. (2015).
Outcome preference in patients with noninfectious uveitis: Results of a best-worst scaling
study. Investigating Ophthalmology & Visual Science, 56(1 1), 6864-6872. doi:
10.1167/iovs.15-16705
BWSTOOLS 30
Table 1
Snippet of Data Returned From Calling bibds
Design t k r b λ
1 4 2 3 6 1
2 4 3 3 4 2
3 5 2 4 10 1
4 5 3 6 10 3
5 5 4 4 5 3
6 6 2 5 15 1
27 13 4 4 13 1
28 13 9 9 13 6
29 15 7 7 15 3
30 15 8 8 15 4
31 19 9 9 19 4
32 19 10 10 19 5
BWSTOOLS 31
Table 2
Example Data and Results for Aggregate Ratings
Item Bests Worsts Totals NDiff b SE LB UB Choice Elo
Healthcare 731 125 1400 .43 .93 .04 .84 1.01 .17 1194
The economy 634 119 1400 .37 .77 .04 .69 .85 .15 1167
Education 467 252 1400 .15 .31 .04 .23 .38 .09 1065
National 364 221 1400 .10 .21 .04 .13 .28 .08 1044
security
Gun policy 354 290 1400 .05 .09 .04 .02 .17 .07 1023
Taxes 386 350 1400 .03 .05 .04 -.02 .13 .07 1014
Crime and 286 282 1400 .00 .01 .04 -.07 .08 .07 1006
violence
Investigating 318 351 1400 -.02 -.05 .04 -.12 .03 .06 987
government
corruption
Abortion 302 352 1400 -.04 -.07 .04 -.15 .00 .06 984
Race relations 300 392 1400 -.07 -.13 .04 -.21 -.06 .06 975
BWSTOOLS 32
and racism
Drugs and drug 163 463 1400 -.21 -.44 .04 -.51 -.36 .04 908
abuse
Foreign affairs 121 545 1400 -.30 -.63 .04 -.70 -.55 .04 858
and aid
Bias in the 124 808 1400 -.49 -1.07 .04 -1.15 -.98 .02 775
media
BWSTOOLS 33
Table 3
Snippet of Data Format Required for Individual Rating Functions in bwsTools
ID Block Label Value
1 1 Abortion -1
1 1 Race 0
1 1 Drugs 1
1 1 Education 0
1 2 Drugs 0
1 2 Economy 1
1 2 Foreign Affairs -1
1 2 Guns 0
BWSTOOLS 34
Table 4
Example Individual Ratings from Various Functions in bwsTools
ID Issue Difference Empirical Elo Walk PageRank
Bayes
1 Abortion -0.75 -0.83 895 -1.03 -0.98
1 Bias in Media 0.25 -0.24 1036 0.16 0.14
1 Corruption -0.5 -0.54 931 -0.80 -0.79
1 Crime 0.5 0.51 1070 0.28 0.31
1 Drugs 0.25 .04 1034 0.38 0.33
BWSTOOLS 35
Table 5
Process for Generating Best-Worst Walkscores
Step Description
1 Score each response to each item in each block as +1 (best), -1 (worst), or 0 (not
selected).
2 Treat each block of k items as a series of k ( k -1)/2 pairwise competitions between
items.
3 Score each of these “pairwise competitions” by calculating the “point differentials,”
such that an item chosen as worst lost by two points to the item chosen as best, the
items not chosen tied, the items not chosen both lost to the best item by one, and so
on.
4 Create a t x t square “best” matrix where each (i, j ) cell corresponds to how many
points item i lost to item j by . Ties and wins are scored as zero; diagonal cells (i = j )
are scored as zero.
5 Create a t x t square “worst” matrix where each (i , j ) cell corresponds to how many
points item i beat item j by . Ties and losses are scored as zero; diagonal cells ( i = j )
are scored as zero.
6 Normalize each of these matrices such that each row sums to one. If a row contains
BWSTOOLS 36
only zeros, each cell is replaced with 1/t .
7 Convert each of these adjacency matrices to weighted directed graphs. bwsTools does
so using the igraph package.
8 Run an arbitrarily large number of walks around each of these graphs.
9 Treat the proportion of walks leading to each of the t items as raw best and worst
walkscores.
10 Standardize (z-score) each vector of raw best and worst walkscores; multiply the
standardized worst scores by -1; average these vectors together for a best-worst
walkscore.
BWSTOOLS 37
Least Important Issue Most Important
The economy X
Taxes
X Bias in the media
National security
Figure 1. Example block of a best-worst scaling design, where “bias in the media” is selected as the least important (i.e., worst) and “the economy” as most (i.e., best).
BWSTOOLS 38
Item Text Block Option 1 Option 2 Option 3 Option 4
1 Healthcare 1 2 6 7 13
2 Taxes 2 2 5 8 10
3 National security 3 2 3 4 9
4 Investigating government corruption 4 3 7 8 11
5 Abortion 5 1 4 5 7
6 Crime and violence 6 1 2 11 12
7 Race relations and racism 7 4 10 11 13
8 Drugs and drug abuse 8 4 6 8 12
9 Education 9 1 8 9 13
10 Bias in the media 10 1 3 6 10
11 The economy 11 3 5 12 13
12 Foreign affairs and aid 12 5 6 9 11
13 Gun policy 13 7 9 10 12
Figure 2. Example BIBD. The left panel shows the key for each item number corresponding to what is shown to participants; the right panel shows a b x k matrix showing which of the items appear in each block (row).
BWSTOOLS 39
Figure 3. A scatterplot depicting the relationships between all individual-level scoring methods on the dataset used throughout this article.