A Comparison of Usability Between Methods

Kristen K. Greene, Michael D. Byrne, and Sarah P. Everett Department of Psychology Rice University, MS-25 Houston, TX 77005 USA {kgreene, byrne, petersos}@rice.edu

ABSTRACT From a human factors and usability standpoint, In order to know if new voting is an arena where research could and systems truly represent a gain in usability, it is should have great practical impact. Voting is a necessary to have information about the challenging usability problem for two main usability of more traditional voting methods. reasons: characteristics of the user population The usability of three different voting methods and characteristics of the task itself. The user was evaluated using the metrics recommended population is extremely diverse, representing a by the National Institute of Standards and wide range of ages and varying greatly in Technology: efficiency, effectiveness, and education, socioeconomic status, familiarity satisfaction. Participants voted with two with technology, etc. In addition to this different types of paper-based (one open- extremely diverse voter population, there are response and one bubble ballot) and one subpopulations of users—such as those those mechanical lever machine. No significant who are visually impaired or physically differences in voting completion times or error disabled, or those who do not speak and/or read rates were found between voting methods. English—for whom voting has historically been However, large differences in satisfaction ratings particularly challenging. Also important from a indicated that participants were most satisfied usability standpoint are the numerous first time with the bubble ballot and least satisfied with the voters and the relatively infrequent nature of the lever machine. These data are an important step task—even the most zealous voter will not have in addressing the need for baseline usability the opportunity to vote more than a few times measures for existing voting systems. per year. This makes it especially crucial that voting systems are usable. One of the promises Author Keywords of electronic voting technology is that it may Usability, voting, paper ballots, mechanical lever help address this issue. machine

INTRODUCTION However, there is currently no baseline usability Usability is a critical issue for voting systems; information for existing voting technologies to even if a voting system existed that was which new systems can be compared. The goal perfectly secure, accurate, reliable, and easily of the current study is to begin addressing this auditable, it could still fail to serve the needs of need for baseline data. It is the second in a series the voters and the democratic process. Voting of experiments we are conducting that compares systems must also be usable by all voters. A the usability of multiple voting technologies: truly usable voting technology should be a walk- paper ballots, optical scan ballots, mechanical up-and-use system, enabling even first time lever machines, punchcard voting systems, and voters to cast their votes successfully. It should DREs. be accessible for those voters with disabilities, and it should not disenfranchise particular The three usability measures used in the current groups of voters. The Help America Vote Act study are those metrics recommended by the (HAVA) of 2002 was a landmark piece of National Institute of Standards and Technology legislation in terms of calling public attention to (NIST) (Laskowswki, et al., 2004): efficiency, issues such as these. effectiveness, and satisfaction. These are also suggested by the International Organization for Standardization (ISO 9241-11, 1998). Endorsement by NIST is important, because this previous research by using the same three unlike other usability domains, decisions about measures to assess the usability of mechanical acceptability are made by governmental lever machines, in addition to measuring standards bodies. usability for two of the three previously-studied paper ballots. Efficiency is defined as the relationship between the level of effectiveness achieved and the The three types of paper ballots evaluated by amount of resources expended, and is usually Everett, Byrne, & Greene (2006) were bubble measured by time on task. Efficiency is an ballots, arrow ballots, and open response ballots objective usability metric that measures whether (see Figure 1 for examples of bubble and open a user’s goal was achieved without expending an response ballots). Every participant voted on all inordinate amount of resources (Industry three ballot types. Each participant saw either Usability Reporting Project, 2001). Did voting realistic candidate names or fictional candidate take an acceptable amount of time? Did voting names, but never both. Realistic names were take a reasonable amount of effort? those of people who had either announced their intention to run for the office in question, or who Effectiveness is the second objective usability seemed most likely to run based on current metric recommended by NIST, and is defined as politics. Fictional names were produced by a the relationship between the goal of using the random name generator. Participants were also system and the accuracy and completeness with provided with one of three different types of which this goal is achieved (Industry Usability information. They were given a slate that told Reporting Project, 2001). In the context of them exactly who to vote for, given a voter voting, accuracy means that a vote is cast for the guide, or were not given any information about candidate the voter intended to vote for, without candidates at all. error. Completeness refers to a vote actually being finished and officially cast. Effectiveness The three dependent measures used in the is usually measured by collecting error rates, but previous study were ballot completion time, may also be measured by completion rates and error rates, and satisfaction. Ballot completion number of assists. If a voter asks a poll worker time (efficiency) measured how long it took the for help, that would be considered an assist. participant to complete his or her vote. No significant differences were found between Satisfaction is the lone subjective usability ballot types (bubble, arrow, or open response) or metric recommended by NIST, and is defined as candidate types (realistic or fictional) on the the user’s subjective response to working with ballot completion time measure. Significant the system (Industry Usability Reporting differences were observed between information Project, 2001). Was the voter satisfied with his/ types; participants given a voter guide took her voting experience? Was the voter confident reliably more time to complete their ballots than that his/her vote was recorded? Satisfaction can those who received a slate or those who did not be measured via an external, standardized receive any information about candidates. instrument, usually in the form of a Likert scale questionnaire. The System Usability Scale, Errors (our measure of effectiveness) were (SUS), (Brooke, 1996) is a common recorded when a participant marked an incorrect standardized instrument that has been used and response in the slate condition or when there verified in a multitude of domains. was a discrepancy between their responses on the three ballot types. Overall, error rates were Everett, Byrne, & Greene (2006) used measures low but not negligible, with participants making of efficiency, effectiveness, and satisfaction to errors on slightly less than 4% of the races. compare the usability of three different types of paper ballots. The current study expands upon Satisfaction was measured by the participant’s participation and all other participants were responses to the SUS questionnaire. The bubble compensated $25 for the one-hour study. ballot produced significantly higher SUS scores than the other two ballots, while the open Design response and arrow ballot were not significantly A mixed design with one between-subjects different from one another. variable and one within-subjects variable was used. The between-subjects manipulation was Given that we found no significant differences information condition. This meant that each on ballot completion times between realistic or participant received only one of three types of fictional candidate names, for the current study information: slate, voter guide, or no only fictional names were used. This decision information. A slate was a paper that listed was made to prevent advertising campaigns and exactly who the participant was to vote for, as media coverage from differentially affecting well as whether to vote yes/no on the data from one study to the next. In addition, the propositions. The voter guide used was based on continued use of realistic candidate names those produced by the League of Women Voters. would have quickly rendered the study materials It was left completely up to the participant obsolete, making comparisons between studies whether she/he chose to actually read and use with new materials difficult. Using fictitious the voter guide. In the control condition, names circumvents both these issues for the participants were not given any information current study and future research. Since no about the candidates or propositions. The within- significant differences were found between subjects variable was voting method. This meant ballot completion times for the three paper ballot that every participant voted using each of the types in our previous study, the decision was three following voting methods: open response made to only examine two of those three for the ballot, bubble ballot, and mechanical lever current study. Since the current study evaluated machine. The order in which participants voted an additional voting technology, mechanical was counterbalanced, so that each voting method lever machines, we chose to retain two of the was used first, second, and third an equal three previously-studied paper ballots in order to number of times. be able to define errors by majority agreement. The three dependent variables measured in the METHOD study were voting completion time, errors, and satisfaction. Voting completion time was Participants measured in seconds and indicated how long it Thirty-six people participated in the current took a participant to complete a ballot. Errors study, 13 of whom were Rice University were recorded when a participant marked an undergraduate students and 23 of whom were incorrect response in the slate condition or when recruited from the larger Houston population. there was a discrepancy between their responses There were 21 female and 15 male participants on the three voting methods. Satisfaction was ranging in age from 18 to 56. The average age measured by the participant’s responses to the was 28.61 (SD = 10.48). All had normal or SUS. corrected-to- normal vision and were US citizens fluent in English. On average, Materials and Procedure participants had voted in 3.11 national , The same fictional candidate names used by ranging from zero to 12. Seventeen of the 36 Everett, Byrne, & Greene (2006) were used for participants had only voted nationally once each of the three voting methods in this study. before, and another four participants had no The first voting method was the open response prior national voting experience. On average, ballot (Figure 1A). The second voting method participants had voted in 7.06 other types of was the bubble ballot (Figure 1B). elections (local, school, etc.). Rice students received credit towards a course requirement for Each voting method used the races and propositions used by Everett, Byrne, & Greene (2006). Candidate names were produced by a publicly-available random name generator (http://www.kleimo.com/random/name.cfm). The 21 offices ranged from national races to county races, and the six propositions were real propositions that had previously been voted on in other counties/states, and that could easily be realistic issues for future elections in Harris County, TX, the county in which this study was conducted.

Figure 2. Mechanical lever machine Figure 1A. Open response ballot Participants first went through the informed consent procedure and read the instructions. Those in the control condition were allowed to begin voting immediately, those in the slate condition were given their slate before beginning to vote, and those in the voter guide condition were given ample opportunity to read over the voter guide. Participants stood throughout the voting process, either at a voting station when using the paper ballots, or behind the curtains of the lever machine when using that voting method.

After voting, participants completed a survey, Figure 1B. Bubble ballot then were informed about the purpose of the study and paid. The survey included both The third voting method was the mechanical general demographic questions and questions lever machine (Figure 2). The lever machines more specific to voting itself. Voting specific used in this study were purchased at auction questions asked about people’s previous voting from Victoria County, TX. experience, as well as their opinions on the exact voting methods used in the study. Also part of the survey was a separate SUS for each of the significant interaction of voting method order three voting methods. The SUS is a Likert scale, and information type, F(4, 66) = 5.64, p = 0.001. which has people answer questions using a scale This effect is driven entirely by the first ballot of one to five, ranging from strongly disagree to completed by the participants. Participants in the strongly agree. The SUS includes questions such “slate” condition were significantly faster on as “I thought the system was easy to use” and “I their first ballot than participants in the other two felt very confident using the system.” Responses conditions. However, participants in the “guide” to ten such questions are then combined to yield and control conditions did not reliably differ on a single number; the higher the final number, the their first ballot, and all three conditions were higher the degree of satisfaction a person statistically indistinguishable on their second experienced (Brooke, 1996). and third ballots.

RESULTS Effectiveness We computed error rates two ways: by race and Efficiency by ballot. “By race” rates were calculated by Figure 3 displays the voting completion times dividing the number of errors made by the total for the three voting methods evaluated in this number of opportunities for error. There were 27 study. No significant differences in response races (21 offices plus six propositions) on each times were found between the three voting of the three voting methods. Therefore each methods, F(2, 66) = 0.24, p = 0.78, nor was there participant faced 81 total opportunities for error. a significant effect of information type on Error rates for each ballot type are presented in response times, F(2, 33) = 2.01, p = 0.15. Table 1. Differences between ballot types on Finally, the interaction of voting method and error rate were not significant, F(2, 66) = 0.43, p information type was not significant, F(4, 66) = = .66. 0.29, p = 0.89.

500 Open Bubble Lever Overall 450 Guide Error Control .012 .009 .007 .009 400 Rate Slate 350 Table 1. Error rates by voting method ime (s) 300 Similarly, participants did not make significantly 250 more errors depending on what type of 200 information they were given; there was no significant effect of type of information given, F 150

Ballot Completion T (2, 35) = 0.13, p = .88. This is important because 100 it indicates that the error rate seen in this study is

50 not simply a memory problem whereby participants were failing to remember from one 0 ballot to the next who they voted for. First Second Third Order Another way to consider error rates is by ballot. Figure 3. Voting completion times by voting order and information condition A ballot either does or does not contain one or more errors. Frequencies by ballot type are However, people were reliably slower with the presented in Table 2. A chi-square test of first voting method they used; there was a association revealed no reliable effect of voting significant effect of the order in which method (X2(2) = 3.42, p = .18), that is, voting participants used the three voting methods, F(2, method did not appear to affect the frequency 66) = 27.81, p < 0.001. There was also a with which ballots containing errors were generated. However, a surprising 17 of the 108 DISCUSSION ballots collected contained at least one error; this In terms of efficiency, as measured by time taken means nearly 16% of the ballots contained at to vote, the three voting methods used in this least one error. study seemed approximately equivalent. Effectiveness, however, is clearly a concern. While the overall rate of .009 error per race Errors appears low, this overall effect on ballots is quite striking. (In fact, if errors were statistically None At least 1 Total independent of one another, an error rate of .009 Bubble 32 4 36 per race predicts that in a 27-race ballot, just over 20% of the ballots should contain at least Open 27 9 36 one error.) The substantial error rate seen indicates possible room for improvement in the Lever 32 4 36 effectiveness of paper ballots and lever machines. Total 91 17 108 Table 2. Ballot error frequencies by voting method Despite these similarities between voting methods on the usability metrics of efficiency Satisfaction and effectiveness, sharp differences emerged Participants were most satisfied when using the when examining the usability metric of bubble ballot and least satisfied when using the satisfaction. When Everett, Byrne, & Greene lever machine; in the SUS scores (Figure 4). (2006) found that bubble ballots elicited the This effect was statistically reliable, F(2, 66) = highest satisfaction ratings, it was unclear 9.37, p < 0.001. Posthocs reveal that this whether that was an artifact of the population difference was driven almost entirely by the from which they sampled. It may not have been difference between the bubble ballot and the surprising that they found Rice University lever machine, which were significantly undergraduates to be most satisfied when using different. However, differences between the bubble ballots, since they so closely resemble open ballot and the other ballots were not. standardized tests—something with which they are very familiar as students. It is interesting that 90 the bubble ballot continued to receive the highest satisfaction ratings in the current study, 80 one in which a more diverse and representative 70 sample was used.

60 FUTURE DIRECTIONS While the sample used in the current study was 50 an improvement over previous work, it should 40 be broadened even further in several important respects. As the oldest participant in the current 30 study was only 56 years old, a crucial next step Mean SUS score ± 1 SEM 20 will be to sample from a much wider age range. Also, by testing participants who do not read/ 10 speak English, one may see interesting cultural

0 differences emerge. Study materials are being Bubble ballot Open ballot Lever Machine translated into both Spanish and Chinese and Voting Method future studies will recruit from these Figure 4. SUS scores subpopulations. Similarly, the range of voting methods should be broadened to include punch card voting, a system which was one common ACKNOWLEDGMENTS but which is currently on the decline. From a We would like to thank Phil Kortum and usability perspective, how bad (or good) are Stephen Goggin for comments on and assistance punch card systems? with this research. This research was supported by the National Science Foundation under grant Even with a more representative sample, many #CNS-0524211 (the ACCURATE center). The unanswered questions remain. For example, views and conclusions contained herein are ballot layouts and location/clarity of instructions those of the authors and should not be will always be a potential issue. In our study, the interpreted as representing the official policies or lever machines present all candidates from a endorsements, either expressed or implied, of particular in a single horizontal the NSF, the U.S. Government, or any other row, which makes it easier for people who vote organization. straight-party to complete and check their vote. Is this a visual display characteristic that is REFERENCES worth emulating in the layout of paper ballots or Brooke, J. (1996) SUS: a “quick and dirty” DREs as well? While DREs and lever machines usability scale. In P. W. Jordan, B. Thomas, can both be programmed to allow straight-party B. A. Weerdmeester & A. L. McClelland ticket voting by a single lever pull or on-screen (eds.) Usability Evaluation in Industry. selection, paper ballots do not have such an London:Taylor and Francis. option. How would adding a straight-party ticket Everett, S. P., Byrne, M. D., & Greene, K. K. choice affect the efficiency and effectiveness of (2006). Measuring the usability of paper paper ballots? Would there be ramifications for ballots: Efficiency, effectiveness, and absentee ballots? satisfaction. To appear in Proceedings of the Human Factors and Ergonomics Society The rapid adoption of DREs raises another 50th Annual Meeting. Santa Monica, CA: important series of questions. Are the DREs Human Factors and Ergonomics Society. actually easier to use than their traditional Industry Usability Reporting Project. (2001). counterparts? What are the implications of voter Common industry format for usability test verified paper audit trails (VVPATs) for reports (ANSI/INCITS 354-2001). usability? Clearly, such paper trails must cost International Committee for Information voters time if they are accurately verifying them, Technology Standards. ISO 9241-11. (1998). but how much time, and how accurate are voters Ergonomic requirements for office work when examining such audit trails? with visual display terminals (VDT)s part 11. Guidance on usability. These kinds of questions only scratch the surface Laskowski, S. J., Autry, M., Cugini, J., Killam, of the total space of human factors issues in W., & Yen, J. (2004). Improving the elections, issues which will need to be addressed usability and accessibility of voting systems to ensure confidence in any voting system, and products. NIST Special Publication electronic or otherwise. 500-256.