Quick viewing(Text Mode)

Iterative User-Interface Design

Iterative User-Interface Design

Iterative User-Interface

Jakob Nielsen, Bellcore

ecause even the best experts cannot design perfect user interfac- es in a single attempt. interface should build a usability- life cycle around the concept of iterati0n.l Iterative devel- opment of user interfaces involves steady design refinement based on user testing and other evaluation methods. Interface designers complete a design and note the problems several test users have using it. They then fix these problems in a new iteration. which they test again to ensure that the “fixes” did indeed solve the problems and to find any new usability problems introduced by the changed design. Normally. the design changes from one iteration to the next are local to the specific interface elements that caused user difficulties. An iterative design meth- odology does not involve blindly replacing interface elements with new, alterna- tive design ideas. If designers must choose between two or more interface alterna- tives, they can perform comparative testing to measure which alternative is the most usable. However, such tests are usually viewed as constituting a methodology different from iterative design as such, and they are most often devised to measure rather than find usability problems. Iterative design aims specifically at refinement based on lessons learned from previous iterations. The user tests I present in this article were rigorous, with a fairly large number of test subjects measured carefully in several different ways while performing a fixed set of tasks for each system. In many practical usability-engineering situa- tions,h however, we can gain sufficient insight into the usability problems in a design iteration with only a few test subjects. and it may not be necessary to collect quantitative measurement data. The quantitative measures emphasized in this article may be useful for management in larger projects. but they are not the main F~~~ studies show goal of usability evaluations. Instead, the principal outcome of a usability evalua- that redesigning user tion in a practical development project is a list of usability problems and sugges- tions for interface improvements. interfaces On the basis After a general discussion of iteration and usability metrics. I present four of user testing - examples of iterative user-interface design and measurement. iterating through at Benefits of iteration least three versions - can substantially To show the value of iteration in usability engineering, Karat’ analyzed a commercial development project in which the user interface to a computer security improve usability. application was tested and improved through three versions. For a lower-bound

32

I estimate of the value of the user-inter- them visible and presumably easier to face improvements, she calculated the Reconceptualizing understand. User satisfaction on a l-to- time saved by users due to quicker task 5 rating scale (where 5 corresponded to completion. thus leaving out any value “like very much”) increased from 3.4 of the other improvements. The securi- for the command-line-based version to ty application had 22,876 users. Each 4.2 for the menu-based version, indicat- could be expected to save 4.67 minutes interaction bugs ing a drastic improvement in usability by using version 3 rather than version 1 for the reconceptualized interface. to perform a set of 12 initial tasks (cor- responding to one day’s system use), for lteration a total savings of 1,781 work hours. From Usability metrics and this. Karat estimated saved personnel Figure 1. The conceptual relation be- costs of $41,700. which compared very tween interface usability and number overall usability favorably with the increased develop- of design iterations. ment costs of $20.700 for iterative de- A system’s overall quality is actually sign. The true savings were likely to be a sum of many quality attributes, only considerably larger. since the improved it over time. For example, the task of one of which is usability. Additionally, interface was also faster after the first programming a computer has been re- a system should be socially acceptable day. conceptualized several times, with lan- and practically feasible with respect to Figure 1 shows a conceptual graph of guages changing from octal machine cost, maintainability, and so on. The the relation between design iterations code to mnemonic assembler to higher system should fit users’ job needs and and interface usability. Ideally, each it- programming languages. let them produce high-quality results, eration would result in an interface bet- Figure 1 shows a reconceptualization since that is the reason for having the ter than the previous version, but as I after a long period of stable usability for system at all. I do not consider these show later, this is not always true in a system. but we do not really know issues here because they are related to practice. Some changes in an interface what leads to the creative insights nec- utility: whether the system’s functional- may turn out not to be improvements. essary for a fundamentally novel and ity can do what is needed. This article Therefore. the true usability curve for a better interface design. Interface recon- focuses on usability. that is, on how well particular product would not be as ceptualizations may not always be im- users can use that functionality. (The smooth as the curve in Figure 1. Even mediately followed by the increased concept of utility is not necessarily re- so, Figure 1 reflects the general nature usability indicated in Figure 1. A funda- stricted to work-oriented software. Ed- of iterative design, though we do not yet mental redesign may introduce unex- ucational software has high utility if have enough documented case studies pected usability problems that design- students learn from it, and an entertain- to estimate the curve precisely. ers would have to iron out with a few ment product has high utility if it is fun.) The first few iterations will in general additional iterations. Despite simplified conceptual illus- probably result in major usability gains We know of interface reconceptual- trations like Figure 1. usability is not as interface designers find and fix the izations in individual development really a one-dimensional property of a true “usability catastrophes.” Later it- projects mainly through anecdotal evi- user interface. A system is usable if it is6 erations have progressively smaller po- dence. They have not been documented tential for improvements because the or had their usability impact measured. easy to learn, so users can go quickly major usability problems are eliminat- One exception is the development of an from not knowing the system to ed, and the design may eventually be- electronic white pages system by a large doing some work: come so polished that very little poten- telephone company. The system was efficient, letting the expert user at- tial for further improvement remains. intended to let customers with home tain a high level of productivity; Because the number of documented computers search for telephone num- easy to remember, so infrequent cases is small. we do not yet know the bers. Search terms could be matched users can return after a period of point of diminishing returns in terms of either exactly or phonetically. and users inactivity without having to learn number of iterations. We do not even could expand searches to a larger geo- everything all over: know for sure whether there is an upper graphical area. relatively error-free or error-forgiv- limit on the usability of an interface or Unfortunately, even after the itera- ing, so users do not make many er- whether it would be possible to improve tive design had been through 14 ver- rors, and so those errors are not some interfaces indefinitely with con- sions, test users still said they were in- catastrophic (and are easily recov- tinued substantial gains for each itera- timidated and frustrated by the system. ered from): and tion. At this stage, the designers decided to pleasant to use, satisfying users sub- Assuming that designers stay with the abandon the basic interface design, jectively, so they like to use the sys- same basic interface and keep refining which was based on a command-line tem. it, I feel that there is a limit to the level interface. From the insight that the com- of usability they can achieve. At the mand line made the system’s structure Focusingon usability as a quality goal. same time, I believe that designers can invisible to the user, they reconceptual- we can define it in terms of these five often break such limits by rethinking ized the interface. The redesigned in- attributes. We numerically character- and completely redesigning the inter- terface had a menu of services display- ize each attribute with a metric. In the face as they gain more experience with ing all options at all times, thus making usability field, such metrics are com-

November 1993 33 monly defined in terms of the perfor- for the time being, since the analytical indicating a 33 percent improvement in mance of randomly chosen representa- methods can be rather difficult to ap- usability. In the same way, the normal- tive users who do representative, bench- ply.’O ized usability of interface 2 would still mark tasks with the system. Usability Often, usability engineers engage be 133 percent if users made eight er- measures are empirically derived num- about 10 test users for usability mea- rors with interface 1 and six errors with bers resulting from one or more such surements, though sometimes they con- interface 2. metrics applied to a concrete user inter- duct tests with 20 or more users for tight Converting measures of subjective face. This progression from quality goals confidence intervals on the measure- satisfaction into improvement scores is to quality attributes and their metrics, ment results. It is important to pick test more challenging, because subjective then to the actual measures, makes us- users who represent as closely as possi- satisfaction is often measured with ques- ability steadily more concrete and oper- ble the people who will actually use the tionnaires and rating scales. The under- ationalized. system after its release. For repeated lying problem is that rating scales are We should not necessarily pay equal user tests, designers should use differ- not ratio scales: It is theoretically im- attention to all usability attributes. Some ent users for each test to eliminate trans- possible to arrive at a perfect formula to are more important than others and fer of learning from previous iterations. calculate how much better one rating is should be the focus of development ef- It is normally quite easy to conduct relative to another. We cannot simply forts throughout the iterative design. In user testing with novice users to mea- take the ratio of the raw rating-scale some cases, we might even accept lower sure learnability and initial performance, scores. Doing so would mean. say, that scores on certain usability metrics as error rates, and subjective satisfaction. a system rated 2 on a 1-to-5 scale would long as their values remain above some Usability measures for expert users are be twice as good as a system rated 1. minimally acceptable value.‘ Each case unfortunately harder to come by. as are However, a system rated 2 on a 1-to-7 study I discuss here had its own priori- measures of experienced users’ ability scale would also be twice as good as a tized set of usability attributes. and oth- to return to the system after a period of system rated 1, even though it is clearly er projects might have yet other goals. absence. Such tests require users with easier to achieve a rating of 2 on a l-to- (It is important to prioritize usability actual experience in using the system. 7 scale than on a 140-5 scale. attributes early in the project’s usability For some user interfaces, full expertise Assuming usability engineering has a life cycle.) may take several years to acquire. and purely economic goal. we might be able User efficiency is a measure of how even for simpler interfaces, users nor- to gather empirical data to estimate how fast users can use the computer system mally train for several weeks or months various subjective-satisfaction ratings to perform their tasks. User efficiency is before their performance plateaus at translate to increased sales of a product different from computer response time, the expert level. or increased employee happiness. Un- even though faster response times often Such extended test periods are infea- fortunately, 1 am not aware of any such let userscomplete their tasksmorequick- sible during new-, so ex- data, so I calculated relative subjective- ly. Because of its direct bottom-line pert performance is much less studied satisfaction scores using a method prov- impact, user efficiency is often seen as than novice performance. Two ways of en successful in a larger study of subjec- the most important attribute of usabili- obtaining expert users without having tive-preference measures (see sidebar).” ty. For example, Nynex (the regional to train them are to involve expert users The case studies in this article all used telephone company serving New York of any prior release of the product and a fixed set of tasks to measure usability and New England) estimates that each to use the developers themselves as test throughout the iterative design process. one-second reduction in work time per users. Of course, a developer has a much This was possible because developers call for its toll and assistance operators more extensive understanding of the defined the systems’ basic functionality will save three million dollars per year? system than even an expert user, so tests before the start of the projects. In con- However, other usability attributes may should include some real users, too. trast. some development projects fol- be critical for applications that monitor low a more exploratory model, where or control dangerous industrial plants Quantifying relative changes in us- the tasks users perform with a system or processes, where an especially low ability. In this article, I use a uniform may change during the development frequency of user errors is desired. scoring of usability improvements to process as designers learn more about Normally, the values of the chosen compare the different iterative design the users and their needs. For such usability metrics are measured empiri- projects. The approach normalizes all projects, it may be more difficult to use cally through user testing. Some research usability measures with respect to the a measurement method like the one I projects aim at analytical methods for values measured for the initial design. outline here, but the general results estimating certain usability metrics. and Thus. the initial design has a normal- showing improvements from iterative they have had some success in quantify- ized usability value of 100. and I nor- design should still hold. ing expert-user performancey and the malize the subsequent iterations by di- transfer of user learning from one sys- viding their measured values by those Calculating scores for overall usabil- tem to another. These analytical meth- measured for the initial design. For ex- ity. A single measure of overall user- ods are far less successful at estimating ample, if users made four errors while interface usability is often desirable. For error rates, however. and they are un- completing a task with version 1 and example. interface designers typically able to address the subjective “pleas- three errors with version 2 of an inter- can choose between several alternative ant-to-use’’ dimension of usability. In face, the normalized usability of inter- solutions for various design issues and any case. practical development projects face 2 with respect to user errors is 133 want to know how each alternative af- should probably stay with user testing percent of the usability of interface I. fects the resulting system’s overall us-

34 COMPUTER

I ability. As another example, company management wanting a single spread- Transforming subjective ratings to a ratio scale sheet as its corporate standard would need to know which of the many avail- To ensure uniform treatment of all subjective rating-scale data, I applied able spreadsheets was the most usable. three transformations to the raw rating-scale scores: The interface designers balance usabil- ity against other considerations such as (1) Rescale the various rating scales linearly to map onto the interval from implementation time for the design al- -1 to +1, with 0 as the midpoint and +I as the best rating. This transforma- ternatives, and the corporate managers tion provides a uniform treatment of scales independent of their original balance usability against purchase price scale intervals. for the spreadsheets. (2) Apply the arcsine function to the rescaled scores. This transformation Ultimately. the measure of usability compensates for the fact that rating scales are terminated at each end, mak- should be monetary to allow compari- ing it harder for an average rating to get close to one end of the scale. For sons with other measures such as imple- example, an average rating of 4.6 on a 1 -t0-5 scale might result from four mentation or purchase expenses. For users rating the system 4.0 and six users rating it 5.0. Now, even if all 10 some usability metrics, economic equiv- users like some other system better and would like to increase their rating alents are fairly easy to derive. given by one point, the second system would get an average rating of only 5.0 (an data about the number of users. their increase of 0.4 instead of 1 .O, as deserved), because six users were already loaded salary costs. and their average at the end of the scale. Because of this phenomenon, changes in ratings to- workday. (Loaded salaries include not ward the end of a rating scale should be given extra weight. The arcsine just the money paid to employees but function achieves this by stretching the ends of the interval. also any additional costs, such as social (3)Apply the exponential function eX to the stretched scores to achieve security taxes and medical and pension numbers for which ratios are meaningful. Without such a transformation, a benefits.) For example, 1 recently per- score of 0.2 would seem twice as good as a score of 0.1, whereas a score formed a usability evaluation of a tele- of 0.7 would be only 17 percent better than a score of 0.6, even though the phone company application. I estimat- improvement was the same in both cases. After the exponential transforma- ed improvements in the user interface tion, a score of 0.2 is 10.5 percent better than a score of 0.1, and a score of would reduce training time half a day 0.7 is also 10.5 percent better than a score of 0.6. per user and increase expert-user per- formance 10 percent. Given the num- Admittedly, the choice of functions for these transformations is somewhat ber of users and their loaded salaries, arbitrary. For example, an exponential function with a basis other than e these numbers translated to savings of would achieve the same effect, while resulting in different numbers. Howev- $40.000 from reduced training costs er, the purpose of my analysis is to compare relative ratings of the various and about $500,000 from increased interface iterations, and I applied the same transformations to each original expert-user performance in the first rating. Nowhere do I use the value of a single transformed rating. I always year alone. In this example. the overall look at the ratio between two equally transformed ratings, and all the trans- value of the usability improvements is formations are monotonic. Also, an analysis of a larger body of subjective- obviously dominated by expert-user satisfaction data indicated that these transformations result in reasonable performance -especially since the sys- statistical characteristics.'' tem will probably be used for more than one year. Other usability metrics are harder to convert into economic measures. In prin- tive costs of handling the return plus the all usability attributes is difficult. 1chose ciple, user-error rates can be combined wasted freight charges. However, much an alternative approach here. I calcu- with measures of the seriousnessof each larger indirect costs could result if the lated overall usability in relative terms error and estimates of the error's im- company were to receive a reputation as the geometric mean (the nth root of pact on corporate profits. The impact of for being an unreliable supplier. caus- the product) of the normalized values some errors is easy to estimate. For ing customers to do business elsewhere. of the relative improvements in the in- example, an error in the use of a print Subjective satisfaction is perhaps the dividual usability metrics. Doing so in- command causing a file to be printed on usability attribute with the least direct volves certain weaknesses: First, all the the wrong printer has a cost approxi- economic impact in the case of software usability metrics measured for a given mately corresponding to the time need- for in-house use. even though it may be project are given equal weight, even ed for the user to discover the error and one of the most important factors influ- though some may be more important walk to the wrong printer to retrieve the encing individual purchases of shrink- than others. Also. there are no cutoff output. An error causing a cement plant wrap software. Of course, it is concep- values where any further improvement to burn down has a cost corresponding tually easy to consider the impact of in a given usability metric would be of to the plant's value plus any business increased user satisfaction on software no practical importance. For example, lost while it is rebuilt. In other cases. sales or employee performance. but ac- once the error rate has been reduced to, error costs are harder to estimate. For tual measurements are difficult and were say, one error per 10,000 user transac- example. an error causing a customer to not available in the case studies dis- tions. it may make no practical differ- be shipped the wrong product has direct cussed here. ence to reduce it further. Even so. re- costs corresponding to the administra- Because using monetary measures of ducing the error rate to one error per

November 1993 35 Table 1. Four case studies of iterative desixn.

Interface Versions Subjects Overall Improvement Technology Tested per Test (percent)

Home banking Personal-computer 5 242 graphical user interface Cash register Specialized hardware with 5 87 character-based interface Security Mainframe character-based 3 882 terminal Hypertext Workstation graphical user 3 41 interface

20,000 transactions would double the rather than arithmetic means (the sum geometric mean givescomparatively less usability score for the user-error metric of values divided by the number of val- weight to uncommonly large numbers. as long as no cutoff value has been set ues) to calculate the overall usability The geometric mean increases more for user errors. Using geometric means somewhat alleviates this problem. A when all the usability metrics improve a little than when a single metric improves a lot and the others are stagnant.

1. Basic tasks operating on the customer’s own accounts

Find out the balance for all your accounts. Case studies A Transfer an amount from one of your accounts to the other. Investigate whether a debit-card transaction has been deducted from Table 1 summarizes the four case stud- the account yet. ies in iterative user-interface design that 2. Money transfers to accounts owned by others 1 review here. I discuss the first example Order an electronic funds transfer to pay your March telephone bill. in somewhat greater depth than the oth- Set up a series of electronic funds transfers to pay ers to show how the various usability installments for a year on a purchase of a stereo set. metrics are measured and further re- Investigate whether the teiephone-bill transfer has taken place yet. fined into estimates of improvements in overall usability. 3. Foreign exchange and other rare tasks Order Japanese yen corresponding to the value of 2,000 Danish Home banking. The home-banking kroner. system was a of a system to Order 100 Dutch guilders. let Danish bank customers access their Order an additional account statement for your savings account. accounts and other bank services from a 4. Special tasks home computer with a modem. The in- Given the electronic funds transfers you have set up, what will the terface was explicitly designed to ex- balance be on August 12? (Do not calculate this manually; find out plore the possibilities for such a system through the system.) withcustomerswhoown a personalcom- You returned the stereo set to the shop, so cancel the remaining puter supporting a graphical user inter- installment payments. face and a mouse. The prototype system was developed on a single personal-computer platform Figure 2. User tasks to test the prototype home-banking system. under the assumption that alternative versions for other personal computers Table 2. Absolute values of the constituent measurements of the “task time” with graphical user interfaces would use usability metric for the home-banking system (all times in seconds). similar and have similar usabil- ity characteristics. This assumption was Basic Transfers Total Task not tested. Prototyping aims at generat- Tasks to Others’ Foreign Special Time ing a running user interface much faster Version (own account) Accounts Exchange Tasks (seconds) than standard programming. To do this. developers accept some limitations on 1 199 595 245 182 1.220 code generality. One such prototyping 2 404 442 296 339 1,480 limitation was system implementation 3 21 1 329 233 317 1,090 on a stand-alone personal computer 4 206 344 225 313 1,088 without actual on-line access to bank 5 206 323 23 1 208 967 computers. Instead, the user interface provided access to a limited number of

36 COMPUTER

I accounts and other predefined database some small-business users of this kind rates and catastrophes were particular- information stored on the personal com- of system rely extensively on a foreign- ly high. Even so. users expressed a fairly puter. exchange feature, so a proper weight- high degree of subjective satisfaction, The relevant usability attributes for ing scheme is by no means obvious. In perhaps due to the pleasing appearance this application include user task per- this article, I simply assume that all four of graphical user interfaces compared formance, the users’ subjective satisfac- tasks are equally important and thus with the types of banking interfaces they tion, and the number of errors users have the same weight. This means that were used to. made. Task performance is less impor- task efficiency is measured by the total A major problem was the lack of ex- tant for this application than for many task time. which is simply the sum of plicit error messages (the system just others, as the users will not be paid the times for the four task groups in beeped when users made errors), which employees. Nevertheless. users will be Table 2. made it hard for users to learn from running up telephone bills and taking To measure regular errors and catas- their mistakes. Another major problem up resources on the central computer trophes. an experimenter counted the was the lack of a help system. Also, while accessing the system. so task per- number of each, observed as the users many dialog boxes did not have “can- formance is still of interest. Subjective performed the specified tasks. The two- cel” buttons, so users were trapped once satisfaction may be the most important question questionnaire shown in Figure they had chosen a command.* Several usability attribute, as usage of a home- 3 measured users’ subjective satisfac- menu options had names that were dif- banking system is completely discre- tion after they performed the tasks. tionary. Moreover, the reputation for The over all subjective-satisfaction being “user friendly” (or rather. “cus- score was computed as the average of -:‘Ananonymousreleree ofthisarticle asked whether tomer friendly”) is important for a bank. numeric scores for the user’s replies to this example was real or contrived: ”Would anyone Minimizing user errors is also impor- these two questions. Table 3 lists the really design a dialog box without a cancel but- tant, but most important of all is the raw data for the subjective-satisfaction ton?“ 1 can confirm that the example is real, and that the indeed had forgotten this crucial avoidance of usage catastrophes, de- and error metrics. dialog clement in the lint design. Not only did this fined as situations where the user does The usability test showed that the wpposedly trivial problem occur in this casc study, it occurred again with another designer in another not complete a task correctly. The worst first version of the user interface had case I am studying. Also. several commercial appli- errors are those users make without serious usability problems. Both error cations have been released with similar problems. realizing what they have done. and thus without correcting them. Each test user was asked to perform 1. How did you like using the bank system? specified tasks with the system. access- ing some dummy accounts set up for the Very pleasant (1) Somewhat pleasant (2) prototype. In general, it is important to Neutral (3) pick benchmark tasks that span the sys- Somewhat unpleasant (4) tem’s expected use, since it would be easy for a design to score well on a Very unpleasant (5) single. limited task while being poor for 2. If you had to perform a task that could be done with this system, would most other tasks. Figure 2 lists the four you prefer using the system or would you contact the bank in person? sets of tasks. Definitely use the system (1) Table 2 shows the measures of task Likely to use the system (2) time for each of the four task types Don’t know (3) measured in user testing of the home- Likely to contact the bank (4) bankincg system. The best way to calcu- Definitely contact the bank (5) late a score for task efficiency would be to weigh the measured times for the Figure 3. Questionnaire to measure test users’ satisfaction with the home-bank- individual subtasks relative to the fre- ing interface. quency with which users would be ex- pected to perform those subtasks in ac- tual system use. To do this perfectly, we Table 3. Absolute values of the usability parameters for the home-banking need true, contextual field studies of system. Subjective satisfaction was measured on a 1 to 5 scale, where 1 howconsumersactually access their bank indicated the highest satisfaction. accounts and other bank information. Unfortunately, it is rarely easy to use Subjective current field data to predict frequencies Satisfaction Errors Made Catastrophes of use for features in a future system, Version ( 1-5 scale) per User per User because the system’s introduction chang- es the way the features are used. 1 1.92 9.2 2.56 For the home-banking example, it is 2 1.83 4.3 1.oo likely that most users would access their 3 1.78 2.5 0.44 own accounts more frequently than they 4 1.86 1 .s 0.43 would perform foreign-currency tasks S 1.67 1.5 0.17 or the special tasks. On the other hand.

November 1993 37 Table 4. Normalized improvements in usability parameters for the home-banking sys- fer will take place today.’’ Us- tem. (Version 1 is the baseline and has zero improvement by definition. All numbers ex- ing the word &‘today”rather cept version numbers are percentages.) than displaying a date made the message easy to understand Efficiency Subjective Correct Use Overall and different from the mes- Version (inverse time Satisfaction (inverse error Catastrophe Usability sage for future transfers. This on task) with Dialog frequency) Avoidance Improvement iterative design further modi- fied an interface change that 1 0 0 0 0 0 had been introduced in ver- 2 -18 6 114 156 48 sion 2. 3 12 9 268 582 126 In the final version, version 4 12 4 513 495 155 5. the designer completely re- 5 26 17 513 1,406 242 structured the help system for the second time. The help sys- tem introduced in version 3 was structured according to ficult to understand and easy to confuse counts) and Other (for all other com- the menu commands in the system and (for example, the two commands “Trans- mands). let users get information about any fer money“ and “Move an amount”). The user test of version 2 showed that command by selecting it from a list. There were several inconsistencies. users had significantly fewer problems The revised help system presented a Users had to type in account numbers as with finding commands and made fewer single conceptual diagram of system- digits only (without any dashes), even errors. At the same time, the help sys- supported tasks, linking them with a list though the computer displayed lists of tem was somewhat confusingly struc- of available menu commands and pro- account numbers with dashes separat- tured, and users wasted a fair amount of viding further levels of information ing groups of digits. time reading help information they did through a simple hypertext access To reduce the number of catastro- not need to complete their current tasks. mechanism. phes, version 2 introduced confirming Table 3 shows version 2 did succeed in In addition to the changes outlined dialog boxes to let users check the sys- its major goal of reducing the very high here, the designer made many smaller tem’s interpretation of major commands levels of user errors and catastrophes. changes to the user interface in each before execution. For example. one such However, Table 2 shows this improve- iteration. Few interface elements sur- dialog read, “Each month we will trans- ment was made at the cost of slower vived unchanged from version 1 to ver- fer 2,000.00 kroner from your savings task completion for most tasks. Trans- sion 5. account to the Landlord Company, Inc. fers to others’ accounts were faster in Table 4 shows the normalized values The first transfer will be made April 30, version 2 due to the elimination of for the four usability metrics as well as 1992. OKiCancel.” Also, written error errors when users typed account num- the overall usability computed as their messages in alert boxes replaced the bers the way they were normally for- geometric mean. Most usability metrics beep, and a special help menu was in- matted. improved for each iteration. but there troduced. Version 3 had a completely restruc- were also cases where an iteration scored Version 1 had three main menus: Func- tured help system. Also, several of the worse than its predecessor on some tions, Accounts, and Information. The error messages introduced in version 2 metric. Many of these lower scores led first menu included commands for or- were rewritten to better inform users to further redesigns in later versions. dering account statements, initiating how to correct their errors. Also, many smaller iterative changes electronic funds transfers. and viewing To further reduce error rates. version had to be further modified after obser- a list of funds transfers. The second 4 identified the user’s own accounts by vations of user testing, even if they were menu included commands for viewing account type (for example, checking not large enough to cause measurable an account statement. transferringmon- account or savings account) instead of decreases in the usability metrics. ey between the user’s own accounts. just account number. Also, the dialog and viewing a list of the user’s accounts box for transferring money was expand- Cash-register system. The cash-regis- and their current balances. The third ed with an option to show the available ter system was a point-of-sales applica- menu included commands for showing balance for the account where the trans- tion for a chain of men’s clothing stores current special offers from the bank and fer was to originate. Furthermore. a in Copenhagen. A computerized cash accessing foreign-currency exchange minor change to the confirmation dia- register with a built-in alphanumeric rates. log box for electronic funds transfers screen let sales clerks sell specific mer- The test showed that users did not prevented an error that occurred fairly chandise with payment received as cash, understand the conceptual structure of frequently with users not entering the debit card. credit card, check, foreign this menu design but instead had to correct date for transfers scheduled for currencies converted at the current ex- look through all three menus every time future execution. As mentioned above, change rate, and gift certificates, as well they wanted to activate a command. the confirming dialog normally stated as combinations of these. Clerks could Therefore, the menu structure was com- the date on which the transfer was to be also accept returned goods, issue gift pletely redesigned for version 2, with made. With version 4, when the transfer certificates, and discount prices by a only two main menus: Accounts (for all was specified to happen immediately, specified percentage. commands operating on the user‘s ac- the text was changed to read, “The trans- The usability metrics of interest for

38 COMPUTER

I this application included time needed Table 5. Absolute values of usability metrics for the cash-register system. A rat- for the users to perform 17 typical tasks, ing of 1 indicates the best possible subjective satisfaction. subjective satisfaction, and errors made while completing the tasks. Furthermore, Time Subjective Errors Help Requests an important metric was the number of Version on Task Satisfaction per User per User times the users requested help from a (seconds) (1-5 scale) During 17 Tasks During 17 Tasks

supervisor while performing the task. ~ ~~ This metric was especially important 1 2,482 2.38 0.0 2.7 because the users would often be tem- 2 2,121 1.58 0.1 1.1 porary staff taken on for sales or holi- 3 2,496 1.62 0.2 1.2 day shopping seasons, when the shop would be very busy and such help re- 4 2,123 1.45 0.2 0.9 quests would slow the selling process 5 2,027 1.69 0.0 0.4 and inconvenience shoppers. Table 5 shows the values measured for these usability metrics for each of the five versions of Table 6. Normalized improvements in usability metrics for the cash-register system. the system. (Version 1 is the baseline and has zero improvement by definition. All numbers except Table 6 shows the normal- version numbers are percentages.) ized improvements in usabili- ty for each iteration. Since the Efficiency Subjective Correct Use Help Overall error rate was zero for the first Version (inverse Satisfaction (inverse error (inverse help Usability version, any occurrence of er- time on task) with Dialog frequency) requests) Improvement rors would, in principle, con- stitute an infinite degradation 1 0 0 0 0 0 in usability. However, as men- 2 17 61 -10 145 43 tioned earlier, very small er- 3 -1 56 -20 125 29 ror rates (such as one error per 4 17 77 -20 200 49 170 tasks, which was the mea- 5 22 49 0 575 87 sured error rate for version 2) are not really disastrous for an application like the one stud- ied here. Hence, the table represents a access an older version of the system. A users who could sign on error-free 0.1 increase in error rate as a 10 percent main goal in the new system develop- after the third attempt; and degradation in usability. Another num- ment was to ensure that the transition users’ subjective attitudes toward ber may have been more reasonable, from old to new did not disrupt the the system, as the proportion of test but it turned out that the error rate for branch-office users. Therefore. the most users who believed the product was the last version was also zero, so the important usability attribute was the good enough to deliver without any method chosen to normalize the error users‘ ability to successfully sign on to further changes. rates did not affect the conclusion about the remote mainframe without any er- overall improvement from the total it- rors. Other relevant attributes were the Table 7 shows the results for the three erative design process. time needed for users to sign on and versions of the interface with test users The main changes across the itera- their subjective satisfaction. who had experience with the old system tions were in the wording of the field Three usability metrics were mea- but not with the new interface. The labels, which were changed several times sured: table shows substantial improvements to make them more understandable. on all three usability metrics. A further Also, some field labels were made con- user performance. as the time need- test with an expert user showed that the text sensitive to better fit the user’s ed to complete 12 sign-ons; optimum time to complete a sign-on for task. Fields not needed for the user’s *success rate, as the proportion of the given system was 6 seconds.The test current task temporarily disappeared from the screen. Several shortcuts were introduced to speed common tasks, and Table 7. Absolute values of usability metrics for three iterations of a computer these were also changed over subse- security application in a major computer company.’ quent iterations. Time to Success Subjective Security application. The security Version Complete Tasks Rate Satisfaction application’ was the sign-on sequence (minutes) (percent) (percent) for remote access to a large data-entry ____ and inquiry mainframe application for 1 5.32 20 0 employees at the branch offices of a I 2 2.23 90 60 major computer company. Users already 3 0.65 100 100 used their alphanumeric terminals to

November 1993 39 Table 8. Normalized improvements in usability metrics for the security applica- face according to user strategies that do tion. (Version 1 is the baseline and has zero improvement by definition. All not become apparent until after testing numbers except version numbers are percentages.) has begun. Users always find new and interesting ways to use new computer Time to Success Subjective Usability systems, so it is not enough to rely on Version Complete Tasks Rate Satisfaction Improvement preconceived notions to make an inter- face usable. 1 0 0 0 0 2 139 350 488 298 3 718 400 2214 882 he median improvement in overall usability from the first to the last version for the case studies discussed here is 165 percent. A study of 111 pairwise comparisons of users were able to complete their 12th reference manual. Table 9 shows the user interfaces found that the median sign-on attempt in 7 seconds with the results for the three versions of the hy- difference between two interfaces com- third iteration of the interface, indicat- pertext system. pared in the research literature was 25 ing little room for further improvement A hypertext system for technical ref- percent.” Given that the mean number without changing the system’s funda- erence texts will obviously have to com- of iterations in the designs discussed mental nature. Table 8 shows the nor- pete with text presented in traditional here was three. we might expect the malized improvements in usability for printed books. The system developers improvement from first to last interface the security application over its three therefore conducted a comparative test: to be around 95 percent, based on an versions. They asked test users to perform the average improvement of 25 percent same tasks with the printed version of for each iteration. (Three iterations. Hypertext system. The hypertext sys- the manual.” The search time for the each improving usability by 25 percent, tem’? was an interface designed to facil- printed version was 5.9 minutes with a would give an improvement from ver- itate access to large amounts of text in search accuracy of 64 percent, indicat- sion 1 to version 4 of 95 percent rather electronic form. Typical applications in- ing that the manual was better than the than 75 percent, due to a compounding clude technical reference manuals, on- initial version of the hypertext system. effect similar to that in bank-account line documentation, and scientific jour- This example shows that iterative de- interest calculations.) However, the nal articles converted to on-line form. sign should not simply try to improve a measured improvement was larger, cor- The system ran on a standard worksta- system with reference to itself, but also responding to an average improvement tion using multiple windows and such aim at better usability characteristics of 38 percent from one version to the graphical user-interface features as one than the available competition. next. that let the user click on a word in the Table 10 shows the normalized im- There is no fundamental conflict be- text to find more information about the provements in usability for the hyper- tween the estimate of 25 percent aver- word. text system. Many of the changes from age usability difference between inter- For use with technical reference texts. version 1 to version 2 encouraged users facescompared in the research literature the most important usability attributes to use a proven search strategy - by and 38 percent difference between ver- were the search time for users to find making it both easier to use and more sions in iterative design. Picking the the information they wanted and the explicitly available. Version 3 introduced better of two proposed interfaces to search accuracy (the proportion of times several additional changes. including an perform the same task is actually a very they found the correct information). The important shortcut that automatically primitive usability engineering method interface developers measured both at- performed a second step that users al- that does not let designers combine ap- tributes by having test users answer 32 most always did after a first step in the propriate features from each design. In questions about a statistics software initial tests. Both improvements show contrast, iterative design relies much package using a hypertext version of its how iterative design can mold the inter- more on the usability specialist’s intelli-

Table 9. Absolute values of usability metrics for the hy- Table 10. Normalized improvements in usability metrics pertext system.’* (A version numbering scheme starting for the hypertext system. (Version 1 is the baseline and has with version 0 has been used in other discussions of this zero improvement by definition. All numbers except ver- system.) sion numbers are percentages.)

Search Time Search Accuracy Search Search Usability Version (minutes) (percent) Version Time Accuracy Improvement

1 7.6 69 1 0 0 0 2 5.4 75 2 41 9 24 3 4.3 78 3 77 13 41

40 COMPUTER

I gence and expertise throughout the de- and testing must guide decisions on how and Card,“ Hunzan-Computer Interac- sign process. It lets designers combine to proceed. H tion. Vol. 2. No. 3, 1986, pp. 227-249. features from previous versions based 11. J. Nielsen and J. Levy. “Subjective User on accumulated evidence of what works Preferences versus Objective Interface under what circumstances. Thus, it is Performance Measures,” to be published not surprising that iterative design might Acknowledgments in Comm. ACM. yield a larger improvement between 12. D.E. Egan et al., “Formative Design- versions than that resulting from simply I collected data on the home-banking and Evaluation of SuperBook,”ACM Trans. picking the better of two average design cash-register projects analyzed in this article Information Systems. Vol. 7, No. 1, Jan. while on the faculty of the Technical Univer- 1989. pp. 30-57. alternatives. sity of Denmark, and these systems do not Of course, the 38 percent improve- represent Bellcore products. I thank my stu- ment between versions in iterative de- dents Jens Rasmussen and Frederik Will- sign is only a rough estimate based on erup for helping collect dataon these projects. I also thank Susan T. Dumais, Michael C. four case studies. Furthermore. there is King, and David S. Miller. as well as several a large variability in the magnitude of anonymous Computer referees. for helpful usability improvement from case to case, comments on this article. so we should not expect a 38 percent improvement in every case. Also, we should not expect to sustain exponen- tial improvements in usability across References iterations indefinitely. With many iter- ations, usability improvements would 1, K.F. Bury, “The Iterative Developmcnt Jakob Nielsen is a member of the Computer probably follow a curve somewhat like of Usable Computer Interfaces,” Proc. Sciences Department in Bellcore’s applied IFIP Interact’84Int’l Conf: Human-Com- the one shown in Figure 1, since it is research area. His research interests include puter Interaction. IFIP, Geneva. 1984, usability engineering. hypertext, and next- probably not possible to achieve arbi- Qp. 743-748. generation interaction paradigms, with a trary improvements in usability simply particular emphasis on applied method- by iterating sufficiently many times. An 2. W. Buxton and R. Sniderman. “Iteration ology. open research question is how much in the Design of the Human-Computer Nielsen is the author of the books Hyper- Interface,” Proc. 13th Ann. Meeting Hu- usability can be improved and how good text and Hypermedia and Usability Engi- nzan Factors Assoc. of Canada, 1980. pp. neering. His previous affiliations include the an “ultimate user interface” can get. 72-8 1. LBM User Interface Institute in Yorktown since practical development projects Heights, N.Y., and the Technical University always stop iterating before achieving 3. J.D. Could and C.H. Lewis, “Designing of Dcnmark in Copenhagen. An electronic for Usability: Key Principles and What perfection. business card with further information can Designers Think.” Conzm. ACM, Vol. be retrieved by sending any email message to The median improvement from ver- 28, No. 3, Mar. 1985. pp. 300-31 1. the server at [email protected]. sion 1 toversion2was45percent, where- as the median improvement from ver- 4. L. Tcsler. “Enlisting User Hclp in Soft- sion 2 toversion 3 was “only”34percent. ware Design.” ACM SIGCHI Bull.. Vol. 14, No. 3. Jan. 1983. pp. 5-9. In general, we can probably expect the Readers can contact Nielsen at Bellcore. greatest improvements from the first 5. J. Nielsen. “The Usability Engineering MRE 2P-370.445 South Street, Morristown, few iterations, as designers discover and Life Cycle.” Computer, Vol. 25. No. 3. NJ 07960: e-mail [email protected]. remove usability catastrophes. I recom- Mar. 1992. pp. 12-22. mend continuing beyond the initial iter- 6. J. Nielsen. Usability Engineering. Aca- ation: Designers sometimes introduce demic Press, San Diego, Calif.. 1993. new usability problems in the attempt to fix the old ones. Also, user testing 7. C.-M. Karat, “Cost-Benefit Analysis of may turn up new interface strategies Iterative Usability Testing,” Proc. IFZP Interact ‘YO Third Int’l Conf Human- that need refinement through further Conzputer Interaction, IFIP, Geneva, 1990, Announcing.. . iterations. pp. 351-356. ;I l”zib1e Electmnics trdck begfig The projects reviewed in this article in the Fehiiiary 19% issue of ZEEE include several in which at least one X. M. Good et al., “User-Derived Impact Analysis as a Tool for Usability Engi- Micro and continuing throughout usability-attributescore went down from ncering.” Proc. ACM CHI ‘86 Conf. the year. one version to the next. Also. it was Hurnan Factors in Conzputing Systems, The track focuses on issues,tech- often necessary to redesign a user- ACM Prcss, New York. 1086. pp. 241- nologies, and developments in and interface element more than once to 246. afkaing pc )rtableelecmnic prtxlucts. find a usable version. Three versions 9. W.D. Gray et al.. %OMS Meets the Look for the first article in the track (two iterations) should be the minimum Phone Company: Analytic Modeling on Battery, Display, and Storage in an iterative design process. Of course, Applied to Rcal-World Problems,” Proc. Trade-offs and Technologies. as in the cash-register case study. the IFIP Interact ’90 Third Int’l Con5 Hu- third version may score worse than the man-Computer Interaction, IF1 P, Gene- va. 1990. pp. 29-34. + second version, so designers cannot al- Keep current with the ways follow a plan to stop after version IO. J.M. Carroll and R.L. Campbell. “Soft- technical issues that most 3. The actual results of iterative design ening up Hard Sciencc: Reply to Newel1 interest you. Read /€E€ Micro!

November 1993