Using Test-Driven Development in the Classroom: Providing Students with Automatic, Concrete Feedback on Performance

Using Test-Driven Development in the Classroom: Providing Students with Automatic, Concrete Feedback on Performance Stephen H. EDWARDS Dept. of Computer Science, Virginia Tech 660 McBryde Hall, Mail Stop 0106 Blacksburg, VA 24061, USA +1 540 231 5723 [email protected] ABSTRACT enough and specifically enough for students to improve their performance? There is a need for better ways to teach software testing skills to computer science undergraduates, who are This paper proposes the use of test-driven development in the routinely underprepared in this area. This paper pro- classroom. In conjunction, an automated grading strategy is poses the use of test-driven development in the class- used to assess student-written code and student-written tests room, requiring students to test their own code in pro- together, providing clear and immediate feedback to students gramming assignments. In addition, an automated about the effectiveness and validity of their testing. grading approach is used to assess student-written code and student-written tests together. Students re- 2. BACKGROUND ceive clear, immediate feedback on the effectiveness and validity of their testing. This approach has been The goal is to teach software testing in a way that will encour- piloted in an undergraduate computer science class. age students to practice testing skills in many classes and give Results indicate that students scored higher on their them concrete feedback on their testing performance, without program assignments while producing code with 45% requiring a new course, any new faculty resources, or a signifi- fewer defects per thousand lines of code. cant number of lecture hours in each course where testing will be practiced [3]. Keywords: Computer science, software testing, test- first coding, programming assignments, automated grading 2.1 Why Test-driven Development? Test-driven development (TDD) is a code development strategy 1. INTRODUCTION that has been popularized by extreme programming [1, 2]. In Many computer science educators have been looking for an TDD, one always writes a test case (or more) before adding effective way to improve the coverage of software testing skills new code. In fact, new code is only written in response to exist- that undergraduates receive [13]. Rather than adding a single ing test cases that fail. By constantly running all existing tests course on the subject, some have proposed systematically infus- against a unit after each change, and always phrasing opera- ing testing concerns across the curriculum [6, 7, 8, 11, 12]. tional definitions of desired behavior in terms of new test cases, However, there is no clear consensus on how this goal is best TDD promotes a style of incremental development where it is achieved. always clear what behavior has been correctly implemented and what remains undone. One approach is to require students to test their own code in programming assignments, and then assess them on this task as While TDD is not, strictly speaking, a testing strategy—it is a well as on the correctness of their code solution. Two critical code development strategy [1]—it is a practical, concrete tech- issues immediately arise, however: nique that students can practice on their own assignments. Most importantly, TDD provides visceral benefits that students 1. What testing approach should students use? The approach experience for themselves. It is applicable on small projects must provide practical benefits that students can see, and with minimal training. It gives the programmer a great degree yet be simple enough to apply across the curriculum, well of confidence in the correctness of their code. It encourages before students have received advanced software engineer- students to always have a running version of what they have ing experience. completed so far. Finally, it encourages students to test features and code as they are implemented. This preempts the “big 2. How will students be assessed on testing tasks? In particu- bang” integration problems that students often run into when lar, if students must test their own code, and then be they work feverishly to write all the code for a large assign- graded on both their code and their testing, how can we ment, and only then try to run, test, and debug it. avoid doubling the grading workload of faculty and teaching assistants while also providing feedback frequently 2.2 Prior Approaches to Automated Grad- requires the student to do some work to figure out the source of ing the error(s) revealed. Without considering testing practices, CS educators have de- 3. WEB-CAT: A TOOL FOR AUTO- veloped many approaches to automatically assessing student MATICALLY ASSESSING STUDENT program assignments [4, 5, 9]. While such automated grading systems vary, they typically focus on compilation and execution PROGRAMS of student programs against some form of instructor-provided In order to consider classroom use of TDD practical, the chal- test data. Virginia Tech has been using a similar automated lenges faced by existing automated grading systems must be grading system for student programs for more than six years addressed. Web-CAT, the Web-based Center for Automated and has seen powerful results. Virginia Tech's system, which is Testing, is a new tool that grades student code and student tests similar in principle to most systems that have been described, is together. Most importantly, the assessment approach embodied called the Curator. in this tool is based on the belief that a student should be given the responsibility of demonstrating the correctness of his or her A student can login to the Curator and submit a solution for a own code. programming assignment. When the solution is received, the Curator compiles the student program. It then runs a test data generator provided by the instructor to create input for grading 3.1 Assessing TDD Assignments the submission. It also uses a reference implementation pro- In order to provide appropriate assessment of testing perform- vided by the instructor to create the expected output. The Cura- ance and appropriate incentive to improve, Web-CAT should do tor then runs the student's submission on the generated input, more than just give some sort of “correctness” score for the and grades the results by comparing against the reference im- student’s code. In addition, it should assess the validity and the plementation's output. The student then receives feedback in the completeness of the student’s tests. Web-CAT grades assign- form of a report that summarizes the score, and that includes the ments by measuring three scores: a test validity score, a test input used, the student's output, and the instructor's expected completeness score, and a code correctness score. output for reference. First, the test validity score measures how many of the student’s In practice, such automated grading tools have been extremely tests are accurate—consistent with the problem assignment. successful in classroom use. Automated grading is a vital tool in This score is measured by running those tests against a refer- providing quality assessment of student programs as enroll- ence implementation provided by the instructor to confirm that ments increase. Further, by automating the process of assessing the student’s expected output is correct for each test case. program behavior, TAs and instructors can spend their grading Second, the test completeness score measures how thoroughly effort on assessing design, style, and documentation issues. the student’s tests cover the problem. One method to assess this Further, instructors usually allow multiple submissions for a aspect of performance is to use the reference implementation given program. This allows a student to receive immediate provided by the instructor as a surrogate representation of the feedback on the performance of his or her program, and then problem. By instrumenting this reference implementation to have an opportunity to make corrections and resubmit before measure the code coverage achieved by the student tests, a the due deadline. score can be measured. In our initial prototype, this strategy was used and branch coverage (basis path coverage) served as 2.3 Challenges the test completeness score. Other measures are also possible. Despite its classroom utility, an automatic grading strategy like Third, the code correctness score measures how “correct” the the one embodied in the Curator also has a number of shortcom- student’s code is. To empower students in their own testing ings: capabilities, this score is based solely on how many of the student’s own tests the submitted code can pass. No separate test • Students focus on output correctness first and foremost; data is provided by the instructor or teaching assistant. The all other considerations are a distant second at best (design, reasoning behind this decision is that, if the student's test data is commenting, appropriate use of abstraction, testing one's both valid (according to the instructor's reference implementa- own code, etc.). This is due to the fact that the most imme- tion) and complete (also according to the reference), then it diate feedback students receive is on output correctness, must do a good job of exercising the features of the student and also that the Curator will assign a score of zero for program. submissions that do not compile, do not produce output, or do not terminate. To combine these three measures into one score, a simple for- mula is used. All three measures are taken on a 0%–100% scale, • Students are not encouraged or rewarded for performing and the three components are simply multiplied together. As a testing on their own. result, the score in each dimension becomes a “cap” for the overall score—it is not possible for a student to do poorly in • In practice, students do less testing on their own. one dimension but do well overall.

Load more