<<

arXiv:2007.10744v3 [cs.SE] 8 Sep 2020 edsg uvyisrmn oass otaedocumentat assess to instrument survey a design We hra lgatce aemr rbesadw ecieou describe we problems—and more have articles blog whereas okhsfcsdo ouetto cuayadcompletenes and accuracy documentation on focused has work eomn,cmrhnin n aneac,bttewy th ways the but maintenance, and comprehension, velopment, eg,frAIdcmnain[ documentation API for (e.g., l,Ahjn ta.[ exam al. For numerous. et are Aghajani quality ple, poor suffer can documentation softwa for crucial is documentation software MOTIVATION AND High-quality INTRODUCTION 1 quality documentation, software KEYWORDS topics professional and • CONCEPTS CCS framework. quality unified a of an potential quality and expansion documentation the software about reasoning for sion qua in excel files README and documentation reference ample, assessments—f genres their summarise several We from documentation. software examples evaluating when framework the edito technical recruited we application, and its content, quality demonstrate structure, of information dimensions ten documenta and into decomposes quality data that tion framework the a from propose to work thi community on clarify draw To literature. we engineering guity, vaguely software is the documentation engi in “good” fined software of meaning good the but encourages neering, documentation software Good ABSTRACT akacmrhniefaeokadisrmnsfrassessi characterist for these beyond instruments quality and documentation software framework such, comprehensive As a accuracy. lack beyond issues [ from documentation suffers read documentation carefully to reluctant are mo While completeness). accuracy, (e.g., itself effectiv beyondmentation looking visual requires (e.g., it others sometimes for and retrievability), medium u and orientation, content task mixes (e.g., it attributes quality some o for can text [ quality manually its evaluated semantics, be and only syntax in un complex imprecise, and ambiguous, tured, informa inherently is in which written language is ural documentation software matters. because tool-related ally, and pre matters, content, process-related information tion, covering types, issue umentation dtr n 1dcmnsrltdt h rgamn langua programming techn R the four to with related study documents 41 pilot and A editors sources. different from quality do engineering software the to community quality formation otaeadisengineering its and Software nti okw dp ult rmwrsfo h aaadin- and data the from frameworks quality adapt we work this In eodAcrc:AssigSaeDcmnainality Documentation Soware Assessing Accuracy: Beyond [email protected] dlie A Australia SA, Adelaide, nvriyo Adelaide of University hitp Treude Christoph 1 aedsge aooyo 6 doc- 162 of taxonomy a designed have ] → 3 ult assurance Quality 6 , .Teassmn eed ncon- on depends assessment The ]. 19 ) h atta aydevelopers many that fact the ]), → Documentation 20 . ugssthat suggests ] ot aoiaSaeUniversity State Carolina North aeg,N,Uie States United NC, Raleigh, st apply to rs sefulness), • ; Addition- trelated st h docu- the tl.To style. utnMiddleton Justin [email protected] ics. ambi- s Social eness, senta- ede- re rex- or struc- main. nat- l for d ften vi- r ical lity ion we de- ng ge of at s - - - work. ihrsett dnie renaming. identifier to respect with [ Robillard and Dagenais by AdDoc h otrltdpeeo okt hsppri h eia 199 seminal the is paper this to work of piece related most The accuracy ues yWn n tog[ Strong and Co Wang Data to by Means sumers” Quality Data What Accuracy: “Beyond article u ndfcsa h ouetlvl[ level document stud the few at very defects but on level, cus code the at investigated widely been ae onl ausadexceptions. and values null to lated esteassmn ripoeeto h ult fsoftwar of examp one quality completeness; the and of accuracy improvement beyond documentation or assessment the gets h e iesoso u otaedcmnainqaiyfr quality documentation software our of dimensions bas tutorials) ten threads, the Overflow Stack d files, reference README articles, tation, (blog weaknesses documentation and of strengths genres ferent the for evidence initial provides htaeapidt xsigcmet orflc oemodifica code reflect to comments existing to sequence applied are a that generate acros to changes representations, correlate language al to distinct et learns Panthaplackel that work, approach recent an in proposed And them. recomm fix and to defects solutions directive detect automatically prese to They DRONE processing. language natural pro and from comprehension techniques with documents API of defects detecting f are that comments code source detect to tool a Fraco, [ Robillard sented and Ratol cod documentation, and incons of code at between sets cies aimed Also coherent together. documented as are defined that elements are which patterns umentation cod and documentation natural-language in mismatches names out code seeking tween by errors documentation API detects Tan attempts, such first the [ of al. one be In inconsistencies documentation. and investigate code space documentation the in AKRUDADRLTDWORK RELATED AND BACKGROUND 2 omn nossece,rvaigcue uha depreca [ al. as et Zhou such refactoring. causes and revealing inconsistencies, comment eetv otaeDocumentation. Software Defective otebs forkolde iteo h xsigwr tar- work existing the of little knowledge, our of best the To h otiuin fti okare: work this of contributions The e ta.[ al. et Wen 16 • • • rsne tomn o etn aao omnsre- comments testing for @tcomment presented ] iinfrteepnino nfidqaiyframework quality unified a experimentation. of further expansion through the for vision and docu- genres, A evaluate documentation multiple to over instrument quality ment survey validated partially A about questions quality, documentation asking software for framework ten-dimensional A prahfrtedmi fsfwr documentation. software of domain the for approach 18 rsne ag-cl miia td fcode- of study empirical large-scale a presented ] [email protected] dlie A Australia SA, Adelaide, 21 nvriyo Adelaide of University hsaiAtapattu Thushari , 17 22 .W olwtesame the follow We ]. otiue ieo okon work of line a contributed ] 3 DocRef uoaial icvr doc- discovers automatically ] 22 eetdtcintoshave tools detection Defect .Teeitn approaches existing The ]. yZogadS [ Su and Zhong by ocumen- beyond- fedits of 15 tween fdif- of e fo- ies ragile don ed two s isten- gram tions. pre- ] ame- nted [ . tion end 19 be- 8 et le n- e. 5 e e ] ] ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Christoph Treude, Justin Middleton, and Thushari Atapau is Plösch and colleagues’ survey on what developers value in doc- 4 PILOT STUDY umentation [13]. We turn to related work from the data and infor- To gather preliminary evidence on whether experts would be able mation quality community to further fill this gap. to use the proposed framework for assessing software documenta- tion, we conducted a pilot study with technical editors and differ- ent genres in software documentation. Recruitment. To recruit technical editors, we posted an adver- Information Quality. In their seminal 1995 work “Beyond Accu- 1 racy: What Data Quality Means to Data Consumers”, Wang and tisement on Upwork, a website for freelance recruitment and em- Strong [17] conducted a survey to generate a list of data quality at- ployment. Our posting explained our goal of evaluating software tributes that capture data consumers’ perspectives. These perspec- documentation on ten dimensions and identifying weaknesses in tives were grouped into accuracy, relevancy, representation, and their design. Therefore, we required applicants to be qualified as accessibility, and they contained a total of 15 items such as believ- technical editors with programming experience, and accordingly, ability, reputation, and ease of understanding. A few years later, we would compensate them hourly according to their established Eppler [4] published a similar list of dimensions which contained rates. From this, we recruited the first four qualified applicants 16 dimensions such as clarity, conciseness, and consistency. with experience in technical writing and development. We refer These two sets of dimensions built the starting point into our to our participants henceforth as E1 through E4. investigation of dimensions from the data and information qual- Data Collection. Because developers write software instructions ity community which might be applicable to software documenta- in a variety of contexts, we use a broad definition of documen- tion. Owing to the fact that software documentation tends to be tation, accepting both official, formal documentation and implicit disseminated over the Internet, we further included the work of documentation that emerges from online discussion. We focus on Knight and Burn [7], who discussed the development of a frame- the genres of software documentation that have been the sub- work for assessing information quality on the World Wide Web ject of investigation in previous work on software documentation: and compiled a list of 20 common dimensions for information and reference documentation [5], README files [14], tutorials [11], data quality from previous work, including accuracy, consistency, blogs [10], and Stack Overflow threads [2]. To better control for and timeliness. variation between language, we focused exclusively on resources documenting the R programming language or projects built with it. All documentation was then randomly sampled from the following sources: 3 SOFTWARE DOCUMENTATION QUALITY • Reference Documentation (RD): from the 79 subsections of the R language manual.2 Inspired by the dimensions of information and data quality iden- • README files (R): from the first 159 items in a list3 of tified in related work, we designed the software documentation open-source R projects curated by GitHub users. quality framework summarised in Table 1. The first and last author • Tutorials (T): from the 46 R tutorials on tutorials- of this paper collaboratively went through the dimensions from re- point.com.4 lated work cited in Section 2 to select those that (i) could apply to • Articles (A): from 1,208 articles posted on R-bloggers.com,5 software documentation, (ii) do not depend on the logistics of ac- a blog aggregation website, within the last six months. cessing the documentation (e.g., security), and (iii) can be checked • Stack Overflow threads (SO): from questions tagged “R”. independently of other artefacts (i.e., not accuracy or complete- ness). As a result, we omitted dimensions such as relevancy and We asked the editors to read the documents, suggest edits, and timeliness from Wang and Strong’s work [17] since they require answer the ten questions listed in Table 1 using a scale from 1 context beyond the documentation itself (e.g., relevancy cannot be (low) to 10 (high). From a methodological standpoint, we asked assessed without a specific task in mind); and we omitted dimen- for edits in addition to the ten assessments in order to encourage sions such as interactivity and speed from Eppler’s work [4] which and evidence reflection on the specific ways the documentation are characteristics of the medium rather than the content. Table 1 failed, beyond the general impressions of effectiveness. These ed- indicates the sources from related work for each dimension. its were collected through Word’s Review interface. We In addition, we formulated questions that an assessor can an- assigned the quantity and selection of documentation according to swer about a piece of software documentation to indicate its qual- the amount of time editors had available while trying to balance ity in each dimension. Our pilot study in Section 4 provides ini- distribution across genres. Furthermore, we tried to give each indi- tial evidence that technical editors are able to answer these ques- vidual document to at least two editors. We show the distribution tions and shows interesting trends across genres of documentation. in Table 2. Overall, the editors worked on 41 documents. Seven The questions were inspired by work on readability by Pitler and documents were assessed by only one editor, but the rest were by Nenkova [12] who collected readability ratings using four ques- two; therefore, we had 75 records of document evaluations. tions: How well-written is this article? How well does the text fit 1https://www.upwork.com/ together? How easy was it to understand? How interesting is this 2https://cran.r-project.org/doc/manuals/r-release/R-intro.html 3 article? Our framework contains the same questions, and comple- https://github.com/qinwf/awesome-R, the remaining items were not software projects ments them with one question each for the additional six dimen- 4https://www.tutorialspoint.com/r/index.htm sions. 5https://www.r-bloggers.com/ Beyond Accuracy: Assessing Soware Documentation ality ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA

Table 1: Dimensions of software documentation quality

Dimension Question Source ality How well-written is this document (e.g., spelling, grammar)? question from [12] Appeal How interesting is it? question from [12] Readability How easy was it to read? ‘accessibility’ [4, 7] Understandability How easy was it to understand? ‘ease of understanding’ [17]; question from [12] Structure How well-structured is the document? ‘navigation’ from [7] Cohesion How well does the text fit together? question from [12] Conciseness How succinct is the information provided? ‘concise’, ‘amount of data’ [7, 17]; ‘conciseness’ [4] Effectiveness Does the document make effective use of technical vocabulary? ‘vocabulary’ [12] Consistency How consistent is the use of terminology? ‘consistency’ [4, 7] Clarity Does the document contain ambiguity? ‘clarity’ [4]; ‘understandability’ [7]

Table 2: Distribution of documentation genres to editors Different dimensions do not seem to have equal distributions. For example, reference documentation has a 9.3 in consistency, but Editor RD R T A SO Sum the lowest score is an 8.5 for blog articles. These are nevertheless high scores on average. Appeal also has a small spread, from 6.5 E1 7778 7 36 in README files to 5.3 in blog articles, but a lower average over- E2 6656 6 29 all across dimensions. Structure, on the other hand, varies from 7 E3 11111 5 (README files) to 3.8 (blog articles), demonstrating a larger inter- E4 11111 5 action with genres. Sum 15 15 14 16 15 75 Union 8 8 8 9 8 41 Reflection. Our findings suggest a trend in rating samples from certain genres highly while disapproving others. The relatively high rating of reference documentation and README files may Table 3: Results of document rating by technical editors be explained by their typical origins within the software product and team, whereas blog articles and Stack Overflow posts can be published more freely [9]. Nevertheless, the result that blog articles Dimension RD R T A SO received the lowest in every category did surprise us, especially as ality 8.1 7.6 7.5 6.1 6.7 we designed the framework to reduce the impact of context. There Appeal 5.7 6.5 6.4 5.3 5.8 may be many factors involved in these results. For one, the cor- Readability 6.6 6.4 6.5 4.6 6.1 pora from which we randomly sampled may have different under- Understandability 6.9 7.0 6.9 5.3 6.1 lying distributions on our dimensions. Our reference documenta- Structure 6.3 7.6 7.1 3.8 5.8 tion was sampled from larger related products, whereas manual Cohesion 6.7 6.7 7.0 4.9 6.1 inspection of the randomly sampled blog articles did evidence a Conciseness 6.7 6.7 6.8 5.3 6.3 broad variety of R-related topics, such as discussions of R package Effectiveness 8.3 7.9 8.1 7.1 7.5 updates or write-ups and reflections on personal projects. Further- Consistency 9.3 8.8 9.0 8.5 8.6 more, as with any participant study, the soundness of our results Clarity 8.6 8.7 7.6 6.3 7.7 depends on the accuracy with which we communicated our ideas and the participants understood and enacted them. Editors may ap- proach different genres with different preconceptions; an 8 rating for a README file may not be the same as an 8 rating for a Stack Results. We display the average dimension rating per genre of Overflow thread. documentation in Table 3; for each row, the highest value is in bold, and the lowest is italicised. For example, this table shows that the Technical Challenges. As noted previously, we asked editors to quality of the reference documentation, or how well-written it is make edits on document copies in Microsoft Word as well, hoping according to the definition in Table 1, is rated 8.1 out of 10, on to gain insight into technical editing strategies for software docu- average higher than all the other genres. Meanwhile, the average mentation. We obtained over 4,000 edit events across all 75 docu- blog article was rated 6.1 out of 10, the lowest. Within dimensions, ments, ranging from reformattings and single character insertions the highest ratings are distributed primarily among reference doc- to deep paragraph revisions. However, we encountered several ob- umentation and README files; tutorials received the highest in stacles when attempting to analyse this data. First, when copying cohesion and conciseness, but only by a difference of at most 0.3 online documents to Microsoft Word, interactions and images in of a rating point with the two aforementioned genres. Neverthe- hypertext documents were distractingly reformatted for a linear less, blog articles received the global lowest rating across every document. As a result, several of the edits addressed changes to dimension, especially so in readability and structure. structure that did not exist when viewing the document in the ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Christoph Treude, Justin Middleton, and Thushari Atapau browser instead. Furthermore, many edits were not recorded as ACKNOWLEDGEMENTS intended—for example, replacing words with synonyms that do The authors thank Emerson Murphy-Hill and Kathryn T. Stolee not make sense in a programming context. Although we can recre- for their contributions to this work. This research was undertaken, ate changes to the document text through before-and-after com- in part, thanks to funding from a University of Adelaide – NC parison, small edits blend together and large edits confound the State Starter Grant and the Australian Research Council’s Dis- differencing algorithm. Because the editor’s original intent is lost, covery Early Career Researcher Award (DECRA) funding scheme any coding scheme applied over it becomes less secure. Due to (DE180100153). This work was inspired by the International Work- these circumstances, we decided against drawing conclusions from shop series on Dynamic Software Documentation, held at McGill’s this data. Bellairs Research Institute. REFERENCES [1] Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Márquez, Mario Linares- Vásquez, Laura Moreno, GabrieleBavota, and Michele Lanza. 2019. Software doc- 5 IMPACT AND FUTURE WORK umentation issues unveiled. In Int’l. Conf. on . 1199–1210. [2] Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are devel- Accuracy of software documentation is a necessary but not suffi- opers talking about? an analysis of topics and trends in stack overflow. Empirical cient condition for its success. To empower software developers to Software Engineering 19, 3 (2014), 619–654. [3] Barthélémy Dagenais and Martin P Robillard. 2014. Using traceability links to make effective use of software documentation, it must be carefully recommend adaptive changes for documentation evolution. IEEE Trans. on Soft- designed in appearance, content, and more. Our vision, then, is to ware Engineering 40, 11 (2014), 1126–1146. provide a more precise definition to better explore and transform [4] Martin J Eppler. 2001. A generic framework for information quality in knowledge-intensive processes. In Int’l. Conf. on Information Quality. what it means for software documentation to be of high quality, [5] Davide Fucci, Alireza Mollaalizadehbahnemiri, and Walid Maalej. 2019. On us- along with a research agenda enabled by such a quality framework. ing machine learning to identify knowledge in API reference documentation. In Joint Meeting on European Software Engineering Conference and Symp. on the To that end, we have drawn from seminal work in the information Foundations of Software Engineering. 109–119. and data quality communities to design a framework consisting of [6] Ninus Khamis. 2013. Applying Computational Linguistics on API Documentation: ten dimensions for software documentation quality. The JavadocMiner. LAP LAMBERT Academic Publishing. [7] Shirlee-ann Knight and Janice Burn. 2005. Developing a framework for assessing Furthermore, our research agenda proposes future work to eval- information quality on the World Wide Web. Informing Science 8 (2005). uate the impact of each dimension with end-users and improve the [8] Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, JunyiJessyLi,andRaymond J framework. For one, we can verify the assessments of technical Mooney. 2020. Learning to Update Natural Language Comments Based on Code Changes. arXiv preprint arXiv:2004.12169 (2020). editors by introducing end-users to different versions of the docu- [9] Chris Parnin, Christoph Treude,LarsGrammel, and Margaret-Anne Storey. 2012. ments (e.g., edited by technical editors based on quality criteria and Crowd documentation: Exploring the coverage and the dynamics of API discussions on Stack Overflow. Technical Report. Georgia Institute of Technology. original version) and observing their use. Another direction is to [10] Chris Parnin, Christoph Treude, and Margaret-Anne Storey. 2013. Blog- explore trade-offs between quality attributes, such as whether read- ging developer knowledge: Motivations, challenges, and future directions. In ability outweighs structure in terms of document usability, and to Int’l. Conf. on Program Comprehension. 211–214. [11] Gayane Petrosyan, Martin P Robillard, and Renato De Mori. 2015. Discovering further disambiguate similar dimensions (e.g., structure and cohe- information explaining API types using text classification. In Int’l. Conf. on Soft- sion). We will further revisit the edits from the technical editors ware Engineering. 869–879. to extract surface-level (e.g., word count) and deep-level (e.g., co- [12] Emily Pitler and Ani Nenkova. 2008. Revisiting readability: A unified frame- work for predicting text quality. In Conference on Empirical Methods in Natural hesion of adjacent sentences) lexical, syntactic, and discourse fea- Language Processing. 186–195. tures to build classification models for predicting documentation [13] Reinhold Plösch, Andreas Dautovic, and Matthias Saft. 2014. The value of soft- ware documentation quality. In Int’l. Conf. on Quality Software. 333–342. quality. Such classification models can be used to assess and rank [14] Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atap- the quality of software documentation. attu, and David Lo. 2019. Categorizing the content of GitHub README files. We believe these efforts are important because of the volume Empirical Software Engineering 24, 3 (2019), 1296–1327. [15] Inderjot Kaur Ratol and Martin P Robillard. 2017. Detecting fragile comments. of software documentation on the web. A simple Google search In Int’l. Conf. on Automated Software Engineering. 112–122. related to will return documentation that [16] Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T Leavens. 2012. @tcom- matches the query without explicitly considering the quality of ment: Testing javadoc comments to detect comment-code inconsistencies. In Int’l. Conf. on , Verification and Validation. 260–269. the material. For example, a query on ‘reading CSV file in r’ returns [17] Richard Y Wang and Diane M Strong. 1996. Beyond accuracy: What data quality blog articles and tutorials as the top results, yet our preliminary re- means to data consumers. Journal of Management Information Systems 12, 4 (1996), 5–33. sults demonstrate that on average, blog articles are ranked worst [18] Fengcai Wen, Csaba Nagy, Gabriele Bavota, and Michele Lanza. 2019. A large- across all ten quality dimensions. This is not to say that blog ar- scale empirical study on code-comment inconsistencies. In Int’l. Conf. on Pro- ticles are essentially faulty and should be abandoned moving for- gram Comprehension. 53–64. [19] Hao Zhong and Zhendong Su. 2013. Detecting API documentation errors. In ward; rather, we hope to spur reflection on how end-users inter- Int’l. Conf. on Object-oriented Programming Systems Languages, and Applications. act with blog articles and how each of our dimensions manifest 803–816. uniquely under the genre’s constraints, while nevertheless using [20] Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifi- cations from natural language API documentation. In Int’l. Conf. on Automated what we currently know to emphasise more useful results. There- Software Engineering. 307–318. fore, applying the framework can influence guidelines and recom- [21] Yu Zhou, Changzhi Wang, Xin Yan, Taolue Chen, Sebastiano Panichella, and Har- ald C Gall. 2018. Automatic detection and repair recommendation of directive mendation systems for documentation improvement as well as au- defects in Java API documentation. IEEE Trans. on Software Engineering (2018). tomatic assessing and ranking systems for navigating the large vol- [22] Yu Zhou, Xin Yan, Taolue Chen, Sebastiano Panichella, and Harald Gall. 2019. umes of documentation and emphasising high-quality documenta- DRONE: a tool to detect and repairdirectivedefects in JavaAPIs documentation. In Int’l. Conf. on Software Engineering: Companion Proceedings. 115–118. tion.