0001 0002 Guidelines for Statistical Projects: Coding and 0003 0004 Typography 0005 0006 1 2 0007 Marius Hofert , Ulf Schepsmeier 0008 0009 0010 2014-10-01 0011 0012 0013 0014 0015 Abstract 0016 A 0017 Guidelines for conducting, implementing (in LTEX and R) and documenting statistical (re- 0018 search) projects are provided in order to improve readability and reduce the error rates of 0019 theses, scientific papers, reports and especially code. This is meant to save supervisors, package 0020 maintainers, students and practitioners a lot of time. It is clear, however, that such guidelines 0021 cannot be exhaustive. The given recommendations should therefore rather serve as a starting 0022 0023 point for improving your workflow and to avoid common pitfalls in statistical projects of larger 0024 scale. 0025 0026 0027 Keywords 0028 Coding, typography, LATEX, R. 0029 0030 MSC2010 0031 0032 68U15, 68U20, 68U05, 97R60. 0033 0034 0035 Contents 0036 0037 0038 0039 1 Introduction 2 0040 0041 2 General suggestions 3 0042 2.1 Forget about the Pareto principle (80–20 rule) ...... 3 0043 0044 2.2 When solving a particular problem for the first time, spend time on it ...... 4 0045 2.3 English in mathematics ...... 4 0046 2.4 Be consistent ...... 5 0047 0048 2.5 Be concise ...... 5 0049 2.6 Be structured ...... 6 0050 0051 2.7 Be self-contained ...... 7 0052 2.8 Be reproducible ...... 7 0053 2.9 Optimize communication, meetings and preparation ...... 8 0054 0055 0056 3 Editors and integrated development environments 9 0057 0058 0059 0060 0061 0062 1Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, 0063 ON, Canada N2L 3G1, [email protected] 0064 2Department of Mathematics, Technische Universität München, 85748 Garching, Germany, ulf.schepsmeier@ 0065 tum.de 0066

1

© Marius Hofert, Ulf Schepsmeier 1 Introduction

0067 4 LAT X 10 0068 E 0069 4.1 Getting started ...... 10 0070 4.2 Typographic recommendations for mathematical documents ...... 10 0071 4.3 Technical tricks to improve typography ...... 12 0072 0073 4.3.1 Citations ...... 12 0074 4.3.2 Spaces and alignment ...... 12 0075 0076 4.3.3 Figures ...... 13 0077 4.3.4 Miscellaneous ...... 14 0078 0079 5 R 14 0080 0081 5.1 Getting started ...... 14 0082 5.2 Documentation ...... 15 0083 5.2.1 Citing R and R packages ...... 15 0084 0085 5.2.2 Run time information ...... 16 0086 5.2.3 Code documentation ...... 17 0087 0088 5.3 Programming style ...... 18 0089 5.3.1 Writing correct code ...... 18 0090 5.3.2 Writing readable code ...... 18 0091 0092 5.3.3 Writing safe, fast, flexible and sophisticated functions ...... 20 0093 5.3.4 Learn from others, learn from the masters ...... 22 0094 5.3.5 Test your code ...... 23 0095 0096 5.3.6 Specific hints ...... 23 0097 5.4 Tables and graphics ...... 23 0098 0099 0100 6 Version control 25 0101 6.1 Dropbox ...... 25 0102 6.2 SVN ...... 25 0103 0104 6.2.1 Checkout ...... 26 0105 6.2.2 Add and (re)move ...... 26 0106 6.2.3 Update, commit ...... 27 0107 0108 6.2.4 Log, status, list, diff ...... 27 0109 6.2.5 Conflicts ...... 28 0110 6.3 Git ...... 28 0111 0112 0113 7 Submitting a paper 28 0114 7.1 Purpose of journals: How to find the best fitting journal for my research . . . . . 29 0115 0116 7.2 Preparations before submission ...... 30 0117 7.3 Submitting a paper ...... 31 0118 0119 References 32 0120 0121 0122 0123 1 Introduction 0124 0125 These guidelines are meant for students, professors and practitioners who would like to write, 0126 0127 participate in, or supervise a project such as a bachelor, master, Ph.D. thesis, or scientific 0128 paper in the intersection of mathematics, statistics and computer science. Besides some general 0129 recommendations, we focus on the software tools LAT X and R for conducting, implementing and 0130 E 0131 documenting the project. 0132

2

© Marius Hofert, Ulf Schepsmeier 2 General suggestions

0133 Before going into detail, some remarks are in order: 0134 0135 The science of coding Coding (in LATEX, R etc.) is like a handwriting. From the corresponding 0136 files and style of coding one can read a lot. Writing correct, readable, well-documented and 0137 0138 easy-to-maintain code is a science on its own and one that is not taught explicitly at university 0139 level unless one specifically studies computer science (but this course of studies rarely addresses 0140 LAT X and R). However, with the ever increasing complexity of statistical simulations and 0141 E 0142 projects, it becomes important to have coding guidelines – otherwise it might be difficult for 0143 others to understand what you actually want to “say” or do with your code. Besides being 0144 0145 correct, your code should be easily readable and extendable by others. Larger projects involve 0146 more and more contributors, each of which should be able to easily follow your code and adjust 0147 it if required, hence some guidelines are in order. 0148 0149 Motivation These guidelines are motivated from our own work with students and practitioners. 0150 After pointing out improvements, making code more readable, correcting common mistakes and 0151 0152 improving documents again and again, we hope that these guidelines help all parties involved 0153 in a project to avoid (what we believe are) common pitfalls and to save time. 0154 0155 Focus The guidelines reflect our personal recommendations and experience using LATEX and 0156 R (and related tools mentioned below) in our personal areas of research (which lies in the 0157 0158 intersection of mathematics, statistics and computer sciences). It is clear that such a guide 0159 cannot be exhaustive. In particular, this is not an introduction to the topics presented! If you 0160 feel that we missed an important aspect not easily found or addressed in other guidelines or 0161 0162 tutorials, or if you can improve this document, please let us know. 0163 Goal Our goal with these guidelines is not to make a document or code snippet 100% perfect. 0164 0165 There are exceptions to almost any rule and describing all of them would extend the page 0166 count of this document well beyond what you would be willing to read; the left-out exceptions 0167 0168 are the 10–20% not discussed here. Furthermore, some aspects are discussed in more detail 0169 than others (which is motivated by our work/judgement). Aspects addressing more advanced 0170 users are marked with an A . 0171 0172 Disclaimer This document does not exist to torture you(r workflow) to use a specific kind of 0173 , editor, software, etc. It should rather point out how we tackle certain 0174 0175 problems (and partly, but not always (!), why we solve them like this). You may or may not 0176 find this helpful, the principle do not like it? do not use it! applies. 0177 0178 We will constantly update this document. Therefore, there will never be a version which can 0179 be considered final. 0180 0181 The guidelines are organized as follows. In Section 2, we give general suggestions for written or 0182 coding intensive projects. Section 3 briefly addresses the importance and choice of text editors. 0183 A 0184 Section 4 and Section 5 point out recommendations when working with LTEX and R, respectively. 0185 0186 0187 2 General suggestions 0188 0189 2.1 Forget about the Pareto principle (80–20 rule) 0190 0191 Definition The Pareto principle (or 80–20 rule) says that for many events, 80% of the final 0192 0193 outcome/result/effect is achieved by 20% of the input/causes. 0194 0195 Meaning Essentially, this means that one should stop after having spent 20% of the time/effort 0196 one could spend on the project, since all additional effort would just improve the outcome by 0197 the remaining 20%. 0198

3

© Marius Hofert, Ulf Schepsmeier 2 General suggestions

0199 Why not? The Pareto principle is frequently used in many areas. However, it does not apply 0200 0201 to scientific work. If you write a research paper, for example, it will come back for revision 0202 at some time. You certainly do not want to realize a year later, that you now actually have 0203 to start over with the whole work (instead of doing just a revision). Furthermore, if a referee 0204 0205 feels that you only spent 20% of the effort on the submitted work, this most likely results in a 0206 rejection. Do your homework, work hard and exclusively on the topic and you will get a result 0207 0208 you can be happy with. Also, your supervisor is happy to learn about what comes (far) after 0209 the 80% (in contrast to hearing about well-known results). Keep that in mind at any stage. 0210 0211 0212 2.2 When solving a particular problem for the first time, spend time on it 0213 0214 In programming, there is this basic (unwritten) law (maybe applying as well to research in 0215 general): 0216 0217 1) If you have a problem, search for it. 0218 The chance that you are the first one working on this problem is small. Others may have 0219 0220 already solved the problem (in an elegant, optimal, fast and readable way). 0221 2) If you cannot find a solution, search more, search differently, search longer – but search for it! 0222 0223 3) If you still cannot find it, go back to 2). 0224 0225 4) If you are sure there exists no (good) solution, write your own. Spend a lot of time on it to 0226 ensure the solution is excellent. Then make it (publicly) available. 0227 0228 Concerning 1) and the links we provide below, always look for solutions provided by senior 0229 members of mailing lists, forums, blogs etc., as there can be significant differences in the quality 0230 0231 of the answers. 0232 In short, if there is a good solution available, learn from it. If there is not, write your own, but 0233 0234 make sure it is of good quality so that others can benefit from it when they find themselves in the 0235 same position. 0236 0237 0238 2.3 English in mathematics 0239 0240 The language science speaks is (American) English. Even if it is only a comment in a script you 0241 write, a file name, variable, or function etc., use (American) English. It will be easier for others 0242 0243 to find and understand your work (besides various other advantages). 0244 Additionally, we want to mention some basic rules for mathematical typography in English; 0245 here we follow Halmos (1970) and Higham (1993). 0246 0247 Short(er) sentences Use short sentences in your theses or project document. Long sentences are 0248 not as conventional in English as they are in German, for example. So German students, at 0249 0250 least, are advised to follow this rule. Formulate (sufficiently) simple and simple to understand 0251 sentences. This improves the readability of your text. 0252 0253 Pluralis majestatis In scientific documents one uses “we” instead of “I”, even if there is only one 0254 author – the “we” represents the author and the reader. 0255 0256 Passive mode In English it is often easier and more elegant to formulate a statement in passive 0257 mode. But do not use it too often, especially in American journals the active mode is often 0258 0259 preferred. 0260 0261 Readable text instead of operators In the English mathematical literature, words such as “there 0262 exists” or “for all” are to be preferred over their operator equivalents “∃” and “∀”; the former 0263 make the text more readable. This contrasts, for example, German mathematical typography. 0264

4

© Marius Hofert, Ulf Schepsmeier 2 General suggestions

0265 Another symbol frequently used in German but not English mathematical typography is “:=” 0266 0267 (“=:”) for defining the quantity on the left-hand (right-hand) side by the one on the right-hand 0268 (left-hand) side. 0269 0270 Comma rules in English As non-native English speaker one is often unsure if and when a comma 0271 has to be set in a sentence. In general the regulation regarding commas in English is less 0272 0273 restrictive as in German, for example, but there are some rules: 0274 0275 A nonrestrictive element, which does not limit scope but merely provides additional infor- 0276 mation, is indicated by being set off by commas. 0277 0278 A restrictive relative clause is introduced with “that” and is not set off by commas. 0279 0280 A nonrestrictive relative is introduced with “which” and is always set off by commas. 0281 Use a semicolon only where you could also use a full stop. 0282 0283 Mind commas in if-clauses: “If you knew all that I know, you would know what I mean”, 0284 0285 but “You would know what I mean if you knew all that I know”. 0286 0287 0288 2.4 Be consistent 0289 0290 Stick to (your) rules Consistently use the same notations for the same quantities throughout 0291 the text. More generally, stick to the (typographical/coding) rules you use exactly in the 0292 same way throughout the whole file (. document or .R script), from the very first to the 0293 0294 very last character in the file (even when using spaces). This will significantly help you when 0295 search-and-replace is in order (after a supervisor’s or referee’s feedback, or if you would like to 0296 make changes). 0297 0298 Say, you use a special rule for writing nested parenthesis, for example one of 0299 0300      0301 ((a7x + a6)x + a5)x + a4 x + a3 x + a2 x + a1 x + a0 (1) 0302 0303 or 0304 0305 [{([{(a x + a )x + a }x + a ]x + a )x + a }x + a ]x + a . 0306 7 6 5 4 3 2 1 0 0307 0308 It does not matter (much) which is to be preferred on first writing (and for publications in 0309 scientific journals this is often determined by the journal’s style guide), as long as you stick to 0310 0311 the very same rule throughout the whole document. 0312 0313 2.5 Be concise 0314 0315 Be precise Most importantly, be mathematically correct (for example, note the difference between 0316 0317 ∈ and ⊆, a quite common mistake). Furthermore, be concise in your descriptions, proofs, etc. 0318 In mathematics, this especially applies to assumptions made for certain statements to hold. 0319 0320 Be short Besides being precise, be short. Twenty well-written pages are much more interesting 0321 to read (besides being less to type and less to correct or grade) than one hundred sloppy and 0322 0323 boring pages. 0324 0325 Everywhere in the documentation Being concise applies to various parts of a project, for exam- 0326 ple, the documentation: 0327 0328 Headings Headings and the table of contents should provide a golden thread or structure 0329 which should be easy to grasp without even reading the text. 0330

5

© Marius Hofert, Ulf Schepsmeier 2 General suggestions

0331 Figures and tables Figures and tables, including their captions, should be easy to read and 0332 0333 understand without having to search in the text for the corresponding explanation. 0334 0335 Formulas Put important formulas in a displayed equation and check that the main ideas of 0336 your work can be followed by just looking at the displayed equations. In the same spirit, 0337 a displayed equation/formula etc. should make sense as much as possible without looking 0338 3 0339 at the text . Conversely, more complicated formulas should always be explained in verbal 0340 form in the text as well. This, together with the displayed equation/formula (do not use 0341 text here), gives the reader the chance to understand the topic on two different levels, one 0342 0343 language-based and one formula-based. Ideally, there should also be third, graphical-based 0344 level by illustrating the (complicated) formula with a graphic. 0345 0346 File, variable and function names Naming files, variables and functions (both from a mathemat- 0347 0348 ical and a programming point of view) in a meaningful way is important. Label versions of your 0349 files by starting with the date in ISO 8601 date format (such as 2013-12-31_my_project.R). 0350 This way, they are displayed in chronological order if files in the current folder are sorted by 0351 0352 name. 0353 Do not call a variable variable or var. Instead, give it a -related and self-explaining 0354 0355 name (ideally even such that the type (integer, real, etc.) of the variable is obvious from its 0356 name), such as tau for a certain value of Kendall’s tau or n for a sample size (similar to the 0357 0358 standard notation n in statistics). Choose variable names in scripts as close as possible to 0359 their mathematical equivalents. In the same spirit, do not call a function fun; note that R, for 0360 example, would not even allow the (reserved) name function. 0361 0362 Also, do not encode a certain method or outcome in numbers if it is not a number naturally. 0363 For example, colors 1, 2 and 3 are much less self-explaining than colors “blue”, “green” and 0364 0365 “red”. 0366 0367 The following basic rule typically provides compact and readable code: The more often you 0368 need a variable (this partly also applies to functions), the shorter its name should be. Often, 0369 short names can be generated by leaving out vocals, the human eye typically “interpolates” 0370 0371 correctly and directly recognizes the corresponding word (and thus the meaning of the name). 0372 In general, omit superfluous parts in function names; for this and other naming conventions 0373 more specifically in R, see Section 5.3.2. 0374 0375 0376 2.6 Be structured 0377 0378 Introduction, abstract and summary come last Do not start to write your paper or project 0379 0380 document by thinking about the introduction. The introduction, abstract and summary are 0381 the last parts you should write in your project. First concentrate on the content. At the very 0382 last, think about the introduction and the end. Write down headwords, for example for the 0383 0384 motivation of the topic and finally write out the introduction in full. Additionally, you can 0385 0386 0387 3 0388 Also, when introducing a function f for the first time, do not just write 0389 0390 f(x) = log x. 0391 Instead, make it more precise by providing its domain, so 0392 0393 f(x) = log x, x ∈ (0, ∞). 0394 0395 0396

6

© Marius Hofert, Ulf Schepsmeier 2 General suggestions

0397 also note the most important three or so words on every page of your document. This can help 0398 0399 in creating a golden thread. 0400 0401 Numbering To structure your manuscript into meaningful parts you can use chapters (but only 0402 in large manuscripts like books or theses), sections, subsections, or paragraphs. Do not use too 0403 many levels of headings. In most cases, three numbered levels are sufficient. In smaller reports 0404 0405 even two levels are typically fine. Only use subsections if you have more than one meaningful 0406 subsection. Otherwise work with paragraphs. 0407 0408 Two possible ways During our scientific career we learned two ways of starting a document and 0409 structuring it. 0410 0411 Bottom-up Collect all your ideas, write them down, and finally structure them into associated 0412 0413 parts, sections and chapters. 0414 Top-down Think about a logical way of reading/following your paper. Write down the chapters 0415 0416 and sections you have in mind and order them. Then write down your ideas and text in the 0417 corresponding sections. 0418 0419 Listings Sometimes, list of bullet points are very helpful to write down several connected state- 0420 0421 ments in a compact way. If they have an order you may use an ordered list, otherwise an 0422 unordered list. You can use different numbering styles, e.g., arabic or roman numbers, or 0423 alphabetical items. In unordered lists different bullet point styles are also available (we mainly 0424 0425 use filled squares in this document). Even minor headings are possible, see for example this 0426 guide. 0427 0428 0429 2.7 Be self-contained 0430 0431 Outsource Instead of reproducing known results, properly refer to papers, books, or code/packages 0432 your work is based on; when referring to books, always provide a page number (and mention 0433 0434 the edition of the book in the references). 0435 0436 How to cite Typically, author-year citation style (such as “paperAuthorLastName (YYYY)” or 0437 “bookAuthorLastName (YYYY, pp. 17)”) provides the most readable and memorable citations. 0438 Also, instead of just “It follows from A (2000) and B (2010) that z holds”, write “In terms of 0439 0440 our setup here, A (2000) showed that x holds. With this result, the assumption of the main 0441 theorem in B (2010) holds, which states that. . . One can therefore conclude that z holds”. In 0442 this way, the main idea can be followed without having to read “A (2000)” and “B (2010)”, 0443 0444 which makes the document more self-contained. Note that “p. 17” is used to refer to page 17 0445 directly and “pp. 17” to refer to page 17 and thereafter. 0446 0447 Do not cite the world’s literature Cite (only) the main or original reference and not a myriad 0448 of references which are more or less related to a topic/theorem/definition/statement. It makes 0449 0450 the text unnecessarily long and difficult to read (and by Section 2.5, we wanted to be concise!). 0451 0452 0453 2.8 Be reproducible 0454 0455 Meaning Make sure your results are reproducible, that is, one can repeat the experiment or 0456 simulation (or even a proof (!)) and obtains the exact same result. 0457 0458 Seed and more To obtain a reproducible statistical simulation, always set a seed! For more on 0459 this, including instructions how to conduct simulation studies in R, see (the ideas and words of 0460 0461 warning in) Hofert and Mächler (2014). 0462

7

© Marius Hofert, Ulf Schepsmeier 2 General suggestions

0463 A Sweave, Knitr For manuscripts containing R code or R results one possibility to achieve repro- 0464 A 0465 ducibility is Sweave or Knitr. These two R packages allow to combine R and LTEX code in 0466 one file (.Rnw files). Every time you change a calculation in the R code and compile the whole 0467 document, the R results are automatically updated and propagated to the file created by 0468 A 0469 LTEX (there are many more options). 0470 0471 Both packages are very useful for small projects and short reports. Some editors like RStudio 0472 (see Section 3) support Sweave and Knitr and offer buttons to easily incorporate so-called 0473 chunks – pieces of R code in LAT X. Furthermore, these tools have a good documentation and 0474 E 0475 an intuitive handling. 0476 A 0477 Tools like Sweave and Knitr also have their drawbacks. Mixing LTEX and R code does not 0478 necessarily provide all the features that either one provides (there are restrictions). Furthermore, 0479 having text mixed between different chunks of code may distract you from coding (typically, 0480 0481 text – besides comments – does not help in writing sophisticated code); navigation within 0482 the document also does not get easier. Moreover, debugging (that is, searching for errors that 0483 appear in some piece of code) is significantly more difficult when mixing R with LAT X code. 0484 E 0485 Finally, run time is longer (although intermediate results can be caught and stored, but this 0486 again makes the code longer). 0487 0488 Overall, we thus do not recommend to use Sweave or Knitr for large projects, unless 1) there is 0489 a significant amount of code to be displayed in the written companion of a project (which is 0490 0491 rarely the case); 2) the code runs sufficiently fast; and 3) unless the user’s knowledge about 0492 LATEX and R is sufficiently advanced. 0493 0494 0495 2.9 Optimize communication, meetings and preparation 0496 0497 Getting in contact If you contact a researcher/instructor etc. for the first time, start by (briefly!) 0498 saying/writing who you are (what is your status? master/Ph.D. student? practitioner?), what 0499 0500 the goal of your project is (thesis? software development?) and who you work with on this 0501 project (supervisor, colleagues, etc.). 0502 0503 Email communication Communication between you, your supervisor and a potential third party 0504 such as a tutor will often be mainly by email. We advise you to consider the following points: 0505 0506 Choose a short but meaningful subject line. Subject matters such as “Hi” or “Dear Professor” 0507 0508 are not meaningful. A concise subject would be “Problem master thesis: for-loop too slow”, 0509 for example. 0510 0511 Check your email before you send it. Is the announced attachment attached? Is the question 0512 clearly formulated? Do I have answered all the questions the supervisor asked me in her/his 0513 0514 last email? 0515 An unwritten law (at least applying to students) states that emails should be answered 0516 0517 within 24h (otherwise one could equally well send a carrier pigeon). Therefore, check (and 0518 answer) your email at least once a day. 0519 0520 Preparing meetings Prepare the questions you have and would like to ask. Send them to your 0521 0522 supervisor (in the same email in which you ask for an appointment). Make a suggestion for two 0523 possible dates for the meeting. Bring a paper and pencil to the meeting Also, be able to briefly 0524 summarize your work/problems (which should be easy since you have already formulated the 0525 0526 related questions); your supervisor is usually involved in several projects simultaneously and 0527 0528

8

© Marius Hofert, Ulf Schepsmeier 3 Editors and integrated development environments

0529 can not remember all details of your project. The more precisely you can nail down a problem, 0530 0531 the more likely you will directly get the answer you were looking for. 0532 0533 During meetings Take notes of the answers, comments, suggestions, etc. your supervisor mentions 0534 during the meeting. 0535 0536 Wrap-up Complete and structure your write-up right after the meeting. Put the points you have 0537 to act on in your files (.tex or .R) with a string TODO in front of them (this allows you to 0538 search for all such points to see whether there is anything left in the document to do). Finally, 0539 0540 go through all files again, work on the TODOs, and take notes of the questions that arise (to 0541 have them ready for the next meeting). 0542 0543 Feedback Note that your supervisor typically only corrects the first instance of a mistake in 0544 a project document. It is your responsibility to completely go through the files and make 0545 0546 the corresponding corrections everywhere (which should be easy since you follow Section 2.4 0547 above!). 0548 0549 Matter of course During meetings, be awake (!) and polite. Do not answer emails or phone calls 0550 during a meeting (yes, it happened to us!) 0551 0552 0553 0554 3 Editors and integrated development environments 0555 0556 Why to think about it It seems difficult to overestimate the importance of a good (text) editor 0557 for modern software development. Indeed, besides auxiliary programs (such as a PDF viewer, 0558 0559 for example), advanced programmers mainly work with a tool accepting command lines (the 0560 “terminal” on Unix systems), a browser and a good editor. An editor is an application which 0561 0562 allows to edit files – one of the major tasks when writing documents or software. 0563 Everybody can use his/her own favorite editor. We will not really recommend one. But we 0564 0565 will give some suggestions what a good editor should have and how the editor can support 0566 and improve our coding style. Most of the more advanced editors, which we introduce below, 0567 support automatically many of the style guides we will give in Section 4 and 5. Especially the 0568 0569 ones in Section 5.2 and 5.3. 0570 0571 Two sophisticated choices Although many pieces of software now provide their own integrated 0572 development environment (“IDE”), there are some powerful editors that can be used for various 0573 different tasks and thus provide a notion of “economies of scale” for development. Two very 0574 0575 sophisticated editors are GNU Emacs and Vim. Both editors go far beyond simple task such 0576 as syntax highlighting or navigation within files. Their rivalry is known as “editor war”. Both 0577 editors are highly customizable and can be further expanded to allow for much more advanced 0578 0579 tasks such as managing files including bookmarks or as Getting Things Done (“GTD”) software, 0580 partly even as email program or web browser. Especially for working LATEX and R, Emacs is 0581 0582 suited well, with the well-developed tools AUCTEX and Emacs Speaks Statistics (“ESS”). The 0583 customizability comes at the price of rather steep learning curve, though. Although powerful 0584 editors such as Emacs or Vim can be recommended to work with in the long run, it takes time 0585 0586 to become proficient in using them. 0587 A popular choice for Windows is the free program Notepad++. This is a powerful editor 0588 0589 which is (partly) customizable and goes beyond syntax highlighting or navigation within files. 0590 Many coding languages are supported and extensions are possible. “Find and replace” or other 0591 editing functions are well implemented and can be used in several files simultaneously. 0592 0593 0594

9

© Marius Hofert, Ulf Schepsmeier 4 LATEX

0595 Less powerful but easier to learn choices For working with LAT X and R, there are also specific 0596 E A 0597 editors and IDEs available which are comparably easy to use. For LTEX examples are or 0598 Texmaker (primarily for ), TextMate (a more general text editor) or TeXShop (for Mac), 0599 or TeXnicCenter (for Windows). 0600 0601 For R, we can recommend RStudio which is available on Linux, Mac and Windows. It combines 0602 0603 R with an editor, file directory, help pages and output windows. R packages can be easily 0604 installed/updated and loaded. Even whole projects such as packages can be managed in 0605 RStudio-projects. Furthermore, RStudio supports the easy use of Sweave and Knitr; see 0606 0607 Section 2.8. 0608 0609 0610 4 LATEX 0611 0612 0613 4.1 Getting started 0614 A 0615 Introduction We assume the reader to be familiar with basic syntax and usage of LTEX. For an 0616 introduction, see Oetiker et al. (2011). For LATEX packages and other material around TEX, see 0617 http://www.ctan.org/. 0618 0619 Help Typically very good help on more advanced topics is provided by http://tex.stackexchange. 0620 com/. 0621 0622 0623 4.2 Typographic recommendations for mathematical documents 0624 0625 Getting help Although books like Ritter (2002) can provide guidance with many good ideas not 0626 0627 mentioned here, keep in mind that (by far) not all recommendations apply equally well to 0628 mathematical or scientific documents. 0629 0630 Common careless errors Beware of mistakes (supervisor names, dates, spelling of affiliation etc.) 0631 on title pages, covers, etc., one typically does not check such pages again after they have been 0632 0633 created. 0634 Lazy eye principle To access whether a document looks good, apply the lazy eye principle: hold 0635 0636 the page a meter away from your eyes and try to “view through” (like your grandmother would 0637 do without her glasses). Check whether the page structure (including white space, figures, 0638 margins etc.) is appealing. 0639 0640 One advice which is often implied by the lazy eye principle is to use headings in heads of 0641 propositions, theorems, examples etc. to make it easier to follow the overall golden thread of 0642 0643 the document, to see which are the main results or which are only auxiliary results etc. 0644 A 0645 Character protrusion Use the LTEX package microtype for character protrusion and font expan- 0646 sion (only with pdfLATEX). By stretching lines ending with certain characters further out in 0647 the margin than others, this, for example, provides a visually more appealing justification than 0648 0649 by forcing each line to have precisely the same length. 0650 New paragraphs Use paragraph indentation (\parindent) instead of paragraph skip (\parskip). 0651 0652 The reason is that in mathematical documents with displayed equations, a paragraph skip is 0653 difficult/impossible to distinguish from a vertical space after a displayed equation (which is a 0654 problem when a paragraph ends with the latter). 0655 0656 Create a new paragraph by an empty line in your .tex file, not by using \par. 0657 0658 Furthermore, before each new ((sub)sub)section, use an empty line (except when a new 0659 (sub)subsection directly follows a new (sub)section). 0660

10

© Marius Hofert, Ulf Schepsmeier 4 LATEX

0661 Title case If at all, only use title case in the title of (larger) projects, not in section headings, 0662 0663 table headings etc. 0664 0665 Capitalization If you refer to a table/figure/theorem in your text use upper case letters, for 0666 example “In Figure 2, we illustrate. . . ” or “The proof of Theorem 3 is given in. . . ”. But if you 0667 do not refer to a numbered environment, use lower case letters, so “In the figure shown below, 0668 0669 we illustrate. . . ” or “The proof of the following theorem is given in. . . ”. 0670 Punctuation Use punctuation marks, also in displayed mathematical formulas. After all, mathe- 0671 0672 matics is also a language (the language of nature) and thus deserves proper punctuation. 0673 0674 Abbreviations The abbreviations “i.e.” (“that is”), “e.g.” (“for example”) and “c.f.” (“see”) are 0675 always preceded by a comma (unless used right after a “(” of course) and, in American English, 0676 also followed by one. 0677 0678 Footnotes Do not use footnotes. They distract from the reading flow, are rarely accepted by 0679 0680 scientific journals and can almost always be omitted anyways. 0681 Introducing new quantities If you introduce/define a new term or notion, make it visible via 0682 0683 \emph{...} and, if you have a longer document with an index (such as a thesis), refer to it in 0684 the index. 0685 0686 Always introduce definitions, figures, tables, etc. before they appear in the text. However, do 0687 not introduce them too long before they actually appear, rather right before. This is also 0688 0689 considered as good practice in programming in general. If you define a variable too early, the 0690 reader (or even yourself) might have forgotten about it by the time it is used. 0691 0692 Large numbers Use \, to visually separate numbers larger than or equal to 1 000, so write 1\,000, 0693 1\,000\,000, etc. 0694 0695 Page ranges For page ranges (such as “1–10”), compound names, or dashes, use -- (and not just 0696 -, which is reserved for hyphens!). 0697 0698 Sets The positive integers, the real numbers, the complex numbers etc. can be nicely formatted 0699 via \mathbbm{N}, \mathbbm{R}, \mathbbm{C} etc. from the LAT X package bbm. 0700 E 0701 For indices, note that i ∈ {1, . . . , n} is a more precise statement than i = 1, . . . , n. 0702 0703 Parentheses, square brackets and braces Use \bigl(, \bigr), \Bigl(, \Bigr), \biggl(, \ 0704 biggr) and \Biggl(, \Biggr) instead of \left(, \right) unless they cannot be used easily 0705 0706 or you really need large parenthesis; see http://tex.stackexchange.com/questions/12773/ 0707 or-left-parentheses and http://tex.stackexchange.com/questions/1454/what-is-the-correct-way-to-do-delimiters. 0708 0709 Also, do not use the unspecified versions \big and related commands as they create too much hor- 0710 izontal space; see http://tex.stackexchange.com/questions/1232/difference-between-big-and-bigl. 0711 0712 Size of parentheses This is a complicated topic and there exists no easy solution. We suggest to 0713 (typically) follow the rule: For two subsequent parentheses use the same size, then go to the 0714 next larger size; see (1). 0715 0716 The space after a parenthesis In displayed equations, large (typically opening) parentheses may 0717 0718 reach into the actual formula. With \, one can create some additional space; see the difference 0719 between \biggl(\sum_{i=1}^n and \biggl(\,\sum_{i=1}^n: 0720 n n 0721 X  X 0722 versus 0723 i=1 i=1 0724 0725 0726

11

© Marius Hofert, Ulf Schepsmeier 4 LATEX

0727 Labeling Only label those displayed equations etc. that you actually refer to from somewhere in 0728 0729 your document (hence a label should indicate a more important or not so easy to remember 0730 equation). Do not label every displayed equation, theorem etc. by default. If you do not want 0731 to label a certain line in a multi-line equation, use (before the line breaking ). If you 0732 \notag \\ 0733 want to change the label, use \tag{$*$}, for example (right before \label{...}). 0734 0735 Referring to equations Referring to equations can be done via \eqref{eq:label} instead of 0736 (\ref{eq:label}); the latter version bears the risk of forgetting the adjacent parentheses. 0737 0738 Vectors Vectors are column vectors, but written as a tuple X = (X1,...,Xd). Furthermore, use 0739 the command \bm{} from the LATEX package bm to create bold symbols such as vectors; this 0740 also works for greek letters. Note that a transpose sign is only used if required, for example, as 0741 > 0742 in a X; use ^{\top} to generate a transpose sign. 0743 0744 Ruler Use the package vruler with the setting \setvruler[10pt][1][1][4][1][0pt][0pt 0745 ][-30pt][\textheight] (or similarly; see the documentation) to display line numbers in 0746 your document. This greatly simplifies discussing certain parts of the document (by email). 0747 0748 Quotation marks The LATEX quotation marks in (American) English start with “ (typically 0749 obtained via the key with the tilde symbol) and end with ” (the key with the single quotation 0750 0751 marks on), not ". 0752 0753 0754 4.3 Technical tricks to improve typography 0755 0756 4.3.1 Citations 0757 0758 How-to Use BibTEX, or – even better – BibLATEX, to manage references and bibliographies in a 0759 .bib file. There are several free software tools available to organize and manage references 0760 A 0761 for BibTEX or BibLTEX, for example JabRef. Emacs’ AUCTEX and RefTEX also provide 0762 functionality for conveniently working with .bib files. 0763 0764 Where to (typically) put references References can often be nicely added at the end of a sentence 0765 via a semicolon without disturbing the reading flow; see . . . . 0766 0767 0768 4.3.2 Spaces and alignment 0769 0770 Escaping spaces after dots and to avoid line breaks If a word, title of a person, or abbreviation 0771 ends with a dot, note that LAT X cannot distinguish it from the end of a sentence. LAT X 0772 E E 0773 therefore creates a space which is larger than what you actually want. In order to get the 0774 correct spacing, you have to escape the space. This can be done using a backslash, for example 0775 0776 As Ph.D.\ student, I have... 0777 Another instance where one should escape spaces is when referring to figures or tables. In this 0778 0779 case one can use a tilde to avoid a line break between the label “Figure” or “Table” and its 0780 number: As shown in Table~1 and Figure~3... 0781 0782 Breaking terms over lines If you want to break, for example, a vector X = (X1,...,Xd) over a 0783 line, use $ $ to allow LATEX to break the line. For example, write $\bm{X}=(X_1,$ $\dots,X 0784 0785 _d)$ or $\bm{X}$ $=(X_1,\dots,X_d)$. In the same spirit, write $\bm{X}_i$, $i\in\{1,\ 0786 dots,d\}$ instead of $\bm{X}_i, i\in\{1,\dots,d\}$. First, this the former gives LATEX 0787 more freedom in nicely breaking the line and, second, it creates a more readable space between 0788 0789 $\bm{X}_i$ and $i\in\{1,\dots,d\}$: 0790 0791 Xi, i ∈ {1, . . . , d} versus Xi, i ∈ {1, . . . , d} 0792

12

© Marius Hofert, Ulf Schepsmeier 4 LATEX

0793 Watch out for bold indices Watch out for the difference between \bm{X_i} and \bm{X}_i; the 0794 0795 former creates a bold index while the latter does not. Bold indices are typically only used for 0796 vectors of indices. 0797 0798 Horizontal spaces Use \quad in displayed equations to separate formulas from text or domains 0799 from the actual equations etc. A greater separator is \qquad. 0800 0801 Use align and alignat For one-column displayed equations, one can use amsmath’s align environ- 0802 ment for both one-line or (possibly aligned) multi-line displayed equations. This has the slight 0803 0804 disadvantage of creating vertical space between the last line of text before the environment 0805 independently of how much this line is filled (furthermore, \qedhere is not correctly put when a 0806 0807 proof ends with an align environment). One can use amsmath’s equation environment instead, 0808 however, only if the displayed equation only has one line. For multi-column multi-line displayed 0809 equations, one can use amsmath’s alignat environment. For more details (including why not to 0810 0811 use variants such as $$..$$), see, e.g., http://tex.stackexchange.com/questions/40492/ 0812 what-are-the-differences-between-align-equation-and-displaymath. 0813 A 0814 A Allow page breaks in displayed equations You can use \allowdisplaybreaks to allow LTEX 0815 to break displayed equations over different pages. But this is only recommended on the very 0816 last iteration of your document preparation process. Ideally, it should not be necessary as it is 0817 0818 often more natural to separate long align environments into two ore more, with some text 0819 in-between. 0820 0821 A The powerful phantom command Use \phantom{...} to properly align follow-up lines of dis- 0822 played equations. For example, 0823 0824 1 \begin{align*} 0825 2 f(x)&=\biggl(\Bigl(\Bigl(\bigl(\bigl(((a_nx+a_{n-1})x+a_{n-2})x+a_{n-3} 0826 3 0827 \bigr)x+a_{n-4}\bigr)x+a_{n-5}\Bigr)\\ 0828 4 &\phantom{{}={}\biggl(\Bigl(}\cdot x+a_{n-6}\Bigr)x+a_{n-7}\biggr)x+\dots. 0829 5 \end{align*} 0830 0831 shows a properly vertically aligned second line: 0832 0833  0834     f(x) = ((anx + an−1)x + an−2)x + an−3 x + an−4 x + an−5 0835 0836  0837  · x + an−6 x + an−7 x + .... 0838 0839 0840 Note that the {} around the equality sign within the \phantom command represents an empty 0841 0842 math object and thus properly replicates the (larger) space around such signs in math mode; 0843 in some situations, only \phantom{={}} might be required. 0844 0845 0846 4.3.3 Figures 0847 0848 Template for including (side-by-side) figures For including two figures side-by-side, one can use 0849 a construction of the following form (for including just one figure, omit \hfill and the obvious 0850 0851 second \includegraphics command). 0852 1 0853 \begin{figure}[htbp] 0854 2 \centering 0855 3 \includegraphics[width=0.48\textwidth]{my_figure_1_without_ending}% 0856 4 \hfill 0857 5 \includegraphics[width=0.48\textwidth]{my_figure_1_without_ending}% 0858

13

© Marius Hofert, Ulf Schepsmeier 5 R

0859 6 \caption{Plot of \dots\ (left) and \dots\ (right).} 0860 0861 7 \label{fig:label} 0862 8 \end{figure} 0863 0864 0865 4.3.4 Miscellaneous 0866 0867 Short versions of commands \ldots can often be replaced by \dots, for example, X_1,\dots,X 0868 0869 _d correctly produces X1,...,Xd. Also, use \le and \ge instead of \leq and \geq, respectively. 0870 0871 Easier to read letter l Use \ell (`) instead of l (l) for the log-likelihood. 0872 Emphasize Emphasize text using the LAT X command \emph, not \textit. Do not use \ 0873 E 0874 underline. 0875 0876 0877 5 R 0878 0879 0880 R, see www.r-project.org/about.html, is a free software environment for statistical computing 0881 and graphics. This combination of focus on statistics and providing graphics is one of the 0882 many strengths of R. By being open source and providing tools for package development, many 0883 0884 people have contributed to the usage of R for virtually all statistical tasks by providing packages. 0885 Furthermore, new research results in the statistical community are often published together with 0886 new or further improved R packages. 0887 0888 This part of our guidelines covers statistical software development in R. By software development 0889 we do not mean writing R packages, but rather code snippets or scripts (.R files) in “good shape”, 0890 which could be served as a basis for packages or which could be sent to package maintainers 0891 0892 (without them getting headaches and nightmares from looking at your code). Many of the points 0893 addressed are also valid for other programming or script languages like C, C++ or MATLAB. 0894 0895 The general goal of this chapter is to help you to write code which is easy to read, efficient, not 0896 too bad to be distributed and reproducible. 0897 0898 0899 5.1 Getting started 0900 0901 Introduction We assume the reader to be familiar with basic syntax and usage of R. For an 0902 introduction, see, for example, Venables et al. (2012). For R packages and other material 0903 0904 around R, see http://cran.r-project.org/. 0905 Help There are nowadays many mailing lists, forums, blogs, etc. available for obtaining help on how 0906 0907 to use R. For general R related questions, https://stat.ethz.ch/mailman/listinfo/r-help 0908 is one of the major mailing lists. Also, http://stackoverflow.com/ with tags for R provides 0909 0910 a good contact point with useful answers typically within a short period of time. For more 0911 specific questions such as platform-dependent or topic-dependent, see the special mailing 0912 lists on http://www.r-project.org/mail.html, such as https://stat.ethz.ch/mailman/ 0913 0914 options/r-sig-hpc/ for high performance computing. Furthermore, see http://www.rseek. 0915 org/ for searching R related sites, help files, manuals, mailing list archives etc. 0916 0917 Installing packages There are various ways to install R packages, the most common are: 0918 0919 from CRAN install.packages("myPkg") installs the package myPkg from the Comprehen- 0920 sive R Archive Network (CRAN); see http://cran.r-project.org/. This is the most 0921 0922 typical way to install R packages. Note that "myPkg" can also be a vector of packages, so 0923 c("myPkg1", "myPkg1"). 0924

14

© Marius Hofert, Ulf Schepsmeier 5 R

0925 from R-Forge install.packages("myPkg", repos="http://R-Forge.R-project.org") in- 0926 0927 stalls the package myPkg from R-Forge, a central platform for the development of R packages, 0928 R-related software and further projects; see https://r-forge.r-project.org/. If a pack- 0929 age is developed on R-Forge, then the latest version is available there (uploads to CRAN are 0930 0931 typically only made every once in a while). This means that if you ask a package maintainer 0932 for a change in a package (which is developed on R-Forge; many packages are), you most 0933 0934 likely have to install the package from R-Forge to get the desired change. 0935 A from .tar.gz install.packages("~/my/folder/myPkg.tar.gz", repos=NULL) installs a pack- 0936 0937 age available as a .tar.gz file. This is source code. Windows or Mac need pre-compiled 0938 code. How to produce pre-compiled code from source see for example http://www-m4.ma. 0939 tum.de/en/teaching/theses/r-package-manual/. 0940 0941 A from GitHub The command install_github("myPkg") from the package devtools installs 0942 0943 packages from GitHub; see https://github.com/. 0944 0945 Installed packages can be updated with update.packages(ask=FALSE, checkBuilt=TRUE) 0946 and removed with remove.packages("myPkg"). 0947 0948 0949 5.2 Documentation 0950 0951 5.2.1 Citing R and R packages 0952 0953 Many volunteers have invested a lot of time and effort in creating R and R packages, please cite 0954 R and the packages you use for data analysis. Use the command to cite R or R 0955 citation() 0956 packages. To cite R itself, citation() provides a plain text references and a BibTEX entry. For 0957 R packages, use citation("pkgname"), where pkgname is the name of the R package to be cited. 0958 0959 For example 0960 2 require(VineCopula) 0961 0962 3 citation("VineCopula") 0963 gives 0964 0965 0966 Ulf Schepsmeier, Jakob Stoeber, Eike Christian Brechmann and Benedikt 0967 Graeler (2013). VineCopula: Statistical inference of vine copulas. R 0968 0969 package version 1.2-1. 0970 0971 A BibTeX entry for LaTeX users is 0972 0973 @Manual{, 0974 0975 title = {VineCopula: Statistical inference of vine copulas}, 0976 author = {Ulf Schepsmeier and Jakob Stoeber and Eike Christian Brechmann and 0977 Benedikt Graeler}, 0978 year = {2013}, 0979 note = {R package version 1.2-1}, 0980 0981 } 0982 0983 Here is an example with a list of entries: 0984 2 require(copula) 0985 0986 3 (ci ← citation("copula")) 0987 4 ci[1] # including BibTeX entry; see also toBibtex(ci) 0988 gives 0989 0990

15

© Marius Hofert, Ulf Schepsmeier 5 R

0991 0992 To cite the R package copula in publications use: 0993 0994 0995 Marius Hofert, Ivan Kojadinovic, Martin Maechler and Jun Yan (2013). 0996 copula: Multivariate Dependence with Copulas. R package version 0997 0.999-8. URL http://CRAN.R-project.org/package=copula 0998 0999 Jun Yan (2007). Enjoy the Joy of Copulas: With a Package copula. 1000 1001 Journal of Statistical Software, 21(4), 1-21. 1002 URLhttp://www.jstatsoft.org/v21/i04/. 1003 1004 Ivan Kojadinovic, Jun Yan (2010). Modeling Multivariate Distributions 1005 with Continuous Margins Using the copula R Package. Journal of 1006 1007 Statistical Software, 34(9), 1-20. URL 1008 http://www.jstatsoft.org/v34/i09/. 1009 1010 Marius Hofert, Martin Maechler (2011). Nested Archimedean Copulas 1011 Meet R: The nacopula Package. Journal of Statistical Software, 39(9), 1012 1013 1-20. URL http://www.jstatsoft.org/v39/i09/. 1014 1015 and 1016 Marius Hofert, Ivan Kojadinovic, Martin Maechler and Jun Yan (2013). 1017 copula: Multivariate Dependence with Copulas. R package version 1018 1019 0.999-8. URL http://CRAN.R-project.org/package=copula 1020 1021 A BibTeX entry for LaTeX users is 1022 1023 @Manual{, 1024 1025 title = {copula: Multivariate Dependence with Copulas}, 1026 author = {Marius Hofert and Ivan Kojadinovic and Martin Maechler and Jun Yan}, 1027 year = {2013}, 1028 note = {R package version 0.999-8.}, 1029 url = {http://CRAN.R-project.org/package=copula}, 1030 1031 } 1032 1033 1034 5.2.2 Run time information 1035 1036 In many statistical projects one compares different methods, models, algorithms or just different 1037 variations of the former. Beside statistical measures often run time informations are given. 1038 1039 Whenever you state run times of your algorithm name the software, e.g. R and R-packages, and 1040 the machine you used for your calculations. State all necessary information for a possible rerun. 1041 Also do not forget to give the time unit, usually seconds (short sec). Here an example form 1042 1043 Schepsmeier (2013): 1044 1 1045 In all of the forthcoming simulation studies we used $B=2500$ replications and the 1046 number of observations were chosen to be $n=500, n=750, n=1000$ or $n=2000$. 1047 2 As model dimension we chose $d=5$ and $d=8$ and the critical level $\alpha$ is $0.05$. 1048 3 As before all calculations are performed using the statistical software \R\ and the \R- 1049 package \textbf{VineCopula} of \cite{VineCopula}. 1050 4 1051 1052 5 ... 1053 6 1054 7 Of cause the computation time for the different proposed GOF tests is also a point of 1055 interest for practical applications. 1056

16

© Marius Hofert, Ulf Schepsmeier 5 R

1057 8 Therefore, in Table \ref{tab:Summary} the computation times in seconds for the 1058 1059 different methods run on a Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz computer for $n 1060 =1000$ are given alongside with a summary of our findings. 1061 1062 1063 5.2.3 Code documentation 1064 1065 The documentation of your code is one of the most important tasks in the software development. 1066 1067 It enables other users, maintainers or your supervisor to follow your ideas of coding and allow for 1068 easy application. Even you self will profit from a proper documentation. 1069 There are two ways to document a code - in the coding itself and externally in extra files. 1070 1071 While the first one is absolute necessary the second one is optional and depends on the scale of 1072 the project and the demands of your supervisor. External files are usually needed in R-packages 1073 and are more extensive in their description, giving for example additional explanations on the 1074 1075 statistics and simple application examples. 1076 1077 1078 1079 Internal documentation 1080 1081 Comments Writing comments (as explanations, for example, or to point out the mathematical 1082 calculations behind the scenes) is good. In R, the # symbol can be used to start a comment. 1083 The following example code shows the usage of comments (forget about the meaning of the 1084 1085 other parts, just look for the comments). 1086 2 ### fast rejection algorithm,R version ################################# 1087 3 1088 1089 4 ##’ Samplea vector of random variates St~\tilde{S}(alpha, 1, 1090 5 ##’(cos(alpha*pi/2)*V_0)^{1/alpha},V_0*I_{alpha= 1}, 1091 6 ##’h*I_{alpha 6= 1}; 1) withLS transform 1092 7 ##’ exp(-V_0((h+t)^alpha-h^alpha)) with the fast rejection 1093 8 1094 ##’ algorithm; see Nolan’s book for the parametrization 1095 9 ##’ 1096 10 ##’ @title Sampling an exponentially tilted stable distribution 1097 11 ##’ @param alpha parameter in (0,1] 1098 12 ##’ @param V0 vector of random variates 1099 13 1100 ##’ @paramh non-negative real number 1101 14 ##’ @return vector of variates St 1102 15 ##’ @author Marius Hofert, Martin Maechler 1103 16 retstableR ← function(alpha, V0, h=1) { 1104 17 stopifnot(is.numeric(alpha), length(alpha) == 1, 1105 18 ≤ ≤ 1106 0 alpha, alpha 1) # alpha>1 => cos(pi/2*alpha)<0 1107 19 n ← length(V0) 1108 20 ## case alpha ==1 1109 21 if(alpha == 1 || n == 0) return(V0) # alpha ==1 => point mass at V0 1110 22 ## else alpha 6= 1 => call fast rejection algorithm with optimalm 1111 23 ← 1112 m m.opt.retst(V0) 1113 24 mapply(retstablerej, m=m, V0=V0, alpha=alpha) 1114 25 } 1115 1116 Note the difference between inline comments (comments for a statement in a single line; starting 1117 with #), comments addressing several lines of code (on a new line right before the corresponding 1118 1119 chunk, starting with ##) and comments separating larger parts of code (starting with ###; 1120 typically only used for much larger code chunks or to visually separate different functions or 1121 other bigger parts in an R script). 1122

17

© Marius Hofert, Ulf Schepsmeier 5 R

1123 The comments starting with ##’ are part of a certain way of documenting functions called 1124 1125 Roxygen documentation. One first starts with a short explanation what the function computes. 1126 After a blank line, a one-line title (starting with @title) giving the main purpose of the function 1127 is provided. Then, explanations for all arguments of the function follow (by ), explaining 1128 @param 1129 the types of the corresponding arguments. The return value of the function is given via 1130 @return, followed by the author(s) of the function (@author); additionally, a @note may follow. 1131 1132 Roxygen documentation can directly be converted to a help file for an R package containing 1133 the corresponding function, although help files typically contain much more information (such 1134 as example calls). 1135 1136 1137 external Files (.txt, .Rd, .pdf) 1138 1139 .txt General description of the code or package. Also special dependencies on other packages 1140 or required software tools such as gsl should be explained in such files. Often naming: 1141 1142 Description.txt, README.txt, install.txt 1143 A .Rd Help files in R packages. The coding is adapted from LAT X but is different. 1144 E 1145 A .pdf Manuals generated from the help files or vignettes. 1146 1147 1148 5.3 Programming style 1149 1150 5.3.1 Writing correct code 1151 1152 Ranges of numbers Let n be an integer. It is convenient to write 1:n for the sequence of numbers 1153 1154 from 1 to n. However, use this only if you are absolutely sure that n is greater than or equal to 1. 1155 Often, for n less than 1, one would expect the empty sequence, for example in a for(i in 1:n) 1156 loop. To get this behavior, write for(i in seq_len(n)) instead. 1157 1158 if and else Note that else has to follow the closing brace of an if statement on the same line. 1159 1160 Bad: 1161 1162 2 res ← if(x > 0) { 1163 3 "positive" 1164 4 } 1165 5 else{ 1166 6 1167 "non-positive" 1168 7 } 1169 1170 Good: 1171 1172 2 res ← if(x > 0) { 1173 3 "positive" 1174 4 } else{ 1175 5 "non-positive" 1176 6 } 1177 1178 1179 Let us remark here that if() itself is a function, so we can assign its return value to a variable. 1180 1181 5.3.2 Writing readable code 1182 1183 Even before writing efficient code, it is important to write readable and structured code. This 1184 1185 significantly improves debugging but also avoids making programming errors in the first place. 1186 1187 1188

18

© Marius Hofert, Ulf Schepsmeier 5 R

1189 80 characters rule Note that lines should contain less than or equal to 80 characters, the only 1190 1191 exception being strings, which should not be broken over lines. This is the typical rule for 1192 editors, terminal emulators, printers, debuggers etc. 1193 1194 Assignment operator In variable assignments, use x <- 1 instead of x = 1, except for arguments 1195 in function calls. 1196 1197 A Omit useless code In R statements do not have to end with a semi-colon, so omit it; semi-colons 1198 are only used to separate two statements on the same line, which is rather rarely useful. 1199 1200 Another such example is return() if used in the last line of a function body. It can be omitted; 1201 1202 see Section 5.3.3. 1203 As an exception, write 0.1 instead of .1 (although the 0 can be omitted here). We agreed on 1204 1205 being as close to mathematical notation as possible (which votes in favor of 0.1) and the eye 1206 also can not accidentally read this number as 1. 1207 1208 Variable and function names As mentioned in Section 2.5, variables and functions should have 1209 meaningful names and should be shorter the more often they are needed (the latter applying 1210 1211 at least to variables). 1212 Use dots and lowercase letters for variables (for example, my.variable.name <-) and arguments 1213 1214 of functions (see, for example the arguments of pnorm()). On the contrary, in function names, 1215 dots are typically (only) used to separate the method name from the class name when declaring 1216 S3 methods; see , the default method for the function . In function 1217 plot.default() plot() 1218 names, rather use camel case, so myFunctionName <- function() {}. 1219 1220 Omit superfluous parts, for example, a function lengthOfObject(object) which determines 1221 the length of a given object is preferrably to be called just length, since, when called, length 1222 (object) immediately reveals what the return value is. 1223 1224 Functions returning a boolean can often be named starting with is, for example isPos <- 1225 1226 function(x) x > 0 determines whether a given object x is positive (note that isPos() is 1227 also vectorized!). Do not use names like isNotPos(), since it is not immediately clear what 1228 the doubly-negated !isNotPos() means. 1229 1230 Other useful prefixes are get... or find..., or also n... (for example, nPts() for determining 1231 the number of points). 1232 1233 Specify arguments When calling a function, specify the argument names for all except the first 1234 1235 argument, so use rnorm(10, mean=3, sd=2) instead of rnorm(10, 3, 2). 1236 Indentation Indent your code. If your editor does not provide automatic code indentation for R, 1237 1238 use four spaces per level of indentation as a rule of thumb. Do not use tabulators. 1239 1240 As an example, consider: 1241 2 if(a < 0) { # first level 1242 3 ← 1243 b 1 1244 4 d ← 2 1245 5 } else{ 1246 6 if(e == 0) { # second level 1247 7 b ← 0 1248 8 ← 1249 d 3 1250 9 } else{ 1251 10 b ← 2 1252 11 d ← 2 1253 12 } 1254

19

© Marius Hofert, Ulf Schepsmeier 5 R

1255 13 } 1256 1257 14 bd ← c(b, d) 1258 1259 Besides the indentation, there are various interesting points here. if() is indeed also a function. 1260 This feature can be used to write much more readable code (in complicated nested if-statements). 1261 Our output here are two values, b and d. By putting them in one vector, each if/else statement 1262 1263 has only one return value (namely the vector containing the values of b and d). For one-line 1264 functions one can omit the curly braces {}, thus we can save space. Furthermore, we can 1265 directly assigning the return value of if() to a variable. Taking these tricks together, we obtain 1266 1267 the following compact form or our code (note that we now only have a single line and the 1268 human eye directly detects the assignment at the beginning, knowing that this is the quantity 1269 1270 which is defined here – which is not obvious by looking at the above version, where each line 1271 A starts with if() or else() but which does not directly reveal which quantities are defined). 1272 1273 2 bd ← if(a < 0) c(1, 2) else if(e == 0) c(0, 3) else c(2, 2) 1274 1275 But note that you should use one-line statements only for easy statements such as here, where 1276 the terms involved are not too sophisticated and thus the risk of not immediately understanding 1277 1278 the statement is minimal. 1279 Finally, let us remark that, normally, the opening brace of a function is put at the end of the 1280 1281 line containing the function head (see the above example with if() involving braces). For 1282 longer function bodies, the opening brace can also be put on a new line in the same column as 1283 the closing brace after the function body, so that one can more easily identify code blocks. 1284 1285 Vertical alignment of code Assignments which belong semantically together can be vertically 1286 1287 aligned as follows: 1288 1 n ← 100 1289 1290 2 set.seed(271) 1291 3 stnrm ← lapply(1:10, function(i) rnorm(i)) 1292 4 t4 ← lapply(1:10, function(i) rt(i, df=4)) # vertically align the’ ←’ 1293 1294 Spaces around operators Place one space before and after binary operators (<-, +, -, etc.) and 1295 1296 after a comma. As an exception, for arguments of functions, use a space only to separate the 1297 arguments, not for the assignment. 1298 1299 Bad: 1300 1301 2 x←1 1302 3 optimize(function(x) x*x,interval = c(1, 10),maximum = TRUE) 1303 1304 Good: 1305 1306 2 x ← 1 1307 3 optimize(function(x) x*x, interval=c(1, 10), maximum=TRUE) 1308 1309 1310 5.3.3 Writing safe, fast, flexible and sophisticated functions 1311 1312 Write (auxiliary) functions R is about writing functions. Everything is an object or a function 1313 1314 in R. If a sufficiently large block of lines appears several times in your code, outsource it, 1315 write a function which calculates this block of code. If, at some point, you are required to 1316 improve/optimize that part of the code, you do not have to do it several times. Furthermore, 1317 1318 having a separate “black box” in terms of a function improves readability of the code. Moreover, 1319 the function can be tested individually for correctness or debugged in case of an error. 1320

20

© Marius Hofert, Ulf Schepsmeier 5 R

1321 Check user input for errors At least when writing larger, important functions with several argu- 1322 1323 ments, check user input for validity (rule of thumb: the larger the run time of the function, 1324 the more time can be spend on checking inputs). This can often be done with the functions 1325 or which may return useful error information to the user in case an 1326 stopifnot() stop() 1327 argument is not valid. Some books suggest here to imagine the most incompetent user, but 1328 it is hardly possible to check all possible user inputs for correctness. Try to catch the most 1329 1330 important ones which may lead to a crashes or wrong return values of your function. 1331 Warnings Use warning() if an input is correct but not very meaningful (or if a function is 1332 1333 deprecated etc.). Warnings can also be useful for accompanying outputs of functions, for 1334 example if an algorithm does not converge but stops after a maximum number of iteration 1335 steps it is advisable to inform the user (which could, of course, also be done by returning a 1336 1337 corresponding status). 1338 1339 Note that warnings are not errors! A warning is nothing bad! 1340 NA, NaN, Inf and other limiting cases R functions typically deal well with “limiting cases” such 1341 1342 as NA, NaN, Inf etc. If possible, make sure that the functions you write can also correctly 1343 handle such limiting cases. To find NAs do not use x == NA but is.na(x). 1344 1345 Never (ever) truncate the return value of a function at an arbitrary large (hard-coded) value 1346 such as 10100, just that the function returns something finite. This is never good enough for all 1347 1348 inputs. Rather think about why the return value is so large, does it naturally tend to infinity 1349 for certain inputs, can we find a good approximation near the limit, or can we implement a 1350 proper logarithm (not just the logarithmic value of the function – all accuracy would be lost!) 1351 1352 to be able to do calculations on a more moderate scale? 1353 1354 A Parallelize, Matricize If you (easily) can, write functions which do not only operate on single 1355 numbers, but also on vectors or matrices. This way, such functions will typically be much faster 1356 when called, for example, with a vector instead of calling it for each element of the vector. 1357 1358 Avoid for, while and repeat loops if possible If possible, avoid loops (for, while, repeat), as 1359 they are comparably slow in R. The main point is that they are very rarely needed (one example 1360 1361 being a rejection algorithm, but if speed really matters one can also implement it in C). Rather 1362 work with the functions sapply(), lapply(), vapply(), apply(), mapply(), or tapply(). It 1363 1364 is by no means possible to explain all of them here, but here is a basic example how a for loop 1365 can be avoided: 1366 1367 Bad: 1368 2 ← 1369 x numeric(100) 1370 3 for(i in 1:100) x[i] ← i*i # first creating and then filling an object 1371 1372 Good: 1373 1374 2 x ← sapply(1:100, function(i) i*i) # directly create the correct object 1375 3 x ← unlist(lapply(1:100, function(i) i*i)) # typically faster 1376 4 x ← vapply(1:100, function(i) i*i, NA_real_) # typically fastest 1377 1378 Return all (useful) results Computations are expensive, do not throw away other calculated 1379 results just because you are not interested in them at the moment. In contrast, return all 1380 1381 (useful) results/quantities computed, especially if the computations are time consuming. 1382 1383 For example, if your function computes an optimum via optim(), do not just return the 1384 optimum, return all related computed values (for example, the convergence status). Note that 1385 optim() itself follows this paradigm. You never know if you or a user may need it. 1386

21

© Marius Hofert, Ulf Schepsmeier 5 R

1387 Bad: 1388 1389 2 myOpt ← function(x) 1390 3 optim(c(-1.2, 1), fn=function(z) x * (z[2]-z[1]^2)^2 + (1-z[1])^2)$par 1391 1392 Good: 1393 1394 2 myOpt ← function(x) 1395 3 optim(c(-1.2, 1), fn=function(z) x * (z[2]-z[1]^2)^2 + (1-z[1])^2) 1396 1397 As another example, if you are interested in the behavior of an estimator, do not just return 1398 1399 the average over all estimators computed in your simulation study, return the estimators 1400 themselves (and actually even more results such as whether there were warnings or errors 1401 during the computation, run time etc.; see Hofert and Mächler (2014)). This allows you to 1402 1403 look at histograms, box plots etc., which provides much more information than just the mean. 1404 1405 Arguments with defaults We can make the function myOpt() above more flexible. 1406 2 myOpt ← function(x, init=c(-1.2, 1)) 1407 3 optim(init, fn=function(z) x * (z[2]-z[1]^2)^2 + (1-z[1])^2) 1408 1409 1410 Instead of using the vector c(-1.2, 1) “hard-coded” in the function body, we make it an 1411 additional parameter, with default value c(-1.2, 1). This way, myOpt() can be called with 1412 one argument x as before, but one has the flexibility of using a different vector init if required. 1413 1414 Note that it is sometimes advisable not to give a formal argument a default value. This way, a 1415 user is forced to think about a reasonable value. 1416 1417 Note that arguments with default values should come after arguments without default values. 1418 Ellipsis argument Sometimes, you want to write a function, such as myOpt() above, which calls 1419 A 1420 another function, such as optim() above, which itself has many arguments. One way of passing 1421 arguments from myOpt() to optim() is via the ellipsis argument .... Incorporating it in the 1422 1423 above example, we obtain: 1424 2 myOpt ← function(x, init=c(-1.2, 1), ...) 1425 3 1426 optim(init, fn=function(z) x * (z[2]-z[1]^2)^2 + (1-z[1])^2, ...) 1427 1428 We can now call myOpt() with, for example, method="BFGS", which is then passed to optim(), 1429 so myOpt(100, method="BFGS") computes the optimum with the "BFGS". 1430 1431 Return value A function should always return an object of the same type, no matter what the 1432 input is (so, for example, do not write a function which returns either a character string or a 1433 number depending on the input). 1434 1435 Furthermore, separate computations from graphics. A function should either compute some 1436 values or create/plot a graphic, do not plot something and also return the plotted values in the 1437 1438 same function. Note that a function which produces a base graphic (such as obtained with 1439 plot()) typically has invisible() as last line, that is, return value. 1440 1441 The last statement/line of a function is used as return value by default, you do not have to 1442 write return(res), just res will return the result res already. return() is typically only used 1443 1444 when a function should be exited before the last line. 1445 1446 1447 5.3.4 Learn from others, learn from the masters 1448 1449 As we already announced in Section 2.2 search for already existing solutions for your problem. 1450 For R-programming this means look for R-packages providing your wanted functionality. Use it! 1451 Use their knowledge, use their programming skills and learn from it. 1452

22

© Marius Hofert, Ulf Schepsmeier 5 R

1453 Learn from others, learn from the masters and read the source code of packages. This is the 1454 1455 major advantage of open source. The code is publicly available. Therefore, download the source 1456 package from CRAN, R-forge or github. Only in the source package (.tar.gz) the functions are 1457 not pre-compiled. Read the code, R, C, C++ and/or Fortran, and find your specific function. 1458 1459 Read the function(s) closely and reveal the hidden functionality or special treatment of critical 1460 input parameters. Experiment with the function(s), run the auxiliary functions separately and 1461 1462 most important learn form it! Learn how to structure functions, learn how to write code, learn 1463 how others provide solutions. 1464 Furthermore, you can reveal what the software does exactly. And you can change some code to 1465 1466 implement a modification of an algorithm or fix a bug. For further reading, see Ligges (2006). 1467 Note: The R source of an R-package (in source state) is inside /R/*.R, and not what 1468 you get when you display the function in R (by typing its name). 1469 1470 1471 5.3.5 Test your code 1472 1473 Write (small) testing examples to verify your code. Write examples for each sub-function and 1474 1475 each auxiliary function. Test different input parameters, test critical parameters, test false user 1476 input. See treatment of missing values and the section about error handling and warnings. 1477 For advanced users: Create your own package and use the auto-testing functionality of R (via 1478 1479 R CMD check). An R-package can be created and checked easily in the R-environment software 1480 RStudio or by hand, see R Development Core Team (2006). 1481 There are also some R-packages available for unit testing, for example , . 1482 RUnit testthat 1483 1484 5.3.6 Specific hints 1485 1486 Do not grow objects Define an object in advance and specify its length/dimension. Do not let 1487 1488 an object grow if it is not necessary. For example, if you can not avoid a for loop, replace 1489 1490 1 rmat ← NULL 1491 2 for( i in 1:n) { 1492 3 rmat ← rbind(rmat, some.further.computation.depending.on.i) 1493 4 } 1494 1495 1496 by 1497 1 rmat ← matrix(0., n, k) 1498 1499 2 for( i in 1:n) { 1500 3 rmat[i, ] ← some.further.computation.depending.on.i 1501 4 } 1502 1503 Use TRUE and FALSE, not ’T’ and ’F’ It can cause conflicts. For example if you have a variable 1504 1505 T, which has the value 0 (T <- 0) (e.g. a test statistic) and call a function by fct.a(data 1506 =data, log=T) instead of fct.a(data=data, log=TRUE). Since T is zero the function will 1507 return a result with log=FALSE (0 ˆ= FALSE, 1 ˆ= TRUE). 1508 1509 1510 5.4 Tables and graphics 1511 1512 Tables and graphics are a topic of their own and we could already fill a whole script with guidelines 1513 1514 and tips with them. Both are used in manuscripts like theses or scientific papers to illustrate 1515 results and findings. In what follows we collect some important points to consider when creating 1516 1517 tables or graphics in R. 1518

23

© Marius Hofert, Ulf Schepsmeier 5 R

1519 Graphics instead of tables Instead of tables containing a myriad of numbers, use plots to graph- 1520 1521 ically display your results; see also Hofert and Mächler (2014). The reason is that the human 1522 eye is not able to compare more than two or three numbers at a time. Graphics typically 1523 reveal much more information about the underlying laws and make it easier to see “structure”. 1524 1525 In most cases they even allow to save space in comparison to tables. Also, when preparing a 1526 presentation, graphics are much easier to grasp within a short period of time than tables (we 1527 1528 have all seen slides with a myriad of numbers on them, accompanied with the speaker saying 1529 “...as it becomes clear from these results”, switching to the next slide before one even has a 1530 chance to look at more than three numbers!). 1531 1532 It is clear that a table is significantly easier to create, but a graphic has another advantage. A 1533 plot has the potential to reveal numerical problems as well, something often overlooked when 1534 1535 the results are only displayed in tables. 1536 1537 Rounding If you still decide for presenting your results in a table, carefully think about how 1538 to display the numbers, for example, how many digits to use. Usually few digits are enough. 1539 Using too many digits gives the impression of high precision which might not be adequate 1540 1541 because of numerical issues, see Wainer (1993). 1542 1543 If we wish to report results with two digits we need the standard error of this estimated 1544 proposition to be ≤ 0.005 (1.96 × 0.005 ≈ 0.01). Thus the standard error of a reported result 1545 should be reported too, so a reader can gauge the accuracy. 1546 1547 Also, at least in every column, use the same number of digits and align the numbers according 1548 to their decimal point. Important results can be highlighted, for example in bold. To export 1549 A A 1550 tables from R to LTEX one can use the R package xtable, which creates LTEX code from R; 1551 see also Hofert and Mächler (2014). 1552 1553 Detecting and distinguishing different lines/points In graphics, do not use too light colors, as 1554 they are typically difficult to detect, especially in presentations. Ideally, create a graphic in 1555 1556 such a way that the results are still readable when printed in black/white. Also, use colors 1557 suitable for color-blind people. 1558 1559 Other options to distinguish different lines (or points) are by using different line types (see 1560 lty), line widths (see lwd), or different symbols (see pch). Since each illustration should be 1561 1562 self-explanatory, use legends and, of course, label axes, sometimes also a title or sub-title can 1563 be useful (if not, a suitable caption can be given in LATEX to place such pieces of information). 1564 1565 Label sizes Use sufficiently large plot axis/legend labels or titles. By default, they are often too 1566 small in R. When plotting to a .pdf file, this can typically be solved by choosing a smaller default 1567 width and/or height parameter when opening the PDF device via pdf(...); for example, pdf 1568 1569 (..., width=6, height=6) for square plotting regions and pdf(..., width=10, height=6) 1570 for rectangular ones. 1571 1572 A Cropping white space around figures Crop the white margins of PDF files before putting them 1573 into your .tex document. This is important for their correct alignment in LAT X and such that 1574 E 1575 no space is wasted and a large area is covered with the graphic of interest. 1576 The function of the R package may help in cropping white space. 1577 dev.off.pdf() simsalapar 1578 It is based on the Unix tool pdfcrop. You can also use the latter manually in the shell or use 1579 any other program of your choice to crop white space around the margins of a plot (note that 1580 1581 there are also settings in R to do that, but one can not crop the white space perfectly/maximally, 1582 at least not with a trial-and-error procedure for each plot individually). 1583 1584

24

© Marius Hofert, Ulf Schepsmeier 6 Version control

1585 A Base, lattice, ggplot2, grid graphics Besides base graphics, there are some other options, in- 1586 1587 cluding lattice graphics (based on the R package lattice), ggplot2 graphics (based on the R 1588 packages ggplot2), or, the more low-level grid graphics engine for designing your own graphics 1589 functions or modifying existing ones. We recommend to work with base graphics and, for very 1590 1591 special purposes (such as three-dimensional wire-frame or cloud plots), use lattice. 1592 1593 1594 6 Version control 1595 1596 1597 A When working on the same project, it is necessary to exchange and “merge” individual contributions 1598 at certain points in time. One approach would be to exchange the corresponding files by email. 1599 This seems fine in projects where only two people are involved, as long as not both work on the 1600 1601 same parts of the file simultaneously. To make sure that one does not accidentally take over an 1602 outdated part of the file, both participants have to use a “diff tool” and, every time they receive a 1603 1604 file, compare it to their “local version” (to find the “difference” so-to-speak) before taking over 1605 newly added parts. Possible overlaps have to be fixed by hand. Needless to say, even if only two 1606 people are involved and files are not exchanged very often, this is tedious and prone to errors; 1607 1608 especially if the project involves several files. Furthermore, it would be advisable to keep older 1609 versions of the files in case a result has been erroneously deleted previously, for example. But 1610 how do you know in which of the backup files the latest version of the result resides? There are 1611 1612 endless of such problems and thus people have automatized these procedures. Below we briefly 1613 mention some approaches of version control systems, in increasing sophistication. 1614 1615 1616 6.1 Dropbox 1617 1618 Dropbox is not a version control system as you may still face the problem of tripping over changes 1619 of other authors. We only mention it here since it is more sophisticated than “sending around 1620 1621 files”, simplifies the above process (at least provides versions of files for up to 30 days) and does 1622 not have a learning curve as steep as for the tools below. Note that one can also combine Dropbox 1623 with the tools described below, but we omit further details in this introduction here. 1624 1625 1626 6.2 SVN 1627 1628 Apache Subversion (SVN) is a widely used version control system. It is a server-based version 1629 1630 control system meaning that there is a copy of your collection of files, typically a folder, the 1631 so-called repository, stored on a remote server (the SVN server). There are free SVN servers 1632 1633 available, they often come at the price of projects being either public(ly available) or private which 1634 is fine in most cases; an example of a free private hosting service (also allowing Git access, see 1635 Section 6.3 below) is https://cloudforge.com/, for example. 1636 1637 Once the server has been set up, one can checkout the repository, which creates a local working 1638 copy. One can then add files to the working copy, commit changes to the (remote) repository, 1639 update ones working copy in case a project member has committed changes to the repository, 1640 1641 display a log of changes to the repository (including commit messages of the various file versions), 1642 display the status of a file (is the file under version control or not) and diff erences between 1643 the working copy and the version in the repository on the server. In what follows, we describe 1644 1645 these processes with some basic example commands often used (GUIs are widely found on the 1646 internet, the workflow remains the same). For more information use svn help ...; for example, 1647 1648 svn help commit for the commit command. 1649 1650

25

© Marius Hofert, Ulf Schepsmeier 6 Version control

1651 How to set up an SVN server or how to start an SVN project on hosting services like cloudForge 1652 1653 are explained in several places on the internet. Hence, we do not explain this step in more details 1654 below (also not for Git, see Section 6.3). In the following we suggest that there is already an SVN 1655 server running or an SVN project set up. Besides online resources, local IT services can typically 1656 1657 also be contacted for assistance. 1658 1659 1660 6.2.1 Checkout 1661 1662 In the beginning of the project, you want to checkout the (remote) repository, so that your local 1663 working copy is created. The working copy is stored in a subfolder .svn of your current working 1664 1665 directory. Note that unless files were manually uploaded to the server or if the repository has 1666 already been set up by a colleague, the repository (and thus the working copy) is empty. All we 1667 need for the checkout is a URL to the SVN server or project (you will be asked for your user 1668 1669 credentials as have been provided previously for setting up the repository on the server): 1670 1 svn co # check out project (create working copy from the repository) 1671 1672 As example URLs, see 1673 1674 1 svn://svn.r-forge.r-project.org/svnroot/nacopula/ 1675 2 svn://r-forge.r-project.org/svnroot/vinecopula/ 1676 1677 which are the SVN URLs for the R packages copula and VineCopula, respectively, hosted by 1678 1679 R-Forge. 1680 After we have checked out the repository (created the working copy) we can add files to the 1681 working copy (see below) or modify existing files which are already under version control (for 1682 1683 example, if they have been downloaded during the checkout process in case the repository was 1684 non-empty). 1685 1686 Clearly, the same repository can be checked out by many people (including one person on 1687 several devices, for example), which makes this tool especially interesting when a larger number 1688 of people collaborate. Furthermore, the following variants can also be useful: 1689 1690 1 svn co output_Dir_name # check out the project into a certain directory 1691 2 svn co -r110 # check out version 110 of the repository 1692 1693 If you use a GUI (like SVN Torquise on Windows) these steps, as well as all the following 1694 1695 functions, can be done via easy-to-handle menus. As mentioned before, the workflow remains the 1696 same. 1697 1698 1699 6.2.2 Add and (re)move 1700 1701 Adding a file (or folder) “foo” to the working copy can be done as follows. One first copies or 1702 moves the file to the local directory (the directory containing the working copy .svn) and then 1703 1704 adds it to the working copy (thus putting it under version control) via: 1705 1 1706 svn add foo # add directory or a file ’foo’ to working copy 1707 It is important to note that although “foo” is now under version control, the remote repository 1708 1709 (on the server) does not see “foo” yet and so your collaborators do not see “foo” yet either. For 1710 this you have to commit your changes, see Section 6.2.3, essentially propagating your changes to 1711 1712 the repository. Furthermore, let us remark that only important files should be added, for example, 1713 only add .tex files, but not the corresponding .pdf or the myriad of files generated on the way. 1714 Similar to adding files, (re)moving a file or folder from the working copy can be done via: 1715 1716

26

© Marius Hofert, Ulf Schepsmeier 6 Version control

1717 1718 1 svn mv foo bar # moves ’foo’ to ’bar’ 1719 2 svn rm foo # removes ’foo’ 1720 1721 Again, do not forget to commit afterwards. 1722 1723 1724 6.2.3 Update, commit 1725 1726 Once you have made changes to your local files, you want to commit these changes to the repository 1727 1728 on the server so that your collaborators can see these changes after they update their working 1729 copies with the server’s version. Before committing your changes, you should conduct an update 1730 of your working copy yourself so that you see changes your collaborators have committed in the 1731 1732 meanwhile (and so that you can solve possible conflicts, see Section 6.2.5); note that this update 1733 only updates your working copy but does not overwrite your local files (or file changes). You can 1734 1735 (and should) thus safely do an update before committing, even if you have already changed files. 1736 An update and commit can be done as follows: 1737 1738 1 svn up # update working copy (does not overwrite local changes) 1739 2 svn ci -m ’added file ‘foo‘ and fixed an error’ # use -m for commit messages! 1740 1741 Use short commit messages so that your collaborators can get an idea about what you changed 1742 when looking at the log file, see Section 6.2.4. 1743 1744 Note that when removing a file not using svn rm, svn up will draw the repository’s version 1745 of the file, so it will appear again. To remove it and propagate the changes, you should do 1746 svn rm foo followed by svn ci -m ’removed ‘foo‘’. 1747 1748 1749 6.2.4 Log, status, list, diff 1750 1751 After an update of your working copy, you can display the log message to get a feeling for changes 1752 1753 submitted by your collaborators (if they provided commit messages, which they should). This 1754 can be done, for example, as follows: 1755 1756 1 svn log | head -20 # displays the first 20 lines of the log message 1757 1758 If one would like to get an overview about the status of certain local files, one can use the 1759 following command: 1760 1761 1 svn st # display the status of files 1762 1763 The first columns of the output contain one character abbreviations like “A” (for “added”), “C” 1764 (for “conflicted”), “D” (for “deleted”), “M” (for “modified”), “?” (for items not under version 1765 1766 control) and “!” (for missing items, for example, if a file was removed by a non-svn command); see 1767 svn help status for more details. svn st can also be helpful to detect which files have recently 1768 been modified in case one would like to commit only some of them, for example. 1769 1770 To see which files are under version control at all, use: 1771 1 1772 svn list 1773 1774 The command svn diff can be used to display changes in (differences between) the local files 1775 and the working copy. Some possible usages are: 1776 1777 1 svn diff # display diff between local files and working copy 1778 2 svn diff -r 110:117 # diff of (older) version 110 to (newer) 117 1779 3 svn diff -r 110 foo # diff of (older) version 110 to working copy of ‘‘foo’’ 1780 1781 1782

27

© Marius Hofert, Ulf Schepsmeier 7 Submitting a paper

1783 6.2.5 Conflicts 1784 1785 It may happen that you commit a change to the file “foo”, but one of your collaborators has 1786 1787 already committed changes of “foo” to the repository. This is in general no problem if the two 1788 sets of changes do not overlap. SVN will then merge the file changes automatically. However, if 1789 1790 the changes overlap, your commit will fail. You should then use svn up (which you should have 1791 used before committing anyways). SVN will then display that a conflict has been discovered and 1792 offers the possibilities to postpone the conflict to be resolved later (“p”), make a full diff, that is, 1793 1794 to show all changes made to the merged file (“df”), edit the merged file (“e”) and to show further 1795 options (“s”). “df” reveals how much work it is to solve the conflict, if it is only a minor one, one 1796 can use “e”, for larger ones one would typically use “p” and fix the conflict by hand. Using “p” 1797 1798 creates the additional files “foo.mine” (with your version), “foo.r” (the original file you 1799 worked with; “” denotes the version number), “foo.r” (current version of your 1800 colleague), whereas the original file “foo” contains modifications of the following type: 1801 1802 1 <<<<<<< .mine 1803 2 here is what you wrote 1804 1805 3 ======1806 4 here is what your colleague wrote 1807 5 >>>>>>> .r 1808 1809 For each of these chunks, just decide which of the versions you would like to keep and delete 1810 the other (including “«««<” and “======” etc.). After that, tell SVN that the conflict 1811 1812 has been resolved via svn resolve foo; this will automatically delete the additional files and 1813 you can then update and commit again. There are other possibilities as well, for example, 1814 svn resolve --accept=theirs-full foo would directly solve the conflict by always accepting 1815 1816 your collaborator’s version. 1817 1818 1819 6.3 Git 1820 1821 Git has recently become a popular version control system. In contrast to SVN, Git is a distributed 1822 control system meaning that your local working copy is also a full repository (with its own local 1823 1824 history and branch structure) and you can commit to it. This implies that there are possibly 1825 two repositories involved, your local one and, possibly (but not necessarily), one on the server. 1826 This decentralized version control system has the advantage that if you are offline for some hours, 1827 1828 say, but still want to revert back to an older version you can do this (via your local repository). 1829 However, having two repositories involved adds another level of complexity, for example, there is 1830 not only one but two update commands, one that updates your local repository from the remote 1831 1832 repository (git fetch) and one that updates the local repository from the server and merges the 1833 changes into your actual files (git pull, essentially a git fetch followed by git merge). Git is 1834 faster than SVN and more sophisticated when it comes to branching, merging and solving conflicts. 1835 1836 An interesting discussion on strengths and weaknesses of Git and SVN can be found under http:// 1837 stackoverflow.com/questions/871/why-is-git-better-than-subversion. We omit further 1838 1839 details here. 1840 1841 1842 7 Submitting a paper 1843 1844 A Publishing your research should be the goal and final step of your scientific work. In this chapter 1845 1846 we list some points we think are useful to know when submitting a paper; they may also serve as 1847 a checklist before the submission. Here we follow Davidian (2005) who provides a good outline 1848

28

© Marius Hofert, Ulf Schepsmeier 7 Submitting a paper

1849 in her set of publicly available slides. We also updated some suggestions and added personal 1850 1851 practical experience. 1852 In the following we assume that the manuscript has already been (mostly) written and is ready 1853 for being handed in to a journal or conference proceedings. Furthermore, we assume that you 1854 1855 followed the journal’s guidelines concerning style and structure of a paper. The 3-step procedure 1856 we describe in the next subsections is the same for all paper submissions independent of the 1857 1858 journal or the conference. 1859 1860 1861 7.1 Purpose of journals: How to find the best fitting journal for my research 1862 1863 The very first and important step to publish one’s research is to find the best fitting journal. 1864 There are hundreds of different (statistical) journals available having different objectives, focus 1865 and scope. Major journals such as the Journal of American Statistical Association – Theory and 1866 1867 Methods or Annals of Statistics cover several objects and scopes. Other journals specialize in a 1868 major topic, e.g., Statistics in Medicine or Journal of Agricultural, Biological, and Environmental 1869 Statistics, or in one or two scopes like Computational Statistics and Data Analysis or Journal of 1870 1871 Computational and Graphical Statistics. 1872 Classify and rank your own work according to the points below. This will narrow down the 1873 amount of adequate journals and may assist you in finding the best fitting journal(s). 1874 1875 Objective and focus Journals can be distinguished by their major objective or focus of statistics 1876 1877 via the area of application (e.g., medicine), the method (e.g., time series analysis) or the goal 1878 (e.g., dependence modeling). 1879 1880 Scope The objective or focus of a journal is obvious in most cases whereas the scope is often not. 1881 Each journal propagates its own goals and scopes they want to cover. According to Davidian 1882 1883 (2005), there are six different scopes in statistics; below we added two more. 1884 1885 New theory Situation in which suitable methods are not available – propose such a method. 1886 1887 New theory based on existing work Methods are available, but have limitations – extend, 1888 improve, relax assumptions. 1889 1890 New theory based on existing work II Methods are available – propose a competing one and 1891 compare and illustrate. 1892 1893 Application or problem driven An important subject-matter application has specific issues – 1894 show how to handle these with existing or modified methods. 1895 1896 Extensions Properties of existing or new procedures are unknown – work out formal theory. 1897 1898 Simulation studies Properties of existing or new procedures are unknown – carry out extensive 1899 simulations. 1900 1901 Surveys Overview of existing methods and comparison. Usually a relative new method is 1902 added, too. 1903 1904 Manuals and Vignettes Software descriptions, manuals, vignettes and code snippets for public 1905 1906 available statistical functions, packages and software. They can reach from presenting 1907 theoretical methods and algorithms to details about implementation or numerical challenges. 1908 1909 Usually a journal’s scope is a mixture of several points. Classify your own work and check if it 1910 fits in the journal’s scope. In the cover letter, one should also mention how the paper fits in 1911 1912 the journal’s scope. 1913 1914

29

© Marius Hofert, Ulf Schepsmeier 7 Submitting a paper

1915 Audience The objective and scope of a journal naturally implies the corresponding audience. The 1916 1917 audience can be academics, graduate students, practicing statisticians, researchers of other 1918 disciplines etc. Identify your audience. Who would be most interested in your work? What 1919 journals tend to be read by researchers in the corresponding area? 1920 1921 Journal rankings intend (but often also fail) to reflect the journal’s impact within its field and 1922 1923 beyond. Rankings are also an indicator for the quality of a journal, but do not have to be (if you 1924 have found a journal which seems “right” for your research, go for it, do not listen to rankings – 1925 only use them to compare adequate journals which you are unsure about otherwise). Rankings 1926 1927 are facilitated by analyzing citations of scientific (and non-scientific) publications. The most 1928 common measure is the impact factor which reflects the average number of citations to articles 1929 published in science and social science journals4. But there are several other measures, too. 1930 1931 Practical hints 1932 1933 1) Have a look at the journals in your reference list. Usually they are of the same field of 1934 statistics using similar methods or have similar areas of application. 1935 1936 2) Examine papers published in a recent issue of the journal for style, level of technical detail 1937 or discussed topics. 1938 1939 3) Have a look at the list of Editors and Associate Editors of the journal. One of them will 1940 1941 handle your paper. You may get an idea who it could be by looking at their research 1942 interests. Maybe you know her/him and, as an exception in case you are absolutely unsure, 1943 can address her/him personally. 1944 1945 1946 7.2 Preparations before submission 1947 1948 Before you can submit the manuscript to a journal or conference proceedings you have to accurately 1949 1950 prepare your submission. False styles or missing material such as figures or separately listed 1951 tables may be a reason for a rejection. Thus invest a good quantity of time to prepare your 1952 submission. Also watch out for typographical errors, they certainly do not contribute to leave a 1953 1954 good impression about the quality of your work. Ideally, have a colleague (native speaker) read 1955 the paper and give you feedback. 1956 1957 First of all, read the journal’s guidelines for authors and carefully follow these instructions! 1958 The journal typically lists all necessary properties and style requirements, conventions on math, 1959 figures, tables, font size, font style, spacing, etc. For (statistical) journals, the most important 1960 1961 points are: 1962 Scope As mentioned above, check whether your manuscript fits into the scope of the journal. 1963 1964 Length Some journals limit the number of pages. Important is here the number of pages in the 1965 journals style format with correct page boundaries, spacing, font size, etc. Do not go beyond 1966 1967 the limits, rather stay well below; also, a shorter paper typically takes less time to review. 1968 1969 Style of article (including references) Most journals already demand a paper submission in the 1970 journal’s publishing style. Therefore, they offer LATEX style files defining page boundaries, 1971 spacing, font size, etc. or Microsoft Word samples and guidelines. 1972 1973 Authors and authors affiliation Watch out for double-blinded submission requirements (neither 1974 the author knows the name of the reviewer (as usual) nor the reviewer knows the names of the 1975 1976 authors). 1977 1978 4Wikipedia contributors, "Journal ranking," Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/ 1979 index.php?title=Journal_ranking&oldid=608376667 (accessed June 18, 2014). 1980

30

© Marius Hofert, Ulf Schepsmeier 7 Submitting a paper

1981 Figures and tables Usually not all figure formats are accepted. Check which ones are preferred 1982 1983 by the journal. There are several open source software tools to convert one format into 1984 another (e.g., ps2pdf or epstopdf). Make sure that all graphics have a high enough resolution. 1985 Sometimes it is requested to put all figures and tables separately at the end of the manuscript. 1986 1987 Furthermore, each figure or table should have a reasonable caption. A graph or table including 1988 the caption should be self-explaining without requiring to read the article’s text. Therefore 1989 1990 captions and legends have to be accurate. 1991 The abstract is a concise summary of what you will present in the paper and provides the key 1992 1993 findings (not just the conclusions). Formulas and citations to other works should be avoided in 1994 an abstract. Some journals even prohibit formulas and citations in the abstract. The golden 1995 thread should become clear without reference to the rest of the manuscript. The abstract 1996 1997 should be no more than 200–250 words (depending on the journal) and is typically written in 1998 passive. 1999 2000 Keywords Immediately after the abstract, provide a maximum of 6 keywords. Be sparing with 2001 abbreviations: only abbreviations firmly established in the field may be eligible. These keywords 2002 2003 will be used for indexing purposes. 2004 2005 Highlights are a short collection of bullet points that convey the core findings of the article. 2006 Usually highlights are used for online publications on the journal’s web page to highlight the 2007 article’s main findings. 2008 2009 File naming Some journals require to follow special file naming conventions for additional material 2010 such as figures (to work in an automated compilation process). 2011 2012 2013 7.3 Submitting a paper 2014 2015 Nowadays most journals offer (and prefer to use) a submission management system. On submission 2016 2017 of a paper, one has to provide several details such as the work address, title, abstract, keywords, 2018 and finally one has to upload the manuscript (figures, etc.) and a cover letter. Follow it step by 2019 step. 2020 2021 Check for completeness Check all your answers in the forms, check for completeness, check if 2022 you followed the journal instructions for authors, check the cover letter (see below). Modern 2023 2024 submission management systems create a final .pdf containing all the documents handed in. 2025 Have a last look at the file before the final submission. If it is required to submit a .tex file, 2026 2027 it will be converted to .pdf. Check it carefully! Are all figures included? All mathematical 2028 symbols displayed correctly? Are references shown properly? Check on the formating, page 2029 style and spacing. 2030 2031 Cover letter Enclose a short cover letter making clear your intention (submission of a paper for 2032 publication) and note any conflict of interest (such as sponsored work). The cover letter is 2033 2034 either uploaded to the manuscript management system or sent in an accompanying email to 2035 the editor, most often the former. 2036 2037 Congratulation! You submitted your hard work offering your research to the community and 2038 the public audience. Now it is the time to wait for feedback from the journal (typically, reports 2039 2040 are sent by one to three reviewers and sometimes also the Associate Editor). This can take a 2041 frustrating amount of time, though (up to several years). In the meanwhile, most journals offer 2042 the possibility (check the journal’s policy) to put the paper either on the authors’ websites or 2043 2044 on special publication servers such as http://arxiv.org. This has the advantage of making 2045 the results available right away and also avoids the paper to be rejected by a reviewer who then 2046

31

© Marius Hofert, Ulf Schepsmeier References

2047 publishes the results on his own (before your paper is published); normally, the reviewer is an 2048 2049 expert in the same area and thus there is the potential of this situation to happen. Unfortunately, 2050 the review process in its current form bears risks of this (and other) type. 2051 2052 2053 Acknowledgements 2054 2055 We would like to thank our students and the many people which have sent us code in the past for 2056 inspiring this guide. Also, we would like to thank the following friends and colleagues for feedback 2057 2058 on these guidelines: Matthias Kirchner (ETH Zürich), Aleksey Min (Technische Universität 2059 München), Matthias Scherer (Technische Universität München). 2060 2061 2062 2063 References 2064 2065 Davidian, M. (2005), Academic publication: Journals and the journal editorial process, Department 2066 of Statistics, North Carolina State University, http://www4.stat.ncsu.edu/~davidian/ 2067 2068 st810a/journals_handout.pdf. 2069 Halmos, P. (1970), How to write mathematics, Enseign. Math. 16(2), 123–152. 2070 2071 Higham, N. (1993), Handbook of Writing for the Mathematical Sciences, SIAM. 2072 Hofert, M. and Mächler, M. (2014), Parallel and other simulations in R made easy: An end-to-end 2073 study. 2074 2075 Ligges, U. (2006), Help Desk: Accessing the sources, R News, 6(4), 43–45, http://CRAN.R- 2076 project.org/doc/Rnews/Rnews_2006-4.pdf. 2077 Oetiker, T., Partl, H., Hyna, I., and Schlegl, E. (2011), The Not So Short Introduction to 2078 A 2079 LTEX 2ε, http://mirror.switch.ch/ftp/mirror/tex/info/lshort/english/lshort.pdf 2080 (2014-01-26). 2081 R Development Core Team (2006), Writing R Extentions, R Foundation for Statistical Computing, 2082 2083 Vienna, Austria, http://www.R-project.org. 2084 Ritter, R. (2002), The Oxford Guide to Style, Oxford University Press. 2085 Schepsmeier, U. (2013), Efficient goodness-of-fit tests in multi-dimensional vine copula models, 2086 2087 submitted for publication, http://arxiv.org/abs/1309.5808. 2088 Venables, W. N., Smith, D. M., and R Core Team (2012), An Introduction to R, http://cran.r- 2089 2090 project.org/ (2013-02-03). 2091 Wainer, H. (1993), Visual Revelations: Tabular Presentation, Chance, 6(3), 52–56. 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112

32

© Marius Hofert, Ulf Schepsmeier