A Comparison of Three Programming Languages for a Full-Fledged Next

Total Page:16

File Type:pdf, Size:1020Kb

A Comparison of Three Programming Languages for a Full-Fledged Next bioRxiv preprint doi: https://doi.org/10.1101/558056; this version posted February 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. A comparison of three programming languages for a full-fledged next-generation sequencing tool Costanza, Pascal∗ Herzeel, Charlotte∗ [email protected] [email protected] Verachtert, Wilfried [email protected] imec, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium February 22, 2019 Abstract icantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our anal- Background elPrep is an established multi-threaded ysis shows that concurrent, parallel garbage collec- framework for preparing SAM and BAM files in se- tion is better at managing a large heap of objects quencing pipelines. To achieve good performance, than reference counting in our case. its software architecture makes only a single pass Conclusions Based on our benchmark results, we through a SAM/BAM file for multiple preparation selected Go as our new implementation language for steps, and keeps sequencing data as much as pos- elPrep, and recommend considering Go as a good sible in main memory. Similar to other SAM/BAM candidate for developing other bioinformatics tools tools, management of heap memory is a complex task for processing SAM/BAM data as well. in elPrep, and it became a serious productivity bot- tleneck in its original implementation language dur- ing recent further development of elPrep. We there- Background fore investigated three alternative programming lan- guages: Go and Java using a concurrent, parallel The sequence alignment/map format garbage collector on the one hand, and C++17 using (SAM/BAM) [1] is the de facto standard in reference counting on the other hand for handling the bioinformatics community for storing mapped large amounts of heap objects. We reimplemented sequencing data. There exists a large body of elPrep in all three languages and benchmarked their work on tools for processing SAM/BAM files for runtime performance and memory use. analysis [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Results The Go implementation performs best, The SAMtools [1], Picard [2], and GATK [3] software yielding the best balance between runtime perfor- packages developed by the Broad and Sanger insti- mance and memory use. While the Java benchmarks tutes are considered to be reference implementations report a somewhat faster runtime than the Go bench- for many operations on SAM/BAM files, examples of marks, the memory use of the Java runs is signifi- which include sorting reads, marking PCR and opti- cantly higher. The C++17 benchmarks run signif- cal duplicates, recalibrating base quality scores, indel realignment, and various filtering options, which ∗Equal contributor typically precede variant calling. Many alternative 1 bioRxiv preprint doi: https://doi.org/10.1101/558056; this version posted February 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. software packages [4, 5, 6, 7, 8, 9, 10, 12, 14, 15] focus data structures that spill to disk, elPrep currently re- on optimizing the computations of these operations, quires a fixed minimum amount of RAM to process either by providing alternative algorithms, or by a whole-exome or whole-genome file, whereas other using parallelization, distribution, or other opti- tools sometimes allow putting a cap on the RAM use mization techniques specific to their implementation by using disk space instead. Nonetheless, for effi- language, which is often C, C++, or Java. ciency, it is recommended to use as much RAM as We have developed elPrep [8, 16], an open-source, available [8, 18]. This means that, in general, tools multi-threaded framework for processing SAM/BAM for processing SAM/BAM data need to be able to files in sequencing pipelines, especially designed for manipulate large amounts of allocated memory. optimizing computational performance. It can be In most programming languages, there exist more used as a drop-in replacement for many operations or less similar ways to explicitly or implicitly allo- implemented by SAMtools, Picard, and GATK, while cate memory for heap objects which, unlike stack producing identical results [8, 16]. elPrep allows users values, are not bound to the lifetimes of function to specify arbitrary combinations of SAM/BAM op- or method invocations. However, programming lan- erations as a single pipeline in one command line. guages strongly differ in how memory for heap ob- elPrep's unique software architecture then ensures jects is subsequently deallocated. A detailed discus- that running such a pipeline requires only a single sion can be found in \The Garbage Collection Hand- pass through the SAM/BAM file, no matter how book" by Jones, Hosking, and Moss [19]. There are many operations are specified. The framework takes mainly three approaches: care of merging and parallelizing the execution of the Manual memory management Memory has to operations, which significantly speeds up the overall be explicitly deallocated in the program source execution of a pipeline. code (for example by calling free in C [20]). In contrast, related work focuses on optimizing in- dividual SAM/BAM operations, but we have shown Garbage collection Memory is automatically that our approach of merging operations outperforms managed by a separate component of the this strategy [8]. For example, compared to us- runtime library called the garbage collector. ing GATK4, elPrep executes the 4-step Broad Best At arbitrary points in time, it traverses the Practices pipeline [17] (consisting of sorting, mark- object graph to determine which objects are ing PCR and optical duplicates, and base qual- still directly or indirectly accessible by the ity score recalibration and application) up to 13x running program, and deallocates inaccessible faster on whole-exome data, and up to 7.4x faster objects. This ensures that object lifetimes do on whole-genome data, while utilizing fewer compute not have to be explicitly modelled, and that resources [8]. pointers can be more freely passed around All SAM/BAM tools have in common that they in a program. Most garbage collector im- need to manipulate large amounts of data, as plementations interrupt the running program SAM/BAM files easily take up 10-100GB in com- and only allow it to continue executing after pressed form. Some tools implement data structures garbage collection { they \stop the world" [19] that spill to disk when reaching a certain thresh- { and perform object graph traversal using old on RAM use, but elPrep uses a strategy where a sequential algorithm. However, advanced data is split upfront into chunks that are processed implementation techniques, as employed by entirely in memory to avoid repeated file I/O [16]. Java [21] and Go [22], include traversing the Our benchmarks show that elPrep's representation object graph concurrently with the running of SAM/BAM data is more efficient than, for exam- program while limiting its interruption as far ple, GATK4, as elPrep uses less memory for loading as possible; and using a multi-threaded parallel the same number of reads from a SAM/BAM file in algorithm that significantly speeds up garbage memory [8]. However, since elPrep does not provide collection on modern multicore processors. 2 bioRxiv preprint doi: https://doi.org/10.1101/558056; this version posted February 22, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Reference counting Memory is managed by main- like parallel sorting or concurrent duplicate mark- taining a reference count with each heap object. ing, but also from the overall software architecture When pointers are assigned to each other, these that organizes these steps into a single-pass, multi- reference counts are increased or decreased to threaded pipeline. Since such software-architectural keep track of how many pointers refer to each ob- aspects are not covered by the existing literature, it ject. Whenever a reference count drops to zero, therefore became necessary to perform the study de- the corresponding object can be deallocated.1 scribed in this article. elPrep is an open-ended software framework that elPrep was originally, up to version 2.6, imple- allows for arbitrary combinations of different func- mented in the Common Lisp programming lan- tional steps in a pipeline, like duplicate marking, sort- guage [23]. Most existing Common Lisp implemen- ing reads, replacing read groups, and so on; addi- tations use stop-the-world, sequential garbage collec- tionally, elPrep also accommodates functional steps tors. To achieve good performance, it was therefore provided by third-party tool writers. This openness necessary to explicitly control how often and when makes it difficult to precisely determine the lifetime the garbage collector would run to avoid needless in- of allocated objects during a program run. It is terruptions of the main program, especially during known that manual memory management can con- parallel phases. As a consequence, we also had to tribute to extremely low productivity when develop- avoid unnecessary memory allocations, and reuse al- ing such software frameworks. See for example the ready allocated memory as far as possible, to reduce IBM San Francisco project, where a transition from the number of garbage collector runs. However, our C++ with manual memory management to Java with more recent attempts to add more functionality to garbage collection led to an estimated 300% produc- elPrep (like optical duplicate marking, base quality tivity increase [33].
Recommended publications
  • Acme: a User Interface for Programmers Rob Pike [email protected]−Labs.Com
    Acme: A User Interface for Programmers Rob Pike [email protected]−labs.com ABSTRACT A hybrid of window system, shell, and editor, Acme gives text-oriented applications a clean, expressive, and consistent style of interaction. Tradi­ tional window systems support interactive client programs and offer libraries of pre-defined operations such as pop-up menus and buttons to promote a consistent user interface among the clients. Acme instead pro­ vides its clients with a fixed user interface and simple conventions to encourage its uniform use. Clients access the facilities of Acme through a file system interface; Acme is in part a file server that exports device-like files that may be manipulated to access and control the contents of its win­ dows. Written in a concurrent programming language, Acme is structured as a set of communicating processes that neatly subdivide the various aspects of its tasks: display management, input, file server, and so on. Acme attaches distinct functions to the three mouse buttons: the left selects text; the middle executes textual commands; and the right com­ bines context search and file opening functions to integrate the various applications and files in the system. Acme works well enough to have developed a community that uses it exclusively. Although Acme discourages the traditional style of interaction based on typescript windowsߞteletypesߞits users find Acmeߣs other ser­ vices render typescripts obsolete. History and motivation The usual typescript style of interaction with Unix and its relatives is an old one. The typescriptߞan intermingling of textual commands and their outputߞoriginates with the scrolls of paper on teletypes.
    [Show full text]
  • Tiny Tools Gerard J
    Tiny Tools Gerard J. Holzmann Jet Propulsion Laboratory, California Institute of Technology Many programmers like the convenience of integrated development environments (IDEs) when developing code. The best examples are Microsoft’s Visual Studio for Windows and Eclipse for Unix-like systems, which have both been around for many years. You get all the features you need to build and debug software, and lots of other things that you will probably never realize are also there. You can use all these features without having to know very much about what goes on behind the screen. And there’s the rub. If you’re like me, you want to know precisely what goes on behind the screen, and you want to be able to control every bit of it. The IDEs can sometimes feel as if they are taking over every last corner of your computer, leaving you wondering how much bigger your machine would have to be to make things run a little more smoothly. So what follows is for those of us who don’t use IDEs. It’s for the bare metal programmers, who prefer to write code using their own screen editor, and who do everything else with command-line tools. There are no real conveniences that you need to give up to work this way, but what you gain is a better understanding your development environment, and the control to change, extend, or improve it whenever you find better ways to do things. Bare Metal Programming Many developers who write embedded software work in precisely this way.
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]
  • Revenues 1010 1020 1030 1040 1045 1060 1090 82,958 139,250
    FCC Paper Report 43-01 ARMIS Annual Summary Report COMPANY Northern New England Telephone Telephone Operallons LLC ST UDY AREA. New Hempsh1 re PERIOD From: Jan 201 1 To Dec 2011 COSA FPNH A ACCOUNT LEVEL REPORTING (Dollars on thousands) ROW CLASSIFICATION T otal Nonreg AdJu stments Subject To Slate Interstate (a) (b) (C) (d) Separations (g) (h) (f) Revenues 1010 Besoc Local Servoces 82,958 N/A 0 82,958 82,958 0 1020 Network Access Services 139,250 N/A 0 139,250 9,443 129,807 1030 Toil Network Servoces 13,911 N/A 0 13,911 13,881 31 1040 Mosceltaneous 33,250 N/A 0 33,250 22,165 11,084 1045 Nonregutated 7,540 7,540 N/A N/A N/A N/A 1060 Uncoltecllbles 3,597 101 0 3,497 1,615 1,882 1090 Total Operalong Revenues 273,312 7,439 0 265,872 126,832 139,040 Imputation or Dtrectory Revenue 23,300 Total Assessed Revenue 296,612 Assessment 942,999 403,229 Assessment Factor $0.00318 $ 0.0031 8 The revenue data above os avaolable on the FCC ARMIS websote at http)/fjallrossJcc.gov/eafs7/adhoc/table year tal Profile and Historic Information! AT&T Labs! AT&T Page 1 of 5 Personal Business About AT&T ·~ ~ at&t Backgrounder In today's rapidly changing business environment, many of the most exciting innovations are being spearheaded by AT&T Labs, the long-respected research and development arm of AT&T. History The year was 1901...the beginning of a new century.
    [Show full text]
  • Sam Quick Reference Card
    Sam______________________ quick reference card ___________________________Addresses _Sam_______________________________________________________________________ idioms n,m line n to line m X/.*/,x/<cr>/d strip <cr> from all files ’ address mark, see k below x/ˆ/ .,/0d strip C comments from selection . correct selection/position -/ˆ/+#10 goto the 10th colum in the current line 0 start of file -0+,+0- round dot down to whole lines only ˆ start of line ,x/ +/ v/ˆ/ c/ / compress runs of spaces, leaving indentation $ end of line/file s/"([ˆ"]*)"/‘‘1’’/ replace "hello" with ‘‘hello’’ in selection , equivalent to 0,$ f <nl> set current file-name to null < echo "" insert ascii code xxx at current pos ______________________________Regular Expressions , > wc -l count lines in file . any character /text/+-p highlight the lines containing <pat> -/text/ search for text backwards * 0 or more of previous $-/text/ search for the last occurrence of text in file + 1 or more of previous [ˆn] any char but n ,x/<text>/+-p grep for text [nm] n or m .x/<pat>/ c/<rep>/ search for <pat> and replace with <rep> [a-z] class a thru z B < echo *.c add all the C files in current dir to file list B < grep -l <pat> * add all the files containing <pat> to file list (re) tag pattern # substitute #’th tagged pattern X/’/w write all modified files Y/.c/D remove all non C files from file list _Text________________________________________ commands fmt pipe selection through the text formatter > mail <user> send selection as Email
    [Show full text]
  • Computer Oral History Collection, 1969-1973, 1977
    Computer Oral History Collection, 1969-1973, 1977 Interviewee: B. Holbrook Interviewer: Uta C. Merzbach Date: May 10, 1969 Repository: Archives Center, National Museum of American History MERZBACH: Do you mind starting out by giving a little basic background as to your early interest, training, and schooling, how you came to go to Bell Labs. HOLBROOK: I was educated to be an X-ray physicist, and I came to Bell Laboratories in 1930 expecting to work in some branch of physics, and I have worked in almost everything except physics ever since. I was in transmission research for a good many years. I worked on voice-operated devices for transatlantic radio and things like this for awhile. Most of the War I spent in a group that was developing electrical analog anti aircraft fire control equipment for the Navy. And then I worked for several years in the Switching Research Department, and in 1957 became head of what was ultimately called the computing systems research department, where I was for the remaining eleven years at Bell Laboratories. I never had anything officially to do with computers until 1957. I was interested in what was going on and thereby acquired some knowledge of it. MERZBACH: Could you tell about the early activities which I guess were taking place just about the time that you came to the Labs? HOLBROOK: Well, of course, from the standpoint of the history of computers of the major advances was made in 1927 or 1938 with Harold Black's invention of the feedback amplifier which changed an amplifier from something which had gain into a precision measuring instrument which was the essential part of building an electric analog computer and was extremely convenient in handling signals through electronic digital computers.
    [Show full text]
  • Using the Go Programming Language in Practice
    MASTER’S THESIS | LUND UNIVERSITY 2014 Using the Go Programming Language in Practice Fredrik Pettersson, Erik Westrup Department of Computer Science Faculty of Engineering LTH ISSN 1650-2884 LU-CS-EX 2014-19 Using the Go Programming Language in Practice Erik Westrup Fredrik Pettersson <[email protected]> <[email protected]> <[email protected]> <[email protected]> June 5, 2014 Master’s thesis work carried out at Axis Communications AB for the Department of Computer Science, Lund University. Supervisors: Jonas Skeppstedt <[email protected]> Mathias Bruce <[email protected]> Robert Rosengren <[email protected]> Examiner Jonas Skeppstedt Abstract When developing software today, we still use old tools and ideas. Maybe it is time to start from scratch and try tools and languages that are more in line with how we actually want to develop software. The Go Programming Language was created at Google by a rather famous trio: Rob Pike, Ken Thompson and Robert Griesemer. Before introducing Go, the company suffered from their development process not scaling well due to slow builds, uncontrolled dependencies, hard to read code, poor documenta- tion and so on. Go is set out to provide a solution for these issues. The purpose of this master’s thesis was to review the current state of the language. This is not only a study of the language itself but an investigation of the whole software development process using Go. The study was carried out from an embedded development perspective which includes an investigation of compilers and cross-compilation. We found that Go is exciting, fun to use and fulfills what is promised in many cases.
    [Show full text]
  • Programming Languages
    Introduction to Programming Languages Introduction to Programming Languages Anthony A. Aaby © 1996 by Anthony A. Aaby HTML Style Guide | To Do | Miscellenous (possible content) | Figures | Definitions Short Table of Contents Preface 1. Introduction 2. Syntax 3. Semantics 4. Translation 5. Pragmatics 6. Abstraction and Generalization 7. Data and Data Structuring 8. Logic Programming 9. Functional Programming 10. Imperative Programming 11. Concurrent Programming 12. Object-Oriented Programming 13. Evaluation Appendix Stack machine Unified Grammar Logic Bibliography Definitions Index Supplementary Material http://cs.wwc.edu/~aabyan/221_2/PLBOOK/ (1 de 6) [18/12/2001 10:33:41] Introduction to Programming Languages Code Answers Long Table of Contents HTML Style Guide | To Do Preface Syntax, translation, semantics and pragmatics 1 Introduction 1.1 Data 1.2 Models of Computation 1.3 Syntax and Semantics 1.4 Pragmatics 1.5 Language Design Principles 1.6 Historical Perspectives and Further Reading 1.7 Exercises 2 Syntax 2.1 Context-free Grammars 2.1.1 Alphabets and Languages 2.1.2 Grammars and Languages 2.1.3 Abstract Syntax 2.1.4 Parsing 2.1.5 Table-driven and recursive descent parsing 2.2 Nondeterministic Pushdown Automata 2.2.1 Equivalence of pda and cfgs 2.3 Regular Expressions 2.4 Deterministic and Non-deterministic Finite State Machines 2.4.1 Equivalence of deterministic and non-deterministic fsa 2.4.2 Equivalence of fsa and regular expressions 2.4.3 Graphical Representation 2.4.4 Tabular Representation 2.4.5 Implementation of FSAs 2.5 Historical
    [Show full text]
  • Acme: a User Interface for Programmers Rob Pike AT&T Bell Laboratories Murray Hill, New Jersey 07974
    Acme: A User Interface for Programmers Rob Pike AT&T Bell Laboratories Murray Hill, New Jersey 07974 ABSTRACT A hybrid of window system, shell, and editor, Acme gives text-oriented applications a clean, expressive, and consistent style of interaction. Traditional window systems support inter- active client programs and offer libraries of pre-defined operations such as pop-up menus and buttons to promote a consistent user interface among the clients. Acme instead provides its clients with a fixed user interface and simple conventions to encourage its uniform use. Clients access the facilities of Acme through a file system interface; Acme is in part a file server that exports device-like files that may be manipulated to access and control the contents of its win- dows. Written in a concurrent programming language, Acme is structured as a set of communi- cating processes that neatly subdivide the various aspects of its tasks: display management, input, file server, and so on. Acme attaches distinct functions to the three mouse buttons: the left selects text; the mid- dle executes textual commands; and the right combines context search and file opening functions to integrate the various applications and files in the system. Acme works well enough to have developed a community that uses it exclusively. Although Acme discourages the traditional style of interaction based on typescript windows— teletypes—its users find Acme’s other services render typescripts obsolete. History and motivation The usual typescript style of interaction with Unix and its relatives is an old one. The typescript—an inter- mingling of textual commands and their output—originates with the scrolls of paper on teletypes.
    [Show full text]
  • 8½, the Plan 9 Window System
    8½, the Plan 9 Window System Rob Pike [email protected]−labs.com ABSTRACT The Plan 9 window system, 8½, is a modest-sized program of novel design. It provides textual I/O and bitmap graphic services to both local and remote client programs by offering a multiplexed file service to those clients. It serves traditional UNIX files like /dev/tty as well as more unusual ones that provide access to the mouse and the raw screen. Bit­ map graphics operations are provided by serving a file called /dev/bitblt that interprets client messages to perform raster opera­ tions. The file service that 8½ offers its clients is identical to that it uses for its own implementation, so it is fundamentally no more than a multi­ plexer. This architecture has some rewarding symmetries and can be implemented compactly. Introduction In 1989 I constructed a toy window system from only a few hundred lines of source code using a custom language and an unusual architecture involving concurrent pro­ cesses [Pike89]. Although that system was rudimentary at best, it demonstrated that window systems are not inherently complicated. The following year, for the new Plan 9 distributed system [Pike92], I applied some of the lessons from that toy project to write, in C, a production-quality window system called 8½. 8½ provides, on black-and-white, grey-scale, or color displays, the services required of a modern window system, includ­ ing programmability and support for remote graphics. The entire system, including the default program that runs in the window ߞ the equivalent of xterm [Far89] with ߢcut­ ting and pastingߣ between windows ߞ is well under 90 kilobytes of text on a Motorola 68020 processor, about half the size of the operating system kernel that supports it and a tenth the size of the X server [Sche86] without xterm.
    [Show full text]
  • Lecture 18 Regular Expressions the Grep Command
    Lecture 18 Regular Expressions Many of today’s web applications require matching patterns in a text document to look for specific information. A good example is parsing a html file to extract <img> tags of a web document. If the image locations are available, then we can write a script to automatically download these images to a location we specify. Looking for tags like <img> is a form of searching for a pattern. Pattern searches are widely used in many applications like search engines. A regular expression(regex) is defined as a pattern that defines a class of strings. Given a string, we can then test if the string belongs to this class of patterns. Regular expressions are used by many of the unix utilities like grep, sed, awk, vi, emacs etc. We will learn the syntax of describing regex later. Pattern search is a useful activity and can be used in many applications. We are already doing some level of pattern search when we use wildcards such as *. For example, > ls *.c Lists all the files with c extension or ls ab* lists all file names that starts with ab in the current directory. These type of commands (ls,dir etc) work with windows, unix and most operating systems. That is, the command ls will look for files with a certain name patterns but are limited in ways we can describe patterns. The wild card (*) is typically used with many commands in unix. For example, cp *.c /afs/andrew.cmu.edu/course/15/123/handin/Lab6/guna copies all .c files in the current directory to the given directory Unix commands like ls, cp can use simple wild card (*) type syntax to describe specific patterns and perform the corresponding tasks.
    [Show full text]
  • An Introduction to Linux and Bowtie
    An Introduction to Linux and Bowtie Cavan Reilly November 10, 2017 Table of contents Introduction to UNIX-like operating systems Installing programs Bowtie SAMtools Introduction to Linux In order to use the latest tools in bioinformatics you need access to a linux based operating system (or Mac OS X). So we will give some background on how to use a linux based system. The linux operating system is based on the UNIX operating system (by a guy named Linus Torvalds), but is freely available (UNIX was developed by a private company). There are a number of implementations of linux that go by different names: e.g. Ubuntu, Debian, Fedora, and OPENsuse. Introduction to Linux Most beginners choose Ubuntu (or one of its offshoots, like Mint). I use Ubuntu or Fedora on my computers and centos is used on the server we will use for this course. You can install it for free on any PC. Such systems are based on a hierarchical structure for organizing files into directories. We will use a remote server, so we need a way to communicate between the computer you are using and this remote machine. Introduction to Linux To connect to a linux server from a Windows machine you need a Windows program that allows you to log on to a terminal and the ability to transfer files to and from the remote server and your Windows machine. WinSCP is a free program that allows one to do this. You can download the data from the WinSCP website and it installs like a regular Windows program: just allow your computer to install it and go with the default configuration.
    [Show full text]