Compiling and installing software, manually or automatically, file comparison J. M. P. Alves

Laboratory of Genomics & Bioinformatics in Parasitology Department of Parasitology, ICB, USP Software for all

● Operating systems nowadays, differently from the old times, come with a large selection of user applications installed by default

● Those are programs that we use most of the time: office software, web browsers, image editors, music players, games, etc. etc. etc.

distros such as and include most of the commonly used programs in their default download image

● However, the universe of software is huge, and very diverse; it is not possible to include even a small sample of every kind of application that exists out there

● That is where being able to install other programs comes in handy

J.M.P. Alves 2 / 55 BMP0260 / ICB5765 / IBI5765 Software for all

● Until the 1990s, the Internet was not as available for the general public as it is today, so software had to be distributed or copied in tape or disks (magnetic, like floppy disks, or optical, like CDs), or even printed

5.25” floppy, 3.5” floppy, 8” floppy, 175 kB to 1.2 MB 360 or 720 kB 1.44 MB J.M.P. Alves 3 / 55 BMP0260 / ICB5765 / IBI5765 40 years of external storage

By avaragado, at http://flickr.com/photos/89394041@N00/6960433672

J.M.P. Alves 4 / 55 BMP0260 / ICB5765 / IBI5765 The modern era

● With the Web, anyone with an Internet connection became able to find software

● Certain specialized collections of software, called repositories, have been set up and serve as centralized sources

● Classical example in biological software, the IUBio Software section (since 1989!): http://iubio.bio.indiana.edu/software/

● Many of these repositories are dedicated to a certain kind of software, e.g. from a platform (CRAN for R, CPAN for Perl etc.)

● One can use a specific tool to get new software from a repository, or alternatively just use a Web browser – it depends on each repository

● Repositories can add safety: tested and examined by the maintainers

J.M.P. Alves 5 / 55 BMP0260 / ICB5765 / IBI5765 The Linux contribution

● It is very often the case that a new program that you want to run depends on some other piece(s) of software

● These dependencies may be other independent programs (which might do something that your new program needs) or shared software libraries (which are collections of code that do not operate independently, but just “help” other programs)

● In either case, we can only use our new program if the dependencies are installed

● And the dependencies themselves might (and usually do) depend on other programs or libraries…

● As you’ve probably noticed by now, this can be quite a problem

J.M.P. Alves 6 / 55 BMP0260 / ICB5765 / IBI5765 Dependency hell

● To solve this problem, a system of package management has been established, around 1997 to 1998, in different Linux distributions

● A package is a program plus all the other accessory files it needs (like eventual data files, man pages, specific libraries etc.)

● One of the creators of Debian GNU/Linux, Ian Murdock, has said: “What’s the single biggest advancement Linux has brought to the industry? Package Management –or, more specifically, the ability to install and upgrade software over the network in a seamlessly integrated fashion”

● Nowadays, everyone knows something similar to Linux repositories and package management: the app stores

● Managing a package means: install, remove, update – the package itself and all of its dependencies

J.M.P. Alves 7 / 55 BMP0260 / ICB5765 / IBI5765 Managing packages

● Package management also ensures that all versions of the dependencies involved are compatible

● The program in a package is compiled and ready to run

● In the early days of Linux (remember, the kernel was written in 1991), two distros came up with their own way of packaging programs and then managing the resulting packages

● Those pioneers (and dominating to this day) were:

● RedHat, 1997, with the RPM system (RedHat )

● Debian, 1998, with the DEB system

● Debian (which is the system installed on the remote server) is the direct ancestor of Ubuntu, one of the most used distros nowadays, so we will focus on package management using the DEB system

J.M.P. Alves 8 / 55 BMP0260 / ICB5765 / IBI5765 Taking a look at the GUI

● Every big Linux distro has now its own repository, with many gigabytes of all kinds of software; for example, see: Debian packages: https://www.debian.org/distrib/packages Ubuntu packages: https://packages.ubuntu.com/focal/

● In a desktop Ubuntu machine, you can use the Ubuntu Software program

● This program is the default way, in Ubuntu, for exploring the repository and managing packages – I myself only use it when the system is brand new, to install Synaptic (a much better package manager in my opinion… but that depends on taste)

● In Ubuntu Software you can see what is installed, as well as what is available for installation, which is a mouse click away

● You cannot install anything without an administrator password

J.M.P. Alves 9 / 55 BMP0260 / ICB5765 / IBI5765 Back to the command-line

● Everything that can be done in the graphical apps can also be done in the nurturing and friendly environment of the CLI, of course

● The Debian family of distros all use the DEB

● The most popular CLI program for DEB package management is the APT system

● The system maintains a list of installed packages, as well as those available in the repositories registered (which can be several)

● The first thing to do is to update this information -get update

● This command will download all the latest software lists from the repositories

J.M.P. Alves 10 / 55 BMP0260 / ICB5765 / IBI5765 Searching for packages

● There are tens of thousands of packages out there, therefore, searching is an important task apt-cache search regex

● The regular expression can be just simple words (e.g., likelihood), as always, and it will search the package names and descriptions

● Now that you have found some packages that look like they might be what you are looking for, its time to ask for more information apt-cache show ncbi-blast+ apt-cache showpkg ncbi-blast+

● These two commands will show basic information about the package, as well as more detailed information such as dependencies

J.M.P. Alves 11 / 55 BMP0260 / ICB5765 / IBI5765 Installing packages

● Having decided that a package does what you need, now it is time to install it (as root or, in Ubuntu, using sudo if you have admin powers)

apt-get install raxml

sudo apt-get install raxml (not available in Debian)

● The first command will fail if you are not root

● The second command uses a program, sudo, which allows one to run any command as another user (by default, root), using one’s own password

● Debian is very security-conscious, so it does not have sudo installed by default; Ubuntu, on the other hand, does not have a root user

J.M.P. Alves 12 / 55 BMP0260 / ICB5765 / IBI5765 Upgrading packages

● Another reason why package management is very useful is that it deals with updates as well

apt-get upgrade raxml apt-get upgrade

● As seen, one can give the name of the package to be updated, of just ask for all updates (second command)

● If the package that is installed has no newer version available, the user is told so; e.g.:

muscle is already the newest version (1:3.8.31+dfsg-1).

● If one tries to upgrade a package that was not installed, apt-get tries to install it, after confirmation from the user:

The following NEW packages will be installed:

J.M.P. Alves 13 / 55 BMP0260 / ICB5765 / IBI5765 Removing packages

● Maybe that package whose description looked good does not really do what you wanted…

● Time to remove it:

apt-get remove raxml

apt-get purge raxml

● The second option is more drastic: it removes any specific configuration files as well as the program itself

● Leaving the configuration files could be useful in the future, specially if the system was using tailored configurations, different from the default

J.M.P. Alves 14 / 55 BMP0260 / ICB5765 / IBI5765 Out of repository?

● Although the repositories have a lot of stuff, they are far from having everything

● Also, it is often the case that the distro’s repository has a rather old version of the program – that is a problem specially for Debian, which is a distro that cares a lot about software security and stability

● However, package files can be created by anyone, not just repository maintainers

● Many projects distribute DEB and/or RPM files so you can install them directly; e.g.:

http://donate.libreoffice.org/pt/dl/deb-x86_64/5.3.3/pt-BR/ LibreOffice_5.3.3_Linux_x86-64_deb.tar.gz

J.M.P. Alves 15 / 55 BMP0260 / ICB5765 / IBI5765 Out of repository?

● The greatest disadvantage of directly installing a package file is that there won’t be any automatic dependency resolution

● Nonetheless, the system will still check if dependencies are all available and are the correct versions – and refuse to install if not

● A more basic, “low-level” tool is used for directly installing packages: -i *.deb

● This command will install (upgrading, if there is an older version present) all DEB files present in the current

● The dpkg program is used by APT tools, by the way, so when you are using apt-get, you are indirectly using the less friendly dpkg

J.M.P. Alves 16 / 55 BMP0260 / ICB5765 / IBI5765 What else?

● As you saw, installing programs in Linux is a breeze when they are in the distro’s repository

● When they are not, but there is an available package, things are still quite easy as long as there are no dependency issues: the program is already compiled and in an easy-to-install form

● What if there is neither?

J.M.P. Alves 17 / 55 BMP0260 / ICB5765 / IBI5765 Installing an executable

● There are now two main possible situations

● The program is compiled for your kind of computer

● The program is not compiled for your kind of computer

● What do I mean with “kind of computer”?

● Most importantly, two things:

● The operating system (which is Linux, no problem there)

● The CPU architecture, i.e., the number of bits in a register

● Nowadays, all computers have 64-bit CPUs, but the OS is sometimes built as 32-bit (for compatibility reasons, although that is less and less common)

J.M.P. Alves 18 / 55 BMP0260 / ICB5765 / IBI5765 Installing an executable

● To check the architecture of your machine, use the arch command

● For 64-bit machines, it will show x86_64

● For 32-bit ones (rare nowadays), it will show, usually, i386 or i686

● Whether it will be easy or hard to install the program in question… will depend on the writer of the program

● Some programs come with an installation script to place each file in its proper place

● Other programs don’t, but they have helpful documentation (e.g. a README or INSTALL file) with instructions

● Let’s try it! On the remote server…

J.M.P. Alves 19 / 55 BMP0260 / ICB5765 / IBI5765 Now you do it!

Go to the course site and enter Practical Exercise 34

Follow the instructions to answer the questions in the exercise

Remember: in a PE, you should do things in practice before answering the question!

J.M.P. Alves 20 / 55 BMP0260 / ICB5765 / IBI5765 No executable available? ● No, no reason for panic

● If no compiled executable is present, then we will have to compile the program ourselves

J.M.P. Alves 21 / 55 BMP0260 / ICB5765 / IBI5765 First, a memory refresher There are 10 kinds* of programming languages: Interpreted and compiled

* if you don’t remember the (bad) joke, try the following arithmetic expansion: echo $((2#10)) (see pg. 482-483 of The Linux Command Line book for more)

J.M.P. Alves 22 / 55 BMP0260 / ICB5765 / IBI5765 Compiled x interpreted

● It is common to divide computer languages into two kinds: interpreted or compiled (although the distinction can be somewhat arbitrary and fuzzy…)

● Compiling is the act of transforming source code into (usually) machine code

● A language like C++, for example, is compiled: one cannot run the program right after writing the source code file –it is first necessary to compile the program, and then run the generated executable file

● A language like Python, on the other hand, is interpreted: one runs the text file with the source code directly, without the need of first generating a compiled file

● Whether you will need to compile your new program or not depends then on which language was used in its writing

J.M.P. Alves 23 / 55 BMP0260 / ICB5765 / IBI5765 Compilation example

● Last week we learned the basics of an interpreted language: the shell

● Let’s now look at some compiled examples

● In the remote server, copy the file /data/hello.c to your home area

● Since this is (probably) your first time seeing a C program’s source code, it is following the ancient tradition of the “hello, world” program

● Take a look at the contents, if you wish

J.M.P. Alves 24 / 55 BMP0260 / ICB5765 / IBI5765 Compilation example

● Then make the file executable and run it, the same way we did last week with shell scripts: ./hello.c

● What happened? An error happened!

● C is a compiled language: the source code must be transformed into machine language – that is, compiled – before it can be run

J.M.P. Alves 25 / 55 BMP0260 / ICB5765 / IBI5765 Compilation example

● The program that compiles is called a compiler

● To create the compiled file, you need the compiler to be installed

● To run the compiled file, you do not need the compiler to be installed!

● The main C compiler in Linux is gcc (for GNU C compiler)

● Let’s compile our little program: gcc hello.c

● What happened?

● Run the file created (a.out); did it greet the world as expected?

● The a.out default name is ugly; use the -o option (lowercase o) to specify an output name for your compiled program

J.M.P. Alves 26 / 55 BMP0260 / ICB5765 / IBI5765 Compilation example

● Our little program is as simple as it can get, so running gcc directly was easily done

● However, useful programs will be much larger and will comprise a large amount of files that need to be processed to create the one or more executable files intended

● To illustrate the next level of complexity, copy the following archive to your home area: /data/diction-1.11.tar.gz

● Unpack the archive and enter the newly created directory

● The organization you see is the one used by the GNU projects, which is followed by many authors

J.M.P. Alves 27 / 55 BMP0260 / ICB5765 / IBI5765 Information

● Some of the files you see, the ones written in all uppercase letters, are information that can be very useful, sometimes essential

● COPYING or LICENSE ● INSTALL ● NEWS ● README ● The INSTALL and README files are usually the most important of these

● They contain general information about the program (e.g., what it is for, how to use it, who wrote, citation, dependencies, etc.) and compiling (also called building) and installation instructions

● Even if you already know how to perform these tasks, it is always a good idea to take at least a quick look at those files

J.M.P. Alves 28 / 55 BMP0260 / ICB5765 / IBI5765 Now you do it!

Go to the course site and enter Practical Exercise 35

Follow the instructions to answer the questions in the exercise

Remember: in a PE, you should do things in practice before answering the question!

J.M.P. Alves 29 / 55 BMP0260 / ICB5765 / IBI5765 Let’s compile

● Use the commands you earned during the practical exercise to configure, compile, and install (to your home area) the diction-1.11 program and its accessory files

● Being inside the diction-1.11 directory, run: ./configure --prefix=/home/dummy make make install

● Of course, your prefix will be different, using your user name

● How many new executable files were generated, and where did they end up being installed?

J.M.P. Alves 30 / 55 BMP0260 / ICB5765 / IBI5765 Another case

● Some programs are not as well documented or do not make your life easy when you are compiling them

● We will revisit the SOAPdenovo program, but this time we will pretend there was no compiled executable files for our kind of computer

● Copy the file /data/SOAPdenovo2-src-r240.tgz to your home area

● Unpack the archive and enter the newly created directory

● Read the INSTALL file for instructions

J.M.P. Alves 31 / 55 BMP0260 / ICB5765 / IBI5765 Another case

● It all looks simple enough, they just tell us to run: make

● This command is more than just running gcc directly, but less than the process we undertook for diction, with configuration before the making

● So… just do it!

● Uh oh, what happened!?

J.M.P. Alves 32 / 55 BMP0260 / ICB5765 / IBI5765 What now?

J.M.P. Alves 33 / 55 BMP0260 / ICB5765 / IBI5765 Now we're in wild territory

● Compilation might fail for a variety of reasons

● Our good, old friend dependency hell is one of them: libraries or software development components (like headers and such) might be completely missing

● Another, related possibility is version incompatibility: the installed version of the compiler or of some library might not work with the software one is trying to install

J.M.P. Alves 34 / 55 BMP0260 / ICB5765 / IBI5765 Now we're in wild territory

● Finally, and the worst of them all, there might be a mistake in the source code, which you will only be able to detect (and maybe correct) if you know that programming language

● A lot of knowledge and experience might be required to find out which might be the cause of the failure in compiling; even more to fix it

● Always remember: Google is your friend! Search for the error message

J.M.P. Alves 35 / 55 BMP0260 / ICB5765 / IBI5765 Comparing file content

● It is often useful to see what the differences (or commonalities) between two text file are

● The comm program accepts two sorted files and prints out three columns:

● Lines unique to the first file

● Lines unique to the second file

● Lines common to both files

● comm can be specially useful when one has two long lists of items to compare

● Example: comm -3 file1 file2

J.M.P. Alves 36 / 55 BMP0260 / ICB5765 / IBI5765 Comparing file content

● comm -3 file1 file2

● This will output only the lines that are unique to file1 and to file2, but not those that are shared by both files

● Try it! In the remote server, run: comm -3 /data/file_comA /data/file_comB

● This will print in one column all lines that are unique to file_comA and, in the other, all that are unique to file_comB

J.M.P. Alves 37 / 55 BMP0260 / ICB5765 / IBI5765 Comparing file content 1 2 3 h s w

J.M.P. Alves 38 / 55 BMP0260 / ICB5765 / IBI5765 Comparing file content

● While comm is quite useful, it is not very powerful, and not always applicable (e.g., for use on more complex files that are not supposed to be sorted)

● The program is used a lot, specially by programmers (e.g., comparing different source code versions) and system administrators (e.g., comparing configuration files), to compare files

● Differently from comm, for diff files do not need to be sorted – actually, they should not be sorted just in order to run diff!

J.M.P. Alves 39 / 55 BMP0260 / ICB5765 / IBI5765 Comparing file content

● diff is used to create the so-called diff file, which allows us to use another tool (patch) to transform one file into the other

● There are many options to control what should be taken into account when comparing files: how to treat blank space, empty lines, case sensitivity etc.

● diff can compare full directories (and their subdirectories, recursively, if told to), file by file

J.M.P. Alves 40 / 55 BMP0260 / ICB5765 / IBI5765 Comparing file content

● The main advantages of a diff file are that:

● It is very small compared to the full size of the original files

● It concisely shows the changes between the files, allowing one to quickly evaluate the differences

● diff can list the differences between two files in different ways:

● Default format, the shortest (but hardest to read)

● Context format, shows lines around the changes, for context

● Unified format, eliminates redundancies of the context format

● Let’s try! In the remote server, run: diff /data/Darwin_1 /data/Darwin_2

J.M.P. Alves 41 / 55 BMP0260 / ICB5765 / IBI5765 Quiz time!

Go to the course page and choose Quiz 32

J.M.P. Alves 42 / 55 BMP0260 / ICB5765 / IBI5765 Now you do it!

Go to the course site and enter Practical Exercise 31

Follow the instructions to answer the questions in the exercise

Remember: in a PE, you should do things in practice before answering the question!

J.M.P. Alves 43 / 55 BMP0260 / ICB5765 / IBI5765 Syncing data

● We have learned how to transfer data to and from remote computers, specially using scp

● While a great tool, scp always copies everything you tell it to

● But there is a better way: transferring only what has been updated since the previous transfer

● The program is a great tool for mirroring and backing up directories

J.M.P. Alves 44 / 55 BMP0260 / ICB5765 / IBI5765 Syncing data

● rsync can sync local and remote files, and its commands take the form:

rsync options source destination

● The source and destination files can be either local-local, local-remote, or remote- local (but not remote-remote)

J.M.P. Alves 45 / 55 BMP0260 / ICB5765 / IBI5765 Syncing data

● rsync frequently uses a few options :

● -a : archive mode, short-hand for several options – most importantly, recursive copy and keep permissions (see man page)

● -z : compress data on-the-fly for transfer; more use for transfers over slower networks

● -v : verbose messages, gives details of the sync and a summary when the operation has finished

J.M.P. Alves 46 / 55 BMP0260 / ICB5765 / IBI5765 Syncing data

● As usual, the options can be combined, for example:

rsync -azv local_dir computer_name:remote_dir

● It is customary to create cron jobs to periodically and automatically run rsync commands that synchronize data – between machines or between a machine and external storage, for example

● Let's sync a local directory into your home directory: time rsync -av /data/JC JC

J.M.P. Alves 47 / 55 BMP0260 / ICB5765 / IBI5765 Syncing data

● The time command is not needed! It is here just to measure how long it took to run the command that comes after it

● There was no JC directory in your area; rsync automatically creates a destination directory if it does not exist

● Since this is a transfer within the same computer, the -z option is not necessary; but does it hurt to use it?

● Try it! Now that you know the time that it took to sync the data without compression, either delete the JC directory created in your area or use a different destination (e.g., JC2), and run rsync again, this time with -z

J.M.P. Alves 48 / 55 BMP0260 / ICB5765 / IBI5765 Syncing data

● If you did everything correctly, you must have noticed that using compression took about 10 times more time than not

● Compressing the data takes time, so it is only worth doing if you are transferring data to a remote computer, using a slow network connection

● What happens if you now try to sync the data to a directory where you have already saved it previously?

● Do it! In our example: rsync -av /data/JC JC

● As you can see, nothing was copied this time; rsync looks for updated files (looking at modification time stamps) and only copies those

J.M.P. Alves 49 / 55 BMP0260 / ICB5765 / IBI5765 Recap

● Software repositories are sites that store large (but sometimes not so large) collections of programs; the well maintained ones are secure

● Repositories can be specialized or, as is the case for distro repos, generalist

● Packages are a way of distributing software in a standardized and easy-to-install manner

● There are different formats of Linux packages, the main ones being the RPM (RedHat family) and the DEB (Debian family, including Ubuntu)

● Package management is mainly comprised of searching, installing, removing, and upgrading packages

J.M.P. Alves 50 / 55 BMP0260 / ICB5765 / IBI5765 Recap

● Great advantages of package management are that it automatically:

● Solves dependency problems

● Lets you know that software (any software) can be updated

● If the package one is interested in is not in the distro's repository, but there is a package available for download, installation is usually easy

● If there is no package, usually the next best alternative is to find a compiled version of the program that suits the OS and hardware

● Failing that, one will probably have to compile the program, which might be anything from easy to impossible without a lot of effort

J.M.P. Alves 51 / 55 BMP0260 / ICB5765 / IBI5765 Recap

● If compilation fails, pay attention to the error messages and try to decipher what they might mean; it could be a missing dependency, an incompatible library, or the compiler version does not support the version of the language in which the program is written… or an error in the source code

● Such error messages can often be quite cryptic and demand experience and programming knowledge to understand

● A little “googling” goes a long way in most cases!

J.M.P. Alves 52 / 55 BMP0260 / ICB5765 / IBI5765 Recap

● Comparing the contents of text files is an important task

● comm compares two previously sorted files and lists which lines are unique or shared

● diff is the main text file comparison tool in the Unix world, and has three output styles: default, context, and unified

● It is used a lot to distribute program source code updates, which can be applied with the patch program

J.M.P. Alves 53 / 55 BMP0260 / ICB5765 / IBI5765 Recap

● rsync is a great tool for incremental transfer of data (i.e., only transfer what has changed), and is thus used a lot for

● Transfers within the same machine, i.e., disks connected to the same computer, should not use compression (-z): it is much slower in this case!

J.M.P. Alves 54 / 55 BMP0260 / ICB5765 / IBI5765 J.M.P. Alves 55 / 55 BMP0260 / ICB5765 / IBI5765