<<

Work / Technology & tools ILLUSTRATION BY THE PROJECT TWINS THE PROJECT BY ILLUSTRATION FIVE REASONS TO LOVE THE LINE The text interface is intimidating, but can save researchers from mundane computing tasks. Just be sure you know what you’re doing. By Jeffrey M. Perkel

ennifer Johnson’s principal investigator available on Windows through such tools as actions across multiple files. Johnson directed had a simple request. Since 2018, their the free ‘Windows Subsystem for Linux’ and her terminal to scan her hard disk for sequenc- team had sequenced the DNA of some MobaXterm, the command line (also called ing data files, extract the needed information 1,300 silver fox (Vulpes vulpes) specimens, the ) is a powerful text-based interface and compile them into a tidy spreadsheet. and the lab wanted to know precisely in users issue terse instructions to cre- “That took me less than 10 minutes,” she says; Jhow many bases it had collected, and how well ate, , and manipulate files, all without recomputing the data would have taken a full those bases aligned to the reference genome. using the mouse. There are actually several day. Johnson, a bioinformatician the University distinct (although largely compatible) shell Many computational disciplines, such as of Illinois at Urbana-Champaign, had the nec- systems, among the most popular of which is bioinformatics, rely heavily on the command essary data. But they were scattered across , an acronym for the ‘Bourne again shell’ (a line. But all researchers use computers her hard drive. The most obvious solution was reference to the Bourne shell, which it replaced can benefit from it, says Jeroen Janssens, prin- also the most painful: to open each , find in 1989). cipal instructor of Data Science Workshops in the required information, close and repeat — Bash is both a collection of small utilities Rotterdam, the Netherlands, and author of the 1,300 times. So Johnson took another tack, and a full-blown , 2014 book Data Science at the Command Line. using the command line. ranging from ‘’, a powerful text-search “The mouse doesn’t scale,” Janssens explains. Prebuilt into macOS and systems, and tool, to ‘for loops’, which allows users to repeat For instance, although it is certainly possible

Nature | 590 | 4 February 2021 | 173 ©2021 Spri nger Nature Li mited. All ri ghts reserved.

Work / Technology & tools to rename a file by pointing and clicking, that condense it into something that I can actually in Montreal, Canada, worked to ensure that task becomes tedious when it is scaled to work with and visualize,” she says. the researchers in his facility were set up to hundreds or thousands of files. stay productive, wherever they happened Here, we highlight five ways the command Manipulate spreadsheets to quarantine. “Without having this in place, line can ease your computational research. Shell commands perform seemingly simple none of the students would have been able to operations. The ‘’ command, for instance, do anything,” he says. Wrangle files extracts one or columns from a spread- Perhaps the shell’s most powerful feature is the sheet; ‘’ counts words, lines or characters; Automate ability to repeat simple tasks across multiple ‘’ filters files for lines that match a certain Shell commands can be stored in text files files. Researchers could, for instance, system- condition; and ‘’ manipulates text ‘streams’. called scripts, which can be saved, shared and atically rename their files to add a date stamp, But these simple commands can be strung version-controlled, enhancing reproducibil- or them from one to another together using ‘pipe’ (‘|’), a shell feature that fun- ity. They can also be automated. Using the (see our example code at https://github.com/ nels the output of one command into another, ‘’ command, users can schedule scripts jperkel/nature_bash). thus creating powerful bespoke workflows. “It’s to run when it’s convenient for them. For During his postdoctoral studies, Casey very good for what we call ‘whip-it-up-itude’ — instance, says Wickes, some websites ask that Greene’s adviser insisted that all images getting things going quickly, prototyping,” says users who plan to ‘scrape’ content do so during of figures that were used in presentations Tom Ryder, a systems administrator based in off-peak hours — in the case of the PubMed had a black background. Greene got pretty Palmerston North, New Zealand, and author of literature database, between 9 p.m. and 5 a.m. good, he says, at opening figures in an image the 2018 book Bash Quick Guide. Eastern . “You can have [the script] run editor, inverting and colour-rotating them, only at the times you are allowed,” she says. and repeating. “But at some point, it turns “Shell commands can be Alternatively, users can download data to their out life is too short to continue importing primary computer daily, because many shared and colour-rotating in even a free software stored in text files called systems routinely delete older files. program that is relatively easy to use,” says scripts, which can be Greene, who now directs the Center for Health saved, shared and version- Warning: no undo Artificial Intelligence at the University of controlled.” The flip side of this power is the shell’s intim- Colorado School of Medicine in Aurora. So, idating interface, often just a dollar sign and he turned to the command line — specifically, a cursor. “That’s scary for some people,” says the free and open-source image-manipulation Janssens. The shell provides little in the way of tool ImageMagick, using a for loop to repeat The following command, for instance, . And it can be “unforgiving”, Janssens notes. the operation across all his files: combines five utilities to count the unique An improperly positioned space can change a for file in *.png; do convert $file -channel gene names in a gene-expression data set: command’s meaning, and few commands will RGB -negate -modulate 100,100,200 out_$- cut -f1 GEOdataset.csv | sed -E ‘s/^>//’ | sort by default ask if you know what you’re doing. file; done; | | wc -l “If you have the right to do it within the system, These steps extract (cut) the first column [the shell] assumes that you meant to do it the Handle big data (containing the name) from the spreadsheet; way you said it,” Devenyi says. “And so you can Some data sets are simply too big to handle. remove a leading greater-than symbol (sed); do lots of various dangerous things.” For a project studying digital object identifier sort the list alphabetically; reduce that list to The classic example is -rf * — a command (DOI) metadata, Elizabeth Wickes, an infor- its unique values (uniq); and the total that deletes all files from the current mation scientist at the University of Illinois number of lines (that is, gene names) to the and everything below. If executed from the at Urbana–Champaign, harvested millions screen (wc). wrong location, crucial work can be lost. of XML files. But committing those files to a “We’ve all done that,” says Johnson. So proceed version-control repository overloaded her Parallelize your work with caution. system: “GitHub Desktop and my system Christina Koch, a research computing facilita- One simple trick is to use the ‘’ indexing just barfed on it,” she says. Using the tor at the University of Wisconsin–Madison, command to ensure that you’re specifying ‘git’ command-line tool, however, she tackled works at a computing centre that provides the files you intend. “‘Echo before execute’ the problem in an hour and a half. remote access to some 14,000 nodes and tera- is a very good rule of thumb,” says Greene. Similarly, Lynette Strickland, an evolu- bytes of memory. Suppose, Koch says, that a Some commands provide a ‘dry-run’ mode, tionary biologist at Texas A&M University in bioinformatician has a computational work- which reports what they intend to do, and/ Corpus Christi, has documented millions of flow for analysing gene-expression data sets. or an ‘interactive’ mode, which prompts the genetic variants for her research on invasive Each data set takes a day to process on their user before making changes. Users can also set lionfish (Pterois sp.). The data are too large for computer, and the researcher has 60 such data variables to prevent the computer from over- Microsoft Excel, which caps spreadsheets at sets. “That’s two months of non-stop running,” writing files, or to when there is an error about one million rows. So, Strickland used the she says. But, by sending the job to a computer (‘noclobber’ and ‘pipefail’, respectively). And shell to identify the records (spreadsheet rows) cluster using the ‘secure shell’ command, ‘ssh’, they should avoid running commands while that exceeded a certain quality threshold, which opens an encrypted portal to the remote they have administrative privileges. extract just the columns she needed, and save system, the researcher can parallelize the com- “Life comes at you fast,” says Wickes. “And sci- the data to a new file, using a command that putations across 60 computers. “Instead of ence can come at you fast sometimes.” The shell looks something like this (assuming the quality two months, it takes one day.” makes it possible to handle the unexpected. score is in column 4, the cut-off threshold is 50, Even without such power, ‘ssh’ provides the Besides, says Koch, it’s fun. “It’s so powerful, and and the desired columns are 1–4): ability to work remotely — an especially useful I feel like such a cool nerd when I’m using it,” she awk -F, ‘{ if ($4 > 50) print $0 }’ datafile.csv feature during a pandemic. With COVID-19 says. “You can feel very competent.” | cut -d, -f1-4 > newdatafile.csv lockdowns looming in 2020, Gabriel Devenyi, “By just taking the specific information that I a systems administrator and programmer at Jeffrey M. Perkel is technology editor need from it using [the shell], I can really, really the Douglas Mental Health University Institute at Nature.

174 | Nature | Vol 590 | 4 February 2021 ©2021 Spri nger Nature Li mited. All ri ghts reserved.