Developerworks : Linux : Tip: Sorting Files with Sort and Tsort

Search within for: Use only ( ) " " + - Search help IBM home | Products & services | Support & downloads | My account IBM developerWorks : Linux : Linux articles Tip: Sorting files with sort and tsort Get to know your textutils Contents: Resources Jacek Artymiak ([email protected]) Freelance author and consultant About the author March 6, 2003 Rate this article Save time and headaches by using sort and tsort -- instead of resorting to more complex solutions utilizing Perl or Awk. Jacek Related content: Artymiak explains how. Get to know your textutils Although it is possible to write advanced sorting applications in Perl or Concatenating files with cat Awk, doing so may not always be necessary -- and is often a pain. Most Reading text streams in things you'll ever need are equally possible -- and lots easier -- with the sort command, which can sort lines in more than one file, merge files, and chunks with head and tail even check to see if sorting them is necessary. You can specify sort keys Subscribe to the (portions of lines used for comparisons) or not, in which case sort just developerWorks newsletter compares whole lines. Also in the Linux zone: So, if you want to sort your password file, you could just use the following. Tutorials (Note that you cannot send the output straight to the input file, because it will corrupt the input file. That's why you need to send it to a temporary file Tools and products and then rename that file to /etc/passwd -- as shown below.) Code and components Listing 1. Simple sort Articles $ su - # sort /etc/passwd > /etc/passwd-new # mv /etc/passwd-new /etc/passwd -r Should you want to reverse the order of sorting, use the option. You More on sort and tsort -u can also suppress printing of identical lines with the option. Follow along in the man A very practical feature of sort is its ability to sort using field keys. A page by opening the GNU field is a string of text separated from other fields with a certain single manual's pages on sort character. For example, the fields in /etc/passwd are separated with a operations, or view these colon (:). So, if you wanted, you could sort /etc/passwd by the user ID, options in your man or info group ID, comments field, home catalog, or shell. To do this, use the -t pages in a new terminal option followed by the character used as the separator, and the number of window by typing man the field that will be used as the sort key followed by the number of the sort or man tsort at the last field where the key will end; for example, sort -t : -k 5,5 command line. /etc/passwd sorts the password file by the comment field, which is the place where full user names like "John Smith" are stored. But sort -t : -k 3,4 /etc/passwd sorts the same file using both the user ID and the group ID. If you omit the second number, sort will assume that the key starts at the given field and continues to the end of each line. Try this yourself, and observe the differences. (When numeric sorting looks wrong, add the -g option). Also, note that a whitespace transition is the default separator -- so if fields are already separated by blank characters, you may omit the separator and use -t alone. (Notice also that numbering of fields starts with 1.) For even finer control, you can use keys and offsets. Offsets are separated from keys with a dot, as in -k 1.3,5.7, which means that the sort key should start on the third character of the first field, and end at the seventh character of the fifth field (offsets too are numbered from 1). When would you need this? Well, I use it from time to time for sorting Apache logs; the key and offset notation lets me skip the date fields. Another option to watch out for is -b, which tells sort to ignore blank characters (spaces, tabs, etc.) and treat the first non-blank character on the line as the start of the sort key. Also, if you use that option, offsets will be counted from the first non-blank character (useful when the field separator is not a blank character and when the fields may contain strings starting with blank characters). Further modifications of the sorting algorithm are possible with these options: -d (use only letters, digits, and blanks for sort keys), -f (turn off case recognition and treat lowercase and uppercase characters as identical), -i (ignores non-printing ASCII characters), -M (sorts lines using three-letter abbreviations of month names: JAN, FEB, MAR, ...), -n (sorts lines using only digits, -, and commas, or other thousands separator). These options, as well as -b and -r, can be used as part of a key number, in which case they apply to that key only and not globally, like they do when they are used outside key definitions. As an example of the use of a key number, consider: sort -t: -k 4g,4 -k 3gr,3 /etc/passwd This will sort the passwd file by group ID and within groups by userid, backwards. But that's not all that sort is capable of. It can also resolve ties that happen when the keys you used cannot be used to decide which line is first. To add hints for resolving ties, add another -k option and follow it with the field and (optional) offset, using the same notation as the one you used for defining keys; for example, sort -k 3.4,4.5 -k 7.3,9.4 /etc/passwd sorts lines using keys that begin at the fourth character of the third key and end at the fifth character of the fourth key and use the third character of the seventh field and the fourth character of the ninth field to resolve ties. The last group of options deals with input, output, and temporary files. For example, the -c option, when used in sort -c < file, checks if the input file has been sorted yet (you can use other options as well), and if it has, reports an error. This is handy for making checks before processing large files that may take a long time to sort. When you use the -u option together with the -c option, it will be interpreted as a request to check that there are no two identical line in the input file. Also important when you are processing large files is the -T option used to specify an alternative directory for temporary files (they are removed after sort finishes work) instead of the default /tmp. You can use sort to process more than one file at a time, and there are basically two ways to do it: you can use cat to concatenate them first, as in: cat file1 file2 file3 | sort > outfile Or, you could use this command: sort -m file1 file2 file3 > outfile There is one condition in the second case: each input file must be sorted before they are all sent to sort -m together. That may look like an unnecessary burden, but in fact it speeds up work and saves precious system resources. Oh, and don't forget the -m option. You can use the -u option here to suppress printing of identical lines. If you need a more esoteric kind of sort routine, you might want to check out the tsort command, which performs a topological sort on a file. The difference between a topological and standard sort is shown in Listing 2 (you can download happybirthday.txt from Resources). Listing 2. Difference between topological and standard sort $ cat happybirthday.txt Happy Birthday to You! Happy Birthday to You! Happy Birthday Dear Tux! Happy Birthday to You! $ sort happybirthday.txt Happy Birthday Dear Tux! Happy Birthday to You! Happy Birthday to You! Happy Birthday to You! $ tsort happybirthday.txt Dear Happy to Tux! Birthday You! Of course, that isn't a very useful demonstration of what you'd use tsort for -- just an illustration of how different the output of the two commands is. tsort is generally used for solving a logic problem in which it's necessary to predict a total order from observed partial orders; for example, (from the tsort info page): tsort <<EOF a b c d e f b c d e EOF will produce the output a b c d e f Questions or comments? I'd love to hear from you -- send mail to [email protected]. Next time, we'll delve into tr. Resources ● Download the example file for Listing 2, happybirthday.txt. ● Find even more info on these useful tools in the GNU text utilities manual. (An expanded view of the same TOC lives at MIT, where you can also find this great list of even more useful GNU tools.) ● Windows users can find these tools in the Cygwin package. ● Mac OS X users may want to try Fink, which installs a rich UNIX environment under the sleek new Mac OS X. ● Something just not working for you? Try checking the Frequently asked questions for GNU textutils. ● Need more introductory info before delving in to the tools we've covered here? Try starting with UNIXhelp for users. ● Of course, the classic work in this field is Unix Power Tools, from O'Reilly and Associates (Jerry Peek, Tim O'Reilly, and Mike Loukides: 1997; ISBN 1-56592-260-3). ● Lest we forget, the Jargon File has an amusing entry on the topic of sorting.

Load more