Sorting your data with msort

1211

Author: Ben Martin

msort is a tool for sorting text files. With both a command-line and graphical interface, it allows you to pick out where your sort keys are in a file and lets you select how to order those keys in a number of ways.

Compared with the GNU sort program that is installed on most Linux systems, msort offers more flexibility in defining where your sort keys are and how to order them, as well as great internationalization support with full support for UTF-8, the ability to sort a file using different locales for different sort keys, and support for numbers in non-Western number systems.

Packages are available for Gutsy, openSUSE, and Fedora 7. For this article I’ll build from source using version 8.44. msort has some required and optional dependencies depending on what functionality you want to have in your msort build. msort uses the Uninum library to handle numbers that are not in Western number systems. You can compile msort without Uninum support if you are not interested in this feature. The graphical interface for msort requires Tcl/Tk and the iwidgets library. Packages for Uninum are available for Fedora 8 and some versions of openSUSE, but not for Gutsy. The iwidgets library is available for Gutsy, openSUSE 10.3, and Fedora 8. You can compile msort from source with the normal ./configure; make; sudo make install sequence. There are two options for handling UTF-8 text in msort: using the utf8proc package or libicu. The author of msort recommends compiling msort using libicu support over utf8proc.

It is common to break data to be sorted up into records. When you sort lines of text, for example, each line may be considered to be a record. You might then nominate a part of a record to be a sort key. For example, you might want to sort a file containing people’s names using their surname to order the output. For cases where your sort key does not guarantee that the data is completely sorted, you might specify subsequent sort keys that can be used. For example, if your first sort key is the extension of a file name, you might then like to sort on the file name itself as a second sort key. This would make all files with the same extension be sorted together in the output, and the ordering for all files with the same extension would be sorted by file name.

Imagine that you have a tag-delimited file that you want to sort. Using GNU sort with a free-format file without fixed field lengths is not simple, but the following msort invocation will sort that data by last name. To make the example more readable I have used the utility’s long argument names in the first invocation. The final command performs the same sort using the short argument options. The -l/--line option tells msort to treat each line as a record, while the -t/--tag gives msort a regular expression to use to find the key that you want to sort with.

$ cat tagged-data.txt first:Frodo last:Baggins first:Samwise last:Gamgee first:Meriadoc last:Brandybuck first:Peregrin last:Took $ msort.utf8proc --quiet --line --tag last: tagged-data.txt first:Frodo last:Baggins first:Meriadoc last:Brandybuck first:Samwise last:Gamgee first:Peregrin last:Took $ msort.utf8proc -qlt last: tagged-data.txt

msort can handle records that span multiple lines. If your records are separated with a special character, you can use the --record-separator command-line option to tell msort to use it. Otherwise msort will take two or more newline characters as the record separator.

Shown below is the same data as above but in a format where records span multiple lines. Instead of sorting by last name, I sort on the first name and use the --block option to explicitly tell msort that records are in multiple line format. Notice that I shuffled around the order in which the first and last tags appear for each record and also added a new value for Peregrin that does not effect the correct ordering of the results.

$ cat multiline-data.txt first:Frodo last:Baggins last:Gamgee first:Samwise first:Meriadoc last:Brandybuck something else first:Peregrin last:Took $ msort.utf8proc --quiet --block --tag first: multiline-data.txt first:Frodo last:Baggins first:Meriadoc last:Brandybuck something else first:Peregrin last:Took last:Gamgee first:Samwise

Below is an example where you might like to sort by more than one key. First I would like to sort based on the second column, but there are many values of 100 in that column, so having each of those records then sorted on the first column would make the output more readable. The -n/--position command-line option tells msort to sort the data using the nominated fields. By default, white space delimits fields, which means that the space between the first and second column breaks the data up correctly into two columns. You can specify multiple field separator characters using the -d/--field-separators argument. The -c n option shown in the middle of the command line tells msort to sort the previously specified key numerically. This is why the values of 20 appear at the top of the output instead of after those of 100, which would be the case using the default lexicographical comparison.

$ cat multi-key-charnum.txt act 100 there 100 foo 100 bar 300 urusai 300 small 20 one 20 $ msort.utf8proc --quiet --line -n 2 -c n -n 1 multi-key-charnum.txt one 20 small 20 act 100 foo 100 there 100 bar 300 urusai 300

There are many options for the sort order you can specify with -c, including dates, times, domains, email addresses, angles, hybrid, and a few numeric options. The hybrid mode breaks a field up based on repeating patterns of text and numbers. This works well for sorting IP addresses, as you would commonly find in system logs. Shown below is an abridged sample of an iptables packet log report from syslog. I sort it first based on the source address and then based on the destination address, using the -t/--tag option to pick out the IP addresses from the log and the -c h to specify that each key should be sorting using the hybrid comparison mode.

$ cat hybrid-ips.txt Apr 29 20:14:58 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.3.4 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4 DST=192.168.4.12 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth1 OUT=eth0 SRC=192.168.3.3 DST=192.168.3.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4 DST=192.168.0.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.133 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.1.33 LEN=76... $ msort.utf8proc -ql -t SRC= -c h -t DST= -c h hybrid-ips.txt Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.0.133 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.1.33 LEN=76... Apr 29 20:14:58 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.2 DST=192.168.3.4 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth1 OUT=eth0 SRC=192.168.3.3 DST=192.168.3.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4 DST=192.168.0.33 LEN=76... Apr 29 20:15:48 fots kernel: invalides IN=eth2 OUT=eth0 SRC=192.168.3.4 DST=192.168.4.12 LEN=76...

The ability to specify the keys that you want to sort using the --tag regular expression together with the selection of ways that msort can compare its keys allows you to perform a wide variety of sorting tasks directly from the command line, without having to resort to writing a Perl script. Support for foreign language number systems can be a real bonus too, depending on what data sets you are using.

Category:

  • Tools & Utilities