Author: M. Shuaib Khan
Strace
When an application you successfully compiled fails during run time, it usually gives you an error. On a lucky day, the error message might contain details of what went wrong, and give you clues about what to do to fix the problem. But this is not what usually happens. Often, error messages are obscure and of little help in figuring out what went wrong.
Strace can come in handy in such situations. This utility traces the system calls a program uses during its run time. A system call is a Linux kernel function that provides secure access to a system’s resources, such as memory, disk, and network.
Strace is easy to use — just pass the name of the executable you want to run as an argument to the strace application. As an example, check out what output you get when you trace the following simple “Hello, world!” program:
#include int main() { printf("Hello, world!n"); return 0; }
$gcc -o hello hello.c $strace ./hello execve("./hello", ["./hello"], [/* 94 vars */]) = 0 brk(0) = 0x804b000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7eff000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/tls/i686/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/tls/i686/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/tls/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/tls/i686", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/tls/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/tls/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/tls/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/tls", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/i686/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/i686/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/i686/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/i686", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/sse2/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib/sse2", 0xbf91d630) = -1 ENOENT (No such file or directory) open("/opt/wx/2.8/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat64("/opt/wx/2.8/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=186839, ...}) = 0 mmap2(NULL, 186839, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7ed1000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 . . . write(1, "Hello, world!n", 14Hello, world! ) = 14 exit_group(0) = ? Process 6006 detached
In the above output, you can see that to run this simple program, a good number of system calls were made to open, read, write, close, etc. Notice that there were a large number of unsuccessful calls to open the libc.so.6 library. That’s because the run time linker is looking in several places to find the library. The only successful call to open the library is when the linker looks for it in the /lib location, as shown by the line shown in bold letters in the output, where the open system call returns a value of ‘3,’ which is an indication of successful opening. If we could somehow make the loader look in /lib first, we could save a lot of unsuccessful calls for the library search. And of course we can, by bringing the string /lib to the beginning of the environment variable LD_LIBRARY_PATH, which the run time linker uses to search for the libraries required by the running program.
$export LD_LIBRARY_PATH=/lib
The output of strace can be quite unwieldy when it’s dumped to the console. It is common to redirect this text to a file by using the command’s -o
option. Another common option is -p
, or PID, which allows you to connect to a running program and see its output. This is useful in the case of long-running daemons which you cannot restart easily, or which need to be monitored very rarely.
A nice example of how useful strace can get comes from a user who had installed multimedia codecs, including libdvdcss, which allowed him to play encrypted DVDs. But when he tried to use his movie player to play DVDs, he got strange errors. On tracing the movie player with strace, he figured out that the run time linker was looking in the wrong places for the installed codecs. After searching for the required library and putting it in a directory where the linker could find it, he was able to run the movie player to play his DVDs.
ltrace
ltrace is a sister application of strace. It works just like strace, but instead of tracing the system calls executed during the run time of a program, it traces the dynamic library calls. If we ltrace the previous “Hello, world!” program, here is what we get as the ouput:
$ltrace ./hello __libc_start_main(0x80483b4, 1, 0xbfacb0d4, 0x80483f0, 0x80483e0 puts(" 01"Hello, world! ) = 14 +++ exited (status 0) +++
The output shows that the executable “hello” uses only one library function — namely “puts” to put the string “Hello, world!n” on the output console.
ltrace isn’t as commonly used as strace. It is preferred when a detail trace of a program is required, especially when we are interested in the details of the dynamic library functions the program uses, such as malloc(), gethostbyname(), and setenv().
lsof
The lsof tool is used to list all the files open on a Linux system. Remember that in true Unix spirit, almost everything is a file. You access your hardware through files located in /dev, information about CPU, memory, and other devices is located in files on /proc, and network connections, a.k.a. sockets, are also sometimes represented as files.
lsof becomes really handy when you want to know what files a process has currently opened, or which processes are currently acting on a certain file:
$lsof COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME init 1 root cwd DIR 8,1 4096 2 / init 1 root rtd DIR 8,1 4096 2 / init 1 root txt REG 8,1 533224 1658100 /sbin/init init 1 root 10u FIFO 0,14 2941 /dev/initctl migration 2 root cwd DIR 8,1 4096 2 / migration 2 root rtd DIR 8,1 4096 2 /
lsof lists the running command, its process ID, the user to whom the process belongs, file descriptor of the opened file, type of the file opened, major and minor device numbers of the file, size of the file, node number of its inode, and the name of the file opened or the mount point of the device being acted on.
To list files opened by process belonging to a particular user, use:
$lsof -u user
To see a list of files opened by a particular process, use:
$lsof -p pid
Sometimes, you are unable to unmount a particular device because the system reports it as busy, even though you think it is not used by any process. To see what process is still using it, use:
$lsof /dev/mount-point
This will give you the list of processes using the device. Kill them, and you are ready to unmount the device.
top
Top lists the top processes running on a system at any specific time. The criteria for top could be top CPU consumers, top memory consumer, etc.
$top top - 18:21:33 up 1:40, 4 users, load average: 0.30, 0.21, 0.27 Tasks: 155 total, 2 running, 148 sleeping, 0 stopped, 5 zombie Cpu(s): 6.9%us, 2.7%sy, 0.0%ni, 80.5%id, 9.6%wa, 0.1%hi, 0.1%si, 0.0%st Mem: 506908k total, 492384k used, 14524k free, 12900k buffers Swap: 1052248k total, 39836k used, 1012412k free, 144944k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 744 124 80 S 0 0.0 0:01.37 init 2 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0 4 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1 5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
Top can be useful when you want to know what process is consuming how much of a system’s resources. In particular, if a certain process is consuming too much memory, you can locate it through top and take appropriate measures to bring it down, if it’s not critical.
Traceroute
Traceroute is a network troubleshooting tool. For a network packet to reach a remote computer from your machine, it has to go through different routers on the network. Sometimes, even though both the local and the remote machines are functioning properly and connected to the network, they can’t communicate with each other because of a problem somewhere in between the two machines. To trace where the packet is dropped on the network, use traceroute:
$traceroute google.com Hop (ms) (ms) (ms) IP Address Host name 1 0 0 0 66.98.244.1 gphou-66-98-244-1.ev1servers.net 2 0 1 0 66.98.241.16 gphou-66-98-241-16.ev1servers.net . . . 13 29 28 28 72.14.232.57 - 14 34 35 36 64.233.175.42 - 15 28 28 29 64.233.167.99 py-in-f99.google.com
The output shows that the packet had to go through 15 different machines before successfully reaching google.com. It lists the IP addresses and names (if available) of all the intermediate machines the packet went through.
ping
Ping can help you figure out if a remote machine on the network is up and connected. Ping sends ICMP messages to the remote machine, and prints the details if it gets a reply from the remote machine. Sometimes system administrators disable ICMP messages on their machines, which means that a ping won’t get a reply from that particular machine, even it is present on the network, so be sure that the remote machine you’re interested in does reply to ICMP messages before assuming that it is down.
$ping google.com PING google.com (72.14.207.99) 56(84) bytes of data. 64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=1 ttl=238 time=265 ms 64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=2 ttl=238 time=269 ms 64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=3 ttl=238 time=272 ms 64 bytes from eh-in-f99.google.com (72.14.207.99): icmp_seq=4 ttl=238 time=263 ms
hexdump
The hexdump utility is useful for seeing the contents of a binary file in a human-readable format, which can be ASCII, hexadecimal, octal, or decimal. For example, to see what the contents of the executable /bin/ls looks like in hex and ASCII, use:
$hexdump -C /bin/ls 00000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| 00000010 02 00 03 00 01 00 00 00 80 9c 04 08 34 00 00 00 |............4...| 00000020 0c 5c 01 00 00 00 00 00 34 00 20 00 0a 00 28 00 |.......4. ...(.| 00000030 1f 00 1e 00 06 00 00 00 34 00 00 00 34 80 04 08 |........4...4...| 00000040 34 80 04 08 40 01 00 00 40 01 00 00 05 00 00 00 |4...@...@.......| . . .
The information on the left is the contents of the file in hex, while the text between the bars is the ASCII representation.
Hexdump is useful for searching text strings within an executable file for which source code might not be available. It can help you locate specific error messages and where they occur in a file.
Conclusion
Troubleshooting Linux is an art, but these tools can help you master it. You can read more usage details about these tools on their respective man pages. Remember that knowing how to use a tool is not the same as knowing when to use it. As you encounter different problems and tackle them, you’ll eventually learn the art of diagnosing trouble and fixing problems on your Linux system.