Author: Mark Sobell
This article is excerpted from the recently published book A Practical Guide to Red hat Linux, Second Edition.
Different communications have different priorities. Information about the company picnic in two months is not as time sensitive as the fact that you are bringing the system down in 5 minutes. To meet these differing needs, Linux provides different ways of communicating. The most common methods are described and contrasted in the following list. All these methods are generally available to everyone, except for the message of the day, and wall (write all), which is typically reserved for Superuser.
write
write
to communicate with a user who is logged in on the local system. You might use it to ask a user to stop running a program that is bogging down the system. The user might reply that he will be done in 3 minutes. Users can also use write
to ask the system administrator to mount a tape or restore a file.
talk
talk
utility performs the same function as write
but is more advanced. Although talk
uses a character-based interface, it has a graphical appearance, showing what each user is typing as it is being typed. Unlike write
, you can use talk
to have a discussion with someone on another machine on the network.
wall
wall
(write all) utility effectively communicates immediately with all users who are logged in. It works similarly to write
, except users cannot use wall
to write back to only you. Use wall
when you are about to bring the system down or are in another crisis situation. Users who are not logged in do not get the message.
Use wall
while you are Superuser only in crisis situations; it interrupts anything anyone is doing.
Users can easily make permanent records of messages they receive via email, as opposed to messages received via write
or talk
, so they can keep track of important details. It would be appropriate to use email to inform users about a new, complex procedure, so each user could keep a copy of the information for reference.
Creating problems
Even experienced system administrators make mistakes; new system administrators make more mistakes. Even though you can improve your odds by carefully reading and following the documentation provided with your software, many things can still go wrong. A comprehensive list is not possible, no matter how long, as new and exciting ways to create problems are discovered every day. A few of the more common techniques are described here.
Failing to perform regular backups
Few feelings are more painful to a system administrator than realizing that important information is lost forever. If your system supports multiple users, having a recent backup may be your only protection from a public lynching. If it is a single-user system, having a recent backup certainly keeps you happier when you lose a hard disk.
Not reading and following instructions
Software developers provide documentation for a reason. Even when you have installed a software package before, you should carefully read the instructions again. They may have changed, or you may simply remember them incorrectly. Software changes more quickly; look for the latest documentation online.
Deleting or mistyping a critical file
One sure way to give yourself nightmares is to execute the command
# rm -rf /etc
do not do this
Perhaps no other command renders a Linux system useless so quickly. The only recourse is to reboot into rescue mode using the first installation CD and restore the missing files from a recent backup. Although this example is extreme, many files are critical to proper operation of a system. Deleting one of these files or mistyping information in one of them is almost certain to cause problems. If you directly edit /etc/passwd, for example, entering the wrong information in a field can make it impossible for one or more users to log in. Do not use rm -rf with an argument that includes wildcard characters; do pause after typing the command, and read it before you press RETURN
. Check everything you do carefully, and make a copy of a critical file before you edit it.
Solving problems
As the system administrator, it is your responsibility to keep the system secure and running smoothly. When a user is having a problem, it usually falls to the administrator to help the user get back on track. This section suggests ways to keep users happy and the system functioning at its peak.
Helping when a user cannot log in
When a user has trouble logging in on the system, the problem may be a user error or a problem with the system software or hardware. The following steps can help you determine where the problem is:
- Determine if only that one user or only that one user’s terminal/ workstation has a problem or if the problem is more widespread.
- Check that the user’s Caps Lock key is not on.
- Make sure the user’s home directory exists and corresponds to that user’s entry in the /etc/passwd file. Verify that the user owns his or her home directory and startup files and that they are readable (and, in the case of the home directory, executable). Confirm that the entry for the user’s login shell in the /etc/passwd file is valid (that is, that the entry is accurate and that the shell exists as specified).
- Change the user’s password if there is a chance that he or she has forgotten the correct password.
- Check the user’s startup files (.profile, .login, .bashrc, and so on). The user may have edited one of these files and introduced a syntax error that prevents login.
- Check the terminal or monitor data cable from where it plugs into the terminal to where it plugs into the computer (or as far as you can follow it). Finally, try turning the terminal or monitor off and then turning it back on.
- When the problem appears to be widespread, check if you can log in from the system console. If you can, make sure that the system is in multiuser mode. If you cannot log in, the system may have crashed; reboot it and perform any necessary recovery steps (the system usually does quite a bit automatically).
- Check the /etc/inittab file to see that it is starting the appropriate login service (usually some form of
getty
, such asmingetty
). - Check the /var/log/messages file. This file accumulates system errors, messages from daemon processes, and other important information. It may indicate the cause or more symptoms of a problem. Also, check the system console. Occasionally messages about system problems that do not get written to /var/log/messages (for instance, if the disk is full) get displayed on the console.
- Use
df
to check for full filesystems. Sometimes, if the /tmp filesystem or the user’s home directory is full,login
fails in unexpected ways. In some cases you may be able to log in to a textual environment but not a graphical one. When applications that start when the user logs in cannot create temporary files or cannot update files in the user’s home directory, the login process itself may terminate.
Speeding up the system
When the system is running slowly for no apparent reason, perhaps a process did not exit when a user logged out. Symptoms include poor response time and a system load, as shown by w
or uptime
, that is greater than 1.0. Use ps -ef
to list all processes. The top
utility is excellent for quickly finding rogue processes. One thing to look for in ps -ef
output is a large number in the TIME field. For example, if you find a Netscape process that has a TIME field over 100.0, this process has likely run amok. However, if the user is doing a lot of Java work and has not logged out for a long time, this value may be normal. Look at the STIME field to see when the process was started. If the process has been running for longer than the user has been logged in, it is a good candidate to be killed.
When a user gets stuck and leaves his or her terminal unattended without notifying anyone, it is convenient to kill
all processes owned by that user. If the user is running a window system, such as GNOME or KDE on the console, kill the window manager process. Manager processes to look for include startkde, gnome-session, or another process name that ends in wm. Usually the window manager is either the first or the last thing to be run, and exiting from the window manager logs the user out. If killing the window manager does not work, try killing the X server process itself. This process is typically listed as /etc/X11/X. If that fails, you can kill all processes owned by a user by running kill -1 -1
, or equivalently kill -TERM -1
as the user. Using -1
(one) in place of the process ID tells kill
that it should send the signal to all processes that are owned by that user. For example, as root you could type
# su jenny -c 'kill -TERM -1'
If this does not kill all processes (sometimes TERM does not kill a process), you can use the KILL signal. The following line will definitely kill
all processes owned by Jenny and will not be friendly about it:
# su jenny -c 'kill -KILL -1'
(If you do not use su jenny - c
, the same command brings the system down.)
lsof:
Finds open files
The name lsof
is short for ls
open files; this utility locates open files. Its options let you look only at certain processes, look only at certain file descriptors of a process, or show certain network connections (network connections use file descriptors just as normal files do and lsof
can show those as well). Once you have identified a suspect process using ps -ef
, run the following command:
# lsof -sp pid
Replace pid
with the process ID of the suspect process; lsof
displays a list of all file descriptors that process pid
has open. The -s
option displays the size of all open files. The size information may be helpful in determining whether the process has a very large file open. If it does, contact the owner of the process or, if necessary, kill the process. The -rn
option redisplays the output of lsof
every n
seconds.