Author: Roderick W. Smith
This article is excerpted from the recently published book Linux Power Tools.
Recovering from filesystem corruption
A common nightmare among computer users is losing data due to filesystem corruption. This can happen because of errors introduced during a system crash, filesystem driver bugs (particularly when using non-Linux drivers to access Linux partitions), human error involving low-level disk utilities, or other factors. All Linux filesystems include disk-check tools, but they differ in many details.
Most filesystems provide utilities that scan the filesystem’s contents for internal consistency. This tool can detect, and often correct, errors such as mangled directories, bad time stamps, inodes that point to the wrong part of the disk, and so on. In Linux, the fsck
utility serves as a front-end to filesystem-specific checking tools, which usually have names of the form fsck.filesystem
, where filesystem is the filesystem name, such as jfs or ext2. If you need to check a filesystem manually, you can either call fsck, which then calls the filesystem-specific utility; or you can call the filesystem-specific program directly.
Precisely what each fsck
program does is highly filesystem-specific. These programs often perform multiple passes through the filesystem. Each pass checks for a particular type of problem. On a badly corrupted partition, you may need to run fsck
several times to fix all the problems. You may also need to answer questions posed by fsck
for some types of errors. Unfortunately, these questions tend to be cryptic to any but experienced filesystem gurus. Fortunately, they usually accept yes/no responses, so you can guess (the default value generally produces acceptable results).
If your partition is badly corrupted, and if you have enough unused space on another partition or hard disk, try copying the entire partition’s data with dd
as a backup prior to running fsck
. For instance, typing dd if=/dev/sda5 of=/dev/hda7
backs up the /dev/sda5 partition to /dev/hda7. You can then restore that image and try other options if fsck
doesn’t work properly. You can also try adding the noerror
option to a dd
command to copy a partition on a drive that’s producing physical read errors. It’s possible that fsck
will then be able to recover data from the copied partition even if it fails on the original because of the read errors.
During system startup, Linux examines the filesystem entry in /etc/fstab to determine whether to run fsck
automatically. Specifically, the final field for a filesystem entry is a code specifying when to check the filesystem. A value of 0 means not to run fsck
on startup. This setting is appropriate for non-Linux filesystems. The root (/) filesystem should normally have a value of 1, meaning that it’s checked first. All other Linux native filesystems should normally have values of 2, meaning that they’re checked after the root filesystem. Filesystem checks at boot time can cause problems in some distributions and with some journaling filesystems, though. Specifically, the filesystem check tool may slow down the bootup process because it unnecessarily forces a thorough disk check. For this reason, these checks are sometimes disabled (with a 0 value in the final /etc/fstab field), particularly for ReiserFS, and sometimes for ext3fs and XFS. These filesystems are all capable of automatically replaying their journals when mounted, so an explicit filesystem check at boot time normally isn’t necessary.
Filesystem check options
The main fsck
program supports several options, which it passes on to the filesystem-specific programs as necessary. Most of the interesting options are filesystem-specific. Some of the things you can specify for particular filesystems include:
Backup superblocks
The ext2 and ext3 filesystems use superblocks, which contain many vital filesystem details. Because the superblocks are so important, these filesystems store backup superblocks on the disk. If the main superblock is damaged, fsck.ext2
can use a backup whose location you specify with -b. You can learn the location of the backup superblocks by typing mke2fs -n /dev/file
, where /dev/file
is the device file for the partition. Be sure to include the -n option or mke2fs
will create a new filesystem, destroying the old one!
Detecting bad blocks
You can have fsck.ext2
scan for physically bad sectors on the disk and mark them as such by passing it the -c option. Doubling up this option (that is, using -cc) performs a non-destructive read/write test, which may catch problems that a read-only test will miss.
Forcing a check
Normally, most fsck
utilities perform only minimal checks if the filesystem is marked as clean. You can force a check with the fsck.ext2
and fsck.jfs
utilities by passing them the -f
option. In reiserfsck
, the --check
option causes it to check for errors but not correct them, while --fix-fixable
causes it to automatically fix certain types of errors.
Journal location
To tell fsck.ext2
, fsck.jfs
, or reiserfsck
where an external journal is located, pass the location with the -j journal-location
parameter.
Minimizing user interaction
The -p
option to fsck.ext2
and fsck.jfs
minimizes the questions they ask. The -n
and -y
options to fsck.ext2
cause it to assume answers of no or yes, respectively, to all questions. (A no response causes fsck
to not make changes to the filesystem.) The -s
option to xfs_check
causes it to report only serious errors. This option can make it easier to spot a problem that prevents mounting the filesystem when the filesystem also has less important problems.
Rebuilding the superblock
The reiserfsck
program supports rebuilding the superblock by using the --rebuild-sb
option.
Rebuilding the filesystem tree
You can rebuild the entire tree of directories and files by using the --rebuild-tree
option to reiserfsck
. This option is potentially very risky, but it can fix major problems. Back up the partition using dd
before using this option.
Omit replaying the journal
The -o
option to fsck.jfs
causes the program to omit replaying the journal. This option is most useful when the journal itself has been corrupted.
You don’t normally need to use fsck
except when it runs into problems after a reboot. In that case, Linux may drop you into a maintenance shell and advise you to run fsck
manually. Of course, you can run this utility at your discretion, as well, and doing so may be advisable if your disk access is acting up — if files are disappearing or if the disk appears to have too much or too little free space, for instance.
Never run fsck
on a filesystem that’s mounted for read/write access. Doing so can confuse Linux and lead to more filesystem corruption. If necessary, shut down the system and boot from an emergency system to run fsck
on an unmounted filesystem or to run on one that’s mounted for read-only access.
Journaling filesystems: Not a panacea
Many people seem to think of journaling filesystems as protection against filesystem errors. This isn’t their primary purpose, though. Journaling filesystems are designed to minimize filesystem check times after a crash or other severe error. Ext2fs has a good reputation as a reliable filesystem. Linux’s journaling filesystems also have good reputations, but you shouldn’t assume they’re any safer than ext2fs.
Because journaling filesystems minimize system startup time after power outages, some people have taken to shutting off Linux systems that use journaling filesystems without shutting Linux down properly. This practice is risky at best. Linux still caches accesses to journaling filesystems, so data from recently written files may be lost if you power off the computer without shutting it down first. Always shut down the computer with the shutdown
utility (or a program that calls it, such as a GUI login screen) before turning off the power.
Recovering deleted files
Perhaps the most common type of filesystem problem is files that are accidentally deleted. Users frequently delete the wrong files or delete a file only to discover that it’s actually needed. Windows system users may be accustomed to undelete utilities, which scour the disk for recently deleted files in order to recover them. Unfortunately, such tools are rare on Linux. You can make undeletion easier by encouraging the use of special utilities that don’t really delete files, but instead place them in temporary holding areas for deletion later. If all else fails, you may need to recover files from a backup.
Trash can utilities
One of the simplest ways to recover “deleted” files is to not delete them at all. This is the idea behind a trash can — a tool or procedure to hold onto files that are to be deleted without actually deleting them. These files can be deleted automatically or manually, depending on the tool or procedure. The most familiar form of trash can utility for most users, and the one from which the name derives, is the trash can icon that exists in many popular GUI environments, including KDE and GNOME. To use a GUI trash can, you drag files you want to delete to its icon. The icon is basically just a pointer to a specific directory that’s out of the way or hidden from view, such as ~/Desktop/Trash or ~/.gnome-desktop/Trash. When you drag a file to the trash can, you’re really just moving it to that directory. If you subsequently decide you want to undelete the file, you can click or double-click the trash can icon to open a file browser on the trash directory. This enables you to drag the files you want to rescue out of the trash directory. Typically, files are only deleted from the trash directory when you say so by right-clicking the trash can icon and selecting an option called Empty Trash or something similar.
When you’re working from the command line, the rm
command is the usual method of deleting files, as in rm somefile.txt
. This command doesn’t use anything akin to the trash directory by default, and depending on your distribution and its default settings, rm
may not even prompt you to be sure you’re deleting the files you want to delete. You can improve rm
‘s safety considerably by forcing it to confirm each deletion by using the -i
option, as in rm -i somefile.txt
. In fact, you may want to make this the default by creating an alias in your shell startup scripts. For instance, the following line in ~/.bashrc or /etc/profile will set up such an alias for bash:
alias rm='rm -i'
This configuration can become tedious if you use the -r
option to delete an entire directory tree, though, or if you simply want to delete a lot of files by using wildcards. You can override the alias by specifying the complete path to rm
(/bin/rm) when you type the command.
Forcing confirmation before deleting files can be a useful preventive measure, but it’s not really a way of recovering deleted files. One simple way to allow such recovery is to mimic the GUI environments’ trash cans — instead of deleting files with rm
, move them to a holding directory with mv
. You can then empty the holding directory whenever it’s convenient. In fact, if you use both a command shell and a GUI environment that implements a trash can, you can use the same directory for both.
If you or your users are already familiar with rm
, you may find it difficult to switch to using mv
. It’s also easy to forget how many files have been moved into the trash directory, and so disk space may fill up. One solution is to write a simple script that takes the place of rm
, but that moves files to the trash directory. This script can simultaneously delete files older than a specified date or delete files if the trash directory contains more than a certain number of files. Alternatively, you could create a cron job to periodically delete files in the trash directory. An example of such a script is saferm. To use saferm or any similar script, you install it in place of the regular rm
command, create an alias to call the script instead of rm
, or call it by its true name. For instance, the following alias will work:
alias rm='saferm'
In the case of saferm, the script prompts before deleting files, but you can eliminate the prompt by changing the line that reads read answer
to read answer=A
and commenting out the immediately preceding echo
lines. The script uses a trash directory in the user’s home directory, ~/.trash. When users need to recover “deleted” files, they can simply move them out of ~/.trash. This specific script doesn’t attempt to empty the trash bin, so users must do this themselves using the real rm
; or you or your users can create cron jobs to do the task.
File recovery tools
Undelete utilities for Linux are few and far between. The Linux philosophy is that users shouldn’t delete files they really don’t want to delete, and if they do, they should be restored from backups. Nonetheless, in a pinch there are some tricks you can use to try to recover accidentally deleted files.
One of these tricks is the recover utility, available with most Linux distributions. Unfortunately, this tool has several drawbacks. The first is that it was designed for ext2fs, and so it doesn’t work with most journaling filesystems. (It may work with ext3fs, though.) Another problem is that recover takes a long time to do anything, even on small partitions. I frequently see network programs such as web browsers and mail clients crash when recover runs. Finally, in my experience, recover frequently fails to work at all; if you type recover /dev/sda4
, for instance, to recover files from /dev/sda4, the program may churn for a while, consume a lot of CPU time, and return with a Terminated notice. In sum, recover isn’t a reliable tool, but you might try it if you’re desperate. If you do try to run it, I recommend shutting down unnecessary network-enabled programs first.
Another method of file recovery is to use grep
to search for text contained in the file. This approach is unlikely to work on anything but text files, and even then it may return a partial file or a file surrounded by text or binary junk. To use this approach, you type a command such as the following:
# grep -a -B5 -A100 "Dear Senator Jones" /dev/sda4 > recover.txt
This command searches for the text Dear Senator Jones on /dev/sda4 and returns the five lines before (-B5
) and the 100 lines after (-A100
) that string. The redirection operator stores the results in the file recover.txt. Because this operation involves a scan of the entire raw disk device, it’s likely to take a while. (You can speed matters up slightly by omitting the redirection operator and instead cutting and pasting the returned lines from an xterm
into a text editor; this enables you to hit Ctrl+C to cancel the operation once it’s located the file. Another option is to use script
to start a new shell that copies its output to a file, so you don’t need to copy text into an editor.) This approach also works with any filesystem. If the file is fragmented, though, it will only return part of the file. If you misjudge the size of the file in lines, you’ll either get just part of the file or too much — possibly including binary data before, after, or even within the target file.
Restoring files from a backup
Emergency recovery procedures — restoring most or all of a working system from a backup — are useful after a disk failure, security breach, or a seriously damaging administrative blunder. System backups can also be very useful in restoring deleted files. In this scenario, an accidentally deleted file can be restored from a backup. One drawback to this procedure is that the original file must have existed prior to the last regular system backup. If your backups are infrequent, the file might not exist. Even if you make daily backups, this procedure is unlikely to help if a user creates a file, quickly deletes it, and then wants it back immediately. A trash can utility is the best protection against that sort of damage.
As an example, suppose you create backups to tape using tar
. You can recover files from this backup by using the --extract
(-x
) command. Typically, you also pass the --verbose
(-v
) option so that you know when the target file has been restored, and you use --file
(-f
) to point to the tape device file. You must also pass the name of the file to be restored:
# tar -xvf /dev/st0 home/al/election.txt
This command recovers the file home/al/election.txt from the /dev/st0 tape device. A few points about this command require attention:
Permissions
The user who runs the command must have read/write access to the tape device. This user must also have write permission to the restore directory (normally, the current directory). Therefore, root normally runs this command, although other users may have sufficient privileges on some systems. Ownership and permissions on the restored file may change if a user other than root runs the command.
Filename specification
The preceding command omitted the leading slash (/) in the target filename specification (home/al/election.txt). This is because tar
normally strips this slash when it writes files, so when you specify files for restoration, the slash must also be missing. A few utilities and methods of creating a backup add a leading ./ to the filename. If your backups include this feature, you must include it in the filename specification to restore the file.
Restore directory
Normally, tar
restores files to the current working directory. Thus, if you type the preceding command while in /root, it will create a /root/home/al/election.txt file (assuming it’s on the tape). I recommend restoring to an empty subdirectory and then moving the restored file to its intended target area. This practice minimizes the risk that you might mistype the target file specification and overwrite a newer file with an older one, or even overwrite the entire Linux installation with the backup.
Unfortunately, tar
requires that you have a complete filename, including its path, ready in order to recover a file. If you don’t know the exact filename, you can try taking a directory of the tape by typing tar tvf /dev/st0
(substituting another tape device filename, if necessary). You may want to pipe the result through less
or grep
to help you search for the correct filename, or redirect it to a file you can search.
You can keep a record of files on a tape at backup time to simplify searches at restore time. Using the --verbose
option and redirecting the results to a file will do the trick. Some incremental backup methods automatically store information on a backup’s contents, too. Some backup tools, such as the commercial Backup/Recover Utility, store an index of files on the tape. This index enables you to quickly scan the tape and select files for recovery from the index.
Summary
Linux, like any OS, is built on its filesystems. The ext2 filesystem has long been the standard for Linux, but over the course of development of the 2.4.x kernels, new journaling filesystems have been added as standard equipment. These filesystems give you several options that vary in subtle ways — disk space consumption by different types of files, support for ACLs, and so on. Most systems will work well with any Linux filesystem, but if disk performance is critically important to you, you may want to research the options further to pick the best one for your need. You can also optimize filesystems in various ways, ranging from options at filesystem creation time to defragmenting and resizing filesystems. Unfortunately, filesystems don’t always work perfectly reliably. Sometimes you may need to fix filesystem corruption, and various tools exist to help you do this. Users may also accidentally delete files, and recovering them can be a challenging task, although being prepared by using trash can utilities and performing regular backups can greatly simplify recovery operations.
Category:
- Linux