Monday, October 4, 2010

Linux Troubleshooting


Linux Troubleshooting
Linux is legendary for its stability - once set up correctly, a Linux box, left to its own devices, will run trouble-free for a very long time. Most problems arise soon after installation or major configuration changes, and are the result of misconfiguration, typographical errors or the occasional hardware failure.
However, from time to time accidents do happen, even in the best-regulated environments . . .
A Linux Troubleshooting Toolkit
The best way to minimise the impact of those unforeseeable events is to prepate for them, by assembling the recovery tools in advance
Tom's Root Boot Disk
An essential part of every Linux professional's bag of tricks, this tiny (by today's standards) package unpacks to create a 1.722 MB floppy disk that is a complete Linux distribution with a selection of recovery tools - until you see how it's done you'll find it hard to believe a single floppy can contain so much!
An alternative version comes in El Torito (bootable CD-ROM) format . You can download tomsrtbt from http://www.toms.net/rb/
Knoppix
This is a popular Linux distribution, based on Debian, which boots and runs entirely from CD-ROM. While it is popular for demonstrations, or for letting interested users get a taste of Linux without having to install a distribution on the hard drive, it is also incredibly useful as a system repair tool. You can download Knoppix from http://www.knopper.net/knoppix/index-en.html (read the notes on software patents, then click on the KNOPPIX link - it's still there).
mkbootdisk
Most Linux distributions have a command to build a bootable floppy disk which can be used to repair a system. Red Hat Linux, for example, has the mkbootdisk command. In order to use this, you only need to know the desired kernel version to write to floppy, and you can find the current kernel version with the uname -r command:
mkbootdisk 2.4.20-8

or
mkbootdisk `uname -r`

In general, mkbootdisk and similar utilities will read various configuration files, such as /etc/fstab and /boot/grub/grub.conf, in order to work out the root filesystem, any required kernel command-line arguments and the drivers which will need to be loaded from the generated ramdisk image. One useful but not widely-known option for mkbootdisk is the --iso option, which makes a bootable CD-ROM image. This can then be updated with additional utilities, etc. if required.
Other Boot Disks
Most Linux distributions allow you to boot from the first installation CD in a system repair or 'rescue' mode. For Red Hat, for example, using the first CD-ROM to boot with the command 'linux rescue' will boot the system and then attempt a number of basic repairs automatically. The repair script will attempt to identify all the Linux partitions on your hard drives and mount them in the correct location. At the end of this process, you should wind up with the system completely assembled and mounted under /mnt/sysimage.
Red Hat Linux Professional boxed sets of recent vintage also include a rather neat credit-card-sized rescue CD, and similar CD's are sometimes available from Linux-related company stands at trade shows.
Problems:
Can't Boot?
Watch the system closely as it boots, and take note of any error messages that appear. If the system complains that it is unable to mount the root filesystem, for example, this can be for any of several reasons:
  • The BIOS cannot find the boot loader. This sometimes happens after you've installed Linux to dual-boot with Windows, but - out of concern to not misconfigure the system - have asked the install program to place the boot loader in the Linux root (or /boot) filesystem. The problem is that the BIOS can't see it there, unless you make that the active partition. The simplest fix is to reinstall Linux and this time, let it place the LILO or GRUB boot loader into the Master Boot Record - don't worry, the Linux boot loaders are automatically set up to let you choose Linux or Windows at boot time. It is possible to perform a more complex fix, for example by copying the Linux boot loader sector into a file, and setting up the Windows NT/2K/XP boot loader to chain to it - but that is too complex to describe here (see http://www.lesbell.com.au/Home.nsf/web/Using+the+NT+Boot+Loader+to+Boot+Linux?OpenDocument where you'll find a longer article describing how to use the NT boot loader to boot Linux).
  • The kernel doesn't have a device driver to access the hard drive (e.g. a SCSI drive). Fix this by using the mkinitrd script to build a new initrd file that contains the correct drivers, or recompile the kernel to include the driver code. This usually happens because you've built a new kernel and slightly messed up the configuration.
  • The kernel doesn't have a filesystem driver to access the root partition. For example, if the root filesystem is formatted with ext3, then you will need the ext3 and jbd modules in the initrd or compiled into the kernel. Fix as for the previous problem. Again, this usually happens after building a new kernel.
  • The partition table has been modified, for example, by the installation of another operating system. In this case, edit the kernel command line (in /ec/lilo.conf or /boot/grub/menu.lst) and the contents of /etc/fstab to contain the correct entries.
  • Filesystems are corrupted, due to a power failure or system crash. Generally, after a system crash or power outage (what? No UPS?), the system will come up and repair itself. If you are using a journalling filesystem like ext3fs, jfs, xfs or resiserfs, it will usually perform a roll-forward recovery from its journal file and carry on. Even with the older ext2fs, the system usually runs an fsck (file system check) on the various file systems and repairs them automatically. However, just occasionally manual intervention is required - ; you might have to answer 'Y' to a string of questions (answering 'N' will get you nowhere unless you intend to perform really low-level repairs yourself in a last-ditch attempt to avoid data loss). In the worst case, you might have to reboot from rescue media and manuall run the e2fsck (or similar) command against each filesystem in turn. For example:
e2fsck -p /dev/hda7

If the program complains that the superblock - the master block that links to everything else - is corrupted, it is useful to remember that the superblock is so critical that it is duplicated every 8192 blocks through the filesystem and you can tell e2fsck to use one of the backups:
e2fsck -b 8193 /dev/hda7
  • One or more filesystems cannot be found and mounted: Check the contents of /etc/fstab - in making quick alterations here, typographical errors are common. You can use the e2label command to view the label of each filesystem: some distributions set these to the mount point so you can figure out what is what.

In each case, you will need to boot from some kind of rescue media, then work at the command line to repair the damage. If you boot from tomsrtbt or Knoppix, you will have editors and other utilities available. If you boot from the Red Hat installation CD in rescue mode, you will need to change the root directory so that the various system directories and filesystems are in the correct locations:
chroot /mnt/sysimage

See the box "The chroot Command" for details of why and how this works.
Forgot root password
If you have - really have - forgotten the root password for your system, it is still possible, in many cases, to log in and fix this. On some distributions, you can boot in single-user maintenance mode (runlevel 1) by appending a '1' or 'single' on the end of the normal kernel boot command line. With the LILO boot loader, for example, you can type
linux 1

to boot this way. With GRUB, it's a little more complex: you have to choose the boot menu item you want to use, then press 'e' to edit it, move to the kernel command line and press 'e' to edit it, append the '1' at the end of the line, press Enter to terminate editing and then press 'b' to boot it.
However, some distributions will still request the root password in runlevel 1. For those, you should append the option 'init=/bin/bash' to the kernel command line, e.g.
linux init=/bin/bash

Now, instead of running the init process to kick off all the startup scripts, the kernel will simply run a bash shell. Since the startup scripts have not run, you may have to mount other filesystems manually, and you will certainly have to remount the root filesystem read-write with the command:
mount -o remount,rw /

Now, you can set about removing the root password. To do this, simply edit the /etc/shadow file and remove the encrypted password field from the file - it's usually the second field of the first line. You can now reboot, log in as root and use the passwd command to reset the password.
Security Warning!
Now that everyone knows this tip, you should take care to set a LILO or GRUB password to stop an attacker from editing the boot command line and breaking into your system this way. Of course, an attacker could also remove the root password by booting from floppy or CD, so you should set the system to boot from hard drive first, and then password-protect the BIOS settings, too!
Can't Eject CD-ROM?
You can normally eject a CD using the eject command (and you can close the drive again later with eject -t). But what if you get a message:
eject: unable to eject, last error: Invalid argument

The problem here is that something is accessing the CD-ROM drive - but what? You can use the fuser command to find out:
fuser /dev/cdrom

will show processes that have an open file or are otherwise accessing the CD-ROM drive. The command
fuser -uik /dev/cdrom

will show you the process ID and user that "owns" the drive, and will interactively allow you to kill the process.
No sound
Sound configuration is fairly tricky unless you know exactly what type of sound hardware you have - the chipset, not the brand of card. The simplest solution is to use the distribution's own sound configuration command - for Red Hat, this is redhat-config-soundcard or sndconfig (for the older versions).
X resolution too low or too high
Try using the left Ctrl and Alt keys with the + and - keys on the numeric pad to cycle through the various resolutions available on your system. You can also manually edit the XF86Config file (look in /etc/X11/ or nearby for this, depending on your distribution), then find the relevant Modes line, and comment out inappropriate modes
For example, if my monitor couldn't cope with 1400 x 1050 resolution, I would remove that entry from the Modes line in my XF86Config file:
Section "Screen"
       Identifier "Screen0"
       Device     "Videocard0"
       Monitor    "Monitor0"
       DefaultDepth     24
       SubSection "Display"
               Depth     24
               Modes    "1400x1050" "1280x1024" "1280x960" "1024x768" "800x600" "640x480"
       EndSubSection
EndSection

Sometimes, increasing the DefaultDepth entry will reduce the maximum resolution to something that your monitor can cope with.
Find the Right Driver Module
You can make the system attempt to load every device driver module of any given type in turn by using the command
modprobe -t type \*

where type is the name of a directory under /lib/modules/kernelver/kernel. For example:
modprobe -t net \*

will attempt to load most network drivers, one after another.
Trouble-shooting techniques
Use pairs of similarly-configured systems
Quick things to check:
Is a filesystem full? This can show up in lots of different ways: being unable to save files, print jobs not spooling correctly (especially on Samba print/file servers), and so on. Use the df command to see available space:
[root@freya home]# df -H
Filesystem              Size  Used Avail Use% Mounted on
/dev/Volume00/LogVol00 520MB 254MB 240MB 52%  /
/dev/hda3              128MB 2 1MB 101MB 17%  /boot
/dev/Volume00/LogVol03 2.2GB 134MB 1.9GB  7%  /home
/dev/Volume00/LogVol05 520MB 8.5MB 485MB  2%  /opt
none                   264MB     0 264MB  0%  /dev/shm
/dev/Volume00/LogVol02 1.1GB  36MB 969MB  4%  /tmp
/dev/Volume00/LogVol01 4.3GB 3.0GB 1.1GB 75%  /usr
/dev/Volume00/LogVol06 1.1GB 101MB 903MB 11%  /usr/local
/dev/Volume00/LogVol04 3.2GB 2.3GB 756MB 75%  /var
/dev/hda1               16GB  13GB 2.8GB 83%  /mnt/winc

Remember that a filesystem can fill up either because almost all of its data blocks are used up (some are reserved for the root user, just to get out of trouble) or because all its i-nodes (there is one of these per file) are used up.
If you need to make space by deleting some large files, use the command 'ls -lS' to get a directory listing that is sorted by file size. To scan an entire filesystem (e.g. /home or /var) for the largest files, use the command:
du | sort -n

The largest files will be at the end of the listing.
Adding New Drives
Sometimes the growth of a filesystem - particularly /home - means that it is necessary to find it a new home; in other words, add another physical disk and relocate the filesystem to its new home where there is room to grow.
Here is the procedure for adding another drive, with a single partition which will become the new /home filesystem (I'm assuming fdisk has already been used to partition it):
As root:
# mkdir /mnt/newhome
# mkfs -t ext2 /dev/hdb1
# mount /dev/hdb1 /mnt/newhome
# (cd /home && tar cf - .) | (cd /mnt/newhome && tar xpf -)

then
# cd /
# mv /home /home.old
# mkdir /home
# umount /mnt/newhome
# mount /dev/hdb1 /home

Once the new /home directory tree has been checked out, you can then safely
# cd /home.old
# rm -rf *
# cd ..
# rmdir /home.old
# rmdir /mnt/newhome

to clean up.
Network Problems
Use the ifconfig command to check whether an interface has been configured and is up. For example:
Long delays while starting daemons at boot time
If the system seems to stop for 30 seconds or more while starting - particularly when starting network deamons like sendmail or NFS - then the problem is likely to be either DNS misconfiguration, a DNS outage, or no network connection at all. Check that /etc/resolv.conf contains the correct DNS addresses, check that /etc/hosts contains the correct IP address and names for this machine, and then check that the network interface is up.
Troubleshooting Techniques and Skills
The first rule is: Use the log files - they are the primary source of debugging information and clues. You can examine the main log file with the command:
tail /var/log/messages

and you can watch it continuously by running the command:
tail -f /var/log/messages

in a window while you work. For security and login-related problems, check the file /var/log/secure. There are other log files and directories that relate to different subsystems in /var/log, and you should never overlook them.
If trying to resolve boot-time problems, use the command:
dmesg | less

to review the kernel ring buffer.
The next rule is to compare similarly-configured systems, if you have them. Often, you can see obvious differences in the configuration files between a working system and the broken system.
Next: if you are stumped, talk the problem over with a colleague or friend. They don't have to know the perfect solution - often, their suggestions can trigger a new line of thinking or remind you of something you have overlooked.
If you don't have someone you can talk to, then use online resources. Get to know how to perform searches at http://www.google.com/linux , and how to search the comp.os.linux and similar newsgroups at http://groups.google.com. On many occasions, I've turned up answers online after exhausting my own ideas.
Problem Avoidance Techniques
Keep a system change log. Whenever you make changes to the system, write them into the log. In general, if you never make changes to a system, it will just keep running - so that if the system breaks, the problem is usually related to recent changes.
Before making changes to critical system configuration files, make a backup copy which you can restore if everything goes pear-shaped. For example:
cp /etc/fstab /etc/fstab.good
vi /etc/fstab

There is no substitute for learning as much as possible about how the system works, and the role of the various configuration files in /etc, the daemon start/stop scripts in /etc/rc.d/init.d, how the init process works, and so on.
And, of course, the most importand System Administration Rule of all: Never make changes after three p.m. on a Friday!
The chroot Command
The chroot command is extremely useful for both system security and for system repair. Its basic syntax is:
chroot new-root-dir [command ...]

and its purpose is to run the specified command with the root directory changed to new-root-dir. If no command is specified, the default behaiour is to run an interactive shell (usually a bash shell). For example, the command:
chroot /var/ftp

will run a command shell in /var/ftp. However, note that the behaviour is to change the root directory first, and then try to invoke the command or shell, so that there had better be a file /var/ftp/bin/bash (which there would be, on many systems). In addition, the command will usually need to be statically linked, as otherwise it would attempt to load libraries from /lib, which is now /var/ftp/lib.
The chroot command is often used to start network daemons on servers - this is so that if an attacker manages to compromise the daemon, perhaps through a buffer overflow, he is unable to navigate around the entire system directory tree, but is instead constrained within a 'chroot jail'.
A major use of the chroot command is to change the root directory of the system after booting from a repair floppy or CD. For example, if you boot a Red Hat installation CD with the command 'linux rescue', the root file system is actually a RAM disk, and the root filesystem on your hard drive is mounted as /mnt/sysimage. Commands you give will load programs from /bin and /sbin on the RAM disk, which is obviously limited. To get access to those directories on the hard drive, you will need to change your root directory with the command
chroot /mnt/sysimage



...........................

No comments:

Post a Comment