My EC2 server wouldn’t boot after apt-get dist-upgrade: how I fixed it

So I have this EC2 server, which is a MySQL replication slave (henceforth known as “victim”). This server was originally running Alestic’s 64-bit paravirtual AMI for Ubuntu Oneiric 11.10 (ami-a8ec6098), but had been previously upgraded to Precise with do-release-upgrade.

Yesterday, I performed an apt-get dist-upgrade on victim in order to upgrade its apps and the kernel. Then I rebooted the server, but it refused to boot, hanging before SSH could come online. Checking the system log, I saw that it was unable to mount its root disk:

Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[11081144.842424] EXT4-fs (xvdf): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.

[11081145.506225] random: nonblocking pool is initialized
lxcmount stop/pre-start, process 179

 * Starting configure network device security[ OK ]
 * Starting Mount network filesystems[ OK ]
 * Starting Mount filesystems on boot[ OK ]
 * Stopping Mount network filesystems[ OK ]
 * Starting Populate and link to /run filesystem[ OK ]
 * Stopping Populate and link to /run filesystem[ OK ]
 * Stopping Track if upstart is running in a container[ OK ]
 * Starting Bridge socket events into upstart[ OK ]
 * Starting configure network device[ OK ]
 * Starting Initialize or finalize resolvconf[ OK ]

The disk drive for / is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery

Since this is an Amazon server, pressing a key is not an option, there is no interactive console. Though interestingly, telling the instance to stop causes more system log to be printed as it performs a clean shut down.

The obvious question was, why couldn’t it find or mount the root disk?

At this point I took a snapshot of the root disk so that I didn’t mess it up further with any of my experiments. And I started up a new server, “rescue”, from a pristine Alestic Precise AMI, and dist-upgraded it, which should make it nearly identical to victim. But rescue restarted fine without hanging. There must be some critical difference between my pristine rescue server and the hanging victim server, I just had to find it.

In the AWS control panel, I noticed that victim was using an older Amazon “kernel” (which I think is just a PV-GRUB kernel used to boot the kernel installed on the instance’s disk). So I used the AWS CLI tool to make it the same as rescue's, which should be fine since rescue is the same architecture and guest kernel version:

aws ec2 modify-instance-attribute --instance-id i-000000 --kernel "{\"Value\":\"aki-fc8f11cc\"}"

No luck! It still had the same issue.

I unmounted the root disk (/dev/sda1) from victim, attached it to rescue (as /dev/xvdf) and mounted it at /mnt:

root@rescue# mount /dev/xvdf /mnt

One reason for not being able to mount the root disk would be that its label had changed. The /etc/fstab file specifies how the root disk should be identified, so let’s take a look:

root@rescue# cat /mnt/etc/fstab
LABEL=cloudimg-rootfs	/	 ext4	defaults	0 0

Okay, so it looks for a disk with the label “cloudimg-rootfs”. What label does the disk have?

root@rescue# e2label /dev/xvdf
cloudimg-rootfs

So the disk label is correct, rats. Just for fun, I tried replacing the “LABEL=cloudimg-rootfs” part with plain old “/dev/xvda1”, but it still wouldn’t boot.

Maybe the new kernel or initrd was corrupt? I checked out /boot/grub/menu.lst and both victim and rescue were trying to boot the exact same kernel version. So I just removed victim‘s /boot, initrd.img and vmlinuz and replaced them with the pristine ones from rescue:

root@rescue# rm -rf /mnt/initrd.img* /mnt/vmlinuz* /mnt/boot
root@rescue# cp -a /boot /mnt/
root@rescue# cp -a /vmlinuz /mnt/
root@rescue# cp -a /initrd.img /mnt/

Now I tried booting victim, but it still hung for the same reason! So the kernel or initrd wasn’t the issue.

Now I’m getting desperate, so I tried something really crazy. I deleted victim's /etc directory and replaced it with the one from rescue:

root@rescue# rm -rf /mnt/etc
root@rescue# cp -a /etc /mnt

I booted up victim with this new fiddled root disk, and it worked!! It booted fine! So the reason that victim wouldn’t boot is due to some bad configuration in /etc. But the thing I most want to save from victim is its custom configuration, so I can’t just use rescue‘s out-of-the-box defaults. So I deleted the victim disk I had mangled and restored it from its snapshot (back to its broken condition) and remounted it to rescue.

Now I just had to find the difference between the pristine configuration and the broken one that causes it not to boot. I started by installing all the packages I could remember from victim on to rescue in order to minimise the size of the diff between their configurations. Finally, I ran a recursive diff, ignoring any “*~” emacs backup files:

root@rescue# diff -r --exclude "*~" /mnt/etc /etc > /root/victim-diff

Since the system hung before anything exciting like the network or root disk was initialised, I knew I could ignore MySQL, Apache, Postfix, Nagios, etc, since they are all started too late in the boot process. That didn’t leave many interesting changed files. There were some changed grub settings:

diff -r /mnt/etc/default/grub /etc/default/grub
7c7
< #GRUB_HIDDEN_TIMEOUT=0
---
> GRUB_HIDDEN_TIMEOUT=0
9c9
< GRUB_TIMEOUT=5
---
> GRUB_TIMEOUT=0
11c11
< GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 nomdmonddf nomdmonisw nomdmonddf nomdmonisw"
---
> GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"

But moving the newer grub config in didn’t fix it. There was a change to the hardware clock settings that I had seen mentioned a couple of times in the system log:

diff -r /mnt/etc/default/rcS /etc/default/rcS
27a28
> HWCLOCKACCESS=no

But porting that change over didn’t fix it either. Finally, I noticed this:

Only in /mnt/etc/init: lxcguest.conf
Only in /mnt/etc/init: lxcmount.conf

That’s odd, I don’t remember ever using LXC on this system, and it’ll never be an LXC guest. One of those configuration files contains “mount” in the name and both are in the “/init” directory, so these files could easily be related to mounting problems during init! Was I really using LXC? I chroot’d into the old drive in order to interrogate its dpkg catalog:

root@rescue# chroot /mnt
root@rescue# dpkg -l | grep lxc
rc  lxcguest    0.7.5-0ubuntu8.6   amd64  Linux container guest package
root@rescue# exit

The “rc” at the start indicates that the lxcguest package was installed at some point, was removed, but still has configuration files left behind. Great, that means that I can blow away those old files:

root@rescue# rm /mnt/etc/init/lxcguest.conf /mnt/etc/init/lxcmount.conf

And now, glory of glories, the server booted!

[64304636.707613] random: nonblocking pool is initialized
 * Starting Mount filesystems on boot[ OK ]
 * Starting Populate and link to /run filesystem[ OK ]
 * Stopping Populate and link to /run filesystem[ OK ]
 * Stopping Track if upstart is running in a container[ OK ]
[64304638.007641] EXT4-fs (xvda1): re-mounted. Opts: discard
 * Starting Initialize or finalize resolvconf[ OK ]
 * Starting Signal sysvinit that the rootfs is mounted[ OK ]
[64304638.205979] init: mounted-tmp main process (302) terminated with status 1

...

Just to check that it wasn’t fixed due to some combination of changes I made, I deleted victim's fiddled root disk, restored it from the snapshot, and only removed those two lxc config files. It booted fine, so nothing else I changed was required in solving it!

I hope that these debugging steps help someone out in repairing their own EC2 server. (And yes I realise that in most cases, building a new instance with Chef or Puppet would be a better solution, but this is what I had to work with!)