Login bypass in Ubiquiti airMAX/airOS before 8.0.2, 7.2.5, 6.0.2, 5.6.15 if airControl web-UI was used

After seeing this arbitrary command execution vulnerability in Ubiquiti equipment, discovered by SEC Consult, I was intrigued. In that bug, code that would have been secure on a more recent version of PHP was rendered vulnerable because of the ancient PHP version used (2.0.1, which is nearly 20 years old). I wanted to see what other bugs might be caused by PHP that works in unexpected ways.

My friend owns a “NanoBeam AC” running firmware WA_v8.0.1, so I downloaded that firmware from Ubiquiti’s website and unpacked it with binwalk. I found a bunch of PHP scripts, a custom patched PHP 2.0.1 binary, and a custom patched Lighttpd server which handles session management and serves the files.

In those PHP scripts I saw a ton of opportunities for Ubiquiti to get things wrong because of the number of calls they made to execute external programs using the shell. It’s difficult to run commands through the shell securely, because of the potential for characters in the command to be interpreted by the shell as parameter separators or shell metacharacters. This could allow an attacker to inject unexpected parameters, overwrite unexpected files, or execute unexpected commands. Since Ubiquiti already add their own code to the PHP binary, it seems silly that they wouldn’t add their own spawn() command, which would bypass shell interpretation of the command and so solve all of those issues (including this login bypass vulnerability) in one fell swoop, but I digress.

The potential vulnerabilities I saw were all gated behind authentication, so although they would be vulnerable to CSRF payloads sent to a user who was currently logged in to their router (there is no CSRF token protection), they weren’t very interesting to me, as I don’t imagine that people spend that much time logged in to their routers. (There has a been a lot of work on improving this code in more recent firmware releases, so most/all of this code is fixed by now anyway)

Which files are accessible without authentication? Well, let’s check /etc/lighttpd/lighttpd.conf:

airos.allow = (
   ...
   "jsl10n.cgi",
   "poll.cgi",
   "/login.cgi",
   "/ticket.cgi",
   ...
)

Hm, what’s ticket.cgi exactly? It turns out that this equipment has a facility for an administrator or automated system to SSH to the router and generate a login “ticket” for the device, which is a 32-character random string. This ticket can then be passed as a link to some member of your organisation’s support staff, so that they can log in to the router to update its configuration without needing to know the admin password or SSH credentials for the router. It provides a temporary-access facility.

Ubiquiti’s “airControl” administration tool creates these tickets automatically for you when you use the “Open Web-UI” feature. Without airControl, an admin could manually generate a ticket for the root account by running a command like this on the device:

/bin/ma-ticket-add /tmp/.tickets.tdb 2de09cb52c5d9aab729e155336167e03 root

This creates a new ticket database, /tmp/.tickets.tdb (this file doesn’t exist by default and is removed on reboot) and adds a single ticket to it with that specified 32-character value, which will log the user in as the specified user (here, “root”).

A user can then log in to the device as root by visiting the URL “http://192.168.1.1/ticket.cgi?ticketid=2de09cb52c5d9aab​729e155336167e03”, where 192.168.1.1 is the IP address of the device. Visiting this link causes the ticket to be consumed and deleted from the ticket database.

If someone has generated and subsequently used a login ticket since the last reboot, there will be an empty database of tickets in /tmp/.tickets.tdb (since all tickets have been used and removed from the database). This is the situation that opens the vulnerability.

Let’s take a look at the vulnerable code in ticket.cgi:

if (isset($ticketid) && (strlen($ticketid) > 0)) {
   /* check ticket existence */
   $cmd = "/bin/ma-show /tmp/.tickets.tdb " + $ticketid;
   exec(EscapeShellCmd($cmd),$lines,$rc);

   if ($rc == 0) {
      $user_regexp = "[[:space:]]+user:[[:space:]]+\\'([[:print:]]+)\\'$";
      $username = "mcuser";
      $i = 0; $size = count($lines);
      while ($i < $size) {
         if (ereg($user_regexp, $lines[$i], $res)) {
            $username = $res[1];
            $i = $size;
         }
         $i++;
      }
      /* authorize the session, that brought this ticket */
      $session = get_session_id($$session_id, $AIROS_SESSIONID, $HTTP_USER_AGENT);
      $cmd = "/bin/ma-auth /tmp/.sessions.tdb " + $session + "  " + $username;
      exec(EscapeShellCmd($cmd), $lines, $rc);

We provide a non-empty ticket ID as a URL parameter ticketid. This is passed to the /bin/ma-show binary, which looks for that ticket ID in the ticket database. If that ticket is found, ma-show prints out the contents of the ticket and returns a zero exit code to indicate success (otherwise it returns a non-zero exit code). The PHP code parses the content of the ticket to find out which user the ticket is for, and finally it creates a logged-in session for that user using the ma-auth binary (the same binary that is used to create a session during a regular login).

Okay, so how can this go wrong? Well, ma-show actually has a bonus feature. If you call it with no ticket ID argument, it prints out every ticket in its database, and sets its exit code to the number of tickets that it printed out. We can trigger this by passing a single space character in as our ticket ID. Since the shell will treat it as whitespace and discard it, ma-show will only see one argument on its command line.

If the ticket database is empty, ma-show‘s return code ($rc) will be zero (since it’ll print zero tickets), so the PHP code will consider the ticket lookup to be successful! However, since ma-show won’t print out any tickets, the parsing to find a username in the output will fail, and at first glance it appears that we could only authenticate as “mcuser”, an account that may not actually exist on the device.

This is where the magic of PHP 2 comes in. In PHP 2, when you add a parameter to the URL, it causes that value to be set into a global variable with the same name as the parameter (what newer versions of PHP would call “register globals“). And the exec() command appends its output to whatever is already in the $lines array. So we can supply a ?lines[]= parameter in the URL to effectively add our own user grant to the otherwise-empty output of the first exec() call.

The upshot

So in the situation where the ticket database is present, but empty (e.g. airControl’s “open web-UI” feature has been used since the last reboot), it is possible for an unauthenticated remote user to log in as any user, without knowing any ticket ID, just by visiting a specially-crafted URL.

I sent the reproduction instructions to my friend, and sure enough, they were able to get root on their Nanobeam using it.

The fix

I reported this vulnerability to Ubiquiti through HackerOne on 2017-03-21. The vulnerability was promptly patched in airOS v8.0.2, v7.2.5, v6.0.2 and v5.6.15 (2017-03-28) by adding this code, which ensures that the ticket ID is non-empty before calling ma-show:

if (!ereg("^[[:xdigit:]]{32}$", $ticketid)) {
   Header("Location: $redir_url");
   exit;
}

Ubiquiti report that this is also fixed in airGateway v1.1.9 (2017-03-28) and airFiber v3.7-rc3, v3.4.3, v3.2.3 (2017-04-07), products that I am not familiar with.

The airControl software now includes a mitigation (since 2.0.2 and 2.1-beta7) to prevent the “Open Web-UI” feature from opening up the vulnerability even when used against old firmware.

One way of achieving this mitigation would be to create two random tickets at a time, one of which will never be consumed, so that the ticket database is never emptied. If you were building your own custom authentication system on top of the ticket functionality, this is what you could do to avoid the vulnerability when used against older firmware.

Timeline

2017-03-21 – Reported to Ubiquiti through HackerOne
2017-03-22 – Receipt confirmed and Ubiquiti says they’re working on a fix
2017-03-28 – Firmwares released with fix
2017-04-11 – US$1000 bounty awarded
2017-04-14 – Ubiquiti suggest a 120 day disclosure period, I agree.
2017-08-17 – Bounty upgraded to US$6000
2017-08-18 – Disclosure

Release notes for 8.0.2

airOS8 Firmware Revision History
====================================================================
Supported products
  * Rocket 5AC Lite, model: R5AC-Lite
  * Rocket 5AC PTP AirPrism, model: R5AC-PTP
  * Rocket 5AC Multi-Point AirPrism, model: R5AC-PTMP
  * PowerBeam 5AC, models: PBE-5AC-500, PBE-5AC-620, PBE-5AC-300, PBE-5AC-400
  * PowerBeam 5AC 300 ISO, model: PBE-5AC-300-ISO
  * PowerBeam 5AC 400 ISO, model: PBE-5AC-400-ISO
  * PowerBeam 5AC 500 ISO, model: PBE-5AC-500-ISO
  * NanoBeam 5AC 19dBi, model: NBE-5AC-19
  * NanoBeam 5AC 16dBi, model: NBE-5AC-16
  * LiteBeam 5AC 23dBi, model: LBE-5AC-23
  * LiteBeam AC 16 dBi 120 degrees, model: LBE-AC-16-120
  * Rocket 5AC Prism, model: R5-AC-PRISM

====================================================================
Version 8.0.2 (XC, WA) - Service Release (March 28, 2017)
--------------------------------------------------------------------

New:
- New: SNMP OIDs for CPU and Memory utilisation
- New: Update dropbear to v2016.74
- New: OpenSSL update to v1.0.2k
- New: libevent update to v2.1.8

Fixes:
- Fix: PTP mode stability and performance improvements
- Fix: Security fixes and improvements
- Fix: ATPC fast restart added
- Fix: Restore initial TX power on AP/PTP when ATPC is turned off
- Fix: Revert Device Name strictness for DHCP Client (escape only hashtag ‘#’ symbol which breaks DHCP Client operation)
- Fix: Station fails re-authentication with AP when the same SSID is used for other APs
- Fix: Stations start disassociating from AP (PTMP)
- Fix: ATPC feature enable/disable and ATPC target signal change interrupts wireless link
- Fix: Wrong distance (100km) reporting after switching from Fixed to Auto Distance in airMAX PTP mode
- Fix: Flow Control fix for WA products

WEB UI:
- WEB UI: Don't allow to remove BRIDGE0 interface containing WLAN0
- WEB UI: "(Auto)" label missing on STA's Remote statistics in case ATPC is enabled on AP (PTP mode only)
- WEB UI: Show more detailed error messages when upgrading invalid firmware
- WEB UI: Improved Station List for small screens like mobile
- WEB UI: Change status.cgi output type from text/html to application/json
- WEB UI: Password change validation fix

Passthrough more than 4 PCIe devices to Proxmox 4.4 and 5 guests

By default in Proxmox 4.4 and 5.0, you are unable to pass through more than 4 PCIe devices to the guest. If you try, you’ll get an error when attempting to start the VM which reads:

vm 100 - unable to parse value of 'hostpci4' - unknown setting 'hostpci4'

Passed-through PCIe devices are attached to the four ports called “ich9-pcie-port-{1,2,3,4}” which are defined in /usr/share/qemu-server/pve-q35.cfg. These ports occupy PCIe function numbers 0-3, leaving function numbers 4-7 unused.

It’s a simple matter to add definitions for an extra 4 ports to use up those spare function numbers in /usr/share/qemu-server/pve-q35.cfg: Continue reading Passthrough more than 4 PCIe devices to Proxmox 4.4 and 5 guests

Fix for macOS Sierra 10.12.4+ “don’t steal mac OS” error on boot on Proxmox 4 and 5

In Sierra 10.12.4, macOS added some extra copy protection which is able to tell that the SMC emulation that QEMU provides is not a real Mac. This causes a fatal error during boot on Proxmox 5 and earlier.

One way of fixing this would be to remove the SMC device from the virtual machine’s arguments, and use FakeSMC.kext instead, like a regular Hackintosh, but this is inelegant.

Instead, we can patch QEMU to fix the SMC support, using the fixes from here: Continue reading Fix for macOS Sierra 10.12.4+ “don’t steal mac OS” error on boot on Proxmox 4 and 5

Accelerate IO for macOS Sierra Proxmox guests by passing through an NVMe SSD

Recently I migrated my MacBook Pro into a Proxmox virtual machine to use as my daily-driver. This made for a rather large stepdown in IO performance, since my MacBook used an SSD, and Proxmox was using a RAIDZ1 array of spinning disks. On top of the IOPS penalty for spinning disks, there are currently no macOS drivers for the virtio SCSI paravirtual device, so we have to use IDE/SATA emulation instead, which is very slow (although this may change in the near future).

One way to improve things would be to use PCIe passthrough to pass through a whole physical SATA controller to the guest. This would eliminate almost all of the performance penalty of the virtualised SATA controller. But there’s a new option for drive passthrough: NVMe SSDs.

NVMe is a new standard for operating systems to communicate with a disk controller, which has been specifically designed to extract the most speed possible from SSDs. NVMe SSDs are PCIe devices (typically x4), so we can pass them straight through to macOS. I’m using the Samsung 950 Pro. You might also consider the faster 960 Pro.

The only missing piece of the puzzle is NVMe support in macOS Sierra. Thankfully, modern macs have begun shipping with NVMe SSDs inside, so we have an official Apple driver we can use. It just needs to be patched to accept our SSDs. Continue reading Accelerate IO for macOS Sierra Proxmox guests by passing through an NVMe SSD

Using Clover UEFI boot with Sierra on Proxmox

My previous Proxmox post described how to install Sierra into Proxmox using the Enoch bootloader (SeaBIOS boot). Since then, I’ve been using it as my daily-use desktop, and it has generally been working out great for me. However, I had some real struggles getting the graphics card passthrough to work reliably. I managed to fix these by updating to UEFI boot with Clover.

One of the problems with legacy BIOS boot and GPU passthrough is VGA arbitration. From what I understand, the video cards in the host and guest can end up both contending to own the VGA resources, which can cause a deadlock on boot. When a Sierra guest loads its video driver during boot, my Proxmox host hangs, and the screen fills with black and white bars.

UEFI boot doesn’t suffer from this problem, since it does away with the legacy VGA interface. So if your video card’s firmware supports UEFI/EFI boot (my R9 280X already does), you can switch the guest to boot using OVMF instead. This requires us to use a macOS bootloader that supports UEFI. I chose Clover.

However, there’s an issue at the moment with Clover and QEMU which causes macOS’s detected CPU speed to be wrong. This makes window animations, the system clock, movie players, typematic repeat, etc., run much too fast or too slow.

On Proxmox 4.4, we have to patch Clover to fix this, follow the instructions in the next section.

Proxmox 5 has support for telling macOS exactly what the CPU’s frequency is, by exposing a VMWare-style interface that macOS knows how to read. This fixes the CPU speed problem. So on Proxmox 5,  we can just edit the VM configuration to enable this feature, and afterwards we can install an unmodified official Clover release (I’m using r4097) using the install instructions further down this page.

Building your own copy of Clover with the QEMU CPU speed patch for Proxmox 4.4

You can either just download my prebuilt patched Clover r4061 / EDK2 r24132 installer, or follow the instructions in this section to patch and build Clover yourself.

We’ll be following the official Clover building instructions, but we’ll be modifying those slightly.

Install XCode from the App Store before you start. Run “sudo xcodebuild -license” to accept the license agreement. Run “sudo xcode-select –install” to ensure the command-line tools are installed.

Note that when the instructions say to make a directory called “src” in your home directory, you should listen! There are hardcoded paths that will look for built tools in that directory, so it’s much easier to just go with the flow here.

Fetching Clover source

Follow steps 1-3 from the section “compiling from source“, with some changes:

On the line that fetches EDK2:

svn co -r 18198 svn://svn.code.sf.net/p/edk2/code/trunk/edk2 edk2

Fetch EDK2 revision 24132 instead:

svn co -r 24132 svn://svn.code.sf.net/p/edk2/code/trunk/edk2 edk2

When it checks out the latest Clover source:

svn co svn://svn.code.sf.net/p/cloverefiboot/code Clover

Check out revision 4061 instead:

svn co -r 4061 svn://svn.code.sf.net/p/cloverefiboot/code Clover

You can skip the line that runs “./buildgcc-4.9.sh”, since we’ll be using XCode instead.

Apply the patch

User “arne ziegert” over on the Clover issue tracker came up with a patch to fix the CPU speed issue on QEMU, which we’ll apply before we build Clover.

Download this patch to “edk2/Clover”. Change into that directory and run:

svn patch clover-r4061-qemu-cpu-speed-patch.diff

Build Clover

Change into the “edk2/Clover” directory, and run:

./ebuild.sh

The default options, which use XCode to build an X64 bootloader, are perfect for us.

After that completes, run “cd CloverPackage; ./makepkg”. This will produce an installable package for us in “edk2/Clover/CloverPackage/sym/Clover_v2.4k_r4061.pkg.”.

Proxmox 5: Enabling vmware-cpuid-freq support

In Proxmox 5, we don’t need to patch Clover, we just need to enable the vmware-cpuid-freq feature on the CPU in our VM’s configuration.

You should currently have an “args:” option in your VM configuration that contains:

-cpu Penryn,kvm=off,vendor=GenuineIntel

We need to edit that to add +invtsc to enable invariant timestamp counter support, add the vmware-cpuid-freq option, and turn kvm back on (exposing the fact that this is a virtual machine to macOS):

-cpu Penryn,kvm=on,vendor=GenuineIntel,+invtsc,vmware-cpuid-freq=on

Install Clover to the EFI partition

At this point you might want to take a snapshot of your Sierra install, so you can roll things back if it goes wrong. Though note that we will still be able to boot with Enoch/SeaBIOS even after we’re done, so if you mess up Clover/OVMF, you should be able to switch right back to SeaBIOS in your VM options to fix things.

Run the Clover.pkg installer in your Sierra guest:

The Clover installer should leave the EFI partition mounted for us. Open that up in Finder.

Replace the EFI/CLOVER/config.plist file with this one, which I got from Spaceinvader One’s unRAID tutorial.

Put this q35-acpi-dsdt.aml file from QEMU into “EFI/CLOVER/ACPI/origin”. (This file is no longer part of the latest QEMU revision, however the last revision which contained it can be browsed here.)

Configure Proxmox to use OVMF/UEFI

We’re nearly done! Just switch over to OVMF in your VM’s settings:

Now fire it up!

Editing your Clover/EFI settings in the future

You can use the Clover Configurator tool to edit your Clover configuration. This tool should mount the EFI partition for you. If you want to mount it manually, first check the device name of the EFI partition in the terminal:

~$ diskutil list
/dev/disk0 (external):
   #:             TYPE   NAME           SIZE       IDENTIFIER
   0: GUID_partition_scheme             512.1 GB   disk0
   1:              EFI   EFI            209.7 MB   disk0s1
   2:        Apple_HFS   Main           511.8 GB   disk0s2

Then you can mount it like so:

sudo mkdir /Volumes/EFI
sudo mount -t msdos /dev/disk0s1 /Volumes/EFI

Alternative process: Dedicated Clover boot device

Rather than installing Clover by executing the .pkg on the guest, you can attach a dedicated Clover disk to your VM and just fill it with a Clover disk image that I’ve prepared.

On the hardware tab, add a new disk of size 1GB to hold Clover (if you’re already using IDE0 then add to IDE2). On the options tab, change the boot order to boot from this drive.

If you haven’t already switched your VM settings from SeaBIOS to OVMF, change the BIOS type to OVMF on the options tab, and add an EFI disk to store UEFI settings on the hardware tab.

If you have the Sierra install DVD mounted, make sure the line for that in your VM’s config contains the “media=cdrom” flag (unlike Enoch which needed that to be removed). For example:

sata0: local:iso/Install_macOS_Sierra.iso,media=cdrom,size=6074010K

Download this Clover disk image (5MB, uncompresses to 1GB), upload it to Proxmox and unpack it there with “gunzip clover-r4061-1gb.img.gz”. Now write that image onto the 1GB disk you added. For my ZFS-backed volume, that was accomplished with:

dd if=clover-r4061-1gb.img of=/dev/zvol/tank/vms/vm-104-disk-2 bs=1M

Be sure to get the device name correct so you don’t overwrite the wrong drive! Now you should be able to use this Clover boot disk to boot the Sierra installer, or an already-installed copy of Sierra.

Creating a CrashPlan container on Proxmox to back up your files

I’m migrating from FreeNAS to Proxmox 4.3. On FreeNAS, there was a built-in plugin for CrashPlan support, which I was using to back up the files that FreeNAS was serving from ZFS over the network. However, keeping this plugin running was a chore, with forced automatic CrashPlan updates frequently breaking it and requiring manual intervention to fix, and headless operation requiring an unsupported, tedious procedure to achieve, with lots of opportunities for getting it wrong.

On top of this, CrashPlan doesn’t actually support FreeBSD, and instead relies on the Linux emulation that the FreeNAS jail system provides, which puts the plugin at risk of being broken by CrashPlan relying on unsupported Linux kernel features.

By contrast, Proxmox provides the perfect environment for CrashPlan. Having a real Linux kernel available for the LVM container system to use means there’s no kernel incompatibility to worry about. Continue reading Creating a CrashPlan container on Proxmox to back up your files

Installing macOS Sierra on Proxmox 4.4 / QEMU 2.7.1

This tutorial for installing macOS Sierra has been adapted for Proxmox 4.4 from this tutorial for Yosemite, and this GitHub project for installing into vanilla KVM.

Requirements

I’ll assume you already have Proxmox 4.4 installed. You also need a real Mac available in order to download Sierra from the App Store and build the installation ISO. Your host computer must have an Intel CPU at least as new as Penryn. I think you may need a custom Mac kernel to use an AMD CPU.

These installation instructions have been tested with Sierra 10.12.4. Although it’s been a while since I performed a fresh install, I’m currently running Sierra 10.12.6 on Proxmox 5 using a VM built with these instructions.

First step: Create an installation ISO

On a Mac machine, download the macOS Sierra installer from the App Store (this will download it into your Applications folder).

download

Download the contents of this repository to your mac.

From inside that directory, run “sudo ./create_install_iso.sh” to create the install CD for you:

create-iso

Once that’s done, connect to your Proxmox server using Transmit (or some other SCP/SFTP client) and upload the ISO you created to /var/lib/vz/template/iso.

While you’re there, upload the enoch_rev2877_boot bootloader file from the GitHub repository to /var/lib/vz/template/qemu/enoch_rev2877_boot.

Fetch the OSK authentication key

macOS checks that it is running on real Mac hardware, and refuses to boot on third-party hardware. You can get around this by reading an authentication key out of your real Mac hardware (the OSK key). Run the first bit of C code from this page (you’ll need XCode installed) and it’ll print out the 64 character OSK for you. Make a note of it.

Create the VM

From the Proxmox web UI, create a new virtual machine as shown below.

In the Options page for the VM, change “Use tablet for pointer” to “No”.

In the Hardware page for the VM, change the the Display to Standard VGA (std).

Don’t try to start the VM just yet. First, SSH into your Proxmox server so we can make some edits to the configuration files.

Edit /etc/pve/qemu-server/YOUR-VM-ID-HERE.conf (with nano or vim). Add these two lines, being sure to subtitute the OSK you extracted earlier into the right place:

machine: pc-q35-2.4
args: -device isa-applesmc,osk="THE-OSK-YOU-EXTRACTED-GOES-HERE" -smbios type=2 -kernel /var/lib/vz/template/qemu/enoch_rev2877_boot -cpu Penryn,kvm=off,vendor=GenuineIntel

Find the line that specifies the ISO file, and remove the “,media=cdrom” part from the end of the line (otherwise you’ll get stuck at the bootloader).

On the net0 line, change “e1000” to “e1000-82545em”. This variant is supported by OS X.

macOS doesn’t support the PS2 keyboard and mouse that QEMU will emulate, nor does it support the tablet, so edit /usr/share/qemu-server/pve-q35.cfg and add these USB input devices to the bottom of the file instead:

[device "mouse1"]
 driver = "usb-mouse"
 bus = "ehci.0"
 port = "1"

[device "keyboard1"]
 driver = "usb-kbd"
 bus = "ehci.0"
 port = "2"

We’ve added those to the config file instead of to the VM’s args directly. If we were to add them to the VM’s args, then when Proxmox constructs its call to KVM to launch the VM, those device definitions would appear before the pve-q35.cfg file is included, which defines the USB busses. However, the device definitions must appear after the definitions of the USB bus that they refer to.

Note that this file is whitespace-sensitive, make you you don’t add any blank lines that have extraneous spaces on them.

Configure Proxmox

On Proxmox, run “echo 1 > /sys/module/kvm/parameters/ignore_msrs” to avoid a bootloop during macOS boot. To make this change persist across Proxmox reboots, run:

echo "options kvm ignore_msrs=Y" >>/etc/modprobe.d/kvm.conf && update-initramfs -k all -u

If you’re installing Sierra 10.12.4 or newer, you’ll also need to patch Proxmox’s copy of QEMU in order to be able to boot until this patch is merged by the upstream.

Install Sierra

Now start up your VM.

If you get an error “file system may not support O_DIRECT / Could not open iso: invalid argument” when starting the VM, you may need to edit the CD drive on the hardware tab and change its cache setting to “writeback (unsafe)”.

Go to the Console tab:

boot-menu

Press enter to choose the “install macOS Sierra” entry and the installer should boot up.

If you are unable to move the mouse cursor at the Welcome screen, and a beachball-of-doom appears on the host, you might be using Safari. It seems to get overwhelmed with the number of screen updates on the animated Welcome screen and become unresponsive. Try Chrome instead.

Our virtual hard drive needs to be erased/formatted before we can install to it, so go to Utilities -> Disk Utility and do that now:

installer-erase-disk

Before we start installation, we have some files to copy over to the newly-formatted drive. Choose Utilities -> Terminal, and copy the /Extras directory to your main volume (/Volumes/Main, for example) using “cp -av /Extra /Volumes/Main/” like so:

Quit terminal. Now you can begin installation to the Main drive.

installer-installing

After the first stage of installation, the VM should reboot itself and continue installation by booting from the hard drive. After answering the initial install questions, you’re ready to go!

installed

Sleep management

I found that I was unable to wake Sierra from sleep using my mouse or keyboard. You can either disable system sleep in Sierra’s Energy Saver settings to avoid this, or you can manually wake the VM up from sleep from Proxmox by running:

qm monitor YOUR-VM-ID-HERE
system_wakeup
quit

USB passthrough

Using noVNC gets pretty annoying due to the Mac’s absence of tablet support for absolute cursor positioning. You can solve this by turning on the Mac’s screen sharing feature and using that instead. But I want to use this as my primary computer, so I’m using USB input devices plugged directly into Proxmox.

Proxmox has good documentation for USB passthrough. Basically, run “qm monitor YOUR-VM-ID-HERE”, then “info usbhost” to get a list of the USB devices connected to Proxmox:

qm> info usbhost
 Bus 3, Addr 12, Port 6, Speed 480 Mb/s
 Class 00: USB device 8564:1000, Mass Storage Device
 Bus 3, Addr 11, Port 5.4, Speed 12 Mb/s
 Class 00: USB device 04d9:0141, USB Keyboard
 Bus 3, Addr 10, Port 5.1.2, Speed 12 Mb/s
 Class 00: USB device 046d:c52b, USB Receiver
 Bus 3, Addr 9, Port 14.4, Speed 12 Mb/s
 Class 00: USB device 046d:c227, G15 GamePanel LCD
 Bus 3, Addr 8, Port 14.1, Speed 1.5 Mb/s
 Class 00: USB device 046d:c226, G15 Gaming Keyboard
 Bus 3, Addr 6, Port 11, Speed 12 Mb/s
 Class e0: USB device 0b05:17d0,
 Bus 3, Addr 2, Port 1, Speed 1.5 Mb/s
 Class 00: USB device 068e:00f2, CH PRO PEDALS USB

In this case I can add my keyboard and mouse to USB passthrough by quitting qm, then running:

qm set YOUR-VM-ID-HERE -usb1 host=04d9:0141
qm set YOUR-VM-ID-HERE -usb2 host=046d:c52b

This saves the devices to the VM configuration for you. It’s possible to hot-add USB devices, but I just rebooted my VM to have the new settings apply.

PCIe GPU passthrough

For native graphics performance, I wanted to pass through my graphics card for the macOS VM’s exclusive use (driving a monitor connected to Proxmox). Follow the instructions from the Proxmox manual. Use the “GPU Seabios PCI EXPRESS PASSTHROUGH” section for this installation.

Note that your CPU and motherboard need to support VT-d (be sure to enable it in your BIOS as it’s often disabled by default), and your CPU needs to support IOMMU interrupt remapping.

After following the instructions to blacklist video drivers in the Proxmox manual, I found I had to run “update-initramfs -u” in order for the blacklist to be applied.

Check that your graphics card has been reserved correctly by running “lspci -k” on Proxmox and checking which driver is assigned to the graphics card (if done correctly, it should be “vfio-pci”).

After following through all the steps in that guide, I ended up with a new “hostpci0: 01:00,pcie=1,x-vga=on” line in my VM’s configuration, and after a reboot of Proxmox, my graphics card (Radeon R9 280X) was working! Only some cards are natively supported by macOS, check out the tonymacx86 Radeon compatibility list for your card. I also found a list of supported Nvidia cards (some using Nvidia’s Web Driver).

I have had success passing through my EVGA GeForce GTX 750Ti SC 2G, driving a 4K screen over DisplayPort and another display over HDMI. This required me to use Clover/UEFI boot, install the NVidia web drivers, and update my SMBIOS to “iMac 14,2” and enable “NvidiaWeb” in Clover Configurator.

Using Clover as a bootloader

I’ve also written up a guide on converting this VM to use Clover for booting instead of Enoch.

Raw images display corrupt in Lightroom when using FreeNAS 9.10 / afpd 3.1.8

I have so many raw Canon CR2 photos in my Lightroom library that they won’t fit on my MacBook, so I built myself a FreeNAS-based NAS to put them on. I access my NAS’s photo drive over the network from my MacBook. This worked great for many years.

However, a while back I noticed that many of my older photos displayed corrupt in Lightroom. They’d cut off halfway through and dissolve into a repeating pattern of stripes:

As you can imagine, this was pretty devastating. My newer photos all displayed fine, but a good percentage of my old photos displayed corrupted, as if the bits had rotted on the disk over time. So I restored those photos to my desktop using my Crashplan backup, and those restored photos were fine. Phew!

But how had the photos ended up getting corrupted? My FreeNAS runs ZFS, which detects and repairs/reports corruption. And regular ZFS scrubs on my drive array never turned up any problems. Corruption should be impossible.

I checked the MD5-sum of the perfect photos from the backup, and checked the MD5 of the corrupt photo on my NAS. They were identical. Huh? In fact, if I just copied the photos from my NAS to my laptop using Finder, then opened them in Lightroom, the previously corrupt-looking photos displayed perfectly.

So the problem wasn’t the image files at all, the problem was the connection between my MacBook and FreeNAS. Since I’m using a Mac, I figured I should use AFP (the Apple Filing Protocol) to connect the two together. This is provided by a package on FreeNAS called “netatalk”, and it turns out that this package received an update between FreeNAS 9.2 and 9.3. Rolling back to FreeNAS 9.2 fixed the AFP corruption issue for me.

With the recent release of FreeNAS 9.10, I decided it was probably time to track down the bug and get it fixed so I could upgrade. So with the help of Ralph Boehme at netatalk, I enabled debug logging in netatalk (now at version 3.1.8) to track down the cause of the issue. In FreeNAS, this can be achieved by:

nano /usr/local/etc/afp.conf

Add these lines to the [Global] section:

log level = default:maxdebug
log file = /tmp/netatalk.log

Save and exit, then:

service netatalk stop
service netatalk start

This will start logging a ton of information to /tmp/netatalk.log (don’t let it fill up your boot drive by leaving it enabled for too long!)

I then browsed around my photos in Lightroom until a photo displayed corrupted, and noted down the filename.

By searching the log for that file, I found the problem. Lightroom would read the photos by sending a AFP_OPENFORK message to open the file, then a series of AFP_READ_EXT messages to read the file contents, then a AFP_CLOSEFORK message to finish up. This normally worked fine. However, sometimes Lightroom would want to fetch some extra attributes from the file while it was reading it by sending a AFP_LISTEXTATTR message:

<== Start AFP command: AFP_LISTEXTATTR
dirlookup(did: 1339): START
…
sys_list_eas(827C5343.CR2): file is already opened
ad_close(HF): BEGIN: {d: 1, m: 0, r: 0} [dfd: 6 (ref: 1), mfd: 6 (ref: 1), rfd: -1 (ref: 0)]
ad_close(HF): END: 0 {d: 1, m: 0, r: 0} [dfd: -1 (ref: 0), mfd: -1 (ref: 0), rfd: -1 (ref: 0)]
Finished AFP command: AFP_LISTEXTATTR -> AFP_OK

That “file is already opened” message looked suspicious to me, since it implies that AFP_LISTEXTATTR was sharing the photo’s file handle with Lightroom’s reading process, and the final part of AFP_LISTEXTATTR was calling ad_close(), hmm.

After this, every subsequent call from Lightroom to continue reading the file would fail with an error of AFPERR_EOF. In other words, fetching attributes from a file that was already opened was wrongly causing the file to be closed, sending a premature EOF to Lightroom and so cutting off the image halfway through.

Ralph came up with a patch to netatalk to fix this problem, so now all I had to do was test it. This requires building FreeNAS from source, which was quite a learning experience. The build repo for FreeNAS 9.10 is here:

https://github.com/freenas/freenas-build

I installed FreeBSD 10.3 in a VirtualBox VM on my laptop, then grabbed a copy of a reasonable-looking tagged release of the build repo (some of the newer commits looked a bit experimental):

pkg install git
git clone https://github.com/freenas/freenas-build.git
cd freenas-build
git checkout 9c46f771d3467b2c2625c752bf51398903cb309b

Now follow the instructions to install the build pre-reqs:

make bootstrap-pkgs
pkg install devel/gmake

portsnap fetch extract
cd /usr/ports/textproc/py-sphinx_numfig
# Just keep hitting OK to accept the defaults for the dependencies:
make config-recursive
make install

# This package adds a hardlink to python in /usr/local/bin/python needed for build scripts:
cd /usr/ports/lang/python
make install

Now back in the freenas-build directory, have FreeNAS fetch its source (note we have to add PROFILE=freenas9 to get FreeNAS 9.10 instead of 10):

make PROFILE=freenas9 checkout

Save the patch for netatalk into freenas-build/_BE/ports/net/netatalk3/files.

Now it’s time to build FreeNAS (this takes many hours)!

make PROFILE=freenas9 release

Once done, you’ll have a FreeNAS ISO in freenas-build/_BE/release/FreeNAS-9.10-MASTER-*/x64. I downloaded my current FreeNAS configuration, then used the ISO to install a fresh new copy of FreeNAS to a new USB stick, booted it, uploaded my old configuration, and everything worked fine! The bug was gone, and all my photos displayed correctly.

You can track FreeNAS’s progress in patching this bug on their tracker, though it’ll probably already be included in the latest build by the time you read this (the patch is already in the repo!):

https://bugs.freenas.org/issues/10284

My EC2 server wouldn’t boot after apt-get dist-upgrade: how I fixed it

So I have this EC2 server, which is a MySQL replication slave (henceforth known as “victim”). This server was originally running Alestic’s 64-bit paravirtual AMI for Ubuntu Oneiric 11.10 (ami-a8ec6098), but had been previously upgraded to Precise with do-release-upgrade.

Yesterday, I performed an apt-get dist-upgrade on victim in order to upgrade its apps and the kernel. Then I rebooted the server, but it refused to boot, hanging before SSH could come online. Checking the system log, I saw that it was unable to mount its root disk:

Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[11081144.842424] EXT4-fs (xvdf): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.

[11081145.506225] random: nonblocking pool is initialized
lxcmount stop/pre-start, process 179

 * Starting configure network device security[ OK ]
 * Starting Mount network filesystems[ OK ]
 * Starting Mount filesystems on boot[ OK ]
 * Stopping Mount network filesystems[ OK ]
 * Starting Populate and link to /run filesystem[ OK ]
 * Stopping Populate and link to /run filesystem[ OK ]
 * Stopping Track if upstart is running in a container[ OK ]
 * Starting Bridge socket events into upstart[ OK ]
 * Starting configure network device[ OK ]
 * Starting Initialize or finalize resolvconf[ OK ]

The disk drive for / is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery

Since this is an Amazon server, pressing a key is not an option, there is no interactive console. Though interestingly, telling the instance to stop causes more system log to be printed as it performs a clean shut down.

The obvious question was, why couldn’t it find or mount the root disk?

At this point I took a snapshot of the root disk so that I didn’t mess it up further with any of my experiments. And I started up a new server, “rescue”, from a pristine Alestic Precise AMI, and dist-upgraded it, which should make it nearly identical to victim. But rescue restarted fine without hanging. There must be some critical difference between my pristine rescue server and the hanging victim server, I just had to find it.

In the AWS control panel, I noticed that victim was using an older Amazon “kernel” (which I think is just a PV-GRUB kernel used to boot the kernel installed on the instance’s disk). So I used the AWS CLI tool to make it the same as rescue's, which should be fine since rescue is the same architecture and guest kernel version:

aws ec2 modify-instance-attribute --instance-id i-000000 --kernel "{\"Value\":\"aki-fc8f11cc\"}"

No luck! It still had the same issue.

I unmounted the root disk (/dev/sda1) from victim, attached it to rescue (as /dev/xvdf) and mounted it at /mnt:

root@rescue# mount /dev/xvdf /mnt

One reason for not being able to mount the root disk would be that its label had changed. The /etc/fstab file specifies how the root disk should be identified, so let’s take a look:

root@rescue# cat /mnt/etc/fstab
LABEL=cloudimg-rootfs	/	 ext4	defaults	0 0

Okay, so it looks for a disk with the label “cloudimg-rootfs”. What label does the disk have?

root@rescue# e2label /dev/xvdf
cloudimg-rootfs

So the disk label is correct, rats. Just for fun, I tried replacing the “LABEL=cloudimg-rootfs” part with plain old “/dev/xvda1”, but it still wouldn’t boot.

Maybe the new kernel or initrd was corrupt? I checked out /boot/grub/menu.lst and both victim and rescue were trying to boot the exact same kernel version. So I just removed victim‘s /boot, initrd.img and vmlinuz and replaced them with the pristine ones from rescue:

root@rescue# rm -rf /mnt/initrd.img* /mnt/vmlinuz* /mnt/boot
root@rescue# cp -a /boot /mnt/
root@rescue# cp -a /vmlinuz /mnt/
root@rescue# cp -a /initrd.img /mnt/

Now I tried booting victim, but it still hung for the same reason! So the kernel or initrd wasn’t the issue.

Now I’m getting desperate, so I tried something really crazy. I deleted victim's /etc directory and replaced it with the one from rescue:

root@rescue# rm -rf /mnt/etc
root@rescue# cp -a /etc /mnt

I booted up victim with this new fiddled root disk, and it worked!! It booted fine! So the reason that victim wouldn’t boot is due to some bad configuration in /etc. But the thing I most want to save from victim is its custom configuration, so I can’t just use rescue‘s out-of-the-box defaults. So I deleted the victim disk I had mangled and restored it from its snapshot (back to its broken condition) and remounted it to rescue.

Now I just had to find the difference between the pristine configuration and the broken one that causes it not to boot. I started by installing all the packages I could remember from victim on to rescue in order to minimise the size of the diff between their configurations. Finally, I ran a recursive diff, ignoring any “*~” emacs backup files:

root@rescue# diff -r --exclude "*~" /mnt/etc /etc > /root/victim-diff

Since the system hung before anything exciting like the network or root disk was initialised, I knew I could ignore MySQL, Apache, Postfix, Nagios, etc, since they are all started too late in the boot process. That didn’t leave many interesting changed files. There were some changed grub settings:

diff -r /mnt/etc/default/grub /etc/default/grub
7c7
< #GRUB_HIDDEN_TIMEOUT=0
---
> GRUB_HIDDEN_TIMEOUT=0
9c9
< GRUB_TIMEOUT=5
---
> GRUB_TIMEOUT=0
11c11
< GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 nomdmonddf nomdmonisw nomdmonddf nomdmonisw"
---
> GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"

But moving the newer grub config in didn’t fix it. There was a change to the hardware clock settings that I had seen mentioned a couple of times in the system log:

diff -r /mnt/etc/default/rcS /etc/default/rcS
27a28
> HWCLOCKACCESS=no

But porting that change over didn’t fix it either. Finally, I noticed this:

Only in /mnt/etc/init: lxcguest.conf
Only in /mnt/etc/init: lxcmount.conf

That’s odd, I don’t remember ever using LXC on this system, and it’ll never be an LXC guest. One of those configuration files contains “mount” in the name and both are in the “/init” directory, so these files could easily be related to mounting problems during init! Was I really using LXC? I chroot’d into the old drive in order to interrogate its dpkg catalog:

root@rescue# chroot /mnt
root@rescue# dpkg -l | grep lxc
rc  lxcguest    0.7.5-0ubuntu8.6   amd64  Linux container guest package
root@rescue# exit

The “rc” at the start indicates that the lxcguest package was installed at some point, was removed, but still has configuration files left behind. Great, that means that I can blow away those old files:

root@rescue# rm /mnt/etc/init/lxcguest.conf /mnt/etc/init/lxcmount.conf

And now, glory of glories, the server booted!

[64304636.707613] random: nonblocking pool is initialized
 * Starting Mount filesystems on boot[ OK ]
 * Starting Populate and link to /run filesystem[ OK ]
 * Stopping Populate and link to /run filesystem[ OK ]
 * Stopping Track if upstart is running in a container[ OK ]
[64304638.007641] EXT4-fs (xvda1): re-mounted. Opts: discard
 * Starting Initialize or finalize resolvconf[ OK ]
 * Starting Signal sysvinit that the rootfs is mounted[ OK ]
[64304638.205979] init: mounted-tmp main process (302) terminated with status 1

...

Just to check that it wasn’t fixed due to some combination of changes I made, I deleted victim's fiddled root disk, restored it from the snapshot, and only removed those two lxc config files. It booted fine, so nothing else I changed was required in solving it!

I hope that these debugging steps help someone out in repairing their own EC2 server. (And yes I realise that in most cases, building a new instance with Chef or Puppet would be a better solution, but this is what I had to work with!)

Solving incorrect exec_time stats for queries in MySQL’s binary log

Due to MySQL Bug #52704, if your server’s clock happens to tick backwards during a query’s execution, the Exec_time listed in the binary log will become a huge number like 4294967295 (which is -1 casted to a 32-bit unsigned quantity):

#140427 13:48:52 server id 1  end_log_pos 23475         Query   thread_id=7782750       exec_time=4294967295    error_code=0
SET TIMESTAMP=1398606532/*!*/;
INSERT INTO phpbb_privmsgs_to  (msg_id, user_id, author_id, folder_id, pm_new, pm_unread, pm_forwarded) VALUES (53165565, 65, 2, -3, 1, 1, 0)
/*!*/;
# at 23475

The clock ticking backwards is likely to be caused by your server’s time being adjusted by NTP.

Since it causes those queries to have stupidly large execution times, when you’re trying to examine the binary log with pt-query-digest, it completely throws off all the statistics and makes the tool unusable.

You can solve this issue by adding a custom “filter” to your pt-query-digest call which sets the exec_time of the query to zero if it looks too large to be real:

mysqlbinlog mysql-bin.000526 | pt-query-digest --type binlog --filter '(($event->{Query_time} || 0) > 2147483648 ? $event->{Query_time} = 0 : 1) || 1'