Raw images display corrupt in Lightroom when using FreeNAS 9.10 / afpd 3.1.8

I have so many raw Canon CR2 photos in my Lightroom library that they won’t fit on my MacBook, so I built myself a FreeNAS-based NAS to put them on. I access my NAS’s photo drive over the network from my MacBook. This worked great for many years.

However, a while back I noticed that many of my older photos displayed corrupt in Lightroom. They’d cut off halfway through and dissolve into a repeating pattern of stripes:

As you can imagine, this was pretty devastating. My newer photos all displayed fine, but a good percentage of my old photos displayed corrupted, as if the bits had rotted on the disk over time. So I restored those photos to my desktop using my Crashplan backup, and those restored photos were fine. Phew!

But how had the photos ended up getting corrupted? My FreeNAS runs ZFS, which detects and repairs/reports corruption. And regular ZFS scrubs on my drive array never turned up any problems. Corruption should be impossible.

I checked the MD5-sum of the perfect photos from the backup, and checked the MD5 of the corrupt photo on my NAS. They were identical. Huh? In fact, if I just copied the photos from my NAS to my laptop using Finder, then opened them in Lightroom, the previously corrupt-looking photos displayed perfectly.

So the problem wasn’t the image files at all, the problem was the connection between my MacBook and FreeNAS. Since I’m using a Mac, I figured I should use AFP (the Apple Filing Protocol) to connect the two together. This is provided by a package on FreeNAS called “netatalk”, and it turns out that this package received an update between FreeNAS 9.2 and 9.3. Rolling back to FreeNAS 9.2 fixed the AFP corruption issue for me.

With the recent release of FreeNAS 9.10, I decided it was probably time to track down the bug and get it fixed so I could upgrade. So with the help of Ralph Boehme at netatalk, I enabled debug logging in netatalk (now at version 3.1.8) to track down the cause of the issue. In FreeNAS, this can be achieved by:

nano /usr/local/etc/afp.conf

Add these lines to the [Global] section:

log level = default:maxdebug
log file = /tmp/netatalk.log

Save and exit, then:

service netatalk stop
service netatalk start

This will start logging a ton of information to /tmp/netatalk.log (don’t let it fill up your boot drive by leaving it enabled for too long!)

I then browsed around my photos in Lightroom until a photo displayed corrupted, and noted down the filename.

By searching the log for that file, I found the problem. Lightroom would read the photos by sending a AFP_OPENFORK message to open the file, then a series of AFP_READ_EXT messages to read the file contents, then a AFP_CLOSEFORK message to finish up. This normally worked fine. However, sometimes Lightroom would want to fetch some extra attributes from the file while it was reading it by sending a AFP_LISTEXTATTR message:

<== Start AFP command: AFP_LISTEXTATTR
dirlookup(did: 1339): START
…
sys_list_eas(827C5343.CR2): file is already opened
ad_close(HF): BEGIN: {d: 1, m: 0, r: 0} [dfd: 6 (ref: 1), mfd: 6 (ref: 1), rfd: -1 (ref: 0)]
ad_close(HF): END: 0 {d: 1, m: 0, r: 0} [dfd: -1 (ref: 0), mfd: -1 (ref: 0), rfd: -1 (ref: 0)]
Finished AFP command: AFP_LISTEXTATTR -> AFP_OK

That “file is already opened” message looked suspicious to me, since it implies that AFP_LISTEXTATTR was sharing the photo’s file handle with Lightroom’s reading process, and the final part of AFP_LISTEXTATTR was calling ad_close(), hmm.

After this, every subsequent call from Lightroom to continue reading the file would fail with an error of AFPERR_EOF. In other words, fetching attributes from a file that was already opened was wrongly causing the file to be closed, sending a premature EOF to Lightroom and so cutting off the image halfway through.

Ralph came up with a patch to netatalk to fix this problem, so now all I had to do was test it. This requires building FreeNAS from source, which was quite a learning experience. The build repo for FreeNAS 9.10 is here:

https://github.com/freenas/freenas-build

I installed FreeBSD 10.3 in a VirtualBox VM on my laptop, then grabbed a copy of a reasonable-looking tagged release of the build repo (some of the newer commits looked a bit experimental):

pkg install git
git clone https://github.com/freenas/freenas-build.git
cd freenas-build
git checkout 9c46f771d3467b2c2625c752bf51398903cb309b

Now follow the instructions to install the build pre-reqs:

make bootstrap-pkgs
pkg install devel/gmake

portsnap fetch extract
cd /usr/ports/textproc/py-sphinx_numfig
# Just keep hitting OK to accept the defaults for the dependencies:
make config-recursive
make install

# This package adds a hardlink to python in /usr/local/bin/python needed for build scripts:
cd /usr/ports/lang/python
make install

Now back in the freenas-build directory, have FreeNAS fetch its source (note we have to add PROFILE=freenas9 to get FreeNAS 9.10 instead of 10):

make PROFILE=freenas9 checkout

Save the patch for netatalk into freenas-build/_BE/ports/net/netatalk3/files.

Now it’s time to build FreeNAS (this takes many hours)!

make PROFILE=freenas9 release

Once done, you’ll have a FreeNAS ISO in freenas-build/_BE/release/FreeNAS-9.10-MASTER-*/x64. I downloaded my current FreeNAS configuration, then used the ISO to install a fresh new copy of FreeNAS to a new USB stick, booted it, uploaded my old configuration, and everything worked fine! The bug was gone, and all my photos displayed correctly.

You can track FreeNAS’s progress in patching this bug on their tracker, though it’ll probably already be included in the latest build by the time you read this (the patch is already in the repo!):

https://bugs.freenas.org/issues/10284