Experiments In Owning Data

I have been working for a while to own most of the data I generate. Thought I would write down what I mean by that and how I am doing so far.

Before this effort most of my data where spread across many proprietary services, some free, some paid. I had always felt I had restricted control over them, and I had to find out some free tier restrictions the hard way.

So this started as an effort to

  1. organize all the things that I wanted in a central (virtual) place.
  2. and fine grain control over who has access to the data.

However all of the service that I looked into was made for exactly what I wanted to avoid - a free service that monetizes based on my personal data, they do take my money to provide “upgrades”, and my data may not be mined (or maybe - no way to ensure that). Also service behaviors were sometimes opaque and confusing, even causing people to loose data.

Another thing that stood out was how inflexible these services were. Mostly designed as big monoliths that does not play well with others. For e.g google photos is really nice - but what if I want to run an imagemagic script over all of the photos I have? I think there is someway to do this if you poke at the photos API, however the friction is too much compared to just mounting them over webdav or fuse. For a lot of these services Linux was a second class citizen, and FreeBSD an undiscovered species. I understand these are not common requirements, but I wanted the system to work with things I use and have.

Hardware

At the moment, what I call my personal data is ~500GB, that’s all the pictures, emails documents, code and other things that I have. Assuming a 3 fold growth (probably too low?) I decided that I need around 2.5TB storage. Other requirements were,

  1. connected to reasonably fast internet and reliable power
  2. cheap (remember, migrating out of this system is going to be really painful)

After some consideration I decided to not to host my hardware, I move around a lot and state of home internet in Germany is not where I’d like it to be.

Requirements for storage made most of the cloud providers unfeasible (3TB EBS is ~$350/month).

I finally settled on a physical machine from hetzner server auction. Server auction is where they sell their older generation machines (read: sandy bride/ivy bridge) at a steep discount. I was able to get a Xeon E3 with 32GB ECC ram and 2x3TB disks for 30 EUR a month.

It could have been a bit cheaper if had gone with an i7 machine (newer cpu too) instead. But they don’t ECC RAM. Intel is very adamant in not supporting ECC in “desktop class” processors.

Installation

Installation was piece of cake, hetzner allows you to boot the server into freebsd rescue mode where they point server to PXE boot from a mfsbsd disk and lets you ssh, and then you can start installing FreeBSD (one can follow a similar procedure for Linux distros with a linux rescue image..)

Security

Even though the main goal is to avoid mass surveillance, I also wanted to avoid data leaks because of unplanned events - me not paying bills, hardware failures etc. The solution was to encrypt the disks, so that at rest nobody can sniff data out of them.

This became a challenge because getting access to KVM in hetzner environment is not instant. One need to send them a request and a human mails you kvm access creds for an hour (they are usually fast though). This is a challenge because every time I need to reboot the server I would need to get KVM access, type in my password over KVM (also not sure how much of that encryption I can trust..) and let the machine boot.

Two Zpools approach

However a friend of mine had the solution, the idea is to have two zpools. one, unencrypted that holds the OS and the other encrypted that holds data.

Both of the zpools are in raid1, meaning they are mirrored to two physical disks, hence as long as both disks don’t fail together, we won’t have any problems.

   disk1
   +------------------------+
   | pool1|    pool2        |
   | unenc|    enc          |
   +------------------------+
   disk2
   +------------------------+
   | pool1|    pool2        |
   | unenc|    enc          |
   +------------------------+

Roughly this how it works: When machine boots, it boots off the plain zpool, and gets to the custom rc.script geli0 installed by us

#!/bin/sh
#

# PROVIDE: geli0
# BEFORE: disks
# REQUIRE: initrandom
# KEYWORD: nojail

. /etc/rc.subr

name="geli0"
start_cmd="geli0_start"
stop_cmd=":"
required_modules="geom_eli:g_eli"

geli0_start()
{
        zfs mount -av
        /etc/rc.d/hostid start
        /etc/rc.d/hostname start
        /etc/rc.d/netif start
        /etc/rc.d/routing start
        /etc/rc.d/sshd start

        echo -n "Waiting for zpool:encrypted to become available, "
        echo -n "press enter to continue..."
        echo

        while true; do
                if [ -e /dev/ada0p4.eli -a -e /dev/ada1p4.eli ]; then
                        break
                fi
                read -t 5 dummy && break
        done
        /etc/rc.d/sshd stop
        pkill sshd
        /etc/rc.d/routing stop
        /etc/rc.d/netif stop
#       /etc/rc.d/devd stop
}

load_rc_config $name
run_rc_command "$1"

This script pauses the boot, setups up some essential services related to network, ssh and waits for the second set of disks to be available. The machine is essentially waiting for me to decrypt the disks, and I can do that by ssh-ing to the box and running decryptvol.sh (contents below)

#!/bin/sh

#
# The passphrase for both disks is the same.
# Read it once and decrypt the disks.
#

set -e

echo -n "Enter passphrase: "
stty -echo
IFS="" read -r passphrase
stty echo
echo

echo $passphrase | geli attach -k /boot/keys/ada1p4.key -j - /dev/ada1p4
echo $passphrase | geli attach -k /boot/keys/ada0p4.key -j - /dev/ada0p4

As soon as the disks are available the geli0 scripts resumes regular boot, but now with access to encrypted data.

Conclusion and part 2

With this setup I have a place to store my data and its secure from data mining by third party service providers. One bit that worries me is that someone can coerce hetzner to attack the hardware itself, but I am not sure its something I can solve at the moment.

However this is only a part of the puzzle. Strictly speaking I have my data platform so as to speak, and now I need services that integrates this with other devices that generate and consume data. This post is already longer than I anticipated, so I will write about software and other services in a follow up.