Docker is pretty cool, but one thing that seems to be a pain point is reliable and secure networking between docker containers that span multiple hosts. Solutions are popping up all over the place, but they are typically new projects that implement an entire custom mesh networking protocol themselves, and even worse, they seem to either roll their own crypto or assume that the network itself is trusted. After finding that some people had integrated Open vSwitch with Docker, I thought I’d give it a go on CoreOS.

Introduction

Docker runs applications in containers, and each container has its own virtual network interface that appear as eth0. The other end of this interface is connected to the docker bridge interface, docker0. By default, docker0 has an some internal IP address ending in .1 and each container is given a subsequent IP address after that. Each container can talk to all other containers via this bridge, and to the internet via some iptables forwarding rules.

This is convenient on a single host, because even without docker links, containers can communicate with eachother directly. This is awesome for a microservice architecture, where you use something like Registrator to throw containers into a service that provides DNS lookups for services (skydns or consul for example). Unfortunately when you have multiple hosts running Docker containers, you have to have your service ports exposed on the public internet (or have a “trusted” internal network facility like Amazon VPC) in order for the services to communicate cross-host and still maintain some security.

The solution is generally to somehow assign unique IPs to each docker container across all hosts (perhaps each host has a separate non-overlapping subnet), and have some networking product that routes traffic between hosts. For those that haven’t seen the Docker networking projects out there and want to get an idea of their design choices, here’s a quick run down:

  • Flannel is released by CoreOS themselves, and it coordinates the mesh network through a pre-existing (and somehow secured) etcd. Packets on Docker’s interface are encapsulated inside a UDP packet and sent over the network - I couldn’t see any encryption.
  • Weave on the other hand seems to encrypt packets, and seems to implement their own protocol and encryption with a pre-shared key. Unfortunately it’s a bit intrusive and requires completely wrapping the docker binary and performs rework a container’s network interface shortly after it is started.

Open vSwitch (OVS)

Another alternative which has been discussed by Franck Besnard and Marek Goldmann is the idea of somehow hooking up the docker0 bridge to Open vSwitch (aka OVS), a powerful software defined networking switch. It supports GRE tunnelling, a proposed standard for tunelling Ethernet, IP, and other layers over an existing IP network. It also supports a much less well documentated combination called ipsec_gre in which the tunnelled packets are encrypted and authenticated using IPsec, which can either use a pre-shared key or public/private keys. These are both standards or almost standards, and are supported by common network tools like Wireshark and tcpdump.

Although integrating OVS and Docker has been documented already, I found that the latest version seemed to have issues performing GRE routing when the native docker0 bridge interface had the IP address assigned to it, instead of the OVS br0 interface itself (specifically, ARP replies were not being forwarded back across the tunnel).

I also wanted to run this on CoreOS, and have some simple way of provisioning OVS onto the system. The ideal machanism for this would be Docker itself.

coreos-ovs

My solution to this was to create a Docker image (coreos-ovs), which has the latest version of Open vSwitched installed, and to write some simple cloud-config glue to set up and launch the service. This image doesn’t handle magical mesh networking, it just sets up Docker to use an Open vSwitch bridge. This bridge can then be connected to another host using gre or ipsec_gre.

My test case was a Digital Ocean droplet running CoreOS Stable (currently 522.6.0). The following cloud-config snippet when thrown in to a User Data field will set everything up, except the tunnel itself:

#cloud-config

coreos:
  units:
    - name: docker.service
      command: start
      drop-ins:
        - name: 50-custom-bridge.conf
          content: |
            [Service]
            Environment='DOCKER_OPTS=--bip="10.0.11.0/8" --fixed-cidr="10.0.11.0/24"'
    - name: openvswitch.service
      command: start
      content: |
        [Unit]
        Description=Open vSwitch Servers
        After=docker.service
        Requires=docker.service

        [Service]
        Restart=always
        ExecStartPre=/sbin/modprobe openvswitch
        ExecStartPre=/sbin/modprobe af_key
        ExecStartPre=-/usr/bin/docker run --name=openvswitch-cfg -v /opt/ovs/etc busybox true
        ExecStartPre=-/usr/bin/docker rm -f openvswitch
        ExecStartPre=/usr/bin/docker run -d --net=host --privileged --name=openvswitch --volumes-from=openvswitch-cfg theojulienne/coreos-ovs:latest
        ExecStart=/usr/bin/docker attach openvswitch
        ExecStartPost=/usr/bin/docker exec openvswitch /scripts/docker-attach

An important detail is the Environment adjustment to Docker, which in this case is configured to use the entire 10.0.0.0/8 subnet across all of the hosts, with the local docker0 having the IP address 10.0.11.0 and the Docker containers themselves having IPs in the range 10.0.11.1-254. The usage of the .0 IP for the bridge interface is to workaround a bug in CoreOS stable’s Docker 1.3.3 which assigns the entire range specified by --fixed-cidr including the bridge’s IP itself, which causes issues.

By assigning each Docker host a different prefix 10.xxx.xxx, the IPs are guarenteed not to collide, and each host can run 253 containers at once. There can be 60,000+ hosts, with the only prefix not allowed being 10.0.0 (since 10.0.0.0 would be an invalid IP for the bridge because the first IP of the entire subnet cannot be used).

Behind the scenes, this actually starts Docker then removes the IP address from docker0 and adds it to the OVS bridge interface br0 instead, then it adds docker0 as a port to the OVS bridge and everything works again. It’s not ideal, but it’s probably one of the least invasive ways of doing this until Docker natively supports OVS (which isn’t hard, but people seem to want to implement an entire pluggable networking architecture before doing this!)

Setting up the GRE tunnel

Once the hosts are booted and the docker images download and run, setting up GRE is trivial. On each host run the following, substituting $REMOTE_IP with the IP of the remote server (and incrementing gre0 as appropriate for each subsequent tunnel):

docker exec openvswitch /opt/ovs/bin/ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$REMOTE_IP

Note that this does not encrypt or authenticate packets, they are transmitted in the clear.

Adding IPsec

Also included in the docker image is support for IPsec, adding it in is pretty simple, especially for a pre-shared secret (though you should definitely be using public/private keys!):

docker exec openvswitch /opt/ovs/bin/ovs-vsctl add-port br0 gre0 -- set interface gre0 type=ipsec_gre options:remote_ip=$REMOTE_IP options:psk=thisisnotagoodpsk

The Result

On one host, in this case coreos-01 which was assigned the subnet 10.0.11.*:

core@coreos-01 ~ $ docker run -it ubuntu bash
root@76fd9721e6cf:/# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 02:42:0a:00:0b:02
          inet addr:10.0.11.2  Bcast:0.0.0.0  Mask:255.0.0.0
          inet6 addr: fe80::42:aff:fe00:b02/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:648 (648.0 B)  TX bytes:578 (578.0 B)
root@76fd9721e6cf:/# nc -l 1234
Hello from Open vSwitch

And on the other host, coreos-02 which was assigned the subnet 10.0.42.*:

core@coreos-02 ~ $ docker run -it ubuntu bash
root@c467e191ea53:/# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 02:42:0a:00:2a:02
          inet addr:10.0.42.2  Bcast:0.0.0.0  Mask:255.0.0.0
          inet6 addr: fe80::42:aff:fe00:2a02/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:22 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1804 (1.8 KB)  TX bytes:648 (648.0 B)
root@c467e191ea53:/# nc 10.0.11.2 1234
Hello from Open vSwitch

This doesn’t perform automatic mesh networking, but it does make setting up manual secure tunnels nice and easy! Because Open vSwitch is just like a physical network switch, the network doesn’t need have connections from every node to every other, it just needs a single path between any two nodes and the packets will be routed automagically. A simple automatic mesh could be built on top of this using something like etcd fairly easily, perhaps in a future post/project :)