Docker is pretty cool, but one thing that seems to be a pain point is reliable and secure networking between docker containers that span multiple hosts. Solutions are popping up all over the place, but they are typically new projects that implement an entire custom mesh networking protocol themselves, and even worse, they seem to either roll their own crypto or assume that the network itself is trusted. After finding that some people had integrated Open vSwitch with Docker, I thought I’d give it a go on CoreOS.
Docker runs applications in containers, and each container has its own virtual network interface that appear as
eth0. The other end of this interface is connected to the docker bridge interface,
docker0. By default,
docker0 has an some internal IP address ending in
.1 and each container is given a subsequent IP address after that. Each container can talk to all other containers via this bridge, and to the internet via some
iptables forwarding rules.
This is convenient on a single host, because even without docker links, containers can communicate with eachother directly. This is awesome for a microservice architecture, where you use something like Registrator to throw containers into a service that provides DNS lookups for services (skydns or consul for example). Unfortunately when you have multiple hosts running Docker containers, you have to have your service ports exposed on the public internet (or have a “trusted” internal network facility like Amazon VPC) in order for the services to communicate cross-host and still maintain some security.
The solution is generally to somehow assign unique IPs to each docker container across all hosts (perhaps each host has a separate non-overlapping subnet), and have some networking product that routes traffic between hosts. For those that haven’t seen the Docker networking projects out there and want to get an idea of their design choices, here’s a quick run down:
- Flannel is released by CoreOS themselves, and it coordinates the mesh network through a pre-existing (and somehow secured) etcd. Packets on Docker’s interface are encapsulated inside a UDP packet and sent over the network - I couldn’t see any encryption.
- Weave on the other hand seems to encrypt packets, and seems to implement their own protocol and encryption with a pre-shared key. Unfortunately it’s a bit intrusive and requires completely wrapping the
dockerbinary and performs rework a container’s network interface shortly after it is started.
Open vSwitch (OVS)
Another alternative which has been discussed by Franck Besnard and Marek Goldmann is the idea of somehow hooking up the
docker0 bridge to Open vSwitch (aka OVS), a powerful software defined networking switch. It supports GRE tunnelling, a proposed standard for tunelling Ethernet, IP, and other layers over an existing IP network. It also supports a much less well documentated combination called ipsec_gre in which the tunnelled packets are encrypted and authenticated using IPsec, which can either use a pre-shared key or public/private keys. These are both standards or almost standards, and are supported by common network tools like Wireshark and tcpdump.
Although integrating OVS and Docker has been documented already, I found that the latest version seemed to have issues performing GRE routing when the native
docker0 bridge interface had the IP address assigned to it, instead of the OVS
br0 interface itself (specifically, ARP replies were not being forwarded back across the tunnel).
I also wanted to run this on CoreOS, and have some simple way of provisioning OVS onto the system. The ideal machanism for this would be Docker itself.
My solution to this was to create a Docker image (coreos-ovs), which has the latest version of Open vSwitched installed, and to write some simple
cloud-config glue to set up and launch the service. This image doesn’t handle magical mesh networking, it just sets up Docker to use an Open vSwitch bridge. This bridge can then be connected to another host using
My test case was a Digital Ocean droplet running CoreOS Stable (currently 522.6.0). The following cloud-config snippet when thrown in to a User Data field will set everything up, except the tunnel itself:
#cloud-config coreos: units: - name: docker.service command: start drop-ins: - name: 50-custom-bridge.conf content: | [Service] Environment='DOCKER_OPTS=--bip="10.0.11.0/8" --fixed-cidr="10.0.11.0/24"' - name: openvswitch.service command: start content: | [Unit] Description=Open vSwitch Servers After=docker.service Requires=docker.service [Service] Restart=always ExecStartPre=/sbin/modprobe openvswitch ExecStartPre=/sbin/modprobe af_key ExecStartPre=-/usr/bin/docker run --name=openvswitch-cfg -v /opt/ovs/etc busybox true ExecStartPre=-/usr/bin/docker rm -f openvswitch ExecStartPre=/usr/bin/docker run -d --net=host --privileged --name=openvswitch --volumes-from=openvswitch-cfg theojulienne/coreos-ovs:latest ExecStart=/usr/bin/docker attach openvswitch ExecStartPost=/usr/bin/docker exec openvswitch /scripts/docker-attach
An important detail is the
Environment adjustment to Docker, which in this case is configured to use the entire
10.0.0.0/8 subnet across all of the hosts, with the local
docker0 having the IP address
10.0.11.0 and the Docker containers themselves having IPs in the range
10.0.11.1-254. The usage of the
.0 IP for the bridge interface is to workaround a bug in CoreOS stable’s Docker 1.3.3 which assigns the entire range specified by
--fixed-cidr including the bridge’s IP itself, which causes issues.
By assigning each Docker host a different prefix
10.xxx.xxx, the IPs are guarenteed not to collide, and each host can run 253 containers at once. There can be 60,000+ hosts, with the only prefix not allowed being
10.0.0.0 would be an invalid IP for the bridge because the first IP of the entire subnet cannot be used).
Behind the scenes, this actually starts Docker then removes the IP address from
docker0 and adds it to the OVS bridge interface
br0 instead, then it adds
docker0 as a port to the OVS bridge and everything works again. It’s not ideal, but it’s probably one of the least invasive ways of doing this until Docker natively supports OVS (which isn’t hard, but people seem to want to implement an entire pluggable networking architecture before doing this!)
Setting up the GRE tunnel
Once the hosts are booted and the docker images download and run, setting up GRE is trivial. On each host run the following, substituting
$REMOTE_IP with the IP of the remote server (and incrementing
gre0 as appropriate for each subsequent tunnel):
docker exec openvswitch /opt/ovs/bin/ovs-vsctl add-port br0 gre0 -- set interface gre0 type=gre options:remote_ip=$REMOTE_IP
Note that this does not encrypt or authenticate packets, they are transmitted in the clear.
Also included in the docker image is support for IPsec, adding it in is pretty simple, especially for a pre-shared secret (though you should definitely be using public/private keys!):
docker exec openvswitch /opt/ovs/bin/ovs-vsctl add-port br0 gre0 -- set interface gre0 type=ipsec_gre options:remote_ip=$REMOTE_IP options:psk=thisisnotagoodpsk
On one host, in this case
coreos-01 which was assigned the subnet
core@coreos-01 ~ $ docker run -it ubuntu bash root@76fd9721e6cf:/# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 02:42:0a:00:0b:02 inet addr:10.0.11.2 Bcast:0.0.0.0 Mask:255.0.0.0 inet6 addr: fe80::42:aff:fe00:b02/64 Scope:Link UP BROADCAST RUNNING MTU:1500 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:648 (648.0 B) TX bytes:578 (578.0 B) root@76fd9721e6cf:/# nc -l 1234 Hello from Open vSwitch
And on the other host,
coreos-02 which was assigned the subnet
core@coreos-02 ~ $ docker run -it ubuntu bash root@c467e191ea53:/# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 02:42:0a:00:2a:02 inet addr:10.0.42.2 Bcast:0.0.0.0 Mask:255.0.0.0 inet6 addr: fe80::42:aff:fe00:2a02/64 Scope:Link UP BROADCAST RUNNING MTU:1500 Metric:1 RX packets:22 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1804 (1.8 KB) TX bytes:648 (648.0 B) root@c467e191ea53:/# nc 10.0.11.2 1234 Hello from Open vSwitch
This doesn’t perform automatic mesh networking, but it does make setting up manual secure tunnels nice and easy! Because Open vSwitch is just like a physical network switch, the network doesn’t need have connections from every node to every other, it just needs a single path between any two nodes and the packets will be routed automagically. A simple automatic mesh could be built on top of this using something like etcd fairly easily, perhaps in a future post/project :)