IPv6-only in the data plane

IPv6 is the future of the internet, or at least that's been the story for the last 20 years now. In reality, some parts of the internet have IPv4 connectivity, some parts have IPv6 connectivity, and there's only a partial overlap between these two parts. So, if you're only connected to the internet with one address family, and want to access (or be accessed by) hosts with only the other address family, you're out of luck, or need to use translation somewhere in order to bridge between the two parts.

At the start of this year, I did some work investigating the feasibility of operating a network which externally advertises dual-stack connectivity, but internally only uses IPv6, and uses translation mechanisms on the border to allow interoperation with external IPv4 networks. In this post, I'll describe the context which motivated this idea, and discuss the theory and implementation of the proof of concept system I built as a demonstration, with some observations from operational experience along the way.


Table of contents:

Primer: address family translation

There are a number of translation technologies for enabling connectivity between IPv4 and IPv6 hosts, which satisfy a couple of different use cases. The ones which I've been focusing on for the purposes of this post all make use of the fact that the IPv6 address space is considerably larger than the IPv4 address space (as IPv6 addresses have a greater number of bits than IPv4 addresses), and so it's possible (and definitely quite feasible) to map the entire IPv4 address space into a cutout inside an IPv6 prefix.

RFC6052 details a number of schemes for performing this kind of mapping, but the simplest one is to prepend a fixed 96-bit prefix to a 32-bit IPv4 address, which gives a corresponding 128-bit IPv6 address. Conveniently, the RFC also provides a standard well-known IPv6 prefix for this purpose, 64:ff9b::/96; hence, one could map the IPv4 address 192.0.2.1 into the well-known prefix, to get 64:ff9b::c000:0201 (or just 64:ff9b::192.0.2.1, if you've read section 2.2 of RFC4291).

In particular, I've made use of these translation technologies in the testbed network:

Context: related work

There's a growing amount of prior art surrounding the use of IPv4/IPv6 translation technologies — the economics of IPv4 addresses becoming more expensive means that an increasing number of organisations are operating partially or fully IPv6-only network environments. I found the following references useful when designing the testbed system, which may also be of interest as further background reading:

Background: internet chez moi

The IPv6-only data plane idea originated from some operational considerations from my own network — at the time of writing, I'm running my own autonomous system on the internet, which is IPv6-only, as IPv4 space is expensive and I can't presently justify the cost of leasing any. I've therefore been using NAT64 translation to access websites which don't have IPv6 connectivity, by running a translator on a machine which is both connected to my own network and has an IPv4 address from another provider. (I'm not running any public-facing services from my network, so being reachable from IPv4 hosts isn't something which I currently require.)

This certainly has some operational conveniences, as hosts in my network only need a single IP address in order to connect to the internet, and I only need to maintain a single set of firewall configuration (apart from on the host performing the NAT64 function, which is otherwise a special case).

However, if I ever introduce IPv4 connectivity to my network in the future, I'd have to put in some effort to make that connectivity available to all the existing hosts inside my network. In the worst case, I'd need to add an IPv4 address to all of those hosts in my network, which increases the accounting overhead for internally allocated resources, and I'd need to configure (and then maintain!) a second firewall configuration on all my routers. There's also the possibility that I might need to use RFC1918 address space (if I have too many computers), which only adds to the potential complexity.

A simpler way to go about doing this would be to keep my existing IPv6-only infrastructure, and then run one or more translators inside my own network with my own address space (instead of relying on another network provider for IPv4 connectivity), as this would reduce the number of dual-stacked hosts in my network to just the translators (and presumably the border routers).

Sending packets to the IPv4 internet from an IPv6-only host then takes a trip to and from the translators, and relies on existing IPv4 routing to choose a suitable egress path; similarly, ingress traffic from IPv4 hosts would need to be translated with the appropriate IPv6-only host's address as a destination. This also implies carrying some IPv4 traffic in the data plane, if nowhere else then between the translators and the border routers.

I then arrived at the idea motivating this post: is it possible to push the translation functions out onto the border routers themselves, and then make the internal data plane on such a hypothetical network purely IPv6-only, even between border routers?

The immediate thought which this raised was the issue of choosing the most suitable egress path — for a given host inside this IPv6-only network which is trying to access some remote IPv4-only host, the topologically nearest border router within the source network may not have the best egress route to the given destination, or may not have an egress route for the destination at all.

The missing piece here is knowledge of external IPv4 reachability information within the IPv6 control plane, so outbound traffic can be directed to the appropriate egress point, where it can then be translated. The natural solution here seemed to be to map the entire IPv4 routing table into a cutout prefix in RFC6052. As far as I know, this is an novel approach, so I decided to try building a testbed network which features this fine-grained mapping of the IPv4 routing table into the IPv6 routing table as a proof of concept.

Theory: bridging the control plane

As I wasn't aware of any (open source) routing software which supports automatically converting IPv4 prefixes into IPv6 prefixes, I devised my own scheme for this, based on the RFC6052 scheme described above.

I extended this idea to embed an IPv4 route inside an IPv6 route, by prepending a fixed 96-bit prefix to the network prefix part of each route, and adding 96 to the netmask length. The next hop attribute of the original IPv4 route is then rewritten to an IPv6 address using a preconfigured static lookup table.

So, for example, the IPv4 route 198.51.100.0/24 via 192.0.2.1 can be translated into an IPv6 route within the RFC6052 well-known prefix. This requires the IPv4 next hop address to have an associated IPv6 address configured in the lookup table, for example 2001:db8:64:64::1. The network part of the route becomes 64:ff9b::c633:6400, and the prefix length becomes 120. The resulting IPv6 route is therefore 64:ff9b::c633:6400/120 via 2001:db8:64:64::1. This next hop address can then be routed through a suitably configured SIIT or NAT64 translator, so IPv6-only hosts can make use of the IPv6 route to access hosts in 198.51.100.0/24.


ORIGINAL ROUTE: 198.51.100.0 /     24     via  192.0.2.1
                [  prefix  ]   [ length ]     [ nexthop ]
                      |            |               |
                      +-+          +-----+         +------+
                        |                |                |
                        v                v                v
    +--------------------------------+-------+------------------------+
    |            198.  51. 100.   0  |   24  |        192.0.2.1       |
    |                   |            |   |   |            |           |
    |                   v            |   |   |            |           |
    |            (convert to hex)    |   |   |            |           |
    |                   |            |   |   |            |           |
    |                   v            |   v   |            v           |
    |           0xc6 0x33 0x64 0x00  | (+96) | (nexthop lookup table) |
    |                   |            |   |   |            |           |
    |                   v            |   |   |            |           |
    |        (prepend 96-bit prefix) |   |   |            |           |
    |                   |            |   |   |            |           |
    |                   v            |   v   |            v           |
    | 64:ff9b::   c6   33 : 64   00  |  120  |    2001:db8:64:64::1   |
    +--------------------------------+-------+------------------------+
                        |                |                |
                   +----+           +----+              +-+
                   |                |                   |
                   v                v                   v
           [     prefix     ]   [ length ]      [    nexthop    ]
NEW ROUTE: 64:ff9b::c633:6400 /    120     via  2001:db8:64:64::1

Hence, I needed a BGP speaker which could import IPv4 routes from peers, and then automatically apply this transformation — the routes could then be re-exported as an IPv6 routing table, either back to peers using BGP route reflection, or directly to the kernel forwarding plane.

I initially struggled to find any library implementations of BGP which I was comfortable using. I briefly attempted to use GoBGP for this purpose — as well as providing a gRPC interface for remotely controlling the GoBGP BGP daemon, the project also provides a library interface for running an in-process BGP speaker. However, I ran into issues with prohibitive memory consumption and stuck routes, so I eventually had to abandon that approach.

I then decided to modify an existing BGP daemon to extend it with the routing table mapping functionality. I use BIRD in my network, and a friend had already written a partial patch against BIRD for this kind of table mapping, which is where I got the idea for this in the first place. Using this patch as a basis, I was able to successfully add the desired table mapping functionality to BIRD.

BIRD can be configured with multiple routing tables of the same address family (which has a myriad of uses, e.g. for separating interior routes from routes learned from other networks, or for controlling per-peer route export policy), and provides "pipes" as a means of copying routes from one table to another table. I added a modified version of this functionality which would copy routes from an IPv4 routing table to an IPv6 routing table, applying the transformation described above in the process.

Configuring a table mapping instance looks like this:


# special mapping pipe protocol
protocol pipe64 {
    table input4;         # read ipv4 routes from this table
    peer table output6;   # send converted routes to this table
    prefix 64:ff9b::/96;  # static prefix to prepend to ipv4 routes

    import none;          # we aren't converting any ipv6 routes to ipv4 routes
    export filter {
        # convert the route's nexthop attribute (assuming it was received
        # over BGP).
        if bgp_next_hop = 192.0.2.1 then {
            bgp_next_hop = 2001:db8:64:64::1;
        } else {
            reject;       # discard routes with unknown nexthops
        }
        accept;
    };
}

I did some cursory testing with a full IPv4 default-free zone routing table (via a BGP session borrowed from another friend), and the patched BIRD appeared to do the right thing, so the next task was to try putting all the theory into action.

Practice: testbed deployment

Combining the modified BIRD version (for performing the table mapping function) and the translation technologies I mentioned earlier (for interoperating with external IPv4 networks), I then tried to build a live demonstration of the IPv6-only data plane concept on the aforementioned testbed network.

Setup

Firstly, I needed some address space to number the test network and to announce to other networks (which I borrowed from another friend). Let's assume I have the IPv6 block 2001:db8:1234::/48 and the IPv4 block 192.0.2.0/24, and that I've reserved 192.0.2.254 from the IPv4 block as a NAT64 egress address. I also used another autonomous system number (instead of 207480, which I use on my current network), to keep the test and production environments separate — let's assume I'm using AS64500.

I initially set up a pair of servers to handle the routing for the test network. I decided to run the table mapping function on a separate host from the border router, so that any issues with my patched BIRD instance wouldn't directly affect peering sessions with other networks. The border router then imports routes from other networks (in this case just a transit provider), and then the second server operates as a route reflector which imports IPv4 routes from the border router, converts them into IPv6 routes, and then re-exports them back to the border router.

The BIRD configuration on the border router for the BGP session with the route reflector looks like this:

protocol bgp route_reflector {
    local as 64500;
    neighbor ... as 64500;

    ipv4 {
        # export all of the externally learned ipv4 routes to the route
        # reflector
        table exterior4;
        import none;
        export all;
    };

    ipv6 {
        # import all of the converted routes back from the route reflector
        table exterior6;
        import all;
        export none;
    };
};

Correspondingly, the route reflector's configuration on the other side of the BGP session looks like this:

protocol bgp border_router {
    local as 64500;
    neighbor ... as 64500;

    ipv4 {
        # feed all imported IPv4 routes into the table mapping pipe
        table input4;
        import all;
        export none;
    };

    ipv6 {
        # re-export the converted routes
        table output6;
        import none;
        export all;
    };
}

The border router and route reflector use multiprotocol BGP to exchange both IPv4 and IPv6 routes over a single session. Note also that the routing tables in the route reflector's configuration line up with the pipe64 configuration example I gave earlier.

Translators: theory

The border router needs to perform two translation functions: it needs to allow external IPv4 hosts to communicate with hosts inside the test network; and it also needs to allow internal hosts to communicate with external IPv4 hosts. I used SIIT-DC for handling ingress traffic from IPv4 hosts, and NAT64 for egress traffic towards IPv4 hosts, with the RFC6052 well-known prefix for representing external IPv4 addresses.

Using these translation technologies brings some additional addressing requirements. NAT64 requires an IPv4 address as the source address for outgoing translated packets, so I reserved 192.0.2.254 as the test network's NAT64 egress address, as noted above.

On its own, a SIIT translator will translate all IPv4 addresses it processes (both as source and destination) into its configured mapping prefix. This would result in the IPv4 address 192.0.2.42 being translated into the IPv6 address 64:ff9b::c000:022a. However, I decided to use an RFC7757 EAM to map the test network's IPv4 block into a prefix inside its IPv6 block, and reserved 2001:db8:1234:64:ffff:ffff:ffff:ff00/120 as the mapping prefix. This then means that the translator will instead translate 192.0.2.42 to 2001:db8:1234:64:ffff:ffff:ffff:ff2a and vice versa (notice that the trailing eight bits of each address are the same — decimal 42 is hex 0x2a).

This has the useful side effect that it makes it possible to assign a single IPv6 address to a host within the test network, such that it's reachable over both IPv4 and IPv6, which is a nice trick.

Translators: operation

All IPv4 ingress traffic for the test network's IPv4 block is handled by the SIIT translator, apart from traffic to the NAT64 exit address, which is handled by the NAT64 translator. For example, an incoming IPv4 packet sent from 203.0.113.14 to 192.0.2.68 would be translated to an IPv6 packet with source address 64:ff9b::cb00:710e and destination address 2001:db8:1234:64:ffff:ffff:ffff:ff44. Traffic handled by the NAT64 translator relies on layer 4 session state entries maintained by the translator in order to find the appopriate destination address for a given packet. IPv6 ingress is otherwise handled as normal.

Egress traffic to external IPv6 addresses is handled as normal, apart from traffic to addresses within the RFC6052 well-known prefix, which is hairpinned through the translation layers, and the destination address is translated to an IPv4 address as in RFC6052. If the source address of an outgoing packet is within 2001:db8:1234:64:ffff:ffff:ffff:ff00/120, then it's processed statelessly by SIIT, and the source address is mapped to an address within 192.0.2.0/24. Otherwise, it's processed by the NAT64 translator, which sets the source address to 192.0.2.254, and tracks the layer 4 session state.

Translators: implementation

I intially implemented the translation layer using Tayga, a userspace NAT64 and SIIT translator which runs on Linux. Tayga exposes a tun device, and will translate IPv4 packets sent to that interface to IPv6 and vice versa. Tayga only implements the layer 3 parts of NAT64, in that it dynamically maps source IPv6 addresses to IPv4 addresses drawn randomly from a pool, and then expects the administrator to use the usual tools for NAT44 (e.g. iptables -j MASQUERADE on egress) for handling layer 4 state tracking. Tayga will also perform SIIT on packets whose addresses both fall within its translation prefix (as in RFC6052), and the version packaged in Debian includes feature patches which allow EAM's to be configured. Tayga will also assign itself an IPv4 and IPv6 address, so you can directly ping the translator.

I configured Tayga on the border router with the RFC6052 well-known prefix for translating external IPv4 addresses, and added an EAM associating 192.0.2.0/24 with 2001:db8:1234:64:ffff:ffff:ffff:ff00/120. I used 10.0.0.0/8 as the dynamic IPv4 address pool (as that's the largest RFC1918 netblock), with iptables rules for performing outgoing NAT44 for translating packets originating from this range. The Tayga configuration looks like this:

# tun device and data dir setup
tun-device nat64
data-dir /var/spool/tayga

# translation prefix and dynamic ipv4 pool setup
prefix 64:ff9b::/96
dynamic-pool 10.0.0.0/8

# translator ipv4 and ipv6 addresses
ipv4-addr 10.0.0.1
ipv6-addr 2001:db8:1234:64::1

# rfc7757 EAM
map 192.0.2.0/24 2001:db8:1234:64:ffff:ffff:ffff:ff00/120

I configured the table mapping route reflector to rewrite the nexthop of IPv6-converted IPv4 routes to the address assigned to Tayga, so that the border router will direct outgoing traffic to the translator. I also added a static route to the kernel so that incoming IPv4 traffic would also be directed to Tayga for translation to IPv6.

The next bit of fun was to see if this actually all worked.

Test drive

I set up two more servers on the test network, behind the border router, to check that the SIIT and NAT64 translation worked correctly. One was a web server, so I could test the reachability of hosts within the test network, and the other was an intermediate router between the web server and the border router, so I would later be able to add more border routers without needing to manage any dynamic routing on the web server host directly.

I assigned two IP addresses to the web server, one which was inside the 2001:db8:1234:64:ffff:ffff:ffff:ff00/120 prefix and therefore had a corresponding IPv4 address within 192.0.2.0/24 to and from which it would be mapped by SIIT, and another from another subnet within the test network's IPv6 block, which would cause packets with that source address to be processed by the NAT64. This meant that I could test both the NAT64 and SIIT translation by choosing an appropriate source address on the web server, and the web server would be reachable through the SIIT translator by external IPv4 hosts.

Results

After some probing from both within and outside the test network, everything seemed to work:

While this was a useful test of the basic premise, it's a little unrealistic. The test network so far only had a single border router with a single upstream, while most networks on the internet instead are multi-homed and use several transit providers for redundancy purposes.

Issues with multihoming

I then added a second border router with its own upstream connectivity to the test network, which was directly connected to the first border router and the intermediate router I mentioned above, with the links between them forming a triangle. Both the border routers peered with the table mapping route reflector, which would then select the best path for each IPv4 route before translating and re-exporting them as IPv6 routes.

Then, I repeated the tests to check if the SIIT and NAT64 still worked. The SIIT translation appeared to still work correctly (which I tested from multiple remote hosts, to cover ingress and egress paths through both border routers), however to my surprise traffic passing through the NAT64 translator would occasionally fail on the reverse path — packets sent through the NAT64 translator from the web server would sometimes never receive a response.

I eventually determined this was due to asymmetric routing and the statefulness of NAT64. Asymmetric routing occurs between two networks when the path which one network uses to reach the other (passing through one or more intermediate networks) is not the same path as the one used by the second to reach the first. This could mean that the egress and ingress direction of a flow (such as a TCP session) might pass through different border routers, and this is perfectly normal. This does, however, interfere with performing stateful flow tracking on border routers, as they may only see one direction of the flow, with the other direction passing through another router.

For example, in the diagram below, Hosts 1 and 2 are exchanging packets, however the A-side routers can only see one direction of the packet flow, and B-side routers can only see the opposite direction.


                                +--------+
                                | Host 1 |
                                +--------+
                                  |    ^
                                  |    |
                         +--------+    +--------+
                         |                      |
                         v                      |
                +------------------+   +------------------+
                | AS64501 router A |   | AS64501 router B |
                +------------------+   +------------------+
                         |                      ^
                         |                      |
                         v                      |
                +------------------+   +------------------+
                | AS64502 router A |   | AS64502 router B |
                +------------------+   +------------------+
                         |                      ^
                         |                      |
                         +--------+    +--------+
                                  |    |
                                  v    |
                                +--------+
                                | Host 2 |
                                +--------+

The NAT64 translators ran afoul of precisely this problem. When packets pass through the NAT64 translator on exit from the test network, the translator records state information to identify the layer 4 sessions associated with those packets, so that packets entering the network through the NAT64 translator can be matched against existing sessions, and routed to the appropriate internal endpoint. However, if incoming packets don't match a known session, then the translator simply drops them.

The test network's NAT64 translators didn't share their session state records, and otherwise operated in isolation. Hence, if the paths between the test network and some remote network were asymmetric, then this would result in one direction of the flow exiting through one border router (and creating session state entries along the way), while the return path would pass through the other border router, which would drop the packets, due to lacking the session state information recorded on the first router.

Back to the drawing board

I thought of a couple of ways I could solve this issue. One way to solve this is to use a distinct NAT64 egress address per border router — the egress address then identifies which border router performed the translation, so incoming traffic can internally be routed back to that router. However, this means that ingress traffic may need to be translated between IPv4 and IPv6 multiple times before it arrives at its destination.

If a packet enters the network through the first border router with that router's NAT64 address as its destination, then it can be handled by the kernel's NAT44 firewall rules and the local Tayga instance. However, if a packet were to enter the network through the first border router for the second border router's NAT64 address, then this packet would first be translated to IPv6 using SIIT in order to be forwarded to the second router. At the second router, the packet would need to be translated back to IPv4 via SIIT again in order to be handled by the kernel NAT44 firewall rule, and then be converted back to IPv6 so it could be forwarded to its end destination.

This seemed excessive and complex — a simpler solution would be to find some way to synchronise the NAT64 session states between multiple routers. The tradeoff here is that NAT64 session state tracking becomes a distributed system, which has implications of its own, but I decided it would be a worthwhile avenue of investigation. Unfortunately this isn't possible with Tayga; while it's theoretically possible to synchronise the state of the kernel's conntrack subsystem (which maintains records of session states for regular NAT44), it wasn't immediately obvious whether it was possible to synchronise Tayga's mapping of IPv6 source addresses to the dynamic pool of IPv4 addresses. I then needed a NAT64 translator which supports this kind of state synchronisation to replace Tayga, and as it happens (for better or for worse) such a thing already exists.

Translators: implementation, take 2

Jool is a kernel SIIT and NAT64 translator for Linux, which is implemented as a series of third-party kernel modules which perform the translation functions, with some accompanying user space tools. Jool includes a daemon which listens on a netlink socket to notifications from the kernel of new NAT64 session state entries being created, and then uses multicast to send these state entries to other hosts. The daemon likewise listens for session state entries announced by other hosts, and adds them to the local kernel's session state records. This functionality is primarily intended for use in high-availabilty scenarios, where a network administrator is operating multiple NAT64 translators for redundancy puproses, however the principle is the same for the distributed NAT64-on-the-edge I wanted to implement.

At the time of writing, Jool has two operating modes. In the first, it inspects all traffic passing through the host machine, and performs translation on all matching packets; any packet which the host machine receives which is a candidate for translation is automatically translated. This is a little coarse, as it doesn't leave much room for defining policy about how and when different packets should be translated. Jool's second mode provides an iptables target which can be used within iptables' mangle table, so it's possible to selectively hairpin incoming packets through a translator based on whether the packet satisfies some administrator-defined condition.

I opted to use the iptables mode, as this gave me finer control of translation policy, and which packets would be sent to which translator. I created both a NAT64 and a SIIT instance, and set them up similar to how I used Tayga. Jool's SIIT implementation supports EAM's, so I configured it with the RFC6052 well-known prefix for translating IPv4 addresses, and added a mapping between 192.0.2.0/24 with 2001:db8:1234:64:ffff:ffff:ffff:ff00/120, as before. Jool performs both the layer 3 and layer 4 functions of NAT64, so I didn't need the iptables rules for performing address masquerading on egress, and configured the translator with the egress address 192.0.2.254 directly.

Then, I set up some iptables rules for handling translation on each border router. First, I added some rules to handle incoming IPv4 packets (received on the interface facing the transit provider), ordered so that NAT64 ingress would be handled first, and then packets would be sent to the SIIT translator otherwise:

# iptables -t mangle -A PREROUTING -d 192.0.2.254/32 -i eth0 \
>       -j JOOL --instance default
# iptables -t mangle -A PREROUTING -d 192.0.2.0/24 -i eth0 \
>       -j JOOL_SIIT --instance default

Next, I set up the corresponding rules for handling outgoing IPv6 packets — this has the opposite order, as packets with SIIT-translatable source addresses are handled first. The rules below were repeated once for each interior interface on each border router:

# ip6tables -t mangle -A PREROUTING -i wg0 -s 2001:db8:1234:64:ffff:ffff:ffff:ff00/120 \
>       -d 64:ff9b::/96 \
>       -j JOOL_SIIT --instance default
# ip6tables -t mangle -A PREROUTING -i wg0 -s 2001:db8:1234::/48 -d 64:ff9b::/96 \
>       -j JOOL --instance default

I also added an iptables rule to drop any outgoing packets towards 64:ff9b::/96 on the uplink interface, to prevent any internal traffic leaking out to the transit provider.

I then set up the NAT64 state synchronisation daemon on both border routers, and configured them to exchange state information over the Wireguard interface I'd set up between them.

Finally, I configured the route reflector to rewrite the nexthops on translated routes to the border routers themselves instead of the (now deconfigured) tun interfaces provided by Tayga.

Test drive, take 2

I then repeated the tests I previously performed with Tayga. In particular I was interested in whether the NAT64 state synchronisation would work correctly, and I used Jool's user space tools to monitor the NAT64 states on both border routers during testing.

As expected, Jool's SIIT translator appeared to function correctly, as the statelessness of the translation means that it's path agnostic. However, initially I had trouble with the NAT64 states failing to synchronise properly, and I observed the same issues with asymmetric routing that I did with Tayga. This was in spite of the synchronisation daemon, which appeared to be running correctly and transmitting packets on the interface between the border routers.

It turned out that this was actually a bug in Jool which had already been reported, and with some additional information the maintainers were able to fix the issue. Once I applied this fix to the test network, the NAT64 states were finally synchronised between the border routers correctly, and asymmetric routing was no longer an issue.

With this problem solved, all the pieces fit together, and both the SIIT and NAT64 translation worked properly in both ingress and egress directions. Mission accomplished!

Conclusion: closing thoughts

The test network succeeded in its purpose: it demonstrates that it's possible to build a network which is internally IPv6-only, and which interoperates with IPv4 networks — using a combination of network-layer translation mechanisms on the border, and a fine-grained mapping of IPv4 routing information into a single, unified IPv6 control plane. I'm quite pleased that most of the building blocks are off-the-shelf tools, with the modified BIRD version being the sole exception, as this isn't too far removed from something one could deploy in a production situation.

That being said, while I'm happy with the testbed system as a tech demo, I think some of the design decision would need to be revisited for deploying this kind of network more seriously. In particular, the distributed NAT64 introduces a number of failure modes which could lead to unwanted packet loss, such as border routers becoming partitioned and unable to synchronise session states while still being connected to the rest of the network otherwise. A more robust approach may be to only perform SIIT on the border, and then internally perform IPv6 NAT on a per-site basis, using SIIT-translatable addresses as NAT egress addresses.

I also ran into a couple of scaling issues along the way while building out the test system, which was at least partly due to the comparatively low-powered machines I was using. Handling full routing tables with BIRD was slow, and reaching convergence from a dry start took several minutes and several hundred megabytes of memory. This isn't helped by the size of the IPv4 routing table, nor the fact that the border routers had to hold this table in memory twice (once as IPv4 and once converted to IPv6).

The Linux kernel also still has some quirks with IPv6 routing, not least of which the infamous bug which hamstrung IPv6 forwarding between kernel versions 4.16 and 5.8. Fully loaded, the kernel routing table contained nearly a million IPv6 routes (covering both address families' default-free zones), and manipulating the routing table in this state was a bit slow, in a way that suggests that some performance limits were being reached, though I didn't check whether forwarding performance was affected.

This experiment has run its course, however I'm very enthusiastic to try applying the idea more seriously in the future!

Acknowledgements

I had help from a lot of people while I was working on this post and the test network:


Changelog