Welcome to Volume 2, Issue 16 of The Internet Security Conference Newsletter, Insight. Insight provides commentaries and educational columns, authored by some of the best minds in the security community. Many of our columnists teach and speak at The Internet Security Conference.
The editorial calendar at this time includes:
Previous issues are posted here.
TISC is about sharing clue. So is the newsletter. We promise to provide something useful each issue. If we don't, flame me.
Enjoy, and be safe,
Dave
In this issue, Radia Perlman describes methods one might employ to make internets resilient from malicious or malfunctioning switches. This column is based on her Ph. D. thesis, yet the subject is still relevant today because all these years have passed and the routing protocols we use today are at best marginally more resistant to sabotaged and incorrectly operating routing "nodes" than those we used in the 1970's.
Radia Perlman, Sun Microsystems
A routing protocol is a distributed algorithm in which a collection of switches (the term "switch" seems to have replaced the words "bridge" and "router") exchange data and cooperate to calculate compatible forwarding decisions. In all of today's deployed routing algorithms (distance vector algorithms such as RIP, link state algorithms such as IS-IS or OSPF, the spanning tree algorithm used by bridges), a single malicious switch may bring down the entire portion of the network in which the routing algorithm is running. The malicious switch might lie about who it is connected to, flood the network with bogus data, misroute or mangle packets it should be forwarding.
The usual method for people to "harden" routing protocols is to put an authentication field into routing packets, usually a keyed message digest (MD) signed with a secret shared by all the switches participating in the algorithm. This prevents a node other than an authorized switch from injecting faulty routing control traffic. But it does not do anything to defend against a trusted switch that has become malicious.
Why would a switch be malicious? It might have been sabotaged by malicious code. The admin account of the switch may have been compromised or hacked. Or in many cases, routers are flaky enough that they do not need the help of saboteurs to start acting erratically: software bugs, misconfiguration, hardware problems can cause arbitrary failures.
Such failures are known as Byzantine failures, after a famous computer science problem known as the Byzantine generals problem, in which a collection of processors, through pair-wise communication, agree on a binary value. That problem is surprisingly hard to solve. There are many papers written on the subject, mostly about how hard the problem is. For instance, this can't be solved if more than 1/3 of the processors are faulty. But surprisingly, routing resilient to Byzantine failures is not that difficult. It was the subject of my Ph. D. thesis at MIT in 1988. At the time, my goal was to graduate, but since then interest in the thesis has grown and I am often asked for on-line copies of it. Since I do not have soft copy of the thesis, and because for the thesis I was trying to make it be at least 100 pages (it seemed like a thesis ought to be at least 100 pages long), I will instead use this opportunity to write a new, succinct description of the solution.
Briefly, the solution consists of designing robust flooding, where a node is able to send a routing message to all the other nodes (at least the ones to which it has at least one correctly functioning path). The robust flooding will be used in order to disseminate the list of public keys of all the nodes (so that nodes need only be configured with one trusted node public key), and for disseminating link state information in a link state protocol similar to IS-IS (and to OSPF and to PNNI...all these protocols are based on IS-IS).
Now all nodes will have a link state database telling them about all the correctly functioning links and switches (and some extra, malicious information perhaps). Instead of doing hop-by-hop routing, where each switch independently decides the next hop, the source (I'm assuming all nodes also act as switches) selects an entire path to the destination, and sends a signed path setup routing message listing the route. Switches authenticate the signed path routing message using the public key received in the flood. When forwarding data, they must only confirm that packets arrive from the expected port. No cryptographic check needs to be made when forwarding data.
Now, to give more detail on each of these steps.
Flooding has the property that as long as there is a single correctly functioning path between source (S) and destination (D), a packet will reach D, provided there is infinite bandwidth, which of course there isn't. So the trick is to enforce fair use of the resources. Each switch will have at least one buffer for each possible source switch. In order to ensure that the routing message occupying the buffer reserved for S was generated by S, S signs the message. To ensure that old routing messages from S do not compete with the newest, each message has a sequence number.
When a switch receives a routing message from source S, it checks that it is properly signed by S. If so, the switch checks the sequence number of the received message against the one in its database. If the received message has a larger sequence number, the switch overwrites the buffer reserved for S with the newer message, and marks the message as needing to be forwarded to each neighbor except the one from which the message was received (i.e., the one that forwarded the message from S). That neighbor is marked, for S's routing message, as needing to be sent an acknowledgment.
For each routing message in memory and for each link, a switch keeps a database of "state" for that message. The state can consist of "transmit", "ack", or "OK". When a link is available, the switch goes round-robin through the routing messages. If the flag is "transmit", the switch transmits that message to that neighbor. If the flag is "ack", the switch sends an acknowledgment for that message to that neighbor, and changes the state to "OK". If the flag is "OK", the switch skips over that message to the next. To those familiar with link state routing, this is exactly the algorithm used in IS-IS to flood link state information.
The robust flooding is used for two types of information. The first type is a list of public keys of all the nodes, flooded by one or a small number of "trusted nodes". In this way switches do not need to be configured with the public keys of all the nodes: only the "trusted nodes". To add a new switch into the net, it must choose a public/private key pair, be configured with the trusted node(s) public key(s), and the trusted node(s) must be configured with the new node's public key.
The second type of information is link state information, which is generated by each switch and consists of that switch's ID and the identities of its neighbors, and cost of the link to those neighbors. The messages sent in this second type of flood are signed and sent by individual switches, using their own private keys.
The link state database gives complete information about the topology (and more, unfortunately, since some of the information may be incorrect due to malicious nodes). However, since the source will calculate the entire path, it can use any decision criteria it wants in order to select a path. It doesn't have to select the best path, but simply a path that works.
Assuming the source node gets feedback about whether the path is working properly or not (feedback from a higher layer), then if a path works, that switch can note that the links and switches along that path are likely to be trustworthy, and favor routes containing those. If a path does not work, then the source will attempt to compute a different path, avoiding as many as possible of the switches from the first path.
Assuming not many switches are malicious, it will not take long to find a path that works, if one exists. If a lot of the switches are malicious, it becomes exponentially more difficult to find a path. However, after some threshold of failures, a source could revert to flooding its data, which would work but be suboptimal.
When the source has chosen a path, it sends a special "path setup" routing message, digitally signed, with the path specified. It travels along the specified path. Each switch records the expected input and output ports for that source/destination pair.
Data packets do not need to be cryptographically protected. Forwarding data is the function that most needs to be efficient in data packet forwarding. Using this approach, the work to forward a packet is not significantly more difficult than with a conventional network The only extra work to forward a data packet is to check that it arrived from the expected neighbor.
Summary
It would not be that difficult to convert a traditional network into routing according to this algorithm. It is basically a link state routing scheme in which the link state routing message are digitally signed. Then a path setup procedure, which is already used in connection-oriented networks like ATM and is being added to IP in the form of MPLS, is also enhanced with digital signatures.
Truth in Advertising
This work documents a thesis rather than a full-blown design. It shows the feasibility in a network small enough that each switch can reserve a buffer for each source. In larger networks, there would need to be hierarchy. Although hierarchy was discussed in the thesis, there is some complexity in implementing it. Imagine a two-level network, rather like dividing a country into "states", and having intrastate and interstate routing. Within a state all the switches would know about each other. The interstate switches would also all have to know about each other. Since a source within some "state" would not be known by the interstate switches, the egress switch from the source's state would need to sign the routing message, vouching to the interstate switches that the message arrived from a valid switch. And that egress switch would have to ensure fair use of its allocated interstate traffic from among the sources it serves.
Another issue which does not have a wonderfully clean solution is selection of paths in the presence of many malicious nodes. Since there are exponentially many paths, if there were really only a single correct path, and all other paths had at least one malicious switch, it would take too long to find the functional path. Fallback in this scenario to flooding might be the best solution.