slide 1 - Title New Techniques for Making TCP Robust to Corruption-Based Loss Wesley Eddy, Shawn Ostermann Ohio University Mark Allman BBN Technologies/NASA GRC slide 2 - Bacground This work builds on the foundations laid down by some work BBN did on Explicit Transmission Error Notification last year sponsored by NASA. We're mostly interested in improving TCP throughput in bulk-transfer applications. The basic problem here is that for space applications we assume packet errors will be fairly likely and unrelated to congestion, so TCP's throughput may be unneccessarily poor. slide 3 - TCP's Response to Packet Loss So here we have a plot showing how TCP's maximum attainable throughput falls off as the total packet loss rate increases. Both axes are logartihmic, so notice it falls off pretty quickly, and this is an upper bound. slide 4 - Packet Errors Considered Harmful That packet loss rate p has two components, congestion and corruption, which we'll call c and e. IP routers pretty much only drop packets because they're too busy (congestion) or the packets are damaged (corruption). Now it's clear that if the router is too busy we need to slow down. If a packet was just hit with some uncorrectable bit errors though ... it's not so clear that slowing down helps any ... it depends on the link layer technology and in many cases there's no correlation between an overburdened link and corruption and thus no need to slow down for bit errors. It's pretty clear that if we have a nicely provisioned satellite connection, we'll probably get hit by more errors than congestion events and TCP will end up going a lot slower than it has to. clear? slide 5 - How Much Better Can It Be? Here's the first plot again, with a line added for a TCP that magically knows if a loss was caused by an error or congestion and slows down accordingly. This is once again logscale, so for high p this is actually a pretty big percent gain. slide 6 - With Perfect Knowldedge This is a plot of the TCP congestion window over time. The area underneath it (integral) is equivalent to throughput. The W_max line is the maximum speed the network resources will let us burst at without congestion. On the top we show a TCP that magically knows when it has reached this limit and lost a packet and when a packet was just lost due to an error. On the bottom we show an unenlightened TCP that just has to assume congestion. It's clear the top one has a nice big chunk of area in that first triangle that the bottom one doesn't, so an error aware TCP has large potential throughput gains. slide 7 - Cumulative Explicit Transmission Error Notification so it's not hard for a router to keep track of how what proportion of packets coming in it ends up detecting errors in and throwing out. If we subtract that from 1 it's a survival probability and we can multiply the survival rates all the way through the path to get the total probability a packet will propagate all the way from sender to recver without suffering a detected uncorrectable error. This is (1-e). slide 8 - Help Where You LEAST Expect It We've recently developed an algorithm (called LEAST) that we can use to very accurately determine p. This operates passively on a TCP sender and is based on inferrences drawn from the ACK stream. Effectively our estimate is that the number of needed retransmists must be the same as the number of lost packets ... so we just have to subtract spurious and unneeded duplicate retransmits from the total number of retransmits. We have a paper with more detailed information if you're interested. So now we have e and we have p. We probably know enough about what's causing losses to do something intelligent to TCP congestion control that will both give us better throughput AND still play nicely under real congestion losses. slide 9 - Modified Congestion Window Update We've looked at two ways to modify the congestion response ... the probabilistic way was to sort of guess at the reason each time we needed to retransmit. This was the original way that the BBN folks though of. We noticed a big problem in that wrong guesses tend to inflate p ... e is a (long term) static channel property so e/p falls and our guesses don't benefit us as much in the long run, although it does work pretty well for shorter flows. We've had a little bit better long-term success with using a cwnd multiplier that comes from the range 1/2 to 1. This allows us to be more aggressive than standard TCP, yet reasonably conservative if congestion losses are prevalent. Call this the deterministic algorithm. slide 10 - Gain in Throughput Here's a plot of TCP throughput at a given packet loss rate p of two percent. On the x-axis we set the amount of this that's due to errors. Since stock TCP doesn't know the difference, it's line is flat. The CETEN line shown is our deterministic algorithms performance curve. Notice the significant gains towards the right where most losses are due to errors, keeping in mind that this is plotted logarithmically. slide 11 - Future Work There are still a few corners we're nailing down. We want to make sure multiple flows don't choke themselves and that CETEN flows don't choke normal TCP. It's no good if we're not neighborly. There also may be other algorithms that knowing e/p allows us to perform. There's room for changes in the way error rates are propagated without busting our congestion control modifications ... the two are sort of layered. Finally, there's the possibility that a receiver could claim all losses are due to corruption in order to try and cheat the sender into giving him an unfair amount of bandwidth. We have some ideas for mitigating this.