slide 1 - Title

New Techniques for Making TCP Robust to Corruption-Based Loss

Wesley Eddy, Shawn Ostermann
Ohio University

Mark Allman
BBN Technologies/NASA GRC

slide 2 - Bacground

This work builds on the foundations laid down by some work BBN did on Explicit
Transmission Error Notification last year sponsored by NASA.  We're mostly
interested in improving TCP throughput in bulk-transfer applications. The basic
problem here is that for space applications we assume packet errors will be
fairly likely and unrelated to congestion, so TCP's throughput may be
unneccessarily poor.

slide 3 - TCP's Response to Packet Loss

So here we have a plot showing how TCP's maximum attainable throughput falls
off as the total packet loss rate increases.  Both axes are logartihmic, so
notice it falls off pretty quickly, and this is an upper bound.

slide 4 - Packet Errors Considered Harmful

That packet loss rate p has two components, congestion and corruption, which
we'll call c and e.  IP routers pretty much only drop packets because they're
too busy (congestion) or the packets are damaged (corruption).  Now it's clear
that if the router is too busy we need to slow down.  If a packet was just hit
with some uncorrectable bit errors though ... it's not so clear that slowing
down helps any ... it depends on the link layer technology and in many cases
there's no correlation between an overburdened link and corruption and thus no
need to slow down for bit errors.

It's pretty clear that if we have a nicely provisioned satellite connection,
we'll probably get hit by more errors than congestion events and TCP will end
up going a lot slower than it has to.

clear?

slide 5 - How Much Better Can It Be?

Here's the first plot again, with a line added for a TCP that magically
knows if a loss was caused by an error or congestion and slows down
accordingly.  This is once again logscale, so for high p this is actually
a pretty big percent gain.

slide 6 - With Perfect Knowldedge

This is a plot of the TCP congestion window over time.  The area underneath
it (integral) is equivalent to throughput.  The W_max line is the maximum
speed the network resources will let us burst at without congestion.  On
the top we show a TCP that magically knows when it has reached this limit
and lost a packet and when a packet was just lost due to an error.  On the
bottom we show an unenlightened TCP that just has to assume congestion.  It's
clear the top one has a nice big chunk of area in that first triangle that
the bottom one doesn't, so an error aware TCP has large potential throughput
gains.

slide 7 - Cumulative Explicit Transmission Error Notification

so it's not hard for a router to keep track of how what proportion of
packets coming in it ends up detecting errors in and throwing out.  If
we subtract that from 1 it's a survival probability and we can multiply
the survival rates all the way through the path to get the total probability
a packet will propagate all the way from sender to recver without suffering
a detected uncorrectable error.  This is (1-e).

slide 8 - Help Where You LEAST Expect It

We've recently developed an algorithm (called LEAST) that we can use to
very accurately determine p.  This operates passively on a TCP sender
and is based on inferrences drawn from the ACK stream.  Effectively our
estimate is that the number of needed retransmists must be the same as
the number of lost packets ... so we just have to subtract spurious and
unneeded duplicate retransmits from the total number of retransmits.  We
have a paper with more detailed information if you're interested.

So now we have e and we have p.  We probably know enough about what's causing
losses to do something intelligent to TCP congestion control that will both
give us better throughput AND still play nicely under real congestion losses.

slide 9 - Modified Congestion Window Update

We've looked at two ways to modify the congestion response ... the
probabilistic way was to sort of guess at the reason each time we needed to
retransmit.  This was the original way that the BBN folks though of.  We
noticed a big problem in that wrong guesses tend to inflate p ... e is a (long
term) static channel property so e/p falls and our guesses don't benefit us as
much in the long run, although it does work pretty well for shorter flows.

We've had a little bit better long-term success with using a cwnd multiplier
that comes from the range 1/2 to 1.  This allows us to be more aggressive than
standard TCP, yet reasonably conservative if congestion losses are prevalent.
Call this the deterministic algorithm.

slide 10 - Gain in Throughput

Here's a plot of TCP throughput at a given packet loss rate p of two percent.
On the x-axis we set the amount of this that's due to errors.  Since stock
TCP doesn't know the difference, it's line is flat.  The CETEN line shown
is our deterministic algorithms performance curve.  Notice the significant
gains towards the right where most losses are due to errors, keeping in mind
that this is plotted logarithmically.

slide 11 - Future Work

There are still a few corners we're nailing down.  We want to make sure
multiple flows don't choke themselves and that CETEN flows don't choke normal
TCP.  It's no good if we're not neighborly.

There also may be other algorithms that knowing e/p allows us to perform.

There's room for changes in the way error rates are propagated without
busting our congestion control modifications ... the two are sort of
layered.

Finally, there's the possibility that a receiver could claim all losses
are due to corruption in order to try and cheat the sender into giving
him an unfair amount of bandwidth.  We have some ideas for mitigating
this.