Packet loss response - but how?

Now that the LEDBAT discussion has died down.... it's clear to me that we've got two scenarios where we HAVE to consider packet loss as an indicator that a congestion control algorithm based on delay will "have to do something": - Packet loss because of queues filled by TCP (high delay, but no way to reduce it) - Packet loss because of AQM-handled congestion (low delay, but packets go AWOL anyway) We also have a third category of loss that we should NOT consider, if we can avoid it: - Packet loss due to stochastic events like wireless drops. (aside: ECN lets you see the difference between the first group and the second: ECN markings are unambiguously in the first group. But we can't assume that ECN is universally deployed any time soon.) Now - the question is HOW the receiver responds when it sees packet loss. Some special considerations: - Due to our interactive target, there is no difference between a massively out of order packet and a lost packet. So we can regard anything that comes ~100 ms later than it "should" as lost. - Due to the metronome-beat nature of most RTP packet streams, and the notion of at least partial unreliability, the "last packet before a pause is lost" scenario of TCP can probably be safely ignored. We can always detect packet loss by looking at the next packet. Thoughts? Harald

Hi Harald, On Fri, May 4, 2012 at 10:02 AM, Harald Alvestrand <harald@alvestrand.no> wrote: [snipped the top half of the email.]
Now - the question is HOW the receiver responds when it sees packet loss.
Some special considerations:
- Due to our interactive target, there is no difference between a massively out of order packet and a lost packet. So we can regard anything that comes ~100 ms later than it "should" as lost.
I agree that packets can be discarded by the receiver due to late arrival. So the 100ms may seem like a good number, but perhaps this number should come from the application (via the constraints API)? Moreover, interactive applications usually define some kind of optimal end-to-end delay (about 150ms?) and maximum end-to-end delay (about 400ms?). So based on the latency of the path the 100ms could be adjusted. The receiver also has a dejitter buffer which can compensate for *some* packet re-ordering. AFAIK receiver does this by calculating the packet skew or drift for each packet it receives and adjusts the playout delay. Usually using a sliding window average, and window sizes for drift calculation are between 100, 250 packets.
- Due to the metronome-beat nature of most RTP packet streams, and the notion of at least partial unreliability, the "last packet before a pause is lost" scenario of TCP can probably be safely ignored. We can always detect packet loss by looking at the next packet.
Thoughts?
Can you comment on what the current RRTCC would do if the discarded packet is included in the rate calculation or not. Or how does the value computed by RRTCC compare to the Receiver Rate, the Receiver Goodput and the sending rate? In the algorithms that I worked on using TMMBR for video, the receiver would undershoot to the goodput in cases of intervals with losses and discards. In case of few bursty losses in an interval, it would go down to the receiver rate. The receiver can calculate: Receiver Rate: RTP packets received in an interval (including: duplicate, retx packets etc) Goodput: RTP packets that were actually played out in the interval (excludes: discarded packets etc) Sending Rate: difference between the octets sent in consecutive SRs *may* give the sending rate in the interval. (Assuming that an SR was not lost.) Usually, Sending Rate >= Receiver Rate >= Goodput.

On 05/04/2012 03:02 AM, Harald Alvestrand wrote:
Now that the LEDBAT discussion has died down....
it's clear to me that we've got two scenarios where we HAVE to consider packet loss as an indicator that a congestion control algorithm based on delay will "have to do something":
- Packet loss because of queues filled by TCP (high delay, but no way to reduce it) - Packet loss because of AQM-handled congestion (low delay, but packets go AWOL anyway)
We also have a third category of loss that we should NOT consider, if we can avoid it:
- Packet loss due to stochastic events like wireless drops.
These may be less frequent than you might expect: in fact, we have had experience recently with the opposite problem, finding device drivers which would attempt to retransmit infinitely (inserting unbounded delay) in the face of problems in the wireless channel. At least we've now got an upper bound on retransmission attempts in the Linux Atheros 9k driver... There are bugs everywhere.... What is "normal" will be needing multiple attempts to transmit a packet in WiFi (and therefore additional jitter as those attempts take place). It turns out that it is often/usually a better strategy to *not* drop the transmission rate in the face of transmission errors, but to make multiple transmission attempts (remember, dropping the rate increases the time in which a packet may get damaged by noise). The Minstrel algorithm in Linux is quite sophisticated in this way. But such operation will generate jitter rather than loss. - Jim For amusement, see: http://www.rossolson.com/dwelling/2003/11/every-packet-is-sacred/ Unfortunately, some of the world takes this humour seriously, and the result has been bufferbloat...

The RTP timestamps are certainly our friends. I am setting up to run some experiments with the various common buffer management algorithms to see what conclusions can be drawn from inter-packet arrival times. I suspect that the results will vary wildly from the RED-like algorithms to the more primitive tail-drop-like algorithms. In the case of RED-like algorithms, we will hopefully not get to much delay/bloat before the drop event provides a trigger. For the tail-drop-like algorithms, we may have to use the increasing delay/bloat trend as a trigger. As I think about the LEDBAT discussions, I am concerned about the interaction between the various algorithms - but some data should be informative. We may even be able to differentiate between error-driven loss and congestion driven loss, particularly if the noise is on the last hop of the network and thus downstream of the congested queue (which is typically where the noise occurs). In my tiny brain, you should be able to see a gap in the time record corresponding to a packet that was dropped due to last-mile noise. A packet dropped in the queue upstream of the last mile bottleneck would not have that type of time gap. You do need to consider cross traffic in this thought exercise, but statistical methods may be able to separate persistent congestion from persistent noise-driven loss. TL;DR - We can probably tell that we have queues building prior to the actual loss event, particularly when we need to overcome limitations of poor buffer management algorithms. Bill VerSteeg -----Original Message----- From: rtp-congestion-bounces@alvestrand.no [mailto:rtp-congestion-bounces@alvestrand.no] On Behalf Of Harald Alvestrand Sent: Friday, May 04, 2012 3:03 AM To: rtp-congestion@alvestrand.no Subject: [R-C] Packet loss response - but how? Now that the LEDBAT discussion has died down.... it's clear to me that we've got two scenarios where we HAVE to consider packet loss as an indicator that a congestion control algorithm based on delay will "have to do something": - Packet loss because of queues filled by TCP (high delay, but no way to reduce it) - Packet loss because of AQM-handled congestion (low delay, but packets go AWOL anyway) We also have a third category of loss that we should NOT consider, if we can avoid it: - Packet loss due to stochastic events like wireless drops. (aside: ECN lets you see the difference between the first group and the second: ECN markings are unambiguously in the first group. But we can't assume that ECN is universally deployed any time soon.) Now - the question is HOW the receiver responds when it sees packet loss. Some special considerations: - Due to our interactive target, there is no difference between a massively out of order packet and a lost packet. So we can regard anything that comes ~100 ms later than it "should" as lost. - Due to the metronome-beat nature of most RTP packet streams, and the notion of at least partial unreliability, the "last packet before a pause is lost" scenario of TCP can probably be safely ignored. We can always detect packet loss by looking at the next packet. Thoughts? Harald _______________________________________________ Rtp-congestion mailing list Rtp-congestion@alvestrand.no http://www.alvestrand.no/mailman/listinfo/rtp-congestion

On 5/4/2012 9:50 AM, Bill Ver Steeg (versteb) wrote:
The RTP timestamps are certainly our friends.
I am setting up to run some experiments with the various common buffer management algorithms to see what conclusions can be drawn from inter-packet arrival times. I suspect that the results will vary wildly from the RED-like algorithms to the more primitive tail-drop-like algorithms. In the case of RED-like algorithms, we will hopefully not get to much delay/bloat before the drop event provides a trigger. For the tail-drop-like algorithms, we may have to use the increasing delay/bloat trend as a trigger.
This would match my experience. As mentioned, I found access-link congestion loss (especially when not competing with sustained TCP flows, which is pretty normal for home use, especially if no one is watching Netflix...) results in a sawtooth delay with losses at the delay drops. This also happens (with more noise and often a faster ramp) when competing, especially when competing with small numbers of flows. Not really unexpected. As this sort of drop corresponds to a full buffer, it's pretty much a 'red flag' for a realtime flow. RED drops I found to be more useful in avoiding delay (of course). My general mechanism was to drop transmission rate (bandwidth estimate) an amount proportional to the drop rate; and tail-queue type drops (sawtooth) cause much sharper bandwidth drops. I simply assume all drops are in some way related to congestion.
As I think about the LEDBAT discussions, I am concerned about the interaction between the various algorithms - but some data should be informative.
Absolutely.
We may even be able to differentiate between error-driven loss and congestion driven loss, particularly if the noise is on the last hop of the network and thus downstream of the congested queue (which is typically where the noise occurs). In my tiny brain, you should be able to see a gap in the time record corresponding to a packet that was dropped due to last-mile noise. A packet dropped in the queue upstream of the last mile bottleneck would not have that type of time gap. You do need to consider cross traffic in this thought exercise, but statistical methods may be able to separate persistent congestion from persistent noise-driven loss.
Exactly the mechanism I used to differentiate "fishy" losses from "random" ones; "fishy" losses as mentioned cause bigger responses. I still dropped bandwidth on "random" drops, which can be congestion drops from RED in a core router so long as the router queue isn't too long. You'd also see those from "minimal queue" tail-drop routers. I did use a separate jitter buffer for determining losses (and for my filter info) from the normal jitter buffer, which being adaptive might not hold the data long enough for me. I actually kept around a second of delay/loss data on the video channel, and *if there was no loss* or large delay ramp I only reported stats every second or two.
TL;DR - We can probably tell that we have queues building prior to the actual loss event, particularly when we need to overcome limitations of poor buffer management algorithms.
If the queues are large enough, or if the over-bandwidth is low enough, yes. If there's a heavy burst of traffic (think modern browsers maximizing pageload time to sharded servers), then you may not get a chance; you may go from no delay to 200ms taildrop in an RTT or two - or even between two 20 or 30ms packets. And you need to filter enough to decide if it's jitter or delay. (You can make an argument that 'jitter' is really the sum of deltas in path queues, but that doesn't help you much in deciding whether to react to it or not.) Data would be useful... :-) Generally, for realtime media you really want to be undershooting slightly most of the time in order to make sure the queues stay at/near 0. The more uncertainty you have, the more you want to undershoot. A stable delay signal makes it fairly safe to probe for additional bandwidth because you'll get a quick response, and if the probe is a "small" step relative to current bandwidth, then the time to recognize the filtered delay signal and inform the other side and have them adapt (roughly filter delay + RTT + encoding delay). High jitter can be the result of wireless or cross-traffic, unfortunately. Also, especially near startup, my equivalent to slow-start was much more aggressive initially to find the safe point, but with each overshoot (and drop back below the apparent rate) in the same bandwidth range I would reduce the magnitude of the next probes until we had pretty much determined the safe rate. This is most effective in finding the channel bandwidth without significant sustained competing traffic on the bottleneck link. If I believed I'd found the channel bandwidth, I would remember that, and be much less likely to probe over that limit, though I would do so occasionally to see if there had been a change. This allowed for faster recovery from short-duration competing traffic (the most common case) without overshooting the channel bandwidth. Note that the more effective your queue detection logic is, the less you need that sort of heuristic; it may have been overkill on my part. -- Randell Jesup randell-ietf@jesup.org

Randell- Yup, this all makes sense. Regarding Netflix and other ABR flows...... I would add that the increasing prevalence of bursty Adaptive BitRate video (HTTP get-get-get of 2-10 second chunks of video) makes the detection of spare link capacity and/or cross traffic much more difficult. The ABR traffic pattern boils down to a square wave pattern of a totally saturated last mile for a few seconds followed by an idle link for a few seconds. The square wave actually has the TCP sawtooth modulated on top of it, so there are secondary effects. Throw in a few instances of ABR video on a given last mile and things get very interesting. The solution to this problem is not in scope for the RTCWeb/RTP work, but I sure wish that the ABR folks would find a way to smooth out their flows. We have been knocking around some ideas in this area in other discussions, so if anybody is interested in this please drop me a note. bvs -----Original Message----- From: rtp-congestion-bounces@alvestrand.no [mailto:rtp-congestion-bounces@alvestrand.no] On Behalf Of Randell Jesup Sent: Friday, May 04, 2012 12:33 PM To: rtp-congestion@alvestrand.no Subject: Re: [R-C] Packet loss response - but how? On 5/4/2012 9:50 AM, Bill Ver Steeg (versteb) wrote:
The RTP timestamps are certainly our friends.
I am setting up to run some experiments with the various common buffer management algorithms to see what conclusions can be drawn from inter-packet arrival times. I suspect that the results will vary wildly from the RED-like algorithms to the more primitive tail-drop-like algorithms. In the case of RED-like algorithms, we will hopefully not get to much delay/bloat before the drop event provides a trigger. For the tail-drop-like algorithms, we may have to use the increasing delay/bloat trend as a trigger.
This would match my experience. As mentioned, I found access-link congestion loss (especially when not competing with sustained TCP flows, which is pretty normal for home use, especially if no one is watching Netflix...) results in a sawtooth delay with losses at the delay drops. This also happens (with more noise and often a faster ramp) when competing, especially when competing with small numbers of flows. Not really unexpected. As this sort of drop corresponds to a full buffer, it's pretty much a 'red flag' for a realtime flow. RED drops I found to be more useful in avoiding delay (of course). My general mechanism was to drop transmission rate (bandwidth estimate) an amount proportional to the drop rate; and tail-queue type drops (sawtooth) cause much sharper bandwidth drops. I simply assume all drops are in some way related to congestion.
As I think about the LEDBAT discussions, I am concerned about the interaction between the various algorithms - but some data should be informative.
Absolutely.
We may even be able to differentiate between error-driven loss and congestion driven loss, particularly if the noise is on the last hop of the network and thus downstream of the congested queue (which is typically where the noise occurs). In my tiny brain, you should be able to see a gap in the time record corresponding to a packet that was dropped due to last-mile noise. A packet dropped in the queue upstream of the last mile bottleneck would not have that type of time gap. You do need to consider cross traffic in this thought exercise, but statistical methods may be able to separate persistent congestion from persistent noise-driven loss.
Exactly the mechanism I used to differentiate "fishy" losses from "random" ones; "fishy" losses as mentioned cause bigger responses. I still dropped bandwidth on "random" drops, which can be congestion drops from RED in a core router so long as the router queue isn't too long. You'd also see those from "minimal queue" tail-drop routers. I did use a separate jitter buffer for determining losses (and for my filter info) from the normal jitter buffer, which being adaptive might not hold the data long enough for me. I actually kept around a second of delay/loss data on the video channel, and *if there was no loss* or large delay ramp I only reported stats every second or two.
TL;DR - We can probably tell that we have queues building prior to the actual loss event, particularly when we need to overcome limitations of poor buffer management algorithms.
If the queues are large enough, or if the over-bandwidth is low enough, yes. If there's a heavy burst of traffic (think modern browsers maximizing pageload time to sharded servers), then you may not get a chance; you may go from no delay to 200ms taildrop in an RTT or two - or even between two 20 or 30ms packets. And you need to filter enough to decide if it's jitter or delay. (You can make an argument that 'jitter' is really the sum of deltas in path queues, but that doesn't help you much in deciding whether to react to it or not.) Data would be useful... :-) Generally, for realtime media you really want to be undershooting slightly most of the time in order to make sure the queues stay at/near 0. The more uncertainty you have, the more you want to undershoot. A stable delay signal makes it fairly safe to probe for additional bandwidth because you'll get a quick response, and if the probe is a "small" step relative to current bandwidth, then the time to recognize the filtered delay signal and inform the other side and have them adapt (roughly filter delay + RTT + encoding delay). High jitter can be the result of wireless or cross-traffic, unfortunately. Also, especially near startup, my equivalent to slow-start was much more aggressive initially to find the safe point, but with each overshoot (and drop back below the apparent rate) in the same bandwidth range I would reduce the magnitude of the next probes until we had pretty much determined the safe rate. This is most effective in finding the channel bandwidth without significant sustained competing traffic on the bottleneck link. If I believed I'd found the channel bandwidth, I would remember that, and be much less likely to probe over that limit, though I would do so occasionally to see if there had been a change. This allowed for faster recovery from short-duration competing traffic (the most common case) without overshooting the channel bandwidth. Note that the more effective your queue detection logic is, the less you need that sort of heuristic; it may have been overkill on my part. -- Randell Jesup randell-ietf@jesup.org _______________________________________________ Rtp-congestion mailing list Rtp-congestion@alvestrand.no http://www.alvestrand.no/mailman/listinfo/rtp-congestion

On 05/04/2012 01:31 PM, Bill Ver Steeg (versteb) wrote:
Randell-
Yup, this all makes sense.
Regarding Netflix and other ABR flows......
I would add that the increasing prevalence of bursty Adaptive BitRate video (HTTP get-get-get of 2-10 second chunks of video) makes the detection of spare link capacity and/or cross traffic much more difficult. The ABR traffic pattern boils down to a square wave pattern of a totally saturated last mile for a few seconds followed by an idle link for a few seconds. The square wave actually has the TCP sawtooth modulated on top of it, so there are secondary effects. Throw in a few instances of ABR video on a given last mile and things get very interesting.
The solution to this problem is not in scope for the RTCWeb/RTP work, but I sure wish that the ABR folks would find a way to smooth out their flows. We have been knocking around some ideas in this area in other discussions, so if anybody is interested in this please drop me a note. I posted an example of what happens on google+ a while back.
https://plus.google.com/u/0/110299325941327120246/posts It's pretty grim. Worse yet, since TCP's responsiveness is quadratic to the delay, these bursty elephant flows aren't going to want to get out of the way when there is no AQM present. And our current broadband edge typically has no classification: your real time packets gets stuck behind these sprinting flows. Someone with more packet tweezing skills than I have might be able to use TCP timestamps that may be in the flows to figure out how much this is impulsing the delay. - Jim <https://plus.google.com/u/0/110299325941327120246/posts>

On 5/4/2012 1:40 PM, Jim Gettys wrote:
On 05/04/2012 01:31 PM, Bill Ver Steeg (versteb) wrote:
Randell-
Yup, this all makes sense.
Regarding Netflix and other ABR flows......
I would add that the increasing prevalence of bursty Adaptive BitRate video (HTTP get-get-get of 2-10 second chunks of video) makes the detection of spare link capacity and/or cross traffic much more difficult. The ABR traffic pattern boils down to a square wave pattern of a totally saturated last mile for a few seconds followed by an idle link for a few seconds. The square wave actually has the TCP sawtooth modulated on top of it, so there are secondary effects. Throw in a few instances of ABR video on a given last mile and things get very interesting.
The solution to this problem is not in scope for the RTCWeb/RTP work, but I sure wish that the ABR folks would find a way to smooth out their flows. We have been knocking around some ideas in this area in other discussions, so if anybody is interested in this please drop me a note. I posted an example of what happens on google+ a while back.
https://plus.google.com/u/0/110299325941327120246/posts
It's pretty grim.
Yes it is. I've been discussing similar issues with Mozilla developers working on DASH. An interesting overview of Adobe (and likely NetFlix is similar), Microsoft, Apple, and a representative DASH implementation: http://www.slideshare.net/christian.timmerer/an-evaluation-of-dynamic-adapti... Note that the Apple case seems different than yours - probably desktop and not IPad; but it also shows deep buffers and energy awareness. Some other adaptation/congestion-control work for DASH is at http://www.cs.tut.fi/~moncef/publications/rate-adaptation-IC-2011.pdf. I've only skimmed it...
Worse yet, since TCP's responsiveness is quadratic to the delay, these bursty elephant flows aren't going to want to get out of the way when there is no AQM present. And our current broadband edge typically has no classification: your real time packets gets stuck behind these sprinting flows.
Someone with more packet tweezing skills than I have might be able to use TCP timestamps that may be in the flows to figure out how much this is impulsing the delay.
I'm sure it's possible. Or run a voip call with G.711/etc at the same time to your cellphone, and count the 'clap' or number-count delay, or read the delay out of wireshark (especially easy with RTCP packets). -- Randell Jesup randell-ietf@jesup.org
participants (5)
-
Bill Ver Steeg (versteb)
-
Harald Alvestrand
-
Jim Gettys
-
Randell Jesup
-
Varun Singh