
On 10/8/2011 11:29 PM, Justin Uberti wrote:
On Sat, Oct 8, 2011 at 10:39 PM, Randell Jesup <randell-ietf@jesup.org <mailto:randell-ietf@jesup.org>> wrote:
Well, I'm probably being overly-worried about processing delays (and in particular differing delays for audio and video). Let's say audio gets sampled at X, and (ignoring other processing steps) takes 1ms to encode. It gets to the wire at X + <other steps> + 1. Lets say video is also sampled at X, and (ignoring other processing steps) takes 10ms to encode. It gets to the wire at X + <other steps> + 10. So we've added a 9ms offset to all our A/V sync, and in this case it's in the "wrong" direction (people are more sensitive to early-audio than early-video). And if "other steps" on each side don't balance (and they may not), it could be worse. I also worry more that in a browser, with no access to true RT_PRI processing, the delays could be significantly variable (we get preempted by some other process/thread for 10 or 20ms, etc). Also, if the receiver isn't careful it could be tricked into skipping frames it should be displaying due to jitter in the packet-to-packet timestamps.
So perhaps I'm not being overly-worried. I realize that I'm trading off accuracy in bandwidth estimation (or if you prefer, reaction speed) for ease in getting a consistent framerate and best-possible A/V sync. In a perfect world we'd record the sampling time and the delta until it was submitted to sendto(), so we'd have both. (You could use a header extension to do that).
There's a lot more going on here. The algorithmic delays for audio and video will often be different, the capture delays perhaps wildly so. In addition, you won't want to just dump the video directly onto the wire - typically it will be leaked out over some interval to avoid bandwidth spikes, and the audio will have to maintain some jitter buffer to prevent underrun - so I think the encoding processing deltas will be nominal compared to the other delays in the pipeline.
Sure - though you have the sampling time of the audio and video, and if you do your job right on the playback side, they'll be rock-solid synced (and that can be done even if there's static drift between the audio and video timestamp clocks). So long as you don't use time-on-wire timestamps...
I think this also does illustrate why having "time-on-wire" timestamping is really useful for increasing estimation accuracy :-)
BTW, I was serious when I said you could improve on this with an RTP header extension with "time-on-the-wire" delta from sample time. However, I don't think we need this here. As it would be totally optional and ignored, that could be added later. -- Randell Jesup randell-ietf@jesup.org