That's right, a lot of
things remain regarding protocols and other stuff. But IMHO
<device>, StreamAPIs and <audio> and
<video> should be part of the puzzle!
At Mozilla we've experimented with taking the input from a webcam
for video recording and reflecting it to a canvas for display. This
makes it easy to manipulate from script for building application,
taking screenshots, etc.