Subject: | Feedback to the DAP group on the topic of audio/media capture needed for HTML+Speech |
---|---|
Resent-Date: | Sat, 15 Jan 2011 05:47:45 +0000 |
Resent-From: | public-device-apis@w3.org |
Date: | Sat, 15 Jan 2011 04:54:56 +0000 |
From: | Michael Bodell <mbodell@microsoft.com> |
To: | public-device-apis@w3.org <public-device-apis@w3.org> |
CC: | public-xg-htmlspeech@w3.org <public-xg-htmlspeech@w3.org> |
On today’s Hypertext Coordination Group
Teleconference the issue of “Audio on the Web” was discussed
(see minutes:
http://www.w3.org/2011/01/14-hcg-minutes.html)
and I was given the action item of contacting the DAP group to
provide feedback about audio capture. We in the HTML Speech XG
(http://www.w3.org/2005/Incubator/htmlspeech/)
have been discussing use cases, requirements, and some proposals
around speech enabled html pages and the need for the audio to
be captured and recognized in real time (I.e., in a streaming
fashion, not in a file upload fashion). We recognize that there
are interesting security and privacy concerns with supporting
this necessary functionality.
The HTML Speech XG has currently finished
with requirements gathering, and is in the process of
prioritizing these requirements. Our requirements document is
at
http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html.
There are a large number (almost half) of our requirements that
might be of particular note to the audio capture process. I’ve
tried to pull out and organize the requirements most relevant to
the DAP audio capture:
·
Requirements about to
where the audio is streamed:
o
FPR12. Speech services
that can be specified by web apps must include network speech
services [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr12]
o
FPR32. Speech services
that can be specified by web apps must include local speech
services. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr32]
·
Requirements about the
audio streams and the fact that it needs to be streamed:
o
FPR33. There should be at
least one mandatory-to-support codec that isn't encumbered with
IP issues and has sufficient fidelity & low bandwidth
requirements. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr33]
o
FPR25. Implementations
should be allowed to start processing captured audio before the
capture completes. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr25]
o
FPR26. The API to do
recognition should not introduce unneeded latency. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr26]
o
FPR56. Web applications
must be able to request NL interpretation based only on text
input (no audio sent). [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr56]
·
Requirements about what
must be possible while streaming (I.e., getting midstream events
in a timely fashion without cutting off the stream; being able
to decide to cut off the stream mid request; being able to reuse
the stream):
o
FPR40. Web applications
must be able to use barge-in (interrupting audio and TTS output
when the user starts speaking). [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr40]
o
FPR21. The web app should
be notified that capture starts. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr21]
o
FPR22. The web app should
be notified that speech is considered to have started for the
purposes of recognition. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr22]
o
FPR23. The web app should
be notified that speech is considered to have ended for the
purposes of recognition. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr23]
o
FPR24. The web app should
be notified when recognition results are available. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr24]
o
FPR57. Web applications
must be able to request recognition based on previously sent
audio. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr57]
o
FPR59. While capture is
happening, there must be a way for the web application to abort
the capture and recognition process. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr59]
·
Requirements around the
UI/API/Usability of speech/audio capture:
o
FPR42. It should be
possible for user agents to allow hands-free speech input. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr42]
o
FPR54. Web apps should be
able to customize all aspects of the user interface for speech
recognition, except where such customizations conflict with
security and privacy requirements in this document, or where
they cause other security or privacy problems. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr54]
o
FPR13. It should be easy
to assign recognition results to a single input field. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr13]
o
FPR14. It should not be
required to fill an input field every time there is a
recognition result. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr14]
o
FPR15. It should be
possible to use recognition results to multiple input fields. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr15]
·
Requirements around
privacy and security concerns:
o
FPR16. User consent
should be informed consent. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr16]
o
FPR20. The spec should
not unnecessarily restrict the UA's choice in privacy policy. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr20]
o
FPR55. Web application
must be able to encrypt communications to remote speech service.
[http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr55]
o
FPR1. Web applications
must not capture audio without the user's consent. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr1]
o
FPR17. While capture is
happening, there must be an obvious way for the user to abort
the capture and recognition process. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr17]
o
FPR18. It must be
possible for the user to revoke consent. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr18]
o
FPR37. Web application
should be given captured audio access only after explicit
consent from the user. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr37]
o
FPR49. End users need a
clear indication whenever microphone is listening to the user. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr49]
We would be happy to discuss the details and
context behind any of these requirements, and we’d also
appreciate any feedback on our use cases and requirements. I’m
sure many of these are requirements the DAP group is already
considering, but the speech use cases may well add some
additional requirements that may not have yet been considered as
part of the capture work.
The HTML Speech XG is also in the process of
collecting proposals for our Speech API which we are planning to
finish by the end of February. In our discussions to date, we
have reviewed and discussed some of the DAP capture API as well
as some of the work that has gone on around the <device>
tag proposals (We reviewed and discussed at least
http://www.w3.org/TR/html-media-capture/
and
http://www.w3.org/TR/media-capture-api/
and Robin provided the following links to more in progress work
in the htcg call
http://dev.w3.org/2009/dap/camera/
and
http://dev.w3.org/2009/dap/camera/Overview-API.html).
In general I’d characterize our discussions as we would be
extremely happy if we could reuse the DAP work, and would be
happy to work with you on having proposals that meet this need.
To date in our review the large issue has been the streaming
issue where the capture API is nearly useless to us if it
doesn’t support streaming. But happily from today’s htcg call
it sounds like DAP is actively working on streaming so we
strongly support that work direction, think it is extremely
important, and will be interesting to see any and all work in
that direction.
I’m not sure what the most productive next
steps for us to take (email discussion back and forth, some HTML
Speech XG members come to a DAP audio capture conference call,
some DAP members come to a Speech XG teleconference, or
something else). In general, the HTML Speech XG tries to do
most of our work over the public email alias and we also have a
schedule-as-needed Thursday teleconference time for 90 minutes
starting at noon New York time.
Thanks, and look forward to working on this
important topic with you!
Michael Bodell (Microsoft)
Co-chair HTML Speech XG