Multimodal browsers allow users to interact via a combination of modalities, for instance, speech recognition and synthesis, displays, keypads and pointing devices.
http://www.w3.org/TR/multimodal-reqs
Current Devices
Desktop systems have proven to be highly effective for accessing the World Wide Web. The high resolution displays, pointing devices and full size keyboards make it easy to interact efficiently with large amounts of information. When you are on the move, you need a small lightweight device that fits easily into your pocket or purse. Cell phones are extremely popular, but their small size limits the amount of information they can display, as well as the number and kinds of keys they can feature.
Mobile profiles have emerged for a number of W3C specifications: XHTML, CSS, SMIL and SVG. Mobile access to the Web is now becoming a reality. The small keypads make it difficult to enter search strings or Web addresses, especially for ideographic languages with many thousands of characters. Recent years have also seen a tremendous growth of interest in using speech as a means to interact with Web-based services over the telephone. W3C responded to this by establishing the Voice Browser Activity which is developing requirements and specifications for the W3C Speech Interface Framework.
Spoken interfaces based upon VoiceXML prompt users with pre-recorded or synthetic speech and understand simple words or phrases. As the technology improves we can look forward to richer natural language conversations. There is now an emerging interest in combining speech interaction with other modes of interaction. Multimodal interaction will enable the user to speak, write and type, as well as hear and see using a more natural user interface than today's single mode browsers.
Multimodal Access
The different modalities may be supported on a single device or on separate devices working in tandem, for example, you could be talking into your cellphone and seeing the results on a PDA. Voice may also be offered as an adjunct to browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen. This can be especially important in automobiles or other situations where hands and eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller. It is much easier to say a few words than it is to thumb them in on a keypad where multiple key presses may be needed for each character. Complementing speech, ink entered with a stylus or imaging device can be used for handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields. Ink is expected to be popular for instant messaging.
Mobile devices working in isolation generally lack the power to recognize more than a few hundred spoken commands. The storage limitations restrict the use of prerecorded speech prompts. Small speech synthesizers are possible, but tend to produce robotic sounding speech that many users find tiring to listen to. A solution is to process speech recognition and synthesis remotely on more powerful platforms. A similar case holds for complex voice dialogs with rich natural language understanding. Simple dialogs could be handled locally, but for richer interaction, it will be necessary to couple the device with a remote dialog engine.
Multimodal applications should be able to adapt to changing device capabilities, user preferences and environmental conditions. For instance, users should be able to disable speech input and output when this would be distracting to nearby people. It should be easy for developers to tailor applications to dynamically adapt to such changes, making best use of the available modes of interaction at any given time. In addition, developers should be able to create applications involving multiple devices and multiple users, augmenting human to computer and human to human interaction.
http://www.w3.org/2002/mmi/#status