Multimodal interaction

From Wikipedia, the free encyclopedia

This article or section relies largely or entirely upon a single source.
Please help improve this article by introducing appropriate citations of additional sources.

Multimodal interaction provides the user with multiple modes of interfacing with a system beyond the traditional keyboard and mouse input/output. The most common such interface combines a visual modality (e.g. a display, keyboard, and mouse) with a voice modality (speech recognition for input, speech synthesis and recorded audio for output). However other modalities, such as pen-based input or haptic input/output may be used. Multimodal user interfaces are a research area in human-computer interaction (HCI).

The advantage of multiple modalities is increased usability: the weaknesses of one modality are offset by the strengths of another. On a mobile device with a small visual interface and keypad, a word may be quite difficult to type but very easy to say (e.g. Poughkeepsie). Consider how you would access and search through digital media catalogs from these same devices or set top boxes. And in one real-world example, patient information in an operating room environment is accessed verbally by members of the surgical team to maintain an antiseptic environment, and presented in near realtime aurally and visually to maximize comprehension.

Multimodal user interfaces have implications for accessibility.^[1] A well-designed multimodal application can be used by people with a wide variety of impairments. Visually impaired users rely on the voice modality with some keypad input. Hearing-impaired users rely on the visual modality with some speech input. Other users will be "situationally impaired" (e.g. wearing gloves in a very noisy environment, driving, or needing to enter a credit card number in a public place) and will simply use the appropriate modalities as desired. On the other hand, a multimodal application that requires users to be able to operate all modalities is very poorly designed.

The most common form of multimodality in the market makes use of the XHTML+Voice (aka X+V) Web markup language, an open specification developed by IBM, Motorola, and Opera Software. X+V is currently under consideration by the W3C and combines several W3C Recommendations including XHTML for visual markup, VoiceXML for voice markup, and XML Events, a standard for integrating XML languages. Multimodal web browsers supporting X+V include IBM WebSphere Everyplace Multimodal Environment, Opera for Embedded Linux and Windows, and ACCESS Systems NetFront for Windows Mobile. To develop multimodal applications, software developers may use a software development kit, such as IBM WebSphere Multimodal Toolkit, based on the open source Eclipse framework, which includes an X+V debugger, editor, and simulator.

[edit] References

^ Vitense, H.S., Jacko, J.A and Emery, V.K. (2002) Multimodal feedback: establishing a performance baseline for improved access by individuals with visual impairments. In: Fifth Annual ACM Conference on Assistive Technologies, pp. 49-56