XHTML+Voice By Example
July 27th 2011: Please note that Voice only works in Opera on Windows 2000/XP, and we no longer officially support it.
This article assumes that you have already Voice installed and working on your computer. For more on what XHTML+Voice (X+V) is all about, read our Getting to Know X+V article.
Hello World!
You can make an X+V browser say "Hello World" with the 'block' element, like this:
<block>Hello World!</block>
The full web page will look like this:
<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN" "http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd"> [1] <html xmlns="http://www.w3.org/1999/xhtml" xmlns:ev="http://www.w3.org/2001/xml-events"> [2] <head> <title>Example 1: "Hello, World"</title> <form xmlns="http://www.w3.org/2001/vxml" id="sayHello"> [3] <block>Hello World!</block> [4] </form> </head> <body ev:event="load" ev:handler="#sayHello"> [5] <h1>"Hello World!" example</h1> <p>If your browser is voice enabled, you will hear it say "Hello, world!".</p> </body> </html>
- [1] The DOCTYPE describes the type of document this is. It isn't necessary for voice processing, but is necessary for the document to be valid. Also see DOCTYPE sniffing.
- [2] XHTML is the default namespace, and XML Events has prefix "ev".
- [3] This is the voice form, the element that contains voice code. It sets the default namespace to VoiceXML, so everything contained in the voice form is VoiceXML by default. It has an id ("sayHello") so that other elements can refer to it.
- [4] This is the code that actually says "Hello World".
- [5] This is what triggers the voice form. When the body element receives the
"load" event (i.e. when the page has loaded) the handler with id
sayHello
is activated.
Make it listen
So now you can have the browser talk to you. It is far more exiting that you can talk back to it. This can also be more challenging. It isn't too hard to make it listen, it is more difficult to make it understand. The application will not behave more intelligently than what you code for. You need to specify what to listen for from the user chatter, and give good hints about what is expected. The examples in this article will be very literal-minded, and not give much leeway for creative user responses, but it isn't too hard to make it more flexible.
Using your options
When you in HTML want to restrict the choices, you can use a select
element. To give the user the choice
between "one" and "two", and nothing else, you can code:
<select name="example"> <option>One</option> <option>Two</option> </select>
You can use the mouse or the Tab key to activate the select box and the arrow keys to choose one of the options.
In a voice form the field
element fulfil a similar role. To present the same choice in voice, you code:
<field name="example"> <option>One</option> <option>Two</option> </field>
The following example will ask the user the name of the best browser (Opera of course), handle mismatches, and give an inspired lecture at the end.
<field name="browser"> <prompt>What is the name of the best browser?</prompt> [1] <option>Opera</option> [2] <nomatch>Try again.</nomatch> [3] <filled>Yes, that's a fact. Opera is the best browser, full of wonderful features.</filled> [4] </field>
- [1] The spoken texts are called prompts. They are similar to the
p
elements in XHTML. - [2] The options gives the choices the user has. In this case the only value for the best browser is "Opera".
- [3] If the user tries to say anything that doesn't match "Opera",
he will be asked to try again (and again and again). If no
nomatch
is set, the standard phrase (normally "Sorry. I did not understand") will be used instead. - [4] The filled element will be executed when a match (i.e. "Opera") has occurred.
Going for grammar
The collection of accepted responses is called a grammar (the collection of
options above is also a grammar). At its most simple, it is only a collection
of alternatives, e.g. <fruit> = apple
| orange | slime mold
. The voice recognition version on Hello
World, coffee-tea-milk, will politely ask you if you want coffee, tea, or
milk, and (in this version) refuse to give it to you afterwards.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ev="http://www.w3.org/2001/xml-events"> <head> <title>Example 3: Drink dispenser</title> <vxml xmlns="http://www.w3.org/2001/vxml" id="drinkform"> [1] <field name="drink"> <prompt>Would you like coffee, tea, or milk?</prompt> <grammar><![CDATA[ [2] #JSGF V1.0; grammar drinks; public <drinks> = coffee | tea | milk [3] ]]> [2] </grammar> <filled> [4] <block>Sorry, I'm out of <value expr="drink"/>.</block> [5] </filled> </field> </vxml:form></head> <body ev:event="load" ev:handler="#drinkform"> [1] <h1>Example 3: Drink dispenser</h1> <p>Our drink dispenser can offer you a wide choice of refreshing drinks.</p> </body> </html>
- [1] This voice form will be triggered when the document has loaded.
- [2] Grammars can have characters like "<" that are normally treated
as XML, in this case start of tag, but content inside
<![CDATA
[ and]]>
is taken literally in XML, and not processed as markup. If you write<![CDATA[<b>bold</b>]]>
in the source code, it should be displayed as plain text "<b>bold</b>", not as "bold". - [3] This is the grammar, you can choose between coffee, tea, and milk. The "|" vertical bar means "or".
- [4] The 'filled' element is a conditional element, it is entered when the field has gotten a value.
- [5] This is where it refuses to serve you anything.
The grammar could in this case just as well be expressed using option
, like this:
<option>coffee</option> <option>tea</option> <option>milk</option>
I don't understand what you are saying, but I can pretend
The advantage of grammar
over option
is that you can handle more natural language this way. Here is a more advanced grammar
example:
<form xmlns="http://www.w3.org/2001/vxml" id="command"> <field name="commandInterpreter"> <grammar><![CDATA[ #JSGF V1.0; grammar command; public <command> = [I want to] <action> {the_action = $action} [1] <object> {the_object = $object} [with <instrument>{the_instrument=$instrument}]; <action> = watch | shut down | surprise | control | buy | hide | ignore; [2] <object> = [the|a] tv | [the|a] phone | [the|my] neighbor | Opera | my boss; <instrument> = [the] remote control | [my famous] wit | [a] stick | [a] camera | [an] [expired] credit card | a wet blanket; ]]></grammar> <prompt>Give me a command</prompt> <nomatch>I refuse to do that. </nomatch> </field> <filled> <!-- Give feedback --> [3] Why do you want to <value expr="the_action"/> the poor <value expr="the_object"/>? </filled> </form>
- [1] Phrases in [brackets], like "I want to" are, optional. <action> refers to the action rule further down, the variable is set to , which is a special variable containing whatever the <action> rule has returned.
- [2] The <action> rule lets you pick one out of a set of verbs, much like
option
would. - [3] The
value
element refers to the values set in [1]. This values can also be used when scripting X+V
This grammar would accept utterances like:
- watch tv
- I want to surprise my neighbour with my famous wit
- buy Opera with an expired credit card
- I want to hide the phone
- shut down tv with remote control
- ignore my boss
The grammar can easily be modified. Try use different verbs and nouns for the <action>, <object>, and <instrument> rules.
This example may be more advanced than you would need, but it does show some of the things that grammars allow you to do. To learn more about this see the grammar links in Getting to Know X+V.
id=This article is licensed under a Creative Commons Attribution, Non Commercial - Share Alike 2.5 license.
Comments
The forum archive of this article is still available on My Opera.