Author: Colin Beckingham
Though the tools for voice control and dictation in the open source world lag far behind those in the commercial arena, I decided to see how far I could get in querying a database by voice and having the computer respond verbally. Using a number of open source tools, I’m happy to report success.
I needed four basic components to perform the query:
- The grammar and vocabulary — a list of words that are relevant in this context, and how they are used together
- Acoustic model — a statistical representation of sounds that the engine can handle
- Voice recognition engine — does the work of deciphering what has been said
- Dialog manager — translates the interpreted instructions into commands for the system and generates the response
I used a number of popular and obscure open source tools:
- Integrated Taxonomic Information System and MySQL. The ITIS database (a scientific database of current Latin Taxonomy which I will be querying) is freely downloadable, but it comes with an SQL schema for Informix rather than MySQL. I was able to massage the schema into MySQL dialect using guidance I found on one or two sites.
- PHP and Perl for scripting
- HMM Toolkit (HTK) from Cambridge University as the mathematical engine to generate the acoustic model. (There has been some discussion as to whether this software is truly open source. You can judge yourself from the terms of the license.)
- Julius as the voice recognition engine
- Audacity to record the voice samples
- A lexicon as guide for phoneme structure, such as British English Example Pronunciations
- Voxforge utilities to build the acoustic model using Perl and HTK
- Festival from Edinburgh University to read back the results under the control of the dialog manager
The following discussion owes much to the Voxforge site, which I recommend as a guide to the process of building an acoustic model. It took me several hours to work through the tutorial the first time, but with experience and the ability to foresee and avoid errors, the time it takes to work through the whole process, including voice recording, falls rapidly.
The grammar
We’re not building a dictation application here that listens and tries to interpret sounds into words selected from a large random vocabulary. We’re dealing with a small set of specific commands we know in advance, which can be laid out and planned for.
Here are some anticipated commands:
- COMPUTER SLEEP — tell the computer to stop responding to commands while we do other things.
- COMPUTER WAKE — back to work
- COMPUTER RESET — restore a known configuration
- COMPUTER QUIT — end the program
In a quasi-meta format this could appear as:
COMPUTER ( SLEEP | WAKE | RESET | QUIT )
which simply says that the word “computer” could be followed by any one, but only one, of the words in parentheses. Expanding the grammar a bit more, we could add:
CONNECT ( LOCALHOST | SERVER1 | SERVER2 ) USE ( ITIS | OTHERDB )
… and so on. This gives you a flavor of the grammar.
Acoustic model
Once we have the grammar fully defined we can use the Voxforge tools, which in turn use Perl, HTK, and other commands, to set up the required files for generation of the acoustic model. Part of this process involves recording my voice saying the elements of the grammar. Applying the Voxforge tools to my recordings paints the statistical image or acoustic model of the grammar for the benefit of the speech recognition engine (SRE).
The Voxforge site offers two examples of how to prepare the acoustic model: a tutorial and a howto. The tutorial is a detailed explanation of the process and is a nice introduction to the field, particularly for beginners. The howto is a more automated version of the tutorial, which speeds up the whole process. I used the tutorial to begin with, then graduated to the howto to make fast modifications to my grammar and “print” them acoustically.
I used Audacity to record my prompts, but other utilities would do as well. I found it most efficient to record all of the prompts in one long .wav file, then use the label feature in Audacity to divide the audio file up with the “Export multiple” command from the menu.
Along the way we make the acquaintance of the required lexicon, which is a list of words together with the printed version of each word and its phoneme representation. Here is an example:
COMPUTER [COMPUTER] k ax m p y uw t ax r
At the end of this process the acoustic model is limited to the speaker’s voice only; that is, it is speaker-dependent. This could be an advantage in terms of primitive security, or a disadvantage in terms of immediate use by others.
Speaker-independent models are a whole different kettle of fish. To find out more about this issue, see the Voxforge site, which was set up with the objective of resolving this issue in the open source domain.
Voice/speech recognition engine
We can now test the effectiveness of the acoustic model with the engine:
julius -input mic -C julian.jconfHere, Julius, the voice recognition engine, gets input from the microphone and interprets the sounds according to the acoustic model instructions in the configuration file julian.jconf. The output will appear, together with diagnostics, in the terminal window. Here is an example as I say "COMPUTER WAKE."
### read waveform input pass1_best: <s> COMPUTER WAKE pass1_best_wordseq: 0 9 5 pass1_best_phonemeseq: sil | k ax m p y uw t ax r | w ey k pass1_best_score: -12243.838867 ### Recognition: 2nd pass (RL heuristic best-first) STAT: 00 _default: 26 generated, 26 pushed, 5 nodes popped in 444 sentence1: <s> COMPUTER WAKE </s> wseq1: 0 9 5 1 phseq1: sil | k ax m p y uw t ax r | w ey k | sil cmscore1: 1.000 1.000 1.000 1.000 score1: -12466.839844In this case the SRE has hit the right command first time. Note that in both passes (the engine always does two passes) the SRE detected the correct sentence. The output includes score information which could be used to filter out words not recognized or otherwise doubtful. Absolute score values may differ in another model or situation.
The final command line will look something like this when typed into the Linux box:
julius -input mic -C julian.jconf | mydialogmanager.phpWhen we pipe the output to the dialog manager for interpretation and action, all of the information is passed up in the pipe, but only some of it is acted upon.
Dialog manager
The kernel of information we need here is in the line that begins with "sentence1:". We can throw away all other sentences unless we are in debugging mode or need to test scores. It is the job of the dialog manager to find that sentence, strip off the junk at the beginning and end, and leave us with "COMPUTER WAKE," which is a string that it can handle. Here is the introduction to a simple PHP dialog manager:
#!/usr/local/bin/php -q <?php $awake = false; $myloop = true; $in = defined('STDIN') ? STDIN : fopen('php://stdin' , 'r'); while ($myloop) { // infinite loop $line = fgets($in,1024); if (substr($line,0,8) == 'sentence') { $sent = substr($line,15,-5); $ok = more_stuff($sent); } } ?>The code sets up a couple of status variables, makes sure that the input channel is defined to receive the output of Julius, and then goes into an infinite loop, constantly checking for more to do. The program waits until it sees an incoming string, then extracts the sentence.
What can the function
more_stuff()
do? One possibility is aswitch
statement:switch ($sent) { case 'COMPUTER WAKE': $awake = true; break; case 'COMPUTER SLEEP': $awake = false; break; case 'COMPUTER RESET': // statements depending on other parts of the DM break; case 'COMPUTER QUIT': $myloop = false; break; default: if ($awake) { // go on and do interesting things with other statements } else { // do nothing } break; }During the processing of commands we want to get Festival to read out results. We can accomplish this by sending a string to the following function, which I show as printing to the monitor for debugging as well as enunciating the result:
function saytext($phrase) { echo $phrase."n"; exec('festival -b '(SayText "'.$phrase.'")''); }Conclusion
Julius interprets my voice commands correctly pretty much 100% of the time, except for the first utterance of the session. In the first utterance the system has no previous utterance to use in its analysis and will sometimes make an error. It is easy to program around this and discard the first instruction.
Those more knowledgeable in speech recognition processes will have noted that I have made no mention of building in a "silence" part of the model to help filter out spurious and irrelevant commands. More information on this subject is available at the Voxforge site.
The grammar format we defined permits variable variables, in the sense that you can declare a grammar sentence like:
DIAL < ONE | TWO | THREE | FOUR >The angle brackets indicate to the HTK toolkit that combinations such as DIAL TWO THREE, DIAL ONE, and DIAL FOUR ONE TWO TWO are all valid, but in this format my recognition accuracy fell off rapidly.
While the grammar format is necessarily a long way from natural language processing, a simple dialog manager is able to provide me with a readback from the database via Festival with few problems and with surprising accuracy.
Category:
- Open Source