Inside Palaver: Linux Speech Recognition that Taps Google’s Voice Technology

795

Despite efforts to advance Linux speech recognition, there is still no reliable, fully-baked open source competitor to Dragon Systems’ proprietary Naturally Speaking. Lately, however, instead of trying to mimic Dragon’s technology, which is only available to Linux users via Wine emulation, some developers are cueing off simpler, yet in many ways more widely useful mobile natural language engines. Last week, a De Anza College student named James McClain released a public beta of an open source GNU/Linux speech recognition program called Palaver that uses Google’s voice APIs on the back end.

James McClainPalaver is billed as being easy to use, good at interpreting different pronunciations, and customizable, letting developers add commands and functions via an app dictionary. Available initially for Ubuntu, Palaver is designed primarily for controlling computer functions, but it can also be used for transcription.

Fuzzy search enables multiple search terms for a single task. For example, users can run, launch, or open a program with voice commands. Palaver can also respond to a few basic open-ended questions, such as speaking “NBA scores” to bring up results, but much more is planned along these lines.

“Most people using speech recognition today are using Siri or Google Now on their phone,” said McClain in an email interview with Linux.com. “In the past, they were almost certainly using Dragon, and Linux developers tried to imitate that. Palaver is much more similar to Siri or Android Voice Actions, which is what most people are looking for.”

The GPLv3-licensed Palaver supports swapping out Google’s technology for other back-end engines, said McClain. “Voice Actions is very accurate and fast, but many people understandably don’t want to give their information to Google,” he says. “Luckily, Palaver could hook up to something else. The code that calls Voice Actions is very simple and separated, so if someone wants to use an engine like PocketSphinx, nothing else has to change.”

McClain’s biggest challenge in developing Palaver lies in starting and stopping voice recording. “This actually kept me from writing the application for a long time,” he said. “Then someone said ‘Just make them press a hotkey to start and end speech; you can work on automatically stopping later.’ And so I did.”

More Linux Distro Support Coming

Despite some complaints about the hotkey requirement, the overall response to the private beta released late last year was quite positive. The beta testers even revealed a solution to the start/stop problem: an open source Google app called Vox-launcher. The ability to start recording speech without a hotkey is now slated for an upcoming release.

A release due next week will “rewrite some core parts,” and “allow Palaver to be installed more easily from the Software Center,” says McClain. The eventual goal is to let Palaver be easily installed on any supporting Linux distribution. The community has pitched in with offers to translate dictionaries and help package Palaver for particular distros, such as an already completed Arch Linux version.

Future releases will include an improved package manager and a repository for adding and removing functions. And McClain is looking for help in developing a configuration and installation GUI. In the meantime, a YouTube tutorial helps ease the setup process.

In addition to disposing of the hotkey, other planned features include improved debugging, support for more languages, and a feature that lets users create macros and bind them to speech commands. McClain hopes to greatly improve support for open-ended questions by connecting to natural-language knowledge systems. “Palaver can talk to Wolfram Alpha and MIT START directly via a web request or an API, so I plan to have them answer ‘What, How, Who’ questions,” says McClain.

McClain is also interested in crowdsourcing dictionary development by letting people suggest new commands and actions. “People would vote on what commands and actions they want, and developers would implement them,” he explains. “With enough people helping, combined with fuzzy recognition, it might be possible to say what you want done in natural language, without having to remember commands.”

VoxForge: a New Foundation for Linux Speech

Speech recognition is still a work in progress, and it continually fails to meet expectations. “Since no humans speak exactly the same, speech recognition is really hard,” says McClain. Meanwhile, Linux has trailed here, due largely to the usual market-share reasons.

Beyond Dragon’s Naturally Speaking, proprietary solutions are pretty much limited to a few Linux-compatible programs such as SRI International’s DynaSpeak and Vocapia’s VoxScribe. As for the paucity of ready-to-roll, fully featured open source efforts, McClain notes that most speech databases for training recognition engines have been proprietary. “Luckily we now have VoxForge,” he adds.

The VoxForge project aims to collect and compile transcribed speech to develop a standard set of acoustic models that can be shared by open source speech recognition engines (SREs). McClain notes, however, that “it will take a while for VoxForge to match the databases that Dragon or Google have.”

Meanwhile, the models the open SREs use now are, in the words of the VoxForge website, “not at the level of quality of commercial speech recognition engines.” Initially, VoxForge is supporting four open SREs: CMU Sphinx, Julius, HTK, and ISIP. Developed at Carnegie Mellon University, Sphinx appears to have drawn the most support, especially for CMU’s embedded-oriented PocketSphinx. The Japanese-focused Julius, meanwhile, is expanding into English-language applications. The Hidden Markov Model Toolkit (HTK) and the Internet-Accessible Speech Recognition Technology Project (ISIP) are both academic, research-oriented projects.

The lack of robust databases may explain why many of the open source Linux speech programs listed on Wikipedia, and the more up-to-date Arch Linux wiki seem to have lost momentum. Some newer efforts include the PocketSphinx-based GnomeVoiceControl and Simon, which was based on Julius and HTK, but recently switched to Sphinx in a 0.4 version that also added some experimental VoxForge models.

Canonical’s HUD project for Ubuntu and the emerging, mobile-oriented Ubuntu Touch, which McClain says he will eventually support, uses PocketSphinx and Julius. Last month HUD developer Ted Gould posted a blog entry saying Julius offers better performance and results, but has an irksome “4-clause BSD license, putting it in multiverse and making it so that we can’t link to it in the Ubuntu archive version of HUD.” Gould seems to be open for another solution.

Eventually, VoxForge should rise to the occasion, and in the meantime, innovative efforts like Palaver are reimagining the user experience. Fortunately, speech recognition has “improved a great amount recently,” says McClain. “Maybe we are finally hitting the needed processing power and technologies to develop fast, accurate, untrained, speech recognition.”

How To Install Ubuntu Voice Recognition is part of the Linux Foundation’s 100 Linux Tutorials Campaign. For more Linux how-to videos or to upload your own go to http://video.linux.com/categories/100-linux-tutorials-campaign.

https://www.youtube.com/watch?v=pxom292XW_g” frameborder=”0