HAL 9000, Star Trek, Cerulean Blue Cars, and The State of Speech Recognition Software by a Decade-long User

“Open the pod bay doors, HAL.”

“I’m sorry, Dave, I’m afraid I can’t do that.”

The H.A.L. 9000 of “2001: A Space Odyssey” fame pretty much is the most user-unfriendly computer of all time, inasmuch as it killed off all but one member of the crew of its spaceship. Strangely enough, though, Arthur C. Clarke and Stanley Kubrick may have better caught the nature of speech-recognition software than the ever-so-reliable and pleasant talking computers on “Star Trek.”

HAL, it turned out, committed murder because of a programming bug. (Microsoft didn’t exist when the movie was made, otherwise they would have called it a programming “issue.”) Given conflicting programming instructions to always follow orders from the human crew, but not to tell them that the mission’s crew objective was to search for extraterrestrial life, HAL had the ultimate Fatal Application Error. Even before that, though, HAL took instructions very literally, and accordingly, after the astronauts got past the “gee whiz” phase, they perceived their computer companion as being either stubborn or obtuse.

And stubborn or obtuse is pretty much how a typical user will regard the typical speech recognition program on the market today.

The technology has come a LONG way. I started using it nearly a decade ago — not out of enthusiasm for cutting-edge computing, but because of RSI, which made keyboarding long passages of text physically impossible. From that perspective, speech recognition was a godsend, and the evolution has been amazing.

Consider pricing: in 1991, adopting speech recognition was an $8,000+ proposition. You needed to buy a then-top-of-the-line 386 system, add a then-exotic sound board, and purchase a certified reseller’s consulting time along with the software, as the manufacturers would not let users set up their systems un-aided.

Today any computer superstore or mail-order house can sell you infinitely more advanced software for about 150 bucks, and it’s a quick install on any run-of-the-mill modern multimedia PC. (Let me not over-sell the point, though. There is a trade-off between speed and accuracy in speech recognition, so the better the hardware, typically the better the performance of the voice software.)

Operationally, the products have made a quantum leap from “discrete speech” [when… you… had… to… speak… each… word… separately… with… a… LONG… pause… between… each… word] to “continuous speech” — which is to say, everyday conversational speech. Moreover, whereas the early systems typically mis-recognized one out of every eight words (10, if you were lucky), recognition levels are now in the 95% range.

I am using NaturallySpeaking from Dragon Systems Inc. (see below), generally regarded as the top speech recognition system on the market today. Also popular is ViaVoice 98, IBM’s entry in the field (see below). An emerging contender is Lernout & Hauspie’s Voice Xpress. L&H acquired Kurzweil (another pioneering company), and has backing from Microsoft — and you know what that usually means. Microsoft, in fact, is quietly testing speech recognition software under its own brand name. There is some suspicion that had Windows 2000’s shipping date not slipped so much, some speech command-and-control might have been part of it.

NaturallySpeaking from Dragon Systems Inc.
http://www.dragonsys.com/

IBM’s ViaVoice 98
http://www.software.ibm.com/speech/

NaturallySpeaking, ViaVoice, and Voice Xpress are all DICTATION programs. They generally provide some kind of basic word processing window into which you can dictate, and also the capability to dictate within Microsoft Word and other applications. NaturallySpeaking, in fact, ships with the Corel WordPerfect Suite, while ViaVoice is included in IBM-owned Lotus SmartSuite. None of them have extensive capabilities to command and control applications. The big item in that market is L&H’s Kurzweil VoiceCommands, which Microsoft plugs as a Word add-on.
http://officeupdate.microsoft.com/
downloaddetails/voicecommand.htm
(Beware: this URL may wrap in your email reader)

All speech recognition systems, however, suffer from one inherent accuracy problem: WE don’t have especially high accuracy ourselves, nor does the English language. Few of us have radio- announcer-like voices that allow us to clearly enunciate each word. Nor — particularly if we are writing something from scratch — do we maintain a constant cadence. Pause an extra millisecond between syllables and your dictation software, as mine just did, can put up “sole bulls” instead of “syllables.”

And even perfect diction doesn’t solve the “2-to-too-two” kind of problem. English is full of words that sound the same, and often are hard to distinguish even in context. If I say, “two blue cars,” maybe I’m talking about a pair of blue cars, but I also might have been saying “too blue cars,” as a complaint about excessive cerulean coloration.

Once there has been mis-recognition, the systems have ways to make corrections. But they are tedious, time-consuming processes. Also a chronic problem is that since speech recognition software draws from a built-in dictionary, when it makes an error, the mistake will still be a real word. And no spell checker will ever catch it if you miss it. As the editors of this e-zine can attest, odd words inevitably will creep into my writing. A Dragon techie pointed out to me that, for this reason, obscenities are excluded from the software’s dictionary (and a user adds them at his own peril). The company discovered the problem as the result of an embarrassing moment at a trade show demo.

One my TNPC colleagues asked about the quasi-psychological aspects of the technology. As he noted, we are used to the tactile sensation of typing and it can feel odd to write without it. I have found that the fact that the words come up on screen when you say them, helps some. You also adjust with time. It helps that I was a rotten typist, so that my dictation accuracy is not much worse than my hunt-and-peck performance. Another help is that I was a journalist in the pre-laptop era when dictating a story over the phone to someone in the newsroom was still a standard practice.

Still, what gives you a comfort level with your writing is a personal thing. I feel most comfortable doing my dictation in a private office. Working in a more crowded environment where everyone can hear what I am dictating is the equivalent of having the entire office looking over your shoulder while you are typing. Again, this becomes less awkward with time, as your dictation becomes less of a novelty in the office. But it still bugs me.

The bottom line is that speech recognition right now is most useful as a tool for those who need it to deal with disabilities such as RSI or for those who have been using a dictation machine. The frustrations probably make the technology unsuitable for people who are effective on the keyboard. Unless you are really dedicated, it is not the tool of choice for crash projects.

BUT…

Instead of being “just a generation away,” speech recognition now is just about there. Current high-end PCs have enough power that a continuous speech program can crunch enough data to understand the context in which a word has been used and thus improve accuracy — without paying a terrible price in dictation speed. The hardware releases scheduled for the next year, plus the usual upgrades to speech software, will mean that long before 2001 actually gets here, most users probably will be talking to their computers.

“Rotate the pod, HAL.”