Speech Recognition Technology
(Source: Gary Robson, Cheetah Systems)
Reprinted from The Circuit Rider, the magazine of the United States Court Reporters Association
The popular media often picks up tales from speech recognition technology firms of how court reporting will be an obsolete profession any day now. To those not up-to-speed on the technology, this is a frightening thought. Prospective students are worried about whether starting court reporting school is a good idea. Existing students are wondering whether to stay.
Just how good IS speech recognition these days?
SPEECH SYNTHESIS is the creation of speech electronically. In other words, making a computer talk. This isn´t of much interest to the court reporting profession, where the objective is to turn speech to print rather than vice-versa.
Now, for some observations from the show:
SPEECH RECOGNITION is computer comprehension of speech. There are several broad classifications of speech recognition, including discrete speech vs. continuous speech, speaker-dependent vs. speaker-independent, and context-sensitive vs. Context-insensitive.
DISCRETE SPEECH RECOGNITION requires that each word be an individually identifiable unit. Obviously, this isn´t the way we talk. During normal speech, words are run together, and even slurred (as in "gonna" for "going to"). To make speech recognition easier, many systems require a pause between words. A typical requirement is 100 milliseconds (one tenth of a second). Given that it takes about 2/10 of a second to say a typical word, that puts a theoretical maximum of 200 words per minute on discrete speech recognition. Needless to say, it will never work in a courtroom!
CONTINUOUS SPEECH RECOGNITION is the hook everyone´s hanging their hat on. This is the ability to recognize words exactly as they´re spoken, slurs and all. The system uses a technology known as the "Hidden Markov Model" to separate the words into phonemes (individual sounds), and then reassembles them into words.
SPEAKER-DEPENDENT systems are trained for a single voice. This is the technology used by stenomaskers. The system is trained to understand their pronunciations, inflections, and accents, and can run much more efficiently and accurately because it is tailored to the speaker. This is analogous to the way a CAT system is trained to a specific court reporter using dictionaries, phonics tables, theory sheets, and include files.
SPEAKER-INDEPENDENT systems are designed to deal with anyone, as long as they´re speaking English. To do this, the scientists had to figure out what parts of speech are generic, and which ones vary from person to person. A spin-off of this speech recognition technology is that the speaker-dependent parts have now been programmed into security systems which respond only to a given individual´s voice, as shown in movies like "Sneakers."
CONTEXT-SENSITIVE systems increase their accuracy by anticipating or limiting what can be said at any given time. For example, a speech-recognition-based hotel wake-up call system might ask you what time you´d like to be awakened. It can then assume that whatever you say will represent a time of day. If you say anything else, it will not be able to recognize it. Context-sensitive systems may actually have large vocabularies, but only a small portion of that vocabulary will be activated at a time.
CONTEXT-INSENSITIVE systems allow you to say anything, any time. Typically, they have dictionaries in the neighborhood of 20,000 words.
Tools for setting up speaker-independent, context-sensitive, voice-based operation of Windows applications abounded. They did everything from running menus to checking selections in dialog boxes. None of them allowed dynamic modification (a la Total Access Courtroom software, where the attorney can add issue codes on-the-fly). They want all the words and rules built in before you start using the program.
The quality of such tools varied widely. Overall, I wasn´t impressed. One company I visited, the fellow took over a dozen tries to get it to accept his telephone number. Understand here that it was only looking for (and accepting) digits! AT&T had a voice-mail application running with only ten commands available. One of them was "previous message." In three minutes of trying, I was unable to get it to recognize that command (although one time I said "previous message" and it thought I said "erase all messages."). This is from a company with a huge research facility and a vested interest in making this technology work!
One program for controlling Windows applications wasn´t too bad. I could go through their demo dialog box and check things fairly accurately, as long as there wasn´t any field input. It was a hotel demo, where you could call in for towels, pillows, wake-up calls, newspapers, etc. As long as you asked for one of the things on the menu, it was right probably 75% of the time. When you entered the time for the wake-up call, it wasn´t so hot!
ARPA (the U.S. Government´s Advanced Research Project Administration) is still sponsoring a lot of research. The best system in their recent "contest" achieved 93% accuracy with a 20,000-word vocabulary. That´s about one wrong word in every 14. It used a minicomputer with 100Mb of RAM! All of the words used in the test were in the dictionary, so that 7% error rate represents just a count of WRONG translations, not words outside the vocabulary. This would be like giving a court reporter a realtime test and saying that it´s okay if the word isn´t in your dictionary and it comes out wrong. You´ll only be graded off if you write it wrong!
Some of the best technology is still batch-oriented rather than realtime. In other words, they use the dictation paradigm, where you speak into a tape recorder (or the digital equivalent thereof), and it processes it all at once rather than as you do it. These systems use UNIX-based computers with DSPs (digital signal processors) to get the throughput.
Background noise is a killer to these applications. They couldn´t handle a newscast where the reporter was standing in a rainstorm, like California´s had the last three months, even if they COULD deal with it in the studio.
Overall, I would rate the tools available for speech-based editing as "marginally useful." I would rate full-speed realtime speaker-independent speech recognition as a long way from being able to compete with court reporting technology. I asked one vendor (who claims to be the leader in the core technology, as several of them do) when we´d be able to plug a system into a television newscast and have it create captions in realtime. He said "two years." I asked how long before it could be done on a PC-type computer (they´re using $20,000 Silicon Graphics workstations). He said we´d have to wait until PC´s were far more powerful than today´s Pentiums and Power PCs. I asked how long before it could get ALL of the words in the newscast, including live interviews and remotes. He said "never."
When the court reporting industry does see competition from these technologies (which it will, sooner or later), it will probably come from mask reporters first. These folks are trained to speak clearly and distinctly already, and my guess is that they will probably be able to achieve reasonable results within the next few years. By "reasonable," I´m speaking of error levels similar to what it takes to pass the CRR (4% or less).