Desktop Speech Recognition: A How-to on Establishing User Skill

Lamont Wood dives deep into the benefits and drawbacks of speech recognition. Users rarely get it right. Start dictating right after this article. — In a previous article we examined the abilities of Windows Speech Recognition. The recognition accuracy of WSR is an issue, as many people know and have experienced. However, an often-overlooked aspect of speech recognition software in general is user skill. User skill creates problems not only for WSR, but for Dragon Naturally Speaking as well. Correct dictation will increase your productivity and comfort with both applications.

WSR Image

Image credit:

“Speak Naturally”

The most common advice that a user hears when using speech recognition software is to “speak naturally.” Two decades of practical use with many programs, for fun and work, has taught me that the saying “speak naturally” is essentially bad advice.

Or conversely, the advice is spot on if you naturally speak like a socially-alienated BBC announcer.

Speech recognition software performs most-admirably with big fifty-cent words like “democratize” and “ambiguous.” When you scale down to the two-cent words (i.e. you, and, there, it, he, she, etc …) that form the connective tissue of a sentence, the software becomes adrift in a fog, resulting in a word scramble. And the smaller words, when missing or misplaced, are always the hardest to find when proofing.

A flesh-and-blood stenographer is the right person to “speak naturally” with — they will fill in connective tissue with ease through common sense and culture. A computer, however, has no common sense or culture. Thus, the responsibility of a connected, well-dictated sentence falls on the individual, not the software.

Arch Pronunciation

I have dubbed an approach that works (for me) called Arch Pronunciation. Essentially, I archly dwell on every word as if I mistrusted the listener. The machine is not a reliable or trustworthy listener, in truth, so my approach focuses on speaking clearly, without specific emphasis, and as if my listener had trouble understanding. This will make it so words are heard by the program, causing a lot less trouble later on.

This does not mean that you pause. Between. Words. But every word has to be distinctly enunciated in a fashion that, if used in conversation, would seem insulting. So remember—you’re not conversing. In fact, in the original sense of the word, you are not even dictating. You are controlling a machine with your voice. Go ahead, sound insulting if necessary—the machine doesn’t care.

Yes, you may end up speaking slower than you would otherwise. So here is where we paraphrase Wyatt Earp: Speed is good. Accuracy is better. (He was discussing a line of endeavor that also involved immediate feedback, albeit with more finite results.)

Textual Visualization

I also find it worthwhile to use a technique that I call Textual Visualization. You visualize the words that you want to appear on the screen and then speak them aloud using Arch Pronunciation. This extra layer of concentration will help you pronounce the words correctly and consistently, with no extraneous “dysfluencies” like “ah, well, y’know.”

Meanwhile, you will probably discover that, when it comes to writing, after years of keyboarding you now “think with your fingers.” With speech recognition the flow of thought must be grabbed and turned into words at a different (probably earlier) point in the cognitive process — that’s been my impression, anyway.

It has also been my impression that you will have to give yourself about two weeks to become comfortable with this new mode. While two weeks is not a trivial consideration, keep in mind that it pales in comparison to the effort you had to invest to learn how to type.

Having achieved some comfort with it, my experience is that using speech recognition roughly halves the time and effort needed to compose a first draft, and there were times when I felt like I was cheating because I was not tired at the end of the day. Using both speech and the mouse is advantageous during editing, since you don’t have to keep repositioning your hands on the keyboard—you can select what you don’t like and dictate a change. For formatting it seems to be no advantage and for graphics it may be a disadvantage.

But if you have accessibility issues (i.e. you can’t reach the keyboard or move the mouse) such details are irrelevant—speech recognition can change your life.

Finally, there’s the sound and outside-noise environment to consider. Using a headset microphone, office chatter will not hurt recognition accuracy. Background droning from ventilators and fans will hurt accuracy, but probably not enough to shut you down. The bigger issue may be simple self-consciousness — you may not want people listening to you while you talk to your computer. But with a little adjustment it should feel no different than talking on the phone.

For, I’m Lamont Wood.

Based in San Antonio, Texas, Lamont Wood is a senior editor at He’s been covering tech trade and mainstream publications for almost three decades now, and he’s a household name in Hong Kong and China. His tech reporting has appeared in innumerable tech journals, including the original BYTE (est. 1975). Email Lamont at or follow him @LAMONTwood.