On Voice Interfaces and Accents

Good piece by Sonia Paul for Backchannel, on how AI like Siri and Alexa handle accents:

To train a machine to recognize speech, you need a lot of audio samples. First, researchers have to collect thousands of voices, speaking on a range of topics. They then manually transcribe the audio clips. This combination of data — audio clips and written transcriptions — allows machines to make associations between sound and words. The phrases that occur most frequently become a pattern for an algorithm to learn how a human speaks.

But an AI can only recognize what it’s been trained to hear. Its flexibility depends on the diversity of the accents to which it’s been introduced. Governments, academics, and smaller startups rely on collections of audio and transcriptions, called speech corpora, to bypass doing labor-intensive transcriptions themselves. The University of Pennsylvania’s Linguistic Data Consortium (LDC) is a powerhouse of these data sets, making them available under licensed agreements for companies and researchers. One of its most famous corpora is Switchboard.

If voice if the Next Big Thing in technology, as many pundits believe, then systems like Alexa and Siri need to get demonstrably better at parsing accents and, importantly for accessibility, speech impediments. Otherwise, these voice-driven interfaces will remain effectively inaccessible and unusable. And that'd be a shame, because voice has so much potential to be a powerful assistive technology.