Audio-hacking your smart phone

By: William Jackson
July 8, 2016

Facebooktwitterredditpinterestlinkedinmail
William Jackson
William Jackson

And now we have one more thing to worry about. Researchers at Georgetown University and the University of California, Berkeley have demonstrated a method of obfuscating voice commands so that they are not recognized by humans but can be executed by voice-activated smart phones.

Depending upon the capabilities of the device, an attack could be used to post the user’s location on social media, cause denial of service by shutting down or activating airplane mode, or open a web page hosting drive-by malware. Hidden commands could be broadcast from public speakers or embedded in trending YouTube videos to reach many victims.

“Voice is a broadcast channel open to any attacker that is able to create sound within the vicinity of a device,” the authors wrote in their recent paper. “This introduces an opportunity for attackers to try to issue unauthorized voice commands to these devices.”

Like many new demonstration attacks, this one has limitations. The audio source had to be within about 10 feet of the phone (their tests used Android phones) and the volume couldn’t be too low. Large amounts of background noise would also interfere with it. And the phones would recognize and correctly execute the command a little more than half the time. Maybe most importantly, the user of the device can hear these commands. You might not understand “Agock bougaley” as “OK Google,” but if you heard someone or something talking nonsense to your phone you might check to see what is going on.

Still, more practical attacks are possible. And voice control is becoming more common, not only in mobile devices. Apple has released macOS Sierra, its latest desktop operating system, to beta testers, and it includes Siri. This is the kind of thing that could get the attention of malicious hackers.

The idea of audio hacks is not new. Other researchers have shown it can be done. The question addressed in this paper is: “Can an attacker create hidden voice commands, i.e., commands that will be executed by the device but which won’t be understood (or perhaps even noticed) by the human user?”

The answer is yes. Depending on your point of view, the vulnerability exploited here is that speech recognition systems are either too good—they recognize things too easily—or not good enough—recognition is so poor that the threshold for accepting commands is too low. Researchers did both a “black box” test which presumed no special understanding of the speech recognition system being attacked, and a “white box” test that was tailored to a specific system. Not surprisingly, the white box attack worked best, but the black box worked pretty well.

A text-to-voice engine was used to generate three voice commands: “OK Google,” “call 911,” and “turn on airplane mode.” An “audio mangler” extracted the data used by the speech recognition system and adds noise to it. “The attacker is in essence attempting to remove all audio features that are not used in the speech recognition system but which a human listener might use for comprehension.”

The result is gobbledy-gook to a human listener, but still contains the data the speech recognition system is looking for. In the black box test, phones were able to correctly execute obfuscated commands 60 percent of the time. Humans could identify the meaning only 41 percent of the time. This opens a 19 percent window for a successful attack. In the white box text against a specific speech recognition system, the system was able to correctly recognize the command 90 percent of the time while it was essentially unintelligible to humans.

The researchers also looked at defenses against audio hacks. Most had problems. Audio CAPTCHA has high user overhead and is subject to hacking. Biometric voice authentication—which recognizes the user’s voice—takes a while to train and it is not clear that it would prevent execution of a command.

The most promising idea appears to be filtering to decrease the fidelity of the audio before speech recognition. This degrades the quality of the malicious attack enough to prevent recognition, while only slightly degrading recognition of legitimate commands.

This probably isn’t something we have to worry about right away, but you might want to start paying more attention to those garbled PA announcements you hear on public transit.