Snips is excited to join forces with Sonos to create even more differentiated and immersive experiences for our customers.
We’re debunking some of the most widely-held claims about voice interfaces, performance, and data security.
With 41% of US consumers now equipped with a voice-activated speaker and businesses relying on voice recognition solutions to streamline processes involving confidential information and sensitive data, privacy concerns around voice assistants are growing. At the same time, the means implemented by the main voice recognition technology providers to address privacy challenges still appear to be insufficient, and many data-related practices are still unclear to the end user, as confirmed by recent news.
Nearly a year after the GDPR’s entry into force, we’ve dissected a few of the claims made by cloud-based voice assistant providers on four topics :
- Performance Claim: They claim that using massive amounts of end user data is necessary to improve voice assistant performance;
- Trust Claim: They claim it is ok to do so because end users can trust cloud-based voice assistant providers with their voice data;
- Security Claim: They claim cloud-processing is the most secure architecture given how much cloud-based providers invest in security; and
- Privacy Innovations Claim : They claim privacy-protective innovations can solve privacy issues in machine learning anyways.
By doing this, we hope to dispel a few of the commonly held misconceptions about the necessity and desirability of cloud-based voice recognition solutions, and to demonstrate that the “Snips way” provides a high-performing, on-device alternative at no privacy cost.
Before we dive in, why is privacy at a heightened risk in the voice assistant context?
What Makes Privacy So Important in the Voice Context?
All providers should have identified voice recognition technology as especially privacy-sensitive long ago. At Snips, we realised very early on that voice recognition had the potential to rapidly become a widely adopted consumer technology. We decided we could only truly advocate for the general adoption of voice interfaces to “make technology disappear” if we could guarantee that end users would not be asked to trade in their privacy for the best technology.
Voice recordings are not just any type of data — voice can reveal a lot about an individual’s emotions or state of mind. Machines can also use voice to identify an individual with a good level of certainty. This, coupled with the fact that voice cannot be altered, makes voice one of an individual’s biometric “prints” that can be used to confirm a person’s identity (similar to e.g. fingerprints, retina recognition, or vein mapping). Voice can also be copied and manipulated in ways that may allow impersonation and identity theft.
Voice recognition technology is also not any kind of technology — it is based on machine learning techniques that, by definition, require a certain amount of training data. The voice data used for training must be labeled, generally manually, so that algorithms can be trained to interpret voice inputs accurately. While a lot of research efforts are focused on automating the labelling process or relaxing the need for labels (i.e. unsupervised learning), to date, this still involves human review and input of human voice data, which necessarily creates a tension in terms of confidentiality when that voice data pertains to an actual end user.
Voice assistants are also inherently invasive due to the intimate contexts they are often used in. They are most prevalent in homes, including in bedrooms and bathrooms. Confidential information is disclosed in meeting rooms, factories, and cars, where voice assistants are becoming more and more integrated. And often their presence is not disclosed to the unaware observer.
A lot more information can also be collected through voice assistants without end users being actively aware, whether because voice might be recorded unintentionally (if there is a wake word false positive and the voice assistant is triggered by mistake), or because other information can be derived from end users’ use of their voice assistants.
The main voice assistants currently on the market systematically centralise, remember, and learn from every single interaction end users have with them.
The main voice assistants currently on the market systematically centralise, remember, and learn from every single interaction end users have with them. Their records include raw audio data and the outputs of all algorithms involved, attached to logs of all actions taken by the assistant. The latest research and innovations also suggest that interactions are set to become significantly smoother and more relevant based on additional information about end users’ tastes, contacts, habits, etc.
Overall, voice assistants combine the use of data with biometric potential, data-intensive technology, use cases concentrated on intimate and confidential spaces, and improvements focused on adding context and crossing more information about end users. Cumulatively, these factors give voice assistants a high potential to generate multi-angled privacy threats, unless they are designed with privacy in mind from the start.
Performance vs. Privacy : an Imaginary Paradigm
The first justification for the centralisation of end user voice data offered by the most prevalent voice assistant providers is related to performance: they claim that centralising end user data is required in order to maximise the performance of their cloud-based assistants. For instance, one voice assistant provider’s privacy settings state that “training [the voice assistant] with recordings from a diverse range of customers helps ensure [it] works well for everyone.”. Not only does the performance justification have no technical grounding, its use in these providers’ terms and privacy policies is misleading and may not comply with the GDPR’s transparency requirements as a result.
There Are No Technical Arguments To Support Collecting End User Data For Improved Performance
Though false positives and other performance issues are unavoidable, performance-based justifications for collecting end user voice data are flawed because there is both no need for the scale of data being collected, and there are alternatives to using actual end user data that provide comparable results.
Performance is in fact only minimally improved with the collection of more data past a certain point, and systematic recording with the goal of improving models is not necessarily justified. In addition, while it is true that a diverse range of data is required for optimal performance, these companies also already have the data they need to ensure that that is the case.
Even if that turned out not to be the case, there are other sources for such data than end user’s voice queries, and issues such as wake word false positives and query misunderstandings can actually be solved without using end user data to enrich models. Benchmarks we performed on an open dataset show that over 90% precision can already be reached with 2,000 crowdsourced training utterances (as opposed to end-user collected voice queries) on state of the art natural language understanding engines (e.g. Microsoft Luis.ai or Snips NLU).
Similarly, when it comes to spoken language understanding, more benchmarks show that combining crowdsourced query datasets with about 1,000 hours of transcribed audio yields similar performance as with the Google Speech API on large vocabulary use cases. Those datasets can be acquired off-the-shelf from a series of providers, at prices accessible for a start-up. What’s more, models yielding state of the art performance can run on standard IoT hardware, typically a Raspberry Pi 3 (quad-core Cortex-A53 at 1.4GHz), and do not necessarily require powerful servers.
This shows that massive volumes of recordings are not necessary, but also that high-performing voice interfaces can be built by crowdsourcing data from human workers rather than using end user data.
Thus, small quantities of crowdsourced data can lead to tremendous improvements in performance. This shows not only that massive volumes of recordings are not necessary, but also that high-performing voice interfaces can be built by crowdsourcing data from human workers rather than using end user data that then needs to be labeled by human workers. The volumes of data required to reach the performance levels described above can be crowdsourced in a few hours, and hence no one can actually justify collecting millions of end user utterances every day to reach these metrics.
This provides the advantage that any voice recordings or written queries used are created for the very purpose of training the voice recognition models, and that they do not represent a threat for anyone’s privacy (since they do not contain real-life queries). Edge-based solutions that don’t use end user data are also designed so that even when a false positive occurs, it has no consequences on privacy, as the processing is local and no voice data are sent to the cloud.
Justifications For End User Data Centralisation Based On Performance Do Not Meet Transparency Standards
Why do cloud-based voice assistant providers make these claims if they have no technical grounding?
The performance justification is used to induce end users into accepting the fact that their data are processed en masse on the cloud.
But more than merely for persuasion, these claims are actually used for legal effect, to comply with some of the GDPR’s lawfulness requirements, i.e. the requirements that personal data be processed “fairly and in a transparent manner” and that it be “collected for specified, explicit and legitimate purposes”. Another related requirement is that any consent relied on by a company to justify processing an end user’s personal data be “freely given, specific and informed”.
Data protection authorities have started enforcing these standards under the GDPR, and Google was fined 50 million euros by the CNIL for overly vague privacy terms and mis-informed consent in January 2019.
These requirements translate the presumption that personal data should not be collected unless justified, and that the burden of proof is on the company collecting personal data to evidence a valid purpose and legal basis for doing so. Given we’ve established the lack of technical grounding for the performance justification, one can wonder what legal value these claims should be given against the GDPR’s standards when used for these purposes.
The purpose stated by one cloud-based voice assistant provider in its privacy settings for the use of end user voice data — the development of new features and improved performance — is hardly transparent or explicit for an end user. There is no information, for instance, about the fact that their voice recordings can be reviewed by humans. People speaking within the intimacy of their homes still have no idea what is done with their voice recordings or where they are sent. What should the standard of transparency be in that case?
People speaking within the intimacy of their homes still have no idea what is done with their voice recordings or where they are sent.
Another provider states that even where a user chooses to delete their data history, their data may still be used to “improve their services”, a statement that can hardly be considered informative in and of itself.
How free is an end user’s consent if it is based on the threat that “new features may not work well” for them if they choose to opt-out from voice recordings, thereby leading them to believe they would be opting-out from having access to the best version of the voice assistant if they did, especially considering, as we’ve established above, that this is technically incorrect?
The standards are still being defined, but consent as the legal basis for such processing activities could easily be challenged under these circumstances.
Overall, wording and design choices nudge end users into allowing a voice assistant provider to use their voice recordings based on justifications that are unclear and ambiguous.
Overall, wording and design choices nudge end users into allowing a voice assistant provider to use their voice recordings based on justifications that are unclear and ambiguous.
The CNIL’s innovation and foresight group (“LINC”) recently picked up on these types of “potentially deceptive design practices” in its report on Shaping Choices in the Digital World published in January 2019. While some of these practices “may comply with the GDPR”, the report states, “depending on the time, manner and data in question, they can raise ethical issues and even be non-compliant.”
In contrast, full transparency and neutral design would first involve acknowledging that performance would be unaffected and that most features would still work if an end user chose to opt-out, and detailing what the actual expected impact might be if their data were deleted.
Reliance On Trust Is Not Required If Voice Is Handled On Device
As highlighted above, the general lack of transparency around voice assistants means there is no way for users to be fully aware of what is done with their voice recordings. This creates a flawed basis for trust, as opposed to transparent, on-device architectures that remove the very need for trust.
Paradoxically, the major voice assistant providers all acknowledge the importance of privacy and end user trust repeatedly, and claim that they know “[they] had to get privacy right to preserve [their] customers’ trust” and that they “put customers in control”, or that “[their] users have long entrusted [them] to be responsible with their data and [they] take that trust and responsibility very seriously”.
The fundamental disconnect appears to lie in the absence of mechanisms to ensure that trust is and continues to be fully informed and well-deserved. Transparency is key in this respect; there can be no so-called “control”, no checks and balances, and no accountability to end users, without transparency.
The GDPR provides one level of checks, and helps ensure some minimal safeguards are implemented (some level of security, some level of definition of purpose, some transparency in terms of information), but the interpretation and enforcement of its standards are still being tested. Overall, there aren’t many constraints on providers against doing what they deem necessary to fulfil their own purposes, even when this might come at a higher privacy cost for their end users than justified.
The deal made with end users is very straightforward: if they use an on-device voice assistant, their voice recordings never travel away from the device itself.
Ultimately, in the absence of binding and enforceable checks and balances, end users should avoid having to rely on trust in a third party when it comes to protecting their private lives.
The simplest, most reliable and transparent solution in this context is on-device processing. Doing everything on device removes the question of where voice recordings are sent and who has access to them or reviews them.
The deal made with end users is simple and straightforward: if they use an on-device voice assistant, their voice recordings never travel away from the device itself.They don’t even need the device to be connected to the internet to work. That is the only way they can be certain their privacy will be respected.
In that context, there is no need to gain or maintain end user trust — the way the system is designed simply does not give anyone the power to deviate from the asserted purpose.
Security Risks Are Smaller On Device
Beyond concerns of what personal information is shared with a voice technology provider, a solution’s security with regards to external attacks is also a major concern given the sensitivity of the data that can transit through a voice assistant. Again, this is not a problem with on-device architectures, which do not expose any voice recordings on the cloud.
Though the main voice assistant providers invest in state of the art security measures to protect their servers, leaks and successful attacks still happen. For example, Google’s strict security measures did not prevent 500k Google Plus users’ data being leaked earlier in 2018.
The question is not strictly that of “edge vs. cloud”. In any voice assistant solution, there are microphones “on the edge” (i.e. on the device) which can potentially be misused, hacked and exploited. This means the devices end users speak into constitute a source of vulnerability in any case, and that they must be carefully secured.
Processing voice data locally means the attack surface is limited to those devices only.
But processing voice data locally means the attack surface is limited to those devices only. In contrast, processing them in the cloud causes the attack surface to extend to the communication channel and to the cloud servers themselves, thereby increasing the voice data’s exposure overall.
On the whole, sticking to edge computing when it comes to processing voice data is significantly safer than processing the same data in the cloud.
New Private Machine Learning Technologies Are Not Yet Ready to Solve Privacy and AI Issues at Scale
Lastly, major actors in the artificial intelligence space such as Google, Apple and Microsoft are exploring the implementation of innovative machine learning techniques such as Federated Learning, Differential Privacy or Encrypted Machine Learning to reduce the privacy-cost of their artificial intelligence products. While these are interesting developments, they cannot be expected to solve AI and privacy issues at scale — at least not for the time being.
Federated Learning, which was introduced by Google and put into production on the Google Keyboard, is about keeping end user data on-device and using it to train a global model in a decentralised way. It can only be easily applied to self-supervised problems, i.e. problems in which users implicitly provide the answer they would have expected the AI to guess, so the AI can learn from them directly through their use of the device. In a keyboard suggestion engine, the word that the user ends up using is the one the engine should have suggested. In most other AI use cases, and typically in voice-related applications, expected answers are not provided by the end user, which limits the use of Federated Learning in most cases.
These solutions’ general character brings tremendous potential, but the current state of the art still makes them inapplicable on real-time, computation-intensive applications like speech recognition.
Differential Privacy, which Apple has communicated about most, has also been used for emoji prediction in a smart keyboard environment. It hinges on the notion of a configurable level of privacy protection and/or loss, that is compensated when the scale of the population over which the data are collected is above a certain level (typically over a hundred million users).
The notion of a level of privacy protection and/or loss cannot be generalised to all types of data easily — while it’s straightforward for counts and analytics, there is no notion of level of privacy when talking about a voice sample. What level of voice distortion brings what level of privacy protection? No one can provide an answer given the current state of research.
The complexity of the Differential Privacy approach also makes it hard to grasp for end users, and even for regulators and legislators.
Last, Differential Privacy is only useful to companies that have a massive user base, which limits its applicability to smaller-scale solutions.
Interesting progress has been made on Encrypted Machine Learning over the last few years. Homomorphic Encryption in particular allows a server to run operations over encrypted data without decrypting them, which means there can be no privacy leakage. Similar guarantees can be obtained through Multi-Party Computation approaches. These solutions’ general character brings tremendous potential, but the current state of the art still makes them inapplicable on real-time, computation-intensive applications like speech recognition.
In contrast, edge computing is a solution that can be applied to many machine learning problems without compromising accuracy, and that can be described without ambiguity in a way anyone can understand. As Apple formulated it at the last CES: “whatever happens on your iPhone stays on your iPhone” — and embedded voice recognition is truly as simple as that.
Voice and Privacy : the “No-Compromise” Approach
Access to the best-performing, most trustworthy, most secure voice recognition technology should come at no privacy cost to end users, especially when used in the context of their own home or in a professional context. Instead, privacy concerns can and should be handled through privacy-respectful innovations developed for customers’ and their end users’ benefit.
This is the approach we apply at Snips. We implemented “private-by-design” voice recognition technology before it became a legal obligation, and we go above and beyond applicable standards to make sure end user privacy is never the necessary compromise for performance, trust or security.
Access to the best-performing, most secure, most trustworthy voice recognition technology should come at no privacy cost to end users, especially when used in the context of their own home or in a professional context.
We achieve this by designing solutions that are transparent and understandable by anyone, so that end users are truly in control. We do everything on-device — from wake word to automatic speech recognition, and through to natural language understanding — and we invest all of our efforts in making sure both performance and the end user experience are never degraded as a result.