Conventionally, users have issued instructions to computers via mouses and keyboards in order to accomplish tasks. More recently, however, touchable screens on gadgets have become the primary mode of interaction. Nowadays, more and more smart gadgets are emerging around us, such as smartwatches, sound boxes, vehicle control centers, etc. However, it is often tedious or cumbersome to physically interact and enter passwords on all of these small-screened smart gadgets. That’s where voiceprint recognition technology comes to stage.
SpeakIn, a Shenzhen-based, AI-oriented voiceprint recognition and ID security solution startup, spoke with us about their vision and the future of the technology. SpeakIn is headquartered in China and has a research and development department in the United States. The startup is currently focusing on the development and commercialization of its voiceprint recognition technology.
Inspired by Google Glass
Chen Haoliang, founder and CEO of SpeakIn, is an MIT dropout who studied applied mathematics and engineering. He previously worked in Google Glass’s human-machine interaction program, which proved an enlightening and formative experience. He spoke to us about SpeakIn and its voiceprint technologies.
From PCs with large screens to various smart gadgets with smaller screens to devices without screens entirely, there is less and less information users can obtain from screens, meaning interaction between human and machines is limited, he said. In order to serve humans better and more efficiently, machines ought to understand humans more predictably and reliably.
“Under these preconditions, voiceprint will be the best choice for logging into these smart gadgets with small screens or without screens. That’s the reason we chose voiceprint to make a breakthrough,” Chen said.
SpeakIn’s core technologies lie in several aspects: liveness detection, emotion and gender recognition, speaker diarization, and voiceprint recognition, comparison and verification.
Voiceprint’s liveness detection can be used to identify a speaker’s actual voice in distinction to a synthetic voice or a recording or reproduction of a person’s voice. This works to eliminate the possibility of counterfeiting, which is the stealing or copying of a user’s biometric information.
Emotion and gender recognition can be applied across a variety of smart gadget industries. A speaker’s emotion and gender is recognizable in real time via extracting and comparing sound patterns against a circulating neural network algorithm. This enables users to enjoy customized services while interacting with their gadgets.
Dr. Chen Dongpeng, senior scientist of SpeakIn, said that speaker diarization technology can be applied in public security scenarios, as it can recognize and separate the audio from an individual speaker from a multi-layer background audio.
Voiceprint recognition, comparison, and verification rapidly aid in user identification through the use of a data processing algorithm against a mass storage index. This technology is applicable to public security, financial scenes, social security services, and other platforms.
Technology-driven, industry-oriented applications
SpeakIn said that they are driven by technology. Over 60% of its employees work in a technology-related capacity, particularly in product research and development.
Core members of SpeakIn’s team have advanced degrees and experience from leading institutions such as Harvard University, MIT, Hong Kong University of Science and Technology, and Microsoft Asia Research Institute. In addition, SpeakIn is building additional laboratories around the world in collaboration with other institutions.
SpeakIn’s team of scientists are currently collecting, analyzing, optimizing, and testing different dialects from individuals of various ages. The program’s learning-oriented iVector technology utilizes a deep neural network and other front-end audio signal processing methods to identify the speaker. With its growing database and advanced technology, SpeakIn aims to take advantage of real commercial scenarios.
“Since our establishment in 2015, we’ve worked to build a database for voiceprint recognition, which will remain an incomparable advantage in various application scenes,” said Chen.
Chen provided two examples that might adapt well to SpeakIn’s voiceprint technology:
1. Public security investigation: SpeakIn helps police monitor high-risk groups as well as fight fraud and anti-terrorism. It can also help detect criminal cases as well as check and verify identification in order to deter and crack down on crime and build a safe environment via its tailor-made long speech and text-independent voiceprint recognition system. Its services are based on a voiceprint database and an auto-voiceprint recognition system.
2. Smart devices: With the help of Voice ID, a solution powered by its voiceprint recognition technology, SpeakIn solves the problem current smart devices have of only recognizing content but not speakers. Voiceprint enables devices to identify different users by voiceprint recognition support, in order to realize a personalized interaction between human and smart devices.
Voiceprint is a unique and reliable form of biometric information. It contains various sorts of information, such as emotion, gender, age, and etc. “Because of this, voiceprint recognition can be applied in more interactive scenes in the future, ” Chen said.
This year is regarded to be the initial year of voiceprint recognition by the industry. At present, there is little difference between China and other countries in the realm of voiceprint technology; the difference lies at the application level.
Entering the voiceprint recognition industry requires massive technical support. At the same time, it is necessary to integrate the technology with specific industries during the implementation of commercialization. Collecting and processing front-end voice signals is difficult when numerous people are talking at the same time. In fact, “denoise” technology is in demand because voice signal collecting and processing are highly susceptible to surrounding audio interference.
Though there are so many startups specializing in speech interaction and voice recognition, only a few specializing in voiceprint recognition technology exist. SpeakIn supports richer and more complex language types on a larger scale. Its complexity lies in the comparison between type 1:1 and type 1:N technologies.
Type 1:1 determines whether a given speech needing to be verified originated from a target speaker by comparing the given speaker to the targeted speaker (1:1). It is mainly used in securities trading, bank transactions, intelligent hardware, and related fields. For example, adding voiceprint verification during online payment processes contributes to higher security.
Type 1:N is useful in identifying the target speaker from a group of speakers, which is widely applicable in the fields of public security, judicature, and military defense, specifically in areas like criminal investigation, criminal tracking, national defense monitoring, etc. For telecommunications fraud, telephone extortion, and other criminal cases, public security personnel can utilize voiceprint recognition technology to narrow the scope of criminal suspects via identification of the target speaker’s voiceprint.
In short, type 1:N uses voiceprint as an ID, while basic voiceprint is only a password for type 1:1. The former requires a more difficult technological basis and reaction speed. This is the area in which SpeakIn specializes.
In order to maintain its advantage in the industry, SpeakIn’s strategy prioritizes the following three aspects, according to Chen:
First, SpeakIn’s products should follow real-world industry demands, converting academic or technological achievements into products and accumulating domain knowledge of various industries.
Second, data drives R&D of products. This includes managing data standardization and sourcing, as well as further enhancing the processing of big data.
Third, standing at the intersection of technology and humanity, SpeakIn believes human elements should integrate seamlessly with products.
SpeakIn’s advantages in the industry are based on the technological threshold to enter the industry, and SpeakIn has been engaged in R&D and commercialization for a long time, he said. As a result, the company has gained some key achievements in academic and business fields. Meanwhile, the company also attracted a group of partners to further explore the application of voiceprint recognition in different areas and scenarios.
Comparison to other biometric technologies
The chart below compares voiceprint recognition with other biometric technologies:
Which is the most popular biometric technologies nowadays especially after iPhone X applies Face ID?
Compared to other biometric technologies that use cameras or instruments, voiceprint recognition can be easily applied to more scenes at a lower cost with just a microphone. Plus, content for voiceprint verification can be changed according to demands, leading to enhanced security and more potential for interactions between humans and machines.
Meanwhile, the problem with traditional voice technology is that it can not identify the speaker, and therefore it can not provide corresponding personalized services or interaction. Voiceprint recognition can be smoothly applied in real application scenarios, compared with other biometric technologies. For example, it doesn’t require gestures like facial recognition, nor does it require user-inputted content like traditional voice recognition. Additionally, in regard to safeguarding against malicious attacks, voiceprint has a distinct advantage as it is more difficult to hack. Collecting voiceprint information is difficult compared to facial information, which, by its physical nature, is exposed and can be collected more easily. Moreover, all voice data from our daily speech can be used to enrich the database and ultimately improve the identification precision.
Additionally, voiceprint recognition only requires microphones during processing, which is cheaper than other methods like cameras for facial recognition.
“Though the development of voiceprint recognition is not parallel to fingerprint recognition or facial recognition, we can not aim to make voiceprint recognition the only method for biological recognition,” said Dr. Chen Dongpeng. “We are developing voiceprint recognition to work with fingerprint recognition and facial recognition in many respects, so that it can support more and more businesses and services, to help users deal with more real problems.”
Challenges and future expectations
Currently, Speakin faces two primary challenges: On the one hand, voiceprint recognition technology is still evolving, and the precision of voiceprint recognition can be easily affected by surrounding noise and multi-speaker dialogue. As such, SpeakIn will continue to invest in the R&D of voiceprint recognition technology, and optimize its algorithm, in order to reach high-precision rates in complex environments. On the other hand, acceptance by the public also takes time, and most companies will likely remain on the sidelines of the development process of voiceprint recognition.
With the development of the Internet and IoT, voiceprint recognition is the most suitable interactive portal for gadgets with small screens or gadgets without screens. Moreover, voiceprint recognition also complements large-scale IoT scenes for information verification and other related services.
In May of 2017, SpeakIn completed a financing round led by IDG to accelerate the commercialization of its achievements in scientific research. SpeakIn integrates its voice recognition-oriented identification solutions with specific products and services in different industries, and this will likely lead to industrial adaptation in some areas.
Today (November 15), SpeakIn disclosed its latest financing round in tens of millions of RMB (several million USD) for Series A2, which was led by Originals Capital, with all of its existing investors strengthened. The capital raised in this round will be used to further invest in the research and development of related products, to expand existing business, as well as accelerate the process of the platformization and openness of products.
In general, application of voiceprint recognition can be expected in several areas in the future: integrated multi-biometrics technologies; non-contact data-collection in nature; and remote online identification.
On the one hand, SpeakIn will solidify its solutions in public security and expand to more areas and functionality with a higher degree of efficiency. On the other hand, it will need additional partners from a variety of industries, such as mobile gadgets, smart homes, and smart vehicles if it hopes to achieve success.
“We will find out more interesting interactions with the above information on the basis of voiceprint recognition, so that we can feel the charm of black technology more directly,” Chen said.
（Top photo from SpeakIn）