LOGO

NVIDIA AI Voice Technology: More Realistic and Expressive

August 31, 2021
NVIDIA AI Voice Technology: More Realistic and Expressive

Advancing Naturalness in AI Voice Synthesis

While current AI assistants like Amazon’s Alexa and Google Assistant represent a significant leap forward compared to traditional GPS voices, they still often fall short of replicating the nuances of human speech. These shortcomings include a lack of natural rhythm and intonation. NVIDIA recently presented new research and tools designed to address these limitations.

Introducing RAD-TTS

NVIDIA’s research focuses on enabling the creation of AI systems capable of capturing more human-like speech qualities. This is achieved by allowing users to train the AI using their own voice data. The company unveiled these advancements at the Interspeech 2021 conference.

The core of this improvement is a model called RAD-TTS, developed by NVIDIA’s text-to-speech research team. This model achieved success in a competition at the NAB broadcast convention, focused on developing the most realistic avatar.

Key Features of RAD-TTS

RAD-TTS allows individuals to train a text-to-speech model utilizing their unique vocal characteristics. This includes elements like pacing, tonality, and timbre. The system effectively learns and replicates these qualities.

Furthermore, RAD-TTS incorporates a voice conversion feature. This allows for the delivery of speech, originally spoken by one individual, using the voice of another. The interface provides precise control over synthesized voice parameters, including pitch, duration, and energy at a frame-level.

Applications in NVIDIA’s I Am AI Series

NVIDIA researchers leveraged this technology to create more natural-sounding voice narration for their “I Am AI” video series. Instead of relying on human voice actors, they utilized synthesized voices.

The goal was to align the narration’s tone and style with the videos’ content, a challenge that has historically been difficult to achieve with AI-generated narration. While the results aren’t entirely devoid of robotic qualities, they represent a noticeable improvement over previous AI voice outputs.

Workflow and Control

NVIDIA explained that their video producer was able to record themselves reading the video script. The AI model then converted this speech into the voice of the intended female narrator.

This baseline narration served as a starting point, allowing the producer to “direct” the AI, much like a voice actor. They could fine-tune the synthesized speech to emphasize specific words and adjust the pacing to better convey the video’s intended tone.

Availability and Resources

NVIDIA is making aspects of this research available to the public. The tools are optimized for performance on NVIDIA GPUs and are accessible through the open-source NVIDIA NeMo Python toolkit. This toolkit is designed for GPU-accelerated conversational AI and can be found on the company’s NGC hub.

Several models have been trained using extensive audio datasets – tens of thousands of hours – on NVIDIA DGX systems. Developers can further refine these models for their specific applications, accelerating the training process with mixed-precision computing on NVIDIA Tensor Core GPUs.

Note: This article was originally published on Engadget.

#NVIDIA#AI voice#artificial intelligence#voice technology#realistic voice#expressive AI