Google DeepMind Unveils New Video Model to Rival Sora

DeepMind's Veo 2: A New Contender in AI Video Generation

Google DeepMind, the leading AI research division of Google, is actively developing technology to surpass OpenAI's capabilities in the realm of video generation. Their latest advancement may provide a temporary lead in this competitive field.

The announcement of Veo 2 was made on Monday by DeepMind. This next-generation AI is designed for video creation and serves as the successor to the original Veo model. Veo already underpins a growing number of applications within Google’s product range.

Veo 2's Technical Specifications

Veo 2 is capable of generating video clips exceeding two minutes in length, with resolutions reaching up to 4k (4096 x 2160 pixels).

This represents a significant improvement over OpenAI’s Sora, offering four times the resolution and more than six times the duration.

Currently, the advantages are largely theoretical. Within Google’s experimental platform, VideoFX – where Veo 2 is presently exclusive – video outputs are limited to 720p resolution and a maximum duration of eight seconds.

For comparison, Sora can generate clips up to 1080p in resolution and lasting up to 20 seconds.

google deepmind unveils a new video model to rival sora

Access to VideoFX is currently restricted through a waitlist. However, Google has indicated that they will be increasing the number of users granted access throughout the current week.

Future Availability and Integration

Eli Collins, VP of Product at DeepMind, informed TechCrunch that Veo 2 will be made accessible through Google’s Vertex AI developer platform once the model is prepared for large-scale deployment.

Collins stated that ongoing development will be guided by user feedback. The company intends to integrate Veo 2’s enhanced features into various applications across the broader Google ecosystem.

Further updates regarding Veo 2 are anticipated to be shared next year, as the model continues to evolve and improve.

Enhanced Control in Video Generation

Similar to the capabilities of Veo, Veo 2 is designed to create videos based on textual descriptions – for example, “A car racing down a freeway” – or a combination of text and a provided reference image.

What advancements does Veo 2 offer? According to DeepMind, this model, capable of generating clips in diverse styles, demonstrates a heightened “understanding” of both physics and camera operation, resulting in more “defined” visual output.

By “defined,” DeepMind refers to an increased sharpness of textures and imagery within the clips, particularly noticeable during scenes involving significant movement. The improved camera controls allow Veo 2 to position the virtual “camera” with greater precision within generated videos.

Furthermore, DeepMind asserts that Veo 2 exhibits a more realistic simulation of motion, fluid dynamics – such as the pouring of coffee – and the behavior of light, including shadows and reflections. This encompasses various lenses and cinematic techniques, alongside “subtle” human expressions.

DeepMind recently shared a selection of examples generated by Veo 2 with TechCrunch. The resulting videos, for AI-generated content, appeared remarkably good – exceptionally so, in fact. Veo 2 displays a strong aptitude for rendering refraction and complex fluids, like maple syrup, and for replicating Pixar-style animation.

Despite DeepMind’s claims of reduced instances of “hallucinations” – such as extra digits or “unexpected objects” – Veo 2 doesn’t entirely overcome the uncanny valley effect.

Consider the vacant gaze of this animated, dog-like creature:

Also, observe the strangely slick surface of the road in this footage, along with the blurred pedestrians and buildings exhibiting architecturally impossible designs:

Collins acknowledged that further development is necessary.

“Areas requiring growth include coherence and consistency,” he stated. “Veo can maintain adherence to a prompt for approximately two minutes, but sustaining complex prompts over extended durations remains a challenge. Maintaining character consistency also presents difficulties. Improvements are also needed in generating intricate details, rapid and complex movements, and continually enhancing realism.”

DeepMind is actively collaborating with artists and producers to refine its video-generation models and associated tools, Collins added.

“From the outset of Veo’s development, we’ve engaged with creatives such as Donald Glover, the Weeknd, d4vd, and others to gain a thorough understanding of their creative workflows and how technology can facilitate the realization of their visions,” Collins explained. “The feedback received from creators on Veo 1 directly influenced the development of Veo 2, and we anticipate continued collaboration with trusted testers and creators to gather input on this new model.”

Safety and Training Procedures

The development of Veo 2 involved extensive training utilizing a vast collection of video content. This approach is typical for AI models, where they learn by analyzing numerous examples of data to identify underlying patterns and subsequently generate new data.

While DeepMind has not disclosed the precise sources of the videos used for Veo 2’s training, YouTube is a plausible candidate. Given Google’s ownership of YouTube and DeepMind’s previous statements to TechCrunch, it’s possible that some YouTube content was incorporated into the training process.

According to Collins, Veo 2 was trained on “high-quality video-description pairings.” These pairings consist of a video alongside a corresponding description detailing the events occurring within it.

Although Google, through DeepMind, offers tools enabling webmasters to prevent their website data from being extracted for training purposes, there is currently no mechanism for creators to have their existing works removed from the model’s training datasets. DeepMind and Google assert that utilizing publicly available data for model training constitutes fair use, implying no obligation to seek permission from data owners.

This stance is not universally accepted, particularly considering research suggesting potential disruption to tens of thousands of jobs in the film and television industries due to AI. Several AI companies, including Midjourney, the creator of a popular AI art application, are facing legal challenges alleging copyright infringement through training on content without consent.

“We are dedicated to collaborative efforts with creators and partners to achieve shared objectives,” Collins stated. “Ongoing engagement with the creative community and industry stakeholders is crucial, allowing us to gather insights and address feedback, including input from VideoFX users.”

The inherent nature of contemporary generative models presents certain risks, such as regurgitation – the reproduction of exact copies of training data. DeepMind addresses this through prompt-level filters designed to block violent, graphic, or explicit content.

Google’s indemnity policy, which offers legal defense to customers against copyright infringement claims related to product usage, will not extend to Veo 2 until its general release, as clarified by Collins.

To lessen the potential for deepfakes, DeepMind is implementing its proprietary watermarking technology, SynthID, to embed imperceptible markers within the frames generated by Veo 2. However, it’s important to acknowledge that, like all watermarking technologies, SynthID is not entirely secure.

Imagen 3 Enhancements

Alongside the unveiling of Veo 2, Google DeepMind has announced improvements to Imagen 3, its commercially available image generation model.

Beginning Monday, a revised iteration of Imagen 3 is being deployed to users of ImageFX, Google's platform for generating images. According to DeepMind, the updated model is capable of producing images and photographs that are more vividly colored and exhibit improved composition, spanning styles such as photorealism, impressionism, and anime.

The latest upgrade ensures greater adherence to user prompts and delivers more detailed and nuanced textures, as detailed in a blog post provided to TechCrunch by DeepMind.

Further enhancements have been made to the ImageFX user interface alongside the model update.

As users input prompts, significant keywords will now transform into “chiplets,” offering a dropdown menu of related and suggested terms. This allows for iterative refinement of the prompt.

Alternatively, users can choose from a selection of automatically generated descriptors displayed directly below the prompt box.

Topics

More

Google DeepMind Unveils New Video Model to Rival Sora

DeepMind's Veo 2: A New Contender in AI Video Generation

Veo 2's Technical Specifications

Future Availability and Integration

Enhanced Control in Video Generation

Safety and Training Procedures

Imagen 3 Enhancements

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization