LOGO

Facebook AI Training: Thousands of Hours of First-Person Video Collected

October 14, 2021
Facebook AI Training: Thousands of Hours of First-Person Video Collected

The Challenge of First-Person AI Perception

As artificial intelligence advances, particularly in areas like augmented reality (AR) and wearable technology, a significant hurdle emerges: enabling AIs to interpret the world from a human perspective. Many tech companies envision future AIs interacting with us through devices like AR glasses, necessitating the ability to understand what we see and experience.

The Limitations of Current AI Models

Existing object and scene recognition models are predominantly trained using third-person perspectives. Consequently, they can identify a person cooking, but only when viewed externally. Recognizing objects from a first-person viewpoint – as if seen through the eyes of the individual – presents a considerable challenge for these systems.

Facebook's Ego4D Dataset

To address this limitation, Facebook has compiled a new, publicly available dataset consisting of approximately 3,000 hours of first-person video footage. This initiative aims to provide the necessary data for training AIs to comprehend everyday tasks from a human's point of view.

Data Collection and Annotation

The video data was gathered through collaborations with 13 partner universities across nine countries. Over 700 volunteers contributed footage of common activities, such as cooking, shopping, and simply relaxing. Participants maintained control over their level of involvement and identity.

A research team meticulously reviewed and edited the footage, hand-annotating it and supplementing it with recordings from controlled environments. This extensive process resulted in the final 3,000-hour dataset, detailed in a dedicated research paper.

Diverse Recording Methods

Footage was captured using various devices, including glasses cameras and GoPros. Some researchers also incorporated environmental scans and gaze tracking to provide a more comprehensive understanding of the user's experience. This data is now part of the Ego4D dataset, freely accessible to the research community.

The Importance of First-Person Perception

Kristen Grauman, the lead researcher, emphasized the need for a paradigm shift towards first-person perception in AI. She stated that teaching AI to understand daily life through human eyes, considering real-time motion and multisensory observations, is crucial for future AI systems.

Distinct from Ray-Ban Stories

Despite both initiatives originating from Facebook, the Ego4D research and the Ray-Ban Stories smart shades are separate endeavors. However, both demonstrate Facebook’s growing focus on the importance of first-person understanding for various applications.

Applications in AR and Robotics

Grauman highlighted the potential of first-person perception in augmented reality and robotics. AI assistants equipped with this capability could reduce cognitive overload by understanding the user’s world through their eyes, offering proactive assistance and relevant information.

Global Diversity in the Dataset

The dataset’s global scope was a deliberate choice. Recognizing that kitchens and daily routines vary significantly across cultures, Facebook aimed to create a diverse dataset representing a wide range of backgrounds, ethnicities, and ages. This diversity enhances the applicability of models trained on the data.

Benchmarks for Evaluating AI Performance

Alongside the dataset, Facebook is releasing a set of benchmarks to assess how effectively AI models utilize the new data. These benchmarks provide a standardized way to measure progress in first-person perception.

Beyond Simple Object Recognition

Identifying objects from a first-person perspective is relatively straightforward. The real challenge lies in understanding intentions, contexts, and linked actions. AR devices should provide insights beyond simply identifying objects – they should offer knowledge the user doesn’t already possess.

Five Key Research Tasks

Researchers have defined five tasks to evaluate AI’s understanding of first-person imagery:

  • Episodic memory: Remembering objects and events in time and space.
  • Forecasting: Predicting future events based on observed sequences.
  • Hand-object interaction: Analyzing how people interact with objects.
  • Audio-visual diarization: Associating sounds with events and objects.
  • Social interaction: Understanding conversations and social cues.

Future Development and Accessibility

While the current 3,000-hour dataset represents a significant advancement, researchers plan to expand it further and welcome new partners. The dataset will be released in the coming months, with details available on the Facebook AI Research blog.

The meticulous annotation, totaling over 250,000 researcher hours, underscores the commitment to providing a high-quality resource for the AI research community.

#Facebook AI#artificial intelligence#first-person video#AI training#machine learning#data collection