LOGO

Meta AI Supercluster: Advancing AI Research

January 24, 2022
Meta AI Supercluster: Advancing AI Research

Meta's Entry into Supercomputing: The AI Research SuperCluster

A worldwide race is underway to develop the largest and most potent computers globally, and Meta, formerly known as Facebook, is poised to participate with its “AI Research SuperCluster,” designated RSC. Upon full deployment, this system is projected to rank among the world’s ten fastest supercomputers, facilitating the extensive computational demands of language and computer vision modeling.

The Need for High-Performance Computing in AI

The creation of substantial AI models, with OpenAI’s GPT-3 being a prominent example, isn't achievable on standard personal computers. These models are the culmination of weeks or months of continuous calculations performed by high-performance computing systems that surpass even the most advanced gaming setups. Accelerating the model training process directly translates to faster testing and the development of improved iterations, a critical advantage when training spans months.

RSC: Current Status and Security Measures

RSC is currently operational, and Meta’s research teams are already utilizing its capabilities. The system processes user-generated data, which Meta emphasizes is encrypted prior to training and the entire facility remains isolated from public internet access.

Challenges in Building a Supercomputer

The team responsible for assembling RSC deserves recognition for completing the project largely through remote collaboration. Supercomputers are inherently physical structures, where fundamental aspects like heat dissipation, cabling, and interconnectivity significantly impact both performance and design.

While exabytes of storage appear substantial in digital terms, they necessitate physical existence on-site with immediate accessibility at a microsecond’s scale. (Pure Storage has also expressed pride in the infrastructure they provided for this project.)

RSC’s Technical Specifications and Ranking

Currently, RSC comprises 760 Nvidia DGX A100 systems, totaling 6,080 GPUs. Meta anticipates this configuration will position it competitively with Perlmutter at Lawrence Berkeley National Lab. According to the longstanding Top 500 ranking, Perlmutter is presently the fifth most powerful supercomputer in operation. (Fugaku in Japan currently holds the top position.)

Future Expansion and Potential Ranking

The company intends to further expand the system’s capacity. The ultimate goal is to achieve approximately three times its current power, potentially placing it in contention for third place in global rankings.

Precision vs. Performance in Supercomputing

A consideration to note is that systems like Summit at Lawrence Livermore National Lab are utilized for research requiring high precision. When simulating complex phenomena, such as atmospheric molecules, calculations must be carried out to a significant number of decimal places. This inherently increases computational expense.

AI Applications and Computational Efficiency

Meta clarifies that AI applications do not demand the same level of precision. Slight variations in results, such as a confidence level of 90% versus 91% for object recognition, are often inconsequential. The primary challenge lies in achieving high certainty across a vast number of objects or phrases, rather than focusing on minute accuracy improvements.

RSC’s Performance Metrics and Implications

This approach allows RSC, operating in TensorFloat-32 math mode, to achieve a higher rate of FLOP/s (floating point operations per second) per core compared to systems prioritizing precision. It currently reaches up to 1,895,000 teraFLOP/s, exceeding 1.9 exaFLOP/s – more than four times Fugaku’s performance.

The significance of this metric, and its impact on rankings, remains a topic of discussion. The Top 500 organization has been contacted for their perspective. Regardless, RSC will undoubtedly be among the world’s fastest computers, potentially the fastest operated by a private entity for internal research.

#meta#ai#artificial intelligence#supercomputer#research#ai supercluster