LOGO

Open Source LLMs and Europe's Digital Sovereignty

February 16, 2025
Open Source LLMs and Europe's Digital Sovereignty

Europe's Push for Open Source Large Language Models

Large language models (LLMs) have become a central focus for Europe’s digital sovereignty efforts, following the announcement of a new initiative. This program aims to develop a suite of genuinely open source LLMs, supporting all languages used within the European Union.

The scope of this project extends beyond the current 24 official EU languages. It also encompasses languages spoken in countries actively seeking entry into the EU market, such as Albania, demonstrating a commitment to future scalability.

Introducing OpenEuroLLM

OpenEuroLLM represents a collaborative effort involving approximately 20 organizations. It is jointly spearheaded by Jan Hajič, a computational linguist at Charles University in Prague, and Peter Sarlin, CEO and co-founder of Silo AI – a Finnish AI lab acquired by AMD last year for $665 million.

This initiative aligns with a larger trend of Europe prioritizing digital sovereignty. The goal is to bring essential infrastructure and tools closer to the region, enhancing control and security.

Major cloud providers are investing in local infrastructure to ensure EU data remains within the region. Furthermore, OpenAI has recently introduced a service allowing data processing and storage within Europe.

The EU has also recently committed to a $11 billion investment in creating a sovereign satellite constellation, intended to compete with systems like Elon Musk’s Starlink. This demonstrates a clear pattern of strategic investment.

Funding and Resources

The dedicated budget for building the LLMs themselves is €37.4 million. Approximately €20 million of this funding originates from the EU’s Digital Europe Programme. While substantial, this represents a relatively small investment when compared to the expenditures of major corporate AI developers.

The overall budget expands when considering funding for related work. A significant portion of the expense is allocated to computational resources. The OpenEuroLLM project leverages EuroHPC supercomputer centers located in Spain, Italy, Finland, and the Netherlands. The broader EuroHPC project boasts a budget of around €7 billion.

Challenges and Considerations

The large and diverse group of participating organizations – encompassing academic institutions, research facilities, and corporations – has prompted questions regarding the project’s feasibility. Anastasia Stasenko, co-founder of LLM company Pleias, questioned whether such a large consortium could achieve the focused efficiency of a dedicated private AI firm.

“Europe’s successes in AI are often driven by smaller, focused teams like Mistral AI and LightOn,” Stasenko noted. “These companies demonstrate clear ownership and accountability for their decisions, encompassing financial aspects, market positioning, and overall reputation.”

These companies are directly responsible for their choices, whether related to finances, market strategy, or public image.

  • The project aims to cover all 24 official EU languages.
  • It includes languages from prospective EU member states.
  • OpenEuroLLM is a collaborative effort led by experts from academia and industry.

A Fresh Start or Continued Progress?

The OpenEuroLLM initiative is positioned at either a beginning point or with existing momentum, contingent upon perspective.

Since 2022, Professor Hajič has also led the High Performance Language Technologies (HPLT) project. This project aimed to create freely available, reusable datasets, models, and workflows utilizing high-performance computing (HPC).

Scheduled for completion in late 2025, HPLT can be considered a precursor to OpenEuroLLM, as most of its partners – excluding those from the U.K. – are also involved in this new endeavor.

“OpenEuroLLM represents a wider scope of participation, but with a specific focus on generative Large Language Models (LLMs),” Hajič explained to TechCrunch. “Therefore, we aren’t commencing with absolutely nothing in terms of data, expertise, tools, or computational experience.”

He further stated that the team comprises individuals with relevant knowledge, enabling a rapid acceleration of progress.

Project Timeline and Current Status

Hajič anticipates the initial versions of the model will be available by mid-2026, with the final iterations expected by the project’s conclusion in 2028.

However, these timelines may appear ambitious considering the current state of the project, which presently consists of a basic GitHub profile.

“In that sense, we are indeed starting anew – the project officially began on Saturday [February 1st],” Hajič noted. “However, preparation for the project has been underway for a year, following the opening of the tender process in February 2024.”

Collaborative Partners

The OpenEuroLLM consortium includes organizations from academia and research across Czechia, the Netherlands, Germany, Sweden, Finland, and Norway, alongside the EuroHPC centers.

Corporate involvement includes Finland’s Silo AI (owned by AMD), as well as Aleph Alpha (Germany), Ellamind (Germany), Prompsit Language Engineering (Spain), and LightOn (France).

Notable Absence

A significant omission from the partner list is Mistral, the French AI company that has established itself as an open-source alternative to companies like OpenAI.

Despite attempts to engage with Mistral, Hajič confirmed that discussions regarding their participation did not materialize.

“I initiated contact, but it did not lead to a substantial conversation about their involvement,” Hajič said.

Participation Limitations

The project may still welcome additional participants through the EU funding program, but participation will be restricted to organizations within the European Union.

Consequently, entities from the U.K. and Switzerland are ineligible to participate.

This contrasts with the Horizon R&D program, to which the U.K. rejoined in 2023 after Brexit-related delays, and which previously funded the HPLT project.

Foundation Models for Transparent AI in Europe

The primary objective of this project, as stated in its core message, is the development of a suite of foundation models specifically for transparent AI within Europe.

A key aspect of these models will be the preservation of the “linguistic and cultural diversity” inherent in all languages used across the European Union – encompassing both present and future languages.

Project Deliverables

The precise nature of the project’s outputs is still under refinement. However, it is anticipated to include a central, multilingual Large Language Model (LLM) engineered for broad applications demanding a high degree of precision.

Alongside this, the project aims to produce smaller, “quantized” model iterations, potentially suited for deployment in edge computing environments where operational efficiency and rapid processing are critical.

Quality and Public Funding

“A detailed plan is still under development,” explained Hajič. “Our intention is to create models that are both compact and of exceptional quality.”

He further emphasized the importance of a thorough approach, stating, “We are committed to avoiding the release of incomplete work, given the significant investment from the European Commission – representing substantial public funds.”

Challenges in Language Proficiency

Achieving consistent performance across all languages presents a considerable challenge, particularly for those with limited digital resources.

“While equal proficiency is the aim, the success we can achieve with languages lacking extensive digital data remains an open question,” Hajič acknowledged.

“Therefore, we are focused on establishing robust benchmarks for these languages, ensuring they accurately reflect both the language itself and the culture it represents, rather than relying on potentially biased evaluation metrics.”

Leveraging Existing Data Resources

The project will significantly benefit from the groundwork laid by the HPLT project, particularly the recent release of version 2.0 of its dataset.

This dataset was trained on an extensive collection of 4.5 petabytes of web crawls, comprising over 20 billion documents.

Furthermore, Hajič indicated plans to supplement this with additional data sourced from Common Crawl, a publicly accessible repository of web-crawled information.

Data Sources

  • 4.5 petabytes of web crawls
  • Over 20 billion documents
  • Additional data from Common Crawl

Defining Open Source

The debate surrounding open source versus proprietary software frequently centers on interpreting the precise meaning of "open source" itself. A resolution to this ongoing discussion can be found by referencing the official “definition” established by the Open Source Initiative (OSI), the recognized authority on legitimate open source licenses.

Currently, the OSI has also developed a definition for “open source AI,” a move that has generated some disagreement. Advocates for truly open AI contend that complete transparency requires not only freely accessible models, but also the datasets, pretrained models, and weights – encompassing all components.

However, the OSI’s definition does not mandate the inclusion of training data, acknowledging that AI models are frequently trained using proprietary data or data subject to redistribution limitations.

The OpenEuroLLM project is grappling with similar challenges. Despite aiming to be “truly open,” compromises may be necessary to meet its “quality” standards. Hajič explained that achieving complete openness presents certain hurdles.

“Our aim is complete openness wherever feasible. However, limitations do exist,” Hajič stated. “We are committed to delivering models of the highest caliber, and European copyright law dictates what data we can utilize.”

This implies that some training data used by OpenEuroLLM might not be publicly distributed, but could be accessible for review by authorized auditors. This approach aligns with the requirements for high-risk AI systems under the EU AI Act.

“We anticipate that a significant portion of the data will be open, particularly data sourced from Common Crawl,” Hajič added. “Ideally, we would release everything completely openly, but compliance with AI regulations is paramount.”

A Case of Parallel Development

Following the official release of OpenEuroLLM, observations were made regarding a strikingly similar initiative that had begun in Europe several months prior. EuroLLM, which initially released a model in September and a subsequent version in December, receives co-funding from the EU and operates through a partnership of nine organizations.

This consortium encompasses both academic bodies, like the University of Edinburgh, and commercial entities, such as Unbabel, a company that recently secured substantial GPU time on EU-based supercomputers for model training.

The objectives of EuroLLM closely mirror those of OpenEuroLLM: “The creation of an open source European Large Language Model capable of processing 24 official European languages, alongside several other languages deemed strategically significant.”

Andre Martins, Research Lead at Unbabel, publicly addressed these parallels, pointing out the appropriation of an already established name by OpenEuroLLM.

He expressed a desire for open collaboration and knowledge sharing between the communities, advocating against redundant efforts with each new funded project.

Jaroslav Hajič acknowledged the situation as “unfortunate,” expressing hope for potential cooperation.

However, he also emphasized that OpenEuroLLM’s EU funding imposes limitations on collaborations with organizations outside the European Union, specifically including universities located in the U.K.

Funding and Collaboration Restrictions

The funding source for OpenEuroLLM dictates certain constraints regarding partnerships.

  • Collaboration with non-EU entities is limited.
  • This restriction specifically impacts institutions in the U.K.

These limitations may influence the potential for synergistic efforts between OpenEuroLLM and EuroLLM.

Similarities Between the Projects

Both projects share a common goal of developing a multilingual Large Language Model for Europe.

Both initiatives are supported by EU funding and involve a consortium of partners.

The AI Funding Landscape

The emergence of DeepSeek from China, coupled with its favorable cost-to-performance metrics, has sparked optimism that artificial intelligence projects may achieve substantial results with reduced financial investment. Nevertheless, recent discussions have raised concerns regarding the actual expenses associated with the development of DeepSeek.

Peter Sarlin, the technical co-lead for the OpenEuroLLM project, expressed this sentiment to TechCrunch, stating, “Regarding DeepSeek, our understanding of the resources invested in its creation remains limited.”

Despite this uncertainty, Sarlin believes OpenEuroLLM will secure adequate funding, primarily to cover personnel costs. A significant portion of the expenses involved in constructing AI systems is related to computational resources, which are largely addressed through their collaboration with EuroHPC centers.

“It is reasonable to suggest that OpenEuroLLM benefits from a considerable budget,” Sarlin explained. “EuroHPC has already allocated billions to AI and computing infrastructure, with further billions pledged for future expansion.”

It is important to recognize that the OpenEuroLLM project is not focused on developing a consumer-facing or enterprise-level product. Its primary objective is the creation of the models themselves, which is why Sarlin anticipates their current budget will be sufficient.

“Our focus isn’t on creating a chatbot or an AI assistant – that represents a product-driven effort, as demonstrated by ChatGPT’s success,” Sarlin clarified. “Instead, we aim to provide an open-source foundation model serving as the AI infrastructure for European companies to innovate upon. We possess the knowledge required to build these models, and it doesn’t necessarily demand billions in investment.”

Since 2017, Sarlin has led the AI laboratory Silo AI, which, in collaboration with partners like the HPLT project, introduced the Poro and Viking families of open models. These models currently support several European languages, and the company is preparing the next generation, the “Europa” models, to encompass all European languages.

This aligns with the concept highlighted by Hajič – the advantage of building upon existing foundations of expertise and technology. A substantial base of knowledge and tools is already in place.

Project Focus and Budget Allocation

  • The OpenEuroLLM project prioritizes model development over product creation.
  • Funding is primarily allocated to personnel and computational resources.
  • EuroHPC’s existing and planned investments provide a strong financial foundation.

Sovereign State

As some commentators have pointed out, OpenEuroLLM is a complex system with numerous interconnected elements – a point Hajič recognizes, maintaining an optimistic perspective.

“My experience includes participation in several collaborative endeavors, and I am convinced that this approach offers benefits compared to development by a single organization,” he stated. “Undoubtedly, significant achievements have been made by entities such as OpenAI and Mistral, but it is my hope that the synergy between academic knowledge and corporate objectives will yield innovative results.”

The project isn't necessarily focused on surpassing major technology companies or large-scale AI ventures; its primary objective is to achieve digital sovereignty. This involves creating foundation Large Language Models (LLMs) that are predominantly open-source and developed within Europe, for the benefit of Europe.

“While it’s possible we may not ultimately produce the leading model, even a ‘good’ model built entirely with European-based components would represent a success,” Hajič explained. “This outcome would be considered a positive achievement.”

#open source LLMs#digital sovereignty#Europe#AI#large language models#technology