Polyglot
Developing Open Source Foundation Models for Low-Resource Languages

Developing Open Scource Foundation Models for Low-Resource Languages (TRA 6 Project)

Eine Wissenschaftlerin und ein Wissenschaftler arbeiten hinter einer Glasfassade und mischen Chemikalien mit Großgeräten.
© Uni Bonn

Our research

Polyglot is an initiative to close the linguistic divide in NLP by developing efficient and accessible foundation models for low-resource languages.

While recent breakthroughs in generative AI have been driven by large-scale foundation models, these advances have largely benefited high-resource languages, leaving many underrepresented languages behind. The current deep learning paradigm—heavily reliant on massive datasets and computing power—has unintentionally widened this gap, making it harder for speakers of low-resource languages to access and shape AI technologies that reflect their linguistic and cultural identities.

Polyglot addresses this imbalance by creating tools, models, and datasets that support open, sustainable, and inclusive AI development. We aim to empower researchers and communities working with low-resource languages through high-quality open-source resources, enabling them to build and fine-tune language models tailored to their needs.

At last, Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

The innovation of the Polyglot project lies in its commitment to making foundation models accessible and effective for low-resource languages, which have historically been excluded from the major advances in generative AI. Rather than scaling existing models indiscriminately, Polyglot takes a targeted, sustainable, and open-source approach by developing tailored tools, datasets, and models that can be adapted to the unique linguistic and cultural contexts of underrepresented communities. This not only democratizes access to AI technologies but also empowers local researchers and speakers to actively shape AI systems in ways that reflect their values and identities.The innovation of the Polyglot project lies in its commitment to making foundation models accessible and effective for low-resource languages, which have historically been excluded from the major advances in generative AI. Rather than scaling existing models indiscriminately, Polyglot takes a targeted, sustainable, and open-source approach by developing tailored tools, datasets, and models that can be adapted to the unique linguistic and cultural contexts of underrepresented communities. This not only democratizes access to AI technologies but also empowers local researchers and speakers to actively shape AI systems in ways that reflect their values and identities.

An interdisciplinary project is one that brings together researchers and methods from different academic fields to collaboratively address complex problems that cannot be fully understood through a single disciplinary lens. In the case of Polyglot, for instance, deep learning specialists, high-performance computing experts, linguists, and philosophers work side by side—not only to build language models for underrepresented languages, but also to ensure that these technologies are developed ethically, sustainably, and with cultural sensitivity. This kind of collaboration allows for richer insights and more responsible innovation, blending technical excellence with societal awareness.

The Polyglot project is expected to generate a diverse set of high-impact outputs that contribute both academically and practically to the NLP community. These include a comprehensive suite of datasets, monolingual large language models, and evaluation benchmarks tailored to low-resource languages, all made openly available. An open-source repository will ensure transparency and reproducibility, allowing others to build upon our work. We also aim to produce several peer-reviewed publications—at least one per target language—covering key aspects of the project, such as dataset creation and model performance. Beyond that, the project supports graduate research, offering fertile ground for Master’s and PhD theses. Finally, we plan to host an international, interdisciplinary workshop to foster dialogue around the development of LLMs for underrepresented languages.

TUCANO - Advancing Neural Text Generation for Portuguese

This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks.

More information: 
Arxiv3
GitHub4

TUCANO

Team Members 

Nicholas Kluge Corrêa5 (Principal Investigator)

Center for Science and Thought, University of Bonn

kluge@uni-bonn.de

 

Aniket Sen6 (Principal Investigator) 

High Performance Computing and Analytics Lab / Helmholtz-Institut für Strahlen- und Kernphysik, University of Bonn

sen@hiskp.uni-bonn.de

 

Sophia Falk7

Bonn Sustainable AI Lab, Institute for Science and Ethics, University of Bonn

falk@iwe.uni-bonn.de


Shiza Fatimah

Institute for Computer Science, University of Bonn

s39sfati@uni-bonn.de

Contact

Avatar Kluge Corrêa

Nicholas Kluge Corrêa

Wird geladen