Engineering LLMs: Data pipelines, regulations, and reality

Mar 13

Training a foundational LLM from the ground up is an enormous engineering challenge. Doing it in a way that is transparent, inclusive, and compliant with the EU AI Act makes that challenge even greater. In our January technical meetup, we take a close look at the practical realities of building and operating LLMs. 

This time, in collaboration with the PyData community we have curated two complementary perspectives that cover the full lifecycle of AI engineering: training a massive sovereign model from scratch and architecting privacy-first environments for self-hosted LLMs.

From curating massive datasets for training to architecting governance for self-hosted models, our speakers for this evening will tackle the realities of privacy and compliance.

Is this for you?

This meetup is aimed at professionals working hands-on with data and AI. You’ll find the sessions especially relevant if you are a:

ML Engineer or Data Scientist interested in what it actually takes to train an LLM from scratch, including data pipelines, training infrastructure, and evaluation.
System Architect or DevOps Engineer interested in the infrastructure required to support "bring your own model" workflows and self-hosted AI solutions.
Tech Lead or CTO assessing how to deploy high-performance AI systems while meeting requirements such as the EU AI Act and GDPR.
Professional interested in Sovereign AI, specifically how we can build transparent, inclusive models that reflect Dutch language and culture.

What’s on the agenda?

GPT-NL: Developing a Dutch LLM from scratch by Dr. Rer Nat. Julio A. de Oliveira Filho, lead System Architect for GPT-NL and Alexandru Turcu, AI Engineer at TNO 

GPT-NL will be a language model for Dutch language and culture. It will be transparent, inclusive, and compliant with Dutch and European values, with a platform where we can openly communicate about data and development. GPT-NL is an initiative of TNO, SURF, and NFI. As the training for the GPT-NL model approaches its end, this talk will share the experience and challenges of training a major LLM model from scratch. Julio and Alexandru will bring insights from their development pipeline, with details on their data curation pipeline and training framework. The session will also explore real-world applications they envision for the model and discuss how the consortium plans to responsibly disseminate GPT-NL within Dutch society.

Bring your own LLM: building GitLab Duo for privacy-first environments by Eduardo Bonet, Principal Engineer at AI, GitLab 

Increasingly more powerful Open Source models allow users the possibility of hosting their own LLMs, but often tools are not there to make use of these models. Eduardo is sharing in this talk a few lessons GitLab learned from developing GitLab Duo Self-hosted, a privacy oriented model-agnostic offering of GitLab Duo, including architecture, the difficulties and benefits of evaluation, and governance controls.

If you care less about promises and more about pipelines, infrastructure, and governance, this evening is for you. Register now to secure your spot!

Event Info:
Date: Thursday, January 29
Time: 18:00 - 21:00

Location: House of Watt, Amsterdam

Talk 1: “GPT-NL: Developing a Dutch LLM from scratch”, by Dr. Rer Nat. Julio A. de Oliveira Filho, lead System Architect for GPT-NL and Alexandru Turcu, AI Engineer at TNO

We’re excited to welcome GPT-NL, a flagship initiative by TNO, SURF, and NFI. Rather than building on top of existing models, GPT-NL is being developed from scratch to serve the Dutch language and cultural context, designed with transparency, inclusivity, and European values at its core.

With the training phase nearing completion, the team is now ready to open the hood and share what it really takes to build such a model.

Talk 2: - “Bring your own LLM: building GitLab Duo for privacy-first environments.”, by Eduardo Bonet, Principal Engineer at GitLab

The growing power of Open Source models is enabling users to host their own Large Language Models (LLMs). However, the necessary tools to fully utilize these models are often lacking. Eduardo will share insights from GitLab's development of GitLab Duo Self-hosted, a privacy-focused, model-agnostic version of GitLab Duo. This will cover lessons learned regarding its architecture, the challenges and advantages of evaluation, and the implementation of effective governance controls.

Mathijs Mul

Engineering LLMs: Data pipelines, regulations, and reality

PyData Amsterdam goes Rotterdam: Developing Computer Vision Systems for Production

Profitability AI: Build it right. Make it fast. Keep it cheap.

PyData Amsterdam