Building Dhavana's First Dhivehi Language Model From Scratch

How we trained a 125-million-parameter Dhivehi-aware language model from zero — and what it taught us before scaling up to the 1B model now being deployed inside Dhavana AI.

The Gap Nobody Was Going to Fill For Us

Building AI products for Maldivian users runs into a fundamental gap: most general-purpose language models barely understand Dhivehi. Even the largest commercial models — Gemini, ChatGPT, Claude — treat Dhivehi as a tail language. They produce text that's grammatically broken, mix English freely into Dhivehi responses, or fall back to character-by-character generation when they encounter Thaana script.

This is not a complaint about those models. They're trained for the world. Dhivehi, with its ~400,000 native speakers and small internet footprint, sits at the bottom of the language-coverage pyramid for any general-purpose system. No big-AI lab is going to wake up and decide to specifically optimize for Dhivehi.

For SerialTechLab, building Dhavana — our Maldivian AI platform — meant we couldn't just plug into an off-the-shelf model and expect quality results. We needed to build our own foundation, one where Dhivehi is treated as a first-class language from the very first training step.

Not Fine-Tuning. Not LoRA. From Scratch.

Most AI projects in low-resource languages take a different path. They start with an existing model — Llama, Mistral, Qwen — and adapt it through fine-tuning or parameter-efficient methods like LoRA. These approaches are sensible. They save compute. They inherit the broad knowledge of much larger models trained at staggering cost.

We chose the harder path: training a model from scratch. From randomly initialized weights. With our own tokenizer. On our own data mixture. Through every step of the modern LLM pipeline.

Why?

Owned weights. A model trained from scratch is ours end-to-end — no license dependencies, no upstream surprises.
Tokenizer-level Dhivehi. Every off-the-shelf tokenizer treats Thaana as a low-priority script. Our custom tokenizer compresses Dhivehi roughly 9× more efficiently than Qwen's and 5× more efficiently than Gemma's. Every Dhivehi token costs less compute to process.
Real understanding. A model that learned Dhivehi during its initial training, rather than having it bolted on afterward, develops different representations for the language — and in our experience, better ones.
Compounding knowledge. Doing this from scratch forced our team to understand every layer of the stack: data pipelines, tokenization, attention internals, optimizer states, checkpointing, evaluation. That knowledge doesn't go away after the first model.

Dhavana-Base-125M

The first model we trained is a 125-million-parameter decoder-only Transformer in the modern Llama-style configuration. Both the pretrained base and the instruction-tuned chat variant are publicly available on Hugging Face:

Serialtechlab/dhavana-base-150m — the from-scratch pretrained base model
Serialtechlab/dhavana-chat-150m — the instruction-tuned chat variant

(A note on naming: the Hugging Face repositories carry the historic name dhavana-base-150m and dhavana-chat-150m from our original target parameter count. The actually trained model came out to 125,264,640 parameters — close enough that we kept the original repo names rather than break links.)

The architectural details:

Component	Details
Parameters	125,264,640
Architecture	16 layers, hidden size 768
Attention	Grouped Query Attention (12 query / 4 key-value heads)
Feed-forward	SwiGLU
Normalization	RMSNorm, pre-norm
Position encoding	Rotary Position Embeddings (RoPE)
Context length	2,048 tokens
Tokenizer	Custom 32k SentencePiece, Dhivehi-optimized
Training data	~3 billion tokens (English, Dhivehi, multilingual, math, code)

It was trained from random initialization to convergence — no warm-starting, no inherited weights — on a careful mixture designed to give it broad world capability while keeping Dhivehi visible during every step of the run.

What 125M Showed Us

The 125M model is small by modern standards. Today's flagship commercial models are 50× to 1,000× larger. Compared to ChatGPT, Gemini, or Claude, Dhavana-Base-125M has clear limits:

It does not have deep factual knowledge of the world.
It cannot do complex multi-step reasoning.
It has limited capability on coding or advanced math.

But on the question that actually mattered to us — can a from-scratch model produce real, fluent Dhivehi? — the answer came back unambiguous: yes.

After instruction tuning, the 125M model:

Generates fluent, grammatical Dhivehi prose across multiple registers — news, formal, narrative.
Correctly switches output language based on the prompt: Dhivehi questions get Dhivehi answers.
Composes multi-paragraph Dhivehi short stories with characters, dialogue, and a coherent arc.
Responds to a "reply in Dhivehi" system prompt and stays in that mode.
Demonstrates real translation capability between English and Dhivehi for everyday content.

For a 125-million-parameter model trained from scratch on a low-resource language, this was not a small result. It gave us our first concrete proof that the approach works.

The Honest Limits

We are not going to oversell what 125M does. At this scale:

Factual recall is unreliable — ask it the capital of the Maldives and it might invent something plausible-sounding.
Long-form coherence drifts beyond a few hundred tokens.
Translation works for everyday phrases but struggles with idioms, technical terms, and specialized vocabulary.
Complex tasks like explaining a scientific concept or writing non-trivial code are beyond its capacity.

These aren't bugs — they're physics. A 125M model has a fundamental capacity ceiling, and ours met it. We always knew the 125M's job was not to be the final product. Its job was to prove the pipeline.

What That Proof Unlocked

The real value of the 125M project was the confidence it gave us. We proved:

Our tokenizer dramatically outperforms multilingual tokenizers on Dhivehi.
Our data pipeline can clean, deduplicate, and tokenize hundreds of millions of tokens of Dhivehi content reliably.
Our training infrastructure scales: survives session limits, resumes cleanly from checkpoints, produces well-behaved loss curves.
Most importantly: scaled up, this approach produces a genuinely useful Dhivehi model.

With that confidence, we moved on to the natural next step: Dhavana-Base-1B — a one-billion-parameter model trained from scratch using the same pipeline, the same tokenizer, the same data philosophy, with roughly eight times the capacity of its predecessor.

The 1B model is being deployed as the new translation engine inside Dhavana AI at dhavana.com, replacing the previous translation pipeline.

It is not as smart as Gemini, ChatGPT, or Claude — and we have no illusions about competing with those models on general intelligence. But for Dhivehi-related translation, which is what most Maldivian users actually need from an AI day-to-day, it produces results that genuinely surprised us. Translations come out coherent, register-appropriate, and meaningfully better than what frontier general-purpose models give for the same input.

What This Means for Dhavana

Every translation request inside Dhavana AI is moving onto a model built from the ground up to understand Dhivehi as a first-class language. Every token of training data was chosen, every line of training code was written, every weight was learned, with Maldivian users specifically in mind.

The Dhavana model line is just beginning. Future iterations will scale further and specialize across the platform's use cases — voice transcription, document understanding, conversational assistance — all rooted in the same principle: Dhivehi deserves AI that was actually built for it, not retrofitted to it.

That, more than benchmark scores, is what makes Dhavana different.

Dhavana AI is available at dhavana.com. Open model releases at huggingface.co/Serialtechlab. Built at SerialTechLab.