Fine-Tuning Gemini for Natural Dhivehi

As AI continues to advance, one critical question remains: Can we truly adapt these powerful models to low-resource languages like Dhivehi? At SerialTech Lab, we've been exploring whether Google's Gemini models can be fine-tuned to respond purely in natural Dhivehi—without falling back to English. This isn't just a technical challenge; it's about preserving and empowering our language in the AI era.

The Challenge: Why Dhivehi is Different

Dhivehi presents unique challenges for AI models. Our language uses the Thaana script—a right-to-left writing system with mandatory vowel diacritics called fili. Unlike English, where AI models have been trained on massive datasets, Dhivehi is what researchers call a "low-resource language" (LRL). This means there's far less training data available, making it harder for AI to learn natural patterns.

The big question we're tackling: Can we fine-tune Gemini to think and respond purely in Dhivehi, maintaining natural flow without code-switching to English?

Understanding Gemini's Architecture

Google's Gemini family uses something called a "Mixture-of-Experts" (MoE) architecture. Think of it like having multiple specialized experts within one model. When you ask a question, only the relevant experts activate—making the system efficient while maintaining high performance.

Here's what we're working with:

Gemini 2.5 Pro: The powerhouse for complex reasoning with up to 2 million token context
Gemini 2.5 Flash: Fast and efficient for high-throughput tasks
Gemini 1.5 Pro: Excellent for long-document analysis
Gemini 1.5 Flash: The speed demon for cost-effective inference

All these models were pre-trained on over 100 languages, including Dhivehi. But there's a catch—what researchers call the "curse of multilinguality." When a model tries to handle too many languages, its performance in any single low-resource language can suffer.

The "Token Tax" Problem

Here's where things get technical—and expensive. Gemini uses a tokenizer that breaks text into smaller units. For English, one token equals roughly 4 characters or 0.75 words. But for Dhivehi? The tokenization is far less efficient.

Because Thaana uses diacritics for vowels, a single character-vowel combination might get split into multiple tokens. This means:

Higher costs: More tokens = higher API costs
Slower processing: More tokens to generate = longer wait times
Less context: The same semantic meaning takes up more of the model's context window

This "token tax" is one of the biggest barriers we face in making Dhivehi AI economically viable.

Our Approach: Supervised Fine-Tuning on Vertex AI

To adapt Gemini for natural Dhivehi, we're using Supervised Fine-Tuning (SFT) on Google's Vertex AI platform. The key technique is called LoRA (Low-Rank Adaptation)—a parameter-efficient method that doesn't require retraining the entire model.

Think of it this way: Instead of teaching the model everything from scratch, we're adding a specialized "Dhivehi layer" on top of its existing knowledge. This way, it keeps its reasoning abilities while learning to express them naturally in Dhivehi.

Key Technical Specifications

When setting up a fine-tuning job on Vertex AI, here are the critical limits:

Maximum tokens per example: 131,072 tokens
Dataset size: Up to 1GB in JSONL format
Validation set: Maximum 5,000 examples
Adapter sizes: Choose from 1, 2, 4, 8, or 16 (we recommend 8 or 16 for Dhivehi)

Data Engineering: The Make-or-Break Factor

The quality of your training data determines everything. For Gemini to respond purely in Dhivehi, our dataset must be:

Monolingual: No code-switching between Dhivehi and English
Natural: Reflecting how people actually speak, not just formal text
Diverse: Covering different registers—from news articles to casual conversation

Where We're Getting Dhivehi Data

We're pulling from several sources:

News websites: ~300MB of formal, grammatically correct Dhivehi (great for structure, but sometimes too formal)
Shaafiu Speech dataset: 16.5 hours of natural, narrative Dhivehi (gold for conversational flow)
Dhivehi Wikipedia: Limited but excellent for knowledge representation
Social media: High volume but requires heavy cleaning to remove English mixing
Sentiment datasets: Experimental but useful for teaching tone

The challenge? Most of these sources contain code-switching. We need to carefully filter and clean the data to create truly monolingual training examples.

Synthetic Data Generation

Given the scarcity of clean Dhivehi data, we're also using larger models to generate synthetic training data. We prompt a capable model like Gemini 1.5 Pro to "act as a native Dhivehi speaker" and generate conversations. Then we use automated metrics to filter out unnatural or incorrect outputs.

The Right-to-Left Challenge

Dhivehi's RTL directionality adds another layer of complexity. While Gemini handles RTL text, many development tools default to LTR display, making it difficult to validate training data manually.

The Thaana script also has unique features:

Vowel diacritics placed above and below consonants
Mixed directionality (RTL for text, LTR for numbers)
Arabic extensions for certain sounds
Hanging baseline for consonants

Each of these creates potential error modes that we need to account for in our fine-tuning process.

Implementation: Our Three-Phase Strategy

Phase 1: Baseline Testing

Before fine-tuning anything, we test the base Gemini model on standard Dhivehi datasets. This gives us a performance benchmark to measure improvement against.

Phase 2: Fine-Tuning Configuration

We configure the training job with optimal hyperparameters:

Adapter size: 8 or 16 (balances capacity with efficiency)
Epochs: 1-3 (more can lead to overfitting)
Learning rate: 1.0 multiplier (stable starting point)
Training region: us-central1 or europe-west4 (where GPU/TPU resources are available)

Phase 3: Purity Evaluation

After training, we test specifically for linguistic purity—can it respond entirely in Dhivehi without English code-switching? We also verify the model hasn't lost its reasoning abilities (a risk called "catastrophic forgetting").

Enhancing with System Instructions

Beyond fine-tuning, we can guide behavior using system instructions—permanent directives like:

"You are a professional Maldivian linguist. Respond only in natural Dhivehi using the Thaana script."

Combined with Gemini's massive 2-million-token context window, we can also provide hundreds of few-shot examples of natural Dhivehi conversations directly in the prompt. This lets the model refine its style at inference time without additional training.

The Safety Challenge

Here's something critical that often gets overlooked: safety guardrails that work in English might fail in Dhivehi. Recent research shows that adversarial robustness scores drop significantly in low-resource languages:

GPT-5.2: 14% safety gap between English and LRLs
Gemini 3 Pro: 12.6% safety gap
Some models: Up to 50% safety gap

This means we need to specifically "safety tune" our Dhivehi model—exposing it to harmful prompts in Dhivehi and training appropriate responses. Without this, a linguistically "pure" model might produce harmful content that English-focused safety filters miss.

Real Results: What Fine-Tuning Achieves

Research shows that supervised fine-tuning can reduce error metrics by 23-25% across different model sizes. Applying this to Dhivehi:

27B parameter models: ~23.5% improvement
12B parameter models: ~25.9% improvement
4B parameter models: ~23.6% improvement

This means even smaller, more efficient models can outperform larger general-purpose ones after fine-tuning—making Dhivehi AI more cost-effective.

Future Directions: Beyond Text

The next frontier? Audio and multimodal capabilities. With Gemini 2.0's real-time streaming and models like Gemini Live, we could soon fine-tune models to:

Speak with natural Maldivian accents
Understand spoken Dhivehi commands
Process images of Thaana text (OCR)
Handle mixed audio-visual content in Dhivehi

Imagine an AI that doesn't just write in Dhivehi, but truly speaks, listens, and sees in our language.

The Bottom Line: It's Possible, But Requires Dedication

Yes, we can fine-tune Gemini to respond purely in natural Dhivehi. The technology exists, the infrastructure is there, and the architectural flexibility of Gemini's MoE design makes it feasible.

But success depends on three critical factors:

Data Quality: High-quality, monolingual Dhivehi corpora reflecting natural speech patterns
Technical Expertise: Proper hyperparameter tuning and understanding of LRL-specific challenges
Cultural Awareness: Ensuring the model captures not just grammar, but the cultural nuances of Maldivian communication

Optimal Configuration Summary

For teams looking to implement this, here's our recommended setup:

Component	Setting	Why
Base Model	gemini-1.5-flash or gemini-2.5-flash	Best cost-performance balance
Method	SFT with LoRA	Prevents forgetting, reduces overhead
Adapter Rank	8 or 16	Handles complex LRL structure
Data Format	JSONL (UTF-8)	Required by Vertex AI, supports Thaana
Region	us-central1 or europe-west4	Best GPU/TPU availability
Evaluation	MetricX or AutoMQM	More sensitive than ROUGE/BLEU
Learning Rate	1.0	Stable default for SFT

Conclusion: Empowering Dhivehi in the AI Age

This isn't just a technical exercise. Fine-tuning Gemini for Dhivehi is about ensuring our language thrives in the AI era—that Maldivians can interact with cutting-edge technology in their mother tongue, naturally and authentically.

The challenges are real: the token tax, data scarcity, RTL complexity, and safety concerns. But with careful data curation, proper technical implementation, and a commitment to cultural authenticity, we can build AI that truly speaks Dhivehi.

At SerialTech Lab, we're committed to making this vision a reality. The future of Maldivian language technology starts here.