8 min read

Why Nepal's Fintech Needs Its Own LLMs

Off-the-shelf models weren't built for Nepali language, regulation, or user behavior. Here's what we learned trying to make them work — and why the answer isn't prompting harder.

The assumption everyone makes

When most teams adopt AI, the default play is: pick a frontier model, write a good prompt, call the API, ship. It works. Until it doesn't.

The failure mode is subtle. The model sounds right. It produces fluent, confident output. But it's wrong in ways you don't catch until a user does — or worse, a regulator. And in fintech, that margin for error is close to zero.

I spent the better part of a year trying to make general-purpose LLMs work for a Nepali fintech context. The core problem isn't capability — GPT-4 and its peers are extraordinarily capable. The problem is alignment to a context they've never seen.


Three things that don't translate

Language. Nepali NLP is a low-resource problem. Most frontier models were trained on datasets where Nepali is a rounding error. The result: transliterated text, code-mixed sentences (Nepali written in Roman script), and colloquial terms used in mobile banking — all produce degraded outputs. The model hallucinates, hedges, or switches to English mid-sentence.

Regulatory context. Nepal Rastra Bank regulations, KYC norms, and PSP compliance rules exist nowhere in the model's training data in any meaningful density. Ask a general-purpose LLM to help draft a customer notice about a transaction limit and it will produce something that sounds legally sound but isn't — for this jurisdiction.

"Sounds right" is the most dangerous failure mode in production AI. It bypasses every human review that would catch an obviously wrong answer.

User mental models. The way people think about digital wallets in Nepal is shaped by specific onboarding flows, agent networks, and mobile-first behavior. A general LLM trained on global fintech content has a very different mental model of what a "wallet top-up" or "agent cashout" means. This creates friction in anything from customer support automation to transaction categorization.


What we tried first

The first instinct — and the right one to try — is prompt engineering. You can get remarkably far with good system prompts, retrieval-augmented generation (RAG), and few-shot examples. We did all of this. It worked for simple, structured tasks.

But RAG has a ceiling. When the knowledge gap is systemic — when the model doesn't understand the domain, not just the data — retrieval only goes so far. You end up stuffing enormous context windows with regulatory docs and hoping the model reasons over them correctly. Sometimes it does. The variance is unacceptable.

The case for fine-tuning

Fine-tuning isn't a silver bullet. I want to be honest about that upfront. It's expensive, it requires good data, and it introduces new failure modes (catastrophic forgetting, distribution shift at inference time).

But it changes the nature of the problem. Instead of asking a general model to perform in a specialized domain, you're shifting the base distribution toward your domain. The model doesn't need to retrieve the right context — it starts from a position that's already closer to correct.

For us, fine-tuning on curated fintech-domain data — transaction logs, support conversations, regulatory documents, internal SOPs — produced models that were dramatically more reliable on the tasks that mattered. Not more capable in a general sense. More trustworthy in our specific context.


The data problem is the real problem

Here's the thing nobody tells you: the hard part of fine-tuning isn't the training. It's the data. Specifically, getting good labeled data in a domain where most knowledge lives in people's heads, in unstructured PDFs, or in legacy systems that predate modern data practices.

We spent more engineering time on data pipelines — collection, cleaning, annotation, quality filtering — than on model training. By a large margin. If you're planning to go down this road, budget accordingly.

The other thing: evaluation is harder than training. You can tell when a model produces a grammatically correct sentence. You can't always tell when it produces a legally incorrect one. Domain-specific evaluation sets, built by people who understand the domain, are non-negotiable.

Where this leaves us

I'm not arguing that Nepal needs to train a frontier model from scratch — that's neither feasible nor necessary. What I'm arguing is that adaptation is not optional. Whether through fine-tuning, RLHF on domain-specific feedback, or more sophisticated RAG architectures, the work of making a model useful in a specific context is real work that can't be skipped.

The teams that figure this out first — that build the data flywheels, the evaluation infrastructure, the domain-adapted models — will have a durable advantage over those that are still prompt-engineering their way through 2026.

We're early. Nepal's AI ecosystem is thin. But that also means the surface area for impact is enormous. The models that understand how money moves here, in this language, under these regulations — those don't exist yet. That's the opportunity.

YP
Yash Paudel
AI Researcher · Builder · Nepal
← Back to thoughts yashpaudel.com.np