At Agora Software, we have chosen local inference to run our AI models. Why? Because this approach allows us to control our infrastructure, our data, and our costs. Discover our insights and experiences.
Why Local Inference?
Text generation, semantic analysis, entity extraction, multilingual document processing. A major technical choice led us to take a radically different direction from most market players.
Instead of relying on APIs (OpenAI, Anthropic, Google), we chose local inference. This means running language models directly on our own infrastructure, on our servers
This choice is not trivial. It requires managing:
- GPUs,
- Memory optimizations (VRAM),
- Load balancing,
- High availability.
But in return, it gives us something important: sovereignty and full control over the technical stack, right down to model inference.
In this article, we want to share our experience with this transition to give you some key insights.
Local Inference: Definition and Comparison of Solutions
Local Inference: Definition and Key Advantages
Local inference allows running language models (LLMs) directly on your servers, without relying on external APIs. It is a key solution for companies concerned with sovereignty and performance, particularly software publishers.
In practice, this means:
- Downloading models from public repositories (HuggingFace),
- Loading them into memory on GPU-equipped servers,
- Processing requests locally, without data leaving your infrastructure,
- Managing scalability, high availability, and performance yourself.
The fundamental difference with cloud APIs?
With local inference, your data never leaves your infrastructure. A major advantage for GDPR compliance and confidentiality.
Comparison of Local Inference Solutions
When embarking on local inference, three main families of solutions are available. Each addresses different needs.
Ollama is probably the most accessible solution to get started. Think of it as Docker for LLMs: a simple interface, one command, and your model runs locally.
Ollama handles downloading, configuration, and startup. It’s perfect for testing, development, and demos.
But as soon as you move to production, limitations become apparent:
- No fine control over memory allocation,
- Difficulty in precisely configuring inference parameters,
- Basic multi-GPU management,
- Limited monitoring.
→ For a proof of concept, it’s ideal. For running a platform, it’s insufficient.
vLLM adopts a different philosophy: it’s an engine designed to serve LLMs with high concurrency.
PagedAttention dynamically manages the KV cache, maximizing the number of simultaneous requests on powerful GPUs.
This approach is effective in a specific context: high-end GPUs (A100, H100) with plenty of VRAM, models in fp16/bf16, and a high volume of concurrent requests to process.
→ vLLM is an excellent choice if you have high-end GPU infrastructure and aim for maximum throughput.
llama.cpp is the low-level inference engine developed by Georgi Gerganov.
A C++ optimized project that runs quantized models with great efficiency.
You can run quantized models (Q4_K_M, Q5_K_S, etc.), adjust context size, KV cache, tensor splitting—in short, you have total control.
The result:
- Minimal VRAM footprint,
- Good performance,
- Absolute flexibility.
Our choice: take llama.cpp as a foundation and build an Agora layer on top. We call it Allama.cpp (A for Agora + llama.cpp).
Why Did We Make This Choice?
Technical mastery, but also data sovereignty and confidentiality—this is the main reason that triggered our entire reflection.
Our Agora platform processes sensitive data:
- HR software containing personal information,
- Legal or financial information,
- Internal corporate communications,
- Data from local authorities.
For each of these use cases, routing information through third-party servers is simply impossible.
GDPR is clear: personal data of European citizens cannot be transferred outside the EU without appropriate safeguards. But even with “safeguards,” U.S. legal reality poses a problem.
The Case of the International Criminal Court
A recent example illustrates the risk: after U.S. sanctions announced in February 2025 against the ICC prosecutor, witnesses cited by the Associated Press indicated that Karim Khan lost access to his Microsoft email and had to migrate to Proton Mail.
More broadly, these sanctions show how actors subject to U.S. jurisdiction can be forced to adapt or suspend services, even when the organization operates in Europe.
It doesn’t matter if the servers are in Europe. It doesn’t matter what Microsoft’s commitments are regarding “digital sovereignty.” The U.S. CLOUD Act allows the U.S. government to demand data disclosure or service interruption for any company under U.S. jurisdiction.
Brad Smith, President of Microsoft, had declared a few weeks earlier: “In the unlikely event that a government orders us to suspend our cloud operations in Europe, we commit to vigorously challenging such a measure.” This commitment did not hold up under political pressure.
When questioned in the French Senate in June 2025, Microsoft France’s Director of Public Affairs was disarmingly frank: “No, I cannot guarantee” that the data of French citizens will never be transmitted to U.S. authorities without the agreement of French authorities.
This transparency is commendable. But it confirms a structural impossibility: as long as a company is under U.S. jurisdiction, it must comply with U.S. law, regardless of its promises.
Cost Control
Beyond sovereignty, there is an economic reality.
Cloud APIs charge per token. For small volumes, it’s convenient: no infrastructure to manage, you pay for what you consume. But as volumes increase, the equation changes radically.
Local inference requires an initial investment in infrastructure (servers, GPUs), but the marginal cost per request becomes acceptable once the investment is amortized.
And above all, you gain something important: budget predictability. No more unpleasant surprises at the end of the month because a client generated far more requests than expected. Your infrastructure cost is fixed and known in advance.
Performance and Availability
Local inference brings concrete technical advantages.
Latency first. No round-trip
to distant servers. No shared queue with thousands of other clients. Our GPUs process requests directly. For real-time use cases (conversational chat, etc.), this responsiveness makes a difference.
Availability next. We no longer depend on a third party for our critical service. No more “API is currently unavailable” blocking the entire platform. No more unpredictable rate limiting. We control our SLA end-to-end.
Of course, this means managing high availability, load balancing, and monitoring ourselves. But that’s precisely our expertise.
Total Technical Control
Local inference offers total technical freedom. We precisely choose which model to use. There are dozens of open-source models, each optimized for specific use cases: Llama-3, Mistral, Qwen for text; Qwen3-VL for vision; specialized NER models for entity extraction, etc.
We choose the level of quantization to maximize performance. We decide on the trade-off, depending on each use case.
We configure all parameters. We fine-tune system prompts without being limited by API constraints. We can even fine-tune models on our own data if necessary.
Additionally, we offer our clients the option to deploy models on their own GPUs if they wish.
Finally, we have complete monitoring of everything. Real-time VRAM usage. Number of active slots. Latency per request. Comprehensive logs. Performance metrics. Everything is observable, measurable, and optimizable.
How We Apply Local Inference
Local inference is at the core of all Agora platform solutions.
- Multilingual conversational chat: our application agents interact with users in French, Italian, Spanish, on MS Teams, Google Chat, WhatsApp.
- NER (Named Entity Recognition) for French: automatic extraction of entities in legal or business texts (names, dates, amounts, legal references).
- Semantic analysis and classification: intelligent routing of requests, intent detection, document classification.
- Content generation: drafting, reformulating, summarizing, translating.
Running an LLM on your laptop is simple. Running it in production with multiple GPUs, load balancing, and high availability is another story. In a future article, we’ll show you how we built Allama.cpp, our custom inference engine.
_____________
If you are a software publisher and wondering about the challenges of sovereignty, performance, and cost control related to AI, we hope this article has shed light on the advantages of local inference, a key approach to addressing these challenges.
At Agora Software, we develop conversational AI solutions dedicated to software publishers. We deploy multilingual, omnichannel application conversational interfaces to enhance the user experience of your applications and platforms.
Want to explore local inference for your AI projects? Discover in [Part 2] how we built Allama.cpp, our custom inference engine. Or contact us to discuss your needs.
If you enjoyed this article, you might also like: Agentic AI: Perfect Employee… or Ticking Time Bomb?
To follow our news, join us on our LinkedIn page.
Want to understand how our conversational AI optimizes productivity and user engagement by effectively complementing your applications?
Want to understand how our conversational AI platform
optimizes your users’ productivity and engagement by effectively complementing your business applications?


