They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome

TheOpenMachineAI · May 4, 2026, 9:10am

The current meta in local AI is that you have to quantize. Big Tech is telling us that to run anything on the edge, we need to compress 2B+ models down to 4-bit, sacrifice the signal-to-noise ratio, and rely on flagship NPUs or Apple Silicon just to survive the memory bandwidth bottleneck.

We at Open Machine didn’t buy it. So, we built a 245M parameter model from scratch, kept it in raw uncompressed 32-bit float (FP32), and ran it on the absolute worst hardware I could find: a two-year-old plastic Samsung Galaxy A25.

Attached is the raw screen recording. Airplane mode on.

The Specs:

Model: Open Machine 245M (Trained from scratch on 20B tokens)
Weights: 2.3GB pure FP32 (ONNX export)
Hardware: €120 Samsung A25 (Exynos 1280)
Compute: CPU ONLY. GPU is off. NPU is off.
RAM: ~4.4GB used (literally lighter than opening a few Chrome tabs).
Thermals: 33.3°C. Zero battery drain. No OOM crashes.

The Elephant in the Room: 0.17 tokens/s Yes, it is slow as hell right now. If it were running at 50 tok/s on a budget CPU in FP32, you guys would immediately (and rightfully) call BS and accuse me of hiding a 4-bit quantized lookup table or using an API.

This speed is because it’s a heavily unoptimized Python loop forcing raw 32-bit math sequentially through a budget mobile CPU. We deliberately handicapped it to prove a point about physics: The memory wall is a routing problem, not a compression problem. If a budget Exynos chip can physically route a 2.3GB FP32 graph without OS memory-killing it or melting the battery, the architecture works. Writing a C++ kernel and dropping to FP16 will make it fly later.

How it fits without OOMing: We didn’t compress the weights; we fixed the network topology. We’re using what we call a “Synthetic Neural Engine” architecture. Instead of vanilla dense transformers(this is also Trasnformers but on our way) where you’re wasting compute on 90% static noise, we proceduralize the weights. We store a semantic dictionary of primitives and a per-context recipe that reconstructs exact weights dynamically full W. We calculate exact attention but store a compact state. Basically, we only compute the pure signal.

The Benchmarks: Even though it was trained on only 20B tokens (DCLM subset) for less than €1,000, this 245M model is already hitting 66% on PIQA and matching Meta’s 350M logic.

We built it anyway over the weekend and dropped this APK in their inbox today.

Stop letting Big Tech convince you that you need $7 Trillion and a massive server farm to solve edge logic.

Roast or Python loop, ask me about the math, and let me know what you think. I’ll drop the Hugging Face benchmark links in the comments, you can download it and test it yourself.

automajicly · May 7, 2026, 11:47pm

I love it. Never say die.. I myself just released yesterday the quantized Q8 and Q4 version of an uncensored obliterated version local private QWEN 2.5–1.5 B compatible and currently using on my own iPhone 13 as well as built mine tire security system with an autonomous AI agent loop via MCP… just released both on here and on GitHub and my iPhone compatible models boosted from zero downloads on day one and in 24 hours 341 downloads and rising as we speak. And budget, it isn’t even the phrase. I literally did all this for free with only two months of experience in coding architecture and cyber security.. keep going we’re gonna change the world one autonomous AI agent data time

Uzer-namo-2024 · May 8, 2026, 5:47am

Interesting experiment with FP32, but 0.17 t/s is a computational dead end.

We solved the “edge logic” problem differently. Why force a mobile CPU to do 32-bit tensor math when you can use Local-First Orchestration?

My mobile node (ChatVTX) is only 6MB. It doesn’t heat the battery, it doesn’t crash, and it gives me full access to a 8B parameter model with sub-second latency via a direct Nitro-link.

Full project details, screenshots, and community discussion here (4PDA): [Link to your 4PDA thread]

(Note: The forum is in Russian, but the architecture and results speak for themselves. You can use a translator to check the technical logs).

Architecture beats brute force every time. While you’re celebrating 0.17 t/s, we are running full-scale OSINT swarms on the go.

▶ [OPEN] NovBase OSINT Report (Swarm Sync: 6522 chars)

TheOpenMachineAI · May 8, 2026, 11:01am

Hey Uzer-namo, I checked out your 4PDA thread.

Your post explicitly states your model is: “работающая на моем сервере с RTX 3050” (running on my server with an RTX 3050).

Your 6MB app is a thin client/API wrapper sending network requests to a desktop GPU. That is cloud/remote hosting, not Edge AI. If you put your phone in Airplane Mode, your app stops working.

My video shows a 2.3GB uncompressed FP32 model running locally, on-device, in Airplane Mode, using zero network, processing exclusively on a budget mobile CPU. We are solving the physical memory bandwidth wall of mobile silicon without relying on external servers or quantization.

There is no Quantization that is the point, the model is 2.3GB running on the phone literally. Not wrapper of some other model or similar. The model and the APK are one not separated thing, when you install APK you get the model as well. Turn of internet or anything it still works. The slow part is because we are working with heavy python loop and unoptimized code. Its POC not the final product.

API wrappers are great, but bridging a network ping to a desktop GPU is not the same sport as running raw floating-point math directly on mobile silicon.

Its a nice project tho…I love it!

Uzer-namo-2024 · May 8, 2026, 11:13am

That is a solid experiment! It actually brings back memories—about a year ago, I was running a similar setup with a full LLM deployed natively on the handset. I spent about 1 months optimizing it, and like you mentioned, it’s impressive to see it functioning in total isolation (Airplane mode).

However, after pushing that architecture to its limits—hitting 80-90% CPU sustained loads and dealing with the inevitable thermal throttling—I realized that for high-stakes OSINT and autonomous agent loops, I needed a different paradigm.

That’s why I eventually moved toward the Local-First Orchestration I’m using now. By offloading the heavy lifting to a dedicated Nitro-node and keeping the mobile interface at a lean 6MB, I get the best of both worlds: the raw power of an 8B model and sub-second latency, without turning the phone into a pocket heater.

Always good to see others experimenting with true edge independence, though. Keep pushing the boundaries!

TheOpenMachineAI · May 8, 2026, 11:45am

Yeah, the problem is this is not a Transformers completely rather custom version of it. We re-wrote the whole stack together it works more like bio brain than AI. It has different math for multiplications and creating the matrix, so its whole new architecture behind, we will call it Post - Transformers or Synthetic Neural Engine. Its not the same thing

Yeah we runed also 7B on ordinary 2GB graphic card with almost 4k tokens, but still experimenting as a start up we are in early phase.

Uzer-namo-2024 · May 8, 2026, 12:12pm

Engineering Insight: The Reality of On-Device Intelligence

"It is inspiring to see so many developers chasing the dream of a truly ‘living’ AI on a handheld device. However, as someone who has pushed mobile hardware to its absolute thermal and computational limits, I must share a hard-earned truth: Intelligence requires room to breathe.

Trying to run a high-reasoning, ‘living’ model directly on a mobile CPU is a bit like trying to run a data center on a smartphone battery. You might get it to work, but you’ll face three immediate walls:

The Thermal Wall: Sustained high-logic tasks will throttle your CPU in minutes, turning your ‘intelligence’ into a slow, stuttering script.
The Memory Bottleneck: To fit a model into 4GB–8GB of RAM, you have to compress it (quantize) so heavily that you lose the ‘soul’ of the reasoning—the very ‘living’ quality you are seeking.
The Energy Tax: You can’t have autonomous agent loops if your device dies in two hours.

The real breakthrough isn’t in compression, it’s in orchestration.

My approach with NovBase was to stop fighting the hardware. Instead of forcing the phone to be the ‘brain,’ I turned it into the ‘eyes and ears’ (a lean 6MB interface), while the actual ‘living’ intelligence—a full-scale, uncompromised 8B model—runs on a dedicated local Nitro-node.

This is the only way to get sub-second responses and deep reasoning without sacrifices. Don’t let the marketing hype fool you: true edge intelligence isn’t about making the model smaller; it’s about making the architecture smarter."

TheOpenMachineAI · May 8, 2026, 6:50pm

I completely agree that standard dense transformers hit a Thermal Wall and an Energy Tax, that is exactly why we had to throw away the vanilla transformer architecture and build the Synthetic Neural Engine.

And you are absolutely right to fit a standard model into RAM, you have to quantize it to death. But that’s exactly what my post addresses: we didn’t quantize. The video proves the model is running in pure, uncompressed FP32, pulling only ~4.4GB RAM total, with the CPU sitting at a cool 33.3°C and zero battery drain.

We bypassed the thermal and memory walls not by offloading the compute to a desktop GPU over Wi-Fi, but by changing the fundamental math of the matrix multiplications so the mobile CPU only computes the pure signal.

Client-server orchestration (‘Nitro-nodes’) is a great workaround for standard models cool! But our goal isn’t to work around the mobile hardware, it’s to write better math so the mobile hardware can actually do the thinking itself.

Uzer-namo-2024 · May 8, 2026, 7:21pm

"That is a very ambitious and impressive approach! It’s rare to see someone tackling the fundamental math of matrix multiplications to bypass the thermal wall on mobile hardware.

We have been working on a similar challenge with our project, but from a different architectural angle. Instead of pushing a heavy model to its limits, we’ve shifted toward a ‘Lean Core + Knowledge Swarm’ architecture. Our current setup uses a lightweight model (around 400MB) acting as an intelligent orchestrator, backed by a robust, pre-tuned framework that handles deep data extraction and synthesis from external sources in real-time.

This way, we keep the mobile CPU cool while maintaining high-level intelligence through efficient ‘Nitro-node’ logic rather than raw compute. It would be fascinating to compare notes on how your ‘Signal Math’ handles long-context reasoning compared to our ‘Swarm Search’ retrieval.

If you’d like to discuss these architectures or exchange ideas on mobile AI optimization, feel free to reach out here: @obn777bot (Telegram).

Keep pushing the boundaries — the world needs more ‘out-of-the-box’ thinking!"

lulavc · May 9, 2026, 10:56am

Nice work.

Topic		Replies	Views
What is the best architecture for integrating local LLM inference and RAG on mobile devices? Beginners	1	75	March 15, 2026
🚀 Bringing Supercomputer-Grade AI Performance to Local CPUs: Purem Benchmarks Now Public Show and Tell	0	44	April 28, 2025
I make AI model from open to close Beginners	1	63	December 18, 2025
Are there any LLMs that can run with decent performance on hardware comparable to Jetson NX? Models	2	59	March 16, 2026
I made some open source software to run UNQUANTIZED Mistral 7b-Instruct on about 2GB of RAM Show and Tell	1	239	April 16, 2025

They said unquantized local AI was impossible on budget phones. We got a 2.3GB FP32 model running locally on a €120 Galaxy A25 CPU. No GPU, no NPU, uses less RAM than Chrome

Engineering Insight: The Reality of On-Device Intelligence

Related topics