🇭🇺 Magyar 🇬🇧 English

Why I'm Building a CPU That Runs .NET Natively

picoJava failed. Jazelle failed. Here's why CLI-CPU won't.
April 15, 2026 · Hocza József Szabolcs

The Question

What if your C# code ran directly on the silicon, with no JIT, no interpreter, no runtime? What if the CPU itself understood .NET bytecode — every ldc.i4, every call, every stloc — as its native instruction set?

This is not a thought experiment. I have been building exactly this for the past months, and the reference implementation is already done: 48 opcodes, 259+ passing tests, a working linker, a CLI runner, and an ALU in Verilog with 41 cocotb tests green. The project is called CLI-CPU, and it is fully open source under CERN-OHL-S v2.

But before I explain what CLI-CPU is, let me tell you why everything that came before it failed — and why this time is different.

The Ghosts of Bytecode Hardware Past

In 1997, Sun Microsystems released picoJava — a processor that could execute Java bytecodes directly in hardware. The idea was compelling: skip the overhead of interpretation and JIT compilation, and run Java at native speed. Sun built two iterations (picoJava-I and picoJava-II), licensed the design, and waited for the world to adopt it.

The world did not. By 2001, HotSpot's JIT compiler was already outperforming picoJava on most workloads. The hardware bytecode execution could not keep up with a software JIT that could specialize, inline, and optimize for the actual runtime behavior of the program. Sun quietly discontinued the architecture.

ARM tried a different approach with Jazelle DBX (2001): instead of a dedicated bytecode CPU, they added a special execution mode to existing ARM cores that could interpret Java bytecodes with hardware acceleration. Jazelle was clever — it fell back to software for complex opcodes and only accelerated the common ones. It shipped in hundreds of millions of ARM9 and ARM11 chips.

And yet, Jazelle also failed. When ARM introduced Thumb-2 and later the Cortex-A series with increasingly sophisticated out-of-order pipelines, the JIT approach (Dalvik, then ART on Android) won decisively. Jazelle was deprecated, then effectively abandoned. The last ARM architecture to include it was ARMv7. ARMv8 (AArch64) dropped it entirely.

The lesson seemed clear: software JIT on general-purpose hardware will always beat dedicated bytecode hardware. The JIT can adapt; the silicon cannot. Case closed.

Except that lesson is wrong. Or rather, it is incomplete. picoJava and Jazelle failed not because bytecode-in-hardware is a bad idea — they failed because they competed on the wrong axis.

The Wrong Race

Both picoJava and Jazelle tried to beat general-purpose CPUs at single-core performance. This was always a losing battle. A modern out-of-order CPU with a sophisticated JIT compiler can specialize code paths, devirtualize calls, eliminate dead code, and perform thousands of microarchitectural tricks that a simple bytecode pipeline cannot match. The JIT sees runtime behavior; the hardware sees only static instructions.

But single-core performance stopped being the most important metric years ago. Dennard scaling ended around 2006. Clock frequencies plateaued. Power walls appeared. The entire industry pivoted to multi-core, and has been struggling with the consequences ever since.

The problem is that multi-core in the conventional sense — shared memory, cache coherence, mutex locks — does not scale. Adding cores to a shared-memory system follows Amdahl's Law at best and leads to catastrophic contention at worst. The MESI protocol (and its descendants) that keeps caches coherent across cores consumes an increasingly absurd fraction of die area, power budget, and design complexity as core counts grow. Intel's mesh interconnect on server parts, AMD's Infinity Fabric, Apple's system-level cache — these are all heroic engineering efforts to paper over a fundamental architectural flaw: shared mutable state does not scale.

This is where CLI-CPU enters — not as a faster single-core bytecode processor, but as a fundamentally different architecture.

Shared-Nothing: The Only Way Forward

CLI-CPU does not have shared memory. Period. Each core has its own private SRAM — code, data, and stack — and no other core can access it. This is not a software convention; it is a physical property of the silicon. There are no bus lines connecting one core's memory to another.

Cores communicate exclusively through hardware mailbox FIFOs. A core writes a 32-bit message to its outgoing mailbox; the hardware delivers it to the destination core's incoming mailbox. The sending core does not block. The receiving core wakes from sleep when a message arrives, processes it, and goes back to sleep. This is the actor model in silicon.

"Current software systems are built on fundamentally flawed models. We need hardware where every core is an actor."
— Joe Armstrong, creator of Erlang, 2014

The consequences of this design are profound:

This is not a new idea in software. Erlang has been doing this since 1986. Akka (2009) and Akka.NET (2014) brought it to the JVM and .NET ecosystems. Orleans (2015) made it accessible to mainstream C# developers. The actor model has proven itself in production at massive scale — WhatsApp (Erlang), Discord (Elixir), Microsoft Azure (Orleans), and countless others.

But all of these run on conventional shared-memory hardware, which means the runtime must simulate message isolation through software abstractions — thread pools, mailbox queues, GC pressure, scheduler overhead. CLI-CPU removes that entire layer. The hardware is the runtime.

Why Now?

picoJava in 1997 had none of the conditions needed for this approach to work. In 2026, every single one has materialized:

What We Have Built So Far

CLI-CPU is not a whitepaper. It is not a "vision document" with handwaving about future possibilities. It is working code, tested code, and — increasingly — working hardware descriptions.

Phase F1.5 is complete. This includes:

The entire codebase — simulator, linker, runner, tests, Verilog, cocotb, documentation — is open source on GitHub.

The Cognitive Fabric One Vision

The target chip — Cognitive Fabric One — is designed for the Sky130 process at approximately 15 mm²:

Cognitive Fabric One Typical RISC-V SoC
Cores 6 Rich + 16 Nano + 1 Secure 4-8 homogeneous
Memory model Private SRAM per core, no shared memory Shared L1/L2 cache hierarchy
Inter-core comm Hardware mailbox FIFOs (OPI bus) Shared memory + cache coherence
Coherence overhead Zero Significant (MESI/directory)
Scaling behavior Linear with core count Sub-linear (Amdahl's Law)
Idle power Near-zero (sleep until mailbox) Significant (cache, coherence logic)
Security isolation Physical (no shared memory) Software (TrustZone, TEE)
ISA CIL (stack, compact) RISC-V (register, wider encoding)
I/O USB 1.1 FS, UART, SPI, GPIO Varies

The Rich cores handle supervisor logic, complex domain code, GC-managed objects, floating point, and exception handling — the full ECMA-335 CIL instruction set (~220 opcodes). The Nano cores are tiny (~10k standard cells each) and run the CIL-T0 integer subset — ideal for workers, neurons, filters, simple actors. The Secure Core is a hardened Rich core with crypto acceleration, TRNG, and PUF — the trust anchor for the entire fabric.

This heterogeneous design is analogous to ARM big.LITTLE or Apple's P-core + E-core, but applied to the .NET world. A C# developer annotates their classes with [RunsOn(CoreType.Nano)] or [RunsOn(CoreType.Rich)], and a Roslyn source generator verifies at compile time that Nano-targeted code stays within the CIL-T0 subset.

Security: Immune by Design

Every major CPU vulnerability of the past decade — Spectre, Meltdown, L1 Terminal Fault, Rowhammer, Retbleed, Inception — exploits one of three things: speculative execution, shared caches, or shared memory. CLI-CPU has none of these.

On top of this, the CIL execution model provides hardware-enforced type safety and control flow integrity. Every memory access is bounds-checked in hardware. Every method call is verified against its declared signature. Stack overflow and underflow generate hardware traps. ROP and JOP attacks are impossible because the hardware enforces that execution can only transfer to valid method entry points.

This is not security bolted on after the fact. This is security as a physical property of the silicon. You cannot patch around it, and you cannot exploit around it.

What Comes Next

The roadmap from here is concrete:

The entire project — hardware, software, documentation — is and will remain open source under CERN-OHL-S v2.

// join the journey

CLI-CPU is open source. The simulator runs today. The silicon path is funded and planned. If you believe that the future of computing is many small cores communicating through messages — not fewer big cores fighting over shared memory — then this project is for you.