The Question
What if your C# code ran directly on the silicon, with no JIT, no interpreter, no runtime? What if the CPU itself understood .NET bytecode — every ldc.i4, every call, every stloc — as its native instruction set?
This is not a thought experiment. I have been building exactly this for the past months, and the reference implementation is already done: 48 opcodes, 259+ passing tests, a working linker, a CLI runner, and an ALU in Verilog with 41 cocotb tests green. The project is called CLI-CPU, and it is fully open source under CERN-OHL-S v2.
But before I explain what CLI-CPU is, let me tell you why everything that came before it failed — and why this time is different.
The Ghosts of Bytecode Hardware Past
In 1997, Sun Microsystems released picoJava — a processor that could execute Java bytecodes directly in hardware. The idea was compelling: skip the overhead of interpretation and JIT compilation, and run Java at native speed. Sun built two iterations (picoJava-I and picoJava-II), licensed the design, and waited for the world to adopt it.
The world did not. By 2001, HotSpot's JIT compiler was already outperforming picoJava on most workloads. The hardware bytecode execution could not keep up with a software JIT that could specialize, inline, and optimize for the actual runtime behavior of the program. Sun quietly discontinued the architecture.
ARM tried a different approach with Jazelle DBX (2001): instead of a dedicated bytecode CPU, they added a special execution mode to existing ARM cores that could interpret Java bytecodes with hardware acceleration. Jazelle was clever — it fell back to software for complex opcodes and only accelerated the common ones. It shipped in hundreds of millions of ARM9 and ARM11 chips.
And yet, Jazelle also failed. When ARM introduced Thumb-2 and later the Cortex-A series with increasingly sophisticated out-of-order pipelines, the JIT approach (Dalvik, then ART on Android) won decisively. Jazelle was deprecated, then effectively abandoned. The last ARM architecture to include it was ARMv7. ARMv8 (AArch64) dropped it entirely.
The lesson seemed clear: software JIT on general-purpose hardware will always beat dedicated bytecode hardware. The JIT can adapt; the silicon cannot. Case closed.
Except that lesson is wrong. Or rather, it is incomplete. picoJava and Jazelle failed not because bytecode-in-hardware is a bad idea — they failed because they competed on the wrong axis.
The Wrong Race
Both picoJava and Jazelle tried to beat general-purpose CPUs at single-core performance. This was always a losing battle. A modern out-of-order CPU with a sophisticated JIT compiler can specialize code paths, devirtualize calls, eliminate dead code, and perform thousands of microarchitectural tricks that a simple bytecode pipeline cannot match. The JIT sees runtime behavior; the hardware sees only static instructions.
But single-core performance stopped being the most important metric years ago. Dennard scaling ended around 2006. Clock frequencies plateaued. Power walls appeared. The entire industry pivoted to multi-core, and has been struggling with the consequences ever since.
The problem is that multi-core in the conventional sense — shared memory, cache coherence, mutex locks — does not scale. Adding cores to a shared-memory system follows Amdahl's Law at best and leads to catastrophic contention at worst. The MESI protocol (and its descendants) that keeps caches coherent across cores consumes an increasingly absurd fraction of die area, power budget, and design complexity as core counts grow. Intel's mesh interconnect on server parts, AMD's Infinity Fabric, Apple's system-level cache — these are all heroic engineering efforts to paper over a fundamental architectural flaw: shared mutable state does not scale.
This is where CLI-CPU enters — not as a faster single-core bytecode processor, but as a fundamentally different architecture.
Shared-Nothing: The Only Way Forward
CLI-CPU does not have shared memory. Period. Each core has its own private SRAM — code, data, and stack — and no other core can access it. This is not a software convention; it is a physical property of the silicon. There are no bus lines connecting one core's memory to another.
Cores communicate exclusively through hardware mailbox FIFOs. A core writes a 32-bit message to its outgoing mailbox; the hardware delivers it to the destination core's incoming mailbox. The sending core does not block. The receiving core wakes from sleep when a message arrives, processes it, and goes back to sleep. This is the actor model in silicon.
"Current software systems are built on fundamentally flawed models. We need hardware where every core is an actor."
— Joe Armstrong, creator of Erlang, 2014
The consequences of this design are profound:
- No cache coherence protocol. Zero. None. No MESI, no MOESI, no directory-based coherence. This alone saves enormous die area and power.
- No locks, no mutexes, no atomics. There is no shared state to protect. Data races are physically impossible.
- Linear scaling. Adding more cores adds proportionally more compute. There is no Amdahl bottleneck because there is no serial section — each core is independent.
- Deterministic execution. Each core's behavior depends only on its code and its incoming messages. No timing-dependent bugs, no heisenbugs, no memory ordering surprises.
- Event-driven power profile. Cores sleep by default. They wake on mailbox interrupt, process the message, and sleep again. Idle power approaches zero.
This is not a new idea in software. Erlang has been doing this since 1986. Akka (2009) and Akka.NET (2014) brought it to the JVM and .NET ecosystems. Orleans (2015) made it accessible to mainstream C# developers. The actor model has proven itself in production at massive scale — WhatsApp (Erlang), Discord (Elixir), Microsoft Azure (Orleans), and countless others.
But all of these run on conventional shared-memory hardware, which means the runtime must simulate message isolation through software abstractions — thread pools, mailbox queues, GC pressure, scheduler overhead. CLI-CPU removes that entire layer. The hardware is the runtime.
Why Now?
picoJava in 1997 had none of the conditions needed for this approach to work. In 2026, every single one has materialized:
- Open PDKs. Google and SkyWater released the Sky130 process design kit in 2020. IHP followed with SG13G2. For the first time in history, anyone can design and fabricate a chip without paying millions in licensing fees. The entire CLI-CPU silicon path uses open tools: Yosys for synthesis, OpenLane2 for place-and-route, Magic for DRC/LVS.
- Tiny Tapeout. Matt Venn's Tiny Tapeout project has democratized chip fabrication. For approximately $150, you can get a small design onto real Sky130 silicon. CLI-CPU's first physical chip (a single Nano core + hardware mailbox) will tape out through this program.
- Affordable FPGAs. The MicroPhase A7-Lite board with a Xilinx XC7A200T — 200,000 logic cells — costs about $320. Three of these boards connected via Ethernet can host a multi-board Cognitive Fabric mesh for under $1,000. This was unthinkable a decade ago.
- The actor model is mainstream. Akka.NET, Orleans, Dapr, Proto.Actor — the .NET ecosystem has embraced message-passing concurrency. Eight million .NET developers already write code that maps naturally to CLI-CPU's execution model. They do not need to learn a new language or paradigm; they just need hardware that runs it natively.
- Dennard scaling is dead. Clock frequencies have been flat for 20 years. The only path to more performance is more cores. But shared-memory multi-core has hit diminishing returns. Shared-nothing many-core is the logical next step, and CLI-CPU is designed from the ground up for it.
What We Have Built So Far
CLI-CPU is not a whitepaper. It is not a "vision document" with handwaving about future possibilities. It is working code, tested code, and — increasingly — working hardware descriptions.
Phase F1.5 is complete. This includes:
- A reference simulator (
TCpu) implementing all 48 CIL-T0 opcodes — the integer subset of ECMA-335 CIL, plus mailbox MMIO extensions. The simulator models the full memory map: CODE, DATA, STACK, and MMIO regions. - A CIL-T0 linker (
TCliCpuLinker) that takes a standard .NET assembly (.dll compiled by Roslyn) and repackages it into a flat binary (.t0) that the hardware can boot from. No translation — the CIL opcodes in the .dll are the same ones the CPU executes. - A CLI runner with
runandlinkcommands. - 259+ xUnit tests covering every opcode, every trap condition, every edge case. Strict TDD — every test was written before its implementation.
- An ALU module in Verilog with a full cocotb testbench (41/41 tests passing). This is the beginning of the RTL phase (F2).
The entire codebase — simulator, linker, runner, tests, Verilog, cocotb, documentation — is open source on GitHub.
The Cognitive Fabric One Vision
The target chip — Cognitive Fabric One — is designed for the Sky130 process at approximately 15 mm²:
| Cognitive Fabric One | Typical RISC-V SoC | |
|---|---|---|
| Cores | 6 Rich + 16 Nano + 1 Secure | 4-8 homogeneous |
| Memory model | Private SRAM per core, no shared memory | Shared L1/L2 cache hierarchy |
| Inter-core comm | Hardware mailbox FIFOs (OPI bus) | Shared memory + cache coherence |
| Coherence overhead | Zero | Significant (MESI/directory) |
| Scaling behavior | Linear with core count | Sub-linear (Amdahl's Law) |
| Idle power | Near-zero (sleep until mailbox) | Significant (cache, coherence logic) |
| Security isolation | Physical (no shared memory) | Software (TrustZone, TEE) |
| ISA | CIL (stack, compact) | RISC-V (register, wider encoding) |
| I/O | USB 1.1 FS, UART, SPI, GPIO | Varies |
The Rich cores handle supervisor logic, complex domain code, GC-managed objects, floating point, and exception handling — the full ECMA-335 CIL instruction set (~220 opcodes). The Nano cores are tiny (~10k standard cells each) and run the CIL-T0 integer subset — ideal for workers, neurons, filters, simple actors. The Secure Core is a hardened Rich core with crypto acceleration, TRNG, and PUF — the trust anchor for the entire fabric.
This heterogeneous design is analogous to ARM big.LITTLE or Apple's P-core + E-core, but applied to the .NET world. A C# developer annotates their classes with [RunsOn(CoreType.Nano)] or [RunsOn(CoreType.Rich)], and a Roslyn source generator verifies at compile time that Nano-targeted code stays within the CIL-T0 subset.
Security: Immune by Design
Every major CPU vulnerability of the past decade — Spectre, Meltdown, L1 Terminal Fault, Rowhammer, Retbleed, Inception — exploits one of three things: speculative execution, shared caches, or shared memory. CLI-CPU has none of these.
- No speculative execution. CLI-CPU cores are simple, in-order, deterministic. There is no branch predictor to poison, no speculative load to leak data through a side channel.
- No shared cache. Each core has its own private SRAM. There is no L2, no L3, no system-level cache. No Flush+Reload, no Prime+Probe, no cache timing attacks.
- No shared memory. Cross-core information flow is limited to explicit mailbox messages. One compromised core cannot read another core's state — not through software, not through microarchitectural side channels, not through anything short of physical decapsulation.
On top of this, the CIL execution model provides hardware-enforced type safety and control flow integrity. Every memory access is bounds-checked in hardware. Every method call is verified against its declared signature. Stack overflow and underflow generate hardware traps. ROP and JOP attacks are impossible because the hardware enforces that execution can only transfer to valid method entry points.
This is not security bolted on after the fact. This is security as a physical property of the silicon. You cannot patch around it, and you cannot exploit around it.
What Comes Next
The roadmap from here is concrete:
- F2 — RTL: Complete the Verilog/Amaranth HDL implementation of a single Nano core. Verify every opcode against the C# simulator using cocotb golden-vector testing. Synthesize for Sky130 with Yosys.
- F2.7 — FPGA Validation: Run the single Nano core on a real FPGA (MicroPhase A7-Lite XC7A200T) before committing to silicon. Fibonacci(20) over UART on real hardware. Principle: no tape-out without FPGA validation.
- F3 — First Silicon: Tape out the FPGA-validated Nano core + hardware mailbox on Tiny Tapeout. Run Fibonacci(20) and an "echo neuron" demo on real silicon. The first physical CIL-native processor.
- F4 — Cognitive Fabric Pivot: Demonstrate 4 Nano cores on FPGA, communicating via hardware mailboxes in a shared-nothing event-driven fabric. Prove that the architecture scales.
- F5 — Rich Core: Add the full CIL instruction set — objects, GC assist, exception handling, FPU. Demonstrate heterogeneous Nano + Rich operation.
- F6 — Cognitive Fabric One: The target chip. First on FPGA (multi-board mesh), then on real silicon via ChipIgnite or IHP MPW.
- F7 — Neuron OS: An actor-based operating system designed from the ground up for shared-nothing hardware. No kernel. No syscalls. No virtual memory. Just actors communicating through messages.
The entire project — hardware, software, documentation — is and will remain open source under CERN-OHL-S v2.
// join the journey
CLI-CPU is open source. The simulator runs today. The silicon path is funded and planned. If you believe that the future of computing is many small cores communicating through messages — not fewer big cores fighting over shared memory — then this project is for you.