Why I'm Building a CPU That Runs .NET Natively

Seal Core — the security core at the center of the CFPU chip

The Question

What if your C# code ran directly on the silicon, with no JIT, no interpreter, no runtime? What if the CPU itself understood .NET bytecode — every ldc.i4, every call, every stloc — as its native instruction set?

This is not a thought experiment. I have been building exactly this for the past months, and the reference implementation is already done: 48 opcodes, 160 green xUnit tests, 249 green cocotb tests, a working linker, a CLI runner, a single-core Nano RTL in Verilog (ALU + decoder + microcode + stack cache + QSPI controller + frame manager), a board-level top wrapper for the MicroPhase A7-Lite XC7A200T (STARTUPE2 + IOBUF), and a Fibonacci(20) = 6765 end-to-end demo over UART in cocotb simulation. The project is called CLI-CPU, and it is fully open source under CERN-OHL-S v2.

But before I explain what CLI-CPU is, let me tell you why everything that came before it failed — and why this time is different.

The Ghosts of Bytecode Hardware Past

In 1997, Sun Microsystems released picoJava — a processor that could execute Java bytecodes directly in hardware. The idea was compelling: skip the overhead of interpretation and JIT compilation, and run Java at native speed. Sun built two iterations (picoJava-I and picoJava-II), licensed the design, and waited for the world to adopt it.

The world did not. By 2001, HotSpot's JIT compiler was already outperforming picoJava on most workloads. The hardware bytecode execution could not keep up with a software JIT that could specialize, inline, and optimize for the actual runtime behavior of the program. Sun quietly discontinued the architecture.

ARM tried a different approach with Jazelle DBX (2001): instead of a dedicated bytecode CPU, they added a special execution mode to existing ARM cores that could interpret Java bytecodes with hardware acceleration. Jazelle was clever — it fell back to software for complex opcodes and only accelerated the common ones. It shipped in hundreds of millions of ARM9 and ARM11 chips.

And yet, Jazelle also failed. When ARM introduced Thumb-2 and later the Cortex-A series with increasingly sophisticated out-of-order pipelines, the JIT approach (Dalvik, then ART on Android) won decisively. Jazelle was deprecated, then effectively abandoned. The last ARM architecture to include it was ARMv7. ARMv8 (AArch64) dropped it entirely.

The lesson seemed clear: software JIT on general-purpose hardware will always beat dedicated bytecode hardware. The JIT can adapt; the silicon cannot. Case closed.

Except that lesson is wrong. Or rather, it is incomplete. picoJava and Jazelle failed not because bytecode-in-hardware is a bad idea — they failed because they competed on the wrong axis.

The Wrong Race

Both picoJava and Jazelle tried to beat general-purpose CPUs at single-core performance. This was always a losing battle. A modern out-of-order CPU with a sophisticated JIT compiler can specialize code paths, devirtualize calls, eliminate dead code, and perform thousands of microarchitectural tricks that a simple bytecode pipeline cannot match. The JIT sees runtime behavior; the hardware sees only static instructions.

But single-core performance stopped being the most important metric years ago. Dennard scaling ended around 2006. Clock frequencies plateaued. Power walls appeared. The entire industry pivoted to multi-core, and has been struggling with the consequences ever since.

The problem is that multi-core in the conventional sense — shared memory, cache coherence, mutex locks — does not scale. Adding cores to a shared-memory system follows Amdahl's Law at best and leads to catastrophic contention at worst. The MESI protocol (and its descendants) that keeps caches coherent across cores consumes an increasingly absurd fraction of die area, power budget, and design complexity as core counts grow. Intel's mesh interconnect on server parts, AMD's Infinity Fabric, Apple's system-level cache — these are all heroic engineering efforts to paper over a fundamental architectural flaw: shared mutable state does not scale.

This is where CLI-CPU enters — not as a faster single-core bytecode processor, but as a fundamentally different architecture.

Shared-Nothing: The Only Way Forward

CLI-CPU does not have shared memory. Period. Each core has its own private SRAM — code, data, and stack — and no other core can access it. This is not a software convention; it is a physical property of the silicon. There are no bus lines connecting one core's memory to another.

Cores communicate exclusively through hardware mailbox FIFOs. A core writes a message to its outgoing mailbox — the Cognitive Fabric NoC cell format is 16-byte header + 1–128-byte payload, with the router using fixed 144-byte slots — and the hardware delivers it to the destination core's incoming mailbox. The sending core does not block. The receiving core wakes from sleep when a message arrives, processes it, and goes back to sleep. This is the actor model in silicon.

The consequences of this design are profound:

No cache coherence protocol. Zero. None. No MESI, no MOESI, no directory-based coherence. This alone saves enormous die area and power.
No locks, no mutexes, no atomics. There is no shared state to protect. Data races are physically impossible.
Linear scaling. Adding more cores adds proportionally more compute. There is no Amdahl bottleneck because there is no serial section — each core is independent.
Deterministic execution. Each core's behavior depends only on its code and its incoming messages. No timing-dependent bugs, no heisenbugs, no memory ordering surprises.
Event-driven power profile. Cores sleep by default. They wake on mailbox interrupt, process the message, and sleep again. Idle power approaches zero.

This is not a new idea in software. Erlang has been doing this since 1986. Akka (2009) and Akka.NET (2014) brought it to the JVM and .NET ecosystems. Orleans (2015) made it accessible to mainstream C# developers. The actor model has proven itself in production at massive scale — WhatsApp (Erlang), Discord (Elixir), Microsoft Azure (Orleans), and countless others.

But all of these run on conventional shared-memory hardware, which means the runtime must simulate message isolation through software abstractions — thread pools, mailbox queues, GC pressure, scheduler overhead. CLI-CPU removes that entire layer. The hardware is the runtime.

Why Now?

picoJava in 1997 had none of the conditions needed for this approach to work. In 2026, every single one has materialized:

Open PDKs. Google and SkyWater released the Sky130 process design kit in 2020. IHP followed with SG13G2. For the first time in history, anyone can design and fabricate a chip without paying millions in licensing fees. The entire CLI-CPU silicon path uses open tools: Yosys for synthesis, OpenLane2 for place-and-route, Magic for DRC/LVS.
Tiny Tapeout. Matt Venn's Tiny Tapeout project has democratized chip fabrication. For approximately $150, you can get a small design onto real Sky130 silicon. CLI-CPU's first physical chip (a single Nano core + hardware mailbox) will tape out through this program.
Affordable FPGAs. The MicroPhase A7-Lite board with a Xilinx XC7A200T — 200,000 logic cells — costs about $320. Three of these boards connected via Ethernet can host a multi-board Cognitive Fabric mesh for under $1,000. This was unthinkable a decade ago.
The actor model is mainstream. Akka.NET, Orleans, Dapr, Proto.Actor — the .NET ecosystem has embraced message-passing concurrency. Eight million .NET developers already write code that maps naturally to CLI-CPU's execution model. They do not need to learn a new language or paradigm; they just need hardware that runs it natively.
Dennard scaling is dead. Clock frequencies have been flat for 20 years. The only path to more performance is more cores. But shared-memory multi-core has hit diminishing returns. Shared-nothing many-core is the logical next step, and CLI-CPU is designed from the ground up for it.

What We Have Built So Far

CLI-CPU is not a whitepaper. It is not a "vision document" with handwaving about future possibilities. It is working code, tested code, and — increasingly — working hardware descriptions.

Phase F1.5 is complete; phase F2 (RTL) is at the F2.7 Sub5 boundary. This includes:

A reference simulator (TCpu) implementing all 48 CIL-T0 opcodes — the integer subset of ECMA-335 CIL, plus mailbox MMIO extensions. The simulator models the full memory map: CODE, DATA, STACK, and MMIO regions.
A CIL-T0 linker (TCliCpuLinker) that takes a standard .NET assembly (.dll compiled by Roslyn) and repackages it into a flat binary (.t0) that the hardware can boot from. No translation — the CIL opcodes in the .dll are the same ones the CPU executes.
A CLI runner with run and link commands.
A single-core Nano RTL in Verilog: ALU (F2.1, 41 cocotb), decoder (42 cocotb), microcode (25 cocotb), stack cache (28 cocotb), QSPI controller (F2.4, 29 cocotb), top-level integration with frame manager and trap aggregator (F2.5a Sub1..6, 49 cocotb).
A golden-vector harness (F2.5b) that compares cocotb execution step-by-step against the C# reference simulator's JSONL trace.
A board-level top wrapper for the MicroPhase A7-Lite XC7A200T (F2.7 Sub1..Sub4): clock + KEY debouncer + boot sequencer FSM, UART TX + decimal printer, IS25L128F QSPI flash wiring with the Xilinx STARTUPE2 + 4× IOBUF primitives in cilcpu_a7lite_board.v.
A Fibonacci(20) = 6765 end-to-end demo over UART, in cocotb simulation through the board.v — from the Roslyn compiler through the CIL-T0 binary to the board-level UART output.
249 green cocotb tests + 160 green xUnit tests covering every opcode, every trap condition, every edge case. Strict TDD — every test was written before its implementation.

The entire codebase — simulator, linker, runner, tests, Verilog, cocotb, documentation — is open source on GitHub.

The Cognitive Fabric One Vision

The target chip — Cognitive Fabric One — is designed for the Sky130 process at approximately 15 mm²:

	Cognitive Fabric One	Typical RISC-V SoC
Cores	6 Rich + 16 Nano + 1 Secure	4-8 homogeneous
Memory model	Private SRAM per core, no shared memory	Shared L1/L2 cache hierarchy
Inter-core comm	Hardware mailbox FIFOs (OPI bus)	Shared memory + cache coherence
Coherence overhead	Zero	Significant (MESI/directory)
Scaling behavior	Linear with core count	Sub-linear (Amdahl's Law)
Idle power	Near-zero (sleep until mailbox)	Significant (cache, coherence logic)
Security isolation	Physical (no shared memory)	Software (TrustZone, TEE)
ISA	CIL (stack, compact)	RISC-V (register, wider encoding)
I/O	USB 1.1 FS, UART, SPI, GPIO	Varies

The Rich cores handle supervisor logic, complex domain code, GC-managed objects, floating point, and exception handling — the full ECMA-335 CIL instruction set (~220 opcodes). The Nano cores are tiny (~10k standard cells each) and run the CIL-T0 integer subset — ideal for workers, neurons, filters, simple actors. The Secure Core is a hardened Rich core with crypto acceleration, TRNG, and PUF — the trust anchor for the entire fabric.

This heterogeneous design is analogous to ARM big.LITTLE or Apple's P-core + E-core, but applied to the .NET world. A C# developer annotates their classes with [RunsOn(CoreType.Nano)] or [RunsOn(CoreType.Rich)], and a Roslyn source generator verifies at compile time that Nano-targeted code stays within the CIL-T0 subset.

Security: Immune by Design

Every major CPU vulnerability of the past decade — Spectre, Meltdown, L1 Terminal Fault, Rowhammer, Retbleed, Inception — exploits one of three things: speculative execution, shared caches, or shared memory. CLI-CPU has none of these.

No speculative execution. CLI-CPU cores are simple, in-order, deterministic. There is no branch predictor to poison, no speculative load to leak data through a side channel.
No shared cache. Each core has its own private SRAM. There is no L2, no L3, no system-level cache. No Flush+Reload, no Prime+Probe, no cache timing attacks.
No shared memory. Cross-core information flow is limited to explicit mailbox messages. One compromised core cannot read another core's state — not through software, not through microarchitectural side channels, not through anything short of physical decapsulation.

On top of this, the CIL execution model provides hardware-enforced type safety and control flow integrity. Every memory access is bounds-checked in hardware. Every method call is verified against its declared signature. Stack overflow and underflow generate hardware traps. ROP and JOP attacks are impossible because the hardware enforces that execution can only transfer to valid method entry points.

This is not security bolted on after the fact. This is security as a physical property of the silicon. You cannot patch around it, and you cannot exploit around it.

What Comes Next

The roadmap from here is concrete:

F2 — RTL: Complete the Verilog/Amaranth HDL implementation of a single Nano core. Verify every opcode against the C# simulator using cocotb golden-vector testing. Synthesize for Sky130 with Yosys.
F2.7 Sub5 — A7-Lite hardware bring-up: F2.7 Sub1..Sub4 are done (board-level top wrapper, STARTUPE2 + 4× IOBUF, Fibonacci(20) = 6765 through the board.v in cocotb simulation). What remains: Vivado + OpenXC7 synthesis, IS25L128F config flash programming (write_cfgmem), 50 MHz timing closure, and Fibonacci(20) output over UART from the actual A7-Lite XC7A200T board through the CH340 — the last validation step before silicon tape-out (F3). Principle: no tape-out without FPGA validation.
F3 — First Silicon: Tape out the FPGA-validated Nano core + hardware mailbox on Tiny Tapeout. Run Fibonacci(20) and an "echo neuron" demo on real silicon. The first physical CIL-native processor.
F4 — Cognitive Fabric Pivot: Demonstrate 4 Nano cores on FPGA, communicating via hardware mailboxes in a shared-nothing event-driven fabric. Prove that the architecture scales.
F5 — Rich Core: Add the full CIL instruction set — objects, GC assist, exception handling, FPU. Demonstrate heterogeneous Nano + Rich operation.
F6 — Cognitive Fabric One: The target chip. First on FPGA (multi-board mesh), then on real silicon via ChipIgnite or IHP MPW.
F7 — Symphact: An actor-based operating system designed from the ground up for shared-nothing hardware. No kernel. No syscalls. No virtual memory. Just actors communicating through messages.

The entire project — hardware, software, documentation — is and will remain open source under CERN-OHL-S v2.

// join the journey

CLI-CPU is open source. The simulator runs today. The silicon path is funded and planned. If you believe that the future of computing is many small cores communicating through messages — not fewer big cores fighting over shared memory — then this project is for you.

Star on GitHub Contribute Read the Architecture