A simulator always lies a little. Not out of malice — it simply overlooks things that real silicon won't. The CLI-CPU is backed by 250+ C# tests and a full cocotb regression suite: the reference simulator ran green, the Verilator RTL simulation ran green, and Fibonacci(20) returned 6765 end to end. On paper, it was done.
Then I picked up an Alinx A7-Lite board with a Xilinx XC7A200T FPGA, and found out what "done" actually means.
Simulation tells you your logic is correct. Hardware tells you your logic is real.
What I wanted to see
The goal was a single, deliberately boring thing: a function written in C# running natively, on real hardware, with no JIT, no interpreter, no operating system — and printing its result over a serial port. This is the CLI-CPU's "hello world."
C# (samples/PureMath, Math.FibonacciIterative)
↓ dotnet build (Roslyn)
.dll (CIL bytecode)
↓ TCliCpuLinker
app.t0 (40 bytes: 8 header + 32 code)
↓ openFPGALoader (config flash @ 0xC00000)
IS25LP128 flash
↓ (read by the FPGA at runtime)
cilcpu_core (Fetch → Decode → Microcode → StackCache → ALU)
↓ core_halt, return_value = 6765
decimal_printer → uart_tx (115200 8N1)
↓
"6765\r\n" on the serial port
There is not a single line of C or firmware anywhere in this chain. The C# function's CIL bytecode runs directly on silicon. That's the whole point of the vision — and it's exactly what made bring-up interesting.
Three bugs stood between me and silicon
Between "done on paper" and "the LED is on, 6765 is coming through" sat three bugs. Each one is the kind simulation overlooks but a real device does not.
1. The simulator is more forgiving than synthesis
The first real Vivado synthesis spat out three Verilog violations that Verilator had tolerated for years:
- Double part-select —
BOOT_ARG_COUNT[7:0][4:0], which Verilator interpreted but synthesis rejects. - Missing
begin/endin six multi-statement functions — Verilog-2001 requires it; the sim compiler never complained. - BRAM inference: the SRAM block had to be rewritten so Vivado cleanly infers 4× RAMB36E1 — a dead combinational read path, a misplaced bus multiplexer, and an async reset all had to come out.
Lesson: a green test at the sim level is necessary but not sufficient. The synthesizer is a second, stricter reviewer — and it's worth running as early as possible.
2. The divider that didn't fit in the clock
Synthesis passed, but timing didn't close: WNS = −66 ns at 50 MHz. The critical path was 329 logic levels with 287 CARRY4 cells on a single chain. The culprit: the purely combinational ALU's 32-bit signed division (DIV / REM), which the synthesizer unrolled into a 32-level subtractor chain.
There was no surface patch here — the root had to be fixed. The single-cycle combinational divider was replaced with a sequential, 1-bit-per-cycle restoring divider (~34 cycles, with a dedicated ST_DIV_WAIT state in the core, preserving C# division semantics: truncation toward zero, remainder sign matching the dividend). The result wasn't just timing closure — it was a significantly smaller chip:
| Resource | Combinational | Iterative | Change |
|---|---|---|---|
| Slice LUT | 5,722 | 3,559 | −37.8% |
| CARRY4 | 869 | 302 | −65.2% |
| Slice FF | 1,354 | 1,564 | +15.5% |
The final build closed at WNS = +2.797 ns margin at 50 MHz, with zero failing endpoints. A classic hardware trade-off: a few extra flip-flops and a few cycles of latency in exchange for killing the combinational blowup.
3. The flash that wouldn't talk in four lanes
The sneakiest bug saved itself for last. Bitstream loaded, flash programmed, KEY2 pressed — and the serial port showed 3\r\n. That's the TRAP_INVALID_OPCODE trap code: the core was reading garbage out of code memory.
The trail led to the flash. The core reads the program from the ISSI IS25LP128 flash using the 0x6B Quad Output Read command — which requires the flash's QE (Quad Enable) bit to be 1. But this flash ships with QE = 0, the A7-Lite's MODE pins aren't strapped for SPIx4, so the FPGA startup ROM doesn't set it either, and openFPGALoader's --enable-quad doesn't recognize this variant. A four-lane command to a one-lane flash → pure garbage.
The fix, again at the root: after reset, the cilcpu_qspi_controller issues a WREN (0x06) + WRSR (0x01, 0x40) sequence to persistently set the flash Status Register's QE bit to 1 itself. I put it behind a QE_INIT_ENABLE parameter: 0 in simulation (existing test behavior stays unchanged), 1 on real hardware.
begin/end mattered to the synthesizer, the timing to a real clock, and the flash QE bit to a physical chip's factory default. Simulation checks your logic — hardware checks your assumptions.
The moment
Reset (KEY1), then start (KEY2). The serial terminal (COM3, 115200 8N1) showed six bytes:
36 37 36 35 0D 0A = "6765\r\n" = FibonacciIterative(20) = 6765 ✓
A function written in C# compiled to CIL bytecode, passed through my own linker, landed on a flash chip, and a processor I designed — on real silicon, in ~1.1 ms — computed the right answer. This is the CLI-CPU's first true sign of life beyond simulation.
Why the FPGA is worth it
These three bugs could have been found after fabricating the chip too. That looks drastically different:
| Where the bug surfaces | Cost | Iteration time |
|---|---|---|
| Simulation | ~€0 | seconds |
| FPGA | ~€0 (board already owned) | minutes (reflash) |
| ASIC tape-out | $1300+ (MPW shuttle) | months |
This is what the F2.7 phase is about: no tape-out without FPGA validation. The flash QE bug was a few lines of RTL on an FPGA. The same bug on a fabricated chip is a dead die and a lost shuttle slot. The FPGA is where assumptions collapse for free.
One core, a bigger picture
This was a single Nano core on a single board. The CLI-CPU vision is orders of magnitude larger: many small, independent cores on one chip, wired together with hardware mailboxes — the "Cognitive Fabric." This run is the first tangible cornerstone of that vision: not a diagram, not a simulation, but working silicon.
The full toolchain works: C# → CIL-T0 → bitstream → flash → core → UART. That chain, which until now lived only in my head and in the tests, just closed on a physical device.
Open source
The RTL, the simulator, the linker, and the full test suite are publicly available. The bring-up runbook is in the repo too.
Simulation tells you that you might be right. Hardware tells you that you are. On May 23, 2026, the CLI-CPU crossed that line.