One RTL, every chip
Most processor architectures are designed for a specific manufacturing node and chip size. When the node changes, the chip is redesigned. The CFPU works differently: a single parameterizable RTL where chip size and manufacturing technology determine the core count.
The core count adapts to the chip, not the other way around.
This is possible because the CFPU's 4-level hierarchical architecture is parameterizable: the cluster size is fixed (4×4 = 16 cores, physical optimum), but the number of clusters, tiles, and regions is configurable. Same Verilog code, different parameters.
module cfpu_chip #(
parameter CORES_PER_CLUSTER = 16, // fixed: 4×4 mesh
parameter CLUSTERS_PER_TILE = 12, // node-dependent: 4-12
parameter TILES_PER_REGION = 12, // node-dependent: 4-12
parameter REGIONS = 24, // chip-size-dependent: 4-24
parameter SRAM_KB_PER_CORE = 1024 // node-dependent: 16-1024
);
// 7nm, 800 mm²: 16 × 8 × 8 × 8 = 8,192 cores, 256 KB/core
// 5nm, 800 mm²: 16 × 8 × 8 × 10 = 10,240 cores, 256 KB/core
Why does it scale?
The key: SRAM dominates core area (55-75%), not logic. SRAM scales worse than logic — which is why the actual core density increase is slower than logic density alone would suggest. To compensate: we increase SRAM at every node step.
| Node | Logic density | SRAM cell | Scaling factor |
|---|---|---|---|
| 130nm (Sky130) | ~2.5 MTr/mm² | ~1.0 μm² | 1× |
| 28nm | ~20 MTr/mm² | ~0.12 μm² | ~8× |
| 7nm | ~91 MTr/mm² | ~0.027 μm² | ~37× |
| 5nm | ~171 MTr/mm² | ~0.021 μm² | ~48× |
Core size per node
A) Growing SRAM — richer actors, fewer cores
If we increase SRAM with each technology step, each core can run more actors, larger working sets, richer programs:
| Node | Core logic | SRAM/core | SRAM area | Core total |
|---|---|---|---|---|
| 130nm | 0.80 mm² | 16 KB | 0.13 mm² | ~1.06 mm² |
| 28nm | 0.10 mm² | 64 KB | 0.06 mm² | ~0.18 mm² |
| 7nm | 0.022 mm² | 256 KB | 0.057 mm² | ~0.083 mm² |
| 5nm | 0.012 mm² | 512 KB | 0.088 mm² | ~0.103 mm² |
B) Fixed 256 KB SRAM — maximum core count
If we keep SRAM at 256 KB on every node, the freed area translates to more cores — greater parallelism, more simultaneous actors:
| Node | Core logic | SRAM/core | SRAM area | Core total |
|---|---|---|---|---|
| 130nm | 0.80 mm² | 256 KB | 2.10 mm² | ~2.93 mm² |
| 28nm | 0.10 mm² | 256 KB | 0.25 mm² | ~0.37 mm² |
| 7nm | 0.022 mm² | 256 KB | 0.057 mm² | ~0.083 mm² |
| 5nm | 0.012 mm² | 256 KB | 0.044 mm² | ~0.059 mm² |
The choice between the two strategies depends on the workload: maximum parallelism (SNN, many small actors) → fixed 256 KB with more cores. Richer actors (ML inference, large object graphs) → growing SRAM. The parameterizable RTL supports both — SRAM_KB_PER_CORE is set at fabrication time.
How many cores fit on a chip?
Usable area = chip size × 0.78 (subtracting: I/O pads, Seal Core, crossbars, routing channels, peripherals).
A) Growing SRAM (16 KB → 512 KB)
| Node | SRAM/core | Core size | 100 mm² | 400 mm² | 800 mm² | 1 400 mm² |
|---|---|---|---|---|---|---|
| 130nm | 16 KB | 1.06 mm² | 74 | 294 | 588 | 1 030 |
| 28nm | 64 KB | 0.18 mm² | 433 | 1 733 | 3 467 | 6 067 |
| 7nm | 256 KB | 0.083 mm² | 940 | 3 759 | 7 518 | 13 157 |
| 5nm | 512 KB | 0.103 mm² | 757 | 3 029 | 6 058 | 10 602 |
B) Fixed 256 KB SRAM — maximum parallelism
| Node | SRAM/core | Core size | 100 mm² | 400 mm² | 800 mm² | 1 400 mm² |
|---|---|---|---|---|---|---|
| 130nm | 256 KB | 2.93 mm² | 27 | 107 | 213 | 373 |
| 28nm | 256 KB | 0.37 mm² | 211 | 843 | 1 686 | 2 951 |
| 7nm | 256 KB | 0.083 mm² | 940 | 3 759 | 7 518 | 13 157 |
| 5nm | 256 KB | 0.059 mm² | 1 322 | 5 288 | 10 576 | 18 508 |
Core count (800 mm² die, fixed 256 KB):
The hierarchy adapts
The 4-level architecture is not rigid — the number of levels follows from the core count:
| Core count | Levels | Configuration |
|---|---|---|
| < 256 | 2 (mesh + 1 crossbar) | IoT, Secure Element |
| 256 – 2,000 | 3 (mesh + 2 crossbars) | Edge, microcontroller |
| 2,000 – 20,000 | 4 (mesh + 3 crossbars) | Compute target |
| 20,000+ | 4-5 | Wafer-scale |
Four core types + MAC Slice
The CFPU does not manufacture a single core type — four programmable core types exist, and ML inference uses non-programmable MAC Slices:
| Core | ISA | FPU | GC + Obj. | SRAM | Role |
|---|---|---|---|---|---|
| Nano | CIL-T0 (48 opcodes) | None | None | 4-64 KB | Spike, sensor, simple actor |
| Actor | Full CIL | None | Yes | 64-256 KB | Actor-native: objects, GC, exceptions |
| Rich | Full CIL + FPU | Yes | Yes | 128-512 KB | FP-intensive, scientific, full .NET |
| Seal | AuthCode, hash | Crypto HW | — | ~32 KB | Code authentication — 1+ per chip |
| MAC Slice | FSM (no CIL) | — | — | 4-256 KB | ML inference — 8×8 INT8 MAC, not programmable |
Core count comparison (5nm, 800 mm²)
| Core + SRAM | Node size | Core count |
|---|---|---|
| Nano + 4 KB | 0.017 mm² | ~47,000 |
| Actor + 64 KB | 0.032 mm² | ~25,000 |
| Rich + 256 KB | 0.071 mm² | ~11,300 |
Node sizes include the recommended L0 router variant area and L1–L3 infrastructure per-core share (~0.007 mm²). The CFPU-ML chip uses MAC Slices (not programmable cores) in a chiplet architecture — see CFPU-ML-Max.
CFPU product family
| Variant | Seal | Cores | Target market |
|---|---|---|---|
| CFPU-N | 1 Seal | Nano only | IoT, massive SNN |
| CFPU-A | 1 Seal | Actor only | Actor cluster, streaming |
| CFPU-R | 1 Seal | Rich only | ML inference, FP |
| CFPU-ML | 1 Seal | MAC Slice + Actor | ML/SNN inference (chiplet) |
| CFPU-H | 1 Seal | Actor + Nano | Heterogeneous supervisor+worker |
| CFPU-X | 1 Seal | Mixed (any combination) | Application-specific |
Same RTL, the CORE_TYPE parameter determines the fabricated core type. On a heterogeneous chip, multiple types can coexist on a single die.
* Arithmetic projections with +25% design margin. Chiplet: 85 mm² tine die (5nm), 2×9 layout (18 tine + IOD, SoIC+CoWoS). No RTL simulation or silicon measurement exists yet.
Node-specific configurations
130nm (Sky130) — first silicon
L0: 16 cores, 4×4 mesh, 16 KB SRAM/core L1: 6 clusters, 6×6 crossbar L2: 6 tiles, 6×6 crossbar + Seal Core 16 × 6 × 6 = 576 Rich Cores Total SRAM: 9 MB
This is the "small CFPU" — the first real silicon, where the full hierarchy and the Seal Core are already present, just at smaller scale. The RTL is identical to later chips.
28nm — commercial target
L0: 16 cores, 4×4 mesh, 64 KB SRAM/core L1: 6 clusters, 6×6 crossbar L2: 6 tiles, 6×6 crossbar L3: 6 regions, 6×6 crossbar + Seal Core 16 × 6 × 6 × 6 = 3,456 Rich Cores Total SRAM: 216 MB
7nm — high performance
L0: 16 cores, 4×4 mesh, 256 KB SRAM/core L1: 8 clusters, 8×8 crossbar L2: 8 tiles, 8×8 crossbar L3: 8 regions, 8×8 crossbar + Seal Core 16 × 8 × 8 × 8 = 8,192 Rich Cores Total SRAM: 2 GB
5nm — target configuration
L0: 16 cores, 4×4 mesh, 256 KB SRAM/core L1: 8 clusters, 8×8 crossbar L2: 8 tiles, 8×8 crossbar L3: 10 regions, 10×10 crossbar + Seal Core 16 × 8 × 8 × 10 = 10,240 Rich Core Total SRAM: 2.5 GB (On 1,400 mm² die: 18 regions, 18,432 cores, 4.5/9.2 GB SRAM)
Latency remains constant
Worst-case latency ranges from ~85 to ~171 cycles depending on the node, hierarchy depth, and payload size. At 5nm, the cross-region worst case is ~171 cycles (~342 ns @ 500 MHz, 128B payload); typical small messages (~48B) complete in ~93 cycles (~186 ns). For context: this is competitive with software actor messaging (Erlang/BEAM: ~0.5–2 µs).
| Node | Core count (800 mm²) | SRAM/core | Levels | L0 cluster size | Worst-case latency (128B) | Typical (48B) |
|---|---|---|---|---|---|---|
| 130nm | 576 | 16 KB | 2 | ~5.3 mm | ~85 cyc | ~38 cyc |
| 28nm | 3,456 | 64 KB | 3 | ~2.7 mm | ~120 cyc | ~68 cyc |
| 7nm | 8,192 | 256 KB | 4 | ~1.4 mm | ~171 cyc | ~93 cyc |
| 5nm | 10,240 | 256 KB | 4 | ~1.1 mm | ~171 cyc | ~93 cyc |
The Seal Core always at the center
Regardless of node or core count, the Seal Core is always at the geometric center of the chip — co-located with the L3 crossbar. This is a communication topology decision: the star center minimizes maximum wire length to any region, and all cross-region traffic naturally passes through it — enabling security inspection without extra routing.
In every chip variant, from 576 cores to 10,000+, the same network architecture: the Seal Core sees all cross-region traffic by design.
Open Source
The entire design process, the scaling model, and the parameterizable RTL are publicly available.
The CFPU is not a chip — it is a chip family. One RTL that grows with the technology, from the 130nm first silicon to the 5nm 10,000+ core target configuration.