From 130nm to 5nm — How the CFPU Scales from 600 to 10,000+ Cores

Seal Core — the security core at the center of the CFPU chip

One RTL, every chip

Most processor architectures are designed for a specific manufacturing node and chip size. When the node changes, the chip is redesigned. The CFPU works differently: a single parameterizable RTL where chip size and manufacturing technology determine the core count.

The core count adapts to the chip, not the other way around.

This is possible because the CFPU's 4-level hierarchical architecture is parameterizable: the cluster size is fixed (4×4 = 16 cores, physical optimum), but the number of clusters, tiles, and regions is configurable. Same Verilog code, different parameters.

module cfpu_chip #(
    parameter CORES_PER_CLUSTER  = 16,    // fixed: 4×4 mesh
    parameter CLUSTERS_PER_TILE  = 12,    // node-dependent: 4-12
    parameter TILES_PER_REGION   = 12,    // node-dependent: 4-12
    parameter REGIONS            = 24,    // chip-size-dependent: 4-24
    parameter SRAM_KB_PER_CORE   = 1024   // node-dependent: 16-1024
);

// 7nm, 800 mm²:  16 × 8 × 8 × 8  =  8,192 cores, 256 KB/core
// 5nm, 800 mm²:  16 × 8 × 8 × 10 = 10,240 cores, 256 KB/core

Why does it scale?

The key: SRAM dominates core area (55-75%), not logic. SRAM scales worse than logic — which is why the actual core density increase is slower than logic density alone would suggest. To compensate: we increase SRAM at every node step.

Node	Logic density	SRAM cell	Scaling factor
130nm (Sky130)	~2.5 MTr/mm²	~1.0 μm²	1×
28nm	~20 MTr/mm²	~0.12 μm²	~8×
7nm	~91 MTr/mm²	~0.027 μm²	~37×
5nm	~171 MTr/mm²	~0.021 μm²	~48×

Core size per node

A) Growing SRAM — richer actors, fewer cores

If we increase SRAM with each technology step, each core can run more actors, larger working sets, richer programs:

Node	Core logic	SRAM/core	SRAM area	Core total
130nm	0.80 mm²	16 KB	0.13 mm²	~1.06 mm²
28nm	0.10 mm²	64 KB	0.06 mm²	~0.18 mm²
7nm	0.022 mm²	256 KB	0.057 mm²	~0.083 mm²
5nm	0.012 mm²	512 KB	0.088 mm²	~0.103 mm²

B) Fixed 256 KB SRAM — maximum core count

If we keep SRAM at 256 KB on every node, the freed area translates to more cores — greater parallelism, more simultaneous actors:

Node	Core logic	SRAM/core	SRAM area	Core total
130nm	0.80 mm²	256 KB	2.10 mm²	~2.93 mm²
28nm	0.10 mm²	256 KB	0.25 mm²	~0.37 mm²
7nm	0.022 mm²	256 KB	0.057 mm²	~0.083 mm²
5nm	0.012 mm²	256 KB	0.044 mm²	~0.059 mm²

The choice between the two strategies depends on the workload: maximum parallelism (SNN, many small actors) → fixed 256 KB with more cores. Richer actors (ML inference, large object graphs) → growing SRAM. The parameterizable RTL supports both — SRAM_KB_PER_CORE is set at fabrication time.

How many cores fit on a chip?

Usable area = chip size × 0.78 (subtracting: I/O pads, Seal Core, crossbars, routing channels, peripherals).

A) Growing SRAM (16 KB → 512 KB)

Node	SRAM/core	Core size	100 mm²	400 mm²	800 mm²	1 400 mm²
130nm	16 KB	1.06 mm²	74	294	588	1 030
28nm	64 KB	0.18 mm²	433	1 733	3 467	6 067
7nm	256 KB	0.083 mm²	940	3 759	7 518	13 157
5nm	512 KB	0.103 mm²	757	3 029	6 058	10 602

B) Fixed 256 KB SRAM — maximum parallelism

Node	SRAM/core	Core size	100 mm²	400 mm²	800 mm²	1 400 mm²
130nm	256 KB	2.93 mm²	27	107	213	373
28nm	256 KB	0.37 mm²	211	843	1 686	2 951
7nm	256 KB	0.083 mm²	940	3 759	7 518	13 157
5nm	256 KB	0.059 mm²	1 322	5 288	10 576	18 508

Core count (800 mm² die, fixed 256 KB):

130nm

213

28nm

1 686

7nm

7 518

5nm

10 576

      At 5nm, the 256 KB variant yields ~75% more cores than 512 KB (10,576 vs 6,058). The decision depends on the workload: maximum parallelism (SNN, many small actors) → 256 KB. Richer actors (ML, large objects) → 512 KB. The RTL is parameterizable.
    

The hierarchy adapts

The 4-level architecture is not rigid — the number of levels follows from the core count:

Core count	Levels	Configuration
< 256	2 (mesh + 1 crossbar)	IoT, Secure Element
256 – 2,000	3 (mesh + 2 crossbars)	Edge, microcontroller
2,000 – 20,000	4 (mesh + 3 crossbars)	Compute target
20,000+	4-5	Wafer-scale

Four core types + MAC Slice

The CFPU does not manufacture a single core type — four programmable core types exist, and ML inference uses non-programmable MAC Slices:

Core	ISA	FPU	GC + Obj.	SRAM	Role
Nano	CIL-T0 (48 opcodes)	None	None	4-64 KB	Spike, sensor, simple actor
Actor	Full CIL	None	Yes	64-256 KB	Actor-native: objects, GC, exceptions
Rich	Full CIL + FPU	Yes	Yes	128-512 KB	FP-intensive, scientific, full .NET
Seal	AuthCode, hash	Crypto HW	—	~32 KB	Code authentication — 1+ per chip
MAC Slice	FSM (no CIL)	—	—	4-256 KB	ML inference — 8×8 INT8 MAC, not programmable

Core count comparison (5nm, 800 mm²)

Core + SRAM	Node size	Core count
Nano + 4 KB	0.017 mm²	~47,000
Actor + 64 KB	0.032 mm²	~25,000
Rich + 256 KB	0.071 mm²	~11,300

Node sizes include the recommended L0 router variant area and L1–L3 infrastructure per-core share (~0.007 mm²). The CFPU-ML chip uses MAC Slices (not programmable cores) in a chiplet architecture — see CFPU-ML-Max.

CFPU product family

Variant	Seal	Cores	Target market
CFPU-N	1 Seal	Nano only	IoT, massive SNN
CFPU-A	1 Seal	Actor only	Actor cluster, streaming
CFPU-R	1 Seal	Rich only	ML inference, FP
CFPU-ML	1 Seal	MAC Slice + Actor	ML/SNN inference (chiplet)
CFPU-H	1 Seal	Actor + Nano	Heterogeneous supervisor+worker
CFPU-X	1 Seal	Mixed (any combination)	Application-specific

Same RTL, the CORE_TYPE parameter determines the fabricated core type. On a heterogeneous chip, multiple types can coexist on a single die.

CFPU-ML-Max (5nm, chiplet, 18 tine): ~94,752 MAC Slices (8×8 INT8, M SKU) + ~150 Actor Cores (FP16, on IOD) = ~6,066 peak TOPS INT8 @ 500 MHz, 740 MB on-chip SRAM*. Details: Core Types blog, CFPU-ML-Max blog.

* Arithmetic projections with +25% design margin. Chiplet: 85 mm² tine die (5nm), 2×9 layout (18 tine + IOD, SoIC+CoWoS). No RTL simulation or silicon measurement exists yet.

Node-specific configurations

130nm (Sky130) — first silicon

L0: 16 cores, 4×4 mesh, 16 KB SRAM/core
L1: 6 clusters, 6×6 crossbar
L2: 6 tiles, 6×6 crossbar + Seal Core

16 × 6 × 6 = 576 Rich Cores
Total SRAM: 9 MB

This is the "small CFPU" — the first real silicon, where the full hierarchy and the Seal Core are already present, just at smaller scale. The RTL is identical to later chips.

28nm — commercial target

L0: 16 cores, 4×4 mesh, 64 KB SRAM/core
L1: 6 clusters, 6×6 crossbar
L2: 6 tiles, 6×6 crossbar
L3: 6 regions, 6×6 crossbar + Seal Core

16 × 6 × 6 × 6 = 3,456 Rich Cores
Total SRAM: 216 MB

7nm — high performance

L0: 16 cores, 4×4 mesh, 256 KB SRAM/core
L1: 8 clusters, 8×8 crossbar
L2: 8 tiles, 8×8 crossbar
L3: 8 regions, 8×8 crossbar + Seal Core

16 × 8 × 8 × 8 = 8,192 Rich Cores
Total SRAM: 2 GB

5nm — target configuration

L0: 16 cores, 4×4 mesh, 256 KB SRAM/core
L1: 8 clusters, 8×8 crossbar
L2: 8 tiles, 8×8 crossbar
L3: 10 regions, 10×10 crossbar + Seal Core

16 × 8 × 8 × 10 = 10,240 Rich Core
Total SRAM: 2.5 GB

(On 1,400 mm² die: 18 regions, 18,432 cores, 4.5/9.2 GB SRAM)

Latency remains constant

Worst-case latency ranges from ~85 to ~171 cycles depending on the node, hierarchy depth, and payload size. At 5nm, the cross-region worst case is ~171 cycles (~342 ns @ 500 MHz, 128B payload); typical small messages (~48B) complete in ~93 cycles (~186 ns). For context: this is competitive with software actor messaging (Erlang/BEAM: ~0.5–2 µs).

Node	Core count (800 mm²)	SRAM/core	Levels	L0 cluster size	Worst-case latency (128B)	Typical (48B)
130nm	576	16 KB	2	~5.3 mm	~85 cyc	~38 cyc
28nm	3,456	64 KB	3	~2.7 mm	~120 cyc	~68 cyc
7nm	8,192	256 KB	4	~1.4 mm	~171 cyc	~93 cyc
5nm	10,240	256 KB	4	~1.1 mm	~171 cyc	~93 cyc

      Same RTL, from 130nm to 5nm. The 130nm 576-core first silicon runs exactly the same code as the 5nm 10,000+-core chip — only the parameters differ. The router design validated at F3 remains unchanged through F7. Worst-case cross-region latency is ~171 cycles (~342 ns @ 500 MHz, 128B payload, v3.2), competitive with software actor frameworks.
    

The Seal Core always at the center

Regardless of node or core count, the Seal Core is always at the geometric center of the chip — co-located with the L3 crossbar. This is a communication topology decision: the star center minimizes maximum wire length to any region, and all cross-region traffic naturally passes through it — enabling security inspection without extra routing.

In every chip variant, from 576 cores to 10,000+, the same network architecture: the Seal Core sees all cross-region traffic by design.

Open Source

The entire design process, the scaling model, and the parameterizable RTL are publicly available.

GitHub → CLI-CPU Blog → Internet on a Chip clicpu.org

The CFPU is not a chip — it is a chip family. One RTL that grows with the technology, from the 130nm first silicon to the 5nm 10,000+ core target configuration.