Magyar English

From 130nm to 5nm

How the CFPU scales from 600 to 10,000+ cores
April 18, 2026 · Hocza József Szabolcs
Seal Core — the security core at the center of the CFPU chip

One RTL, every chip

Most processor architectures are designed for a specific manufacturing node and chip size. When the node changes, the chip is redesigned. The CFPU works differently: a single parameterizable RTL where chip size and manufacturing technology determine the core count.

The core count adapts to the chip, not the other way around.

This is possible because the CFPU's 4-level hierarchical architecture is parameterizable: the cluster size is fixed (4×4 = 16 cores, physical optimum), but the number of clusters, tiles, and regions is configurable. Same Verilog code, different parameters.

module cfpu_chip #(
    parameter CORES_PER_CLUSTER  = 16,    // fixed: 4×4 mesh
    parameter CLUSTERS_PER_TILE  = 12,    // node-dependent: 4-12
    parameter TILES_PER_REGION   = 12,    // node-dependent: 4-12
    parameter REGIONS            = 24,    // chip-size-dependent: 4-24
    parameter SRAM_KB_PER_CORE   = 1024   // node-dependent: 16-1024
);

// 7nm, 800 mm²:  16 × 8 × 8 × 8  =  8,192 cores, 256 KB/core
// 5nm, 800 mm²:  16 × 8 × 8 × 10 = 10,240 cores, 256 KB/core

Why does it scale?

The key: SRAM dominates core area (55-75%), not logic. SRAM scales worse than logic — which is why the actual core density increase is slower than logic density alone would suggest. To compensate: we increase SRAM at every node step.

NodeLogic densitySRAM cellScaling factor
130nm (Sky130)~2.5 MTr/mm²~1.0 μm²
28nm~20 MTr/mm²~0.12 μm²~8×
7nm~91 MTr/mm²~0.027 μm²~37×
5nm~171 MTr/mm²~0.021 μm²~48×

Core size per node

A) Growing SRAM — richer actors, fewer cores

If we increase SRAM with each technology step, each core can run more actors, larger working sets, richer programs:

NodeCore logicSRAM/coreSRAM areaCore total
130nm0.80 mm²16 KB0.13 mm²~1.06 mm²
28nm0.10 mm²64 KB0.06 mm²~0.18 mm²
7nm0.022 mm²256 KB0.057 mm²~0.083 mm²
5nm0.012 mm²512 KB0.088 mm²~0.103 mm²

B) Fixed 256 KB SRAM — maximum core count

If we keep SRAM at 256 KB on every node, the freed area translates to more cores — greater parallelism, more simultaneous actors:

NodeCore logicSRAM/coreSRAM areaCore total
130nm0.80 mm²256 KB2.10 mm²~2.93 mm²
28nm0.10 mm²256 KB0.25 mm²~0.37 mm²
7nm0.022 mm²256 KB0.057 mm²~0.083 mm²
5nm0.012 mm²256 KB0.044 mm²~0.059 mm²

The choice between the two strategies depends on the workload: maximum parallelism (SNN, many small actors) → fixed 256 KB with more cores. Richer actors (ML inference, large object graphs) → growing SRAM. The parameterizable RTL supports both — SRAM_KB_PER_CORE is set at fabrication time.

How many cores fit on a chip?

Usable area = chip size × 0.78 (subtracting: I/O pads, Seal Core, crossbars, routing channels, peripherals).

A) Growing SRAM (16 KB → 512 KB)

NodeSRAM/coreCore size100 mm²400 mm²800 mm²1 400 mm²
130nm16 KB1.06 mm²742945881 030
28nm64 KB0.18 mm²4331 7333 4676 067
7nm256 KB0.083 mm²9403 7597 51813 157
5nm512 KB0.103 mm²7573 0296 05810 602

B) Fixed 256 KB SRAM — maximum parallelism

NodeSRAM/coreCore size100 mm²400 mm²800 mm²1 400 mm²
130nm256 KB2.93 mm²27107213373
28nm256 KB0.37 mm²2118431 6862 951
7nm256 KB0.083 mm²9403 7597 51813 157
5nm256 KB0.059 mm²1 3225 28810 57618 508

Core count (800 mm² die, fixed 256 KB):

130nm
213
28nm
1 686
7nm
7 518
5nm
10 576
At 5nm, the 256 KB variant yields ~75% more cores than 512 KB (10,576 vs 6,058). The decision depends on the workload: maximum parallelism (SNN, many small actors) → 256 KB. Richer actors (ML, large objects) → 512 KB. The RTL is parameterizable.

The hierarchy adapts

The 4-level architecture is not rigid — the number of levels follows from the core count:

Core countLevelsConfiguration
< 2562 (mesh + 1 crossbar)IoT, Secure Element
256 – 2,0003 (mesh + 2 crossbars)Edge, microcontroller
2,000 – 20,0004 (mesh + 3 crossbars)Compute target
20,000+4-5Wafer-scale

Four core types + MAC Slice

The CFPU does not manufacture a single core type — four programmable core types exist, and ML inference uses non-programmable MAC Slices:

CoreISAFPUGC + Obj.SRAMRole
NanoCIL-T0 (48 opcodes)NoneNone4-64 KBSpike, sensor, simple actor
ActorFull CILNoneYes64-256 KBActor-native: objects, GC, exceptions
RichFull CIL + FPUYesYes128-512 KBFP-intensive, scientific, full .NET
SealAuthCode, hashCrypto HW~32 KBCode authentication — 1+ per chip
MAC SliceFSM (no CIL)4-256 KBML inference — 8×8 INT8 MAC, not programmable

Core count comparison (5nm, 800 mm²)

Core + SRAMNode sizeCore count
Nano + 4 KB0.017 mm²~47,000
Actor + 64 KB0.032 mm²~25,000
Rich + 256 KB0.071 mm²~11,300

Node sizes include the recommended L0 router variant area and L1–L3 infrastructure per-core share (~0.007 mm²). The CFPU-ML chip uses MAC Slices (not programmable cores) in a chiplet architecture — see CFPU-ML-Max.

CFPU product family

VariantSealCoresTarget market
CFPU-N1 SealNano onlyIoT, massive SNN
CFPU-A1 SealActor onlyActor cluster, streaming
CFPU-R1 SealRich onlyML inference, FP
CFPU-ML1 SealMAC Slice + ActorML/SNN inference (chiplet)
CFPU-H1 SealActor + NanoHeterogeneous supervisor+worker
CFPU-X1 SealMixed (any combination)Application-specific

Same RTL, the CORE_TYPE parameter determines the fabricated core type. On a heterogeneous chip, multiple types can coexist on a single die.

CFPU-ML-Max (5nm, chiplet, 18 tine): ~94,752 MAC Slices (8×8 INT8, M SKU) + ~150 Actor Cores (FP16, on IOD) = ~6,066 peak TOPS INT8 @ 500 MHz, 740 MB on-chip SRAM*. Details: Core Types blog, CFPU-ML-Max blog.

* Arithmetic projections with +25% design margin. Chiplet: 85 mm² tine die (5nm), 2×9 layout (18 tine + IOD, SoIC+CoWoS). No RTL simulation or silicon measurement exists yet.

Node-specific configurations

130nm (Sky130) — first silicon

L0: 16 cores, 4×4 mesh, 16 KB SRAM/core
L1: 6 clusters, 6×6 crossbar
L2: 6 tiles, 6×6 crossbar + Seal Core

16 × 6 × 6 = 576 Rich Cores
Total SRAM: 9 MB

This is the "small CFPU" — the first real silicon, where the full hierarchy and the Seal Core are already present, just at smaller scale. The RTL is identical to later chips.

28nm — commercial target

L0: 16 cores, 4×4 mesh, 64 KB SRAM/core
L1: 6 clusters, 6×6 crossbar
L2: 6 tiles, 6×6 crossbar
L3: 6 regions, 6×6 crossbar + Seal Core

16 × 6 × 6 × 6 = 3,456 Rich Cores
Total SRAM: 216 MB

7nm — high performance

L0: 16 cores, 4×4 mesh, 256 KB SRAM/core
L1: 8 clusters, 8×8 crossbar
L2: 8 tiles, 8×8 crossbar
L3: 8 regions, 8×8 crossbar + Seal Core

16 × 8 × 8 × 8 = 8,192 Rich Cores
Total SRAM: 2 GB

5nm — target configuration

L0: 16 cores, 4×4 mesh, 256 KB SRAM/core
L1: 8 clusters, 8×8 crossbar
L2: 8 tiles, 8×8 crossbar
L3: 10 regions, 10×10 crossbar + Seal Core

16 × 8 × 8 × 10 = 10,240 Rich Core
Total SRAM: 2.5 GB

(On 1,400 mm² die: 18 regions, 18,432 cores, 4.5/9.2 GB SRAM)

Latency remains constant

Worst-case latency ranges from ~85 to ~171 cycles depending on the node, hierarchy depth, and payload size. At 5nm, the cross-region worst case is ~171 cycles (~342 ns @ 500 MHz, 128B payload); typical small messages (~48B) complete in ~93 cycles (~186 ns). For context: this is competitive with software actor messaging (Erlang/BEAM: ~0.5–2 µs).

NodeCore count (800 mm²)SRAM/coreLevelsL0 cluster sizeWorst-case latency (128B)Typical (48B)
130nm57616 KB2~5.3 mm~85 cyc~38 cyc
28nm3,45664 KB3~2.7 mm~120 cyc~68 cyc
7nm8,192256 KB4~1.4 mm~171 cyc~93 cyc
5nm10,240256 KB4~1.1 mm~171 cyc~93 cyc
Same RTL, from 130nm to 5nm. The 130nm 576-core first silicon runs exactly the same code as the 5nm 10,000+-core chip — only the parameters differ. The router design validated at F3 remains unchanged through F7. Worst-case cross-region latency is ~171 cycles (~342 ns @ 500 MHz, 128B payload, v3.2), competitive with software actor frameworks.

The Seal Core always at the center

Regardless of node or core count, the Seal Core is always at the geometric center of the chip — co-located with the L3 crossbar. This is a communication topology decision: the star center minimizes maximum wire length to any region, and all cross-region traffic naturally passes through it — enabling security inspection without extra routing.

In every chip variant, from 576 cores to 10,000+, the same network architecture: the Seal Core sees all cross-region traffic by design.

Open Source

The entire design process, the scaling model, and the parameterizable RTL are publicly available.

The CFPU is not a chip — it is a chip family. One RTL that grows with the technology, from the 130nm first silicon to the 5nm 10,000+ core target configuration.