Industry Pioneers Unveil Revolutionary Quad-Chiplet Ai Accelerator

Industry Pioneers Unveil Revolutionary Quad-Chiplet Ai Accelerator

The 2026 International Solid-State Semiconductor Conference (ISSCC) recently witnessed the unveiling of Rebellions’ quad-chiplet Rebel 100 AI accelerator, a groundbreaking design that incorporates Unified Chiplet Interconnect Express (UCIe) technology to stitch four chiplets together. This innovative approach represents a significant milestone in the industry’s transition towards multi-chiplet designs for high-performance AI and HPC accelerators.

The Rebel 100 is a four-chiplet AI accelerator designed for large language model inference, aiming to maximize die yield and performance while offering the right balance between price and throughput. The system-in-package (SiP) comprises of four 320mm2 neural processing unit (NPU) dies, each equipped with a 12Hi HBM3E 36 GB memory stack, interconnected using a mesh topology with one another.

Each NPU die is made using Samsung’s performance-enhanced SF4X process technology and packaged using Samsung’s I-CubeS advanced packaging method. The SiP also features four integrated silicon capacitor (ISC) dies that serve for mechanical purposes. The chiplets are interconnected using a UCIe-Advanced die-to-die interface running at 16Gbps, providing an aggregated bandwidth of 4 TB/s.

The interconnect achieves roughly 11ns Flit-Aware Die-to-Die (FDI) to FDI latency, extending memory load-store semantics transparently across chiplets to enable the SiP to behave as a single processor. This allows software developers to focus on writing efficient algorithms without worrying about the complexities of multi-chiplet interconnections.

The Rebel 100 connects to hosts via two PCIe 5.x x16 interfaces that support SR-IOV and peer-to-peer operation. The accelerator can deliver 2 FP8 PFLOPS or 1 FP16 PFLOPS of performance without sparsity at 600W, which is in line with what Nvidia’s H200 can deliver at 700W.

A key aspect of the Rebel 100 design is its focus on data movement. Each NPU die integrates a configurable DMA subsystem with eight execution engines that can pull data from local HBM3E, remote HBM3E located on another chiplet, or from distributed shared memory. This allows for efficient data transfer and reduces latency.

To coordinate work across four chiplets, Rebellions implemented synchronization managers in each NPU instead of relying on a dedicated scheduler. Each chiplet integrates a dedicated hardware synchronization manager with hardwired control logic that can coordinate activity across dies, either under centralized control or in a more autonomous manner.

The architecture avoids direct peer-to-peer communications between units and inter-unit dependencies to cut down unnecessary traffic and coordination overhead, keeping overall utilization high during different execution phases of LLM inference. The on-chip 2D network-on-chip (NoC) uses a straightforward XY routing scheme, with turn restrictions applied to avoid deadlocks.

Arbitration inside routers is handled using a weighted round-robin mechanism, which services traffic from different sources fairly but allows for adjustable priority. The quality-of-service weights can be modified at runtime to make the system favor certain traffic types depending on whether the workload is compute-heavy or memory-intensive.

The 2D NoC mesh inside each chiplet logically expands over UCIe, so the full quad-chiplet system-in-package behaves like one large mesh-connected processor on the logical level. This greatly simplifies life for software developers, who can focus on writing algorithms without worrying about the complexities of multi-chiplet interconnections.

To mitigate power integrity challenges, Rebellions implemented a hardware staggering technique that offsets start times of neural cores instead of activating them simultaneously. Measurements show that synchronized switching produces steep current spikes and noticeable voltage disturbance, whereas staggered activation results in gentler transitions and a more stable power rail.

Additionally, Rebellions added dedicated integrated silicon capacitor (ISC) dies that embed distributed capacitance across the VDD rails to serve both the NPU and the HBM3E PHY. This approach reinforces the design by dampening voltage oscillations and lowering impedance peaks compared to a design without ISC dies.

The Rebel 100 represents a significant breakthrough in multi-chiplet design, with its first quad-chiplet AI inference accelerator using UCIe-A interconnects. Instead of building two large reticle-size dies, Rebellions opted for a quad-chiplet design with four 320-mm2 dies that are much easier to develop and yield.

To make the quad-chiplet design work seamlessly, Rebellions developed an internal 2D mesh network-on-chip that logically expands beyond the chiplet’s boundaries over UCIe so the full quad-chiplet system-in-package behaves like one large mesh-connected processor. This approach simplifies software development and enables more efficient algorithms.

Rebellions also implemented its own configurable DMA subsystem and synchronization managers, rather than adopting standard CXL-based protocols. Furthermore, to ensure power integrity, it implemented a proprietary hardware staggering technique that smooths current ramps and reduces supply noise. On top of this, the company added integrated silicon capacitor (ISC) dies to dampen voltage fluctuations and lower impedance peaks.

While not using the UCIe 1.0 specification to its full extent, the Rebel 100 represents a good example of a multi-chiplet design that relies on industry-standard interconnection while still using proprietary techniques to maximize performance and optimize power efficiency.

In conclusion, Rebellions’ Rebel 100 quad-chiplet AI accelerator showcases a pioneering approach to multi-chiplet design. Its focus on data movement, synchronization managers, and hardware staggering technique demonstrate innovative solutions to the challenges of high-performance computing. As the industry continues to evolve towards multi-chiplet designs, the Rebel 100 represents an exciting milestone in the pursuit of more efficient and powerful AI accelerators.

Original Source

Latest Posts