





# NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Bank Processing Architectures

### Boyu Tian, Yiwei Li, Li Jiang, Shuangyu Cai, Mingyu Gao

Tsinghua University Shanghai Qi Zhi Institute Shanghai Jiao Tong University Huawei Technologies Co., Ltd.



ISCA 2024

# Near-Data Processing (NDP)

Near-data processing (NDP): place compute logic near data memory

- $_{\circ}$  Shorter distance  $\rightarrow$  lower latency and energy
- Higher bandwidth

• Various memory technologies to realize NDP:



### DRAM-Bank NDP Systems

Add computing logics inside/near DDR banks

□ Fine granularity, high bandwidth, high parallelism

Thousands of units

#### Typical commercial products:

UPMEM, Samsung's HBM-PIM, SK Hynix's AiM



UPMEM Chip



# Limitation 1: Lack of Communication Support

Different DRAM banks cannot communicate directly.

- Applications of DRAM-Bank NDP follows data-local execution paradigm.
- Communication is done through expensive host CPU forwarding.
- Adding physical links between banks is prohibitively expensive.



**Communication Overhead** 



End-to-end Execution Time

Thousands of NDP cores in DRAM-Bank NDP.

Static assignment cannot suit applications generating tasks dynamically.

Dynamic load balancing is not enabled, due to lack of communication.

The data-local execution makes the scheduling more complex.



Enabling cross-bank communication without altering DRAM form factors.
Enabling cross-bank load balancing compatible with communication.

- 1. Task-based message-passing programming model
- 2. Cross-bank communication scheme using "bridges".
- 3. Data-transfer-aware scheduling policy.

# Task-Based Programming Model

A task is the basic unit for execution and scheduling.

- Tasks spawn child tasks dynamically.
- Each task is associated with one data element.
- Communication is done through pushing tasks by message passing instead of pulling data.



### NDPBridge Overview



### NDPBridge Overview

Idea: add bridges into each level of the memory hierarchy

- Bridges gather/scatter messages from child node mailboxes
- Existing physical links and DDR commands
- All modifications are within standalone modules



### Bridge-Based Communication



# Bridge-Based Communication



• SCHEDULE

# Bridge-Based Communication



• SCHEDULE

# Bridge-Based Load Balancing

- Bridge commands scheduling
- Unit prepare tasks
- Bridge gathers tasks
- Bridge assigns & dispatches tasks





Messages: GATHER (bank->bridge) + SCATTER (bridge->bank)

## Load Balance: Data-First Scheduling Problem

#### Data-first scheduling problem

Must move data to tasks



Shared-Memory Architecture



NDP Architecture with Data-Local Execution Paradigm

Data transfer takes time

We need data-transfer-aware load balancing

#### Hide transfer latency:

- Traditional scheduling steal tasks when local queue is empty
- Schedule tasks in advance
- Overlap the transfer latency



#### Hide transfer latency:

- Traditional scheduling steal tasks when local queue is empty
- Schedule tasks in advance
- Overlap the transfer latency



### Avoid transfer congestion:

- Traditional work-stealing: steal half of the victim queue
- Fine-grained scheduling



16

#### Reduce transfer traffic:

- Traditional work stealing steals tasks from task queue tail
- Scheduling hot data can reduce data traffic.



### Reduce transfer traffic:

- Traditional work stealing steals tasks from task queue tail
- Scheduling hot data can reduce data traffic.



- $_{\odot}$  We use sketch to filter hot data
  - Similar to HeavyGuardian [KDD' 18]
  - Tasks of hot data are stored separately
  - Storage overhead: 2.2 KB in SRAM



# Methodology

#### Simulated platform

- 2 channels × 4 ranks/ch × 8 chips × 8 banks, 512 units in total
- $_{\odot}$  Simulated using zsim

### Workloads

- $_{\odot}$  Linked list (II).
- Hash table (ht)
- $_{\circ}$  Tree traversal (tree)
- o SpMV (spmv)
- Page rank (**pr**).
- Breadth-first search (bfs)
- Single-source shortest path (sssp)
- Weakly-connected component (wcc)

#### Baselines

| Communication                 | Load Balancing                     |
|-------------------------------|------------------------------------|
| CPU Forwarding                | -                                  |
| Bridge-based<br>Communication | _                                  |
| Bridge-based<br>Communication | Work Stealing                      |
| Bridge-based<br>Communication | Data-Transfer-<br>Aware Scheduling |

### Experiment Results

Bridge-based communication: 1.51× speedup than CPU forwarding

- Due to reduced communication overhead (32.7% -> 1.4% idle wait time)
- Still suffers load imbalance

□ Bridge + Work Stealing: 1.45 × speedup than no scheduling

More communication overhead(1.4% -> 18.6% idle wait time)



CPU Forwarding Bridge Communication Bridge + Work Stealing

NDPBridge: best performance, 2.98× speedup than CPU forwarding
1.35× against Bridge+Work Stealing, 18.6% -> 10.0% idle wait time



CPU Forwarding Bridge Communication Bridge + Work Stealing NDPBridge

### Summary

The lack of cross-bank communication and load balancing support hinders the adoption of DRAM-bank NDP architectures.

#### Our contributions:

- Bridge-based communication: supports cross-bank communication with acceptable hardware cost
- Data-transfer-aware scheduling: supports cross-bank load balancing built upon the communication scheme and with reduced data transfer overhead
- NDPBridge: promotes wider and easier adoption of DRAM-bank NDP architectures

