[컴퓨터구조] 캐시 성능의 측정 및 향상

SKKU SW/Computer Architecture

[컴퓨터구조] 캐시 성능의 측정 및 향상

효딩 2023. 6. 2. 00:42

Memory stall cycles

= Memory accesses / Program x Miss rate x Miss penalty

= Instructions / Program x Misses / Instruction x Miss penalty

명령어 캐시 실패율 I-cache miss rate 2%

(Use instruction fetch operation, read from instruction cache memory)

데이터 캐시 실패율 D-cache miss rate 4%라고 가정해보자

(load & store operation)

메모리 지연이 없을 때 CPI가 2이고 매 실패마다 실패 손실(Miss penalty)이 100사이클이다.

실패가 발생하지 않는 완벽한 캐시를 사용한다면 시스템이 얼마나 빨라지는지 계산하라.

적재와 저장 명령어의 실행 빈도는 36%라 가정하라.

Miss cycles per instruction

I-cache: 0.02 x 100 = 2

D-cache: 0.36(load & store) x 0.04 x 100 = 1.44

Actual CPI = 2 + 2 + 1.44 = 5.44

Ideal CPU is 5.44 / 2 = 2.72 x faster

Hit time is also important for performance

Average memory access time (AMAT)

considering the performance of accessing the cache memory only

AMAT = Hit time + Miss rate x Miss penalty

Example:

CPU with 2ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 6%

AMAT = 2ns x 100% + 20 x 2ns x 0.06 = 2ns + 2.4ns = 4.4ns

- Associativity

Drawback of a direct-mapped cache:

If frequently used addresses are mapped to the same cache index, many cache misses will happen

-> solution: multiple slots per index

multiple locations more than just one so that we can place that two cache lines together in the cache line

- Multi-level Cache

small cache has high miss rate

large cache has slow hit time

-> solution: use multiple levels of caches

- Software Optimization

solution: strategically design data access pattern

<더 유연한 블록 배치를 통한 캐시 실패 줄이기!>

1. Direct mapped

메모리 블록 주소를 상위 계층의 한 주소에 사상

one option to place our given cache memory

2. Set associative (n-way)

한 블록이 들어갈 수 있는 자리의 개수가 고정되어 있음.

provide multiple locations within a set of cache blocks / lines

a given memory address can be placed in one of multiple locations in a set.

It cannot be placed in other set.

Block number determines the set index

(Block number) modulo (#Sets in cache)

3. Fully associative

opposite of direct mapped

블록이 어느 곳에나 있을 수 있기 때문에 주어진 블록을 찾으려면 캐시 내의 모든 엔트리를 검색해야 함.

given cache line can be placed in anywhere in the cache memory.

large flexibility with higher overload

Direct mapped / 2-way set associative / fully associative

Block address access sequence: 0, 8, 0, 6, 8

1. Direct mapped

0 % 4 = 0

8 % 4 = 0

6 % 4 = 2

Block address	Cache index	Hit/miss	Cache content after access
Block address	Cache index	Hit/miss	0	1	2	3
0	0	miss	Mem[0]
8		miss	Mem[8]
0		miss	Mem[0]
6		miss	Mem[0]		Mem[6]
8		miss	Mem[8]		Mem[6]

실패 시 집합 내에서 교체할 엔트리의 선정을 위해서는 교체 방식을 정해야 함.

집합 연관 캐시는 대개 집합 내에서 사용한 지 가장 오래된 블록을 교체시키는 LRU (least recently used) 교체 방식 사용

가장 먼 과거에 사용된 블록이 교체됨.

2. 2-way set associative

0 % 2 = 0

8 % 2 = 0

6 % 2 = 0

-> all goes to the same set

Block address	Cache index	Hit/miss	Cache content after access
Block address	Cache index	Hit/miss	Set 0		Set 1
0	0	miss	Mem[0]
8	0	miss	Mem[0]	Mem[8]
0	0	hit	Mem[0]	Mem[8]
6	0	miss	Mem[0]	Mem[6]
8	0	miss	Mem[8]	Mem[6]

블록 8이 블록 0보다 더 오랫동안 참조되지 않았기 때문에 블록8 교체

3. Fully associative

Block address	Hit/miss	Cache content after access
Block address	Hit/miss	0	1	2	3
0	miss	Mem[0]
8	miss	Mem[0]	Mem[8]
0	hit	Mem[0]	Mem[8]
6	miss	Mem[0]	Mem[8]	Mem[6]
8	hit	Mem[0]	Mem[8]	Mem[6]

Increased associativity generally decreases miss rate

but with diminishing returns

<교체할 블록의 선택>

-Set associative

multiple possible candidates to be replaced

choose a non-valid (empty) location, if there is any.

otherwise, need to choose one of the blocks in the set as the victim

완전 연관 캐시에서는 모든 블록이 교체의 후보가 된다.

가장 널리 사용되는 방식은

LRU (least recently used)

: choose the one unused for the longest time

LRU 교체는 집합의 어떤 원소가 집합 내 다른 원소에 비해 상대적으로 언제 사용되었는지를 추적함으로써 구현 가능함.

Pseudo LRU (for 4+ ways)

Use an approximation

e.g., A tree-like structure that has an LRU bit per 2 entries

다른 방식으로는

Sequential (Round Robin)

: select the replacement victim in the circular order

Random

이 있음.

128KB data capacity, 64 bytes per block, 32bit address

2-way set associative, LRU replacement

Store policy: write-back, write-allocate

64 bytes는 2의 6승이므로 offset은 6bit

128KB / 64bytes = 2048 blocks

그런데 2-way set associative이므로 1024 sets -> 2의 10승

index는 10bit

나머지 32 - 6(offset) - 10(index) = 16bit는 tag

tag (16-bit)

index (10-bit)

offset (6-bit)

교안 예시대로 혼자 문제 풀어봤슴다 ~.~

<Multi-level Caches>

Level-2 cache handles misses from the L-1 cache

2차 캐시는 보통 마이크로프로세서와 같은 칩에 있으며, 1차 캐시에서 실패가 발생하면 접근함.

2차 캐시가 원하는 데이터를 갖고 있으면 실패 손실은 2차 캐시의 접근시간이 되며, 이 값은 메인 메모리 접근시간보다 훨씬 짧다.

1차 캐시와 2차 캐시 둘 다 데이터를 갖고 있지 않은 경우에는 메인 메모리 접근이 필요하게 되어 더 큰 실패 손실이 발생한다.

Some high-end systems include L-3 cache

Example.

CPU base CPI = 1, clock rate = 4GHz

Miss rate/instruction = 2%

Main memory access time = 100ns

Miss penalty = 100ns / 0.25ns = 400 cycles

1 x 100% + 400 x 2% = 1 + 8 = 9

Now add L-2 cache

Access time = 5ns

Global miss rate to main memory = 0.5%

penalty (miss with L-2 hit) = 5ns / 0.25ns = 20 cycles (It is much slower than L1 hit but much better than going to the main memory)

Extra penalty = 400 cycles

CPI = 1 x 100% + 20 x 2% + 400 x 0.5% = 3.4

Performance ratio = 9 / 3.4 = 2.6x faster with L2

1. Compulsory misses (cold start misses)

First access to a block

If we access a certain address / cache block for the first time, then the cache line would not be in the code (start miss)

2. Capacity misses

Due to finite cache size

A replaced block is later accessed again

3. Conflict misses (collision misses)

In a non-fully associative cache

Due to competition for entries in a set