Chapter 12: Memory

In this chapter, we introduce memory hierarchy and cache memory. Computer system performance depends on the processor microarchitecture as well as the memory system. The current memory systems are slower than processors. The increasing gap between processor and memory speeds demands increasingly ingenious memory systems to try to approximate a memory that is as fast as the processor. We introduce memory hierarchy to mitigate the increasing gap between them. We also introduce SRAM and DRAM technologies and flash and disk storages.

Objectives

By the end of this chapter you should be able to:

Demonstrate knowledge of memory hierarchy
Recall how the (temporal and spatial) locality to make memory access fast
Clarify knowledge of the terms, i.e. hit, hit rate, miss, miss rate in the memory hierarchy
Differentiate memory techniques; SRAM and DRAM technologies, and flash and disk storages
Evaluate efficiency of direct-mapped cache

12.1 Memory Hierarchy

Computer performance depends on processor performance and memory system performance.

Fig. ‑. Memory Interface

As shown in the above figure, the processor frequently needs to access and read the data in the memory. If the memory system performance to read the data is not fast as much as the processor performance, the overall performance will be degraded due to the memory performance.

The process performance was similar to the memory performance in 1980, as shown below. But it hasn’t been true since the 1980’s. A technological improvement raises the performance of the processor. On the other hand, the memory performance was not improved as much as the processor performance. The memory performance is not good as much as the processor performance and We call this gap as the processor-memory performance gap.

Gap between Processor and Memory Speeds.
Click "Figure Description" for more description

Fig. ‑. Gap between Processor and Memory Speeds

Since the memory system is not fast as much as the processor performance, we should make memory system appear as fast as processor. We can use memory hierarchy to make the memory system fast as much as the processor speed.

The ideal memory has the following characteristics: 1) it should be fast and cheap (inexpensive); and 2) it should have a large capacity to store data in term of volume. But in reality, we cannot choose the one that meets all the requirement. If we choose one of the fastest one, it will be expensive and have limited capacity. If we choose one with the largest capacity, it will be slow.

There are three memory types in the memory hierarchy, i.e. cache, main memory, virtual memory, as shown below:

Fig. ‑. Memory Hierarchy Pyramid

The cache memory (SRAM) is fast but it can only keep small amount of data. SRAM (Static RAM) needs a lot more transistors in order to store a certain amount of memory. That’s why it is very expensive. On the other hand, the virtual memory (SSD: solid state drive, and HDD: hard disk drive) is very slow but it can store unlimited data in the storage. The main memory (DRAM) is located in between the cache memory and the virtual memory. In term of cost, it is cheaper than the cache memory. But in term of speed, it is faster than the virtual memory. DRAM (Dynamic RAM) requires the data to be refreshed periodically in order to retain the data.

Locality

The memory hierarchy uses the locality to speed up the performance of the memory. There are two types of locality; one is temporal locality, and the other one is spatial locality.

Temporal Locality uses the locality in time. If the data is used recently, the processor may use it again soon. By keeping recently accessed data in higher levels of memory hierarchy, the processor can access the data immediately.

Spatial locality uses the locality in space. If the data is used recently, the processor is most likely to use neighboring data soon. When the processor access data, the memory system brings nearby data into higher levels of memory hierarchy too. The process can access the neighboring data immediately.

We take advantage of principle of locality by implementing the memory of a computer as a memory hierarchy.

Fig. ‑. Principle of Locality

As shown in the above figure, the faster memories are more expensive per bit than the slower memories and thus are smaller. The computer systems store everything on hard disk drive (virtual memory). The memory systems copy recently accessed items in the main memory and copy more recently accessed items in the cache memory.

When the memory systems make a copy from low levels to high levels of memory hierarchy, it copies chunk of data, not a single line of data. When the data is found in that level of memory hierarchy, it is called as hit. The fraction or percentage of accesses that result in a hit is called the hit rate, expressed as the following equation:

$H i t _ r a t e equals blank fraction numerator # blank h i t s over denominator # blank m e m o r y _ a c c e s s e s end fraction$

When the data is not found in that level of memory hierarchy, it is called as miss. It may take time to go to the next level, called as miss penalty. The fraction or percentage of accesses that result in a miss is called the miss rate, expressed as the following equation:

$M i s s _ r a t e equals blank fraction numerator # blank M i s s e s over denominator # blank m e m o r y _ a c c e s s e s end fraction equals 1 minus h i t _ r a t e$

It follows that the sum of the hit rate and the miss rate is equal to 1.0 (100%).

The system makes a copy from the virtual memory to the main memory with a unit of data, called as a page. The system makes a copy from the main memory to the cache memory with a unit of data, called as a block.

SRAM and DRAM Technologies

SRAMs are Integrated circuits that are memory arrays with (usually) a single access port. SRAM has a fixed access time to any datum. It doesn’t need to refresh and so the access time is very close to the cycle time. Typically, it uses six to eight transistors per bit to prevent the information from being disturbed when reading the data. It costs a lot.

DRAMs store data as a charge in a capacitor, where a single transistor is used to access the charge. That’s why it is much denser and cheaper per bit than SRAM. However, the data cannot be kept indefinitely and must periodically be refreshed, call ‘dynamic’.

Table ‑. DRAM Generations

Year introduced	Chip size	$ per GB	Total access time to a new row/column	Average column access time to existing row
1980	64 Kbit	$1,500,000	250 ns	150 ns
1983	256 Kbit	$500,000	185 ns	100 ns
1985	1 Mbit	$200,000	135 ns	40 ns
1989	4 Mbit	$50,000	110 ns	40 ns
1992	16 Mbit	$15,000	90 ns	30 ns
1996	64 Mbit	$10,000	60 ns	12 ns
1998	128 Mbit	$4,000	60 ns	10 ns
2000	256 Mbit	$1,000	55 ns	7 ns
2004	512 Mbit	$250	50 ns	5 ns
2007	1 Gbit	$50	45 ns	1.25 ns
2010	2 Gbit	$30	40 ns	1 ns
2012	4 Gbit	$1	35 ns	0.8 ns

[source] Computer Organization and Design, Fifth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)

The above table shows how DRAM generations gradually developed from 1980s.

Disk Storage

Disk storage (also sometimes called drive storage) is a general category of storage mechanisms where data is recorded by various electronic, magnetic, optical, or mechanical changes to a surface layer of one or more rotating disks.

Opened Magnetics disk showing Read/Write head and magnetic disks disk-geometry with track, cylinder and sector

Fig. ‑. Disk Storage

In the disk storage, a sector is a subdivision of a track on a magnetic disk. Each sector stores a fixed amount of user-accessible data, traditionally 512 bytes for hard disk drives (HDDs) and 2048 bytes for CD-ROMs and DVD-ROMs. The data area contains the sync bytes, user data and an error-correcting code (ECC) that is used to check and possibly correct errors that may have been introduced into the data.

Access to a sector involves

Queuing delay if other accesses are pending
Seek time: move the heads
Rotational latency
Data transfer
Controller overhead

For example, disks rotate at 5400 RPM to 15,000 RPM. What is the average rotational latency at 5400 RPM?

The average rotational latency is calculated as follows: 0.5 rotation/5400 RPM = 0.5 rotation/(5400 rotation/60 seconds) = 30/5400 seconds = 5.6 ms

Exercises

A program has 2,000 load and store instructions. There exists 1,250 of these data values in cache and the rest of them supplied by other levels of memory hierarchy. What are the hit and miss rates for the cache?

Answer)

Hit_rate = 1250/2000 = 0.625
Miss_rate = 750/2000 = 0.375 = 1 – Hit_rate

Disk Access
Given: 512B sector, 15,000 RPM, 1 ms average seek time, 100 MB/s transfer rate,
0.2 ms controller overhead, idle disk.
What is the average read time?

Answer)

No queuing delay because of idle disk
Seek time: 1 ms
Rotational latency: 0.5/(15,000/60) = 2 ms
Data transfer time: 512 B / 100 MB/s = 0.005 ms
Controller overhead: 0.2 ms

The sum of the above items is 3.205 ms.

12.2 Cache Memory

Cache memory is located in the highest level of memory hierarchy. It is fast, and typically takes 1 clock cycle to access the data in the cache memory. Ideally it supplies most data to a processor. It usually holds most recently accessed data.

Caches first appeared in research computers in the early 1960s and in production computers later in that same decade. Every general-purpose computer built today, from servers to low-power embedded processors, includes caches.

When designing cache, the following questions are considered:

What data is held in the cache?
How is data found?
What data is replaced?

Although we focus on data cache loads, the same principles apply for fetches from an instruction cache.

Ideally, cache anticipates needed data and puts it in the cache memory, but it is impossible to predict the future demanding with perfect accuracy. Instead, the cache uses the past pattern to predict future demanding with temporal and spatial localities:

Temporal locality: copy newly accessed data into cache
Spatial locality: copy neighboring data into cache too

Before diving into the detail description, let’s look at the cache terminology.

Capacity ©: number of data bytes in cache
Block size (b): bytes of data brought into cache at once
Number of blocks (B = C/b): number of blocks in cache: B = C/b
Degree of associativity (N): number of blocks in a set
Number of sets (S = B/N): each memory address maps to exactly one cache set

The cache is organized into S sets. Each memory address maps to exactly one set. The caches are categorized by # of blocks in a set:

Direct mapped: 1 block per set
N-way set associative: N blocks per set
Fully associative: all cache blocks in 1 set

We exemplified the cache parameters as follows:

C = 8 words (capacity)
b = 1 word (block size)
So, B = C/b = 8 (# of blocks)

It is ridiculously small, but will illustrate organizations with these simple parameters in the next subsection.

Direct Mapped Cache

In the direct mapped cache, the cache memory assigns the location of the cache for each work based on the address of the word (block) in the main memory. Since there is only one choice to put the data of memory into the blocks of the cache memory, it is called as ‘Directed mapped’, and the block number is calculated with the following modulo operation:

(block address) modulo (# blocks in cache)

Let’s find out where the data at addresses 0x00000004, 0x00000024,…, 0xFFFFFFE4 map to. The following figure illustrates a direct mapped cache with a capacity of eight words (C = 8 words) and a block size of one word (b = 1 word). The number of blocks in cache is a power of 2 (B = C/b = 8).

The cache has eight sets, each of which contains a one-word block. The two rightmost bits of the address are always 00, because they are word aligned. The next three rightmost bits ( log subscript 2 8 = 3 bits) indicate the set (cache index) onto which the memory address maps. Thus, the data at addresses 0x00000004, 0x00000024, . . . , 0xFFFFFFE4 map to the set number 1. Likewise, data at addresses 0x00000010 and 0xFFFFFFF0 map to set 4, and so forth. Each main memory address maps to exactly one set in the cache.

Fig. ‑. Mapping of Main Memory to a Direct Mapped Cache

The following figure shows the direct mapped cache hardware. The cache is constructed with an eight-entry SRAM. Each entry, or set, contains one line consisting of 1 valid bit, 27 bits of tag, and 32 bits of data, as shown in the right side of Fig. 12-7. The cache is accessible using the 32-bit (memory) address that consists of the tag field (27 bits), the set bits (3 bits) and the byte offset (2 bits), as indicated in the top left of Fig. 12-7.

Fig. ‑. Direct Mapped Cache Hardware

The set bits specify the entry or set in the cache. Using the set value, the system finds out the cache index number. The system compares two values, the tag value of the memory address and the tag value in the cache. If the two values are identical and the valid bit is set to ‘1’, the memory system will get ‘hit’, and the data in the cache can be returned to the processor. Otherwise, the cache misses and the memory system must fetch the data from main memory.

The system knows whether a requested block is in the cache or not through tag values. The tags contain the address information required to identify whether a block (a word) in the cache corresponds to the requested block (word). The tag needs only to contain the upper portion of the address. Then what if there is no data in a location? The system indicates whether an entry contains a valid address through a valid bit. If the valid bit is one, there exists a valid address; otherwise there is no valid address in that entry. It is initially set to 0.

Let’s look at how the cache memory is utilized when executing the following MIPS assembly codes:

# MIPS assembly code

addi $t0, $0, 5

loop: beg $t0, $0, done

lw $t1, 0x4($0)

lw $t2, 0xC($0)

lw $t3, 0x8($0)

addi $t0, $t0, -1

j loop

done:

The program contains a loop that repeats for five iterations. Each iteration involves three memory accesses (loads), resulting in 15 total memory accesses. We assume that the cache is initially empty. The first two instructions (addi and beq) require no memory access. Since the cache is initially empty, there is no data in the memory. The third instruction, lw $t1, 0x4($0), got missed, where the memory address consists of the tag (00…00), the set value (001) and the byte offset (00). The system makes a copy the data from memory and the entry of index number 1 is filled with the data including the valid bit (1) and the tag value (00…00), as shown below:

Table ‑. Temporal Locality with a Direct Mapped Cache with lw $t1, 0x4($0)

V	Tag	Data
0			Set 0 (000)
1	00…00	Mem(0x00…04)	Set 1 (001)
0			Set 2 (010)
0			Set 3 (011)
0			Set 4 (100)
0			Set 5 (101)
0			Set 6 (110)
0			Set 7 (111)

The fourth instruction, lw $t2, 0xc($0), got missed because there is no data in the memory. The memory address of the instruction consists of the tag (00…00), the set value (011) and the byte offset (00). The system makes a copy the data from memory in the same way. The entry of index number 3 is filled with the data including the valid bit (1) and the tag value (00…00), as shown below:

Table ‑. Temporal Locality with a Direct Mapped Cache with lw $t2, 0xc($0)

V	Tag	Data
0			Set 0 (000)
1	00…00	Mem(0x00…04)	Set 1 (001)
0			Set 2 (010)
1	00…00	Mem(0x00…0C)	Set 3 (011)
0			Set 4 (100)
0			Set 5 (101)
0			Set 6 (110)
0			Set 7 (111)

The fifth instruction, lw $t3, 0x8($0), got missed again because there is no data in the memory. The memory address of the instruction consists of the tag (00…00), the set value (010) and the byte offset (00). The system makes a copy the data from memory in the same way before. The entry of index number 2 is filled with the data including the valid bit (1) and the tag value (00…00), as shown below:

Table ‑. Temporal Locality with a Direct Mapped Cache with lw $t3, 0x8($0)

V	Tag	Data
0			Set 0 (000)
1	00…00	Mem(0x00…04)	Set 1 (001)
1	00…00	Mem(0x00…08)	Set 2 (010)
1	00…00	Mem(0x00…0C)	Set 3 (011)
0			Set 4 (100)
0			Set 5 (101)
0			Set 6 (110)
0			Set 7 (111)

No memory access is required for the last two instructions, addi and j. The first time the loop executes, the cache is empty and the data must be fetched from main memory locations 0x4, 0xC, and 0x8 into cache sets 1, 3, and 2, respectively. The processor jumps to the instruction line loop. However, the next four times the loop executes, the data is found in the cache. We can calculate the miss rate, 3/15 = 20%.

Now let’s assume that we have the memory addresses 0x4 and 0x24 in a loop, as shown below:

0x4 : tag (00…00), the set value (001) and the byte offset (00).
0x24 : tag (00…01), the set value (001) and the byte offset (00).

Both memory addresses map to the set number 1. During the initial execution of the loop, data at address 0x4 is loaded into set 1 of the cache. Then data at address 0x24 is loaded into set 1, evicting the data from address 0x4. Upon the second execution of the loop, the pattern repeats and the cache must refetch data at address 0x4, evicting data from address 0x24. The two addresses conflict, and the miss rate is 100% in this case.

N-Way Set Associative Cache

An N-way set associative cache reduces conflicts by providing N blocks in each set, where data mapping to that set might be found. The following figure shows the hardware for a C = 8-word, N = 2-way set associative cache. The cache now has only S = 4 sets rather than 8. Thus, only log₂ 4 = 2 set bits are used to select the set.

Fig. ‑. 2-way Set Associative Cache

The number of tag bits increases from 27 to 28 bits. 2-way set associative cache has two options to store the tag value and the data. Each way consists of a data block and the valid and tag bits. When the memory address is searched as noticed in the top left of the above figure, the tag value in memory address is compared with the tag value in the cache. The cache reads blocks from both ways in the selected set and checks the tags and valid bits for a hit. If a hit occurs in one of the ways, a multiplexer selects data from that way.

Let’s look at how the 2-way set associative cache is utilized when executing the following MIPS assembly codes:

# MIPS assembly code

addi $t0, $0, 5

loop: beq $t0, $0, done

lw $t1, 0x4($0)

lw $t2, 0x24($0)

addi $t0, $t0, -1

j loop

done:

The program contains a loop that repeats for five iterations. Each iteration involves two memory accesses (loads), resulting in 10 total memory accesses. We assume that the cache is initially empty.

The first load instruction, lw $t1, 0x4($0), got missed because there is no data in the cache memory. The memory address of the instruction consists of the tag (00…00), the set value (01) and the byte offset (00). The system makes a copy the data from memory to cache and the way 0 entry of the set number 1 (01) is filled with the data including the valid bit (1) and the tag value (00…00), as shown below:

Table ‑. 2-way Set Associative Cache with lw $t1, 0x4($0)

Way 1			Way 0
V	Tag	Data	V	Tag	Data
0			0			Set 3 (11)
0			0			Set 2 (10)
0			1	00…00	Mem[0x00…04]	Set 1 (01)
0			0			Set 0 (00)

The second load instruction, lw $t2, 0x24($0), got missed again because there is no data in the memory. The memory address of the instruction consists of the tag (00…10), the set value (01) and the byte offset (00). The system makes a copy the data from memory to cache in the same way. In this time, the way 1 entry of the set number 1 is filled with the data including the valid bit (1) and the tag value (00…10), as shown below:

Table ‑. 2-way Set Associative Cache with lw $t2, 0x24($0)

Way 1			Way 0
V	Tag	Data	V	Tag	Data
0			0			Set 3 (11)
0			0			Set 2 (10)
1	00…10	Mem[0x00…24]	1	00…00	Mem[0x00…04]	Set 1 (01)
0			0			Set 0 (00)

Both memory accesses, to addresses 0x4 and 0x24, map to the set number 1. However, the cache has two ways, so it can accommodate data from both addresses. During the first loop iteration, the empty cache misses both addresses and loads both words of data into the two ways of the set number 1. On the next four iterations, the cache hits. Hence, the miss rate is 2/10 = 20%.

Full Associative Cache

A fully associative cache allows a given block to go in any cache entry. The cache is expensive to build because it requires all entries to be searched at once. But it can reduce conflict misses.

Fig. ‑. Fully Associative Cache

Exercises

The modulo operation finds the remainder after division of one number by another. For example, 5 modulo 2, where 5 is the dividend and 2 is the divisor, would evaluate to 1 because 5 divided by 2 leaves a quotient of 2 and a remainder of 1. What are the results of the following modulo operations?

9 modulo 3 =
000012 modulo 23 =
100012 modulo 23 =

Chapter 13 - Virtual Memory

Show the following:

Adjust appearance:

Notes

Chapter 12: Memory

Objectives

12.1 Memory Hierarchy

Locality

SRAM and DRAM Technologies

Disk Storage

Exercises

12.2 Cache Memory

Direct Mapped Cache

N-Way Set Associative Cache

Full Associative Cache

Exercises

Annotate