ECE 442 Project Report

Computer System and Memory Design

 

 

 

 

 

 

 

Sashisu Bajracharya

Sukhonthip Rueangvivatanakij

Team 4

Spring 2002

Dr. D. Tabak

 

 

 

 

 

 

 

 

 

 

Abstract

 

The purpose of the project is to design a computing system with instruction sets and memory and cache management. We have to develop the superscalar system with 8-issue instructions per cycle. Further the instruction set is designed for the total of 55 instructions which provides all the general and sufficient programming needs. The instruction set is designed as register-to-register or Load/Store architecture similar to what RISC system has. Our system supports direct, immediate and absolute addressing modes. Moreover, we have branch prediction mechanism of Bimodal Local Branch Prediction which is used for speculative execution to support instruction level parallelism. Moreover, we have Reorder Buffer for reordering the commit of instructions which are executed in parallel. We have Renamed Registers to remove artificial dependence like WAW and WAR data hazards to provide full parallelism. We have multiple functional units as the program generally needs for typical 8 instructions at a time. We also have reservation stations before each execution units in order to keep the instructions until all the input operands are ready. Further, we design the cache and main memory with given specifications. Our cache as specified is set-associative type which has 4 lines (blocks) per set. We implement different cache policies like random replacement, write-through, write-allocate when we build the VHDL model for cache, main memory and CPU. We do the simulation of data transfer between CPU and cache and main memory showing HIT-read, HIT-write, MISS-read and MISS-write. We also do simulation for 8-issue instructions fetching and dispatching to corresponding units.

 

 

 

Specifications

 

The specifications that are required for the project are:

System and Instruction Set Design

Memory and Cache Management Design

 

We specified additional design parameters as suitable for our project:

 

 

 

 

 

 

 

 

 

 

Description of Design

 

Instruction Set Design

We have,

We chose 3 addressing modes:


Addressing Modes

1. Register Direct

2. Immediate

3. Absolute

Since other instructions than Load/Store instructions do not access the memory, the instruction operates on register to register, we did not need to put addressing mode bits inside the Instruction. Instead, the opcode carries the information to tell if the instruction involves absolute address or just register or immediate values.

The instruction formats are shown as follows for different types of instructions:

Instruction Format

Word Size : 64 bits

Register : 32 registers à 25 = 5 bits needed to refer Register

No. of instr : 55 < 26 = 6 bits opcode

3-operand Register to Register

2-Operand Register to Register

1-Operand

LOAD/STORE

 

IMMEDIATE

JUMP

BRANCH ON CONDITION

 

 

Instruction Set

Our instructions are mainly divided into 4 general groups: Arithmetic type, Logic, Control and Data Transfer type. The list of instructions is as follows:

 

             Opcodes         Instruction	Operation Performed		Comment
Arithmetic	     
1)	000000	 ADDF 	Rx, Ry, Rz  ;	Rx =Ry + Rz	 [Floating point Addition]
2)	000001	 SUBF	Rx, Ry, Rz  ;	Rx = Ry - Rz	[Floating point Subtraction]
3)	000010	 MULF   Rx, Ry, Rz  ;	Rx =Ry * Rz	Floating point multiplication
4)	000011	 DIVF 	Rx, Ry, Rz  ;	Rx =Ry / Rz	Floating point Division
				
5)	000100	 ADD	Rx, Ry, Rz  ;	Rx =Ry + Rz	  [Interger Addition]
6)	000101	 SUB	Rx, Ry, Rz  ;	Rx = Ry - Rz	[Interger Subtraction]
7)	000110	 MUL   	Rx, Ry, Rz  ;	Rx = Ry(32) * Rz(32)	Integer Multiply lower 32 bits of sources
8)	000111	 DIV	Rx, Ry, Rz  ;	Rx = Ry / Rz	Integer Division, Rx gets 	Quotients
9)	001000	 MOD 	Rx, Ry, Rz  ;	Rx = Ry % Rz	Rx  gets remainder of division
10)	001001		DEC     Rx      ;    	Rx  = Rx  - 1 	Decrement by 1
11)	001010		INC     Rx      ;    	Rx  = Rx  + 1 	Increment by 1
12)	001011		CLR	Rx      ;     	Rx  = 0 	Clear the register to 0
13)	001100		NEG    Rx      ;     	Rx  = 0 - Rx 	Change to 2’s Complement
14)	001101		SEW 	Rx                    ;       		Sign Extends 16 bit to 64 bit
15)	001110		SEL 	Rx                    ;      		Sign Extends 32 bit to 64 bit
16)	001111		NOP                      ;		No operation


Logic

17)	010000		AND	Rx, Ry, Rz  ;	Rx = Ry AND Rz 	Logical AND of each bit
18)	010001		OR	Rx, Ry, Rz  ;	Rx  = Ry OR Rz 	Logical OR of each bit
19)	010010		NOT    Rx    ;    	Rx  = 1’s complement of  Rx	bits flipped
20)	010011		XOR 	Rx, Ry, Rz  ;	Rx  = Ry XOR Rz  	Logical XOR of each bit
21)	010100		ROR	Rx       ;     	  	Rotate right with carry
22)	010101		ROL	Rx       ;     	 	Rotate left with carry
23)	010110		SHR	Rx       ;     	 	Shift right Rx with carry
24)	010111		SHL	Rx       ;                	Shift left Rx with carry
25)	011000		SAR	Rx       ;                 	Shift Arithmetic right with							high order bit copied
26)	011001		TST	Rx        ;         		Sets or Clears ‘Z’ & ‘N’ flag
27)	011010		CLC               ;		Clears Carry flag
28)	011011		CLV 	      ;                                       	Clear Overflow flag
29)	011100		SEC           ;		Set Carry Flag
30)	011101		SEV	     ;		Set Overflow Flag

Control Instruction

31)	011110		JMP 	X              ; 	PC =X	Jump to Absolute Address X
32)	011111		BEQ 	Rx, Ry,  X   ;     if Rx -  Ry  = 0, PC=X	If Rx = Ry ,Jump to address X
33)	100000		BNE	Rx, Ry,  X   ;   	if Rx -  Ry  ¹ 0, PC=X	If Rx ¹ Ry ,Jump to address X 
34)	100001		BGT 	Rx, Ry,  X   ;     if Rx -  Ry  > 0, PC=X	If Rx > Ry ,Jump to address X
35)	100010		BLT	Rx, Ry,  X   ;     if Rx -  Ry  < 0, PC=X	If Rx < Ry ,Jump to address X
36)	100011		BGE	Rx, Ry,  X   ;     if Rx -  Ry  >= 0, PC=X	If Rx >= Ry ,Jump to address X
37)	100100		BLE 	Rx, Ry,  X   ;     if Rx -  Ry  <= 0, PC=X	If Rx <= Ry ,Jump to address X
38)	100101		JSR	X      ;     	SP = SP - 8, 	Jump to Subroutine with 							(SP) =(PC), PC=X 	address X
39)	100110		RTS                       ;     PC =((SP))       	Return from subroutine
				SP = [SP] + 8
40)	100111		RTI	                  ;	                             Return from Interrupt loads all the Previously saved
						Registers, SP and PC
Data Transfer

41)	101000	PUSH	Rx   ;     	SP =SP - 8,	Push Rx contents to stack
				    (SP) = Rx     
42)	101001	POP 	Rx   ;     	Rx  =    (SP),	Pop From stack to Rx
	    	 	       SP = SP + 8			
43)	101010	LOAD	Rx , M1 ;   	Rx = M[M1]8	 Load 8 bytes from memory Starting at M1						to Rx
44)	101011	STORE M1, Rx  ;   	M[M1]8 = Rx	 Store 8 bytes to memory starting at M1					     	from Rx
45)	101100	MOVE	 Rx, Ry   ;   	Rx  =  Ry                               	Copy the contents of Ry to Rx
46)	101101	LOADI  Rx , #Im  ;   	Rx  =  #Im	Load Immediate number 	to Rx 
47)	101110	EXCHG  Rx , Ry   ;    	(Rx ) <=> (Ry )	Exchange values of Rx  								and Ry 
48)	101111	TSR	Rx     ;   	Rx  = SP	Transfer SP address to Rx 
49)	110000	TRS 	Rx         ; 	SP = Rx   	Transfer Rx  value to SP
50)	110001	LOADB  Rx , M1    ;   	Rx  = M[M1]1 	Load 1 byte from M at M1 							to Rx 

 

 

 

8-issue Pipelining Operation


System Schematic

 

 

 

 

 

 

The Fetch Unit, Instruction Buffer, Branch Prediction Unit and Instruction cache is shown in detail as follows:

 

Fetch Unit

The fetch unit provides the first stage of the pipeline and starts the multiple issue instruction level parallelism. In our case it fetches 8 instructions at a time form Instruction Cache starting from address "PC_out" signal to Instruction Cache. In order to fetch the instructions the Fetch Unit puts the first address of the instructions to "PC_out" signal and put the number of instructions that it wants to fetch and sets the ‘load’ to high to notify Instruction cache to transfer the set of instructions. The Instruction Cache returns those set of instructions in "instr" bus which can hold up to 8 instructions. Now the fetch unit checks if any of the instruction is Control(Branch) instruction. If it finds a branch instruction it sends the lower 6 bits of address of that instruction to Branch Prediction Unit (BPU) and set "predict" signal to high to notify the BPU. The BPU will return back the prediction to take or not to take through "take" signal. If the prediction was not to take and all the branch instruction in the set of 8 had been predicted, then the fetch unit issues those set of 8 instructions to the Instruction Buffer. If the prediction is to take, then the instructions of quantity of remaining after that branch instruction in the set of 8 has to be fetched starting from the branch target address. So, once again the remaining new set of instructions is requested to Instruction Cache which will supply that needed number of instructions to the fetch unit. The fetch unit will then transfer the 8 instructions to the Instruction Buffer.

After the set of instructions had been executed, the Branch type instruction will return the information of branch taken or not taken to the Fetch Unit and also to Reorder Buffer. The fetch unit will then send the "update" signal to BPU to modify the corresponding branch prediction entry for the next prediction. It does so by making "update" signal to high and putting address of last 6 bits to "branch_addr" and branch taken or not taken information through "taken" signal.

Also if the branch was mispredicted then new set of instructions are fetched from the Fetch Unit with similar process as before and old set of mispredicted instructions are flushed out from Reorder Buffer and in any Reservation Stations.

Branch Prediction Unit

The branch prediction unit contains 64 entries of two bits which gives the information about the branch to take or not to take. This prediction scheme that we have used is the Bimodal Branch Prediction scheme. The 64 entries are addressed by the last 6 bits of branch instruction. So, each entry correspond to a branch instruction at a particular location in the program. The two bits contents can have 4 possible values: 00, 01, 10, 11.

The unit initiates its action for two situations: one when the "predict" input is high and the other when the "update" input is high. When the "predict" input is high, the corresponding 6 bits of input is read to get the position of entries of the branch in the table. Then prediction is done. The most significant bit determines the prediction. Here, 00 and 01 signifies the prediction of not to take the branch. The 10 and 11 implies the prediction to take the branch. The predicted bit is returned through "take" output to Fetch Unit. If the update bit is high, then the BPU gets to the position located by 6 bit input of "branch_addr" and according to information from "taken" input bit, if the particular branch was actually taken, its corresponding entry is incremented with saturating on 11. If the branch was not actually taken, the corresponding entry is decremented with saturating on 00. Repeatedly taken branch will be predicted to be taken, and repeatedly not-taken branches will be predicted to be not-taken.

 

Instruction Buffer

The instruction buffer has 16 entries which holds the instructions and sends the first 8 instruction to the Decode, Rename & Dispatch unit. If there is a stall then it temporarily holds the instruction and stores the newly fetched instruction in the last positions.

 

Decode, Rename & Dispatch Unit

 

After the set of 8 instructions are issued from the Instruction buffer unit, each instruction is decoded for its opcode in this unit. Depending on the type of instruction to be either Control Instruction, Integer Operation, Floating-point operation or Data transfer operation, the dispatch unit dispatches the instructions to the particular Reservation Stations of corresponding execution units. Since we have decided to have 4 Branch Units(BU), 8 Integer Units(IU), 4 Floating Point Units(FPU) and 6 Load/Store Units(LSU), if in the set of 8 instructions there are instructions greater than the number of Execution Units of particular type, the instructions are queued in the Reservation Stations.

The decode, rename and dispatch unit checks for operands in the Register File and sets the register file to output two or one operands depending on the instruction type to corresponding field in the Reservation Units. While checking for the source and destination operands of the 8 instructions at a time, it checks for any data dependencies between those instructions and previous pipelined instructions including RAW (Read After Write) hazard, WAW (Write after Write) and WAR (Write After Read) hazards. WAW and WAR hazards are removed by assigning the new Rename Register to that particular operand after finding those hazards, puts the mapping of architected register to renamed register in the mapping table and sends the address of that Rename Register to the Reservation Stations through Register File.

Moreover, the instruction gets an entry in the Re-order Buffer in the order of issue. The entry resides there until the instruction gets executed, dependencies resolved, and the instruction committed. When multiple instructions are executed in parallel possibly out-of-order, the result is written in the corresponding entry in the Reorder Buffer of that instruction.

Reorder Buffer

The reorder buffer is the buffer with one entry per active instruction which has been dispatched for execution but not yet committed. This buffer is implemented as a FIFO queue for the active instructions in the order of issue. As instructions are dispatched they are given an entry in the Reorder Buffer. As instructions complete execution in parallel with order instructions which may happen out of order, the result value is placed in the corresponding entry in the Reorder Buffer wherever they are. When the instruction reaches the head of the FIFO buffer, if it has completed execution and dependencies resolved, the entry is removed from the Reorder Buffer and result value placed in the corresponding architected register. The typical Reorder Buffer structure is as follows:

Entry

Busy

Instruction

State

Destination

Value

1

No

Add R1, R5, R6

Commit

R1

11

2

Yes

Sub R7, R5, R1

Write result

R7

 

3

         

4

         

 

 

Reservation Stations

Reservation Stations (RS) are the temporary storage before the execution units for the instructions that are dispatched and awaiting execution with operands all of which may not be ready at the time of dispatch. The Reservation Station holds the decoded instruction until it is free of all the dependencies and the execution unit is free to execute the instruction. When an instruction is dispatched to the Reservation Station, any already available operand values are read from the register file and placed in the reservation station. After that, the RS checks for the operand names of unavailable data with the result operands of completing instructions that comes back to RS through feedback bus. If there is a match, the result value is placed on the reservation station. When the instructions in RS have all of its operand ready, then the instruction is issued to the execution unit if it is free. There is separate Reservation Stations for each number of individual functional unit to provide the efficient flow of parallelism. The structure of the Reservation Station looks as follows:

operation

source1

data1

valid1

source2

data2

valid2

destination

The decoded opcode is stored on the operation field. The two source operands are designated in source1 and source2. The data field contains the value of the operands. The valid bit indicates if the data value is present in the reservation station. As an instruction completes and produces result data, a comparison of the instruction’s destination field is compared with the source fields of the instructions in the reservation station. If it matches, the data is written to data field and the valid bit is set. If all the source operands are ready, then the instruction is issued to the execution unit for its execution.

 

Instruction Execution and Comitting

We have 4 different general types of execution units: Branch Unit, Integer Unit, Floating Point Unit, Load/Store Unit where each have their Reservation Stations to support instruction level parallelism. We have 4 Branch Units, 8 Integer Units, 4 Floating Point Units, 6 Load/Store Units which has been decided on our design considering 8-issue superscalar system and the relative average occurrence of particular type of instructions and the cost considerations. These units execute all the 55 instructions that we have designed for our Instruction Set. Integer Arithmetic, Logic instructions and some data transfer between registers are handled by Integer Unit. Floating Point arithmetic is handled by Floating Point Unit. Control Instructions including Jump and Branch on condition are handled by Branch Unit. Load/ Store Instruction and instructions involving data transfer to memory are handled by Load/Store Unit.

When the execution of instruction is finished, the result operand value is sent to the Reorder Buffer for corresponding entry, sent as feedback to Reservation Stations and also sent to Data Cache if it was a Store type instruction. The Load instruction initiates data transfer from Data Cache and sends the result back to Reorder Buffer. When the result is sent back to all the Reservation Stations, it is compared with all the pending operands that is waiting for the result. If it matches then the result value is written to the corresponding source operand field. This provides the forwarding mechanism to make data available to waiting instruction as soon as possible which reduces instruction stall that might have arisen with the late availability of the data.

When the result value of instruction executed is sent to the Reorder Buffer, the corresponding entry for the instruction gets the destination value. If the destination is the rename register the corresponding rename register in the Register File is updated. If the instruction also reaches the first 8 positions of the FIFO Reorder Buffer, then the result value is updated in the actual architected register corresponding to that rename register shown by the mapping table. This is the commit state of the instruction that happens in order satisfying the sequential order in which the program is written.

 

 

 

 

Memory Design

 

 

Design of Memory and Cache Management

Line (Block) size : 64 Bytes ( 26 B)

Main Memory size : 512Mbytes (229 B)

Number of lines in Memory = 229 / 26 = 223 lines /M

Set = 4 lines/set (22 Lines)

Page size = 16KB (214)

Number of Pages = 229 / 214 = 215 pages/M

Cache size = 512 Kbytes = 219 bytes

Number of lines in Cache = 219 /26 = 213 Lines/cache

Number of sets in Cache = 213/22 = 211 sets/cache

Number of possible lines per one set = ( 223 lines/M) / (211set/cache)   =   212 Lines from M can be / set

 

 

 

Detailed Model of Memory and Cache Management

 

  The memory and cache management is designed in detail as above. This design has been simulated in VHDL to show the transfer of data between CPU, Cache and Main Memory in the presence of cache hit on read and write and cache miss on read and write instruction initiated by CPU.

The cache has input and output or bi-directional signals of clock, address, data_CPU, address_M, data_M, load, store, load_to_cache, done_to_CPU, store_M and ready_M. When CPU wants to load a value from memory, it puts the address in address signals of 29 bits and sets the ‘load’ signal to high to notify Cache to load the data. If the CPU wants to store the data to memory, it puts the address in address signals of 29 bits and it sets the ‘store’ signal to high to notify the cache. The cache upon receiving either ‘load’ to high or ‘store’ to high, initiates its action to check if the memory addressed is available in the cache. It went on to steps as shown on the following diagram:

The cache checks ‘set’ number to which the ‘address’ belongs by looking at bits from 16 downto 6 which gives the corresponding set number. Then the cache looks in that set and compares each of the 4 entries’ TAG value with TAG in address(the bits from 28 downto 17). If there is a hit, then corresponding word addressed by bits from 5 downto 3 is accessed. If the instruction was to load(HIT ‘load’), then the corresponding word is returned to the CPU via ‘data_cpu’ signal.

If the instruction was to store on the hit(HIT ‘Store’), then the value of ‘data_CPU is stored in the cache at the corresponding entry. Since we have write-through policy, the corresponding entry in the Main Memory has also to be modified. So, the cache puts the address in the "address_M" signal and data of unit of one "line" (64 Byte) in the "data_M" signal and sets "store_M" to high to notify the main memory. Main memory upon finding "store_M" high stores the line passed in "data_M" signal to the address at "address_M".

However if there is not any match on checking the 4 entries of the set, the "miss" occurs. In this case the main memory has to be accessed. If we have a miss on the ‘load’ instruction by CPU (MISS ‘load’) , then the cache has to fetch that data from the main memory. So, the cache puts the address in the "address_M" and sets its "load_to_cache" signal to high to notify the main memory. Main memory upon receiving this signal, gets the line of data at "address_M" and puts it into the "data_M" signal connected to cache and sets "ready_M" signal to high. The cache upon finding ready_M to high gets the line form "data_M", chooses a random entry, as specified by Random Replacement Policy, in the set and replaces that with the new loaded ‘line’ of data and returns the corresponding word value of 64 bits (denoted by bits 5 downto 3) to CPU through "data_CPU" signal.

If the miss occurs in the ‘store’ instruction by CPU (MISS ‘store’), then since we have decided on our design to have ‘write-allocate’ policy, the missing block(line) from the main memory is loaded going through the same process as in miss in the ‘load’ in previous paragraph. Then it modifies the corresponding word with the new value of "data_CPU" and then its copies back the modified block to the main memory by the same process as in HIT on ‘store’.

However, if the block (line) needed is not even in the main memory, we have the "Page Fault " So page of size 16 Kbyte is loaded from the Secondary Memory to the main memory and the corresponding block is loaded to the cache and then the depending on ‘load’ or ‘store’ instruction the data is either loaded to CPU or stored to cache and the main memory. Writing back to secondary memory occurs only when the page is going to be replaced by another new loading page.

 

 

Simulation of Memory Management:

The simulation of memory management and data transfer does the following things:

First the instruction: Load R1, $00000000 is sent by testbench to the CPU where it should create "HIT READ" since the entry is in the cache. The CPU passes the address to cache where it compares the TAG with that of address and returns the data to CPU as it matches. The completion of this instruction takes about 2 clock cycle since only cache access used.

Second the instruction, STORE $00000084, R1 is sent by testbench to CPU where it should result to "HIT WRITE" as the entry is in the Cache initialized before the simulation. However, since the Specification for our design is to have Write-through policy, the cache stores it both to itself and to the main memory. So, the completion of this instruction takes long time about 13 clock cycles since both cache and one memory access are done.

Third the instruction, LOAD R3, $000000a0 is sent by the testbench to CPU where it should result to "MISS READ" since the entry is not put in the cache at the initialization. So when cache checks the TAG it finds the misses to all 4 lines and then requests main memory to transfer the corresponding line to the cache. So, the completion of this instruction takes about 13-14 clock cycles since one memory access is used.

Fourth the instruction, STORE $00000048, R3 is sent by the testbench to CPU where its should create "MISS WRITE" since the entry does not exist during initialization. So when cache checks the TAG in the corresponding set, it finds the miss. Since we decides to use Write-allocate policy, first the corresponding line has to be loaded from memory to cache and then new word value stored at the particular position and then the modified line should be store to memory again. So two memory accesses are used , hence the completion takes about 25 clock cycles as shown on the simulated waveform.

 

 

Simulation of the system

 

The system was simulated for its fetching unit and decode-dispatch unit in VHDL that shows 8 instructions at a time are being fetched and dispatched to the corresponding unit out of 4 categories of multiple units (BU, IU, FPU, LSU) one after another. This shows the superscaled pipeline of multiple instructions being fetched and dispatched to corresponding execution units. The particular test on VHDL shows three sets of 8 instructions being fetched and issued one after another.

The actual instructions that are used for simulation are:

LOAD R1, $00000000F

ADD R3, R5, R10

Load R09, $00000089

ADDF R20, R18, R24

BEQ R3, R10, $000000A0

shl r4

store $000000B8, r21

load r11, $00000056

-- 1 BU , 3 IU, 1 FPU, 3 LSU instructions in above set of 8

AND R5, R8, R10

SUBF R18, R19, R24

LOAD R31, $00000081

ROR r3

BLE R2, R8, $000000A7

store $00000FFF, r17

MULF R11, R5, R120

BGE R2, R8, $000000A7

-- 2BU , 2 IU, 2 FPU, 2 LSU instructions in above set of 8

ADDF R10, R19, R32

AND R3, R21, R6

SUBF R12, R11, R14

LOAD R31, $00000081

ROL r3

MULF R11, R5, R120

CLR R10

BGE R5, R3, $00000068

-- 1 BU , 3 IU, 3 FPU, 1 LSU instructions in above set of 8

 

 

 

The simulation results are shown in next page:

 

 

 

Conclusion

Overall, the project of System, Instruction Set and Memory and Cache Management design was completed successfully by out team. We learned to design instructions and its format in RISC type instructions which provide full support for Instruction Level Parallelism. The design helped us to explore the workings of superscalar system and purpose of different components like Reorder-Buffer, Reservation Station, Branch Prediction, Rename Registers to provide the performance of Instruction Level Parallelism with multiple issues. We learned the design of memory management and saw the transfer of data between CPU, cache and main memory while simulating our model. In fact, we had the opportunity to implement different cache policies like write-through, write-allocate and random replacement. In turn, the project gave us the design skills and the opportunity to implement various possibilities of building system and memory management.