Computer Architecture
SMU CSE 5381/7381
Fall 1998
Chapter 4
Instruction-Level Parallelism (4.1)
Introduction
- Pipelining requires independent instructions
- Ability to overlap is called instruction-level
parallelism (ILP)
- Various hardware and software techniques to find
and exploit
- However small amount of ILP in basic block
- Turn to loops for more instructions
- Vectorization is one method for getting loop-based
ILP
Instruction-Level Parallelism (cont)
Pipeline Scheduling
- Need to separate instruction that uses value from
instruction that generates it
- Limited by ILP and latencies of functional units
- Note: assume different latencies for this section
Loop Unrolling
- Replicate body of loop and decrease iterations
- Give scheduler more independent instructions
- Eliminates branches and branch penalties
- Requires more registers for instruction independence
- Important for this and other transformations to know
when it is safe to reorder instructions
Instruction-Level Parallelism (cont)
Dependencies
- Cannot reorder a pair of dependent instructions
- If instructions are independent can run in parallel
- Three types
- Data dependencies
- Name dependencies
- Control dependencies
Data Dependence
- Two instructions i and j are data dependent if i generates
a value needed by j
- Also include transitive relationship
- Implies that instructions cannot be done in parallel or
be reordered
- Dependencies are a result of the source code while dealing with
them is property of pipeline organization
Instruction-Level Parallelism (cont)
Handling Data Dependencies
- Retain dependency but avoid hazard (scheduling)
- Eliminate dependency (requires complex analysis)
- This is complicated by the fact that dependencies
can flow through registers or memory
Name Dependencies
- Instructions with the same name but no data flow (second
instruction is a write)
- Reordering still results in a wrong answer
- Two types (assume instruction i followed by j):
- Antidependence: Instruction i writes register read
by j (WAR)
- Output: Instruction i writes register written by
j (WAW)
- Both can be eliminated with register renaming
Instruction-Level Parallelism (cont)
Control Dependence
- When the execution of an instruction is dependent upon a branch
- All code except first basic block is control dependent on some
instruction
- Restrictions:
- Cannot move an instruction dependent upon a branch
outside the scope of a branch
- Cannot move an instruction outside the scope of a branch
inside the scope
Dealing With Control Dependencies
- Preserved by in-order execution and hazard detection
- Real issue is to preserve
- Exception behavior
- Data flow
- From this perspective some reordering is possible even if it
violates the restrictions
Instruction-Level Parallelism (cont)
Loop-Level Parallelism
- Analysis done at source level
- Goal is to see if body of loop can be executed in parallel
- Distinction between:
- Dependencies within loop - does not inhibit
parallelism
- Dependencies between iterations of a loop
(loop-carried) - may inhibit
- Many transformations to help eliminate
Dynamic Scheduling (4.2)
Introduction
- Have seen static scheduling where if one instruction
stalls, all subsequent stall
- With dynamic scheduling hardware rearranges instructions
to avoid stalls
- If an instruction stalls, now look at next instructions
- Requires that the ID phase be split:
- Issue: Check for structural
- Read Operands: Wait for data hazards to clear before
reading operands
- Results in out-of-order execution
- Potential for RAW, WAR and WAW hazards
Dynamic Scheduling (cont)
Scoreboard
- Named after feature in CDC 6600
- Instructions execute out of order if no structural
or data hazards
- Goal is one instruction per cycle
- Centralized control that monitors all instructions
- Will use for floating point on DLX example
Execution Steps
- Issue: If functional unit free and no other active instruction
has same destination register, then issue and update scoreboard
- Prevents WAW hazards
- If WAW or structural hazard exists, stall
Dynamic Scheduling (cont)
Execution Steps (cont)
- Read Operands: Checks availability of source operands,
reads when ready and begins execution
- Operand is available if not being written or is
destination of active instruction
- Resolves RAW hazards
- Responsible for execution out-of-order
- Execution: Executes and notifies scoreboard when complete
- Write Result: Write the results after checking for WAR
hazards. Cannot write when:
- An instruction has not read its operands
- One of its operands is the same as the destination for
this command
- The other operand is the result of an earlier
operation
Comments
- Scoreboard must monitor for structural hazards
- There is no forwarding
Dynamic Scheduling (cont)
Scoreboard Structure
- Instruction Status: Which of the four steps the instruction
is in
- Functional Unit (FU) Status: For each FU
- Busy: Indicates if FU is being used
- Op: Operation for FU to perform
- Fi, Fj, Fk: Destination and source registers
- Qj, Qk: FU's producing Fj, Fk
- Rj, Rk: Flags to indicate when Fj, Fk are ready. Reset
when source values are read
- Register Result Status: Indicates which FU will write a register,
if an active instruction has that register as a destination.
Limitations to Performance Increase
- Amount of parallelism available
- Size of scoreboard window (number of entries)
- Number and types of functional units
- Presence of name dependencies
Dynamic Scheduling (cont)
Tomasulo Algorithm
- Designed for IBM 360/91
- Attempted to compensate for few FP registers
- Based on register renaming
- Will discuss in terms of multiple functional units
Basic Structure
- Hazard and execution control distributed (via
reservation stations at each functional unit)
- Replaces register names with that of reservation station --
register renaming
- Results are passed to functional units directly using
a Common Data Bus (CDB)
Dynamic Scheduling (cont)
Steps of Execution
- Issue:
- Get an instruction from the FP operation queue
- If empty reservation station then issue and move
operands to reservation station
- If load/store then issue if available buffer
- Otherwise stall
- Execute:
- If operands are not available, then monitor CDB
until computed
- Checks for RAW hazards
- Write Result:
- Write result onto the CDB and destination registers
Comparison to Scoreboard
- No need to check for WAW or WAR
- CDB is used to broadcast results
- Loads/Stores treated as basic functional units
Dynamic Scheduling (cont)
Data Structures
- Everything (except Load/Store) has tag field, a 4-bit
quantity that specifies the functional unit that will
produce the result needed as an operand.
- Tags are basically names for extended virtual register set.
- Each reservation station has:
- Op: The operation to perform on the source operands
- Qj, Qk: The reservation station that will produce
the corresponding operand
- Vj, Vk: The value of the operands
- Busy: Indicates the reservation station and
FU are occupied
- Each register file and source buffer has
- Qi: The number of the FU that will produce a value to be
stored there
- V: The value to be stored there (store only)
Dynamic Hardware Prediction (4.3)
Branch Prediction
- Amount of ILP to exploit limited by control dependencies
- Techniques crucial to multiple-issue processors
Branch Prediction Buffer
- Small memory indexed by the lower portion of the
branch instruction address
- Contains 1 bit that indicates action previously taken
- For frequently taken branches, 1 bit scheme falls short
- Use 2 bit scheme that requires 2 similar sequential actions
to change state. Can general to n bits.
- Not immediately useful for DLX since branch target and
action are determined at the same time
Dynamic Hardware Prediction (cont)
Correlating Predictors
- Base prediction on what previous branch(es) did
- Generalize to (m, n) predictor: previous m branches
choose from 2^m predictors, each of which is n bits.
- Can keep track of previous m branches in shift register
- Previous 2-bit scheme is (0, 2) in this notation.
Branch-Target Buffer
- Determine if (undecoded) instruction is a branch and
the predicted branch target
- Yields branch penalty of 0
- If matching entry is found, fetching begins immediately
at given target
- Must match PC exactly to not degrade performance
- When incorrect, take a 2 clock-cycle penalty
Dynamic Hardware Prediction (cont)
Variations
- Store target instruction instead of target address:
- Reduces time constraint
- Allows for branch folding -- replacing
an unconditional branch with its target
- Optimize indirect jumps (e.g. procedure returns)
- Difficult to predict
- Keep stack of return addresses -- good results
if stack is not too small
Multiple Issue (4.4)
Introduction
- To reduce CPI below 1 must do multiple-issue --
issue more than one instruction per clock cycle
- Two techniques:
- Superscalar: Hardware handles issues. Can
be statically or dynamically scheduled.
- VLIW: Software solution. Fixed number of
instructions formatted as one word.
Superscalar
- Issues 2 to 8 instructions per clock cycle
- If dependencies, then only instructions up to dependent
one will issue
- Results in dynamic issue of instructions
- Static scheduling looks at the next n instructions
and checks if they can be issued simultaneously
- Dynamic scheduling uses schemes similar to scoreboard
or Tomasulo
Multiple Issue (cont)
Static Superscalar DLX
- Bundle floating point and integer
- Minimizes dependencies between the two
- Requires more functional units or pipelined units
- Problems with FP load/stores issued with FP ops
- Contention for FP register ports
- Data hazard between instructions
- Also 1 cycle stall now affects 3 instructions.
Dynamic Superscalar DLX
- Use Tomasulo's algorithm, again group integer
and floating point
- Would like to issue dependent instructions in
same clock cycle, let reservation stations serialize
- Two solutions:
- Pipeline instruction issue stage to allow
tables to be updated twice as fast
- Eliminate reservation tables and use queues
for loads stores (decoupled architecture)
- Also need to worry about loads/stores being reordered
Multiple Issue (cont)
VLIW
- Superscalar requires a lot of hardware to check
dependencies among instructions
- VLIW groups instructions into one long word
- Responsibility is on compiler to get it right
- Need to do global scheduling to feed the VLIW
- Techniques include trace scheduling
Limitations
- Limitations of ILP in programs: difficult to find
a lot of ILP to keep hardware busy.
- Building the hardware: Memory bandwidth, register-file
bandwidth, dynamic scheduling analysis
- Code size: increases for VLIW
- Stalls: Need to keep units in lock-step for VLIW
- Backward compatibility
Compiler Support for ILP (4.5)
Dependence Analysis
- Need to detect and hopefully eliminate
- Difficulty with array based dependencies
- Several tests such as GCD and Bannerjee
- Difficulties
- Pointers
- Indirect pointers
- Information only available at runtime
- Need to know specific write of a variable
Software Pipeline
- Restructure code so that each iteration operates on
multiple iterations of the loop
- Basically symbolic loop unrolling
- Benefits of loop unrolling without code growth
Compiler Support for ILP (cont)
Trace Scheduling
- Looks for instructions to schedule across branches
- Two steps:
- Trace selection: Attempts to find a likely
path through the code, including branches
- Trace compaction: Scheduling the instructions
into the smallest number of VLIWs.
- If code does not take predicted path, may have to
do extra steps
- Need to ensure that speculative execution does not introduce
exceptions
Hardware Support for Parallelism
Strategies
- Conditional or Predicated Instructions
- Speculation (static or dynamic)
Conditional Instructions
- Instructions includes condition
- If condition true, instruction executes, otherwise noop
- Most common is conditional move
- Converts control dependence into data dependence
- Also helps with speculative execution
- Must again ensure correct exception behavior
Limitations
- They still take time
- Condition should be evaluated early
- Limited use across multiple branches
- Instructions may take longer
Hardware Support for Parallelism (cont)
Compiler Speculation
- Compiler would like to move instruction across branches
- Can deal with dependencies through register renaming,
cleanup
- Exception behavior is more difficult:
- Have hardware and software cooperate to ignore
exceptions
- Add poison bits to result registers
- Hardware buffers result
- Need to distinguish between ``terminate'' and ``resume''
exceptions
Hardware-Software Cooperation
- Handle resumable instructions as normal
- Returned undefined value for termination exceptions
- Now program gets incorrect answer instead of an exception
Hardware Support for Parallelism (cont)
Poison Bits
- Set poison bit on destination register of speculative
instructions that fault
- Propagate poison bits through speculative instructions
- If a normal instructions attempts to use ``poisoned''
register, then trap
Renaming
- Push instructions past branches and flag as speculative
- Hardware buffers values until branch is hit
- If speculation was correct, then values are committed
to the registers
- Called boosting
Hardware Support for Parallelism (cont)
Hardware-Based Speculation
- Use branch prediction, speculation and dynamic scheduling
- A data-flow execution where operations execute as soon
as operands are available
- Has the major disadvantage of complex hardware, but has
following advantages
Advantages
- Memory disambiguation
- Hardware-based branch prediction
- Maintains precise exception model
- No need to compensate, bookkeep
- Does not require different code sequences for different
architectures
Hardware Support for Parallelism (cont)
Example with Tomasulo
- Extend Tomasulo's algorithm
- Separate bypassing of results from completion of instruction
- This allows other speculative instructions to continue
- When instruction is no longer speculative, update state --
called instruction commit
- Execute out of order, but commit in order without incorrectly
changing state
- Need reorder buffer to contain instructions ready to
commit
Reorder Buffer
- Subsumes the load and store buffers
- Supplies operands for other instructions
- Each entry contains the instruction type, destination
number (register or address) and value
- Use reorder buffer entries as tags
Hardware Support for Parallelism (cont)
Four Steps of Execution
- Issue: Pick an instruction from the queue and send
to reservation station if space in the station
and the reorder buffer
- Execute: Wait for operands then execute the operation
- Write Result: Write the result on the CDB and from there
to the reorder buffer and any reservations waiting for
the value.
- Commit: When an instruction (other then an incorrectly
predicted branch) is at the top of the buffer, update
state. If an incorrectly predicted branch, then flush
buffer and restart
Comments
- May want to check for branch prediction before
branch hits top
- Record exceptions in reorder buffer and handle at
commit step
- Can extend to integer instructions, multiple-issue
Studies of ILP (4.7)
Introduction
- Would like to determine how much ILP there is
- Strategy presented is to look at theoretical limit
- Do this be removing all ``artificial'' constraints
Model
- Infinite registers (for renaming)
- Perfect branch prediction
- Perfect jump prediction
- All memory addresses known
Studies of ILP (cont)
Perfect World
- Assume:
- Unlimited issue
- Look arbitrarily far ahead
- All latencies 1 cycle
- Integer: 18 - 63 instructions/cycle
- Floating point: 75 - 150 instructions/cycle
Window Size
- Restrict
- Window size: Number of instructions examined
- Issue count: Side effect of limiting window size
- Even a large window of 512 has dramatic effect
- At small window size floating point similar to integer
- Assume 2k window and 64-way issue
Studies of ILP (cont)
Branch and Jump Prediction
- Compare:
- Selective predictor
- Standard 2-bit
- Static (profile based)
- None (except jump predictors)
- Those with lots of loops perform better
- Assume ambitious predictor
Finite Registers
- Restrict number of registers for renaming
- Most dramatic on floating point
- Assume 256
Studies of ILP (cont)
Alias Analysis
- Compare:
- Global/stack perfect, heap conflict
- Inspection (addresses examined at compile time)
- None
- For Fortran not much a difference between first two
- Area where improved compiler analysis may help
Reality
- Upto 64-way issue
- Selective predictor with 16 return predictors
- Perfect disambiguity for memory references
- Additional 64 GP and FP registers
- As window size changes FP affected the most due to
loop-level parallelism
PowerPC 620 (4.8)
Introduction
- 64-bit PowerPC
- Fetch, issues and complete 4 instructions/clock
- Six execution units:
- 2 simple integer units (1 cycle)
- Complex integer (3 - 20 cycles)
- Load-store unit: effective address calculation,
load/store buffers, nonblocking (upto 4)
- Floating point unit
- Branch unit
Compare to Hypothetical
- Explicit rename registers instead of reorder buffer
- Keeps all registers in one place
- Result moved from rename to actual register at commit
PowerPC 620 (cont)
Pipeline
- Fetch
- Branch-target and branch-prediction buffer
- Stack of return addresses
- Decode
- Issue
- Sent to reservation station with available operands
- Allocate rename register and reorder buffer
- Execute
- When operands are available
- Write to result bus, rename registers
- Updates done for mispredicted branches
- Reservation station freed
- Commit
- Done when all previous have committed
- Upto 4 per cycle
- Values then moved to ``visible'' state
PowerPC 620 (cont)
What is a Stall
- With single issue either you issue or you stall
- With multiple issue you may not issue the maximum, but
may still issue some
- Easier to think of Instructions Per Cycle (IPC)
- Empty slots introduced at issue may be reordered but will
never disappear.
- However, additional stalls after issue may be hidden by
these empty slots
Fetch Performance
- Tries to keep 8-buffer queue filled with at least 4 instructions
- On average keeps 5.2
- Reasons for not keeping full
- Misprediction
- I-Cache misses
- Partial cache line access
PowerPC 620 (cont)
Issue Performance
- May not issue due to:
- Limitations due to combinations of instructions
- Lack of progress in execution and commit stages
- Reasons for these:
- No reservation stations
- No rename registers
- Full reorder buffer
- Two operations on the same functional unit
- Miscellaneous (e.g. special registers)
- These account for the most dramatic reductions in IPC
PowerPC 620 (cont)
Execute Performance
- Can fall to execute due to:
- Source operands not available
- Functional unit not available
- Out-of-order disallowed
- Serialization
- Reasons for these:
- Source operands: Not enough ILP, window too small
- Functional unit: Need more hardware
- Out-of-order/Serialization: Redesign
PowerPC 620 (cont)
Commit Performance
- Typically limited by Issue and Execute performance
- Infrequently due to hardware resource constraints
Summary
- Final IPC runs from 1 to 1.8
- Degradation due to:
- Functional unit constraints, primarily load/store
- Losses in Fetch, Issue and Execute stage
- Limited ILP and finite buffering
Fallacies and Pitfalls
Fallacies
- Processors with low CPI's will always be faster
Pitfalls
- Emphasizing a reduction in CPI by increasing issue rate
while sacrificing clock rate can lead to lower performance.
- Improving only one aspect of a multiple-issue processor and
expecting overall performance improvement
Review: Chapter 3
- Given the basic DLX pipeline, what was the ideal CPI? % 1
- Once the pipeline is full, what is the maximum number of
memory reads per cycle? memory writes? % 2/1
- Once the pipeline is full, what is the maximum number of
register reads per cycle? register writes? % 2/1
- What is the purpose of the pipeline registers?
- Give an example of some information that needs to be propagated
through the pipeline.
- T or F: All stages are used by every type of instruction. % F
- Name the three types of hazards. %
- What is a stall?
- What are the three types of data hazards?
- Describe forwarding and give an example of a case where it does
not prevent a stall.
- Describe the ``software'' solution to control hazards.
- Name some techniques for filling a branch-delay slot.
- What are precise exceptions? Why are they important?
- For the multicycle DLX discussed, what types of data hazards
are possible?
- Name a problem with out-of-order completion.
Note: These notes are based on the text
Computer Architecture: A Quantitative Approach, Second Edition
J. Hennessy and D. Patterson
Morgan Kaufman, 1990, 1996
ISBN 1-55860-329-8
and are copyright © 1998 Matthew Diaz, Southern Methodist University.
These notes may be freely copied and distributed so long as this
notice is included.