Branch prediction

Executing instructions in order

We’ve talked about how our C++ code gets translated into machine understandable instructions in the Compiling section. But we haven’t actually talked about how our computers take those instructions and execute those instructions under the hood. So lets do that now!

But before we talk about concrete computers, lets take a detour and talk about washing clothes.

Washing clothes quickly

Lets say we want to clean our clothes. How many “major” steps do we have? At least for me, I do the following:

Steps:

  1. Wash

  2. Dry

  3. Fold

  4. Store

Lets say each step takes approximately the same amount of time to complete (so this example can stay simple). If each step takes us 15 mins, how long does it take to fully clean our clothes?

Mins   |   15   |   15   |   15   |   15   |
Load 1 |  wash  |   dry  |  fold  |  store | 

Total: |   15   |   30   |   45   |   60   |

So at the end of one cleaning cycle, we took 60 mins. Lets say we want to do 2 loads of our clothes. How long will that take?

Mins   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |
Load 1 |  wash  |   dry  |  fold  |  store | 
Load 2                                     |  wash  |   dry  |  fold  |  store | 

Total: |   15   |   30   |   45   |   60   |   75   |   90   |   105  |   120  |

Seems like 120 mins. What about 3 loads?

Mins   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |   15   |
Load 1 |  wash  |   dry  |  fold  |  store | 
Load 2                                     |  wash  |   dry  |  fold  |  store |
Load 3                                                                         |  wash  |   dry  |  fold  |  store |


Total: |   15   |   30   |   45   |   60   |   75   |   90   |   105  |   120  |   135  |   150  |   165  |   180  |

180 mins.

Can we do better? If it’s just one person doing everything, maybe not (unless you have a washing machine and drying machine that does all your work - but even then you can’t do much about the folding or storing step). What if you could get 3 of your friends and do your laundry together? Could you go faster?

Mins   |   15   |   15   |   15   |   15   |   15   |   15   |
Load 1 |  wash  |   dry  |  fold  |  store | 
Load 2          |  wash  |   dry  |  fold  |  store |
Load 3                   |  wash  |   dry  |  fold  |  store |


Total: |   15   |   30   |   45   |   60   |   75   |   90   |

If we had a dedicated person for each step, we don’t need to wait. The washing buddy can start on washing load 2 as soon as they are done with load one. And then on load 3 once they are done with that. So what previously took 180 mins now only takes 90. This is pretty much the gist of the idea called Pipelining.

Pipelining - aka, how to avoid being idle

An instruction like mov     rbp, rsp is very similar to a human instruction like do laundry. Even though do laundry is one instruction to a human, there are multiple steps involved. Similarly with computer instructions, there are mutliple stages that need to take place under the sheets as well. In computer land, this is usually called the instruction cycle.

Instruction cycle

The instruction cycle process is also sometimes called the fetch–decode–execute cycle. Much like how it sounds, the high level gist of this is that there is a fetch step, a decode step, and a execute step (a step here will involve multiple substeps).

As with everything, before we can do something, we generally need to know what we need to do. Similarly, when a program starts, the first thing a computer will do is to go and fetch the next instruction that it ought to do. Say our next instruction is:

mov     rbp, rsp

Okay, great, we have the above instruction. But what the heck does that even mean?

For those that haven’t been indoctrinated into the ways of reaching assembly yet, the above basically says “take the second thing (rsp) and copy it into the first thing (rbp) - aka mov dest, src -> mov whatever is in rbp to the location of rsp.

Just like how we had to parse and figure out what that instruction meant, a computer also has to do the same. This is the decode step. Once we have the next instruction and we know what it means, we can go about our merry way and execute the instruction. Once a computer finishes this step, it just repeats the process with the next instruction.

Since our computers aren’t exactly humans doing laundry, the computer version of our above example looks a little closer to something like this instead:

#       |   1  |   2  |   3  |   4  |  5   |   6  |  7   |  8  |
Instr 1 |  IF  |  ID  |  EX  | MEM  |  WB  | 
Instr 2        |  IF  |  ID  |  EX  | MEM  |  WB  |
Instr 3               |  IF  |  ID  |  EX  | MEM  |  WB  |
Instr 4                      |  IF  |  ID  |  EX  | MEM  |  WB  |

where:
    IF  - Instruction Fetch
    ID  - Instruction Decode and Register Fetch
    EX  - Execution and effective address calculation
    MEM - Memory Access
    WB  - Write Back

The above is great if every instruction can just continue on without any hiccups and each instruction is independent of another. The reality seems to not look like this however. We might have a situation where one instruction depends on another and therefore has to wait until that is completed. For example, lets say Instr 3 depends on Instr 2 - in the case where we are doing something like:

constexpr uint32_t one = 10; 
const uint32_t two = get_value();
const uint32_t three = two + one; // where one is a constant so no need to wait, but two is dynamic, so we can't compute three until we know what two is

Our instruction pipeline might instead look something like:

#       |   1  |   2  |   3  |   4  |  5   |   6  |  7   |  8   |  9   |
Instr 1 |  IF  |  ID  |  EX  | MEM  |  WB  | 
Instr 2        |  IF  |  ID  |  EX  | MEM  |  WB  |
Instr 3               |  IF  |  IF  |  ID  |  EX  | MEM  |  WB  |
Instr 4                             |  IF  |  ID  |  EX  | MEM  |  WB  |

By having to wait for Instr 2 to complete before we can perform Instr 3, we lost a step here. Usually code depends on some code prior to it and it is easy to end up in situations like. Sometimes, we get stuck waiting on fetching data from a deeper cache or maybe even main memory which stalls every instruction after that.

Okay, so it seems like we can go faster by pipelining the workload, but we sometimes get hit by stalls. But what does this have to do with Branch Prediction. And what does Branch Prediction even mean?

To branch, or not to branch, that is the question

So we talked about what happens when one instruction depends on a prior instruction. But what about cases where we don’t know which direction to go until after we compute something? For example:

void do_action(uint32_t some_value) {
    if (some_value <= some_threshold) {
        do_abc();
        do_def();
        do_ghi();
    } else {
        do_rst();
        do_uvw();
        do_xyz();
    }
}

In the above if block, we don’t know if some_value is actually less than some_threshold. We won’t know it until we have that value and then check to see if it is <= or > the value. In the above pipelining examples, we sort of assumed that the next instruction was know each time - almost as if the code looked like:

Instr 1
Instr 2
Instr 3
Instr 4

but in our current case, its more like:

Instr 1         (if check)
...           /            \
...          /              \
Instr 2   do_abc           do_rst
Instr 3   do_def           do_uvw
Instr 4   do_ghi           do_xyz

So we can’t even fetch Instr 2 until after we figure out the output of Instr 1. Now that seems like quite a bummer.

Note

For terminology sake, it’s important to note that this is juncture we are at is known as a “branch” point. When we say “branching”, we are talking about arriving at a situation where we have one (or more) code paths we can take depending on the output of some conditional.

It is probably easier to think of this visually as a tree where the children nodes are effectively “branches” - at least that’s why computer scientists call this a “branch”.

Here is a crazy idea though. What if we just picked a direction and just went down that route? What if we could say “we’ve been going left (doing abc, def, ghi) 70% of the time, so lets just do that again”?

That’s the gist of what branch prediction is doing!

Our computers has a way(s) of tracking how many times we’ve taken certain branches (It doesn’t really matter too much how that’s done for our purposes). Armed with this information, the computer essentially speculates and takes a branch. It tries to predict the next instruction and then runs off executing that pathway. So what started off looking something like:

#       |   1  |   2  |   3  |   4  |  5   |   6  |  7   |  8   |  9   |
Instr 1 |  IF  |  ID  |  EX  | MEM  |  WB  | 
Instr 2               |  Depending on the result of instr 1, fetch the next instruction - either do_abc or do_rst!

Now looks like (assuming the computer predicts the (some_value <= some_threshold) will probably be true:

#       |   1  |   2  |   3  |   4  |  5   |   6  |  7   |  8   |  9   |
Instr 1 |  IF  |  ID  |  EX  | MEM  |  WB  | 
do_abc         |  IF  |  ID  |  EX  | MEM  |  WB  |
do_def                |  IF  |  IF  |  ID  |  EX  | MEM  |  WB  |
do_ghi                              |  IF  |  ID  |  EX  | MEM  |  WB  |

And off to the races we go!

Wait a second. What if the computer was wrong? What if (some_value <= some_threshold) was actually false! and we were supposed to go down the path of doing rst, uvw, xyz?

Ah, thats no big deal. The computer just says “Oops, we messed up”. Roll back everything until Instr 1 and then lets go down the right path this time.

You might be thinking that this might be a waste of time and energy if you have to undo everything and do the right path. It does end up hurting a little when this does happen. This is what is commonly known as branch misprediction. When you are trying to squeeze out as much performance as possible from your computer, you generally should take care to keep this percentage down.

When computers do branch prediction, they are betting on that, on average, they are more right than they are wrong. Which means, on average, they are saving precious cycles guessing a direction and speculatively executing down that way. In the instances where they do fail, it’s not that bad because in the absense of branch prediction, we would have been waiting several cycles for the result of the instruction anyway! So this ends up being generally a win.

In some cases, you get really big wins. For example:

for (std::size_t i = 0; i < 100; ++i) {
    if (i > 98) // Branching here!
    {
        std::cout << i << std::endl;
    }
    else
    {
        continue;
    }
}

It is obvious to you and me that the if (i > 98) will be false the first 98 times its checked and only true the last 2 times. What a waste it would be if we had to wait each time until we figured out if i was indeed greater than 98 or not. Even if the initial assumption by the computer is that the if will be true, after a few iterations, it becomes more and more likely that it leans false. Computers generally pick up on this and will end up saving a lot of time basically charging ahead with predictions.

This is the essence of branch predictions.

When possible, it is helpful to write your code in a way that doesn’t confuse the branch predictor. With all things related to computer science, the fastest way to do something is to not do that thing at all. Which is why you will frequently see people telling you to avoid conditionals or other branch incuring instructions. Well, now you can hopefully appreciate the why behind that insight.

Like with all good things, it is important to benchmark to see the truth for yourself. So lets do that.

#include <chrono>
#include <cstdlib>
#include <iostream>

constexpr std::size_t num_iterations = 1000;
std::size_t do_loop(const std::vector<bool>& branch_choices)
{
    std::size_t count = num_iterations;
    for (std::size_t i = 0; i < num_iterations; ++i)
    {
        if (branch_choices[i])
        {
            ++count;
        }
        else
        {
            --count;
        }
    }
    return count;
}
auto to_nanos = [](const auto start, const auto stop) { return std::chrono::duration<double, std::milli>(stop - start).count(); };

{
    std::vector<bool> branch_choices;
    branch_choices.reserve(num_iterations);
    for (std::size_t i = 0; i < num_iterations; ++i)
    {
        branch_choices.push_back(true);
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    auto sum = do_loop(branch_choices);
    
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Took " << to_nanos(start, end) << " millis to run the predictable for loop. Calculated sum: " << sum << std::endl;
}

{
    std::vector<bool> branch_choices;
    branch_choices.reserve(num_iterations);
    for (std::size_t i = 0; i < num_iterations; ++i)
    {
        branch_choices.push_back(rand() % 2 == 0); // creates a random value vector
    }

    auto start = std::chrono::high_resolution_clock::now();
    auto sum = do_loop(branch_choices);
    
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Took " << to_nanos(start, end) << " millis to run the unpredictable for loop. Calculated sum: " << sum << std::endl;
}
Took 0.0156 millis to run the predictable for loop. Calculated sum: 2000
Took 0.0218 millis to run the unpredictable for loop. Calculated sum: 976

While the above is a pretty simple example with a silly benchmark, it should be enough to get the point across that there is a cost to branching, and especially unpredictable branching.

Branch Hints

Most compilers tend to provide explicit ways in which you can hint to them some intent. In our specific use case of dealing with Branch Prediction, most compilers let you hint to them that a specific branch is more or less likely to be taken.

The expected syntax is:

__buildin_expect((some_condition), true) // If you think it is very likely
__buildin_expect((some_condition), false) // If you think it is very unlikely

You’ll often see people using a macro to alias this so its easier to read. Something like:

#define LIKELY(cond) __builtin_expect((cond), true)
#define UNLIKELY(cond) __builtin_expect((cond), false)

What this will end materially end up looking like is:

if (LIKELY(some condition)) {
    // We expect this to be the branch we almost always want to take
} else {
    // Similarly, this means that this branch will almost never happen
}

and the inverse:

if (UNLIKELY(some condition)) {
    // We expect this to be the branch we almost always never take
} else {
    // Similarly, this means that this branch will almost always happen
}

Note

This is strictly a suggestion / hint. The compiler can, at its own discretion, choose to ignore said suggestion! This will also be true of all the compiler hints we discuss in this book.

Summary

  • In order to increase the throughput of instructions, computers pipeline their instructions

    • There are risk to this however such as stalling due to the need to wait on a prior instruction

  • In order to mitigate the downside of stalling, computers try to predict the next branch and speculatively execute the instructions

    • If it was a bad prediction, the computer rolls back, and goes down the proper route.

    • Otherwise, we benefit from the increase in throughput by avoiding the stall

  • If you really know what you are doing, you can hint to the compiler about potential branches and how to best handle then with __builtin_expect.

References