I greatly enjoyed this article about how to design CPU instruction sets. Lots of interesting perspective that I hadn't properly considered before.
Probably my favorite bit was just a bit of straightforward arithmetic that I could have done but hadn't thought to: branch mispredict rate * (reorder buffer size / typical basic block length)
With a seven-stage dual-issue pipeline, you might have 14 instructions in flight at a time. If you incorrectly predict a branch, half of these will be the wrong ones and will need to be rolled back, making your real throughput only half of your theoretical throughput. Modern high-end cores typically have around 200 in-flight instructions—that is over 28 basic blocks, so a 95% branch predictor accuracy rate gives less than a 24% probability of correctly predicting every branch being executed. Big cores really like anything that can reduce the cost of misprediction penalties.
I knew branch prediction was critical (Dan Luu's article on that is my favorite), but hadn't internalized just how fast the numbers get bad for big out of order cores.






