I greatly enjoyed this article about how to design CPU instruction sets. Lots of interesting perspective that I hadn't properly considered before.
Probably my favorite bit was just a bit of straightforward arithmetic that I could have done but hadn't thought to: branch mispredict rate * (reorder buffer size / typical basic block length)
With a seven-stage dual-issue pipeline, you might have 14 instructions in flight at a time. If you incorrectly predict a branch, half of these will be the wrong ones and will need to be rolled back, making your real throughput only half of your theoretical throughput. Modern high-end cores typically have around 200 in-flight instructions—that is over 28 basic blocks, so a 95% branch predictor accuracy rate gives less than a 24% probability of correctly predicting every branch being executed. Big cores really like anything that can reduce the cost of misprediction penalties.
I knew branch prediction was critical (Dan Luu's article on that is my favorite), but hadn't internalized just how fast the numbers get bad for big out of order cores.
Quoting somebody else: "Itanium failed because compilers that generate good code for it cannot exist".
(What is meant here is that "the average basic block length of real programs- 7 instructions- is too short for VLIW to take advantage of. And smart compilers/assemblers/programmers cannot salvage that." The one exception is DSPs, which are well-suited to doing math without branching, and thus DSPs are (were?) VLIW.)