I remember when I was kid, sitting there with the awesome spiral-bound reference book for Turbo Assembler, and writing an assembly program to read a text file from disk and write it to the screen.
I dutifully set up all my registers and int 21h'd through the text printing routines. When I got it to work, I sat there and watched the text scroll across the screen, like, hell yeah, I have written the fastest implementation of cat ever (although at the time, I didn't know what Unix cat was...I probably thought something more like "the fastest version of MS-DOS's TYPE command" or something).
Then, to feel even more awesome, I fired up Qbasic and wrote the same program, because this was going to be so much slower than the assembly code I had just written and that would make me feel superior. Just an OPEN of the filename, LINE INPUT and PRINT in a loop until EOF.
A few lines of human-readable code.
I didn't watch the output scroll by because it was so instantaneous it just sort of popped to the final lines immediately.
I assume Qbasic's PRINT command writes characters directly to video memory, without going through DOS's slow-as-hell int 21h interface. Or maybe I was writing one character at a time in assembly, I don't remember.
I tell you this because I just spent hours trying to write a 5.1-to-stereo audio converter using Intel AVX instructions and could not get it to beat the handful of lines of C this takes if you do it with scalars instead of vectors. It wasn't even close.
So just know: sometimes you will start out on something, thinking "I'm going to write the obviously fastest thing," but there are sometimes bogus assumptions lurking in the background, waiting to drag you down.
