In the 10th video of the Deep Dive series, you talk about loop tiling:cache blocking. This is on page 33 of the pdf file also.
You have the access pattern on the left where for i=0, j=0, there is a cache miss. That is clear. What is not clear is why you would automatically count i=0, j=1 to be a cache miss on the very next line. When i=0, j=0 is accessed, would not the entire array be read in (a cache line is 64 bytes). So, if b is an array of ints, int being 4 bytes long, would not the cache store 16 elements of b ? This is assuming that b is appropriately aligned.