Slide 37 of 97
Notes:
We are finally ready to benchmark our code example.
The first load instruction takes one cycle, as does the second load. We add four delay cycles for the NOPs, a total of six. Adding the multiply and its associated NOP brings the total to eight. The add and subtract increase the total to ten. The branch plus the delay slots bring the total to 16. We're going to do this loop 40 times, so the amount of time it's going to take to do this is 16 times 40, or 640 cycles, the way the code is now written.
DSP aficionado's joke that the NOP command stands for “not optimized properly”. In other words, when inserting all of these delay cycles, we are not getting the maximum performance from the part. In fact, once this code is properly optimized, we will see that it is possible to reduce the benchmark to 28 clock cycles.
To accomplish this amazing performance improvement, we will use the powerful optimization capabilities of the C6x development tools.