Page 1 of 1

Loop Unrolling - Optimal Unrolling Factor?

Posted: Wed May 30, 2018 12:30 am
by walker7
When unrolling loops using the DBRA instruction and a fixed number of iterations, I think I have found the optimal unrolling factor.

Multiply the number of iterations by 10, then take the square root.

For example, if you have a loop that is executed 200 times, then you would compute:
sqrt(200 * 10) = 44.72

Now, the factors of 200 (besides 1 and itself) are 2, 4, 5, 8, 10, 20, 25, 40, 50, and 100. The closest to 44.72 is 40, so you might choose to unroll the loop 40 times (meaning the DBRA executes only 5 times). Of course, you could unroll the code less (such as 25 times) to save space, or more (such as 50 times) to save time. Everyone has their own preferences.

When I load tiles into VRAM, I like to use a MOVE.L instruction and unroll it 8 times, writing 32 bytes each iteration.

What are your loop unrolling preferences?

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Wed May 30, 2018 7:38 am
by Stef
Well, the more you unroll, the best... it's always a trade-off between speed and code size but generally grouping 16 iterations is "enough" to make the branching almost free.

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Wed May 30, 2018 12:18 pm
by bioloid
I don't unroll, cause I'm a lamer.

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Wed May 30, 2018 9:34 pm
by Miquel
(iterations*10)^1/2

Why this formula is the optimal?
An educated guess, perhaps?

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Wed May 30, 2018 10:52 pm
by walker7
For those who are interested, let me explain the derivation of the formula.

Assume that the instruction(s) take c cycles to execute. The DBRA takes 10 cycles when taken, and 14 cycles when not taken.
When the loop is completely rolled, each iteration takes c+10 cycles. The final iteration takes c+14 cycles.
The variable r represents the number of iterations.
Therefore, the total number of cycles is: C1 = r(c+10)+4.

For unrolling, consider the variable u, which is the unroll factor. That is, the looped instructions are listed u times before DBRA.
When the loop is unrolled the formula becomes uc+10 for all but the final iteration, and uc+14 for the final iteration.
But, the number of loop iterations becomes r/u, not just r.
The formula then becomes: C2 = (r/u)(uc+10)+4.

Since C2 < C1, we then calculate the time savings as C1 - C2.
The formula becomes: D = C1 - C2 = [r(c+10)+4] - [(r/u)(uc+10)+4].
It then simplifies to: D = 10r * ((u-1)/u).
The variable c has vanished, so no matter what your loop does, just go by the number of iterations needed.

If you were to graph this with D as the dependent variable and u as the independent variable (r is constant), it looks kind of like a square root or logarithm function. That is, it rises steeply at first, then rises more slowly.

The function is monotone increasing, which means that the more iterations you unroll, the more time saved. But, the more iterations unrolled, the more space used up. There could be a nice compromise where the tangent slope of the function is equal to 1.

To find the tangent slope, take the first derivative of the function with respect to u. Then, set it equal to 1 and solve for u.
Function: D = 10r * ((u-1)/u)
1st Derivative: D' = 10r / u^2

10r / u^2 = 1

u = sqrt(10r)
(u is the unrolling factor, r is the number of iterations)

And there you have it! The reason for multiplying by 10 is because of how a DBRA instruction takes 10 cycles when the branch is taken.

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Thu May 31, 2018 2:47 am
by Miquel
Ok, but what exactly are you trying to optimize? or it's just a 'magic' formula, not matter what, a guidance oblivious of goal and circumstances?

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Thu May 31, 2018 5:24 am
by walker7
I'm trying to optimize a trade-off between code size (space) and cycles taken (time). When programming, you often have to make this kind of decision.

Re: Loop Unrolling - Optimal Unrolling Factor?

Posted: Thu May 31, 2018 12:38 pm
by Miquel
Making speed code optimizations is the root of all devils, unless the development is finished you code is final and you have speed problems I would never recommend to do them, and only if you maintain two versions of your code. Usually it's much more rewarding and safe to do structural optimizations.

Once that said, once I ported a game on PSP and was not great on fps, so among other things I tried to unroll a critical copy loop to my absolute surprise if I unrolled from 1 instruction to 8, it increased 1fps. So I just keep adding the same sentence until it hit a barrier of 3fps at about 64 sentences, at this point no further gain was accomplished. But I was prepared to keep going much further even if only some marginal gain was there.

This kind of decisions heavily depends on the case.