免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1753 | 回复: 0
打印 上一主题 下一主题

Loop Unrolling with Duff's Device[转] [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2006-10-22 07:57 |只看该作者 |倒序浏览
转自http://www.codemaestro.com/reviews/review00000102.html
好坏就不评价了

Loop Unrolling with Duff's Device
Loop unrolling, or unwinding, is simply reducing the number of overhead instructions that the CPU has to execute in a loop, thus improving the cache hit rate and the loop's run time.

As a simple demonstration, the following example code copies 100 values from the memory pointed by from, to device, that points to some hardware device we want to send data to.

for (int i = 0; i < 100; ++i)
{
    *device = *from++;
}
Note: We assume device points to a hardware device, so its pointer isn't incremented in the example.

Without looking at the assembley code , I boldly assume almost any compiler would require at least one condition check, one index increment and one jump instruction to execute each loop cycle in the example code above. We can optimize the code dramatically by simply executing several copy instructions in every loop cycle:

for (int i = 0; i < 20; i++)
{
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
}

The improvement is easily noticeable – only fifth of the condition checks, fifth of the jump instructions and fifth of the index increments would be executed. This optimization method is called loop unrolling or loop unwinding.

What happens when the number of elements to copy varies? If loop unrolling is to be used as an optimization, it should be used very carefully. In the last example we counted on the fact that the amount of loop cycles can be divided by 5, but what if the total number of elements doesn't divide by 5? A naive implementation would probably split the logic into two separate parts:

// Handle copy as much as we can with unrolling
int loopCycles = (count + 4) / 5;
for (int i = 0; i < loopCycles; ++i)
{
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
    *device = *from++;
}

// Take care of the remainer memory to copy
for (int j=0; j < count % 5; ++j)
{
    *device = *from++;
}

The implementation above works, but is seems awkward to have two loops for just copying one buffer - by means of code size, and by means of sheer lack of style.

Duff's Device
A special C syntax hack can spare us the filth of those two consecutive loops in the naive code above. Tom Duff (link below) was the first to discover it in 1983 while working in Lucas Films, and proudly named it "Duff's Device". Duff's device implements loop unrolling by interlacing the structures of a switch and a loop. It is perhaps the most dramatic use of case label fall-through ever seen in C:

int n = (count + 7) / 8;      /* count > 0 assumed */
  
switch (count %
{
    case 0:   do {    *device = *from++;
    case 7:           *device = *from++;
    case 6:           *device = *from++;
    case 5:           *device = *from++;
    case 4:           *device = *from++;
    case 3:           *device = *from++;
    case 2:           *device = *from++;
    case 1:           *device = *from++;
              } while (--n > 0);
}

At first the code might seem weird, but it is perfectly valid ANSI C. To clear things out, here's a brief explanation of the code:

When we enter the switch, we first copy count's remainder with 8. Note that there is no break instruction in the switch, so we will 'land' on count % 8 and fall through the rest of the switch. Execution will reach the do's while condition, and jump back to the case 0: instruction. From this point onward, we are simply running a normal loop unrolling calculated in n, and the switch is ignored.

For a simple memory copy you can use a fast assembly instruction like rep movsb. However, in this review the subject is copying memory to a hardware device. Besides hardware specific instructions, Duff's device is still considered to be the fastest code to copy memory to a device for over 22 years.

References
Tom Duff's original discovery can be found at http://www.lysator.liu.se/c/duffs-device.html

Posted by Tomer Margolin | Saturday, July 30th. 2005 |
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP