- 论坛徽章:
- 0
|
本帖最后由 lyjdamzwf 于 2012-01-15 09:21 编辑
平时大家一直都在说字节对齐会影响效率如何如何, 最近突然想自己测一下这个问题.
首先我想到了glibc中的memset中对于字节对齐的处理,代码如下:- void *
- memset (dstpp, c, len)
- void *dstpp;
- int c;
- size_t len;
- {
- long int dstp = (long int) dstpp;
- if (len >= 8)
- {
- size_t xlen;
- op_t cccc;
- cccc = (unsigned char) c;
- cccc |= cccc << 8;
- cccc |= cccc << 16;
- if (OPSIZ > 4)
- /* Do the shift in two steps to avoid warning if long has 32 bits. */
- cccc |= (cccc << 16) << 16;
- [color=Red]/* There are at least some bytes to set.
- No need to test for LEN == 0 in this alignment loop. */
- while (dstp % OPSIZ != 0)
- {
- ((byte *) dstp)[0] = c;
- dstp += 1;
- len -= 1;
- }[/color]
- /* Write 8 `op_t' per iteration until less than 8 `op_t' remain. */
- xlen = len / (OPSIZ * 8);
- while (xlen > 0)
- {
- ((op_t *) dstp)[0] = cccc;
- ((op_t *) dstp)[1] = cccc;
- ((op_t *) dstp)[2] = cccc;
- ((op_t *) dstp)[3] = cccc;
- ((op_t *) dstp)[4] = cccc;
- ((op_t *) dstp)[5] = cccc;
- ((op_t *) dstp)[6] = cccc;
- ((op_t *) dstp)[7] = cccc;
- dstp += 8 * OPSIZ;
- xlen -= 1;
- }
- len %= OPSIZ * 8;
- /* Write 1 `op_t' per iteration until less than OPSIZ bytes remain. */
- xlen = len / OPSIZ;
- while (xlen > 0)
- {
- ((op_t *) dstp)[0] = cccc;
- dstp += OPSIZ;
- xlen -= 1;
- }
- len %= OPSIZ;
- }
- /* Write the last few bytes. */
- while (len > 0)
- {
- ((byte *) dstp)[0] = c;
- dstp += 1;
- len -= 1;
- }
- return dstpp;
- }
复制代码 我们可以看到红色部分,memset进行了按4字节或8字节对齐地字节赋值(OPSIZ为4或8, 在地址对齐之后, 才开始按效率较高的4bytes/8bytes 赋值.
在我以前的理解是, 如果地址没有按某标准类型大小对齐时, 我们在取某标准类型值的时候, 总线会花费更多的指令来读取"两块"地址, 然后合并取值, 导致效率降低.
带着这样的猜测, 我编写了如下测试程序:- #include <sys/time.h>
- #include <stdlib.h>
- #include <stdint.h>
- #include <stdio.h>
- #include <string.h>
- #include <iostream>
- using namespace std;
- #define EXEC_COUNT (1000 * 1000 * 1000)
- typedef long int op_t;
- int main(int argc_, char* argv_[])
- {
- void* ori_mem_ptr = malloc(50);
- memset(ori_mem_ptr, 0, sizeof(ori_mem_ptr));
- timeval begin_tv;
- timeval end_tv;
- for (int offset = 0; offset < 20; ++offset)
- {
- char* mem_ptr = (char*)ori_mem_ptr + offset;
- long int ptr_val = (long int)mem_ptr;
- printf("ori_mem_ptr:[%p] mem_ptr:[%p] offset:[%d]\n", ori_mem_ptr, mem_ptr, offset);
- gettimeofday(&begin_tv, NULL);
- for (uint32_t i = 0; i < EXEC_COUNT; ++i)
- {
- *(uint32_t*)mem_ptr = i;
- //! uint32_t tmp_val = *(uint32_t*)mem_ptr;
- }
- gettimeofday(&end_tv, NULL);
- printf("perf:[%lu]us\n\n", (end_tv.tv_sec * 1000 * 1000 + end_tv.tv_usec) - (begin_tv.tv_sec * 1000 * 1000 + begin_tv.tv_usec));
- }
- free(ori_mem_ptr);
- return 0;
- }
复制代码 这样一个简单的程序, 分配了一个内存块, 然后分配偏移指针指向其内部不同的地址, 本意是想测出没有按字节对齐的话, 取值效率会慢, 但是测试结果让我感到疑惑.ori_mem_ptr:[0x18af010] mem_ptr:[0x18af010] offset:[0]
perf:[2736881]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af011] offset:[1]
perf:[2730324]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af012] offset:[2]
perf:[2728421]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af013] offset:[3]
perf:[2704567]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af014] offset:[4]
perf:[2711222]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af015] offset:[5]
perf:[2710743]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af016] offset:[6]
perf:[2715045]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af017] offset:[7]
perf:[2720721]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af018] offset:[8]
perf:[2707865]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af019] offset:[9]
perf:[2750098]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af01a] offset:[10]
perf:[2745997]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af01b] offset:[11]
perf:[2686419]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af01c] offset:[12]
perf:[2686714]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af01d] offset:[13]
perf:[3304225]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af01e] offset:[14]
perf:[3299947]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af01f] offset:[15]
perf:[3303989]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af020] offset:[16]
perf:[2747328]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af021] offset:[17]
perf:[2687445]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af022] offset:[18]
perf:[2703295]us
ori_mem_ptr:[0x18af010] mem_ptr:[0x18af023] offset:[19]
perf:[2688810]us
可以看到输出结果的分界点不是按我所想的那样, 我怀疑是不是虚拟地址是如此, 但是实际的物理地址可以不连续导致, 那如果这样memset这些库函数的做法又让我疑惑了. 对底层不是很懂.
希望CU各位大神能帮我解惑.
|
|