- 论坛徽章:
- 0
|
Intel 提供了4条 cache 预读指令:
prefetchnta、prefetcht0、prefetcht1 以及 prefetch2
AMD 增加了两条指令:
prefetch 和 prefetchw,这两条是 AMD 自已的 3D NOW 指令。
下面是示例代码:
c code:
- #define num 65536
- #define ARY_SIZE (num * 8)
- double array_a[num]
- double array_b[num]
- double array_c[num]
- int i;
- for ( i = 0; i < num; i++) {
- array_a[i] = array_b[i] * array_c[i];
- }
复制代码
汇编码:
- mov edx, (-num) ; Use biased index.
- mov eax, OFFSET array_a ; Get address of array_a.
- mov ebx, OFFSET array_b ; Get address of array_b.
- mov ecx, OFFSET array_c ; Get address of array_c.
- loop:
- prefetchw [eax+256] ; Four cache lines ahead
- prefetch [ebx+256] ; Four cache lines ahead
- prefetch [ecx+256] ; Four cache lines ahead
- fld QWORD PTR [ebx+edx*8+ARR_SIZE] ; b[i]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE] ; b[i] * c[i]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE] ; a[i] = b[i] * c[i]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+8] ; b[i+1]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+8] ; b[i+1] * c[i+1]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+8] ; a[i+1] = b[i+1] * c[i+1]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+16] ; b[i+2]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+16] ; b[i+2]*c[i+2]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+16] ; a[i+2] = [i+2] * c[i+2]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+24] ; b[i+3]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+24] ; b[i+3] * c[i+3]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+24] ; a[i+3] = b[i+3] * c[i+3]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+32] ; b[i+4]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+32] ; b[i+4] * c[i+4]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+32] ; a[i+4] = b[i+4] * c[i+4]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+40] ; b[i+5]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+40] ; b[i+5] * c[i+5]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+40] ; a[i+5] = b[i+5] * c[i+5]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+48] ; b[i+6]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+48] ; b[i+6] * c[i+6]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+48] ; a[i+6] = b[i+6] * c[i+6]
- fld QWORD PTR [ebx+edx*8+ARR_SIZE+56] ; b[i+7]
- fmul QWORD PTR [ecx+edx*8+ARR_SIZE+56] ; b[i+7] * c[i+7]
- fstp QWORD PTR [eax+edx*8+ARR_SIZE+56] ; a[i+7] = b[i+7] * c[i+7]
- add edx, 8 ; Compute next 8 products
- jnz loop ; until none left.
复制代码
代码中将数据装载进 cache 的 4 个 set
4 way 结构,每个 cache line 为 64 bytes, 4 * 64 byte 共 256 bytes。
[ 本帖最后由 mik 于 2007-1-27 12:11 编辑 ] |
|