- 论坛徽章:
- 0
|
原帖由 星尘细雨 于 2006-12-29 10:18 发表
使用了mmx/sse/sse2的汇编来写memcpy肯定是更快的,
就是在非x86的CPU移植麻烦,
还有我记得性能还和cpu有关,
Intel的cpu在执行sse2指令比amd的cpu的sse2指令快。
既然提到 MMX 指令,就贴一段使用MMX指令copy内存块的代码:
- // block copy: copy a number of DWORDs from DWORD aligned source
- // to DWORD aligned destination using cacheable stores.
- __asm {
- MOV ESI, [src_ptr] ;pointer to src, DWORD aligned
- MOV EDI, [dst_ptr] ;pointer to dst, DWORD aligned
- MOV ECX, [blk_size] ;number of DWORDs to copy
- PREFETCH (ESI) ;PREFetch first src cache line
- CMP ECX, 1 ;less than one DWORD to copy ?
- JB $copydone2_cc ;yes, must be no DWORDs to copy, done
- TEST EDI, 7 ;dst QWORD aligned?
- JZ $dstqaligned2_cc ;yes
- MOVD MM0, [ESI] ;read one DWORD from src
- MOVD [EDI], MM0 ;store one DWORD to dst
- ADD ESI, 4 ;src++
- ADD EDI, 4 ;dst++
- DEC ECX ;number of DWORDs to copy
- $dstqaligned2_cc:
- MOV EBX, ECX ;number of DWORDs to copy
- SHR ECX, 4 ;number of cache lines to copy
- JZ $copyqwords2_cc ;no whole cache lines to copy, maybe QWORDs
- PREFetchm (ESI,64) ;PREFetch src cache line one ahead
- PREFetchmlong (ESI,128) ;PREFetch src cache line two ahead
- ALIGN 16 ;align loop for optimal performance
- $cloop2_cc:
- PREFetchmlong (ESI, 192) ;prefetch cache line three ahead
- MOVQ MM0, [ESI] ;load first QWORD in cache line from src
- ADD EDI, 64 ;src++
- MOVQ MM1, [ESI+8] ;load second QWORD in cache line from src
- ADD ESI, 64 ;dst++
- MOVQ MM2, [ESI-48] ;load third QWORD in cache line from src
- MOVQ [EDI-64], MM0 ;store first DWORD in cache line to dst
- MOVQ MM0, [ESI-40] ;load fourth QWORD in cache line from src
- MOVQ [EDI-56], MM1 ;store second DWORD in cache line to dst
- MOVQ MM1, [ESI-32] ;load fifth QWORD in cache line from src
- MOVQ [EDI-48], MM2 ;store third DWORD in cache line to dst
- MOVQ MM2, [ESI-24] ;load sixth QWORD in cache line from src
- MOVQ [EDI-40], MM0 ;store fourth DWORD in cache line to dst
- MOVQ MM0, [ESI-16] ;load seventh QWORD in cache line from src
- MOVQ [EDI-32], MM1 ;store fifth DWORD in cache line to dst
- MOVQ MM1, [ESI-8] ;load eight QWORD in cache line from src
- MOVQ [EDI-24], MM2 ;store sixth DWORD in cache line to dst
- MOVQ [EDI-16], MM0 ;store seventh DWORD in cache line to dst
- DEC ECX ;count--
- MOVQ [EDI-8], MM1 ;store eighth DWORD in cache line to dst
- JNZ $cloop2_cc ;until no more cache lines to copy
- $copyqwords2_cc:
- MOV ECX, EBX ;number of DWORDs to copy
- AND EBX, 0xE ;number of QWORDS left to copy * 2
- JZ $copydword2_cc ;no QWORDs left, maybe DWORD left
- ALIGN 16 ;align loop for optimal performance
- $qloop2_cc:
- MOVQ MM0, [ESI] ;read QWORD from src
- MOVQ [EDI], MM0 ;store QWORD to dst
- ADD ESI, 8 ;src++
- ADD EDI, 8 ;dst++
- SUB EBX, 2 ;count--
- JNZ $qloop2_cc ;until no more QWORDs left to copy
- $copydword2_cc:
- TEST ECX, 1 ;DWORD left to copy ?
- JZ $copydone2_cc ;nope, we’re done
- MOVD MM0, [ESI] ;read last DWORD from src
- MOVD [EDI], MM0 ;store last DWORD to dst
- $copydone2_cc:
- FEMMS ;clear MMX state
- }
复制代码
当复制块大于 512 字节时,使用以上代码才有价值
另一个更强悍的手法是使用: 128 位的 xmm 指令
- movdqa xmm0, [rdx+r8*8] ; Load
- movntdq [rcx+r8*8], xmm0 ; Store
- movdqa xmm1, [rdx+r8*8+16] ; Load
- movntdq [rcx+r8*8+16], xmm1 ; Store
复制代码 |
|