哪位大侠可以解释一下这个memcpy patch里的基本优化思路以及背后的cpu工作原理

nswcfd 发表于 2015-05-29 21:49

本帖最后由 nswcfd 于 2015-05-29 22:16 编辑

从CSDN发现了一个老帖子，针对memcpy优化的一个patch

CSDN的帖子：bbs.csdn.net/topics/360040485
patch：patchwork.kernel.org/patch/296282/

看起来是把原来的 r w r w r w r w 序列，调整为 r r r r w w w w，
为什么就会有 1.5x ~ 2x的提升。

虽然patch的开头有一段解释，但是很惭愧，没看懂……
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi), %rax
2. movq %rax, (%rdi)
3. movq 8(%rsi), %rax
4. movq %rax, 8(%rdi)

If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.

1. movq 8(%rsi), %rax
2. movq %rax, 8(%rdi)
3. movq (%rsi), %rax
4. movq %rax, (%rdi)

Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence.At last on Core2 we gain 1.83x speedup compared with
original instruction sequence.In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7.(We
use our micro-benchmark, and will do further test according to your
requirment

这里面“However CPU don't check each address bit,so read could fail to recognize different address even they are in different page.”， CPU don't check each address bit是什么意思？
为什么当%si和%di在不同的page里，还会有3等待2的情况？难道cpu在检查指令依赖关系的时候，只看最后12bit？

xlhl3 发表于 2015-05-30 00:11

回复 1# nswcfd
你这学习的很深呀！基本上看不懂汇编！向你学习，致敬!我还要好好学习 {:qq35:}

abutter 发表于 2015-05-31 09:33

本帖最后由 abutter 于 2015-05-31 09:46 编辑

回复 1# nswcfd

个人一点看法，应该是访问属性的差异，x86 的访问属性可以在页表里面配置。

至于延迟的问题是这样的，如果是 RWR，那么第二个 R 需要等待第二个 R 完成，除非不在同一个 cache line 里面，而且没 address alias；RRWW 这样的操作会避免这样的问题。

goingstudy 发表于 2015-06-01 12:14

不懂，等大神来解释

nswcfd 发表于 2015-06-01 12:23

感谢答复。

回复 3# abutter

请教x86 CPU判断Data dependency是基于virtual address还是基于physical address？

hnwyllmm 发表于 2015-06-01 13:18

在Intel的优化手册上找到这段话，大家看看：
Memory Disambiguation
A load operation may depend on a preceding store. Many microarchitectures block loads until all
preceding store addresses are known. The memory disambiguator predicts which loads will not depend
on any previous stores. When the disambiguator predicts that a load does not have such a dependency,
the load takes its data from the L1 data cache even when the store address is unknown. This hides the
load latency. Eventually, the prediction is verified. If an actual conflict is detected, the load and all
succeeding instructions are re-executed.
The following loads are not disambiguated. The execution of these loads is stalled until addresses of all
previous stores are known.
• Loads that cross the 16-byte boundary
• 32-byte Intel AVX loads that are not 32-byte aligned.
The memory disambiguator always assumes dependency between loads and earlier stores that have the
same address bits 0:11.

hnwyllmm 发表于 2015-06-01 14:09

Intel手册上说L1的数据缓存是32KB，8路
Component
Intel microarchitecture code
name Sandy Bridge
Intel microarchitecture code
name Nehalem
Data Cache Unit (DCU) 32KB, 8 ways 32KB, 8 ways
这样正好是检测12个地址位(32K = 2^15, 8= 2^3)
映射地址到某一个缓存用最简单的取模

nswcfd 发表于 2015-06-01 16:12

回复 6# hnwyllmm
Great~ Thanks for sharing~~

页: [1]

Chinaunix's Archiver

哪位大侠可以解释一下这个memcpy patch里的基本优化思路以及背后的cpu工作原理