免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
楼主: 雨丝风片

[FreeBSD] [转载]优化FreeBSD IP和TCP栈 [复制链接]

论坛徽章:
2
亥猪
日期:2014-03-19 16:36:35午马
日期:2014-11-23 23:48:46
发表于 2006-02-27 21:27 |显示全部楼层
原帖由 雨丝风片 于 2006-2-27 20:57 发表


作者在讲述同一级内的匹配运算时,好像就是你说的这个多核的意思:



这个想法很不错。并行粒度比较高。
在一个处理器里面利用运算单元进行并行处理,同时可以减少多处理器中cache一致性问题。
高,这个方法很高明,效率得到最大程度提升的同时又没有惹来过多的overhead。

论坛徽章:
0
发表于 2006-02-28 09:22 |显示全部楼层
原帖由 gvim 于 2006-2-27 21:22 发表
现在分成了4部分的并行处理,每部分进行8此串行,...

这句话应该说反了。本文作者的新路由表本质上就是一个多比特的trie表,我简单画了一下多比特trie表的原理图:


图中,32位的IPv4地址被分成了4段,每段8比特,相应的,trie表也分成了4级,由对应的8比特IP地址确定在该级内究竟匹配哪个分支。第一级显然只有2的8次方个分支,这些分支用绿色方块表示,应该就是作者所谓的“紧密的线性数组”。具体选择哪一个分支是由被查找地址的第0到7比特的值决定的。第一级中的每一个分支都可以对应一个第二级中的蓝色方块,每个蓝色方块也都有2的8次方个分支,具体选择哪一个分支是由被查找地址的第8到15比特的值决定的。依此类推。所以,必须先确定绿色方块中的一个分支,才能知道选择哪个蓝色方块,同样,必须先确定蓝色方块中的一个分支,才能知道选择哪个粉色方块,其余亦然。图中的红色和蓝色箭头分别表示了两次不同的路由查找操作所经过的路径。对IPv4地址的这种分段逐级处理是trie表的一个特色,显然,这种处理是串行的。而在每个方块内的匹配过程则可以利用并行运算的特性,因为这里没有前后依赖关系,只需要拿着IP地址中的8比特在一个有256个条目的数组里面找到一个匹配,具体怎么匹配就看作者的实现了。

用上面的基本概念来解释作者的说法:
在每次查找时,它首先预取出第一个步幅,按线性遍历该级上的所有数组条目,对每一个都进行匹配计算。在现代CPU上,这是相当快的,因为它可以在多个整数执行核中并行运行,而所有的数据也都在快速缓存中。在找到一个真正的匹配之后,它将被存放在一个局部变量中。

每次处理到某一级之后,会把该级内对应的方块整个都预取到缓存中,即在每一级内的并行匹配运算均在缓存内进行。
当在更为精确的步幅中找到了一个匹配时,会把整个步幅都预取出来,并在这一级中再次执行相同的计算。

在找到下一级方块之后,又把那个方块都取进来。。。

[ 本帖最后由 雨丝风片 于 2006-3-1 07:49 编辑 ]

论坛徽章:
0
发表于 2006-02-28 09:31 |显示全部楼层
IPv6一点也很熟悉,但貌似也是分4段标示地址的吧,如果上面所述的trie表每级不局限于8bits,比如扩大到16bits,ipv6的兼容性不就不成问题了?

无责任猜测,错了别打。

论坛徽章:
0
发表于 2006-02-28 09:53 |显示全部楼层
原帖由 colddawn 于 2006-2-28 09:31 发表
IPv6一点也很熟悉,但貌似也是分4段标示地址的吧,如果上面所述的trie表每级不局限于8bits,比如扩大到16bits,ipv6的兼容性不就不成问题了?

无责任猜测,错了别打。


我担心的不是这个兼容性的问题,而是内存。trie表算法本身当然和地址长度无关,但trie表所用的内存就和地址长度很有关系了。v4的地址只有32位,分成4段,每段有2的8次方种可能。而v6的地址有128位,如何分?是否还能继续保持很高的空间效率?这关键还是要看作者所说的数组是如何”紧密“的。另外,作者并没有提出用这个路由表完全取代现有路由表,而只是说先用它作为v4的快速转发表,然后再用到v4的接收流程中,对于v6等其它使用路由表的协议应该还会继续使用现有的radix树路由表。

论坛徽章:
0
发表于 2006-02-28 13:27 |显示全部楼层
Intel手册中关于prefetch的一些资料:

IA-32 Intel Architecture Software Developer's Manual Vol.3

9.5.4 Cache Management Instructions

The PREFETCHh instructions allow a program to suggest to the processor that a cache line from a specified location in system memory be prefetched into the cache hierarchy.

9.8 EXPLICIT CACHING

The Pentium III processor introduced four new instructions, the PREFETCHh instructions, that provide software with explicit control over the caching of data. These instructions provide "hints" to the processor that the data requested by a PREFETCHh instruction should be read into cache hierarchy now or as soon as possible, in anticipation of its use. The instructions provide different variations of hint that allow selection of the cache level into which data will be read.

The PREFETCHh instructions can help reduce the long latency typically associated with reading data from memeory and thus help prevent processor "stalls." However, these instructions should be used judiciously. Overuse can lead to resource conflicts and hence reduce the performance of an application. Also, these instructions should only be used to prefetch data from memory; they should not be used to prefetch instructions.


Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual

Software Data Prefetch

The prefetch instruction can hide the latency of data access in performance-critical sections of application code by allowing data to be fetched in advance of its actual usage. The prefetch instructions do not change the user-visible semantics of a program, although they may affect the program’s performance. The prefetch instructions merely provide a hint to the hardware and generally will not generate exceptions or faults.

The prefetch instructions load either non-temporal data or temporal data in the specified cache level. This data access type and the cache level are specified as a hint.  Depending on the implementation, the instruction fetches 32 or more aligned bytes, including the specified address byte, into the instruction-specified cache levels.

The prefetch instruction is implementation-specific; applications need to be tuned to each implementation to maximize performance.

NOTE. Using the prefetch instructions is recommended only if data does not fit in cache.

The prefetch instructions merely provide a hint to the hardware, and they will not generate exceptions or faults except for a few special cases. However, excessive use of prefetch instructions may waste memory bandwidth and result in performance penalty due to resource constraints.

Nevertheless, the prefetch instructions can lessen the overhead of memory transactions by preventing cache pollution and by using the caches and memory efficiently. This is particularly important for applications that share critical system resources, such as the memory bus.

The prefetch instructions are mainly designed to improve application performance by hiding memory latency in the background. If segments of an application access data in a predictable manner, for example, using arrays with known strides, then they are good candidates for using prefetch to improve performance.  

Use the prefetch instructions in:
    . predictable memory access patterns
    . time-consuming innermost loops
    . locations where the execution pipeline may stall if data is not available.

论坛徽章:
0
发表于 2006-03-01 08:42 |显示全部楼层
这是作者发表在主页上的对BSD协议栈的优化计划以及资助情况:
http://people.freebsd.org/~andre/tcpoptimization.html

Andre Oppermann, FreeBSD Committer <*andre*@freebsd.org> (remove asterisks)

15. July 2005 - Update 28. July 2005: Target reached!!!

TCP/IP Cleanup and Optimizations

The TCP code in FreeBSD has evolved significantly since the fork from 4.4BSD-Lite2 in 1994 primarily due to new features and refinements of the TCP specifications.

The TCP code now needs a general overhaul, streamlining and cleanup to make it easily comprehensible, maintainable and extensible again. In addition there are many little optimizations that can be done during such an operation, propelling FreeBSD back at the top of the best performing TCP/IP stacks again, a position it has held for the longest time in the 90's.

This overhaul is a very involved and delicate matter and needs extensive formal and actual testing to ensure no regressions compared to the current code. The effort needed for this work is about three man-month of fully focused and dedicated time. To get it done I need funding to take time off my day job and to dedicate me to FreeBSD work much the way PHK did with his buffer cache and vnode rework projects.

I've got the opportunity to work up to three man-month exclusively full-time on FreeBSD during the second half of 2005. That means up to 720 hours of full-steam coding (at 60 hours/week)! I will work as much as the fundraise provides.

I need to raise enough money for each month from donations from the FreeBSD community to cover my fixed cost of living, office and associated overhead. These fixed cost amount to US$6,300/month (

论坛徽章:
0
发表于 2006-03-01 08:47 |显示全部楼层
从作者原来的计划来看,对现有路由表的优化和引入新的路由表是两个任务:

Streamlining and refactoring PATRICIA trie routing table

Implementing highly optimized IPv4 multi-bit trie routing table (more than twice as fast)

而且可以看出:
    1、新路由表只针对IPv4。
    2、新路由表就是一个多比特trie表。

论坛徽章:
2
亥猪
日期:2014-03-19 16:36:35午马
日期:2014-11-23 23:48:46
发表于 2006-03-01 11:31 |显示全部楼层
可惜的是没有正式文档或者论文可以做一些先行了解。

论坛徽章:
0
发表于 2006-03-02 13:16 |显示全部楼层
原帖由 gvim 于 2006-3-1 11:31 发表
可惜的是没有正式文档或者论文可以做一些先行了解。


在看到代码之前可以先旁敲侧击,曲线迂回一下,
虽然基本思想是多比特trie表,但实现算法和编程技巧上的差别就太大了。。。

论坛徽章:
0
发表于 2006-03-03 16:05 |显示全部楼层
这么多,
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP