Chinaunix

标题: 关于LINUX上中断在各个CPU之间的负载平衡问题 [打印本页]

作者: 思一克 时间: 2007-06-29 11:04
标题: 关于LINUX上中断在各个CPU之间的负载平衡问题
关于LINUX上中断在各个CPU之间的负载平衡问题

看帖子
http://linux.chinaunix.net/bbs/thread-753474-1-1.html

说4个CPU有严重的不平衡问题。因为无条件实验，
LZ也不在回贴。所以请有兴趣的来参加实验和讨论。

作者: albcamus 时间: 2007-06-29 11:10
kernel启动参数加上acpi_irq_balance会不会好？我试验下先，默认是不balance

作者: 思一克 时间: 2007-06-29 11:19
好。我想应该问题在于KERNEL而不是NAT（iptables). 所以发这里了

作者: albcamus 时间: 2007-06-29 11:24

原帖由 思一克 于 2007-6-29 11:19 发表于 3楼
好。我想应该问题在于KERNEL而不是NAT（iptables). 所以发这里了

没用。

[root@localhost 21]# cat /proc/cmdline
ro root=LABEL=/ vga=0x31B acpi_irq_balance

[root@localhost 21]# grep eth0 /proc/interrupts
21: 19341 0 IO-APIC-fasteoi libata, eth0

eth0和libata中断还是全送到了CPU 0, CPU 1一个也没收到。我机器上是Pentium D双核CPU。

[root@localhost 21]# cat /proc/irq/21/smp_affinity
00000003
这个值是我写入的，两颗cpu的bitmap位都被置1以允许21号中断，但还是只送到cpu 0

作者: 思一克 时间: 2007-06-29 11:27
你cat /proc/interrupts的结果是什么？

作者: albcamus 时间: 2007-06-29 11:35

原帖由 思一克 于 2007-6-29 11:27 发表于 5楼
你cat /proc/interrupts的结果是什么？

[root@localhost Documentation]# cat /proc/interrupts
         CPU0    CPU1
  0:       358       0 IO-APIC-edge    timer
  1:       2       0 IO-APIC-edge    i8042
  8:    8099       0 IO-APIC-edge    rtc
  9:       0       0 IO-APIC-fasteoi acpi
12:       4       0 IO-APIC-edge    i8042
14:       33       0 IO-APIC-edge    ide0
16:    88855       0 IO-APIC-fasteoi HDA Intel, fglrx
17:    22918       0 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb5
18:       0       0 IO-APIC-fasteoi uhci_hcd:usb2
19:       0       0 IO-APIC-fasteoi uhci_hcd:usb3, Ensoniq AudioPCI
20:       0       0 IO-APIC-fasteoi uhci_hcd:usb4
21:    32170       0 IO-APIC-fasteoi libata, eth0
NMI:       0       0
LOC:    234999    203989
ERR:       0
MIS:       0

贫富悬殊很严重啊

作者: zx_wing 时间: 2007-06-29 11:38

原帖由 思一克 于 2007-6-29 11:04 发表于 1楼
关于LINUX上中断在各个CPU之间的负载平衡问题

看帖子
http://linux.chinaunix.net/bbs/thread-753474-1-1.html

说4个CPU有严重的不平衡问题。因为无条件实验，
LZ也不在回贴。所以请有兴趣的来参加实验 ...

？？我觉得是正常的啊，linux默认是将网卡给一个cpu的own的啊，所以你所有的网卡中断都会发给那个CPU。
要改也很容易的，只要修改APIC的RT表就行了。我不太清楚linux提供的接口函数是什么，但apic.c文件或许展现出一些线索。

作者: albcamus 时间: 2007-06-29 11:39
Documentation/IRQ-affinity.txt

SMP IRQ affinity, started by Ingo Molnar <mingo@redhat.com>

/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
to turn off all CPUs, and if an IRQ controller does not support IRQ
affinity then the value will not change from the default 0xffffffff.

Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
the IRQ to CPU4-7 (this is an 8-CPU SMP box):

[root@moon 44]# cat smp_affinity
ffffffff
[root@moon 44]# echo 0f > smp_affinity
[root@moon 44]# cat smp_affinity
0000000f
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
...
--- hell ping statistics ---
6029 packets transmitted, 6027 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.4 ms
[root@moon 44]# cat /proc/interrupts | grep 44:
44:       0    1785    1785    1783    1783       1
1       0 IO-APIC-level  eth1
[root@moon 44]# echo f0 > smp_affinity
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
..
--- hell ping statistics ---
2779 packets transmitted, 2777 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.5/585.4 ms
[root@moon 44]# cat /proc/interrupts | grep 44:
44:    1068    1785    1785    1784    1784    1069    1070    1069 IO-APIC-level  eth1
[root@moon 44]#

他说的这种情况在我机器上完全无效，无论如何设置smp_affinity，还是都送到了cpu 0

作者: albcamus 时间: 2007-06-29 11:42

原帖由 zx_wing 于 2007-6-29 11:38 发表于 7楼

只要修改APIC的RT表就行了

是ioapic的irq route table吧？看了/proc下没有， /sys/devices/system/下有ioapic、lapic和irqrouter目录，但进去只有空目录

猜测改写/proc/irq/<number>/smp_affinity文件，就是影响ioapic的RT表的──但不知为何，我机器上不好使

[ 本帖最后由 albcamus 于 2007-6-29 11:46 编辑 ]

作者: zx_wing 时间: 2007-06-29 11:49

原帖由 albcamus 于 2007-6-29 11:42 发表于 9楼

是ioapic的irq route table吧？看了/proc下没有， /sys/devices/system/下有ioapic、lapic和irqrouter目录，但进去只有空目录

猜测改写/proc/irq/<number>/smp_affinity文件，就是影响ioapic ...

是这个表。我不清楚在linux下有什么接口可以改它，但既然在/sys有，那就应该有相应的driver。
不太清楚linux里这部分，只能提供一点线索

作者: 思一克 时间: 2007-06-29 12:02
看来真是个问题拉？

作者: augustusqing 时间: 2007-06-29 12:37
好问题，确实！
密切关注！

作者: albcamus 时间: 2007-06-29 13:01

原帖由 zx_wing 于 2007-6-29 11:49 发表于 10楼

是这个表。我不清楚在linux下有什么接口可以改它，但既然在/sys有，那就应该有相应的driver。
不太清楚linux里这部分，只能提供一点线索

已经很感谢了

我在看io_apic.c里的内容，看看能不找到问题所在

作者: wysilly 时间: 2007-06-29 13:20
看看这篇线索文章,然后达人解释一下吧.

http://www.ibm.com/developerwork ... ernelint/index.html

我想在测试中还应该有一个叫做irqbalance东西吧.

[ 本帖最后由 wysilly 于 2007-6-29 13:22 编辑 ]

作者: 思一克 时间: 2007-06-29 13:24
我引用的帖子的问题是，iptables重新启动后是平衡的，过了几个小时就严重不平衡了。

原帖由 wysilly 于 2007-6-29 13:20 发表于 14楼
看看这篇线索文章,然后达人解释一下吧.

http://www.ibm.com/developerwork ... ernelint/index.html

我想在测试中还应该有一个叫做irqbalance东西吧.

作者: wysilly 时间: 2007-06-29 13:29
对啊，对啊。就是这个问题，与中断密切相关，要想了解为什么不平衡，不就要了解平衡原理吗？

双网卡，双cpu，刚好一个cpu管一个nic。

双网卡，4cpu好像不太好平衡。所以就又要看irqbalance的实现方式了。

作者: 思一克 时间: 2007-06-29 13:36
好象不是一个CPU管一个网卡。

你看：2个网卡2个CPU，eth1没有接网线，eth0基本上还是平衡的。

]# cat /proc/interrupts
         CPU0    CPU1
  0: 1218847937       40 IO-APIC-edge  timer
  2:       0       0       XT-PIC  cascade
  8:       0       1 IO-APIC-edge  rtc
14:       66       1 IO-APIC-edge  ide0
16:       0       0 IO-APIC-level  uhci_hcd
18:       0       0 IO-APIC-level  uhci_hcd
19:       0       0 IO-APIC-level  uhci_hcd
23:       13       1 IO-APIC-level  ehci_hcd
26: 9935950       1 IO-APIC-level  ioc0
48: 12783533 46127059 IO-APIC-level  eth0
49: 4903874       1 IO-APIC-level  eth1
NMI:       0       0
LOC: 1065696911 1065696910
ERR:       0
MIS:       0

原帖由 wysilly 于 2007-6-29 13:29 发表于 16楼
对啊，对啊。就是这个问题，与中断密切相关，要想了解为什么不平衡，不就要了解平衡原理吗？

双网卡，双cpu，刚好一个cpu管一个nic。

双网卡，4cpu好像不太好平衡。所以就又要看irqbalance的实现方式了。

作者: wysilly 时间: 2007-06-29 13:44
我是说可以一个cpu绑一个nic，如以下。
         CPU0    CPU1
169:  645187653       0 IO-APIC-level  eth1
177:    1186 34171661 IO-APIC-level  eth2
225: 3552116787 3976669860 IO-APIC-level  uhci_hcd:usb4, eth0
3个网卡，2cpu，我就让eth0由irqbalance来自动平衡.
eth1绑在cpu0上, cpu1绑eth2,这样效率高，irqbalance来自动平衡,效率差点。

作者: scutan 时间: 2007-06-29 18:54

原帖由 思一克 于 2007-6-29 11:04 发表于 1楼
关于LINUX上中断在各个CPU之间的负载平衡问题

看帖子
http://linux.chinaunix.net/bbs/thread-753474-1-1.html

说4个CPU有严重的不平衡问题。因为无条件实验，
LZ也不在回贴。所以请有兴趣的来参加实验 ...

在SMP多处理机上是负载不平衡是比较普遍的一种现象, 我也遇到了. 而且也想了一些办法, 不过还是没有成功.

我以前也发过这样的贴子. 当时我的分析如下:

其实irq_balance()虽然可以平衡多个CPU上面的中断数量,但是它仍然不能完全解决CPU负载不平衡的一些问题.
我分析了一下原因,因为中断是发向了一个CPU,而网络中断的下半部是仍软中断来实现的,而软中断会有CPU亲合的问题,就是说你在哪个CPU上触发了这个网卡硬中断,那么这个CPU也会去执行这个硬中断后来的软中断,所以说如果我的网络是1000M的环境的话,那么我的那个触发中断的CPU就会非常繁忙,而另外一个CPU就不会那么繁忙,就会出现了负载不平衡的问题.
而irq_balance()只能是平衡CPU上面的中断的数量,而并不能平衡CPU的负载情况.
而进程的load_balance()也不能很好地解决这个问题,因为软中断是一个进程,load_balance()是将繁忙的CPU上面的进程拉到当前的CPU上面来执行. 但它是不能将正在执行的软中断的进程给拉过来的.

不过后来没有做这方面的内容了. 所以也就没有怎么关心. 不过有一篇论文是关于这个方面的. 虽然比较老了, 但是也许还是有用.
贡献出来, 希望能对大家有一点帮助.

Linux SMP网络体系性能分析.pdf

263.79 KB, 下载次数: 872

作者: scutan 时间: 2007-06-29 19:29

原帖由 albcamus 于 2007-6-29 11:24 发表于 4楼

没用。

# cat /proc/cmdline
ro root=LABEL=/ vga=0x31B acpi_irq_balance

# grep eth0 /proc/interrupts
21: 19341 0 IO-APIC-fasteoi libata, eth0

eth0和libata中断还是全 ...

我也是像您这样写过的, 不过写了之后过不了一会儿时间又回到了以前的状态.
当时把我郁闷得...

作者: 思一克 时间: 2007-06-29 19:33
首先,运行irqbalance, 确保CPU中断的分配基本平衡(cat /proc/interrupts可看),然后再研究LOAD的平衡问题.

作者: scutan 时间: 2007-06-29 19:52

原帖由 思一克 于 2007-6-29 19:33 发表于 21楼
首先,运行irqbalance, 确保CPU中断的分配基本平衡(cat /proc/interrupts可看),然后再研究LOAD的平衡问题.

我是手动将/proc/irq/<ID>/smp_affinity中的值改为全1, 之后还是不行.
而且发现这个/proc/irq/目录下面的smp_affinity中的值除了几个中断号之后,其它的全都是只有一位为1. 其余位全为0.

         CPU0          CPU1          CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  0: 23432940 23447916 23446701 23445441 23445463 23445286 23445460 23445256 IO-APIC-edge  timer
  8:       0             0                   1             0                0             0             1                1          IO-APIC-edge  rtc
  9:       0             0                   0             0                0             0                0                0          IO-APIC-level  acpi
14:       0             0                   0             0                0             0                0                0          IO-APIC-edge  libata
15:       18       3358877          19       839610    839918          21             24          1679740 IO-APIC-edge  ide1
98:       15             18                21          11             5             15             15             18       IO-APIC-level  uhci_hcd:usb1, uhci_hcd:usb3, ehci_hcd:usb5
106:       0             0                   0             0                0             0                0                0          IO-APIC-level  uhci_hcd:usb2, uhci_hcd:usb4
122: 2573686       0                   0             0             0             0                0                0             PCI-MSI  eth1
177:    3662    50581                1355       58219       45692          1435       1328       24886    IO-APIC-level  aacraid
NMI:       0                0                   0                0             0             0                0                0
LOC:  187551733  187551731  187555704  187555703  187554928  187554927  187555782  187555658
ERR:       0
MIS:       0

而此时每个CPU上面的ksoftirqd的这个线程都运行着的. 但是对于某些中断, 仍然是不能很好地进行平均分配.

作者: 思一克 时间: 2007-06-29 21:11
那你的/proc/interrupts看，也不是平衡的？
看你内核中变量irqbalance_disabled的值是什么，是0吗？

原帖由 scutan 于 2007-6-29 19:52 发表于 22楼

我是手动将/proc/irq/<ID>/smp_affinity中的值改为全1, 之后还是不行.
而且发现这个/proc/irq/目录下面的smp_affinity中的值除了几个中断号之后,其它的全都是只有一位为1. 其余位全为0.

...

作者: 思一克 时间: 2007-06-29 21:19
还有，physical_balance 的数值是什么？

可gdb /boot/vmlinux /proc/kcore
print 出内核变量看。

作者: scutan 时间: 2007-06-30 14:15

原帖由 思一克 于 2007-6-29 21:19 发表于 24楼
还有，physical_balance 的数值是什么？

可gdb /boot/vmlinux /proc/kcore
print 出内核变量看。

我gdb /boot/vmlinux中
print physical_balance 出现了下面的语句, 仿佛是没有这个变量. 另外一个变量也是同样的. 望指教.
No symbol table is loaded. Use the "file" command.

作者: shdnzwy 时间: 2007-06-30 22:28
高人……学习了

作者: 思一克 时间: 2007-07-02 09:51
to albcamus,

你的IRQ（NIC）平衡问题实验有效果？

作者: 5iwww 时间: 2007-07-02 10:45

原帖由 wysilly 于 2007-6-29 13:44 发表于 18楼
我是说可以一个cpu绑一个nic，如以下。
         CPU0    CPU1
169:  645187653       0 IO-APIC-level  eth1
177:    1186 34171661 IO-APIC-level  eth2
225: 3552116787 397666 ...

如何实现  一个 cpu 绑一个 nic 呢  请问

66: 3274166999       0       0       0       PCI-MSI  eth0
74: 3156380137       0       0       0       PCI-MSI  eth1

uname
Linux cn-pek1-gateway 2.6.18 #1 SMP Fri Jan 5 18:55:35 CST 2007 i686 i686 i386 GNU/Linux

作者: 思一克 时间: 2007-07-11 17:25
初步结论：

SMP机器无法将网络程序在内核的负荷在任何时刻都平均分配到多个CPU上。

最好的情况就是将一个NIC绑定到一个CPU。如果CPU个数大于NIC，其余的2个CPU将空闲。

作者: 思一克 时间: 2007-07-12 15:54
进展实验：

我在LINUX 2。13 SMP已经做好了将一块网卡的IRQ在2个CPU之间平衡切换，多个CPU也一样，找最少的那个使用。多个网卡也应该可以。

改动的是文件arch/i386/kernel/io_apic.c.

不用用户空间的irqbalance,
实际上，如果你有2个NIC， 2个CPU，现有程序也可以分担到2个CPU，不过切换周期很长。

虽然可以分配，但是
网络负载无法直接利用SMP。同一个时刻，一个NIC中断还是在一个CPU运行。但SMP还是应该快，因为另一个NIC现在没有使用的CPU可以负责用户程序。间接地提高了速度。

我实验好了后会将对io_apic.c的改动贴出来。

作者: platinum 时间: 2007-07-12 20:20
seeker 兄真强~！

作者: nnnqpnnn 时间: 2007-07-12 21:14
都很强。偶只有看的份。

作者: 思一克 时间: 2007-07-13 14:28
以下是arch/i386/kernel/io_apic.c的补丁。在1个NIC，2个CPU和 2个NIC， 2个CPU上都平衡的很好。多个CPU也应该可以很好地平衡。但是我没有测试。

--- io_apic.c 2007-07-13 13:24:57.000000000 +0800
+++ io_apic.c 2007-07-13 14:24:15.000000000 +0800
@@ -46,6 +46,23 @@
#include "io_ports.h"
+
+#define SEEKER_BALANCE_NETWORK_IRQ
+/* network irq balancer testing version, by seeker. 2007.07.13
+ * tested on 1 NIC with 2 cpus, 2 NICs with 2 cpus. it should be working on more NICs and more CPUs
+ *
+ * tested on
+ * Linux yelinux 2.6.13-15-johnye #16 SMP Thu Jul 12 13:05:37 CST 2007 i686 athlon i386 GNU/Linux
+ *
+ * you are welcome to do more testing on a linux box with 3 or more cpus,
+ *
+ * you should apply this patch to testing server instead of to production server.
+ *
+ * user mode irqbalance is not needed. please don't run it on kernel with this patch.
+ *
+ */
+
+
int (*ioapic_renumber_irq)(int ioapic, int irq);
atomic_t irq_mis_count;
@@ -294,6 +311,16 @@
static long balanced_irq_interval = MAX_BALANCED_IRQ_INTERVAL;
+#ifdef SEEKER_BALANCE_NETWORK_IRQ
+struct {
+ char irq;
+ unsigned char count;
+} cpuinfo[NR_CPUS];
+
+unsigned char wait[NR_IRQS];
+unsigned char isnet[NR_IRQS];
+#endif
+
static unsigned long move(int curr_cpu, cpumask_t allowed_mask,
unsigned long now, int direction)
{
@@ -336,6 +363,20 @@
irq_desc_t *desc = irq_desc + irq;
unsigned long flags;
+#ifdef SEEKER_BALANCE_NETWORK_IRQ
+ if(isnet[irq]) {
+ if(cpuinfo[new_cpu].count > 0) {
+ //printk("-------- .\n");
+ return;
+ }
+ //printk("ye1: old %d new %d: LAST_CPU_IRQ(new_cpu, irq) %d\n", cpu, new_cpu, LAST_CPU_IRQ(new_cpu, irq));
+ cpuinfo[cpu].irq = 0; //old cpu
+ cpuinfo[cpu].count = 0;
+ cpuinfo[new_cpu].irq = irq;
+ cpuinfo[new_cpu].count++;
+ }
+ //printk("PEND: irq %d mask %p\n", irq, cpumask_of_cpu(new_cpu));
+#endif
spin_lock_irqsave(&desc->lock, flags);
pending_irq_balance_cpumask[irq] = cpumask_of_cpu(new_cpu);
spin_unlock_irqrestore(&desc->lock, flags);
@@ -354,6 +395,14 @@
if (IRQ_DELTA(CPU_TO_PACKAGEINDEX(i),j) <
useful_load_threshold)
continue;
+
+#ifdef SEEKER_BALANCE_NETWORK_IRQ
+ if(isnet[j]) {
+ printk("not for net irq %d...\n", j);
+ continue;
+ }
+#endif
+
balance_irq(i, j);
}
}
@@ -362,6 +411,42 @@
return;
}
+
+#ifdef SEEKER_BALANCE_NETWORK_IRQ
+void net_balance(int irq)
+{
+int cpu, ncpus;
+unsigned long max;
+int max_cpu;
+
+ //if(++wait[irq] < 32) return;
+ wait[irq] = 0;
+
+ if(!isnet[irq]) return;
+
+ max = ncpus = 0;
+ max_cpu = -1;
+ for(cpu = 0; cpu < NR_CPUS; cpu++) {
+ if(!cpu_online(cpu)) continue;
+ ncpus++;
+ //printk("YE cpu %d irq %d ", cpu, LAST_CPU_IRQ(cpu, irq));
+ //printk("IRQ_DELTA(irq %d cpu %d) %d\n", irq, cpu, IRQ_DELTA(cpu, irq));
+ //
+ if(max < LAST_CPU_IRQ(cpu, irq)) {
+ max = LAST_CPU_IRQ(cpu, irq);
+ max_cpu = cpu;
+ }
+ }
+ if(ncpus < 2) return;
+
+ //printk("YE irq %d max_cpu %d. irq %d count %d\n", irq, max_cpu, cpuinfo[max_cpu].irq, cpuinfo[max_cpu].count);
+ if(max_cpu >= 0) {
+ balance_irq(max_cpu, irq);
+ }
+}
+#endif
+
+
static void do_irq_balance(void)
{
int i, j;
@@ -376,6 +461,13 @@
unsigned long imbalance = 0;
cpumask_t allowed_mask, target_cpu_mask, tmp;
+
+#ifdef SEEKER_BALANCE_NETWORK_IRQ
+ if(irqbalance_disabled)
+ printk("SEEKER: irqblance_disabled %d physical_balance %d, %d\n", irqbalance_disabled, physical_balance, NO_BALANCE_IRQ);
+ //irqbalance_disabled = 0; //JOHNYE
+#endif
+
for (i = 0; i < NR_CPUS; i++) {
int package_index;
CPU_IRQ(i) = 0;
@@ -387,6 +479,14 @@
/* Is this an active IRQ? */
if (!irq_desc[j].action)
continue;
+
+#ifdef SEEKER_BALANCE_NETWORK_IRQ
+ if(!strncmp(irq_desc[j].action->name, "eth", 3)) { //network
+ isnet[j] = 1;
+ net_balance(j);
+ }
+#endif
+
if ( package_index == i )
IRQ_DELTA(package_index,j) = 0;
/* Determine the total count per processor per IRQ */

复制代码

作者: 思一克 时间: 2007-07-13 14:43
还是有问题和要改进的地方。

2 NIC 2 CPU，因为我是想让任何时刻都不在同一个CPU上有2个NIC中断，而2个NIC的中断频率是不同的，所以一段时间后，会失去完全平衡。

这个不是程序的系统问题，而是算法问题。

如果允许在一些时候多个NIC中断可以在同一个CPU，那么就是完全平衡的。

你们看如何是好。

yelinux:/home/linux/debug/_SOFTIRQ # cat /proc/interrupts
         CPU0    CPU1
  0: 4383309    691086       XT-PIC  timer
  8:       1       1 IO-APIC-edge  rtc
  9:       0       0 IO-APIC-level  acpi
16:    264724    96705 IO-APIC-level  libata
17:       0       0 IO-APIC-level  libata
18:       0       0 IO-APIC-level  ohci_hcd:usb1
19:       0       0 IO-APIC-level  ehci_hcd:usb2
20:    19142    20688 IO-APIC-level  eth2
21:    521682    561369 IO-APIC-level  eth0
NMI:       0       0
LOC: 5073684 5073391
ERR:       0
MIS:       0

作者: wheelz 时间: 2007-07-15 14:37
标题: 回复 #30 思一克的帖子
我觉得不能盲目地平衡各个CPU的中断，因为这还涉及到很多hot cache的问题，比如conntrack entry的cache，盲目地把中断分布到不同的CPU上，不见得有很好的效率。
对于tcp报文来说，还有一个报文顺序的问题，容易造成乱序。

作者: 思一克 时间: 2007-07-15 16:56
TO wheelz,

我觉得:
1)首先,完全可以平均分配到各CPU. 可以保证各CPU上的NIC中断总数平衡.
2)平衡动作不应该太频繁. 比如很长时间做一次, 或有很大的不平衡后再去平衡一次.
3)如果活动NIC数目<= CPU数目,确保任何时刻没有2个中断在同一个CPU上.

通过以上的,
1)可以使CPU负载平衡. 比如做NAT的机器, 主要负载都在中断中.中断平衡了,负载就平衡了
2)hot cache的问题可以忽略不记,因为平衡动作不是很频繁.

负载平衡到多CPU上不一定能直接提高速度(和一个CPU负责一个NIC比),因为我上面的贴分析过了, 同一个时刻还是一个CPU在处理一个中断. 也就是说,网络的代码(比如IPTABLES的一个匹配流程)本质上是有顺序的, 无法直接并行执行.

但考虑到和用户程序的竞争,有可能间接获得效率.

但平衡了之后效率也不应该有损失.

具体的还需要实验给出结论.

我还在改进代码. 但实验条件不是很具备.

原帖由 wheelz 于 2007-7-15 14:37 发表
我觉得不能盲目地平衡各个CPU的中断，因为这还涉及到很多hot cache的问题，比如conntrack entry的cache，盲目地把中断分布到不同的CPU上，不见得有很好的效率。
对于tcp报文来说，还有一个报文顺序的问题，容易 ...

作者: albcamus 时间: 2007-07-16 09:30
周末看代码，发现linux内核并没有在每次任务切换时更新local APIC的TPR寄存器，而ia32手册上恰恰是这样期望的，并且说否则就可能导致中断都送往同一个CPU

会不会是这个问题？我觉得这个问题是「中断相关的」而不是「网络相关的」

http://www.cs.helsinki.fi/linux/linux-kernel/2002-10/1392.html

[ 本帖最后由 albcamus 于 2007-7-16 11:44 编辑 ]

作者: 思一克 时间: 2007-07-16 16:14
你说的APIC的问题我不清楚。

CUP负载不平衡是中断的问题，不是网络或相关程序（比如IPTABLES）的问题。
中断平衡了，CPU负载就平衡了。

原帖由 albcamus 于 2007-7-16 09:30 发表
周末看代码，发现linux内核并没有在每次任务切换时更新local APIC的TPR寄存器，而ia32手册上恰恰是这样期望的，并且说否则就可能导致中断都送往同一个CPU

会不会是这个问题？我觉得这个问题是「中断相关 ...

作者: albcamus 时间: 2007-07-16 16:46

原帖由 思一克 于 2007-7-16 16:14 发表
你说的APIC的问题我不清楚。

CUP负载不平衡是中断的问题，不是网络或相关程序（比如IPTABLES）的问题。
中断平衡了，CPU负载就平衡了。

好晕啊，刚刚看了2.6.20的代码， local APIC的任务优先级寄存器TPR、仲裁优先级寄存器APR，和处理器优先级寄存器PPR，这3个寄存器Linux一个都没使用！都搞不懂Linux现在是怎么进行IRQ routing的？莫非跟generic IRQ补丁有关？

我再看看2.6.18的代码是怎样的吧──跟generic IRQ补丁没关系，即使是在2.6.12上，Linux还是没使用这几个寄存器。

[ 本帖最后由 albcamus 于 2007-7-16 17:17 编辑 ]

作者: 思一克 时间: 2007-07-17 12:02
我的是2。6。13

仅仅eth0, eth2 做了自己的平衡，其他的中断没有动。你看，原来也是分配的，而不是一律到CPU0

cat /proc/interrupts
         CPU0    CPU1
  0: 33584050    149794       XT-PIC  timer
  8:       1       1 IO-APIC-edge  rtc
  9:       0       0 IO-APIC-level  acpi
16:    279433    873456 IO-APIC-level  libata
17:       0       0 IO-APIC-level  libata
18:       0       0 IO-APIC-level  ehci_hcd:usb1
19:       0       0 IO-APIC-level  ohci_hcd:usb2
20: 1097617 1156171 IO-APIC-level  eth0
21:    42546    12937 IO-APIC-level  eth2
NMI:       0       0
LOC: 33731906 33732698
ERR:       0
MIS:       0

作者: wheel 时间: 2007-07-17 14:21
irqbalance 没起把?
我的3网卡 2C 的机器基本是平分的..

[ 本帖最后由 wheel 于 2007-7-17 14:27 编辑 ]

作者: 思一克 时间: 2007-07-17 14:37
你要是有4个CPU， 2个NIC，irqbalance好象不行。

你看帖子，他们实验结果说不行。
http://linux.chinaunix.net/bbs/v ... p%3Bfilter%3Ddigest

原帖由 wheel 于 2007-7-17 14:21 发表
irqbalance 没起把?
我的3网卡 2C 的机器基本是平分的..

作者: albcamus 时间: 2007-07-17 14:37

原帖由 wheel 于 2007-7-17 14:21 发表
irqbalance 没起把?
我的3网卡 2C 的机器基本是平分的..

应该是和机器以及内核版本有关，我这边起了irqbalance也一样。

但要说linux的irq routing有这么严重的问题，那也不大可能，肯定大多数机器上是好的。

作者: wheel 时间: 2007-07-18 13:07
[root@localhost 83627]# cat /proc/version
Linux version 2.6.22 (cqs@localhost.localdomain) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-53)) #1 SMP Tue Jul 17 21:52:55 CST 2007
又测了下.还是 2加3卡,2.6.22下是ok的.用20的时候就不成了..

作者: AIXHP 时间: 2007-07-18 16:00

原帖由 scutan 于 2007-6-29 18:54 发表

在SMP多处理机上是负载不平衡是比较普遍的一种现象, 我也遇到了. 而且也想了一些办法, 不过还是没有成功.

我以前也发过这样的贴子. 当时我的分析如下:

其实irq_balance()虽然可以平衡多个CPU上面的 ...

在SMP系统初始化后没有那个CPU特殊一些,均是运行在相同的地址空间. INTEL SMP 硬件有什么特别吗,使不同CPU响应中断不同?

作者: wheel 时间: 2007-07-19 09:53
http://people.redhat.com/mingo/cfs-scheduler/
重新Make内核加上sched-cfs-v2.6.22.1-v19.patch 就能发现好的多了.

作者: albcamus 时间: 2007-07-19 10:31

原帖由 AIXHP 于 2007-7-18 16:00 发表

在SMP系统初始化后没有那个CPU特殊一些,均是运行在相同的地址空间. INTEL SMP 硬件有什么特别吗,使不同CPU响应中断不同?

intel的对OS作者的预期，和Linux的具体实现，很有些不一致的地方。例如在中断分发这件事上， Intel期望的是，每个中断都有一个优先级，计算公式是：

优先级 = 中断向量/16

然而Linux并不使用这个：Linux下的中断没有优先级这个属性。 Intel预期的是，每个CPU的本地APIC有个TPR寄存器，其值是当前所运行任务的优先级（每次切换任务时更新），只有当一个中断的优先级比目标CPU的TPR寄存器值高，这个中断才能递送到该CPU ── 也就是，让运行最低优先级任务的CPU来处理中断。如果有N>1个CPU的TPR相同都是最低，那么就通过总线仲裁来在它们之间round-robin。但是Linux没有这么做，理由在我上面给的链接中有说明。

作者: 思一克 时间: 2007-09-17 10:18
我收回该贴的结论。

慢看来不是我说的原因。还在进一步探索。有结论了告诉大家。

[ 本帖最后由思一克于 2007-9-17 11:34 编辑 ]

作者: Solaris12 时间: 2007-09-17 10:58

原帖由 思一克 于 2007-9-17 10:18 发表
我下工夫研究了一周。

得到了很不好的结论：网络底层程序（包括IPTALBES）本质上无法利用SMP

1）用IPRBALANCE。完全可以BALANCE到各CPU上（比如多少时间调整一次），但同一个时刻只能在一个CPU上。CPU的负 ...

如果能根据IP地址，或者TCP端口来balance，是不是会好一些？

作者: 思一克 时间: 2007-09-20 22:08
PATCH出来了. 是针对2.6.13-15-smp的.

将代码存到文件seeker中, 放在linux source根下, 然后patch -p1 < seeker
然后编译KERNEL.启动.

我初步测试网络下载的速度
在双CPU机器上, 在IPTABLES 的INPUT链安上2400行IP和PORT匹配(目的是故意模拟
高负载的情况). 有PATCH, 网络下载速度比没有可以高出一倍. 因为利用的双CPU.

这个有可能是最好的解决方案. 如果4个CPU,可能将负载能力提高4倍(至少2倍).
IRQBALANCE不需要了.

NAT情况我限于条件,测试的十分不完全.

欢迎测试.

以后我还会给出module, 不用编译KERNEL就可以测试了.

--- old/net/ipv4/ip_input.c 2007-09-20 20:50:31.000000000 +0800
+++ new/net/ipv4/ip_input.c 2007-09-21 05:52:40.000000000 +0800
@@ -362,6 +362,198 @@
return NET_RX_DROP;
}
+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+ *
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP system.
+Takes full advantages of SMP to handle more packets and greatly raises NIC throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
+will make the bottom part's load be very high. So, if the bottom part softirq
+can be randomly distributed to processors and run concurrently on them, the network will
+gain much more packet handling capacity, network throughput will be be increased
+remarkably.
+
+Where useful:
+It's useful on SMP machines that meet the following 2 conditions:
+1) have high kernel network load, for example, running iptables with thousands of rules, etc).
+2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
+On these system, with the increase of softirq load, some CPUs will be idle
+while others(number is equal to # of NIC) keeps busy.
+IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
+Balancing the load of each cpus will not remarkably increase network speed.
+
+Where NOT useful:
+If the bottom half of softirq is too small(without running iptables), or the network
+is too idle, BS patch will not be seen to have visible effect. But It has no
+negative affect either.
+User can turn on/off BS functionality by /proc/sys/net/bs_enable switch.
+
+How to test:
+On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
+softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
+use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
+without BS enabled, and with BS enabled.
+cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS
+cat /proc/sys/net/bs_status. this shows the usage of each CPUs
+Test shown that when bottom softirq load is high, the network throughput can be nearly
+doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.
+
+Bugs:
+It will NOT allow hotpug CPU.
+It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
+for example, 0,1,2,3 is OK. 0,1,8,9 is KO.
+
+Some considerations in the future:
+1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
+at least not for network irq.
+2) Softirq load will become very small. It only run the top half of old softirq, which
+is much less expensive than bottom half---the netfilter program.
+To let top softirq process more packets, cant these 3 network parameters be enlarged?
+extern int netdev_max_backlog = 1000;
+extern int netdev_budget = 300;
+extern int weight_p = 64;
+3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?
+
+Signed-off-by: John Ye (Seeker) <[email]johny@webizmail.com[/email]>
+ *
+ */
+
+#define BS_USE_PERCPU_DATA
+
+struct cpu_stat {
+ unsigned long irqs; //total irqs
+ unsigned long dids; //I did,
+ unsigned long others;
+ unsigned long works;
+};
+#define BS_CPU_STAT_DEFINED
+
+static int nr_cpus = 0;
+
+#ifdef BS_USE_PERCPU_DATA
+static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues); // cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct work_struct, bs_works);
+struct cpu_stat bs_cpu_status[NR_CPUS];
+#else
+#define NR_CPUS 8
+static struct sk_buff_head bs_cpu_queues[NR_CPUS];
+static struct work_struct bs_works[NR_CPUS];
+static struct cpu_stat bs_cpu_status[NR_CPUS];
+#endif
+
+int bs_enable = 1;
+
+static int ip_rcv1(struct sk_buff *skb, struct net_device *dev)
+{
+ return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish, nf_hook_input_cond(skb));
+}
+
+
+static void bs_func(void *data)
+{
+ int flags, num, cpu;
+ struct sk_buff *skb, *last;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+ cpu = smp_processor_id();
+
+
+#ifdef BS_USE_PERCPU_DATA
+ bs_works = &per_cpu(bs_works, cpu);
+ q = &per_cpu(bs_cpu_queues, cpu);
+#else
+ bs_works = &bs_works[cpu];
+ q = &bs_cpu_queues[cpu];
+#endif
+
+ local_bh_disable();
+restart:
+ num = 0;
+ while(1) {
+ last = skb;
+ spin_lock_irqsave(&q->lock, flags);
+ skb = __skb_dequeue(q);
+ spin_unlock_irqrestore(&q->lock, flags);
+ if(!skb) break;
+ num++;
+ //local_bh_disable();
+ ip_rcv1(skb, skb->dev);
+ //__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ }
+
+ bs_cpu_status[cpu].others += num;
+ if(num > 0) { goto restart; }
+
+ __local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ bs_works->func = 0;
+
+ return;
+}
+
+/* COPY_IN_START_FROM kernel/workqueue.c */
+struct cpu_workqueue_struct {
+
+ spinlock_t lock;
+
+ long remove_sequence; /* Least-recently added (next to run) */
+ long insert_sequence; /* Next to add */
+
+ struct list_head worklist;
+ wait_queue_head_t more_work;
+ wait_queue_head_t work_done;
+
+ struct workqueue_struct *wq;
+ task_t *thread;
+
+ int run_depth; /* Detect run_workqueue() recursion depth */
+} ____cacheline_aligned;
+
+
+struct workqueue_struct {
+ struct cpu_workqueue_struct cpu_wq[NR_CPUS];
+ const char *name;
+ struct list_head list; /* Empty if single thread */
+};
+/* COPY_IN_END_FROM kernel/worqueue.c */
+
+extern struct workqueue_struct *keventd_wq;
+
+/* Preempt must be disabled. */
+static void __queue_work(struct cpu_workqueue_struct *cwq,
+ struct work_struct *work)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&cwq->lock, flags);
+ work->wq_data = cwq;
+ list_add_tail(&work->entry, &cwq->worklist);
+ cwq->insert_sequence++;
+ wake_up(&cwq->more_work);
+ spin_unlock_irqrestore(&cwq->lock, flags);
+}
+#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
+
+
/*
* Main IP Receive routine.
*/
@@ -424,8 +616,73 @@
}
}
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+ if(!nr_cpus)
+ nr_cpus = num_online_cpus();
+
+ if(bs_enable && nr_cpus > 1 && iph->protocol != IPPROTO_ICMP) {
+ //if(bs_enable && iph->protocol == IPPROTO_ICMP) { //test on icmp first
+ unsigned int flags, cur, cpu;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+
+ cur = smp_processor_id();
+
+ bs_cpu_status[cur].irqs++;
+
+ //random distribute
+ cpu = (bs_cpu_status[cur].irqs % nr_cpus);
+ if(cpu == cur) {
+ bs_cpu_status[cpu].dids++;
+ return ip_rcv1(skb, dev);
+ }
+
+#ifdef BS_USE_PERCPU_DATA
+ q = &per_cpu(bs_cpu_queues, cpu);
+#else
+ q = &bs_cpu_queues[cpu];
+#endif
+
+ if(!q->next) { // || skb_queue_len(q) == 0 ) {
+ skb_queue_head_init(q);
+ }
+
+
+#ifdef BS_USE_PERCPU_DATA
+ bs_works = &per_cpu(bs_works, cpu);
+#else
+ bs_works = &bs_works[cpu];
+#endif
+ /*
+ local_irq_save(flags);
+ SKB_CB(skb)->dev = dev;
+ SKB_CB(skb)->ptype = pt;
+ */
+ spin_lock_irqsave(&q->lock, flags);
+ __skb_queue_tail(q, skb);
+ spin_unlock_irqrestore(&q->lock, flags);
+ //if(net_ratelimit()) printk("qlen %d\n", q->qlen);
+
+ //local_irq_restore(flags);
+ if (!bs_works->func) {
+ INIT_WORK(bs_works, bs_func, q);
+ bs_cpu_status[cpu].works++;
+ preempt_disable();
+ __queue_work(keventd_wq->cpu_wq + cpu, bs_works);
+ preempt_enable();
+ }
+ } else {
+ int cpu = smp_processor_id();
+ bs_cpu_status[cpu].irqs++;
+ bs_cpu_status[cpu].dids++;
+ return ip_rcv1(skb, dev);
+ }
+ return 0;
+#else
return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
- ip_rcv_finish, nf_hook_input_cond(skb));
+ ip_rcv_finish, nf_hook_input_cond(skb));
+#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
+
inhdr_error:
IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
--- old/net/sysctl_net.c 2007-09-20 23:30:29.000000000 +0800
+++ new/net/sysctl_net.c 2007-09-20 23:28:06.000000000 +0800
@@ -30,6 +30,22 @@
extern struct ctl_table tr_table[];
#endif
+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+#if !defined(BS_CPU_STAT_DEFINED)
+struct cpu_stat {
+ unsigned long irqs; //total irqs
+ unsigned long dids; //I did,
+ unsigned long others;
+ unsigned long works;
+};
+#endif
+extern struct cpu_stat bs_cpu_status[NR_CPUS];
+
+extern int bs_enable;
+#endif
+
struct ctl_table net_table[] = {
{
.ctl_name = NET_CORE,
@@ -61,5 +77,26 @@
.child = tr_table,
},
#endif
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+ {
+ .ctl_name = 99,
+ .procname = "bs_status",
+ .data = &bs_cpu_status,
+ .maxlen = sizeof(bs_cpu_status),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+
+ {
+ .ctl_name = 99,
+ .procname = "bs_enable",
+ .data = &bs_enable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
+
{ 0 },
};
--- old/kernel/workqueue.c 2007-09-21 04:48:13.000000000 +0800
+++ new/kernel/workqueue.c 2007-09-21 04:47:49.000000000 +0800
@@ -384,7 +384,11 @@
kfree(wq);
}
+/*
static struct workqueue_struct *keventd_wq;
+*/
+struct workqueue_struct *keventd_wq;
+EXPORT_SYMBOL(keventd_wq);
int fastcall schedule_work(struct work_struct *work)
{

复制代码

[ 本帖最后由思一克于 2007-9-20 22:17 编辑 ]

作者: 思一克 时间: 2007-09-22 16:43
这是模块版本的BS实现, 仅仅适合与2.6.13-15内核版本. 欢迎实验.

/*
* BOTTOM_SOFTIRQ_NET
* An implementation of bottom softirq concurrent execution on SMP
* This is implemented by splitting current net softirq into top half
* and bottom half, dispatch the bottom half to each cpu's workqueue.
* Hopefully, it can raise the throughput of NIC when running iptalbes
* with heavy softirq load on SMP machine.
*
* Version: $Id: bs_smp.c, v 2.6.13-15 for kernel 2.6.13-15-smp
*
* Authors: John Ye & QianYu Ye, 2007.08.27
*/
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/highmem.h>
#include <linux/init.h>
#include <linux/string.h>
#include <linux/smp_lock.h>
#include <linux/backing-dev.h>
#include <asm/uaccess.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/types.h>
#include <linux/errno.h>
#include <linux/slab.h>
#include <linux/romfs_fs.h>
#include <linux/fs.h>
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/smp_lock.h>
#include <linux/buffer_head.h>
#include <linux/vfs.h>
#include <linux/delay.h>
#include <linux/bio.h>
#include <linux/aio.h>
#include <asm/uaccess.h>
//for debug_syscalls
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/ptrace.h>
#include <linux/user.h>
#include <linux/security.h>
#include <linux/list.h>
#include <asm/pgtable.h>
#include <asm/system.h>
#include <asm/processor.h>
#include <asm/i387.h>
#include <asm/debugreg.h>
#include <asm/ldt.h>
#include <asm/desc.h>
#include <linux/swap.h>
//#include <linux/interrupt.h>
#include <asm/i387.h>
#include <asm/debugreg.h>
#include <asm/ldt.h>
#include <asm/desc.h>
#include <linux/swap.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/smp_lock.h>
#include <linux/input.h>
#include <linux/module.h>
#include <linux/random.h>
#include <linux/major.h>
#include <linux/pm.h>
#include <linux/proc_fs.h>
#include <linux/kmod.h>
#include <linux/interrupt.h>
#include <linux/poll.h>
#include <linux/device.h>
#include <linux/devfs_fs_kernel.h>
#include <linux/interrupt.h>
#include <linux/workqueue.h>
#include <linux/skbuff.h>
#include <linux/config.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/sysctl.h>
#include <net/tcp.h>
#include <net/inet_common.h>
#include <linux/ipsec.h>
#include <asm/unaligned.h>
#include <asm/system.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/config.h>
#include <linux/net.h>
#include <linux/socket.h>
#include <linux/sockios.h>
#include <linux/in.h>
#include <linux/inet.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <net/snmp.h>
#include <net/ip.h>
#include <net/protocol.h>
#include <net/route.h>
#include <linux/skbuff.h>
#include <net/sock.h>
#include <net/arp.h>
#include <net/icmp.h>
#include <net/raw.h>
#include <net/checksum.h>
#include <linux/netfilter_ipv4.h>
#include <net/xfrm.h>
#include <linux/mroute.h>
#include <linux/netlink.h>
#include <net/route.h>"
#include <linux/inetdevice.h>
static spinlock_t *p_ptype_lock;
static struct list_head *p_ptype_base; /* 16 way hashed list */
int (*Pip_options_rcv_srr)(struct sk_buff *skb);
int (*Pnf_rcv_postxfrm_nonlocal)(struct sk_buff *skb);
struct ip_rt_acct *ip_rt_acct;
struct ipv4_devconf *Pipv4_devconf;
#define ipv4_devconf (*Pipv4_devconf)
//#define ip_rt_acct Pip_rt_acct
#define ip_options_rcv_srr Pip_options_rcv_srr
#define nf_rcv_postxfrm_nonlocal Pnf_rcv_postxfrm_nonlocal
//extern int nf_rcv_postxfrm_local(struct sk_buff *skb);
//extern int ip_options_rcv_srr(struct sk_buff *skb);
static struct workqueue_struct **Pkeventd_wq;
#define keventd_wq (*Pkeventd_wq)
#define INSERT_CODE_HERE
static inline int ip_rcv_finish(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
struct iphdr *iph = skb->nh.iph;
int err;
/*
* Initialise the virtual path cache for the packet. It describes
* how the packet travels inside Linux networking.
*/
if (skb->dst == NULL) {
if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))) {
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS);
goto drop;
}
}
if (nf_xfrm_nonlocal_done(skb))
return nf_rcv_postxfrm_nonlocal(skb);
#ifdef CONFIG_NET_CLS_ROUTE
if (skb->dst->tclassid) {
struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id();
u32 idx = skb->dst->tclassid;
st[idx&0xFF].o_packets++;
st[idx&0xFF].o_bytes+=skb->len;
st[(idx>>16)&0xFF].i_packets++;
st[(idx>>16)&0xFF].i_bytes+=skb->len;
}
#endif
if (iph->ihl > 5) {
struct ip_options *opt;
/* It looks as overkill, because not all
IP options require packet mangling.
But it is the easiest for now, especially taking
into account that combination of IP options
and running sniffer is extremely rare condition.
--ANK (980813)
*/
if (skb_cow(skb, skb_headroom(skb))) {
IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
goto drop;
}
iph = skb->nh.iph;
if (ip_options_compile(NULL, skb))
goto inhdr_error;
opt = &(IPCB(skb)->opt);
if (opt->srr) {
struct in_device *in_dev = in_dev_get(dev);
if (in_dev) {
if (!IN_DEV_SOURCE_ROUTE(in_dev)) {
if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit())
printk(KERN_INFO "source route option %u.%u.%u.%u -> %u.%u.%u.%u\n",
NIPQUAD(iph->saddr), NIPQUAD(iph->daddr));
in_dev_put(in_dev);
goto drop;
}
in_dev_put(in_dev);
}
if (ip_options_rcv_srr(skb))
goto drop;
}
}
return dst_input(skb);
inhdr_error:
IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
drop:
kfree_skb(skb);
return NET_RX_DROP;
}
#define CONFIG_BOTTOM_SOFTIRQ_SMP
#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
#ifdef COMMENT____________
/*
[PATCH: 2.6.13-15-SMP 1/2] network: concurrently run softirq network code on SMP
Bottom Softirq Implementation. John Ye, 2007.08.27
Why this patch:
Make kernel be able to concurrently execute softirq's net code on SMP system.
Take full advantages of SMP to handle more packets and greatly raises NIC throughput.
The current kernel's net packet processing logic is:
1) The CPU which handles a hardirq must be executing its related softirq.
2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
at the same time.
The limitation make kernel network be hard to take the advantages of SMP.
How this patch:
It splits the current softirq code into 2 parts: the cpu-sensitive top half,
and the cpu-insensitive bottom half, then make bottom half(calld BS) be
executed on SMP concurrently.
The two parts are not equal in terms of size and load. Top part has constant code
size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
will make the bottom part's load be very high. So, if the bottom part softirq
can be randomly distributed to processors and run concurrently on them, the network will
gain much more packet handling capacity, network throughput will be be increased
remarkably.
Where useful:
It's useful on SMP machines that meet the following 2 conditions:
1) have high kernel network load, for example, running iptables with thousands of rules, etc).
2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
On these system, with the increase of softirq load, some CPUs will be idle
while others(number is equal to # of NIC) keeps busy.
IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
Balancing the load of each cpus will not remarkably increase network speed.
Where NOT useful:
If the bottom half of softirq is too small(without running iptables), or the network
is too idle, BS patch will not be seen to have visible effect. But It has no
negative affect either.
User can turn on/off BS functionality by /proc/sys/net/bs_enable switch.
How to test:
On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
without BS enabled, and with BS enabled.
cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS
cat /proc/sys/net/bs_status. this shows the usage of each CPUs
Test shown that when bottom softirq load is high, the network throughput can be nearly
doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.
Bugs:
It will NOT allow hotplug CPU.
It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
for example, 0,1,2,3 is OK. 0,1,8,9 is KO.
Some considerations in the future:
1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
at least not for network irq.
2) Softirq load will become very small. It only run the top half of old softirq, which
is much less expensive than bottom half---the netfilter program.
To let top softirq process more packets, can these 3 network parameters be given a larger value?
extern int netdev_max_backlog = 1000;
extern int netdev_budget = 300;
extern int weight_p = 64;
3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?
Signed-off-by: John Ye (Seeker) <[email]johny@webizmail.com[/email]>
*/
#endif
#define BS_USE_PERCPU_DATA
struct cpu_stat {
unsigned long irqs; //total irqs
unsigned long dids; //I did,
unsigned long others;
unsigned long works;
};
#define BS_CPU_STAT_DEFINED
static int nr_cpus = 0;
#ifdef BS_USE_PERCPU_DATA
static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues); // cacheline_aligned_in_smp;
static DEFINE_PER_CPU(struct work_struct, bs_works);
//static DEFINE_PER_CPU(struct cpu_stat, bs_cpu_status);
struct cpu_stat bs_cpu_status[NR_CPUS] = { {0, }, {0, }, };
#else
#define NR_CPUS 8
static struct sk_buff_head bs_cpu_queues[NR_CPUS];
static struct work_struct bs_works[NR_CPUS];
static struct cpu_stat bs_cpu_status[NR_CPUS];
#endif
int bs_enable = 1;
static int ip_rcv1(struct sk_buff *skb, struct net_device *dev)
{
return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish, nf_hook_input_cond(skb));
}
static void bs_func(void *data)
{
int flags, num, cpu;
struct sk_buff *skb;
struct work_struct *bs_works;
struct sk_buff_head *q;
cpu = smp_processor_id();
#ifdef BS_USE_PERCPU_DATA
bs_works = &per_cpu(bs_works, cpu);
q = &per_cpu(bs_cpu_queues, cpu);
#else
bs_works = &bs_works[cpu];
q = &bs_cpu_queues[cpu];
#endif
local_bh_disable();
restart:
num = 0;
while(1) {
spin_lock_irqsave(&q->lock, flags);
skb = __skb_dequeue(q);
spin_unlock_irqrestore(&q->lock, flags);
if(!skb) break;
num++;
//local_bh_disable();
ip_rcv1(skb, skb->dev);
//__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
}
bs_cpu_status[cpu].others += num;
if(num > 2) printk("%d %d\n", num, cpu);
if(num > 0) { goto restart; }
__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
bs_works->func = 0;
return;
}
/* COPY_IN_START_FROM kernel/workqueue.c */
struct cpu_workqueue_struct {
spinlock_t lock;
long remove_sequence; /* Least-recently added (next to run) */
long insert_sequence; /* Next to add */
struct list_head worklist;
wait_queue_head_t more_work;
wait_queue_head_t work_done;
struct workqueue_struct *wq;
task_t *thread;
int run_depth; /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;
struct workqueue_struct {
struct cpu_workqueue_struct cpu_wq[NR_CPUS];
const char *name;
struct list_head list; /* Empty if single thread */
};
/* COPY_IN_END_FROM kernel/worqueue.c */
extern struct workqueue_struct *keventd_wq;
/* Preempt must be disabled. */
static void __queue_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
unsigned long flags;
spin_lock_irqsave(&cwq->lock, flags);
work->wq_data = cwq;
list_add_tail(&work->entry, &cwq->worklist);
cwq->insert_sequence++;
wake_up(&cwq->more_work);
spin_unlock_irqrestore(&cwq->lock, flags);
}
#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
/*
* Main IP Receive routine.
*/
/* hard irq are in CPU1, why this get called from CPU0?, __do_IRQ() did so?
*
*/
int REP_ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt)
{
struct iphdr *iph;
/* When the interface is in promisc. mode, drop all the crap
* that it receives, do not try to analyse it.
*/
if (skb->pkt_type == PACKET_OTHERHOST)
goto drop;
IP_INC_STATS_BH(IPSTATS_MIB_INRECEIVES);
if ((skb = skb_share_check(skb, GFP_ATOMIC)) == NULL) {
IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
goto out;
}
if (!pskb_may_pull(skb, sizeof(struct iphdr)))
goto inhdr_error;
iph = skb->nh.iph;
/*
* RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails the checksum.
*
* Is the datagram acceptable?
*
* 1. Length at least the size of an ip header
* 2. Version of 4
* 3. Checksums correctly. [Speed optimisation for later, skip loopback checksums]
* 4. Doesn't have a bogus length
*/
if (iph->ihl < 5 || iph->version != 4)
goto inhdr_error;
if (!pskb_may_pull(skb, iph->ihl*4))
goto inhdr_error;
iph = skb->nh.iph;
if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
goto inhdr_error;
{
__u32 len = ntohs(iph->tot_len);
if (skb->len < len || len < (iph->ihl<<2))
goto inhdr_error;
/* Our transport medium may have padded the buffer out. Now we know it
* is IP we can trim to the true length of the frame.
* Note this now means skb->len holds ntohs(iph->tot_len).
*/
if (pskb_trim_rcsum(skb, len)) {
IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
goto drop;
}
}
#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
if(!nr_cpus)
nr_cpus = num_online_cpus();
if(bs_enable && nr_cpus > 1 && iph->protocol != IPPROTO_ICMP) {
//if(bs_enable && iph->protocol == IPPROTO_ICMP) { //test on icmp first
unsigned int flags, cur, cpu;
struct work_struct *bs_works;
struct sk_buff_head *q;
cur = smp_processor_id();
bs_cpu_status[cur].irqs++;
if(!nr_cpus) {
nr_cpus = num_online_cpus();
}
//random distribute
cpu = (bs_cpu_status[cur].irqs % nr_cpus);
if(cpu == cur) {
bs_cpu_status[cpu].dids++;
return ip_rcv1(skb, dev);
}
#ifdef BS_USE_PERCPU_DATA
q = &per_cpu(bs_cpu_queues, cpu);
#else
q = &bs_cpu_queues[cpu];
#endif
if(!q->next) { // || skb_queue_len(q) == 0 ) {
skb_queue_head_init(q);
}
#ifdef BS_USE_PERCPU_DATA
bs_works = &per_cpu(bs_works, cpu);
#else
bs_works = &bs_works[cpu];
#endif
spin_lock_irqsave(&q->lock, flags);
__skb_queue_tail(q, skb);
spin_unlock_irqrestore(&q->lock, flags);
if (!bs_works->func) {
INIT_WORK(bs_works, bs_func, q);
bs_cpu_status[cpu].works++;
preempt_disable();
__queue_work(keventd_wq->cpu_wq + cpu, bs_works);
preempt_enable();
}
} else {
int cpu = smp_processor_id();
bs_cpu_status[cpu].irqs++;
bs_cpu_status[cpu].dids++;
return ip_rcv1(skb, dev);
}
return 0;
#else
return NF_HOOK_COND(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish, nf_hook_input_cond(skb));
#endif //CONFIG_BOTTOM_SOFTIRQ_SMP
inhdr_error:
IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
drop:
kfree_skb(skb);
out:
return NET_RX_DROP;
}
//for standard patch, those lines should be moved into ../../net/sysctl_net.c
/* COPY_OUT_START_TO net/sysctl_net.c */
#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
#if !defined(BS_CPU_STAT_DEFINED)
struct cpu_stat {
unsigned long irqs; //total irqs
unsigned long dids; //I did,
unsigned long others;
unsigned long works;
};
#endif
extern struct cpu_stat bs_cpu_status[NR_CPUS];
extern int bs_enable;
/* COPY_OUT_END_TO net/sysctl_net.c */
static ctl_table bs_ctl_table[]={
/* COPY_OUT_START_TO net/sysctl_net.c */
{
.ctl_name = 99,
.procname = "bs_status",
.data = &bs_cpu_status,
.maxlen = sizeof(bs_cpu_status),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = 99,
.procname = "bs_enable",
.data = &bs_enable,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
/* COPY_OUT_END_TO net/net_sysctl.c */
{ 0, },
};
static ctl_table bs_sysctl_root[] = {
{
.ctl_name = CTL_NET,
.procname = "net",
.mode = 0555,
.child = bs_ctl_table,
},
{ 0, },
};
struct ctl_table_header *bs_sysctl_hdr;
register_bs_sysctl(void)
{
bs_sysctl_hdr = register_sysctl_table(bs_sysctl_root, 0);
return 0;
}
unregister_bs_sysctl(void)
{
unregister_sysctl_table(bs_sysctl_hdr);
}
#endif //CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
#if 1
seeker_init()
{
int i;
if(nr_cpus == 0)
nr_cpus = num_online_cpus();
register_bs_sysctl();
}
seeker_exit()
{
unregister_bs_sysctl();
bs_enable = 0;
msleep(1000);
flush_scheduled_work();
msleep(1000);
printk("......exit...........\n");
}
#endif
/*--------------------------------------------------------------------------
*/
struct packet_type *dev_find_pack(int type)
{
struct list_head *head;
struct packet_type *pt1;
spin_lock_bh(p_ptype_lock);
head = &p_ptype_base[type & 15];
list_for_each_entry(pt1, head, list) {
printk("pt1: %x\n", pt1->type);
if (pt1->type == htons(type)) {
printk("FOUND\n");
goto out;
}
}
pt1 = 0;
printk( "dev_remove_pack: %p not found. type %x %x %x\n", pt1, type, ETH_P_IP, htons(ETH_P_IP));
out:
spin_unlock_bh(p_ptype_lock);
return pt1;
}
static char system_map[128] = "/boot/System.map-";
static unsigned long sysmap_size;
static char *sysmap_buf;
unsigned long sysmap_name2addr(char *name)
{
char *cp, *dp;
unsigned long addr;
int len, n;
if(!sysmap_buf) return 0;
if(!name || !name[0]) return 0;
n = strlen(name);
for(cp = sysmap_buf; ;) {
cp = strstr(cp, name);
if(!cp) return 0;
for(dp = cp; *dp && *dp != '\n' && *dp != ' ' && *dp != '\t'; dp++);
len = dp - cp;
if(len < n) goto cont;
if(cp > sysmap_buf && cp[-1] != ' ' && cp[-1] != '\t') {
goto cont;
}
if(len > n) {
goto cont;
}
break;
cont:
if(*dp == 0) break;
cp += (len+1);
}
cp -= 11;
if(cp > sysmap_buf && cp[-1] != '\n') {
printk("_ERROR_ in name2addr cp = %p base %p\n", cp, sysmap_buf);
return 0;
}
sscanf(cp, "%x", &addr);
printk("%s -> %p\n", name, addr);
return addr;
}
char *kas_init()
{
struct file *fp;
int i;
long addr;
struct kstat st;
mm_segment_t old_fs;
//printk("system #%s#%s#%s#%s\n", system_utsname.sysname, system_utsname.nodename, system_utsname.release, system_utsname.version);
strcat(system_map, system_utsname.release);
printk("System.map is %s\n", system_map);
old_fs = get_fs();
set_fs(get_ds()); //systemp_map is __user variable
i = vfs_stat(system_map, &st);
set_fs(old_fs);
//sysmap_size = 1024*1024; //error
sysmap_size = st.size + 32;
fp = filp_open(system_map, O_RDONLY, FMODE_READ);
if(!fp) return 1;
sysmap_buf = vmalloc(sysmap_size);
if(!sysmap_buf) return 2;
i = kernel_read(fp, 0, sysmap_buf, sysmap_size);
if(i <= 0) {
filp_close(fp, 0);
vfree(sysmap_buf);
sysmap_buf = 0;
return 3;
}
sysmap_size = i;
*(int*)&sysmap_buf[i] = 0;
filp_close(fp, 0);
//sysmap_symbol2addr = sysmap_name2addr;
p_ptype_lock = sysmap_name2addr("ptype_lock");
p_ptype_base = sysmap_name2addr("ptype_base");
/*
int (*Pip_options_rcv_srr)(struct sk_buff *skb);
int (*Pnf_rcv_postxfrm_nonlocal)(struct sk_buff *skb);
struct ip_rt_acct *ip_rt_acct;
struct ipv4_devconf *Pipv4_devconf;
*/
Pkeventd_wq = sysmap_name2addr("keventd_wq");
//keventd_wq = *(long *)&keventd_wq;
Pip_options_rcv_srr = sysmap_name2addr("ip_options_rcv_srr");
Pnf_rcv_postxfrm_nonlocal = sysmap_name2addr("nf_rcv_postxfrm_nonlocal");
ip_rt_acct = sysmap_name2addr("ip_rt_acct");
Pipv4_devconf = sysmap_name2addr("ipv4_devconf");
printk("lock = %p base = %p\n", p_ptype_lock, p_ptype_base);
vfree(sysmap_buf);
}
struct packet_type *ip_handler;
static int __init init()
{
struct packet_type *pt;
kas_init();
pt = dev_find_pack(ETH_P_IP);
if(!pt) return -1;
//printk("pt %p func ip_rcv %p should be %p\n", pt, pt->func, ip_rcv);
lock_kernel();
if(pt->func == ip_rcv) {
pt->func = REP_ip_rcv;
} else
printk("no...\n");
ip_handler = pt;
unlock_kernel();
seeker_init();
return 0;
}
static void __exit exit(void)
{
seeker_exit();
lock_kernel();
if(ip_handler->func == REP_ip_rcv)
ip_handler->func = ip_rcv;
else
printk("error...\n");
unlock_kernel();
}
module_init(init)
module_exit(exit)
MODULE_LICENSE("GPL");

复制代码

作者: sisi8408 时间: 2007-09-22 20:26

[quote]
local_bh_disable(); //timer lost and local cpu suck
restart:
num = 0;
while(1) {
spin_lock_irqsave(&q->lock, flags);
skb = __skb_dequeue(q);
spin_unlock_irqrestore(&q->lock, flags);
if(!skb) break;
num++;
//local_bh_disable();
ip_rcv1(skb, skb->dev);
//__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
}
bs_cpu_status[cpu].others += num;
if(num > 2) printk("%d %d\n", num, cpu);
if(num > 0) { goto restart; }
__local_bh_enable(); //sub_preempt_count(SOFTIRQ_OFFSET - 1);
bs_works->func = 0;
return;
[/quote]

复制代码

作者: 思一克 时间: 2007-09-22 23:18
我在研究你的指正. 没有结论所以没有给你回.

这个目的我是不让在处理其间调度走.
但一轮完了后就调度走了.
num会成为0.

那个if(num > 0) goto restart

其实没有多大用途.

肯定有不周到(BUG)地方.所以让各位TESTING

作者: sisi8408 时间: 2007-09-23 13:33
seeker did really nice job in BH way,
and local_bh_disable is too stronger,
btw, rx_softirq is driven by timer conditionally,
and scheduled by kthread,
so, it sounds nicer to ask linux complete what u dream,
especially when alot work remains in user space.

作者: 思一克 时间: 2007-09-24 09:05
to 斑竹，sisi8408等各位，

认真的，你们谁有条件帮做做测试。无论是肯定的还是否定的结果都要，真实就可以。
还要有简单的测试方法和结果数据。

测试主要就是在iptables,大流量下的效果。

如果这东西真能有些用处，我会在程序头将做测试工作的人（name, email, date)一一列出在上面。
如果证明无用被抛弃，那么就算大家尽义务了。

THANKS

作者: 思一克 时间: 2007-09-27 09:02
通过讨论，已经改进了成新的模块代码。

贴在我的BLOG上。

主要改动是避免IP包因为并行处理到达的太快而引起的重组问题。

http://blog.chinaunix.net/u/12848/showart.php?id=389602

作者: AIXHP 时间: 2007-09-27 09:24

原帖由 思一克 于 2007-6-29 11:04 发表
关于LINUX上中断在各个CPU之间的负载平衡问题

看帖子
http://linux.chinaunix.net/bbs/thread-753474-1-1.html

说4个CPU有严重的不平衡问题。因为无条件实验，
LZ也不在回贴。所以请有兴趣的来参加实验 ...

monitor kernel load status and user process number in running , if the kernel load no heavy and is so light , this cause is normal , else is a problem .

作者: sisi8408 时间: 2007-10-10 01:49

to 斑竹，sisi8408等各位，

认真的，你们谁有条件帮做做测试。无论是肯定的还是否定的结果都要，真实就可以。
还要有简单的测试方法和结果数据。

测试主要就是在iptables,大流量下的效果。

如果这东西真能有些用处，我会在程序头将做测试工作的人（name, email, date)一一列出在上面。
如果证明无用被抛弃，那么就算大家尽义务了。

THANKS

你大爷的， i cough,
俺敬仰的seeker，说这样不负责任的话，俺就再cough。

不自信？你的那么漂亮的工作，不就没人测么，你着哪门子急？
有一些暇疵就放弃？说不过去呀。

俺回京就去找你。

作者: 思一克 时间: 2007-10-10 09:34
谢谢你的认可。

我现在已经将bridge, ipv6, 等所有在softirq中的NETFILTER都给并行了。一会将PATCH贴出来。

原帖由 sisi8408 于 2007-10-10 01:49 发表

你大爷的， i cough,
俺敬仰的seeker，说这样不负责任的话，俺就再cough。

不自信？你的那么漂亮的工作，不就没人测么，你着哪门子急？
有一些暇疵就放弃？说不过去呀。

俺回京就去找你。

作者: 思一克 时间: 2007-10-10 09:37
BOTTOM SOFTIRQ version 2.

将各协议在softirq中的运行都并行了。不仅仅ipv4. 欢迎测试。

--- linux-2.6.23-rc8/net/core/dev.c 2007-09-25 08:33:10.000000000 +0800
+++ linux-2.6.23-rc8/net/core/dev.c 2007-10-10 09:30:30.000000000 +0800
@@ -1919,12 +1919,269 @@
}
#endif
+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+[PATCH: 2.6.13-15-SMP 1/2] network: concurrently run softirq network code on SMP
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+This is the version 2 BS patch. it will make parallelization for all protocol's
+netfilter runing in softirq, IPV4, IPV6, bridge, etc.
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP system.
+Take full advantages of SMP to handle more packets and greatly raises NIC throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
+will make the bottom part's load be very high. So, if the bottom part softirq
+can be distributed to processors and run concurrently on them, the network will
+gain much more packet handling capacity, network throughput will be be increased
+remarkably.
+
+Where useful:
+It's useful on SMP machines that meet the following 2 conditions:
+1) have high kernel network load, for example, running iptables with thousands of rules, etc).
+2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
+On these system, with the increase of softirq load, some CPUs will be idle
+while others(number is equal to # of NIC) keeps busy.
+IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
+Balancing the load of each cpus will not remarkably increase network speed.
+
+Where NOT useful:
+If the bottom half of softirq is too small(without running iptables), or the network
+is too idle, BS patch will not be seen to have visible effect. But It has no
+negative affect either.
+User can turn off BS functionality by set /proc/sys/net/bs_policy value to 0.
+
+How to test:
+On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
+softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
+use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
+without BS enabled, and with BS enabled.
+cat /proc/sys/net/bs_policy. 1 for flow dispatch, 2 random dispatch. 0 no dispatch.
+cat /proc/sys/net/bs_status. this shows the usage of each CPUs
+Test shown that when bottom softirq load is high, the network throughput can be nearly
+doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.
+
+Bugs:
+It will NOT allow hotplug CPU.
+It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
+for example, 0,1,2,3 is OK. 0,1,8,9 is KO.
+
+Some considerations in the future:
+1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
+at least not for network irq.
+2) Softirq load will become very small. It only run the top half of old softirq, which
+is much less expensive than bottom half---the netfilter program.
+To let top softirq process more packets, can these 3 network parameters be given a larger value?
+extern int netdev_max_backlog = 1000;
+extern int netdev_budget = 300;
+extern int weight_p = 64;
+3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?
+
+Signed-off-by: John Ye (Seeker) <[email]johny@webizmail.com[/email]>
+*/
+
+
+#define CBPTR( skb ) (*((void **)(skb->cb)))
+#define BS_USE_PERCPU_DATA
+struct cpu_stat
+{
+ unsigned long irqs; //total irqs
+ unsigned long dids; //I did,
+ unsigned long works;
+};
+#define BS_CPU_STAT_DEFINED
+
+static int nr_cpus = 0;
+
+#define BS_POL_LINK 1
+#define BS_POL_RANDOM 2
+int bs_policy = BS_POL_LINK; //cpu hash. 0 will turn off BS. 1 link based, 2 random
+
+static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues);
+static DEFINE_PER_CPU(struct work_struct, bs_works);
+//static DEFINE_PER_CPU(struct cpu_stat, bs_cpu_status);
+struct cpu_stat bs_cpu_status[NR_CPUS];
+
+//static int __netif_recv_skb(struct sk_buff *skb, struct net_device *odev);
+static int __netif_recv_skb(struct sk_buff *skb);
+
+static void bs_func(struct work_struct *data)
+{
+ int flags, num, cpu;
+ struct sk_buff *skb;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+ cpu = smp_processor_id();
+
+ bs_works = &per_cpu(bs_works, cpu);
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ //local_bh_disable();
+ restart:
+
+ num = 0;
+ while(1)
+ {
+ spin_lock_irqsave(&q->lock, flags);
+ if(!(skb = __skb_dequeue(q))) {
+ spin_unlock_irqrestore(&q->lock, flags);
+ break;
+ }
+ spin_unlock_irqrestore(&q->lock, flags);
+ num++;
+
+ local_bh_disable();
+ __netif_recv_skb(skb);
+ local_bh_enable(); // sub_preempt_count(SOFTIRQ_OFFSET - 1);
+ }
+
+ bs_cpu_status[cpu].dids += num;
+ //if(num > 2) printk("%d %d\n", num, cpu);
+ if(num > 0)
+ goto restart;
+
+ //__local_bh_enable();
+ bs_works->func = 0;
+
+ return;
+}
+
+struct cpu_workqueue_struct {
+
+ spinlock_t lock;
+
+ struct list_head worklist;
+ wait_queue_head_t more_work;
+ struct work_struct *current_work;
+
+ struct workqueue_struct *wq;
+ struct task_struct *thread;
+
+ int run_depth; /* Detect run_workqueue() recursion depth */
+} ____cacheline_aligned;
+
+struct workqueue_struct {
+ struct cpu_workqueue_struct *cpu_wq;
+ struct list_head list;
+ const char *name;
+ int singlethread;
+ int freezeable; /* Freeze threads during suspend */
+};
+
+#ifndef CONFIG_BOTTOM_SOFTIRQ_MODULE
+extern void __queue_work(struct cpu_workqueue_struct *cwq, struct work_struct *work);
+extern struct workqueue_struct *keventd_wq;
+#endif
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+
+static inline int bs_dispatch(struct sk_buff *skb)
+{
+ struct iphdr *iph = ip_hdr(skb);
+
+ if(!nr_cpus)
+ nr_cpus = num_online_cpus();
+
+ if(bs_policy && nr_cpus > 1) { // && iph->protocol != IPPROTO_ICMP) {
+ //if(bs_policy && nr_cpus > 1 && iph->protocol == IPPROTO_ICMP) { //test on icmp first
+ unsigned int flags, cur, cpu;
+ struct work_struct *bs_works;
+ struct sk_buff_head *q;
+
+ cpu = cur = smp_processor_id();
+
+ bs_cpu_status[cur].irqs++;
+
+ //good point for Jamal. thanks no reordering
+ if(bs_policy == BS_POL_LINK) {
+ int seed = 0;
+ if(iph->protocol == IPPROTO_TCP || iph->protocol == IPPROTO_UDP) {
+ struct tcphdr *th = (struct tcphdr*)(iph + 1); //udp is same as tcp
+ seed = ntohs(th->source) + ntohs(th->dest);
+ }
+ cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
+
+ /*
+ if(net_ratelimit() && iph->protocol == IPPROTO_TCP) {
+ struct tcphdr *th = iph + 1;
+
+ printk("seed %u (%u %u) cpu %d. source %d dest %d\n",
+ seed, iph->saddr + iph->daddr, iph->saddr + iph->daddr + seed, cpu,
+ ntohs(th->source), ntohs(th->dest));
+ }
+ */
+ } else
+ //random distribute
+ if(bs_policy == BS_POL_RANDOM)
+ cpu = (bs_cpu_status[cur].irqs % nr_cpus);
+
+ //cpu = cur;
+ //cpu = (cur? 0: 1);
+
+ if(cpu == cur) {
+ bs_cpu_status[cpu].dids++;
+ return __netif_recv_skb(skb);
+ }
+
+ q = &per_cpu(bs_cpu_queues, cpu);
+
+ if(!q->next) { // || skb_queue_len(q) == 0 ) {
+ skb_queue_head_init(q);
+ }
+
+
+ bs_works = &per_cpu(bs_works, cpu);
+ spin_lock_irqsave(&q->lock, flags);
+ __skb_queue_tail(q, skb);
+ spin_unlock_irqrestore(&q->lock, flags);
+
+ if (!bs_works->func) {
+ INIT_WORK(bs_works, bs_func);
+ bs_cpu_status[cpu].works++;
+ preempt_disable();
+ set_bit(WORK_STRUCT_PENDING, work_data_bits(bs_works));
+ __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), bs_works);
+ preempt_enable();
+ }
+
+ } else {
+
+ bs_cpu_status[smp_processor_id()].dids++;
+ return __netif_recv_skb(skb);
+ }
+ return 0;
+}
+
+
+
+#endif
+
+
int netif_receive_skb(struct sk_buff *skb)
{
- struct packet_type *ptype, *pt_prev;
+ //struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
- int ret = NET_RX_DROP;
- __be16 type;
+ //int ret = NET_RX_DROP;
+ //__be16 type;
/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
@@ -1947,6 +2204,19 @@
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;
+ CBPTR(skb) = orig_dev;
+ return bs_dispatch(skb);
+}
+
+int __netif_recv_skb(struct sk_buff *skb)
+{
+ struct packet_type *ptype, *pt_prev;
+ struct net_device *orig_dev;
+ int ret = NET_RX_DROP;
+ __be16 type;
+
+ orig_dev = CBPTR(skb);
+ CBPTR(skb) = 0;
pt_prev = NULL;
rcu_read_lock();
--- linux-2.6.23-rc8/kernel/workqueue.c 2007-09-25 08:33:10.000000000 +0800
+++ linux-2.6.23-rc8/kernel/workqueue.c 2007-10-10 08:52:05.000000000 +0800
@@ -138,7 +138,9 @@
}
/* Preempt must be disabled. */
-static void __queue_work(struct cpu_workqueue_struct *cwq,
+//static void __queue_work(struct cpu_workqueue_struct *cwq,
+// struct work_struct *work)
+void __queue_work(struct cpu_workqueue_struct *cwq,
struct work_struct *work)
{
unsigned long flags;
@@ -515,7 +517,12 @@
}
EXPORT_SYMBOL(cancel_delayed_work_sync);
+
+/*
static struct workqueue_struct *keventd_wq __read_mostly;
+*/
+struct workqueue_struct *keventd_wq __read_mostly;
+
/**
* schedule_work - put work task in global workqueue
@@ -848,5 +855,6 @@
cpu_singlethread_map = cpumask_of_cpu(singlethread_cpu);
hotcpu_notifier(workqueue_cpu_callback, 0);
keventd_wq = create_workqueue("events");
+ printk("keventd_wq %p %p OK.\n", keventd_wq, keventd_wq->cpu_wq);
BUG_ON(!keventd_wq);
}
--- linux-2.6.23-rc8/net/sysctl_net.c 2007-09-25 08:33:10.000000000 +0800
+++ linux-2.6.23-rc8/net/sysctl_net.c 2007-10-09 21:10:41.000000000 +0800
@@ -29,6 +29,15 @@
#include <linux/if_tr.h>
#endif
+struct cpu_stat
+{
+ unsigned long irqs; /* total irqs on me */
+ unsigned long dids; /* I did, */
+ unsigned long works; /* q works */
+};
+extern int bs_policy;
+extern struct cpu_stat bs_cpu_status[NR_CPUS];
+
struct ctl_table net_table[] = {
{
.ctl_name = NET_CORE,
@@ -36,6 +45,24 @@
.mode = 0555,
.child = core_table,
},
+
+ {
+ .ctl_name = 99,
+ .procname = "bs_status",
+ .data = &bs_cpu_status,
+ .maxlen = sizeof(bs_cpu_status),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = 99,
+ .procname = "bs_policy",
+ .data = &bs_policy,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+
#ifdef CONFIG_INET
{
.ctl_name = NET_IPV4,

复制代码

作者: 思一克 时间: 2007-10-10 09:53
to sisi8408,

听了你的建议，将“too strong"的bh关闭时间范围缩小了。

作者: sisi8408 时间: 2007-10-10 16:40
标题: 回复 #61 思一克的帖子
您又来了， why is it too strong?
u have to show definately clear answer, or
u prove it was perfectly good.

i believe it is not ur, seeker, style of playing linux,
u have soundly good idea and u complement it in C already,
why not take another step to show CUers u r hard enough.

look forward to reading ur answer,
and it is ur duty to respond to any advice and even critisism,
for u see, CUers watching at ur progress and be with u.

作者: 思一克 时间: 2007-10-10 16:51
to sisi8408,

我说的"too strong" 就是你说的"local_bh_disable is too stronger"
你说的对。后来的我将范围缩小了。

原帖由 sisi8408 于 2007-9-23 13:33 发表
seeker did really nice job in BH way,
and local_bh_disable is too stronger,
btw, rx_softirq is driven by timer conditionally,
and scheduled by kthread,
so, it sounds nicer to ask linux comple ...

作者: sisi8408 时间: 2007-10-10 18:25

原帖由 思一克 于 2007-10-10 16:51 发表
to sisi8408,

我说的"too strong" 就是你说的"local_bh_disable is too stronger"
你说的对。后来的我将范围缩小了。

您的说法，俺还是不满意，太简单，至少给个1、2,3.
范围缩小了, 缩了多少？为什么合理？

分析最好，测试是对用户负责，这是好听的，
不好听的，少挨骂，呵呵。

分析不了，测试的兄弟，好的，给个报告和建议，
不好的，骂娘没商量，当猴耍，俺也不乐意。

分析好了，看的人回报信任和尊重，可能不说。

作者: 思一克 时间: 2007-10-10 21:38
To sisi8408
你是让我详细说一下那个PATCH的原理?

作者: wenaideyu 时间: 2007-12-04 15:17
内核中Enable Kernel Irq balancing选项选上了吗，我在2.6.20下测试过，选上该项后，然后修改/proc/irq/??/smp_affinity，
就可以实现负载均衡了，如果不选则CPU不会进行中断的负载均衡的，哪怕你在/proc/irq/??中修改那个掩码也是不行的，

作者: AIXHP 时间: 2007-12-04 16:15

原帖由 albcamus 于 2007-6-29 11:35 发表

[root@localhost Documentation]# cat /proc/interrupts
         CPU0    CPU1
  0:       358       0 IO-APIC-edge    timer
  1:       2       0 IO-APIC-edge    i8042
  ...

CPU负载重吗?若不重[要看硬件中断触发那一个CPU,调度程序是否优先调度CPU0],若如是则正常,否则可能有问题.

[ 本帖最后由 AIXHP 于 2007-12-4 16:16 编辑 ]

作者: AIXHP 时间: 2007-12-04 16:21

原帖由 wenaideyu 于 2007-12-4 15:17 发表
内核中Enable Kernel Irq balancing选项选上了吗，我在2.6.20下测试过，选上该项后，然后修改/proc/irq/??/smp_affinity，
就可以实现负载均衡了，如果不选则CPU不会进行中断的负载均衡的，哪怕你在/proc/irq/ ...

Enable Kernel Irq balancing 是否控制中断控制器工作模式,在初始化中断控制器时起作用?

[ 本帖最后由 AIXHP 于 2007-12-4 16:22 编辑 ]

作者: platinum 时间: 2007-12-18 23:13
网卡驱动收到以太帧后通过接口函数 netif_receive_skb() 交到上层
seeker 兄的做法是在 ip_rcv() 里做文章，添加了特殊调度来实现的，但这样对桥模式是无效的
现在想到一个问题，能否直接把调度直接做在 netif_receive_skb() 里，在提交到上层之前先处理好呢？

作者: 思一克 时间: 2007-12-19 12:43
我早已经做了.

在netif_receive_skb中. 不仅仅支持IP.

我一回贴出来.

原帖由 platinum 于 2007-12-18 23:13 发表
网卡驱动收到以太帧后通过接口函数 netif_receive_skb() 交到上层
seeker 兄的做法是在 ip_rcv() 里做文章，添加了特殊调度来实现的，但这样对桥模式是无效的
现在想到一个问题，能否直接把调度直接做在 neti ...

作者: 思一克 时间: 2007-12-19 12:47
这是第二版本的BS.

不仅仅支持IP. netif_receive_skb 做成并行的了.

/*
*  BOTTOM_SOFTIRQ_NET
*             An implementation of bottom softirq concurrent execution on SMP.
*             This is implemented by splitting current net softirq into top half
*             and bottom half and dispatching the bottom half to each cpu's workqueue
*             Hopefully, it can raise the throughput of NIC when running iptalbes
*             on SMP machine.
*
*             This is BS version 2(BS2), it make SMP parallelization for all
*             other protocols beside ipv4, for example, bridge, packet raw, etc.
*
*  Version: $Id: bs_smp.c, v1.0 for kernel versions:
*             2.6.13-15 for kernel 2.6.13-15-smp. fully tested
*             the other versions need more testing.
*
*  Authors: John Ye & Qianyu Ye, 2007.08.27
*/

/* user must select one of the following versions. no guarantee to work.
#if LINUX_VERSION_CODE > KERNEL_VERSION(2, 6, 0), this is better.
*/

/*
#define KERNEL_VERSION_2_6_13 //2.6.13-15 OK
#define KERNEL_VERSION_2_6_16__ //2.6.16.53 OK
#define KERNEL_VERSION_2_6_17__ //2.6.17.9 #3 OK
#define KERNEL_VERSION_2_6_18__ //2.6.18.8 & 2.6.18.2-34  OK
#define KERNEL_VERSION_2_6_19__ //2.6.19 #1 OK
#define KERNEL_VERSION_2_6_20__ //2.6.20 OK
#define KERNEL_VERSION_2_6_21__ //2.6.21.1 OK
#define KERNEL_VERSION_2_6_22__ //2.6.22.5 OK
#define KERNEL_VERSION_2_6_23__ //2.6.23-rc8 OK
*/

/*
# Makefile for kernel version 2.6.x.
ifneq ($(KERNELRELEASE),)
debug-objs := bs_smp.o
obj-m := bs_smp.o
CFLAGS += -w -Wimplicit-function-declaration
else
PWD  := $(shell pwd)
KVER ?= $(shell uname -r)
KDIR := /lib/modules/$(KVER)/build
all: $(MAKE) -C $(KDIR) M=$(PWD)
clean: rm -rf .*.cmd *.o *.mod.c *.ko .tmp_versions
endif
*/
#include <linux/version.h>

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,13) && LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,18)
#include <linux/config.h>
#endif

#include <asm/debugreg.h>
#include <asm/desc.h>
#include <asm/i387.h>
#include <asm/ldt.h>
#include <asm/pgtable.h>
#include <asm/processor.h>
#include <asm/system.h>
#include <asm/uaccess.h>
#include <asm/unaligned.h>
#include <linux/aio.h>
#include <linux/backing-dev.h>
#include <linux/bio.h>
#include <linux/buffer_head.h>
#include <linux/delay.h>
#include <linux/device.h>
#include <linux/errno.h>
#include <linux/etherdevice.h>
#include <linux/fs.h>
#include <linux/highmem.h>
#include <linux/in.h>
#include <linux/inet.h>
#include <linux/inetdevice.h>
#include <linux/init.h>
#include <linux/input.h>
#include <linux/interrupt.h>
#include <linux/ipsec.h>
#include <linux/kernel.h>
#include <linux/kmod.h>
#include <linux/list.h>
#include <linux/major.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mroute.h>
#include <linux/net.h>
#include <linux/netdevice.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netlink.h>
#include <linux/pagemap.h>
#include <linux/pm.h>
#include <linux/poll.h>
#include <linux/proc_fs.h>
#include <linux/ptrace.h>
#include <linux/random.h>
#include <linux/romfs_fs.h>
#include <linux/sched.h>
#include <linux/security.h>
#include <linux/skbuff.h>
#include <linux/slab.h>
#include <linux/smp.h>
#include <linux/smp_lock.h>
#include <linux/socket.h>
#include <linux/sockios.h>
#include <linux/string.h>
#include <linux/swap.h>
#include <linux/sysctl.h>
#include <linux/types.h>
#include <linux/user.h>
#include <linux/vfs.h>
#include <linux/workqueue.h>
#include <net/arp.h>
#include <net/checksum.h>
#include <net/icmp.h>
#include <net/inet_common.h>
#include <net/ip.h>
#include <net/protocol.h>
#include <net/raw.h>
#include <net/route.h>
#include <net/snmp.h>
#include <net/sock.h>
#include <net/tcp.h>
#include <net/xfrm.h>

//form dev.c
#include <asm/uaccess.h>
#include <asm/system.h>
#include <linux/bitops.h>
#include <linux/cpu.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/socket.h>
#include <linux/sockios.h>
#include <linux/errno.h>
#include <linux/interrupt.h>
#include <linux/if_ether.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/notifier.h>
#include <linux/skbuff.h>
#include <net/sock.h>
#include <linux/rtnetlink.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/stat.h>
#include <linux/if_bridge.h>

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,13) && LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,19)
#include <linux/divert.h>
#endif

#include <net/dst.h>
#include <net/pkt_sched.h>
#include <net/checksum.h>
#include <linux/highmem.h>
#include <linux/init.h>
#include <linux/kmod.h>
#include <linux/module.h>
#include <linux/kallsyms.h>
//#include <linux/netpoll.h>
#include <linux/rcupdate.h>
#include <linux/delay.h>

#include <linux/ip.h> //johnye

#ifdef CONFIG_NET_RADIO
#include <linux/wireless.h> /* Note : will define WIRELESS_EXT */
#include <net/iw_handler.h>
#endif /* CONFIG_NET_RADIO */
#include <asm/current.h>

#define CONFIG_BOTTOM_SOFTIRQ_MODULE
//#undef CONFIG_NET_CLS_ACT  //testing only.

#define TAPFUNC  "netif_receive_skb"

static int (*p_tapped)();

static spinlock_t *p_ptype_lock;
static struct list_head *p_ptype_base;          /* 16 way hashed list */
#define ptype_base p_ptype_base
static struct list_head *p_ptype_all; /* Taps */
#define ptype_all (*p_ptype_all)

static struct workqueue_struct **Pkeventd_wq;  //why not  same like 2.6.13?
#define keventd_wq (*Pkeventd_wq)

//this is little tricky. __netpoll_rx is defined in netpoll.h
//static int (*pp__netpoll_rx)(struct sk_buff *skb); //__netpoll_rx is in net/core/netpoll.c, netpoll_rx in netpoll.h
#define __netpoll_rx (*p__netpoll_rx)
#include <linux/netpoll.h>  //it use __netpoll_rx, __netpoll_rx is in net/core/netpoll.c

/* When > 0 there are consumers of rx skb time stamps */
static atomic_t *p_netstamp_needed; // = ATOMIC_INIT(0);
#define netstamp_needed (*p_netstamp_needed)

#ifdef CONFIG_NET_CLS_ACT
static int (*p_ing_filter)(struct sk_buff *skb);
//#define ing_filter (*p_ing_filter)
#endif

extern DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat); // = { 0, };

static void (*p__queue_work)(struct cpu_workqueue_struct *cwq, struct work_struct *work);
#define __queue_work (*p__queue_work)

void (*p_ip_rcv)();

static struct {
void *feed;
char *symb;
} ___vars[] = {
{ &p_ptype_lock, "ptype_lock" },
{ &p_ptype_base, "ptype_base" },
{ &p_ptype_all, "ptype_all" },
{ &Pkeventd_wq, "keventd_wq" },
{ &p__queue_work, "__queue_work" },
{ &p__netpoll_rx, "__netpoll_rx" },
{ &p_netstamp_needed, "netstamp_needed" },
#ifdef CONFIG_NET_CLS_ACT
{ &p_ing_filter, "ing_filter" },
#endif
{ 0, 0 }
};

/*
      if(!(p_ptype_lock = sysmap_name2addr("ptype_lock"))) return 1;
      if(!(p_ptype_base = sysmap_name2addr("ptype_base"))) return 1;
      if(!(p_ptype_all = sysmap_name2addr("ptype_all"))) return 1;
      if(!(Pkeventd_wq = sysmap_name2addr("keventd_wq"))) return 1;
if(!(p__queue_work = sysmap_name2addr("__queue_work"))) return 1;
      if(!(p__netpoll_rx = sysmap_name2addr("__netpoll_rx"))) return 1;
if(!(p_netstamp_needed = sysmap_name2addr("netstamp_needed"))) return 1;
      if(!(p_ing_filter = sysmap_name2addr("ing_filter"))) return 1;
*/

#define PATCH_START net/core/dev.c
#define CONFIG_BOTTOM_SOFTIRQ_SMP
#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL

//#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP

/*
[PATCH: 2.6.13-15-SMP 1/2] network: concurrently run softirq network code on SMP
Bottom Softirq Implementation. John Ye, 2007.08.27

Why this patch:
Make kernel be able to concurrently execute softirq's net code on SMP system.
Take full advantages of SMP to handle more packets and greatly raises NIC throughput.
The current kernel's net packet processing logic is:
1) The CPU which handles a hardirq must be executing its related softirq.
2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs
at the same time.
The limitation make kernel network be hard to take the advantages of SMP.

How this patch:
It splits the current softirq code into 2 parts: the cpu-sensitive top half,
and the cpu-insensitive bottom half, then make bottom half(calld BS) be
executed on SMP concurrently.
The two parts are not equal in terms of size and load. Top part has constant code
size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match
will make the bottom part's load be very high. So, if the bottom part softirq
can be distributed to processors and run concurrently on them, the network will
gain much more packet handling capacity, network throughput will be be increased
remarkably.

Where useful:
It's useful on SMP machines that meet the following 2 conditions:
1) have high kernel network load, for example, running iptables with thousands of rules, etc).
2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
On these system, with the increase of softirq load, some CPUs will be idle
while others(number is equal to # of NIC) keeps busy.
IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency.
Balancing the load of each cpus will not remarkably increase network speed.

Where NOT useful:
If the bottom half of softirq is too small(without running iptables), or the network
is too idle, BS patch will not be seen to have visible effect. But It has no
negative affect either.
User can turn off BS functionality by set /proc/sys/net/bs_policy value to 0.

How to test:
On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge
softirq load. Then, open 20 ftp sessions to download big file. On another machine(who
use this test machine as gateway), open 20 more ftp download sessions. Compare the speed,
without BS enabled, and with BS enabled.
cat /proc/sys/net/bs_policy. 1 for flow dispatch, 2 random dispatch. 0 no dispatch.
cat /proc/sys/net/bs_status. this shows the usage of each CPUs
Test shown that when bottom softirq load is high, the network throughput can be nearly
doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box.

Bugs:
It will NOT allow hotplug CPU.
It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
for example, 0,1,2,3 is OK. 0,1,8,9 is KO.

Some considerations in the future:
1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more,
at least not for network irq.
2) Softirq load will become very small. It only run the top half of old softirq, which
is much less expensive than bottom half---the netfilter program.
To let top softirq process more packets, can these 3 network parameters be given a larger value?
extern int netdev_max_backlog = 1000;
extern int netdev_budget = 300;
extern int weight_p = 64;
3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on?

Signed-off-by: John Ye (Seeker) <[email]johny@webizmail.com[/email]>
*/

#define CBPTR( skb ) (*((void **)(skb->cb)))
#define BS_USE_PERCPU_DATA
struct cpu_stat
{
      unsigned long irqs;                      //total irqs
      unsigned long dids;                      //I did,
      unsigned long works;
};
#define BS_CPU_STAT_DEFINED

static int nr_cpus = 0;

#define BS_POL_LINK    1
#define BS_POL_RANDOM 2
int bs_policy = BS_POL_LINK;

static DEFINE_PER_CPU(struct sk_buff_head, bs_cpu_queues);
static DEFINE_PER_CPU(struct work_struct, bs_works);
//static DEFINE_PER_CPU(struct cpu_stat, bs_cpu_status);
struct cpu_stat bs_cpu_status[NR_CPUS];

//static int __netif_recv_skb(struct sk_buff *skb, struct net_device *odev);
static int __netif_recv_skb(struct sk_buff *skb);

static void bs_func(void *data)
{
      int num, cpu;
      struct sk_buff *skb;
      struct work_struct *bs_works;
      struct sk_buff_head *q;
      cpu = smp_processor_id();

      bs_works = &per_cpu(bs_works, cpu);
      q = &per_cpu(bs_cpu_queues, cpu);

      restart:
      num = 0;
      while(1)
      {
            spin_lock(&q->lock);
            if(!(skb = __skb_dequeue(q))) {
            spin_unlock(&q->lock);
break;
}
spin_unlock(&q->lock);
            num++;

            local_bh_disable();
            __netif_recv_skb(skb);
            local_bh_enable();    // sub_preempt_count(SOFTIRQ_OFFSET - 1);
      }

      bs_cpu_status[cpu].dids += num;
      if(num > 8) printk("%d on cpu %d\n", num, cpu);
      if(num > 0) goto restart;

      bs_works->func = 0;

      return;
}

#undef PATCH_START

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,13)
/* COPY_IN_START_FROM kernel/workqueue.c */
struct cpu_workqueue_struct
{
      spinlock_t lock;

      long remove_sequence;                   /* Least-recently added (next to run) */
      long insert_sequence;                   /* Next to add */

      struct list_head worklist;
      wait_queue_head_t more_work;
      wait_queue_head_t work_done;

      struct workqueue_struct *wq;
      struct task_struct *thread; //task_t if 2.6.13

      int run_depth;                         /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;

struct workqueue_struct
{
      struct cpu_workqueue_struct cpu_wq[NR_CPUS];
      const char *name;
      struct list_head list;                   /* Empty if single thread */
};
//extern struct workqueue_struct *keventd_wq;
#endif

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,16) && LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,19)
struct cpu_workqueue_struct {

spinlock_t lock;

long remove_sequence; /* Least-recently added (next to run) */
long insert_sequence; /* Next to add */

struct list_head worklist;
wait_queue_head_t more_work;
wait_queue_head_t work_done;

struct workqueue_struct *wq;
struct task_struct *thread;

int run_depth; /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;

/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
*/
struct workqueue_struct {
struct cpu_workqueue_struct *cpu_wq;
const char *name;
struct list_head list; /* Empty if single thread */
};

#endif

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,20) || LINUX_VERSION_CODE == KERNEL_VERSION(2,6,21)
struct cpu_workqueue_struct {

spinlock_t lock;

long remove_sequence; /* Least-recently added (next to run) */
long insert_sequence; /* Next to add */

struct list_head worklist;
wait_queue_head_t more_work;
wait_queue_head_t work_done;

struct workqueue_struct *wq;
struct task_struct *thread;

int run_depth; /* Detect run_workqueue() recursion depth */

int freezeable; /* Freeze the thread during suspend */
} ____cacheline_aligned;

/*
* The externally visible workqueue abstraction is an array of
* per-CPU workqueues:
*/
struct workqueue_struct {
struct cpu_workqueue_struct *cpu_wq;
const char *name;
struct list_head list; /* Empty if single thread */
};
#endif

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,22) || LINUX_VERSION_CODE == KERNEL_VERSION(2,6,23)
struct cpu_workqueue_struct {

spinlock_t lock;

struct list_head worklist;
wait_queue_head_t more_work;
struct work_struct *current_work;

struct workqueue_struct *wq;
struct task_struct *thread;

int run_depth; /* Detect run_workqueue() recursion depth */
} ____cacheline_aligned;

struct workqueue_struct {
struct cpu_workqueue_struct *cpu_wq;
struct list_head list;
const char *name;
int singlethread;
int freezeable; /* Freeze threads during suspend */
};

#endif

#define PATCH_START

#ifndef CONFIG_BOTTOM_SOFTIRQ_MODULE
extern void __queue_work(struct cpu_workqueue_struct *cwq, struct work_struct *work);
extern struct workqueue_struct *keventd_wq;
#endif
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/tcp.h>

static inline int bs_dispatch(struct sk_buff *skb)
{

#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,22)
struct iphdr *iph = ip_hdr(skb);
#else
struct iphdr *iph = skb->nh.iph;
#endif
if(!nr_cpus)
nr_cpus = num_online_cpus();

/*
struct tcphdr {
__u16 source;
__u16 dest;
__u32 seq;
};
*/
if(bs_policy && nr_cpus > 1) { // && iph->protocol != IPPROTO_ICMP) {
//if(bs_policy && nr_cpus > 1 && iph->protocol == IPPROTO_ICMP) { //test on icmp first
unsigned int cur, cpu;
struct work_struct *bs_works;
struct sk_buff_head *q;

cpu = cur = smp_processor_id();

bs_cpu_status[cur].irqs++;

//good point for Jamal. thanks no reordering
if(bs_policy == BS_POL_LINK) {
int seed = 0;
if(iph->protocol == IPPROTO_TCP || iph->protocol == IPPROTO_UDP) {
struct tcphdr *th = (struct tcphdr*)(iph + 1);  //udp is same as tcp
seed = ntohs(th->source) + ntohs(th->dest);
}
cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;

/*
if(net_ratelimit() && iph->protocol == IPPROTO_TCP) {
struct tcphdr *th = iph + 1;

     printk("seed %u (%u %u) cpu %d. source %d dest %d\n",
                                seed, iph->saddr + iph->daddr, iph->saddr + iph->daddr + seed, cpu,
ntohs(th->source), ntohs(th->dest));
}
*/
} else
//random distribute
if(bs_policy == BS_POL_RANDOM)
cpu = (bs_cpu_status[cur].irqs % nr_cpus);

//cpu = cur;
//cpu = (cur? 0: 1);

if(cpu == cur) {
bs_cpu_status[cpu].dids++;
return __netif_recv_skb(skb);
}

q = &per_cpu(bs_cpu_queues, cpu);

if(!q->next) {
skb_queue_head_init(q);
}

spin_lock(&q->lock);
__skb_queue_tail(q, skb);
spin_unlock(&q->lock);

bs_works = &per_cpu(bs_works, cpu);
if (!bs_works->func) {
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,16)
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,19)
                     INIT_WORK(bs_works, bs_func, 0);
#else
                     INIT_WORK(bs_works, bs_func);
#endif
                     bs_cpu_status[cpu].works++;
                     preempt_disable();
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,20)
set_bit(WORK_STRUCT_PENDING, work_data_bits(bs_works));
#endif
__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), bs_works);
                     preempt_enable();
#else
                     INIT_WORK(bs_works, bs_func, q);
                     bs_cpu_status[cpu].works++;
                     preempt_disable();
                     __queue_work(keventd_wq->cpu_wq + cpu, bs_works);
                     preempt_enable();
#endif

}

} else {

bs_cpu_status[smp_processor_id()].dids++;
return __netif_recv_skb(skb);
}
return 0;
#else
return __netif_recv_skb(skb);
#endif
}
#undef PATCH_START

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,13)

#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk("Redir loop detected Dropping packet (%s->%s)\n",
skb->input_dev?skb->input_dev->name:"??",skb->dev->name);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);
if (NULL == skb->input_dev) {
skb->input_dev = skb->dev;
printk("ing_filter:  fixed  %s out %s\n",skb->input_dev->name,skb->dev->name);
}
spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}
#endif
static inline void net_timestamp(struct timeval *stamp)
{
if (atomic_read(&netstamp_needed))
do_gettimeofday(stamp);
else {
stamp->tv_sec = 0;
stamp->tv_usec = 0;
}
}

static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev);
}
#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret) (0)
#endif

static __inline__ void skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
skb->real_dev = skb->dev;
skb->dev = dev->master;
}
}

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
//int ret = NET_RX_DROP;
//unsigned short type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->stamp.tv_sec)
net_timestamp(&skb->stamp);

skb_bond(skb);

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

return bs_dispatch(skb);
}

int __netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
int ret = NET_RX_DROP;
unsigned short type;

pt_prev = NULL;

rcu_read_lock();

if(CBPTR(skb))
printk("+\n");

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

//packet tap
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev);
}
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

handle_diverter(skb);

if (handle_bridge(&skb, &pt_prev, &ret))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev); //increase skb->users
}
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();

return ret;
}
#endif 2.6.13

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,16)

#ifdef CONFIG_NET_CLS_ACT
/* TODO: Maybe we should just force sch_ingress to be compiled in
* when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
* a compare and 2 stores extra right now if we dont have it on
* but have CONFIG_NET_CLS_ACT
* NOTE: This doesnt stop any functionality; if you dont have
* the ingress scheduler, you just cant add policies on ingress.
*
*/
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk("Redir loop detected Dropping packet (%d->%d)\n",
skb->iif, skb->dev->ifindex);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->queue_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->queue_lock);

}

return result;
}
#endif
static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev,
   struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}
static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
}
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (0)
#endif
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master)
skb->dev = dev->master;

return dev;
}
int REP_netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
unsigned short type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.off_sec)
net_timestamp(skb);

if (!skb->iif)
skb->iif = skb->dev->ifindex;

orig_dev = skb_bond(skb);

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

__netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
unsigned short type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;

pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

handle_diverter(skb);

if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}
#endif

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,17)

#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk("Redir loop detected Dropping packet (%s->%s)\n",
skb->input_dev->name, skb->dev->name);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}
#endif

static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
}
}
static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev,
   struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (0)
#endif
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
/*
   * On bonding slaves other than the currently active
   * slave, suppress duplicates except for 802.3ad
   * ETH_P_SLOW and alb non-mcast/bcast.
   */
if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
if (dev->master->priv_flags & IFF_MASTER_ALB) {
if (skb->pkt_type != PACKET_BROADCAST &&
      skb->pkt_type != PACKET_MULTICAST)
goto keep;
}

if (dev->master->priv_flags & IFF_MASTER_8023AD &&
      skb->protocol == __constant_htons(ETH_P_SLOW))
goto keep;

kfree_skb(skb);
return NULL;
}
keep:
skb->dev = dev->master;
}

return dev;
}

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
//int ret = NET_RX_DROP;
//unsigned short type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.off_sec)
net_timestamp(skb);

if (!skb->input_dev)
skb->input_dev = skb->dev;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

__netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
unsigned short type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;
pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

handle_diverter(skb);

if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}

#endif 17

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,18)

#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk(KERN_WARNING "Redir loop detected Dropping packet (%s->%s)\n",
skb->input_dev->name, skb->dev->name);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}
#endif
static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
}
}
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
if (skb_bond_should_drop(skb)) {
kfree_skb(skb);
return NULL;
}
skb->dev = dev->master;
}

return dev;
}
static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev,
   struct net_device *orig_dev)
{
atomic_inc(&skb->users);
if(pt_prev->func == p_ip_rcv) {
printk(".");
}
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (0)
#endif

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
//int ret = NET_RX_DROP;
//unsigned short type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.off_sec)
net_timestamp(skb);

if (!skb->input_dev)
skb->input_dev = skb->dev;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

int __netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
unsigned short type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;

//printk("+");
pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

handle_diverter(skb);

if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
if(pt_prev->func != p_ip_rcv) {
//type, ipv4, arp, llc, etc
//printk("type %d %p \n", type, pt_prev->func);

}
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}

#endif 18

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,19)

#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk(KERN_WARNING "Redir loop detected Dropping packet (%s->%s)\n",
skb->input_dev->name, skb->dev->name);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}
#endif

static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
}
}
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
if (skb_bond_should_drop(skb)) {
kfree_skb(skb);
return NULL;
}
skb->dev = dev->master;
}

return dev;
}

static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev,
   struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (0)
#endif

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
//int ret = NET_RX_DROP;
//unsigned short type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.off_sec)
net_timestamp(skb);

if (!skb->input_dev)
skb->input_dev = skb->dev;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

int __netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
unsigned short type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;

pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

handle_diverter(skb);

if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}

#endif 19

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,20)
#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk(KERN_WARNING "Redir loop detected Dropping packet (%s->%s)\n",
skb->input_dev->name, skb->dev->name);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}
#endif
static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
}
}
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
if (skb_bond_should_drop(skb)) {
kfree_skb(skb);
return NULL;
}
skb->dev = dev->master;
}

return dev;
}
static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev,
   struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (0)
#endif

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
//int ret = NET_RX_DROP;
//__be16 type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.off_sec)
net_timestamp(skb);

if (!skb->input_dev)
skb->input_dev = skb->dev;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

__netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;

pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}

#endif 20

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,21)

#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk(KERN_WARNING "Redir loop detected Dropping packet (%d->%d)\n",
skb->iif, skb->dev->ifindex);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->queue_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->queue_lock);

}

return result;
}
#endif

static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
}
}
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
if (skb_bond_should_drop(skb)) {
kfree_skb(skb);
return NULL;
}
skb->dev = dev->master;
}

return dev;
}
static __inline__ int deliver_skb(struct sk_buff *skb,
   struct packet_type *pt_prev,
   struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb);
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent);

static __inline__ int handle_bridge(struct sk_buff **pskb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if ((*pskb)->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference((*pskb)->dev->br_port)) == NULL)
return 0;

if (*pt_prev) {
*ret = deliver_skb(*pskb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, pskb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (0)
#endif

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
//int ret = NET_RX_DROP;
//__be16 type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.off_sec)
net_timestamp(skb);

if (!skb->iif)
skb->iif = skb->dev->ifindex;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb->h.raw = skb->nh.raw = skb->data;
skb->mac_len = skb->nh.raw - skb->mac.raw;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

__netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;

pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

if (handle_bridge(&skb, &pt_prev, &ret, orig_dev))
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}

#endif 21

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,22)
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk(KERN_WARNING "Redir loop detected Dropping packet (%d->%d)\n",
skb->iif, skb->dev->ifindex);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}

static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else
skb->tstamp.tv64 = 0;
}
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
if (skb_bond_should_drop(skb)) {
kfree_skb(skb);
return NULL;
}
skb->dev = dev->master;
}

return dev;
}

static inline int deliver_skb(struct sk_buff *skb,
      struct packet_type *pt_prev,
      struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
/* These hooks defined here for ATM */
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent) __read_mostly;

/*
* If bridge module is loaded call bridging hook.
*  returns NULL if packet was consumed.
*/
struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
struct sk_buff *skb) __read_mostly;
static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if (skb->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference(skb->dev->br_port)) == NULL)
return skb;

if (*pt_prev) {
*ret = deliver_skb(skb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, skb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (skb)
#endif

int REP_netif_receive_skb(struct sk_buff *skb)
{
//struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
//int ret = NET_RX_DROP;
//__be16 type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.tv64)
net_timestamp(skb);

if (!skb->iif)
skb->iif = skb->dev->ifindex;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb_reset_network_header(skb);
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;

CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

__netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;
pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}
#endif 22

#if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,23)
#ifdef CONFIG_NET_CLS_ACT
static int ing_filter(struct sk_buff *skb)
{
struct Qdisc *q;
struct net_device *dev = skb->dev;
int result = TC_ACT_OK;

if (dev->qdisc_ingress) {
__u32 ttl = (__u32) G_TC_RTTL(skb->tc_verd);
if (MAX_RED_LOOP < ttl++) {
printk(KERN_WARNING "Redir loop detected Dropping packet (%d->%d)\n",
skb->iif, skb->dev->ifindex);
return TC_ACT_SHOT;
}

skb->tc_verd = SET_TC_RTTL(skb->tc_verd,ttl);

skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_INGRESS);

spin_lock(&dev->ingress_lock);
if ((q = dev->qdisc_ingress) != NULL)
result = q->enqueue(skb, q);
spin_unlock(&dev->ingress_lock);

}

return result;
}
#endif

#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
struct sk_buff *(*macvlan_handle_frame_hook)(struct sk_buff *skb) __read_mostly;
EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);

static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
      struct packet_type **pt_prev,
      int *ret,
      struct net_device *orig_dev)
{
if (skb->dev->macvlan_port == NULL)
return skb;

if (*pt_prev) {
*ret = deliver_skb(skb, *pt_prev, orig_dev);
*pt_prev = NULL;
}
return macvlan_handle_frame_hook(skb);
}
#else
#define handle_macvlan(skb, pt_prev, ret, orig_dev) (skb)
#endif

static inline void net_timestamp(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
else
skb->tstamp.tv64 = 0;
}
static inline struct net_device *skb_bond(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;

if (dev->master) {
if (skb_bond_should_drop(skb)) {
kfree_skb(skb);
return NULL;
}
skb->dev = dev->master;
}

return dev;
}

static inline int deliver_skb(struct sk_buff *skb,
      struct packet_type *pt_prev,
      struct net_device *orig_dev)
{
atomic_inc(&skb->users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
/* These hooks defined here for ATM */
struct net_bridge;
struct net_bridge_fdb_entry *(*br_fdb_get_hook)(struct net_bridge *br,
unsigned char *addr);
void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent) __read_mostly;

/*
* If bridge module is loaded call bridging hook.
*  returns NULL if packet was consumed.
*/
struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
struct sk_buff *skb) __read_mostly;
static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
      struct packet_type **pt_prev, int *ret,
      struct net_device *orig_dev)
{
struct net_bridge_port *port;

if (skb->pkt_type == PACKET_LOOPBACK ||
      (port = rcu_dereference(skb->dev->br_port)) == NULL)
return skb;

if (*pt_prev) {
*ret = deliver_skb(skb, *pt_prev, orig_dev);
*pt_prev = NULL;
}

return br_handle_frame_hook(port, skb);
}
#else
#define handle_bridge(skb, pt_prev, ret, orig_dev) (skb)
#endif

int REP_netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;

/* if we've gotten here through NAPI, check netpoll */
if (skb->dev->poll && netpoll_rx(skb))
return NET_RX_DROP;

if (!skb->tstamp.tv64)
net_timestamp(skb);

if (!skb->iif)
skb->iif = skb->dev->ifindex;

orig_dev = skb_bond(skb);

if (!orig_dev)
return NET_RX_DROP;

//__get_cpu_var(netdev_rx_stat).total++;

skb_reset_network_header(skb);
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;
#define PATCH_START net/core/dev.c
CBPTR(skb) = orig_dev;
return bs_dispatch(skb);
}

int __netif_recv_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
struct net_device *orig_dev;
int ret = NET_RX_DROP;
__be16 type;

orig_dev = CBPTR(skb);
CBPTR(skb) = 0;

#undef PATCH_START
pt_prev = NULL;

rcu_read_lock();

#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
goto ncls;
}
#endif

list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL; /* noone else should process this after*/
} else {
skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
}

ret = ing_filter(skb);

if (ret == TC_ACT_SHOT || (ret == TC_ACT_STOLEN)) {
kfree_skb(skb);
goto out;
}

skb->tc_verd = 0;
ncls:
#endif

skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;
skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;

type = skb->protocol;
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type)&15], list) {
if (ptype->type == type &&
      (!ptype->dev || ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
   * me how you were going to use this. :-)
   */
ret = NET_RX_DROP;
}

out:
rcu_read_unlock();
return ret;
}
#endif 23

//--------------------------------------------------------------------------------------
/*
* for standard patch, those lines should be moved into ../../net/sysctl_net.c
*/

/* COPY_OUT_START_TO net/sysctl_net.c */
#define PATCH_START net/sysctl_net.c
#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
#if !defined(BS_CPU_STAT_DEFINED)
struct cpu_stat
{
      unsigned long irqs;                      /* total irqs on me */
      unsigned long dids;                      /* I did, */
      unsigned long works;                   /* q works */
};
#endif
extern struct cpu_stat bs_cpu_status[NR_CPUS];
extern int bs_policy;
#undef PATCH_START
/* COPY_OUT_END_TO net/sysctl_net.c */

static ctl_table bs_ctl_table[] =
{
#define PATCH_START net/sysctl_net.c
      /* COPY_OUT_START_TO net/sysctl_net.c */
      {
            .ctl_name    = 99,
            .procname    = "bs_status",
            .data          = &bs_cpu_status,
            .maxlen       = sizeof(bs_cpu_status),
            .mode          = 0644,
            .proc_handler = &proc_dointvec,
      },
      {
            .ctl_name    = 99,
            .procname    = "bs_policy",
            .data          = &bs_policy,
            .maxlen       = sizeof(int),
            .mode          = 0644,
            .proc_handler = &proc_dointvec,
      },
#undef PATCH_START
      /* COPY_OUT_END_TO net/net_sysctl.c */

      { 0, },
};

static ctl_table bs_sysctl_root[] =
{
      {
            .ctl_name    = CTL_NET,
            .procname    = "net",
            .mode          = 0555,
            .child       = bs_ctl_table,
      },
      { 0, },
};

struct ctl_table_header *bs_sysctl_hdr;
register_bs_sysctl(void)
{

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,21)
      bs_sysctl_hdr = register_sysctl_table(bs_sysctl_root);
#else
      bs_sysctl_hdr = register_sysctl_table(bs_sysctl_root, 0);
#endif
      return 0;
}

unregister_bs_sysctl(void)
{
      unregister_sysctl_table(bs_sysctl_hdr);
}
#endif                                           //CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL

seeker_init()
{
      int i;
      if(nr_cpus == 0)
            nr_cpus = num_online_cpus();
      register_bs_sysctl();
}

seeker_exit()
{
      unsigned long now;
      unregister_bs_sysctl();
      bs_policy = 0;
      msleep(1000);
      flush_scheduled_work();
      now = jiffies;
      msleep(1000);
      printk("%u exited.\n", jiffies - now);
}
//-------------------------------------------------------------------------

/*--------------------------------------------------------------------------
*/

#define OE_KEEP_SIZE 5
static char saved[8];
char *dorepl(char *func_ptr, char *func_new, int on_off)
{
char jmp_entry[8];
int jmpoffset;
int i;
char *cp;

//printk("tapping: old %p new %p onoff %d replace %d\n", func_ptr, func_new, on_off, replace);

if(on_off == 0) {
if(!saved[0]) return 0;
lock_kernel();
      memcpy(func_ptr, saved, OE_KEEP_SIZE);
unlock_kernel();
//code_dump(func_ptr, 9);
saved[0] = 0;
return saved;
}

if(1) {
if(!func_new) return 0;
      memcpy(saved, func_ptr, OE_KEEP_SIZE);
      printk("replace: old %p new %p onoff %d\n", func_ptr, func_new, on_off);

      //do function replacing
      jmp_entry[0] = '\xe9';
      *(long*)&jmp_entry[1] = (long)(func_new - func_ptr - 5);

      lock_kernel();
      memcpy((char*)func_ptr, jmp_entry, 5);
      unlock_kernel();
printk("function (%p) is replaced.\n", func_ptr);
return func_ptr;
}
}

static char system_map[128] = "/boot/System.map-";
static long sysmap_size;
static char *sysmap_buf;

unsigned long sysmap_name2addr(char *name)
{
      char *cp, *dp;
      unsigned long addr;
      int len, n;

      if(!sysmap_buf) return 0;
      if(!name || !name[0]) return 0;
      n = strlen(name);
      for(cp = sysmap_buf; ;)
      {
            cp = strstr(cp, name);
            if(!cp) {
printk("%s not found.\n", name);
return 0;
}

            for(dp = cp; *dp && *dp != '\n' && *dp != ' ' && *dp != '\t'; dp++);

            len = dp - cp;
            if(len < n) goto cont;
            if(cp > sysmap_buf && cp[-1] != ' ' && cp[-1] != '\t')
            {
                     goto cont;
            }
            if(len > n)
            {
                     goto cont;
            }
            break;
            cont:
            if(*dp == 0) break;
            cp += (len+1);
      }

      cp -= 11;
      if(cp > sysmap_buf && cp[-1] != '\n')
      {
            printk("_ERROR_ in name2addr cp = %p base %p\n", cp, sysmap_buf);
            return 0;
      }
      sscanf(cp, "%x", &addr);
printk("VAR: %s %p\n", name, addr);
      return addr;
}

static int kas_init()
{
      struct file *fp;
      int i, val;
      long addr;
      struct kstat st;
      mm_segment_t old_fs;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,19)
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,6,21)
//#include <linux/uts.h>
//#include <linux/utsname.h>
struct new_utsname {
char sysname[65];
char nodename[65];
char release[65];
char version[65];
char machine[65];
char domainname[65];
};
struct uts_namespace {
      struct kref kref;
      struct new_utsname name;
};
extern struct uts_namespace init_uts_ns;
#endif

strcat(system_map, init_uts_ns.name.release);
#else
strcat(system_map, system_utsname.release);
#endif
printk("uname -a %s\n", system_map);

      old_fs = get_fs();
      set_fs(get_ds());                      /* systemp_map is __user variable */
      i = vfs_stat(system_map, &st);
      set_fs(old_fs);
      if(i) return 1;

      sysmap_size = st.size + 32;
      fp = filp_open(system_map, O_RDONLY, FMODE_READ);
      if(!fp) return 2;

      sysmap_buf = vmalloc(sysmap_size);
      if(!sysmap_buf)
      {
            filp_close(fp, 0);
            return 3;
      }
      i = kernel_read(fp, 0, sysmap_buf, sysmap_size);
      if(i <= 1024)
      {
            filp_close(fp, 0);
            vfree(sysmap_buf);
            sysmap_buf = 0;
            return 4;
      }
      sysmap_size = i;
      *(int*)&sysmap_buf[i] = 0;
      filp_close(fp, 0);

      if(!(p_ptype_lock = sysmap_name2addr("ptype_lock"))) return 1;
      if(!(p_ptype_base = sysmap_name2addr("ptype_base"))) return 1;
      if(!(p_ptype_all = sysmap_name2addr("ptype_all"))) return 1;
      if(!(Pkeventd_wq = sysmap_name2addr("keventd_wq"))) return 1;
if(!(p__queue_work = sysmap_name2addr("__queue_work"))) return 1;
      if(!(p__netpoll_rx = sysmap_name2addr("__netpoll_rx"))) return 1;
if(!(p_netstamp_needed = sysmap_name2addr("netstamp_needed"))) return 1;

#ifdef CONFIG_NET_CLS_ACT
      if(!(p_ing_filter = sysmap_name2addr("ing_filter"))) ; // return 1;
#endif
      if(!(p_tapped = sysmap_name2addr(TAPFUNC))) return 1;
      if(!(p_ip_rcv = sysmap_name2addr("ip_rcv"))) return 1;
vfree(sysmap_buf);

      return 0;

}

/*--------------------------------------------------------------------------
*/
static int  __init init()
{
      struct packet_type *pt;
int r;
      if((r = kas_init())) {
         printk("can't resolve globals. err %d\n", r);
     return -1;
}

//printk("REP_netif_receive_skb %p\n", REP_netif_receive_skb);

if(!dorepl(p_tapped, REP_netif_receive_skb, 1))
return -1;

      seeker_init();
printk("bs_smp loaded.\n");
      return 0;
}

static void __exit exit(void)
{
      seeker_exit();
dorepl(p_tapped, REP_netif_receive_skb, 0);
printk("KERNEL VERSION = %d %p\n", KERNEL_VERSION(2,6,23), KERNEL_VERSION(2,6,23));
}

module_init(init)
module_exit(exit)
MODULE_LICENSE("GPL");

[[i] 本帖最后由 platinum 于 2007-12-19 16:10 编辑 [/i]]

作者: platinum 时间: 2007-12-19 16:11
seeker 兄，你的代码太长了，discuz 可能有 BUG，无法以 code 形式贴出来
可否以附件形式上传？

作者: sisi8408 时间: 2007-12-19 21:06
标题: 回复 #71 思一克的帖子

static inline int bs_dispatch(struct sk_buff *skb)
{
#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,22)
struct iphdr *iph = ip_hdr(skb);
#else
struct iphdr *iph = skb->nh.iph;
#endif
if(!nr_cpus)
nr_cpus = num_online_cpus();
/*
struct tcphdr {
__u16 source;
__u16 dest;
__u32 seq;
};
*/
if(bs_policy && nr_cpus > 1) { // && iph->protocol != IPPROTO_ICMP) {
//if(bs_policy && nr_cpus > 1 && iph->protocol == IPPROTO_ICMP) { //test on icmp first
unsigned int cur, cpu;
struct work_struct *bs_works;
struct sk_buff_head *q;
cpu = cur = smp_processor_id();
bs_cpu_status[cur].irqs++;
//good point for Jamal. thanks no reordering
if(bs_policy == BS_POL_LINK) {
int seed = 0;
if(iph->protocol == IPPROTO_TCP || iph->protocol == IPPROTO_UDP) {
struct tcphdr *th = (struct tcphdr*)(iph + 1); //udp is same as tcp
seed = ntohs(th->source) + ntohs(th->dest);
}
cpu = (iph->saddr + iph->daddr + seed) % nr_cpus;
/*
if(net_ratelimit() && iph->protocol == IPPROTO_TCP) {
struct tcphdr *th = iph + 1;
printk("seed %u (%u %u) cpu %d. source %d dest %d\n",
seed, iph->saddr + iph->daddr, iph->saddr + iph->daddr + seed, cpu,
ntohs(th->source), ntohs(th->dest));
}
*/
} else
//random distribute
if(bs_policy == BS_POL_RANDOM)
cpu = (bs_cpu_status[cur].irqs % nr_cpus);
//cpu = cur;
//cpu = (cur? 0: 1);
if(cpu == cur) {
bs_cpu_status[cpu].dids++;
/*
//////////////////////
//////////////////////
in bs_func()
////////////////////////
local_bh_disable();
__netif_recv_skb(skb);
local_bh_enable();
///////////////////
///////////////////
why not mask bh here??
and is it already bh here???
///////////////////
//////////////////////////////
* in drivers/net/tgx.c
* the play of work queue in bh is nice.
*
* if cpu != cur
* howto control the delay of
* bs_func execed on cpu, say 1 jiff???
*/
return __netif_recv_skb(skb);
}
q = &per_cpu(bs_cpu_queues, cpu);
if(!q->next) {
skb_queue_head_init(q);
}
spin_lock(&q->lock);
__skb_queue_tail(q, skb);
spin_unlock(&q->lock);
bs_works = &per_cpu(bs_works, cpu);
if (!bs_works->func) {
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,16)
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,19)
INIT_WORK(bs_works, bs_func, 0);
#else
INIT_WORK(bs_works, bs_func);
#endif
bs_cpu_status[cpu].works++;
preempt_disable();
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,20)
set_bit(WORK_STRUCT_PENDING, work_data_bits(bs_works));
#endif
__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), bs_works);
preempt_enable();
#else
INIT_WORK(bs_works, bs_func, q);
bs_cpu_status[cpu].works++;
preempt_disable();
__queue_work(keventd_wq->cpu_wq + cpu, bs_works);
preempt_enable();
#endif
}
} else {
bs_cpu_status[smp_processor_id()].dids++;
return __netif_recv_skb(skb);
}
return 0;
#else
return __netif_recv_skb(skb);
#endif
}

复制代码

[ 本帖最后由 sisi8408 于 2007-12-19 22:19 编辑 ]

作者: 思一克 时间: 2007-12-20 09:17
TO sisi8408,

我慢慢考虑你的疑问.

作者: 思一克 时间: 2007-12-20 09:18
标题: 回复 #74 思一克的帖子
如何ATTACH文件? 我原来做过呀

作者: 思一克 时间: 2007-12-20 09:27
TO 白金,

到BLOG下载, 我放到那里了.

TO sisi,

那里不用MASK BH, 因为在一个CPU上, 和原来一样.

作者: du2050 时间: 2007-12-20 13:57
看的不太仔细，问一句：
skb_queue不做长度限制么？大压力下skb_queue里没处理过来的skb会来越多，内存没了系统崩掉。

其实我也做过类似的东西，不过是用kernel therad做的并行。其实我觉得从性能角度来看，这个模块想做通用很难的，比如我是用与af_packet+mmap抓包,和netfilter之类完全是两回事，比如在快速抓包看来skb->data里的数据被一个以上的core读过是不可接受的。

作者: 思一克 时间: 2007-12-20 16:55
这个不占额外内存.

大压力下需要实验看结果. 我发出来也是为测试的.

原帖由 du2050 于 2007-12-20 13:57 发表
看的不太仔细，问一句：
skb_queue不做长度限制么？大压力下skb_queue里没处理过来的skb会来越多，内存没了系统崩掉。

其实我也做过类似的东西，不过是用kernel therad做的并行。其实我觉得从性能角度来看， ...

作者: du2050 时间: 2007-12-20 17:10
标题: 回复 #78 思一克的帖子
skb是NIC driver分配的，要在netif_recv_skb处理里释放
skb_queue如果入队速度比出队快，内存用的就越来越多，很快就会崩，所以入队之前要判断下skb_queue_len

作者: platinum 时间: 2007-12-22 13:17
编译没有问题，但载入时遇到了状况

localhost bs_smp # insmod bs_smp.ko
insmod: error inserting 'bs_smp.ko': -1 Operation not permitted
localhost bs_smp # modinfo bs_smp.ko
filename: bs_smp.ko
license: GPL
depends:
vermagic: 2.6.21-gentoo-r4 SMP mod_unload PENTIUM4
localhost bs_smp # uname -a
Linux localhost 2.6.21-gentoo-r4 #7 SMP Thu Dec 20 00:54:40 CST 2007 i686 Genuine Intel(R) CPU T2080 @ 1.73GHz GenuineIntel GNU/Linux
localhost bs_smp #

复制代码

作者: wheel 时间: 2009-06-22 18:29

原帖由 思一克 于 2007-7-13 14:28 发表
以下是arch/i386/kernel/io_apic.c的补丁。在1个NIC，2个CPU和 2个NIC， 2个CPU上都平衡的很好。多个CPU也应该可以很好地平衡。但是我没有测试。

--- io_apic.c 2007-07-13 13:24:57.000000000 + ...

作者: lmarsin 时间: 2010-05-21 11:10
支持

作者: blowingwind 时间: 2010-07-21 23:09
skb是NIC driver分配的，要在netif_recv_skb处理里释放
skb_queue如果入队速度比出队快，内存用的就越来越多，很快就会崩，所以入队之前要判断下skb_queue_len

这个问题如何考虑，是否属实。。。

作者: platinum 时间: 2010-07-21 23:15

skb是NIC driver分配的，要在netif_recv_skb处理里释放
skb_queue如果入队速度比出队快，内存用的就越来越 ...
blowingwind 发表于 2010-07-21 23:09

这个问题确实存在，但如何根据内存去判断 skb_queue_len 呢？有什么好的思路吗？

作者: Godbach 时间: 2010-07-22 17:43
白金兄，这个补丁你测试了吗，结果怎么样？

作者: platinum 时间: 2010-07-22 18:04

白金兄，这个补丁你测试了吗，结果怎么样？
Godbach 发表于 2010-07-22 17:43

测试过，其中有 2 种调度机制，一个是包调度，一个是连接调度，两者效率都有一定提升，但前者会造成乱序
但是，最关键的问题是：不稳定

正如 blowingwind 所说，skb_queue 如果入队速度比出队快，内存用的就越来越多，很快就会崩，导致 OOPS，然后 PANIC
不知道问题在哪里，以及如何修改完善

作者: Godbach 时间: 2010-07-22 18:22
多谢啊。另外，2.6.24内核上net_device结构体相比2.6.23发生了些变化，加了不少napi的结构体。无法直接编译通过。

作者: Godbach 时间: 2010-07-22 18:23
白金兄，有时间帮忙看一下这个问题，小弟还是有点疑惑
http://linux.chinaunix.net/bbs/thread-1167974-1-1.html

作者: blowingwind 时间: 2010-07-29 17:19
大流量很快就会系统崩溃
ps ：有什么有效办法查崩溃前的信息

原因还没仔细查，是否有可能本模块的修改更改了linux内核的默认规则就是谁触发谁处理，导致了某些未知模块统计或者调用出错导致内核core?

作者: gefeinic 时间: 2010-10-18 13:13
我是新手我飘过~~{:3_195:}

欢迎光临 Chinaunix (http://bbs.chinaunix.net/)