- 论坛徽章:
- 0
|
Source:http://undeadly.org/cgi?action=article&sid=20060927091645
This is the first installment in a series of three articles about PF. I originally wrote them as chapters for a book, but then publication was cancelled. Luckily, the rights could be salvaged, and now you get to enjoy them as undeadly.org exclusives. In celebration of the upcoming OpenBSD 4.0 release.
Firewall Ruleset Optimization
- Goals
- The significance of packet rate
- When pf is the bottleneck
- Filter statefully
- The downside of stateful filtering
- Ruleset evaluation
- Ordering rulesets to maximize skip steps
- Use tables for address lists
- Use quick to abort ruleset evaluation when rules match
- Anchors with conditional evaluation
- Let pfctl do the work for you
- Testing Your Firewall (upcoming)
- Firewall Management (upcoming)
复制代码
Firewall Ruleset Optimization
Goals
Ideally, the operation of a packet filter should not affect legitimate network traffic. Packets violating the filtering policy should be blocked, and compliant packets should pass the device as if the device wasn't there at all.
In reality, several factors limit how well a packet filter can achieve that goal. Packets have to pass through the device, adding some amount of latency between the time a packet is received and the time it is forwarded. Any device can only process a finite amount of packets per second. When packets arrive at a higher rate than the device can forward them, packets are lost.
Most protocols, like TCP, deal well with added latency. You can achieve high TCP transfer rates even over links that have several hundred milliseconds of latency. On the other hand, in interactive network gaming even a few tens of milliseconds are usually perceived as too much. Packet loss is generally a worse problem, TCP performance will seriously deteriorate when a significant number of packets are lost.
This chapter explains how to identify when pf is becoming the limiting factor in network throughput and what can be done to improve performance in this case.
The significance of packet rate
One commonly used unit to compare network performance is throughput in bytes per second. But this unit is completely inadequate to measure pf performance. The real limiting factor isn't throughput but packet rate, that is the number of packets per second the host can process. The same host that handles 100Mbps of 1500 byte packets without breaking a sweat can be brought to its knees by a mere 10Mbps of 40 byte packets. The former amounts to only 8,000 packets/second, but the latter traffic stream amounts to 32,000 packets/second, which causes roughly four times the amount of work for the host.
To understand this, let's look at how packets actually pass through the host. Packets are received from the wire by the network interface card (NIC) and read into a small memory buffer on the NIC. When that buffer is full, the NIC triggers a hardware interrupt, causing the NIC driver to copy the packets into network memory buffers (mbufs) in kernel memory. The packets are then passed through the TCP/IP stack in form of these mbufs. Once a packet is transferred into an mbuf, most operations the TCP/IP stack performs on the packet are not dependant on the packet size, as these operations only inspect the packet headers and not the payload. This is also true for pf, which gets passed one packet at a time and makes the decision of whether to block it or pass it on. If the packet needs forwarding, the TCP/IP stack will pass it to a NIC driver, which will extract the packet from the mbuf and put it back onto the wire.
Most of these operations have a comparatively high cost per packet, but a very low cost per size of the packet. Hence, processing a large packet is only slightly more expensive than processing a small packet.
Some limits are based on hardware and software outside of pf. For instance, i386-class machines are not able to handle much more than 10,000 interrupts per second, no matter how fast the CPU is, due to architectural constraints. Some network interface cards will generate one interrupt for each packet received. Hence, the host will start to lose packets when the packet rate exceeds around 10,000 packets per second. Other NICs, like more expensive gigabit cards, have larger built-in memory buffers that allow them to bundle several packets into one interrupt. Hence, the choice of hardware can impose some limits that no optimization of pf can surpass.
When pf is the bottleneck
The kernel passes packets to pf sequentially, one after the other. While pf is being called to decide the fate of one packet, the flow of packets through the kernel is briefly suspended. During that short period of time, further packets read off the wire by NICs have to fit into memory buffers. If pf evaluations take too long, packets will quickly fill up the buffers, and further packets will be lost. The goal of optimizing pf rulesets is to reduce the amount of time pf spends for each packet.
An interesting exercise is to intentionally push the host into this overloaded state by loading a very large ruleset like this:
- $ i=0; while [ $i -lt 100 ]; do \
- printf "block from any to %d.%d.%d.%d\n" \
- `jot -r -s " " 4 1 255`; \
- let i=i+1; \
- done | pfctl -vf -
- block drop inet from any to 151.153.227.25
- block drop inet from any to 54.186.19.95
- block drop inet from any to 165.143.57.178
- ...
复制代码
This represents a worst-case ruleset that defies all automatic optimizations. Because each rule contains a different random non-matching address, pf is forced to traverse the entire ruleset and evaluate each rule for every packet. Loading a ruleset that solely consists of thousands of such rules, and then generating a steady flow of packets that must be filtered, inflicts noticeable load on even the fastest machine. While the host is under load, check the interrupt rate with:
And watch CPU states with:
This will give you an idea of how the host reacts to overloading, and will help you spot similar symptoms when using your own ruleset. You can use the same tools to verify effects of optimizations later on.
Then try the other extreme. Completely disable pf like:
Then compare the vmstat and top values.
This is a simple way to get a rough estimate and upper limit on what to realistically expect from optimization. If your host handles your traffic with pf disabled, you can aim to achieve similar performance with pf enabled. However, if the host already shows problems handling the traffic with pf disabled, optimizing pf rulesets is probably pointless, and other components should be changed first.
If you already have a working ruleset and are wondering whether you should spend time on optimizing it for speed, repeat this test with your ruleset and compare the results with both extreme cases. If running your ruleset shows effects of overloading, you can use the guidelines below to reduce those effects.
In some cases, the ruleset shows no significant amount of load on the host, yet connections through the host show unexpected problems, like delays during connection establishment, stalling connections or disappointingly low throughput. In most of these cases, the problem is not filtering performance at all, but a misconfiguration of the ruleset which causes packets to get dropped. See the chapter 21 about how to identify and deal with such problems.
And finally, if your ruleset is evaluated without causing significant load and everything works as expected, the most reasonable conclusion is to leave the ruleset as is is. Often, rulesets written in a straight-forward approach without respect for performance are evaluated efficiently enough to cause no packet loss. Manual optimizations will only make the ruleset harder to read for the human maintainer, while having only insignificant effect on performance.
Filter statefully
The amount of work done by pf mainly consists of two kinds of operations: ruleset evaluations and state table lookups.
For every packet, pf first does a state table lookup. If a matching state entry is found in the state table, the packet is immediately passed. Otherwise pf evaluates the filter ruleset to find the last matching rule for the packet which decides whether to block or pass it. If the rule passes the packet, it can optionally create a state entry using the 'keep state' option.
When filtering statelessly, without using 'keep state' to create state entries for connections, every packet causes an evaluation of the ruleset, and ruleset evaluation is the single most costly operation pf performs in this scenario. Each packet still causes a state table lookup, but since the table is empty, the cost of the lookup is basically zero.
Filtering statefully means using 'keep state' in filter rules, so packets matching those rules will create a state table entry. Further packets related to the same connections will match the state table entries and get passed automatically, without evaluations of the ruleset. In this scenario, only the first packet of each connection causes a ruleset evaluation, and subsequent packets only cause a state lookup.
Now, a state lookup is much cheaper than a ruleset evaluation. A ruleset is basically a list of rules which must be evaluated from top to bottom. The cost increases with every rule in the list, twice as many rules mean twice the amount of work. And evaluating a single rule can cause comparison of numerous values in the packet. The state table, on the other hand, is a tree. The cost of lookup increases only logarithmically with the number of entries, twice as many states mean only one additional comparison, a fraction of additional work. And comparison is needed only for a limited number of values in the packet.
There is some cost to creating and removing state entries. But assuming the state will match several subsequent packets and saves ruleset evaluation for them, the sum is much cheaper. For specific connections like DNS lookups, where each connection only consists of two packets (one request and one reply), the overhead of state creation might be worse than two ruleset evaluations. Connections that consist of more than a handful of packets, like most TCP connections, will benefit from the created state entry.
In short, you can make ruleset evaluation a per-connection cost instead of a per-packet cost. This can easily make a factor of 100 or more. For example, I see the following counters when I run:
- $ pfctl -si
- State Table Total Rate
- searches 172507978 887.4/s
- inserts 1099936 5.7/s
- removals 1099897 5.7/s
- Counters
- match 6786911 34.9/s
复制代码
This means pf gets called about 900 times per second. I'm filtering on multiple interfaces, so that would mean I'm forwarding about 450 packets per second, each of which gets filtered twice, once on each interface it passes through. But ruleset evaluation occurs only about 35 times per second, and state insertions and deletions only 6 times per second. With anything but a tiny ruleset, this is very well worth it.
To make sure that you're really creating state for each connection, search for 'pass' rules which don't use 'keep state', like in:
- $ pfctl -sr | grep pass | grep -v 'keep state'
复制代码
Make sure you have a tight 'block by default' policy, as otherwise packets might pass not only due to explicit 'pass' rules, but mismatch all rules and pass by default.
The downside of stateful filtering
The only downside to stateful filtering is that state table entries need memory, around 256 bytes for each entry. When pf fails to allocate memory for a new state entry, it blocks the packet that should have created the state entry instead, and increases an out-of-memory counter shown by:
- $ pfctl -si
- Counters
- memory 0 0.0/s
复制代码
Memory for state entries is allocated from the kernel memory pool called 'pfstatepl'. You can use vmstat( to show various aspects of pool memory usage:
- $ vmstat -m
- Memory resource pool statistics
- Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
- pfstatepl 256 1105099 0 1105062 183 114 69 127 0 625 62
复制代码
The difference between 'Requests' and 'Releases' equals the number of currently allocated state table entries, which should match the counter shown by:
- $ pfctl -si
- State Table Total Rate
- current entries 36
复制代码
Other counters shown by pfctl can get reset by pfctl -Fi.
Not all memory of the host is available to the kernel, and the way the amount of physical RAM affects the amount available to the kernel depends on architecture and kernel options and version. As of OpenBSD 3.6, an i386 kernel can use up to 256MB of memory. Prior to 3.6, that limit was much lower for i386. You could have 8GB of RAM in your host, and still pf would fail to allocate memory beyond a small fraction of that amount.
To make matters worse, when pf really hits the limit where pool_get(9) fails, the failure is not as graceful as one might wish. Instead, the entire system becomes unstable after that point, and eventually crashes. This really isn't pf's fault, but a general problem with kernel pool memory management.
To address this, pf itself limits the number of state entries it will allocate at the same time, using pool_sethardlimit(9), also shown by vmstat -m output. The default for this limit is 10,000 entries, which is safe for any host. The limit can be printed with:
- $ pfctl -sm
- states hard limit 10000
- src-nodes hard limit 10000
- frags hard limit 500
复制代码
If you need more concurrent state entries, you can increase the limit in pf.conf with:
The problem is determining a large value that is still safe enough not to trigger a pool allocation failure. This is still a sore topic, as there is no simple formula to calculate the value. Basically, you have to increase the limit and verify the host remains stable after reaching that limit, by artificially creating as many entries.
On the bright side, if you have 512MB or more of RAM, you can now use 256MB for the kernel, which should be safe for at least 500,000 state entries. And most people consider that a lot of concurrent connections. Just imagine each of those connections generating just one packet every ten seconds, and you end up with a packet rate of 50,000 packets/s.
More likely, you don't expect that many states at all. But whatever your state limit is, there are cases where it will be reached, like during a denial-of-service attack. Remember, pf will fail closed not open when state creation fails. An attacker could create state entries until the limit is reached, just for the purpose of denying service to legitimate users.
There are several ways to deal with this problem.
You can limit the number of states created from specific rules, for instance like:
- pass in from any to $ext_if port www keep state (max 256)
复制代码
This would limit the number of concurrent connections to the web server to 256, while other rules could still create state entries. Similarly, the maximum number of connections per source address can be restricted with:
- pass keep state (source-track rule, max-src-states 16)
复制代码
Once a state entry is created, various timeouts define when it is removed. For instance:
- $ pfctl -st
- tcp.opening 30s
复制代码
The timeout for TCP states that are not yet fully established is set to 30 seconds. These timeouts can be lowered to remove state entries more aggressively. Individual timeout values can be set globally in pf.conf:
- set timeout tcp.opening 20
复制代码
They can also be set in individual rules, applying only to states created by these rules:
- pass keep state (tcp.opening 10)
复制代码
There are several pre-defined sets of global timeouts which can be selected in pf.conf:
- set optimization aggressive
复制代码
Also, there's adaptive timeouts, which means these timeouts are not constants, but variables which adjust to the number of state entries allocated. For instance:
- set timeout { adaptive.start 6000, adaptive.end 12000 }
复制代码
pf will use constant timeout values as long as there are less than 6,000 state entries. When there are between 6,000 and 12,000 entries, all timeout values are linearly scaled from 100% at 6,000 to 0% at 12,000 entries, i.e. with 9,000 entries all timeout values are reduced to 50%.
In summary, you probably can specify a number of maximum states you expect to support. Set this as the limit for pf. Expect the limit to get reached during certain attacks, and define a timeout strategy for this case. In the worst case, pf will drop packets when state insertion fails, and the out-of-memory counter will increase.
Ruleset evaluation
A ruleset is a linear list of individual rules, which are evaluated from top to bottom for a given packet. Each rule either does or does not match the packet, depending on the criteria in the rule and the corresponding values in the packet.
Therefore, to a first approximation, the cost of ruleset evaluation grows with the number of rules in the ruleset. This is not precisely true for reasons we'll get into soon, but the general concept is correct. A ruleset with 10,000 rules will almost certainly cause much more load on your host than one with just 100 rules. The most obvious optimization is to reduce the number of rules.
Ordering rulesets to maximize skip steps
The first reason why ruleset evaluation can be cheaper than evaluating each individual rule in the ruleset is called skip steps. This is a transparent and automatic optimization done by pf when the ruleset is loaded. It's best explained with an example. Imagine you have the following simple ruleset:
- 1. block in all
- 2. pass in on fxp0 proto tcp from any to 10.1.2.3 port 22 keep state
- 3. pass in on fxp0 proto tcp from any to 10.1.2.3 port 25 keep state
- 4. pass in on fxp0 proto tcp from any to 10.1.2.3 port 80 keep state
- 5. pass in on fxp0 proto tcp from any to 10.2.3.4 port 80 keep state
复制代码
A TCP packet arrives in on fxp0 to destination address 10.2.3.4 on some port.
pf will start the ruleset evaluation for this packet with the first rule, which fully matches. Evaluation continues with the second rule, which matches the criteria 'in', 'on fxp0', 'proto tcp', 'from any' but doesn't match 'to 10.1.2.3'. So the rule does not match, and evaluation should continue with the third rule.
But pf is aware that the third and fourth rule also specify the same criterion 'to 10.1.2.3' which caused the second rule to mismatch. Hence, it is absolutely certain that the third and fourth rule cannot possibly match this packet, either, and immediately jumps to the fifth rule, saving several comparisons.
Imagine the packet under inspection was UDP instead of TCP. The first rule would have matched, evaluation would have continued with the second rule. There, the criterion 'proto tcp' would have made the rule mismatch the packet. Since the subsequent rules also specify the same criterion 'proto tcp' which was found to mismatch the packet, all of them could be safely skipped, without affecting the outcome of the evaluation. |
|