/ Language: Русский / Genre:comp_www,

Iptables Tutorial 1.2.2

Oskar Andreasson

Iptables Tutorial 1.2.2

Copyright © 2001-2006 Oskar Andreasson Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1; with the Invariant Sections being "Introduction" and all sub-sections, with the Front-Cover Texts being "Original Author: Oskar Andreasson", and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". All scripts in this tutorial are covered by the GNU General Public License. The scripts are free source; you can redistribute them and/or modify them under the terms of the GNU General Public License as published by the Free Software Foundation, version 2 of the License. These scripts are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License within this tutorial, under the section entitled "GNU General Public License"; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA


I would like to dedicate this document to my wonderful sister, niece and brother-in-law for giving me inspiration and feedback. They are a source of joy and a ray of light when I have need of it. Thank you!

A special word should also be extended to Ninel for always encouraging my writing and for taking care of me when I needed it the most. Thank you!

Second of all, I would like to dedicate this work to all of the incredibly hard working Linux developers and maintainers. It is people like those who make this wonderful operating system possible.

About the author

The author of the iptables tutorial was born in...

No, jokes aside. At age 8 I got my first computer for christmas present, a Commodore 64 with a C-1541 diskdrive, 8 needle printer and some games etc. It took me several days to even bother. My father managed to put it together and after 2 days he finally learned himself how to load a game and showed how to do it for myself. A life immersed in computers was born this day I guess. I played mostly games at this stage, but did venture into the C-64 basic programming language a couple of times on and off. After some years, I got my hands on an Amiga 500, which was mainly used for games and some school work and fiddling around. Amiga 1200 was next.

Back in 1993-94 My father was clearsighted enough to understand that Amiga was, unfortunately, not the way of the future. PC and i386 computers was. Despite my screams in vain he bought me a PC, 486 50MHz with 16 MB of ram, Compaq computer. This was actually one of the worst computer designs I have ever seen, everything was integrated, including speakers and CRT screen. I guess they where trying to mimic the Apple designs of the day, but failing miserably to do so. It should be noted though, that this was the computer that got me really into computers. I started coding for real, started using the Internet and actually installed Linux on this machine.

I have for a long time been an avid Linux user and administrator. My Linux experience started in 1994 with a slackware installation from borrowed CD's. This first installation was mostly a trial installation. I had no previous experience and it took me quite some time to get modems running et cetera, and I kept running a dual boot system. The second installation, circa 1996, I had no media around so I winded up downloading the whole slackware A, AP, D and N disksets via FTP on a 28k8 modem. Since I realized I would never learn anything from using graphical interfaces, I went back to basics. Nothing but console, no X11 or graphics except for svgalib. In the end, I believe this has helped me a lot. I believe there is nothing to teach you how to use something as to actually forcing yourself to do it, as I did at this time. I had no choice but to learn. I continued running like this for close to 2 years. After this, I finally installed XFree86 from scratch. After an 24 hour compilation, I realized that I had totally misconfigured the compilation and had to restart the compilation from scratch. As a human, you are always bound to do errors. It simply happens and you better get used to it. Also, this kind of build process teaches you to be patient. Let things have its time and don't force it.

In 2000-2001 I was part of a small group of people who ran a newssite mainly focusing on Amiga related news, but also some Linux and general computer news. The site was called BoingWorld, located at www.boingworld.com (no long available unfortunately). The Linux 2.3 kernels where reaching their end of line and the 2.4 kernels where starting to pop up. At this point, I realized there was a half-new concept of firewalling inside of it. Sure I had run into ipfwadm and ipchains before and used it to some extent, but never truly gone heads first into it. I also realized there was embaerassingly little documentation and I felt it might be an interesting idea to write an iptables tutorial for boingworld. Said and done, I wrote the first 5-10 pages of what you are currently reading. Becoming a smashing hit, I continued to add material to the tutorial. The original pages are no longer anywhere to be found in this tutorial/documentation, but the concept lives on.

I have worked several different companies during this time with Linux/network administration, writing documentation and course material, helped several hundred, if not thousand, people emailing questions regarding iptables and netfilter and general networking questions. I have attended two CERTconf's and held three presentations at the same conference, and also the Netfilter workshop 2003. It has been an hectic and sometimes very ungrateful job to maintain and update this work, but in the end I am very happy for it and this is something I am very proud of having done. At the time of writing this in end of 2006, the project has been close to dead for several years, and I regret this. I hope to change this in the coming years, and that a lot of people will find this work to be of future use, possibly adding to the family of documents with other interesting documentation that might be needed.

How to read

This document could either be read as a reference or from start to end. It was originally written as a small introduction to iptables and to some extent netfilter, but this focus has changed over the years. It aims at being an as complete reference as possibly to iptables and netfilter and to at least give a basic and fast primer or repetition to the areas that you might need to understand. It should be noted that this document will not, nor will it be able to, deal with specific bugs inside or outside the scope of iptables and netfilter, nor does it really deal with how to get around bugs like this.

If you find peculiar bugs or behaviors in iptables or any of the subcomponents, you should contact the Netfilter mailing lists and tell them about the problem and they can tell you if this is a real bug or if it has already been fixed. There are security related bugs found in iptables and Netfilter, one or two do slip by once in a while, it's inevitable. These are properly shown on the front page of the Netfilter main page, and that is where you should go to get information on such topics.

The above also implies that the rule-sets available with this tutorial are not written to deal with actual bugs inside Netfilter. The main goal of them is to simply show how to set up rules in a nice simple fashion that deals with all problems we may run into. For example, this tutorial will not cover how we would close down the HTTP port for the simple reason that Apache happens to be vulnerable in version 1.2.12 (This is covered really, though not for that reason).

This document was written to give everyone a good and simple primer at how to get started with iptables, but at the same time it was created to be as complete as possible. It does not contain any targets or matches that are in patch-o-matic for the simple reason that it would require too much effort to keep such a list updated. If you need information about the patch-o-matic updates, you should read the info that comes with it in patch-o-matic as well as the other documentations available on the Netfilter main page.

If you have any suggestions on additions or if you think you find any problems around the area of iptables and netfilter not covered in this document feel free to contact me about this. I will be more than happy to take a look at it and possibly add what might be missing.


This document requires some previous knowledge about Linux/Unix, shell scripting, as well as how to compile your own kernel, and some simple knowledge about the kernel internals.

I have tried as much as possible to eradicate all prerequisites needed before fully grasping this document, but to some extent it is simply impossible to not need some previous knowledge.

Conventions used in this document

The following conventions are used in this document when it comes to commands, files and other specific information.

• Long code excerpts and command-outputs are printed like shown below. This includes screendumps and larger examples taken from the console.

[blueflux@work1 neigh]$ ls

default eth0 lo

[blueflux@work1 neigh]$

• All commands and program names in the tutorial are shown in bold typeface. This includes all the commands that you might type, or part of the command that you type.

• All system items such as hardware, and also kernel internals or abstract system items such as the loopback interface are all shown in an italic typeface.

• computer output is formatted in this way in the text. Computer output could be summed up as all the output that the computer will give you on the console.

• filenames and paths in the file-system are shown like /usr/local/bin/iptables.

Chapter 1. Introduction

Why this document was written

Well, I found a big empty space in the HOWTO's out there lacking in information about the iptables and Netfilter functions in the new Linux 2.4.x kernels. Among other things, I'm going to try to answer questions that some might have about the new possibilities like state matching. Most of this will be illustrated with an example rc.firewall.txt file that you can use in your /etc/rc.d/ scripts. Yes, this file was originally based upon the masquerading HOWTO for those of you who recognize it.

Also, there's a small script that I wrote just in case you screw up as much as I did during the configuration available as rc.flush-iptables.txt.

How it was written

I originally wrote this as a very small tutorial for boingworld.com, which was an Amiga/Linux/General newssite that a small group of people, including me, ran a couple of years back. Due to the fantastic amount of readers and comments that I got from it, I continued to write on it. The original version was approximately 10-15 A4 pages in printed version and has since been growing slowly but steadily. A huge amount of people has helped me out, spellchecking, bug corrections, etc. At the time of writing this, the http://iptables-tutorial.frozentux.net/ site has had over 600.000 unique hits alone.

This document was written to guide you through the setup process step by step and hopefully help you to understand some more about the iptables package. I have based most of the stuff here on the example rc.firewall file, since I found that example to be a good way to learn how to use iptables. I decided to just follow the basic chain structure and from there walk through each and one of the chains traversed and explain how the script works. That way the tutorial is a little bit harder to follow, though this way is more logical. Whenever you find something that's hard to understand, just come back to this tutorial.

Terms used in this document

This document contains a few terms that may need more detailed explanations before you read them. This section will try to cover the most obvious ones and how I have chosen to use them within this document.

Connection - This is generally referred to in this document as a series of packets relating to each other. These packets refer to each other as an established kind of connection. A connection is in another word a series of exchanged packets. In TCP, this mainly means establishing a connection via the 3-way handshake, and then this is considered a connection until the release handshake.

DNAT - Destination Network Address Translation. DNAT refers to the technique of translating the Destination IP address of a packet, or to change it simply put. This is used together with SNAT to allow several hosts to share a single Internet routable IP address, and to still provide Server Services. This is normally done by assigning different ports with an Internet routable IP address, and then tell the Linux router where to send the traffic.

IPSEC - Internet Protocol Security is a protocol used to encrypt IPv4 packets and sending them securely over the Internet. For more information on IPSEC, look in the Other resources and links appendix for other resources on the topic.

Kernel space - This is more or less the opposite of User space. This implies the actions that take place within the kernel, and not outside of the kernel.

Packet - A singular unit sent over a network, containing a header and a data portion. For example, an IP packet or an TCP packet. In Request For Comments (RFC's) a packet isn't so generalized, instead IP packets are called datagrams, while TCP packets are called segments. I have chosen to call pretty much everything packets in this document for simplicity.

QoS - Quality of Service is a way of specifying how a packet should be handled and what kind of service quality it should receive while sending it. For more information on this topic, take a look in the TCP/IP repetition chapter as well as the Other resources and links appendix for external resources on the subject.

Segment - A TCP segment is pretty much the same as an packet, but a formalized word for a TCP packet.

Stream - This term refers to a connection that sends and receives packets that are related to each other in some fashion. Basically, I have used this term for any kind of connection that sends two or more packets in both directions. In TCP this may mean a connection that sends a SYN and then replies with an SYN/ACK, but it may also mean a connection that sends a SYN and then replies with an ICMP Host unreachable. In other words, I use this term very loosely.

SNAT - Source Network Address Translation. This refers to the techniques used to translate one source address to another in a packet. This is used to make it possible for several hosts to share a single Internet routable IP address, since there is currently a shortage of available IP addresses in IPv4 (IPv6 will solve this).

State - This term refers to which state the packet is in, either according to RFC 793 - Transmission Control Protocol or according to userside states used in Netfilter/iptables. Note that the used states internally, and externally, do not follow the RFC 793 specification fully. The main reason is that Netfilter has to make several assumptions about the connections and packets.

User space - With this term I mean everything and anything that takes place outside the kernel. For example, invoking iptables -h takes place outside the kernel, while iptables -A FORWARD -p tcp -j ACCEPT takes place (partially) within the kernel, since a new rule is added to the ruleset.

Userland - See User space.

VPN - Virtual Private Network is a technique used to create virtually private networks over non-private networks, such as the Internet. IPSEC is one technique used to create VPN connections. OpenVPN is another.

What's next?

This chapter has given some small insight into why this document was written and how it was written. It also explained some common terms used throughout the document.

The next chapter will bring up a rather lengthy introduction and repetition to TCP/IP. Basically this means the IP protocol and some of its sub-protocols that are commonly used with iptables and netfilter. These are TCP, UDP, ICMP and SCTP. SCTP is a rather new standard in comparison to the other protocols, hence quite a lot of space and time has gone into describing this protocol for all of those who are still not quite familiar with it. The next chapter will also discuss some basic and more advanced routing techniques used today.

Chapter 2. TCP/IP repetition

Iptables is an extremely knowledge intensive tool. This means that iptables takes quite a bit of knowledge to be able to use iptables to it's full extent. Among other things, you must have a very good understanding of the TCP/IP protocol.

This chapter aims at explaining the pure "must understands" of TCP/IP before you can go on and work with iptables. Among the things we will go through are the IP, TCP, UDP and ICMP protocols and their headers, and general usages of each of these protocols and how they correlate to each other. Iptables works inside Internet and Transport layers, and because of that, this chapter will focus mainly on those layers as well.

Iptables is also able to work on higher layers, such as the Application layer. However, it was not built for this task, and should not be used for that kind of usage. I will explain more about this in the IP filtering introduction chapter.

TCP/IP Layers

TCP/IP is, as already stated, multi-layered. This means that we have one functionality running at one depth, and another one at another level, etcetera. The reason that we have all of these layers is actually very simple.

The biggest reason is that the whole architecture is very extensible. We can add new functionality to the application layers, for example, without having to reimplement the whole TCP/IP stack code, or to include a complete TCP/IP stack into the actual application. Just the same way as we don't need to rewrite every single program, every time that we make a new network interface card. Each layer should need to know as little as possible about each other, to keep them separated.

Note When we are talking about the programming code of TCP/IP which resides inside the kernel, we are often talking about the TCP/IP stack. The TCP/IP stack simply means all of the sublayers used, from the Network access layer and all the way up to the Application layer.

There are two basic architectures to follow when talking about layers. One of them is the OSI (Open Systems Interconnect) Reference Model and consists of 7 layers. We will only look at it superficially here since we are more interested in the TCP/IP layers. However, from an historical point, this is interesting to know about, especially if you are working with lots of different types of networks. The layers are as follows in the OSI Reference Model list.

Note There is some discussion as to which of these reference models is mostly used, but it seems that the OSI reference model still is the prevalent reference model. This might also depend on where you live, however, in most US and EU countries it seems as you can default to OSI reference model while speaking to technicians and salespeople.

However, throughout the rest of this document, we will mainly refer to the TCP/IP reference model, unless otherwise note

Application layer

Presentation layer

Session layer

Transport layer

Network layer

Data Link layer

Physical layer

A packet that is sent by us, goes from the top and to the bottom of this list, each layer adding its own set of headers to the packet in what we call the encapsulation phase. When the packet finally reaches it's destination the packet goes backwards through the list and the headers are stripped out of the packet, one by one, each header giving the destination host all of the needed information for the packet data to finally reach the application or program that it was destined for.

The second and more interesting layering standard that we are more interested in is the TCP/IP protocol architecture, as shown in the TCP/IP architecture list. There is no universal agreement among people on just how many layers there are in the TCP/IP architecture. However, it is generally considered that there are 3 through 5 layers available, and in most pictures and explanations, there will be 4 layers discussed. We will, for simplicities sake, only consider those four layers that are generally discussed.

Application layer

Transport layer

Internet layer

Network Access layer

As you can see, the architecture of the TCP/IP protocol set is very much like the OSI Reference Model, but yet not. Just the same as with the OSI Reference Model, we add and subtract headers for each layer that we enter or leave.

For example, lets use one of the most common analogies to modern computer networking, the snail-mail letter. Everything is done in steps, just as is everything in TCP/IP.

You want to send a letter to someone asking how they are, and what they are doing. To do this, you must first create the data, or questions. The actual data would be located inside the Application layer.

After this we would put the data written on a sheet of paper inside an envelope and write on it to whom the letter is destined for within a specific company or household. Perhaps something like the example below:

Attn: John Doe

This is equivalent to the the Transport layer, as it is known in TCP/IP. In the Transport layer, if we were dealing with TCP, this would have been equivalent to some port (e.g., port 25).

At this point we write the address on the envelope of the recipient, such as this:

Andersgardsgatan 2 41715 Gothenburg

his would in the analogy be the same as the Internet layer. The internet layer contains information telling us where to reach the recipient, or host, in a TCP/IP network. Just the same way as the recipient on an envelope. This would be the equivalent of the IP address in other words (e.g., IP

The final step is to put the whole letter in a postbox. Doing this would approximately equal to putting a packet into the Network Access Layer. The network access layer contains the functions and routines for accessing the actual physical network that the packet should be transported over.

When the receiver finally receives the letter, he will open the whole letter from the envelope and address etc (decapsulate it). The letter he receives may either require a reply or not. In either case, the letter may be replied upon by the receiver, by reversing the receiver and transmitter addresses on the original letter he received, so that receiver becomes transmitter, and transmitter becomes receiver.

Note It is very important to understand that iptables was and is specifically built to work on the headers of the Internet and the Transport layers. It is possible to do some very basic filtering with iptables in the Application and Network access layers as well, but it was not designed for this, nor is it very suitable for those purposes.

For example, if we use a string match and match for a specific string inside the packet, lets say get /index.html. Will that work? Normally, yes. However, if the packet size is very small, it will not. The reason is that iptables is built to work on a per packet basis, which means that if the string is split into several separate packets, iptables will not see that whole string. For this reason, you are much, much better off using a proxy of some sort for filtering in the application layer. We will discuss these problems in more detail later on in the IP filtering introduction.

As iptables and netfilter mainly operate in the Internet and Transport layers, that is the layers that we will put our main focus in, in the upcoming sections of this chapter. Under the Internet layer, we will almost exclusively see the IP protocol. There are a few additions to this, such as, for example, the GRE protocol, but they are very rare on the internet. Also, iptables is (as the name implies) not focused around these protocols very well either. Because of all these factors we will mainly focus around the IP protocol of the Internet layer, and TCP, UDP and ICMP of the Transport layer.

Note The ICMP protocol is actually sort of a mix between the two layers. It runs in the Internet layer, but it has the exact same headers as the IP protocol, but also a few extra headers, and then directly inside that encapsulation, the data. We will discuss this in more detail further on, in the ICMP characteristics.

IP characteristics

The IP protocol resides in the Internet layer, as we have already said. The IP protocol is the protocol in the TCP/IP stack that is responsible for letting your machine, routers, switches and etcetera, know where a specific packet is going. This protocol is the very heart of the whole TCP/IP stack, and makes up the very foundation of everything in the Internet.

The IP protocol encapsulates the Transport layer packet with information about which Transport layer protocol it came from, what host it is going to, and where it came from, and a little bit of other useful information. All of this is, of course, extremely precisely standardized, down to every single bit. The same applies to every single protocol that we will discuss in this chapter.

The IP protocol has a couple of basic functionalities that it must be able to handle. It must be able to define the datagram, which is the next building block created by the transport layer (this may in other words be TCP, UDP or ICMP for example). The IP protocol also defines the Internet addressing system that we use today. This means that the IP protocol is what defines how to reach between hosts, and this also affects how we are able to route packets, of course. The addresses we are talking about are what we generally call an IP address. Usually when we talk about IP addresses, we talk about dotted quad numbers (e.g., This is mostly to make the IP addresses more readable for the human eye, since the IP address is actually just a 32 bit field of 1's and 0's ( would hence be read as 01111111000000000000000000000001 within the actual IP header).

The IP protocol has even more magic it must perform up it's sleeve. It must also be able to decapsulate and encapsulate the IP datagram (IP data) and send or receive the datagram from either the Network access layer, or the transport layer. This may seem obvious, but sometimes it is not. On top of all this, it has two big functions it must perform as well, that will be of quite interest for the firewalling and routing community. The IP protocol is responsible for routing packets from one host to another, as well as packets that we may receive from one host destined for another. Most of the time on single network access host, this is a very simple process. You have two different options, either the packet is destined for our locally attached network, or possibly through a default gateway. but once you start working with firewalls or security policies together with multiple network interfaces and different routes, it may cause quite some headache for many network administrators. The last of the responsibilities for the IP protocol is that it must fragment and reassemble any datagram that has previously been fragmented, or that needs to be fragmented to fit in to the packetsize of this specific network hardware topology that we are connected to. If these packet fragments are sufficiently small, they may cause a horribly annoying headache for firewall administrators as well. The problem is, that once they are fragmented to small enough chunks, we will start having problems to read even the headers of the packet, not to mention the actual data.

Tip As of Linux kernel 2.4 series, and iptables, this should no longer be a problem for most linux firewalls. The connection tracking system used by iptables for state matching and NAT'ing etc must be able to read the packet defragmented. Because of this, conntrack automatically defragments all packets before they reach the netfilter/iptables structure in the kernel.

The IP protocol is also a connectionless protocol, which in turn means that IP does not "negotiate" a connection. a connection-oriented protocol on the other hand negotiates a connection (called a handshake) and then when all data has been sent, tears it down. TCP is an example of this kind of protocol, however, it is implemented on top of the IP protocol. The reason for not being connection-oriented just yet are several, but among others, a handshake is not required at this time yet since there are other protocols that this would add an unnecessarily high overhead to, and that is made up in such a way that if we don't get a reply, we know the packet was lost somewhere in transit anyways, and resend the original request. As you can see, sending the request and then waiting for a specified amount of time for the reply in this case, is much preferred over first sending one packet to say that we want to open a connection, then receive a packet letting us know it was opened, and finally acknowledge that we know that the whole connection is actually open, and then actually send the request, and after that send another packet to tear the connection down and wait for another reply.

IP is also known as an unreliable protocol, or simply put it does not know if a packet was received or not. It simply receives a packet from the transport layer and does its thing, and then passes it on to the network access layer, and then nothing more to it. It may receive a return packet, which traverses from network access layer to the IP protocol which does it's thing again, and then passes it on upwards to the Transport layer. However, it doesn't care if it gets a reply packet, or if the packet was received at the other end. Same thing applies for the unreliability of IP as for the connectionless-ness, since unreliability would require adding an extra reply packet to each packet that is sent. For example, let us consider a DNS lookup. As it is, we send a DNS request for servername.com. If we never receive a reply, we know something went wrong and re-request the lookup, but during normal use we would send out one request, and get one reply back. Adding reliability to this protocol would mean that the request would require two packets (one request, and one confirmation that the packet was received) and then two packets for the reply (one reply, and one reply to acknowledge the reply was received). In other words, we just doubled the amount of packets needed to send, and almost doubled the amount of data needed to be transmitted.

IP headers

The IP packet contains several different parts in the header as you have understood from the previous introduction to the IP protocol. The whole header is meticuluously divided into different parts, and each part of the header is allocated as small of a piece as possible to do it's work, just to give the protocol as little overhead as possible. You will see the exact configuration of the IP headers in the IP headers image.

Note Understand that the explanations of the different headers are very brief and that we will only discuss the absolute basics of them. For each type of header that we discuss, we will also list the proper RFC's that you should read for further understanding and technical explanations of the protocol in question. As a sidenote to this note, RFC stands for Request For Comments, but these days, they have a totally different meaning to the Internet community. They are what defines and standardises the whole Internet, compared to what they were when the researchers started writing RFC's to each other. Back then, they were simply requests for comments and a way of asking other researchers about their opinions.

The IP protocol is mainly described in RFC 791 - Internet Protocol. However, this RFC is also updated by RFC 1349 - Type of Service in the Internet Protocol Suite, which was obsoleted by RFC 2474 - Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers, and which was updated by RFC 3168 - The Addition of Explicit Congestion Notification (ECN) to IP and RFC 3260 - New Terminology and Clarifications for Diffserv.

Tip As you can see, all of these standards can get a little bit hard to follow at times. One tip for finding the different RFC's that are related to each other is to use the search functions available at RFC-editor.org. In the case of IP, consider that the RFC 791 is the basic RFC, and all of the other are simply updates and changes to that standard. We will discuss these more in detail when we get to the specific headers that are changed by these newer RFC's.

One thing to remember is, that sometimes, an RFC can be obsoleted (not used at all). Normally this means that the RFC has been so drastically updated and that it is better to simply replace the whole thing. It may also become obsolete for other reasons as well. When an RFC becomes obsoleted, a field is added to the original RFC that points to the new RFC instead.

Version - bits 0-3. This is a version number of the IP protocol in binary. IPv4 iscalled 0100, while IPv6 is called 0110. This field is generally not used for filtering very much. The version described in RFC 791 is IPv4.

IHL (Internet Header Length) - bits 4-7. This field tells us how long the IP header is in 32 bit words. As you can see, we have split the header up in this way (32 bits per line) in the image as well. Since the Options field is of optional length, we can never be absolutely sure of how long the whole header is, without this field. The minimum length of this of the header is 5 words.

Type of Service, DSCP, ECN - bits 8-15. This is one of the most complex areas of the IP header for the simple reason that it has been updated 3 times. It has always had the same basic usage, but the implementation has changed several times. First the field was called the Type of Service field. Bit [0-2] of the field was called the Precedence field. Bit [3] was Normal/Low delay, Bit [4] was Normal/High throughput, Bit [5] was Normal/High reliability and bit [6-7] was reserved for future usage. This is still used in a lot of places with older hardware, and it still causes some problems for the Internet. Among other things, bit [6-7] are specified to be set to 0. In the ECN updates (RFC 3168, we start using these reserved bits and hence set other values than 0 to these bits. But a lot of old firewalls and routers have built in checks looking if these bits are set to 1, and if the packets do, the packet is discarded. Today, this is clearly a violation of RFC's, but there is not much you can do about it, except to complain.

The second iteration of this field was when the field was changed into the DS field as defined in RFC 2474. DS stands for Differentiated Services. According to this standard bits [0-5] is Differentiated Services Code Point (DSCP) and the remaining two bits [6-7] are still unused. The DSCP field is pretty much used the same as in how the ToS field was used before, to mark what kind of service this packet should be treated like if the router in question makes any difference between them. One big change is that a device must ignore the unused bits to be fully RFC 2474 compliant, which means we get rid of the previous hassle as explained previously, as long as the device creators follow this RFC.

The third, and almost last, change of the ToS field was when the two, previously, unused bits were used for ECN (Explicit Congestion Notification), as defined in RFC 3168. ECN is used to let the end nodes know about a routers congestion, before it actually starts dropping packets, so that the end nodes will be able to slow down their data transmissions, before the router actually needs to start dropping data. Previously, dropping data was the only way that a router had to tell that it was overloaded, and the end nodes had to do a slow restart for each dropped packet, and then slowly gather up speed again. The two bits are named ECT (ECN Capable Transport) and CE (Congestion Experienced) codepoints.

The final iteration of the whole mess is RFC 3260 which gives some new terminology and clarifications to the usage of the DiffServ system. It doesn't involve too many new updates or changes, except in the terminology. The RFC is also used to clarify some points that were discussed between developers.

Total Length - bits 16 - 31. This field tells us how large the packet is in octets, including headers and everything. The maximum size is 65535 octets, or bytes, for a single packet. The minimum packet size is 576 bytes, not caring if the packet arrives in fragments or not. It is only recommended to send larger packets than this limit if it can be guaranteed that the host can receive it, according to RFC 791. However, these days most networks runs at 1500 byte packet size. This includes almost all ethernet connections, and most Internet connections.

Identification - bits 32 - 46. This field is used in aiding the reassembly of fragmented packets.

Flags - bits 47 - 49. This field contains a few miscellaneous flags pertaining to fragmentation. The first bit is reserved, but still not used, and must be set to 0. The second bit is set to 0 if the packet may be fragmented, and to 1 if it may not be fragmented. The third and last bit can be set to 0 if this was the last fragment, and 1 if there are more fragments of this same packet.

Fragment Offset - bits 50 - 63. The fragment offset field shows where in the datagram that this packet belongs. The fragments are calculated in 64 bits, and the first fragment has offset zero.

Time to live - bits 64 - 72. The TTL field tells us how long the packet may live, or rather how many "hops" it may take over the Internet. Every process that touches the packet must remove one point from the TTL field, and if the TTL reaches zero, the whole packet must be destroyed and discarded. This is basically used as a safety trigger so that a packet may not end up in an uncontrollable loop between one or several hosts. Upon destruction the host should return an ICMP Time exceeded message to the sender.

Protocol - bits 73 - 80. In this field the protocol of the next level layer is indicated. For example, this may be TCP, UDP or ICMP among others. All of these numbers are defined by the Internet Assigned Numbers Authority. All numbers can befound on their homepage Internet Assigned Numbers Authority.

Header checksum - bits 81 - 96. This is a checksum of the IP header of the packet.This field is recomputed at every host that changes the header, which means pretty much every host that the packet traverses over, since they most often change the packets TTL field or some other.

Source address - bits 97 - 128. This is the source address field. It is generally written in 4 octets, translated from binary to decimal numbers with dots in between. That is for example, The field lets the receiver know where the packet came from.

Destination address - bits 129 - 160. The destination address field contains the destination address, and what a surprise, it is formatted the same way as the source address.

Options - bits 161 - 192 <> 478. The options field is not optional, as it may sound. Actually, this is one of the more complex fields in the IP header. The options field contains different optional settings within the header, such as Internet timestamps, SACK or record route route options. Since these options are all optional, the Options field can have different lengths, and hence the whole IP header. However, since we always calculate the IP header in 32 bit words, we must always end the header on an even number, that is the multiple of 32. The field may contain zero or more options.

The options field starts with a brief 8 bit field that lets us know which options are used in the packet. The options are all listed in the TCP Options table, in the TCP options appendix. For more information about the different options, read the proper RFC's. For an updated listing of the IP options, check at Internet Assigned Numbers Authority.

Padding - bits variable. This is a padding field that is used to make the header end at an even 32 bit boundary. The field must always be set to zeroes straight through to the end.

TCP characteristics

The TCP protocol resides on top of the IP protocol. It is a stateful protocol and has built-in functions to see that the data was received properly by the other end host. The main goals of the TCP protocol is to see that data is reliably received and sent, that the data is transported between the Internet layer and Application layer correctly, and that the packet data reaches the proper program in the application layer, and that the data reaches the program in the right order. All of this is possible through the TCP headers of the packet.

The TCP protocol looks at data as an continuous data stream with a start and a stop signal. The signal that indicates that a new stream is waiting to be opened is called a SYN three-way handshake in TCP, and consists of one packet sent with the SYN bit set. The other end then either answers with SYN/ACK or SYN/RST to let the client know if the connection was accepted or denied, respectively. If the client receives an SYN/ACK packet, it once again replies, this time with an ACK packet. At this point, the whole connection is established and data can be sent. During this initial handshake, all of the specific options that will be used throughout the rest of the TCP connection is also negotiated, such as ECN, SACK, etcetera.

While the datastream is alive, we have further mechanisms to see that the packets are actually received properly by the other end. This is the reliability part of TCP. This is done in a simple way, using a Sequence number in the packet. Every time we send a packet, we give a new value to the Sequence number, and when the other end receives the packet, it sends an ACK packet back to the data sender. The ACK packet acknowledges that the packet was received properly. The sequence number also sees to it that the packet is inserted into the data stream in a good order.

Once the connection is closed, this is done by sending a FIN packet from either end-point. The other end then responds by sending a FIN/ACK packet. The FIN sending end can then no longer send any data, but the other end-point can still finish sending data. Once the second end-point wishes to close the connection totally, it sends a FIN packet back to the originally closing end-point, and the other end-point replies with a FIN/ACK packet. Once this whole procedure is done, the connection is torn down properly.

As you will also later see, the TCP headers contain a checksum as well. The checksum consists of a simple hash of the packet. With this hash, we can with rather high accuracy see if a packet has been corrupted in any way during transit between the hosts.

TCP headers

The TCP headers must be able to perform all of the tasks above. We have already explained when and where some of the headers are used, but there are still other areas that we haven't touched very deeply at. Below you see an image of the complete set of TCP headers. It is formatted in 32 bit words per row, as you can see.

Source port - bit 0 - 15. This is the source port of the packet. The source port was originally bound directly to a process on the sending system. Today, we use a hash between the IP addresses, and both the destination and source ports to achieve this uniqueness that we can bind to a single application or program.

Destination port - bit 16 - 31. This is the destination port of the TCP packet. Just as with the source port, this was originally bound directly to a process on the receiving system. Today, a hash is used instead, which allows us to have more open connections at the same time. When a packet is received, the destination and source ports are reversed in the reply back to the originally sending host, so that destination port is now source port, and source port is destination port.

Sequence Number - bit 32 - 63. The sequence number field is used to set a number on each TCP packet so that the TCP stream can be properly sequenced (e.g., the packets winds up in the correct order). The Sequence number is then returned in the ACK field to ackonowledge that the packet was properly received.

Acknowledgment Number - bit 64 - 95. This field is used when we acknowledge a specific packet a host has received. For example, we receive a packet with one Sequence number set, and if everything is okey with the packet, we reply with an ACK packet with the Acknowledgment number set to the same as the original Sequence number.

Data Offset - bit 96 - 99. This field indicates how long the TCP header is, and where the Data part of the packet actually starts. It is set with 4 bits, and measures the TCP header in 32 bit words. The header should always end at an even 32 bit boundary, even with different options set. This is possible thanks to the Padding field at the very end of the TCP header.

Reserved - bit 100 - 103. These bits are reserved for future usage. In RFC 793 this also included the CWR and ECE bits. According to RFC 793 bit 100-105 (i.e., this and the CWR and ECE fields) must be set to zero to be fully compliant. Later on, when we started introducing ECN, this caused a lot of troubles because a lot of Internet appliances such as firewalls and routers dropped packets with them set. This is still true as of writing this.

CWR - bit 104. This bit was added in RFC 3268 and is used by ECN. CWR stands for Congestion Window Reduced, and is used by the data sending part to inform the receiving part that the congestion window has been reduced. When the congestion window is reduced, we send less data per timeunit, to be able to cope with the total network load.

ECE - bit 105. This bit was also added with RFC 3268 and is used by ECN. ECE stands for ECN Echo. It is used by the TCP/IP stack on the receiver host to let the sending host know that it has received an CE packet. The same thing applies here, as for the CWR bit, it was originally a part of the reserved field and because of this, some networking appliances will simply drop the packet if these fields contain anything else than zeroes. This is actually still true for a lot of appliances unfortunately.

URG - bit 106. This field tells us if we should use the Urgent Pointer field or not. If set to 0, do not use Urgent Pointer, if set to 1, do use Urgent pointer.

ACK - bit 107. This bit is set to a packet to indicate that this is in reply to another packet that we received, and that contained data. An Acknowledgment packet is always sent to indicate that we have actually received a packet, and that it contained no errors. If this bit is set, the original data sender will check the Acknowledgment Number to see which packet is actually acknowledged, and then dump it from the buffers.

PSH - bit 108. The PUSH flag is used to tell the TCP protocol on any intermediate hosts to send the data on to the actual user, including the TCP implementation on the receiving host. This will push all data through, unregardless of where or how much of the TCP Window that has been pushed through yet.

RST - bit 109. The RESET flag is set to tell the other end to tear down the TCP connection. This is done in a couple of different scenarios, the main reasons being that the connection has crashed for some reason, if the connection does not exist, or if the packet is wrong in some way.

SYN - bit 110. The SYN (or Synchronize sequence numbers) is used during the initial establishment of a connection. It is set in two instances of the connection, the initial packet that opens the connection, and the reply SYN/ACK packet. It should never be used outside of those instances.

FIN - bit 111. The FIN bit indicates that the host that sent the FIN bit has no more data to send. When the other end sees the FIN bit, it will reply with a FIN/ACK. Once this is done, the host that originally sent the FIN bit can no longer send any data. However, the other end can continue to send data until it is finished, and will then send a FIN packet back, and wait for the final FIN/ACK, after which the connection is sent to a CLOSED state.

Window - bit 112 - 127. The Window field is used by the receiving host to tell the sender how much data the receiver permits at the moment. This is done by sending an ACK back, which contains the Sequence number that we want to acknowledge, and the Window field then contains the maximum accepted sequence numbers that the sending host can use before he receives the next ACK packet. The next ACK packet will update accepted Window which the sender may use.

Checksum - bit 128 - 143. This field contains the checksum of the whole TCP header. It is a one's complement of the one's complement sum of each 16 bit word in the header. If the header does not end on a 16 bit boundary, the additional bits are set to zero. While the checksum is calculated, the checksum field is set to zero. The checksum also covers a 96 bit pseudoheader containing the Destination-, Source-address, protocol, and TCP length. This is for extra security.

Urgent Pointer - bit 144 - 159. This is a pointer that points to the end of the data which is considered urgent. If the connection has important data that should be processed as soon as possible by the receiving end, the sender can set the URG flag and set the Urgent pointer to indicate where the urgent data ends.

Options - bit 160 - **. The Options field is a variable length field and contains optional headers that we may want to use. Basically, this field contains 3 subfields at all times. An initial field tells us the length of the Options field, a second field tells us which options are used, and then we have the actual options. A complete listing of all the TCP Options can be found in TCP options.

Padding - bit **. The padding field pads the TCP header until the whole header ends at a 32-bit boundary. This ensures that the data part of the packet begins on a 32-bit boundary, and no data is lost in the packet. The padding always consists of only zeros.

UDP characteristics

The User Datagram Protocol (UDP) is a very basic and simple protocol on top of the IP protocol. It was developed to allow for very simple data transmission without any error detection of any kind, and it is stateless. However, it is very well fit for query/response kind of applications, such as for example DNS, et cetera, since we know that unless we get a reply from the DNS server, the query was lost somewhere. Sometimes it may also be worth using the UDP protocol instead of TCP, such as when we want only error/loss detection but don't care about sequencing of the packets. This removes some overhead that comes from the TCP protocol. We may also do the other thing around, make our own protocol on top of UDP that only contains sequencing, but no error or loss detection.

The UDP protocol is specified in RFC 768 - User Datagram Protocol. It is a very short and brief RFC, which fits a simple protocol like this very well.

UDP headers

The UDP header can be said to contain a very basic and simplified TCP header. It contains destination-, source-ports, header length and a checksum as seen in the image below.

Source port - bit 0-15. This is the source port of the packet, describing where a reply packet should be sent. This can actually be set to zero if it doesn't apply. For example, sometimes we don't require a reply packet, and the packet can then be set to source port zero. In most implementations, it is set to some port number.

Destination port - bit 16-31. The destination port of the packet. This is required for all packets, as opposed to the source port of a packet.

Length - bit 32-47. The length field specifies the length of the whole packet in octets, including header and data portions. The shortest possible packet can be 8 octets long.

Checksum - bit 48-63. The checksum is the same kind of checksum as used in the TCP header, except that it contains a different set of data. In other words, it is a one's complement of the one's complement sum of parts of the IP header, the whole UDP header, theUDP data and padded with zeroes at the end when necessary.

ICMP characteristics

ICMP messages are used for a basic kind of error reporting between host to host, or host to gateway. Between gateway to gateway, a protocol called Gateway to Gateway protocol (GGP) should normally be used for error reporting. As we have already discussed, the IP protocol is not designed for perfect error handling, but ICMP messages solves some parts of these problems. The big problem from one standpoint is that the headers of the ICMP messages are rather complicated, and differ a little bit from message to message. However, this will not be a big problem from a filtering standpoint most of the time.

The basic form is that the message contains the standard IP header, type, code and a checksum. All ICMP messages contains these fields. The type specifies what kind of error or reply message this packet is, such as for example destination unreachable, echo, echo reply, or redirect message. The code field specifies more information, if necessary. If the packet is of type destination unreachable, there are several possible values on this code field such as network unreachable, host unreachable, or port unreachable. The checksum is simply a checksum for the whole packet.

As you may have noticed, I mentioned the IP header explicitly for the ICMP packet. This was done since the actual IP header is an integral part of the ICMP packet, and the ICMP protocol lives on the same level as the IP protocol in a sense. ICMP does use the IP protocol as if it where a higher level protocol, but at the same time not. ICMP is an integral part of IP, and ICMP must be implemented in every IP implementation.

ICMP headers

As already explained, the headers differs a little bit from ICMP type to ICMP type. Most of the ICMP types are possible to group by their headers. Because of this, we will discuss the basic header form first, and then look at the specifics for each group of types that should be discussed.

All packets contain some basic values from the IP headers discussed previously in this chapter. The headers have previously been discussed at some length, so this is just a short listing of the headers, with a few notes about them.

● Version - This should always be set to 4.

● Internet Header Length - The length of the header in 32 bit words.

● Type of Service - See above. This should be set to 0, as this is the only legit setting according to RFC 792 - Internet Control Message Protocol.

● Total Length - Total length of the header and data portion of the packet, counted in octets.

● Identification , Flags and Fragment offsets - Ripped from the IP protocol.

● Time To Live - How many hops this packet will survive.

● Protocol - which version of ICMP is being used (should always be 1).

● Header Checksum - See the IP explanation.

● Source Address - The source address from whom the packet was sent. This is not entirely true, since the packet can have another source address, than that which is located on the machine in question. The ICMP types that can have this effect will be noted if so.

● Destination Address - The destination address of the packet

There are also a couple of new headers that are used by all of the ICMP types. The new headers are as follows, this time with a few more notes about them:

Type - The type field contains the ICMP type of the packet. This is always different from ICMP type to type. For example ICMP Destination Unreachable packets will have a type 3 set to it. For a complete listing of the different ICMP types, see the ICMP types appendix. This field contains 8 bits total.

Code - All ICMP types can contain different codes as well. Some types only have a single code, while others have several codes that they can use. For example, the ICMP Destination Unreachable (type 3) can have at least code 0, 1, 2, 3, 4 or 5 set. Each code has a different meaning in that context then. For a complete listing of the different codes, see the ICMP types appendix. This field is 8 bits in length, total. We will discuss the different codes a little bit more in detail for each type later on in this section.

Checksum - The Checksum is a 16 bit field containing a one's complement of the ones complement of the headers starting with the ICMP type and down. While calculating the checksum, the checksum field should be set to zero.

At this point the headers for the different packets start to look different also. We will describe the most common ICMP Types one by one, with a brief discussion of its headers and different codes.

ICMP Echo Request/Reply

I have chosen to speak about both the reply and the request of the ICMP echo packets here since they are so closely related to each other. The first difference is that the echo request is type 8, while echo reply is type 0. When a host receives a type 8, it replies with a type 0.

When the reply is sent, the source and destination addresses switch places as well. After both of those changes has been done, the checksum is recomputed, and the reply is sent. There is only one code for both of these types, they are always set to 0.

Identifier - This is set in the request packet, and echoed back in the reply, to be able to keep different ping requests and replies together.

Sequence number - The sequence number for each host, generally this starts at 1 and is incremented by 1 for each packet.

The packets also contains a data part. Per default, the data part is generally empty, but it can contain a userspecified amount of random data.

ICMP Destination Unreachable

The first three fields seen in the image are the same as previously described. The Destination Unreachable type has 16 basic codes that can be used, as seen below in the list.

● Code 0 - Network unreachable - Tells you if a specific network is currently unreachable.

● Code 1 - Host unreachable - Tells you if a specific host is currently unreachable.

● Code 2 - Protocol unreachable - This code tells you if a specific protocol (tcp, udp, etc) can not be reached at the moment.

● Code 3 - Port unreachable - If a port (ssh, http, ftp-data, etc) is not reachable, you will get this message.

● Code 4 - Fragmentation needed and DF set - If a packet needs to be fragmented to be delivered, but the Do not fragment bit is set in the packet, the gateway will return this message.

● Code 5 - Source route failed - If a source route failed for some reason, this message is returned.

● Code 6 - Destination network unknown - If there is no route to a specific network, this message is returned.

● Code 7 - Destination host unknown - If there is no route to a specific host, this message is returned.

● Code 8 - Source host isolated (obsolete) - If a host is isolated, this message should be returned. This code is obsoleted today.

● Code 9 - Destination network administratively prohibited - If a network was blocked at a gateway and your packet was unable to reach it because of this, you should get this ICMP code back.

● Code 10 - Destination host administratively prohibited - If you where unable to reach a host because it was administratively prohibited (e.g., routing administration), you will get this message back.

● Code 11 - Network unreachable for TOS - If a network was unreachable because of a bad TOS setting in your packet, this code will be generated as a return packet.

● Code 12 - Host unreachable for TOS - If your packet was unable to reach a host because of the TOS of the packet, this is the message you get back.

● Code 13 - Communication administratively prohibited by filtering - If the packet was prohibited by some kind of filtering (e.g., firewalling), we get a code 13 back.

● Code 14 - Host precedence violation - This is sent by the first hop router to notify a connected host, to notify the host that the used precedence is not permitted for a specific destination/source combination.

● Code 15 - Precedence cutoff in effect - The first hop router may send this message to a host if the datagram it received had a too low precedence level set in it.

On top of this, it also contains a small "data" part, which should be the whole Internet header (IP header) and 64 bits of the original IP datagram. If the next level protocol contains any ports, etc, it is assumed that the ports should be available in the extra 64 bits.

Source Quench

A source quench packet can be sent to tell the originating source of a packet or stream of packets to slow down when continuing to send data. Note that gateway or destination host that the packets traverses can also be quiet and silently discard the packets, instead of sending any source quench packets.

This packet contains no extra header except the data portion, which contains the internet header plus 64 bits of the original data datagram. This is used to match the source quench message to the correct process, which is currently sending data through the gateway or to the destination host.

All source quench packets have their ICMP types set to 4. They have no codes except 0.

Note Today, there are a couple of new possible ways of notifying the sending and receiving host that a gateway or destination host is overloaded. One way for example is the ECN (Explicit Congestion Notification) system.


The ICMP Redirect type is sent in a single case. Consider this, you have a network ( with several clients and hosts on it, and two gateways. One gateway to a network, and a default gateway to the rest of the Internet. Now consider if one of the hosts on the network has no route set to, but it has the default gateway set. It sends a packet to the default gateway, which of course knows about the network. The default gateway can deduce that it is faster to send the packet directly to the gateway since the packet will enter and leave the gateway on the same interface. The default gateway will hence send out a single ICMP Redirect packet to the host, telling it about the real gateway, and then sending the packet on to the gateway. The host will now know about the closest gateway, and hopefully use it in the future.

The main header of the Redirect type is the Gateway Internet Address field. This field tells the host about the proper gateway, which should really be used. The packet also contains the IP header of the original packet, and the 64 first bits of data in the original packet, which is used to connect it to the proper process sending the data.

The Redirect type has 4 different codes as well, these are the following.

• Code 0 - Redirect for network - Only used for redirects for a whole network (e.g., the example above).

• Code 1 - Redirect for host - Only used for redirects of a specific host (e.g., a host route).

• Code 2 - Redirect for TOS and network - Only used for redirects of a specific Type of Service and to a whole network. Used as code 0, but also based on the TOS.

• Code 3 - Redirect for TOS and host - Only used for redirects of a specific Type of Service and to a specific host. Used as code 1, but also based on the TOS in other words.

TTL equals 0

The TTL equals 0 ICMP type is also known as Time Exceeded Message and has type 11 set to it, and has 2 ICMP codes available. If the TTL field reaches 0 during transit through a gateway or fragment reassembly on the destination host, the packet must be discarded. To notify the sending host of this problem, we can send a TTL equals 0 ICMP packet. The sender can then raise the TTL of outgoing packets to this destination if necessary.

The packet only contains the extra data portion of the packet. The data field contains the Internet header plus 64 bits of the data of the IP packet, so that the other end may match the packet to the proper process. As previously mentioned, the TTL equals 0 type can have two codes.

• Code 0 - TTL equals 0 during transit - This is sent to the sending host if the original packet TTL reached 0 when it was forwarded by a gateway.

• Code 1 - TTL equals 0 during reassembly - This is sent if the original packet was fragmented, and TTL reached 0 during reassembly of the fragments. This code should only be sent from the destination host.

Parameter problem

The parameter problem ICMP uses type 12 and it has 2 codes that it uses as well. Parameter problem messages are used to tell the sending host that the gateway or receiving host had problems understanding parts of the IP headers such as errors, or that some required options where missing.

The parameter problem type contains one special header, which is a pointer to the field that caused the error in the original packet, if the code is 0 that is. The following codes are available:

• Code 0 - IP header bad (catchall error) - This is a catchall error message as discussed just above. Together with the pointer, this code is used to point to which part of the IP header contained an error.

• Code 1 - Required options missing - If an IP option that is required is missing, this code is used to tell about it.

Timestamp request/reply

The timestamp type is obsolete these days, but we bring it up briefly here. Both the reply and the request has a single code (0). The request is type 13 while the reply is type 14. The timestamp packets contains 3 32-bit timestamps counting the milliseconds since midnight UT (Universal Time).

The first timestamp is the Originate timestamp, which contains the last time the sender touched the packet. The receive timestamp is the time that the echoing host first touched the packet and the transmit timestamp is the last timestamp set just previous to sending the packet.

Each timestamp message also contains the same identifiers and sequence numbers as the ICMP echo packets.

Information request/reply

The information request and reply types are obsolete since there are protocols on top of the IP protocol that can now take care of this when necessary (DHCP, etc). The information request generates a reply from any answering host on the network that we are attached to.

The host that wishes to receive information creates a packet with the source address set to the network we are attached to (for example,, and the destination network set to 0. The reply will contain information about our numbers (netmask and ip address).

The information request is run through ICMP type 15 while the reply is sent via type 16.

SCTP Characteristics

Stream Control Transmission Protocol (SCTP) is a relatively new protocol in the game, but since it is growing in usage and complements the TCP and UDP protocols, I have chosen to add this section about it. It has an even higher reliability than TCP, and at the same time a lower overhead from protocol headers.

SCTP has a couple of very interesting features that can be interesting. For those who wish to learn more about this, read the RFC 3286 - An Introduction to the Stream Control Transmission Protocol and RFC 2960 - Stream Control Transmission Protocol document. The first document is an introduction to SCTP and should be very interesting to people who are still in need of more information. The second document is the actual specification for the protocol, which might be less interesting unless you are developing for the protocol or are really interested.

The protocol was originally developed for Telephony over IP, or Voice over IP (VoIP), and has some very interesting attributes due to this. Industry grade VoIP requires very high reliability for one, and this means that a lot of resilience has to be built into the system to handle different kind of problems. The following is a list of the basic features of SCTP.

• Unicast with Multicast properties. This means it is a point-to-point protocol but with the ability to use several addresses at the same end host. It can in other words use different paths to reach the end host. TCP in comparison breaks if the transport path breaks, unless the IP protocol corrects it.

• Reliable transmission. It uses checksums and SACK to detect corrupted, damaged, discarded, duplicated and reordered data. It can then retransmit data as necessary. This is pretty much the same as TCP, but SCTP is more resilient when it comes to reordered data and allows for faster pickups.

• Message oriented. Each message can be framed and hence you can keep tabs on the structure and order of the datastream. TCP is byte oriented and all you get is a stream of bytes without any order between different data inside. You need an extra layer of abstraction in TCP in other words.

• Rate adaptive. It is developed to cooperate and co-exist with TCP for bandwidth. It scales up and down based on network load conditions just the same as TCP. It also has the same algorithms for slow starting when packets where lost. ECN is also supported.

• Multi-homing. As previously mentioned, it is able to set up different end nodes directly in the protocol, and hence doesn't have to rely on the IP layer for resilience.

• Multi-streaming. This allows for multiple simultaneous streams inside the same stream. Hence the name Stream Control Transmission Protocol. A single stream can for example be opened to download a single webpage, and all the images and html documents can then be downloaded within the same stream simultaneously. Or why not a database protocol which can create a separate control stream and then use several streams to receive the output from the different queries simultaneously.

• Initiation. 4 packet initiation of connections where packet 3 and 4 can be used to send data. The equivalent of syncookies is implemented by default to avoid DoS attacks. INIT collision resolution to avoid several simultaneous SCTP connections.

This list could be made even longer, but I will not. Most of this information is gathered from the RFC 3286 - An Introduction to the Stream Control Transmission Protocol document, so read on there for more information

Note In SCTP we talk about chunks, not packets or windows anymore. An SCTP frame can contain several different chunks since the protocol is message oriented. A chunk can either be a control or a data chunk. Control chunks is used to control the session, and data chunks are used to send actual data.

Initialization and association

Each connection is initialized by creating an association between the two hosts that wants to talk to each other. This association is initialized when a user needs it. It is later used as needed.

The initialization is done through 4 packets. First an INIT chunk is sent, which is replied to with an INIT ACK containing a cookie, after this the connection can start sending data. However, two more packets are sent in the initialization. The cookie is replied to with a COOKIE ECHO chunk, which is finally replied to with a COOKIE ACK chunk.

Data sending and control session

SCTP can at this point send data. In SCTP there are control chunks and data chunks, as previously stated. Data chunks are sent using DATA chunks, and DATA chunks are acknowledged by sending a SACK chunk. This works practically the same as a TCP SACK. SACK chunks are control chunks.

On top of this, there are some other control chunks that can be seen. HEARTBEAT and HEARTBEAT ACK chunks for one, and ERROR chunks for another. HEARTBEATs are used to keep the connection alive, and ERROR is used to inform of different problems or errors in the connection, such as invalid stream id's or missing mandatory parameters et cetera.

Shutdown and abort

The SCTP connection is finally closed by either an ABORT chunk or by a graceful SHUTDOWN chunk. SCTP doesn't have a half-closed state as TCP, in other words one side can not continue sending data while the other end has closed its sending socket.

When the user/application wants to close the SCTP socket gracefully, it tells the protocol to SHUTDOWN. SCTP then sends all the data still in its buffers, and then sends a SHUTDOWN chunk. When the other end receives the SHUTDOWN, it will stop accepting data from the application and finish sending all the data. Once it has gotten all the SACK's for the data, it will send a SHUTDOWN ACK chunk and once the closing side has received this chunk, it will finally reply with a SHUTDOWN COMPLETE chunk. The whole session is now closed.

Another way of closing a session is to ABORT it. This is an ungraceful way of removing an SCTP association. When a connecting party wants to remove an SCTP association instantaneously, it sends an ABORT chunk with all the right values signed. All data in the buffers et cetera will be discarded and the association will then be removed. The receiving end will do the same after verifying the ABORT chunk.

SCTP Headers

This will be a very brief introduction to the SCTP headers. SCTP has a lot of different types of packets, and hence I will try to follow the RFC's as close as possible and how they depict the different headers, starting with a general overview of the headers applicable to all SCTP packets.

SCTP Generic header format

This is a generic overview of how a SCTP packet is laid out. Basically, you have a common header first with information describing the whole packet, and the source and destination ports etc. See more below for information on the common header.

After the common header a variable number of chunks are sent, up to the maximum possible in the MTU. All chunks can be bundled except for INIT, INIT ACK and SHUTDOWN COMPLETE, which must not be bundled. DATA chunks may be broken down to fit inside the MTU of the packets.

SCTP Common and generic headers

Every SCTP packet contains the Common header as seen above. The header contains four different fields and is set for every SCTP packet.

Source port - bit 0-15. This field gives the source port of the packet, which port it was sent from. The same as for TCP and UDP source port.

Destination port - bit 16-31. This is the destination port of the packet, ie., the port that the packet is going to. It is the same as for the TCP and UDP destination port.

Verification Tag - bit 32-63. The verification tag is used to verify that the packet comes from the correct sender. It is always set to the same value as the value received by the other peer in the Initiate Tag during the association initialization, with a few exceptions:

• An SCTP packet containing an INIT chunk must have the Verification tag set to 0.

• A SHUTDOWN COMPLETE chunk with the T-bit set must have the verification tag copied from the verification tag of the SHUTDOWN-ACK chunk.

• Packets containing ABORT chunk may have the verification tag set to the same verification tag as the packet causing the ABORT.

Checksum - bit 64-95. A checksum calculated for the whole SCTP packet based on the Adler-32 algorithm. Read RFC 2960 - Stream Control Transmission Protocol, appendix B for more information about this algorithm.

All SCTP chunks has a special layout that they all adhere to as can be seen above. This isn't an actual header, but rather a formalized way of how they do look.

Type - bit 0-7. This field specifies the chunk type of the packet, for example is it an INIT or SHUTDOWN chunk or what? Each chunk type has a specific number, and is specified in the image below. Here is a complete list of Chunk types:

Table 2-1. SCTP Types

Chunk Number Chunk Name
0 Payload Data (DATA)
1 Initiation (INIT)
2 Initiation Acknowledgement (INIT ACK)
3 Selective Acknowledgement (SACK)
4 Heartbeat Request (HEARTBEAT)
5 Heartbeat Acknowledgement (HEARTBEAT ACK)
6 Abort (ABORT)
7 Shutdown (SHUTDOWN)
8 Shutdown Acknowledgement (SHUTDOWN ACK)
9 Operation Error (ERROR)
10 State Cookie (COOKIE ECHO)
11 Cookie Acknowledgement (COOKIE ACK)
12 Reserved for Explicit Congestion Notification Echo (ECNE)
13 Reserved for Congestion Window Reduced (CWR)
14 Shutdown Complete (SHUTDOWN COMPLETE)
15-62 Reserved for IETF
63 IETF-defined chunk extensions
64-126 reserved to IETF
127 IETF-defined chunk extensions
128-190 reserved to IETF
191 IETF-defined chunk extensions
192-254 reserved to IETF
255 IETF-defined chunk extensions

Chunk Flags - bit 8-15. The chunk flags are generally not used but are set up for future usage if nothing else. They are chunk specific flags or bits of information that might be needed for the other peer. According to specifications, flags are only used in DATA, ABORT and SHUTDOWN COMPLETE packets at this moment. This may change however.

Important! A lot of times when you read an RFC, you might run into some old proven problems. The RFC 2960 - Stream Control Transmission Protocol document is one example of this, where they specifically specify that the Chunk flags should always be set to 0 and ignored unless used for something. This is written all over the place, and it begs for problems in the future. If you do firewalling or routing, watch out very carefully for this, since specifications for fields like this may change in the future and hence break at your firewall without any legit reason. This happened before with the implementation of ECN in the IP headers for example. See more in the IP headers section of this chapter.

Chunk Length - bit 16-31. This is the chunk length calculated in bytes. It includes all headers, including the chunk type, chunk flags, chunk length and chunk value. If there is no chunk value, the chunk length will be set to 4 (bytes).

Chunk Value - bit 32-n. This is specific to each chunk and may contain more flags and data pertaining to the chunk type. Sometimes it might be empty, in which case the chunk length will be set to 4.


The ABORT chunk is used to abort an association as previously described in the Shutdown and abort section of this chapter. ABORT is issued upon unrecoverable errors in the association such as bad headers or data.

Type - bit 0-7. Always set to 6 for this chunk type.

Reserved - bit 8-14. Reserved for future chunk flags but not used as of writing this. See the SCTP Common and generic headers for more information about the chunk flags field.

T-bit - bit 15. If this bit is set to 0, the sender had a TCB associated with this packet that it has destroyed. If the sender had no TCB the T-bit should be set to 1.

Length - bit 16-31. Sets the length of the chunk in bytes including error causes.


The COOKIE ACK chunk is used during the initialization of the connection and never anywhere else in the connection. It must precede all DATA and SACK chunks but may be sent in the same packet as the first of these packets.

Type - bit 0-7. Always set to 11 for this type.

Chunk flags - bit 8-15. Not used so far. Should always be set to 0 according to RFC 2960 - Stream Control Transmission Protocol. You should always watch out for this kind of specific behaviour stated by RFC's since it might change in the future, and hence break your firewalls etc. Just the same as happened with IP and ECN. See the SCTP Common and generic headers section for more information.

Length - bit 16-31. Should always be 4 (bytes) for this chunk.


The COOKIE ECHO chunk is used during the initialization of the SCTP connection by the initiating party to reply to the cookie sent by the responding party in the State cookie field in the INIT ACK packet. It may be sent together with DATA chunks in the same packet, but must precede the DATA chunks in such case.

Type - bit 0-7. The chunk type is always set to 10 for this chunk.

Chunk flags - bit 8-15. This field is not used today. The RFC specifies that the flags should always be set to 0, but this might cause trouble as can be seen in the SCTP Common and generic headers section above, specifically the Chunk flags explanation.

Length - bit 16-31. Specifies the length of the chunk, including type, chunk flags, length and cookie fields in bytes.

Cookie - bit 32-n. This field contains the cookie as sent in the previous INIT ACK chunk. It must be the exact same as the cookie sent by the responding party for the other end to actually open the connection. The RFC 2960 - Stream Control Transmission Protocol specifies that the cookie should be as small as possible to insure interoperability, which is very vague and doesn't say much.


DATA chunks are used to send actual data through the stream and have rather complex headers in some ways, but not really worse than TCP headers in general. Each DATA chunk may be part of a different stream, since each SCTP connection can handle several different streams.

Type - bit 0-7. The Type field should always be set to 0 for DATA chunks.

Reserved - bit 8-12. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

U-bit - bit 13. The U-bit is used to indicate if this is an unordered DATA chunk. If it is, the Stream Sequence Number must be ignored by the receiving host and send it on to the upper layer without delay or tries to re-order the DATA chunks.

B-bit - bit 14. The B-bit is used to indicate the beginning of a fragmented DATA chunk. If this bit is set and the E (ending) bit is not set, it indicates that this is the first fragment of a chunk that has been fragmented into several DATA chunks.

E-bit - bit 15. The E-bit is used to indicate the ending of a fragmented DATA chunk. If this flag is set on a chunk, it signals to the SCTP receiver that it can start reassembling the fragments and pass them on to the upper layer. If a packet has both the BE-bits set to set to 0, it signals that the chunk is a middle part of a fragmented chunk. If both BE-bits are set to 1 it signals that the packet is unfragmented and requires no reassembly et cetera.

Length - bit 16-31. The length of the whole DATA chunk calculated in bytes,including the chunk type field and on until the end of the chunk.

TSN - bit 32-63. The Transmission Sequence Number (TSN) is sent in the DATA chunk, and the receiving host uses the TSN to acknowledge that the chunk got through properly by replying with a SACK chunk. This is an overall value for the whole SCTP association.

Stream Identifier - bit 64-79. The Stream Identifier is sent along with the DATA chunk to identify which stream the DATA chunk is associated with. This is used since SCTP can transport several streams within a single association.

Stream Sequence Number - bit 80-95. This is the sequence number of the chunk for the specific stream identified by the Stream Identifier. This sequence number is specific for each stream identifier. If a chunk has been fragmented, the Stream Sequence Number must be the same for all fragments of the original chunk.

Payload Protocol Identifier - bit 96-127. This value is filled in by the upper layers, or applications using the SCTP protocol as a way to identify to each other the content of the DATA chunk. The field must always be sent, including in fragments since routers and firewalls, et cetera, on the way might need the information. If the value was set to 0, the value was not set by the upper layers.

User data - bit 128-n. This is the actual data that the chunk is transporting. It can be of variable length, ending on an even octet. It is the data in the stream as specified by the stream sequence number n in the stream S.


The ERROR chunk is sent to inform the other peer of any problems within the current stream. Each ERROR chunk can contain one or more Error Causes, which are more specifically detailed in the RFC 2960 - Stream Control Transmission Protocol document. I will not go into further details here than the basic ERROR chunk, since it would be too much information. The ERROR chunk is not fatal in and of itself, but rather details an error that has happened. It may however be used together with an ABORT chunk to inform the peer of the error before killing the connection.

Type - bit 0-7. This value is always set to 9 for ERROR chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Length - bit 16-31. Specifies the length of the chunk in bytes, including all the Error Causes.

Error causes - bit 32-n. Each ERROR chunk may contain one or more Error Causes, which notifies the opposite peer of a problem with the connection. Each Error Cause follows a specific format, as described in the RFC 2960 - Stream Control Transmission Protocol document. We will not go into them here more than to say that they all contain an Cause Code, cause length and cause specific information field. The following Error Causes are possible:

Table 2-2. Error Causes

Cause Value Chunk Code
1 Invalid Stream Identifier
2 Missing Mandatory Parameter
3 Stale Cookie Error
4 Out of Resource
5 Unresolvable Address
6 Unrecognized Chunk Type
7 Invalid Mandatory Parameter
8 Unrecognized Parameters
9 No User Data
10 Cookie Received While Shutting Down


The HEARTBEAT chunk is sent by one of the peers to probe and find out if a specific SCTP endpoint address is up. This is sent to the different addresses that was negotiated during the initialization of the association to find out if they are all up.

Type - bit 0-7. The type is always set to 4 for HEARTBEAT chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Length - bit 16-31. The length of the whole chunk, including the Heartbeat Information TLV.

Heartbeat Information TLV - bit 32-n. This is a variable-length parameter as defined inside the RFC 2960 - Stream Control Transmission Protocol document. This is a mandatory parameter for the HEARTBEAT chunks that contains 3 fields, info type = 1, info length and a sender-specific Heartbeat Information parameter. The last field should be a sender-specific information field of some kind, for example a timestamp when the heartbeat was sent and a destination IP address. This is then returned in the HEARTBEAT ACK chunk.


The HEARTBEAT ACK is used to acknowledge that a HEARTBEAT was received and that the connection is working properly. The chunk is always sent to the same IP address as the request was sent from.

Type - bit 0-7. Always set to 5 for HEARTBEAT ACK chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Chunk length - bit 16-31. The length of the HEARTBEAT ACK chunk including the Heartbeat Information TLV, calculated in bytes.

Heartbeat Information TLV - bit 32-n. This field must contain the Heartbeat Information parameter that was sent in the original HEARTBEAT chunk.


The INIT chunk is used to initiate a new association with a destination host, and is the first chunk to be sent by the connecting host. The INIT chunk contains several mandatory fixed length parameters, and some optional variable length parameters. The fixed length mandatory parameters are already in the above headers, and are the Initiate Tag, Advertised Receiver Window Credit, Number of Outbound Streams, Number of Inbound Streams and the Initial TSN parameters. After this comes a couple of optional parameters, they will be listed with the optional parameters paragraph below.

Type - bit 0-7. The type field is always set to 1 for INIT chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Chunk Length - bit 16-31. The chunk length is the length of the whole packet, including everything in the headers, including the optional parameters.

Initiate Tag - bit 32-63. The Initiate Tag is set within the INIT chunk and must be used by the receiver to acknowledge all packets henceforth, within the Verification Tag of the established association. The Initiate Tag may take any value except 0. If the value is 0 anyways, the receiver must react with an ABORT.

Advertised Receiver Window Credit (a_rwnd)- bit 64-95. This is the minimum receiving buffer that the sender of the INIT chunk will allocate for this association, in bytes. This can then be used by the receiver of the a_rwnd, to know how much data it can send out without being SACK'ed. This window should not be lessened, but it might by sending the new a_rwnd in a SACK chunk.

Number of Outbound Streams - bit 96-111. This specifies the maximum number of outbound streams that the connecting host wishes to create to the receiving host. The value must not be 0, and if it is, the receiving host should ABORT the association immediately. There is no negotiation of the minimum number of outbound or inbound streams, it is simply set to the lowest that either host has set in the header.

Number of Inbound Streams - bit 112-127. Specifies the maximum number of inbound connections that the sending peer will allow the receiving host to create in this association. This must not be set to 0, or the receiving host should ABORT the connection. There is no negotiation of the minimum number of outbound or inbound streams, it is simply set to the lowest that either host has set in the header.

Initial TSN - bit 128-159. This value sets the initial Transmit Sequence Number (TSN) that the sender will use when sending data. The field may be set to the same value as the Initiate Tag.

On top of the above mandatory fixed length headers, there are also some optional variable length parameters that might be set, and at least one of the IPv4, IPv6 or Hostname parameters must be set. Only one Hostname may be set, and if a Hostname is set, no IPv4 or IPv6 parameters may be set. Multiple IPv4 and IPv6 parameters may also be set in the same INIT chunk. Also, none of these parameters needs to be set in case the sender only has one address that can be reached, which is where the chunk should be coming from. These parameters are used to set up which addresses may be used to connect to the other end of the association. This is a full list of all the parameters available in the INIT chunk:

Table 2-3. INIT Variable Parameters

Parameter Name Status Type Value
IPv4 Address Optional 5
IPv6 Address Optional 6
Cookie Preservative Optional 9
Host Name Address Optional 11
Supported Address Types Optional 12
Reserved for ECN Capable Optional 32768

Below we describe the three most common Parameters used in the INIT chunk.

The IPv4 parameter is used to send an IPv4 address in the INIT chunk. The IPv4 address can be used to send data through the association. Multiple IPv4 and IPv6 addresses can be specified for a single SCTP association.

Parameter Type - bit 0-15. This is always set to 5 for IPv4 address parameters.

Length - bit 16-31. This is always set to 8 for IPv4 address parameters.

IPv4 Address - bit 32-63. This is an IPv4 address of the sending endpoint.

This parameter is used to send IPv6 addresses in the INIT chunk. This address can then be used to contact the sending endpoint with this association.

Type - bit 0-15. Always set to 6 for the IPv6 parameters.

Length bit 16-31. Always set to 20 for IPv6 parameters.

IPv6 address - bit 32-159. This is an IPv6 address of the sending endpoint that can be used to connect to by the receiving endpoint.

The Hostname parameter is used to send a single hostname as an address. Thea receiving host must then look up the hostname and use any and/or all of the addresses it receives from there. If a hostname parameter is sent, no other IPv4, IPv6 or Hostname parameters may be sent.

Type - bit 0-15. This is always set to 11 for Hostname Parameters.

Length - bit 16-31. The length of the whole parameter, including type, length and hostname field. The Hostname field is variable length. The length is counted in bytes.

Hostname - bit 32-n. A variable length parameter containing a hostname. The hostname is resolved by the receiving end to get the addresses that can be used to contact the sending endpoint.


The INIT ACK chunk is sent in response to a INIT chunk and contains basically the same headers, but with values from the recipient of the original INIT chunk. In addition, it has two extra variable length parameters, the State Cookie and the Unrecognized Parameter parameters.

Type - bit 0-7. This header is always set to 2 for INIT ACK chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Chunk Length - bit 16-31. The chunk length is the length of the whole packet, including everything in the headers, and the optional parameters.

Initiate Tag - bit 32-63. The receiver of the Initiate Tag of the INIT ACK chunk must save this value and copy it into the Verification Tag field of every packet that it sends to the sender of the INIT ACK chunk. The Initiate Tag must not be 0, and if it is, the receiver of the INIT ACK chunk must close the connection with an ABORT.

Advertised Receiver Window Credit (a_rwnd) - bit 64-95. The dedicated buffers that the sender of this chunk has located for traffic, counted in bytes. The dedicated buffers should never be lowered to below this value.

Number of Outbound Streams - bit 96-111. How many outbound streams that the sending host wishes to create. Must not be 0, or the receiver of the INIT ACK should ABORT the association. There is no negotiation of the minimum number of outbound or inbound streams, it is simply set to the lowest that either host has set in the header.

Number of Inbound Streams - bit 112-127. How many inbound streams that the sending endpoint is willing to accept. Must not be 0, or the receiver of the INIT ACK should ABORT the association. There is no negotiation of the minimum number of outbound or inbound streams, it is simply set to the lowest that either host has set in the header.

Initial TSN - bit 128-159. This is set to the Initial Transmission Sequence Number (I-TSN) which will be used by the sending party in the association to start with.

After this point, the INIT ACK chunk continues with optional variable-length parameters. The parameters are exactly the same as for the INIT chunk, with the exception of the addition of the State Cookie and the Unrecognized Parameters parameter, and the deletion of the Supported Address Types parameter. The list in other words look like this:

Table 2-4. INIT ACK Variable Parameters

Parameter Name Status Type Value
IPv4 Address Optional 5
IPv6 Address Optional 6
State Cookie Mandatory 7
Unrecognized Parameters Optional 8
Cookie Preservative Optional 9
Host Name Address Optional 11
Reserved for ECN Capable Optional 32768

The State Cookie is used in INIT ACK to send a cookie to the other host, and until the receiving host has replied with a COOKIE ECHO chunk, the association is not guaranteed. This is to prevent basically the same as a SYN attack in TCP protocol.

Type - bit 0-15. Always set to 7 for all State Cookie parameters.

Length - bit 16-31. The size of the whole parameter, including the type, length and State Cookie field in bytes.

State Cookie - bit 31-n. This parameter contains a cookie of variable length. For a description on how this cookie is created, see the RFC 2960 - Stream Control Transmission Protocol document.


The SACK chunk is used to tell the sender of DATA chunks which chunks has been received and where there has been a gap in the stream, based on the received TSN's. Basically, the SACK chunk acknowledges that it has received data up to a certain point (the Cumulative TSN Ack parameter), and then adds Gap Ack Blocks for all of the data that it has received after the Cumulative TSN Ack point. A SACK chunk must not be sent more than once for every DATA chunk that is received.

Type - bit 0-7. This header is always set to 3 for SACK chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Chunk Length - bit 16-31. The chunk length is the length of the whole chunk, including everything in the headers and all the parameters.

Cumulative TSN Ack - bit 32-63. This is the Cumulative TSN Ack parameter, which is used to acknowledge data. The DATA chunk receiver will use this field to tell the sending host that it has received all data up to this point of the association. After this point, all data that has not been specifically acknowledged by the Gap Ack Blocks will, basically, be considered unaccounted for.

Advertised Receiver Window Credit (a_rwnd) - bit 64-95. The a_rwnd field is basically the same as the a_rwnd in the INIT and INIT ACK chunks, but can be used to raise or lower the a_rwnd value. Please read more in the RFC 2960 - Stream Control Transmission Protocol document about this.

Number of Gap Ack Blocks - bit 96-111. The number of Gap Ack Blocks listed in this chunk. Each Gap Ack Block takes up 32 bits in the chunk.

Number of Duplicate TSNs - bit 112-127. The number of DATA chunks that has been duplicated. Each duplicated TSN is listed after the Gap Ack Blocks in the chunk, and each TSN takes 32 bits to send.

Gap Ack Block #1 Start - bit 128-143. This is the first Gap Ack Block in the SACK chunk. If there are no gaps in the received DATA chunk TSN numbers, there will be no Gap Ack Blocks at all. However, if DATA chunks are received out of order or some DATA chunks where lost during transit to the host, there will be gaps. The gaps that has been seen will be reported with Gap Ack Blocks. The Gap Ack Block start point is calculated by adding the Gap Ack Block Start parameter to the Cumulative TSN value. The calculated value is the start of the block.

Gap Ack Block #1 End - bit 144-159. This value reports the end of the first Gap Ack Block in the stream. All the DATA chunks with the TSN between the Gap Ack Block Start and the Gap Ack Block End has been received. The Gap Ack Block End value is added to the Cumulative TSN, just as the Start parameter, to get the actual last TSN of the block chunks to be Acknowledged.

Gap Ack Block #N Start - bits variable. For every Gap Ack Block counted in the Number of Gap Ack Blocks parameter, one Gap Ack Block is added, until the final N block. Ie, if Number of Gap Ack Blocks = 2, then there will be two Gap Ack Blocks in the SACK chunk. This is the last one simply, and contains the same type of value as the Gap Ack Block #1 Start.

Gap Ack Block #N End - bits variable. Same as for the Gap Ack Block #N End, but for the end of the gap.

Duplicate TSN #1 - bits variable. These fields report a duplicate TSN, in which case we have already received a specific chunk, but receive the same TSN several times more. This can either be router glitches (retransmitting already sent data) or a case of retransmission from the sending endpoint, or a score of other possibilities. Each instance of a duplicate TSN should be reported once. For example, if 2 duplicate TSN's has been received after acknowledging the first one, each of these duplicate TSN's should be sent sent in the next SACK message that is being sent. If even more duplicate TSN's should appear after this second SACK is sent, the new duplicates should be added in the next SACK, and so on.

Duplicate TSN #X - bits variable. This is the last duplicate TSN parameter, containing the same type of information as the first parameter.


The SHUTDOWN chunk is issued when one of the endpoints of a connection wants to close the current association. The sending party must empty all of its sending buffers before sending the SHUTDOWN chunk, and must not send any more DATA chunks afterwards. The receiver must also empty its sending buffers and must then send the responding SHUTDOWN ACK chunk.

Type - bit 0-7. This header is always set to 7 for SHUTDOWN chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Chunk Length - bit 16-31. The chunk length is the length of the whole packet, including the Cumulative TSN Ack parameter. The length of the SHUTDOWN chunk should always be 8.

Cumulative TSN Ack - bit 32-63. This is a Cumulative TSN Ack field, just the same as in the SACK chunk. The Cumulative TSN Ack acknowledges the last TSN received in sequence from the opposite endpoint. This parameter does not, nor can the rest of the SHUTDOWN chunk either, acknowledge Gap Ack Blocks. The lack of a Gap Ack Block in the SHUTDOWN chunk that was acknowledged before should not be interpreted as if the previously acknowledged block was lost again.


The SHUTDOWN ACK chunk is used to acknowledge a SHUTDOWN chunk that has been received. Before the SHUTDOWN ACK chunk is sent, all data in the sending buffers should be sent, but the buffers must not accept any new data from the application. SCTP does not support half-open connections as TCP does.

Type - bit 0-7. This header is always set to 8 for SHUTDOWN ACK chunks.

Chunk flags - bit 8-15. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

Chunk Length - bit 16-31. The chunk length is the length of the whole chunk. The length of the SHUTDOWN ACK chunk should always be 4.


The SHUTDOWN COMPLETE chunk is sent, by the originating host of the SHUTDOWN, in response to the SHUTDOWN ACK chunk. It is sent to acknowledge that the association is finally closed.

Type - bit 0-7. Always set to 14 for SHUTDOWN COMPLETE chunks.

Reserved - bit 8-14. Not used today. Might be applicable for change. See SCTP Common and generic headers for more information.

T-bit - bit 15. The T-bit is not set to signal that the sending host had a Transmission Control Block (TCB) associated with this connection and that it destroyed. If the T-bit was set, it had no TCB to destroy.

Length - bit 16-31. This is always set to 4 for SHUTDOWN COMPLETE chunks, since the chunk should never be any larger, as long as no updates to the standards are made.

TCP/IP destination driven routing

TCP/IP has grown in complexity quite a lot when it comes to the routing part. In the beginning, most people thought it would be enough with destination driven routing. The last few years, this has become more and more complex however. Today, Linux can route on basically every single field or bit in the IP header, and even based on TCP, UDP or ICMP headers as well. This is called policy based routing, or advanced routing.

This is simply a brief discussion on how the destination driven routing is performed. When we send a packet from a sending host, the packet is created. After this, the computer looks at the packet destination address and compares it to the routing table that it has. If the destination address is local, the packet is sent directly to that address via its hardware MAC address. If the packet is on the other side of a gateway, the packet is sent to the MAC address of the gateway. The gateway will then look at the IP headers and see the destination address of the packet. The destination address is looked up in the routing table again, and the packet is sent to the next gateway, et cetera, until the packet finally reaches the local network of the destination.

As you can see, this routing is very basic and simple. With the advanced routing and policy based routing, this gets quite a bit more complex. We can route packets differently based on their source address for example, or their TOS value, et cetera.

What's next?

This chapter has brought you up to date to fully understand the subsequent chapters. The following has been gone through thoroughly:

• TCP/IP structure

• IP protocol functionality and headers.

• TCP protocol functionality and headers.

• UDP protocol functionality and headers.

• ICMP protocol functionality and headers.

• TCP/IP destination driven routing.

All of this will come in very handy later on when you start to work with the actual firewall rulesets. All of this information are pieces that fit together, and will lead to a better firewall design.

Chapter 3. IP filtering introduction

This chapter will discuss the theoretical details about an IP filter, what it is, how it works and basic things such as where to place firewalls, policies, etcetera.

Questions for this chapter may be, where to actually put the firewall? In most cases, this is a simple question, but in large corporate environments it may get trickier. What should the policies be? Who should have access where? What is actually an IP filter? All of these questions should be fairly well answered later on in this chapter.

What is an IP filter

It is important to fully understand what an IP filter is. Iptables is an IP filter, and if you don't fully understand this, you will get serious problems when designing your firewalls in the future.

An IP filter operates mainly in layer 2, of the TCP/IP reference stack. Iptables however has the ability to also work in layer 3, which actually most IP filters of today have. But per definition an IP filter works in the second layer.

If the IP filter implementation is strictly following the definition, it would in other words only be able to filter packets based on their IP headers (Source and Destionation address, TOS/DSCP/ECN, TTL, Protocol, etc. Things that are actually in the IP header.) However, since the Iptables implementation is not perfectly strict around this definition, it is also able to filter packets based on other headers that lie deeper into the packet (TCP, UDP, etc), and shallower (MAC source address).

There is one thing however, that iptables is rather strict about these days. It does not "follow" streams or puzzle data together. This would simply be too processor- and memoryconsuming . The implications of this will be discussed a little bit more further on. It does keep track of packets and see if they are of the same stream (via sequence numbers, port numbers, etc.) almost exactly the same way as the real TCP/IP stack. This is called connection tracking, and thanks to this we can do things such as Destination and Source Network Address Translation (generally called DNAT and SNAT), as well as state matching of packets.

As I implied above, iptables can not connect data from different packets to each other (per default), and hence you can never be fully certain that you will see the complete data at all times. I am specifically mentioning this since there are constantly at least a couple of questions about this on the different mailing lists pertaining to netfilter and iptables and how to do things that are generally considered a really bad idea. For example, every time there is a new windows based virus, there are a couple of different persons asking how to drop all streams containing a specific string. The bad idea about this is that it is so easily circumvented. For example if we match for something like this:


Now, what happens if the virus/exploit writer is smart enough to make the packet size so small that cmd winds up in one packet, and .exe winds up in the next packet? Or what if the packet has to travel through a network that has this small a packet size on its own? Yes, since these string matching functions is unable to work across packet boundaries, the packet will get through anyway.

Some of you may now be asking yourself, why don't we simply make it possible for the string matches, etcetera to read across packet boundaries? It is actually fairly simple. It would be too costly on processor time. Connection tracking is already taking way to much processor time to be totally comforting. To add another extra layer of complexity to connection tracking, such as this, would probably kill more firewalls than anyone of us could expect. Not to think of how much memory would be used for this simple task on each machine.

There is also a second reason for this functionality not being developed. There is a technology called proxies. Proxies were developed to handle traffic in the higher layers, and are hence much better at fullfilling these requirements. Proxies were originally developed to handle downloads and often used pages and to help you get the most out of slow Internet connections. For example, Squid is a webproxy. A person who wants to download a page sends the request, the proxy either grabs the request or receives the request and opens the connection to the web browser, and then connects to the webserver and downloads the file, and when it has downloaded the file or page, it sends it to the client. Now, if a second browser wants to read the same page again, the file or page is already downloaded to the proxy, and can be sent directly, and saves bandwidth for us.

As you may understand, proxies also have quite a lot of functionality to go in and look at the actual content of the files that it downloads. Because of this, they are much better at looking inside the whole streams, files, pages etc.

Now, after warning you about the inherent problems of doing level 7 filtering in iptables and netfilter, there is actually a set of patches that has attacked these problems. This is called http://l7-filter.sourceforge.net/. It can be used to match on a lot of layer 7 protocols but is mainly to be used together with QoS and traffic accounting, even though it can be used for pure filtering as well. The l7-filter is still experimental and developed outside the kernel and netfilter coreteam, and hence you will not hear more about it here.

IP filtering terms and expressions

To fully understand the upcoming chapters there are a few general terms and expressions that one must understand, including a lot of details regarding the TCP/IP chapter. This is a listing of the most common terms used in IP filtering.

• Drop/Deny - When a packet is dropped or denied, it is simply deleted, and no further actions are taken. No reply to tell the host it was dropped, nor is the receiving host of the packet notified in any way. The packet simply disappears.

• Reject - This is basically the same as a drop or deny target or policy, except that we also send a reply to the host sending the packet that was dropped. The reply may be specified, or automatically calculated to some value. (To this date, there is unfortunately no iptables functionality to also send a packet notifying the receiving host of the rejected packet what happened (ie, doing the reverse of the Reject target). This would be very good in certain circumstances, since the receiving host has no ability to stop Denial of Service attacks from happening.)

• State - A specific state of a packet in comparison to a whole stream of packets. For example, if the packet is the first that the firewall sees or knows about, it is considered new (the SYN packet in a TCP connection), or if it is part of an already established connection that the firewall knows about, it is considered to be established. States are known through the connection tracking system, which keeps track of all the sessions.

• Chain - A chain contains a ruleset of rules that are applied on packets that traverses the chain. Each chain has a specific purpose (e.g., which table it is connected to, which specifies what this chain is able to do), as well as a specific application area (e.g., only forwarded packets, or only packets destined for this host). In iptables, there are several different chains, which will be discussed in depth in later chapters.

• Table - Each table has a specific purpose, and in iptables there are 4 tables. The raw, nat, mangle and filter tables. For example, the filter table is specifically designed to filter packets, while the nat table is specifically designed to NAT (Network Address Translation) packets.

• Match - This word can have two different meanings when it comes to IP filtering. The first meaning would be a single match that tells a rule that this header must contain this and this information. For example, the --source match tells us that the source address must be a specific network range or host address. The second meaning is if a whole rule is a match. If the packet matches the whole rule, the jump or target instructions will be carried out (e.g., the packet will be dropped.)

• Target - There is generally a target set for each rule in a ruleset. If the rule has matched fully, the target specification tells us what to do with the packet. For example, if we should drop or accept it, or NAT it, etc. There is also something called a jump specification, for more information see the jump description in this list. As a last note, there might not be a target or jump for each rule, but there may be.

• Rule - A rule is a set of a match or several matches together with a single target in most implementations of IP filters, including the iptables implementation. There are some implementations which let you use several targets/actions per rule.

• Ruleset - A ruleset is the complete set of rules that are put into a whole IP filter implementation. In the case of iptables, this includes all of the rules set in the filter, nat, raw and mangle tables, and in all of the subsequent chains. Most of the time, they are written down in a configuration file of some sort.

• Jump - The jump instruction is closely related to a target. A jump instruction is written exactly the same as a target in iptables, with the exception that instead of writing a target name, you write the name of another chain. If the rule matches, the packet will hence be sent to this second chain and be processed as usual in that chain.

• Connection tracking - A firewall which implements connection tracking is able to track connections/streams simply put. The ability to do so is often done at the impact of lots of processor and memory usage. This is unfortunately true in iptables as well, but much work has been done to work on this. However, the good side is that the firewall will be much more secure with connection tracking properly used by the implementer of the firewall policies.

• Accept - To accept a packet and to let it through the firewall rules. This is the opposite of the drop or deny targets, as well as the reject target.

• Policy - There are two kinds of policies that we speak about most of the time when implementing a firewall. First we have the chain policies, which tells the firewall implementation the default behaviour to take on a packet if there was no rule that matched it. This is the main usage of the word that we will use in this book. The second type of policy is the security policy that we may have written documentation on, for example for the whole company or for this specific network segment. Security policies are very good documents to have thought through properly and to study properly before starting to actually implement the firewall.

How to plan an IP filter

One of the first steps to think about when planning the firewall is their placement. This should be a fairly simple step since mostly your networks should be fairly well segmented anyway. One of the first places that comes to mind is the gateway between your local network(s) and the Internet. This is a place where there should be fairly tight security. Also, in larger networks it may be a good idea to separate different divisions from each other via firewalls. For example, why should the development team have access to the human resources network, or why not protect the economic department from other networks? Simply put, you don't want an angry employee with the pink slip tampering with the salary databases.

Simply put, the above means that you should plan your networks as well as possible, and plan them to be segregated. Especially if the network is medium- to big-sized (100 workstations or more, based on different aspects of the network). In between these smaller networks, try to put firewalls that will only allow the kind of traffic that you would like.

It may also be a good idea to create a De-Militarized Zone (DMZ) in your network in case you have servers that are reached from the Internet. A DMZ is a small physical network with servers, which is closed down to the extreme. This lessens the risk of anyone actually getting in to the machines in the DMZ, and it lessens the risk of anyone actually getting in and downloading any trojans etc. from the outside. The reason that they are called de-militarized zones is that they must be reachable from both the inside and the outside, and hence they are a kind of grey zone (DMZ simply put).

There are a couple of ways to set up the policies and default behaviours in a firewall, and this section will discuss the actual theory that you should think about before actually starting to implement your firewall, and helping you to think through your decisions to the fullest extent.

Before we start, you should understand that most firewalls have default behaviours. For example, if no rule in a specific chain matches, it can be either dropped or accepted per default. Unfortunately, there is only one policy per chain, but this is often easy to get around if we want to have different policies per network interface etc.

There are two basic policies that we normally use. Either we drop everything except that which we specify, or we accept everything except that which we specifically drop. Most of the time, we are mostly interested in the drop policy, and then accepting everything that we want to allow specifically. This means that the firewall is more secure per default, but it may also mean that you will have much more work in front of you to simply get the firewall to operate properly.

Your first decision to make is to simply figure out which type of firewall you should use. How big are the security concerns? What kind of applications must be able to get through the firewall? Certain applications are horrible to firewalls for the simple reason that they negotiate ports to use for data streams inside a control session. This makes it extremely hard for the firewall to know which ports to open up. The most common applications works with iptables, but the more rare ones do not work to this day, unfortunately

Note There are also some applications that work partially, such as ICQ. Normal ICQ usage works perfectly, but not the chat or file sending functions, since they require specific code to handle the protocol. Since the ICQ protocols are not standardized (they are proprietary and may be changed at any time) most IP filters have chosen to either keep the ICQ protocol handlers out, or as patches that can be applied to the firewalls. Iptables have chosen to keep them as separate patches.

It may also be a good idea to apply layered security measures, which we have actually already discussed partially so far. What we mean with this, is that you should use as many security measures as possible at the same time, and don't rely on any one single security concept. Having this as a basic concept for your security will increase security tenfold at least. For an example, let's look at this.

As you can see, in this example I have in this example chosen to place a Cisco PIX firewall at the perimeter of all three network connections. It may NAT the internal LAN, as well as the DMZ if necessary. It may also block all outgoing traffic except http return traffic as well as ftp and ssh traffic. It can allow incoming http traffic from both the LAN and the Internet, and ftp and ssh traffic from the LAN. On top of this, we note that each webserver is based on Linux, and can hence throw iptables and netfilter on each of the machines as well and add the same basic policies on these. This way, if someone manages to break the Cisco PIX, we can still rely on the netfilter firewalls locally on each machine, and vice versa. This allows for so called layered security.

On top of this, we may add Snort on each of the machines. Snort is an excellent open source network intrusion detection system (NIDS) which looks for signatures in the packets that it sees, and if it sees a signature of some kind of attack or breakin it can either e-mail the administrator and notify him about it, or even make active responses to the attack such as blocking the IP from which the attack originated. It should be noted that active responses should not be used lightly since snort has a bad behaviour of reporting lots of false positives (e.g., reporting an attack which is not really an attack).

It could also be a good idea to throw in an proxy in front of the webservers to catch some of the bad packets as well, which could also be a possibility to throw in for all of the locally generated webconnections. With a webproxy you can narrow down on traffic used by webtraffic from your employees, as well as restrict their webusage to some extent. As for a webproxy to your own webservers, you can use it to block some of the most obvious connections to get through. A good proxy that may be worth using is the Squid.

Another precaution that one can take is to install Tripwire. This is an excellent last line of defense kind of application, it is generally considered to be a Host Intrusion Detection System. What it does is to make checksums of all the files specified in a configuration file, and then it is run from cron once in a while to see that all of the specified files are the same as before, or have not changed in an illegit way. This program will in other words be able to find out if anyone has actually been able to get through and tampered with the system. A suggestion is to run this on all of the webservers.

One last thing to note is that it is always a good thing to follow standards, as we know. As you have already seen with the ICQ example, if you don't use standardized systems, things can go terribly wrong. For your own environments, this can be ignored to some extent, but if you are running a broadband service or modempool, it gets all the more important. People who connect through you must always be able to rely on your standardization, and you can't expect everyone to run the specific operating system of your choice. Some people want to run Windows, some want to run Linux or even VMS and so on. If you base your security on proprietary systems, you are in for some trouble.

A good example of this is certain broadband services that have popped up in Sweden who base lots of security on Microsoft network logon. This may sound like a great idea to begin with, but once we start considering other operating systems and so on, this is no longer such a good idea. How will someone running Linux get online? Or VAX/VMS? Or HP/UX? With Linux it can be done of course, if it wasn't for the fact that the network administrator refuses anyone to use the broadband service if they are running linux by simply blocking them in such case. However, this book is not a theological discussion of what is best, so let's leave it as an example of why it is a bad idea to use non-standards.

What's next?

This chapter has gone through several of the basic IP filtering and security measures that you can take to secure your networks, workstations and servers. The following subjects have been brought up:

• IP filtering usage

• IP filtering policies

• Network planning

• Firewall planning

• Layered security techniques

• Network segmentation

In the next chapter we will take a quick look at what Network Address Translation (NAT) is, and after that we will start looking closer at Iptables and it's functionality and actually start getting hands on with the beast.

Chapter 4. Network Address Translation Introduction

NAT is one of the biggestattractions of Linux and Iptables to this day it seems. Instead of using fairly expensive third party solutions such as Cisco PIX etc, a lot of smaller companies and personal users have chosen to go with these solutions instead. One of the main reasons is that it is cheap, and secure. It requires an old computer, a fairly new Linux distribution which you can download for free from the Internet, a spare network card or two and cabling.

This chapter will describe a little bit of the basic theory about NAT, what it can be used for, how it works and what you should think about before starting to work on these subjects.

What NAT is used for and basic terms and expressions

Basically, NAT allows a host or several hosts to share the same IP address in a way. For example, let's say we have a local network consisting of 5-10 clients. We set their default gateways to point through the NAT server. Normally the packet would simply be forwarded by the gateway machine, but in the case of an NAT server it is a little bit different.

NAT servers translates the source and destination addresses of packets as we already said to different addresses. The NAT server receives the packet, rewrites the source and/or destination address and then recalculates the checksum of the packet. One of the most common usages of NAT is the SNAT (Source Network Address Translation) function. Basically, this is used in the above example if we can't afford or see any real idea in having a real public IP for each and every one of the clients. In that case, we use one of the private IP ranges for our local network (for example,, and then we turn on SNAT for our local network. SNAT will then turn all addresses into it's own public IP (for example, This way, there will be 5-10 clients or many many more using the same shared IP address.

There is also something called DNAT, which can be extremely helpful when it comes to setting up servers etc. First of all, you can help the greater good when it comes to saving IP space, second, you can get an more or less totally impenetrable firewall in between your server and the real server in an easy fashion, or simply share an IP for several servers that are separated into several physically different servers. For example, we may run a small company server farm containing a webserver and ftp server on the same machine, while there is a physically separated machine containing a couple of different chat services that the employees working from home or on the road can use to keep in touch with the employees that are on-site. We may then run all of these services on the same IP from the outside via DNAT.

The above example is also based on separate port NAT'ing, or often called PNAT. We don't refer to this very often throughout this book, since it is covered by the DNAT and SNAT functionality in netfilter.

In Linux, there are actually two separate types of NAT that can be used, either Fast-NAT or Netfilter-NAT. Fast-NAT is implemented inside the IP routing code of the Linux kernel, while Netfilter-NAT is also implemented in the Linux kernel, but inside the netfilter code. Since this book won't touch the IP routing code too closely, we will pretty much leave it here, except for a few notes. Fast-NAT is generally called by this name since it is much faster than the netfilter NAT code. It doesn't keep track of connections, and this is both its main pro and con. Connection tracking takes a lot of processor power, and hence it is slower, which is one of the main reasons that the Fast-NAT is faster than Netfilter-NAT. As we also said, the bad thing about Fast-NAT doesn't track connections, which means it will not be able to do SNAT very well for whole networks, neither will it be able to NAT complex protocols such as FTP, IRC and other protocols that Netfilter-NAT is able to handle very well. It is possible, but it will take much, much more work than would be expected from the Netfilter implementation.

There is also a final word that is basically a synonym to SNAT, which is the Masquerade word. In Netfilter, masquerade is pretty much the same as SNAT with the exception that masquerading will automatically set the new source IP to the default IP address of the outgoing network interface.

Caveats using NAT

As we have already explained to some extent, there are quite a lot of minor caveats with using NAT. The main problem is certain protocols and applications which may not work at all. Hopefully, these applications are not too common in the networks that you administer, and in such case, it should cause no huge problems.

The second and smaller problem is applications and protocols which will only work partially. These protocols are more common than the ones that will not work at all, which is quite unfortunate, but there isn't very much we can do about it as it seems. If complex protocols continue to be built, this is a problem we will have to continue living with. Especially if the protocols aren't standardized.

The third, and largest problem, in my point of view, is the fact that the user who sits behind a NAT server to get out on the internet will not be able to run his own server. It could be done, of course, but it takes a lot more time and work to set this up. In companies, this is probably preferred over having tons of servers run by different employees that are reachable from the Internet, without any supervision. However, when it comes to home users, this should be avoided to the very last. You should never as an Internet service provider NAT your customers from a private IP range to a public IP. It will cause you more trouble than it is worth having to deal with, and there will always be one or another client which will want this or that protocol to work flawlessly. When it doesn't, you will be called down upon.

As one last note on the caveats of NAT, it should be mentioned that NAT is actually just a hack more or less. NAT was a solution that was worked out while the IANA and other organisations noted that the Internet grew exponentially, and that the IP addresses would soon be in shortage. NAT was and is a short term solution to the problem of the IPv4 (Yes, IP which we have talked about before is a short version of IPv4 which stands for Internet Protocol version 4). The long term solution to the IPv4 address shortage is the IPv6 protocol, which also solves a ton of other problems. IPv6 has 128 bits assigned to their addresses, while IPv4 only have 32 bits used for IP addresses. This is an incredible increase in address space. It may seem like ridiculous to have enough IP addresses to set one IP address for every atom in our planet, but on the other hand, noone expected the IPv4 address range to be too small either.

Example NAT machine in theory

This is a small theoretical scenario where we want a NAT server between 2 different networks and an Internet connection. What we want to do is to connect 2 networks to each other, and both networks should have access to each other and the Internet. We will discuss the hardware questions you should take into consideration, as well as other theory you should think about before actually starting to implement the NAT machine.

What is needed to build a NAT machine

Before we discuss anything further, we should start by looking at what kind of hardware is needed to build a Linux machine doing NAT. For most smaller networks, this should be no problem, but if you are starting to look at larger networks, it can actually become one. The biggest problem with NAT is that it eats resources quite fast. For a small private network with possibly 1-10 users, a 486 with 32 MB of ram will do more than enough. However, if you are starting to get up around 100 or more users, you should start considering what kind of hardware you should look at. Of course, it is also a good idea to consider bandwidth usage, and how many connections will be open at the same time. Generally, spare computers will do very well however, and this is one of the big pros of using a Linux based firewall. You can use old scrap hardware that you have left over, and hence the firewall will be very cheap in comparison to other firewalls.

You will also need to consider network cards. How many separate networks will connect to your NAT/filter machine? Most of the time it is simply enough to connect one network to an Internet connection. If you connect to the Internet via ethernet, you should generally have 2 ethernet cards, etcetera. It can be a good idea to choose 10/100 mbit/s network cards of relatively good brands for this for scalability, but most any kinds of cards will do as long as they have drivers in the Linux kernel. A note on this matter: avoid using or getting network cards that don't have drivers actually in the Linux kernel distribution. I have on several occasions found network cards/brands that have separately distributed drivers on discs that work dismally. They are generally not very well maintained, and if you get them to work on your kernel of choice to begin with, the chance that they will actually work on the next major Linux kernel upgrade is very small. This will most of the time mean that you may have to get a little bit more costly network cards, but in the end it is worth it.

As a note, if you are going to build your firewall on really old hardware, it is suggested that you at least try to use PCI buses or better as far as possible. First of all, the network cards will hopefully be possible to use in the future when you upgrade. Also, ISA buses are extremely slow and heavy on the CPU usage. This means that putting a lot of load onto ISA network cards can next to kill your machine.

Finally, one thing more to consider is how much memory you put into the NAT/firewall machine. It is a good idea to put in at least more than 64 MB of memory if possible, even if it is possible run it on 32MB of memory. NAT isn't extremely huge on memory consumption, but it may be wise to add as much as possible just in case you will get more traffic than expected.

As you can see, there is quite a lot to think about when it comes to hardware. But, to be completely honest, in most cases you don't need to think about these points at all, unless you are building a NAT machine for a large network or company. Most home users need not think about this, but may more or less use whatever hardware they have handy. There are no complete comparisons and tests on this topic, but you should fare rather well with just a little bit of common sense.

Placement of NAT machines

This should look fairly simple, however, it may be harder than you originally thought in large networks. In general, the NAT machine should be placed on the perimeter of the network, just like any filtering machine out there. This, most of the time, means that the NAT and filtering machines are the same machine, of course. Also worth a thought, if you have very large networks, it may be worth splitting the network into smaller networks and assign a NAT/filtering machine for each of these networks. Since NAT takes quite a lot of processing power, this will definitely help keep round trip time (RTT, the time it takes for a packet to reach a destination and the return packet to get back) down.

In our example network as we described above, with two networks and an Internet connection we should, in other words, look at how large the two networks are. If we can consider them to be small and depending on what requirements the clients have, a couple of hundred clients should be no problem on a decent NAT machine. Otherwise, we could have split up the load over several machines by setting public IP's on smaller NAT machines, each handling their own smaller segment of the network and then let the traffic congregate over a specific routing only machine. This of course takes into consideration that you must have enough public IP's for all of your NAT machines, and that they are routed through your routing machine.

How to place proxies

Proxies are a general problem when it comes to NAT in most cases unfortunately, especially transparent proxies. Normal proxies should not cause too much trouble, but creating a transparent proxy is a dog to get to work, especially on larger networks. The first problem is that proxies take quite a lot of processing power, just the same as NAT does. To put both of these on the same machine is not advisable if you are going to handle large network traffic. The second problem is that if you NAT the source IP as well as the destination IP, the proxy will not be able to know what hosts to contact. E.g., which server is the client trying to contact? Since all that information is lost during the NAT translation since the packets can't contain that information as well if they are NAT'ed, it's a problem. Locally, this has been solved by adding the information in the internal data structures that are created for the packets, and hence proxies such as squid can get the information.

As you can see, the problem is that you don't have much of a choice if you are going to run a transparent proxy. There are, of course, possibilities, but they are not advisable really. One possibility is to create a proxy outside the firewall and create a routing entry that routes all web traffic through that machine, and then locally on the proxy machine NAT the packets to the proper ports for the proxy. This way, the information is preserved all the way to the proxy machine and is still available on it.

The second possibility is to simply create a proxy outside the firewall, and then block all webtraffic except the traffic going to the proxy. This way, you will force all users to actually use the proxy. It's a crude way of doing it, but it will hopefully work.

The final stage of our NAT machine

As a final step, we should bring all of this information together, and see how we would solve the NAT machine then. Let's take a look at a picture of the networks and how it looks. We have decided to put a proxy just outside the NAT/filtering machine as described above, but inside counting from the router. This area could be counted upon as an DMZ in a sense, with the NAT/filter machine being a router between the DMZ and the two company networks. You can see the exact layout we are discussing in the image below.

All the normal traffic from the NAT'ed networks will be sent through the DMZ directly to the router, which will send the traffic on out to the internet. Except, yes, you guessed it, webtraffic which is instead marked inside the netfilter part of the NAT machine, and then routed based on the mark and to the proxy machine. Let's take a look at what I am talking about. Say a http packet is seen by the NAT machine. The mangle table can then be used to mark the packet with a netfilter mark (also known as nfmark). Even later when we should route the packets to our router, we will be able to check for the nfmark within the routing tables, and based on this mark, we can choose to route the http packets to the proxy server. The proxy server will then do it's work on the packets. We will touch these subjects to some extent later on in the document, even though much of the routing based part is happening inside the advanced routing topics.

The NAT machine has a public IP available over the internet, as well as the router and any other machines that may be available on the Internet. All of the machines inside the NAT'ed networks will be using private IP's, hence saving both a lot of cash, and the Internet address space.

What's next?

We have in this chapter in detail explained NAT and the theory around it. In special we have discussed a couple of different angles to use, and some of the normal problems that may arise from using NAT together with proxies. This chapter has covered the following areas in detail.

• NAT usage

• NAT components

• NAT history

• Terms and words used about NAT

• Hardware discussions regarding NAT

• Problems with NAT

All of this will always be of use when you are working with netfilter and iptables. NAT is very widely used in today's networks, even though it is only an intermediary solution for a very unfortunate and unexpected problem. NAT will of course be discussed more in depth later on when we start looking at the Linux netfilter and iptables implementations in more depth.

Chapter 5. Preparations

This chapter is aimed at getting you started and to help you understand the role Netfilter and iptables play in Linux today. This chapter should hopefully get you set up and finished to go with your experimentation, and installation of your firewall. Given time and perseverance, you'll then get it to perform exactly as you want it to.

Where to get iptables

The iptables user-space package can be downloaded from the http://www.netfilter.org/. The iptables package also makes use of kernel space facilities which can be configured into the kernel during make configure. The necessary steps will be discussed a bit further down in this document.

Kernel setup

To run the pure basics of iptables you need to configure the following options into the kernel while doing make config or one of its related commands:

CONFIG_PACKET - This option allows applications and utilities that need to work directly with various network devices. Examples of such utilities are tcpdump or snort.

Note CONFIG_PACKET is strictly speaking not needed for iptables to work, but since it contains so many uses, I have chosen to include it here. If you do not want it, don't include it.

CONFIG_NETFILTER - This option is required if you're going to use your computer as a firewall or gateway to the Internet. In other words, this is most definitely required for anything in this tutorial to work at all. I assume you will want this, since you are reading this.

And of course you need to add the proper drivers for your interfaces to work properly, i.e. Ethernet adapter, PPP and SLIP interfaces. The above will only add some of the pure basics in iptables. You won't be able to do anything productive to be honest, it just adds the framework to the kernel. If you want to use the more advanced options in Iptables, you need to set up the proper configuration options in your kernel. Here we will show you the options available in a basic 2.4.9 kernel and a brief explanation:

CONFIG_IP_NF_CONNTRACK - This module is needed to make connection tracking. Connection tracking is used by, among other things, NAT and Masquerading. If you need to firewall machines on a LAN you most definitely should mark this option. For example, this module is required by the rc.firewall.txt script to work.

CONFIG_IP_NF_FTP - This module is required if you want to do connection tracking on FTP connections. Since FTP connections are quite hard to do connection tracking on in normal cases, conntrack needs a so called helper; this option compiles the helper. If you do not add this module you won't be able to FTP through a firewall or gateway properly.

CONFIG_IP_NF_IPTABLES - This option is required if you want do any kind of filtering, masquerading or NAT. It adds the whole iptables identification framework to the kernel. Without this you won't be able to do anything at all with iptables.

CONFIG_IP_NF_MATCH_LIMIT - This module isn't exactly required but it's used in the example rc.firewall.txt. This option provides the LIMIT match, that adds the possibility to control how many packets per minute that are to be matched, governed by an appropriate rule. For example, -m limit --limit 3/minute would match a maximum of 3 packets per minute. This module can also be used to avoid certain Denial of Service attacks.

CONFIG_IP_NF_MATCH_MAC - This allows us to match packets based on MAC addresses. Every Ethernet adapter has its own MAC address. We could for instance block packets based on what MAC address is used and block a certain computer pretty well since the MAC address very seldom changes. We don't use this option in the rc.firewall.txt example or anywhere else.

CONFIG_IP_NF_MATCH_MARK - This allows us to use a MARK match. For example, if we use the target MARK we could mark a packet and then depending on if this packet is marked further on in the table, we can match based on this mark. This option is the actual match MARK, and further down we will describe the actual target MARK.

CONFIG_IP_NF_MATCH_MULTIPORT - This module allows us to match packets with a whole range of destination ports or source ports. Normally this wouldn't be possible, but with this match it is.

CONFIG_IP_NF_MATCH_TOS - With this match we can match packets based on their TOS field. TOS stands for Type Of Service. TOS can also be set by certain rules in the mangle table and via the ip/tc commands.

CONFIG_IP_NF_MATCH_TCPMSS - This option adds the possibility for us to match TCP packets based on their MSS field.

CONFIG_IP_NF_MATCH_STATE - This is one of the biggest news in comparison to ipchains. With this module we can do stateful matching on packets. For example, if we have already seen traffic in two directions in a TCP connection, this packet will be counted as ESTABLISHED. This module is used extensively in the rc.firewall.txt example.

CONFIG_IP_NF_MATCH_UNCLEAN - This module will add the possibility for us to match IP, TCP, UDP and ICMP packets that don't conform to type or are invalid. We could for example drop these packets, but we never know if they are legitimate or not. Note that this match is still experimental and might not work perfectly in all cases.

CONFIG_IP_NF_MATCH_OWNER - This option will add the possibility for us to do matching based on the owner of a socket. For example, we can allow only the user root to have Internet access. This module was originally just written as an example on what could be done with the new iptables. Note that this match is still experimental and might not work for everyone.

CONFIG_IP_NF_FILTER - This module will add the basic filter table which will enable you to do IP filtering at all. In the filter table you'll find the INPUT, FORWARD and OUTPUT chains. This module is required if you plan to do any kind of filtering on packets that you receive and send.

CONFIG_IP_NF_TARGET_REJECT - This target allows us to specify that an ICMP error message should be sent in reply to incoming packets, instead of plainly dropping them dead to the floor. Keep in mind that TCP connections, as opposed to ICMP and UDP, are always reset or refused with a TCP RST packet.

CONFIG_IP_NF_TARGET_MIRROR - This allows packets to be bounced back to the sender of the packet. For example, if we set up a MIRROR target on destination port HTTP on our INPUT chain and someone tries to access this port, we would bounce his packets back to him and finally he would probably see his own homepage.

Warning The MIRROR target is not to be used lightly. It was originally built as a test and example module, and will most probably be very dangerous to the person setting it up (resulting in serious DDoS if among other things).

CONFIG_IP_NF_NAT - This module allows network address translation, or NAT, in its different forms. This option gives us access to the nat table in iptables. This option is required if we want to do port forwarding, masquerading, etc. Note that this option is not required for firewalling and masquerading of a LAN, but you should have it present unless you are able to provide unique IP addresses for all hosts. Hence, this option is required for the example rc.firewall.txt script to work properly, and most definitely on your network if you do not have the ability to add unique IP addresses as specified above.

CONFIG_IP_NF_TARGET_MASQUERADE - This module adds the MASQUERADE target. For instance if we don't know what IP we have to the Internet this would be the preferred way of getting the IP instead of using DNAT or SNAT. In other words, if we use DHCP, PPP, SLIP or some other connection that assigns us an IP, we need to use this target instead of SNAT. Masquerading gives a slightly higher load on the computer than NAT, but will work without us knowing the IP address in advance.

CONFIG_IP_NF_TARGET_REDIRECT - This target is useful together with application proxies, for example. Instead of letting a packet pass right through, we remap them to go to our local box instead. In other words, we have the possibility to make a transparent proxy this way.

CONFIG_IP_NF_TARGET_LOG - This adds the LOG target and its functionality to iptables. We can use this module to log certain packets to syslogd and hence see what is happening to the packet. This is invaluable for security audits, forensics or debugging a script you are writing.

CONFIG_IP_NF_TARGET_TCPMSS - This option can be used to counter Internet Service Providers and servers who block ICMP Fragmentation Needed packets. This can result in web-pages not getting through, small mails getting through while larger mails don't, ssh works but scp dies after handshake, etc. We can then use the TCPMSS target to overcome this by clamping our MSS (Maximum Segment Size) to the PMTU (Path Maximum Transmit Unit).

CONFIG_IP_NF_COMPAT_IPCHAINS - Adds a compatibility mode with the obsolete ipchains. Do not look to this as any real long term solution for solving migration from Linux 2.2 kernels to 2.4 kernels, since it may well be gone with kernel 2.6.

CONFIG_IP_NF_COMPAT_IPFWADM - Compatibility mode with obsolescent ipfwadm. Definitely don't look to this as a real long term solution.

As you can see, there is a heap of options. I have briefly explained here what kind of extra behaviors you can expect from each module. These are only the options available in a vanilla Linux 2.4.9 kernel. If you would like to take a look at more options, I suggest you look at the patch-o-matic (POM) functions in Netfilter user-land which will add heaps of other options in the kernel. POM fixes are additions that are supposed to be added in the kernel in the future but have not quite reached the kernel yet. This may be for various reasons - such as the patch not being stable yet, to Linus Torvalds being unable to keep up, or not wanting to let the patch in to the mainstream kernel yet since it is still experimental.

You will need the following options compiled into your kernel, or as modules, for the rc.firewall.txt script to work. If you need help with the options that the other scripts need, look at the example firewall scripts section.













At the very least the above will be required for the rc.firewall.txt script. In the other example scripts I will explain what requirements they have in their respective sections. For now, let's try to stay focused on the main script which you should be studying now.

User-land setup

First of all, let's look at how we compile the iptables package. It's important to realize that for the most part configuration and compilation of iptables goes hand in hand with the kernel configuration and compilation. Certain distributions come with the iptables package preinstalled, one of these is Red Hat. However, in old Red Hat it is disabled per default. We will check closer on how to enable it and take a look at other distributions further on in this chapter.

Compiling the user-land applications

First of all unpack the iptables package. Here, we have used the iptables 1.2.6a package and a vanilla 2.4 kernel. Unpack as usual, using bzip2 -cd iptables-1.2.6a.tar.bz2 | tar -xvf - (this can also be accomplished with the tar -xjvf iptables-1.2.6a.tar.bz2, which should do pretty much the same as the first command. However, this may not work with older versions of tar). The package should now be unpacked properly into a directory named iptables-1.2.6a. For more information read the iptables-1.2.6a/INSTALL file which contains pretty good information on compiling and getting the program to run.

After this, there you have the option of configuring and installing extra modules and options etcetera for the kernel.The step described here will only check and install standard patches that are pending for inclusion to the kernel, there are some even more experimental patches further along, which may only be available when you carry out other steps.

Note Some of these patches are highly experimental and may not be such a good idea to install them. However, there are heaps of extremely interesting matches and targets in this installation step so don't be afraid of at least looking at them.

To carry out this step we do something like this from the root of the iptables package:

make pending-patches KERNEL_DIR=/usr/src/linux/

The variable KERNEL_DIR should point to the actual place that your kernel source is located at. Normally this should be /usr/src/linux/ but this may vary, and most probably you will know yourself where the kernel source is available.

The above command only asks about certain patches that are just about to enter the kernel anyway. There might be more patches and additions that the developers of Netfilter are about to add to the kernel, but is a bit further away from actually getting there. One way to install these is by doing the following:

make most-of-pom KERNEL_DIR=/usr/src/linux/

The above command would ask about installing parts of what in Netfilter world is called patch-o-matic, but still skip the most extreme patches that might cause havoc in your kernel. Note that we say ask, because that's what these commands actually do. They ask you before anything is changed in the kernel source. To be able to install all of the patch-o-matic stuff you will need to run the following command:

make patch-o-matic KERNEL_DIR=/usr/src/linux/

Don't forget to read the help for each patch thoroughly before doing anything. Some patches will destroy other patches while others may destroy your kernel if used together with some patches from patch-o-matic etc.

Note You may totally ignore the above steps if you don't want to patch your kernel, it is in other words not necessary to do the above. However, there are some really interesting things in the patch-o-matic that you may want to look at so there's nothing bad in just running the commands and see what they contain.

After this you are finished doing the patch-o-matic parts of installation, you may now compile a new kernel making use of the new patches that you have added to the source. Don't forget to configure the kernel again since the new patches probably are not added to the configured options. You may wait with the kernel compilation until after the compilation of the user-land program iptables if you feel like it, though.

Continue by compiling the iptables user-land application. To compile iptables you issue a simple command that looks like this:

make KERNEL_DIR=/usr/src/linux/

The user-land application should now compile properly. If not, you are on your own, or you could subscribe to the Netfilter mailing list, where you have the chance of asking for help with your problems. There are a few things that might go wrong with the installation of iptables, so don't panic if it won't work. Try to think logically about it and find out what's wrong, or get someone to help you.

If everything has worked smoothly, you're ready to install the binaries by now. To do this, you would issue the following command to install them:

make install KERNEL_DIR=/usr/src/linux/

Hopefully everything should work in the program now. To use any of the changes in the iptables user-land applications you should now recompile and reinstall your kernel and modules, if you hadn't done so before. For more information about installing the user-land applications from source, check the INSTALL file in the source which contains excellent information on the subject of installation.

Installation on Red Hat 7.1

Red Hat 7.1 comes preinstalled with a 2.4.x kernel that has Netfilter and iptables compiled in. It also contains all the basic user-land programs and configuration files that are needed to run it. However, the Red Hat people have disabled the whole thing by using the backward compatible ipchains module. Annoying to say the least, and a lot of people keep asking different mailing lists why iptables doesn't work. So, let's take a brief look at how to turn the ipchains module off and how to install iptables instead.

Note The default Red Hat 7.1 installation today comes with a hopelessly old version of the user-space applications, so you might want to compile a new version of the applications as well as install a new and custom compiled kernel before fully exploiting iptables.

First of all you will need to turn off the ipchains modules so it won't start in the future. To do this, you will need to change some filenames in the /etc/rc.d/ directory-structure. The following command should do it:

chkconfig --level 0123456 ipchains off

By doing this we move all the soft links that points to the /etc/rc.d/init.d/ipchains script to K92ipchains. The first letter which per default would be S, tells the initscripts to start the script. By changing this to K we tell it to Kill the service instead, or to not run it if it was not previously started. Now the service won't be started in the future.

However, to stop the service from actually running right now we need to run another command. This is the service command which can be used to work on currently running services. We would then issue the following command to stop the ipchains service:

service ipchains stop

Finally, to start the iptables service. First of all, we need to know which run-levels we want it to run in. Normally this would be in run-level 2, 3 and 5. These run-levels are used for the following things:

• 2. Multiuser without NFS or the same as 3 if there is no networking.

• 3. Full multiuser mode, i.e. the normal run-level to run in.

• 5. X11. This is used if you automatically boot into Xwindows.

To make iptables run in these run-levels we would do the following commands:

chkconfig --level 235 iptables on

The above commands would in other words make the iptables service run in run-level 2, 3 and 5. If you'd like the iptables service to run in some other run-level you would have to issue the same command in those. However, none of the other run-levels should be used, so you should not really need to activate it for those run-levels. Level 1 is for single user mode, i.e, when you need to fix a screwedup box. Level 4 should be unused, and level 6 is for shutting the computer down.

To activate the iptables service, we just run the following command:

service iptables start

There are no rules in the iptables script. To add rules to an Red Hat 7.1 box, there is two common ways. Firstly, you could edit the /etc/rc.d/init.d/iptables script. This would have the undesired effect of deleting all the rules if you updated the iptables package by RPM. The other way would be to load the rule-set and then save it with the iptables-save command and then have it loaded automatically by the rc.d scripts.

First we will describe the how to set up iptables by cutting and pasting to the iptables init.d script. To add rules that are to be run when the computer starts the service, you add them under the start) section, or in the start() function. Note, if you add the rules under the start) section don't forget to stop the start() function in the start) section from running. Also, don't forget to edit a the stop) section either which tells the script what to do when the computer is going down for example, or when we are entering a run-level that doesn't require iptables. Also, don't forget to check out the restart section and condrestart. Note that all this work will probably be trashed if you have, for example, Red Hat Network automatically update your packages. It may also be trashed by updating from the iptables RPM package.

The second way of doing the set up would require the following: First of all, make and write a rule-set in a shell script file, or directly with iptables, that will meet your requirements, and don't forget to experiment a bit. When you find a set up that works without problems, or as you can see without bugs, use the iptables-save command. You could either use it normally, i.e. iptables-save > /etc/sysconfig/iptables, which would save the rule-set to the file /etc/sysconfig/iptables. This file is automatically used by the iptables rc.d script to restore the rule-set in the future. The other way is to save the script by doing service iptables save, which would save the script automatically to /etc/sysconfig/iptables. The next time you reboot the computer, the iptables rc.d script will use the command iptables-restore to restore the rule-set from the save-file /etc/sysconfig/iptables. Do not intermix these two methods, since they may heavily damage each other and render your firewall configuration useless.

When all of these steps are finished, you can deinstall the currently installed ipchains and iptables packages. This because we don't want the system to mix up the new iptables user-land application with the old preinstalled iptables applications. This step is only necessary if you are going to install iptables from the source package. It's not unusual for the new and the old package to get mixed up, since the rpm based installation installs the package in non-standard places and won't get overwritten by the installation for the new iptables package. To carry out the deinstallation, do as follows:

rpm -e iptables

And why keep ipchains lying around if you won't be using it any more? Removing it is done the same way as with the old iptables binaries, etc:

rpm -e ipchains

After all this has been completed, you will have finished with the update of the iptables package from source, having followed the source installation instructions. None of the old binaries, libraries or include files etc should be lying around any more.

What's next?

This chapter has discussed how to get and how to install iptables and netfilter on some common platforms. In most modern Linux distributions iptables will come with the default installation, but sometimes it might be necessary to compile your own kernel and iptables binaries to get the absolutely latest updates. This chapter should have been a small help managing this.

The next chapter will discuss how tables and chains are traversed, and in what order this happens and so forth. This is very important to comprehend to be able to build your own working rulesets in the future. All the different tables will be discussed in some depth also since they are created for different purposes.

Chapter 6. Traversing of tables and chains

In this chapter we'll discuss how packets traverse the different chains, and in which order. We will also discuss the order in which the tables are traversed. We'll see how valuable this is later on, when we write our own specific rules. We will also look at the points which certain other components, that also are kernel dependent, enter into the picture. Which is to say the different routing decisions and so on. This is especially necessary if we want to write iptables rules that could change routing patterns/rules for packets; i.e. why and how the packets get routed, good examples of this are DNAT and SNAT. Not to be forgotten are, of course, the TOS bits.


When a packet first enters the firewall, it hits the hardware and then gets passed on to the proper device driver in the kernel. Then the packet starts to go through a series of steps in the kernel, before it is either sent to the correct application (locally), or forwarded to another host - or whatever happens to it.

First, let us have a look at a packet that is destined for our own local host. It would pass through the following steps before actually being delivered to our application that receives it:

Table 6-1. Destination local host (our own machine)

Step Table Chain Comment
1     On the wire (e.g., Internet)
2     Comes in on the interface (e.g., eth0)
3 raw PREROUTING This chain is used to handle packets before the connection tracking takes place. It can be used to set a specific connection not to be handled by the connection tracking code for example.
4     This is when the connection tracking code takes place as discussed in the The state machine chapter.
5 mangle PREROUTING This chain is normally used for mangling packets, i.e., changing TOS and so on.
6 nat PREROUTING This chain is used for DNAT mainly. Avoid filtering in this chain since it will be bypassed in certain cases.
7     Routing decision, i.e., is the packet destined for our local host or to be forwarded and where.
8 mangle INPUT At this point, the mangle INPUT chain is hit. We use this chain to mangle packets, after they have been routed, but before they are actually sent to the process on the machine.
9 filter INPUT This is where we do filtering for all incoming traffic destined for our local host. Note that all incoming packets destined for this host pass through this chain, no matter what interface or in which direction they came from.
10     Local process or application (i.e., server or client program).

Note that this time the packet was passed through the INPUT chain instead of the FORWARD chain. Quite logical. Most probably the only thing that's really logical about the traversing of tables and chains in your eyes in the beginning, but if you continue to think about it, you'll find it will get clearer in time.

Now we look at the outgoing packets from our own local host and what steps they go through.

Table 6-2. Source local host (our own machine)

Step Table Chain Comment
1     Local process/application (i.e., server/client program)
2     Routing decision. What source address to use, what outgoing interface to use, and other necessary information that needs to be gathered.
3 raw OUTPUT This is where you do work before the connection tracking has taken place for locally generated packets. You can mark connections so that they will not be tracked for example.
4     This is where the connection tracking takes place for locally generated packets, for example state changes et cetera. This is discussed in more detail in the The state machine chapter.
5 mangle OUTPUT This is where we mangle packets, it is suggested that you do not filter in this chain since it can have side effects.
6 nat OUTPUT This chain can be used to NAT outgoing packets from the firewall itself.
7     Routing decision, since the previous mangle and nat changes may have changed how the packet should be routed.
8 filter OUTPUT This is where we filter packets going out from the local host.
9 mangle POSTROUTING The POSTROUTING chain in the mangle table is mainly used when we want to do mangling on packets before they leave our host, but after the actual routing decisions. This chain will be hit by both packets just traversing the firewall, as well as packets created by the firewall itself.
10 nat POSTROUTING This is where we do SNAT as described earlier. It is suggested that you don't do filtering here since it can have side effects, and certain packets might slip through even though you set a default policy of DROP.
11     Goes out on some interface (e.g., eth0)
12     On the wire (e.g., Internet)

In this example, we're assuming that the packet is destined for another host on another network. The packet goes through the different steps in the following fashion:

Table 6-3. Forwarded packets

Step Table Chain Comment
1     On the wire (i.e., Internet)
2     Comes in on the interface (i.e., eth0)
3 raw PREROUTING Here you can set a connection to not be handled by the connection tracking system.
4     This is where the non-locally generated connection tracking takes place, and is also discussed more in detail in the The state machine chapter.
5 mangle PREROUTING This chain is normally used for mangling packets, i.e., changing TOS and so on.
6 nat PREROUTING This chain is used for DNAT mainly. SNAT is done further on. Avoid filtering in this chain since it will be bypassed in certain cases.
7     Routing decision, i.e., is the packet destined for our local host or to be forwarded and where.
8 mangle FORWARD The packet is then sent on to the FORWARD chain of the mangle table. This can be used for very specific needs, where we want to mangle the packets after the initial routing decision, but before the last routing decision made just before the packet is sent out.
9 filter FORWARD The packet gets routed onto the FORWARD chain. Only forwarded packets go through here, and here we do all the filtering. Note that all traffic that's forwarded goes through here (not only in one direction), so you need to think about it when writing your rule-set.
10 mangle POSTROUTING This chain is used for specific types of packet mangling that we wish to take place after all kinds of routing decisions have been done, but still on this machine.
11 nat POSTROUTING This chain should first and foremost be used for SNAT. Avoid doing filtering here, since certain packets might pass this chain without ever hitting it. This is also where Masquerading is done.
12     Goes out on the outgoing interface (i.e., eth1).
13     Out on the wire again (i.e., LAN).

As you can see, there are quite a lot of steps to pass through. The packet can be stopped at any of the iptables chains, or anywhere else if it is malformed; however, we are mainly interested in the iptables aspect of this lot. Do note that there are no specific chains or tables for different interfaces or anything like that. FORWARD is always passed by all packets that are forwarded over this firewall/router.

Caution Do not use the INPUT chain to filter on in the previous scenario! INPUT is meant solely for packets to our local host that do not get routed to any other destination.

We have now seen how the different chains are traversed in three separate scenarios. If we were to figure out a good map of all this, it would look something like this:

To clarify this image, consider this. If we get a packet into the first routing decision that is not destined for the local machine itself, it will be routed through the FORWARD chain. If the packet is, on the other hand, destined for an IP address that the local machine is listening to, we would send the packet through the INPUT chain and to the local machine.

Also worth a note, is the fact that packets may be destined for the local machine, but the destination address may be changed within the PREROUTING chain by doing NAT. Since this takes place before the first routing decision, the packet will be looked upon after this change. Because of this, the routing may be changed before the routing decision is done. Do note, that all packets will be going through one or the other path in this image. If you DNAT a packet back to the same network that it came from, it will still travel through the rest of the chains until it is back out on the network.

Tip If you feel that you want more information, you could use the rc.test-iptables.txt script. This test script should give you the necessary rules to test how the tables and chains are traversed.

Mangle table

This table should as we've already noted mainly be used for mangling packets. In other words, you may freely use the mangle targets within this table, to change TOS (Type Of Service) fields and the like.

Caution You are strongly advised not to use this table for any filtering; nor will any DNAT, SNAT or Masquerading work in this table.

The following targets are only valid in the mangle table. They can not be used outside the mangle table.






The TOS target is used to set and/or change the Type of Service field in the packet. This could be used for setting up policies on the network regarding how a packet should be routed and so on. Note that this has not been perfected and is not really implemented on the Internet and most of the routers don't care about the value in this field, and sometimes, they act faulty on what they get. Don't set this in other words for packets going to the Internet unless you want to make routing decisions on it, with iproute2.

The TTL target is used to change the TTL (Time To Live) field of the packet. We could tell packets to only have a specific TTL and so on. One good reason for this could be that we don't want to give ourself away to nosy Internet Service Providers. Some Internet Service Providers do not like users running multiple computers on one single connection, and there are some Internet Service Providers known to look for a single host generating different TTL values, and take this as one of many signs of multiple computers connected to a single connection.

The MARK target is used to set special mark values to the packet. These marks could then be recognized by the iproute2 programs to do different routing on the packet depending on what mark they have, or if they don't have any. We could also do bandwidth limiting and Class Based Queuing based on these marks.

The SECMARK target can be used to set security context marks on single packets for usage in SELinux and other security systems that are able to handle these marks. This is then used for very fine grained security on what subsystems of the system can touch what packets et cetera. The SECMARK can also be set on a whole connection with the CONNSECMARK target.

CONNSECMARK is used to copy a security context to or from a single packet from or to the whole connection. This is then used by the SELinux and other security systems to do more fine-grained security on a connection level.

Nat table

This table should only be used for NAT (Network Address Translation) on different packets. In other words, it should only be used to translate the packet's source field or destination field. Note that, as we have said before, only the first packet in a stream will hit this table. After this, the rest of the packets will automatically have the same action taken on them as the first packet. The actual targets that do these kind of things are:





The DNAT target is mainly used in cases where you have a public IP and want to redirect accesses to the firewall to some other host (on a DMZ for example). In other words, we change the destination address of the packet and reroute it to the host.

SNAT is mainly used for changing the source address of packets. For the most part you'll hide your local networks or DMZ, etc. A very good example would be that of a firewall of which we know outside IP address, but need to substitute our local network's IP numbers with that of our firewall. With this target the firewall will automatically SNAT and De-SNAT the packets, hence making it possible to make connections from the LAN to the Internet. If your network uses for example, the packets would never get back from the Internet, because IANA has regulated these networks (among others) as private and only for use in isolated LANs.

The MASQUERADE target is used in exactly the same way as SNAT, but the MASQUERADE target takes a little bit more overhead to compute. The reason for this, is that each time that the MASQUERADE target gets hit by a packet, it automatically checks for the IP address to use, instead of doing as the SNAT target does - just using the single configured IP address. The MASQUERADE target makes it possible to work properly with Dynamic DHCP IP addresses that your ISP might provide for your PPP, PPPoE or SLIP connections to the Internet.

Raw table

The raw table is mainly only used for one thing, and that is to set a mark on packets that they should not be handled by the connection tracking system. This is done by using the NOTRACK target on the packet. If a connection is hit with the NOTRACK target, then conntrack will simply not track the connection. This has been impossible to solve without adding a new table, since none of the other tables are called until after conntrack has actually been run on the packets, and been added to the conntrack tables, or matched against an already available connection. You can read more about this in the The state machine chapter.

This table only has the PREROUTING and OUTPUT chains. No other chains are required since these are the only places that you can deal with packets before they actually hit the connection tracking.

Note For this table to work, the iptable_raw module must be loaded. It will be loaded automatically if iptables is run with the -t raw keywords, and if the module is available.

Note The raw table is a relatively new addition to iptables and the kernel. It might not be available in early 2.6 and 2.4 kernels unless patched.

Filter table

The filter table is mainly used for filtering packets. We can match packets and filter them in whatever way we want. This is the place that we actually take action against packets and look at what they contain and DROP or /ACCEPT them, depending on their content. Of course we may also do prior filtering; however, this particular table is the place for which filtering was designed. Almost all targets are usable in this table. We will be more prolific about the filter table here; however you now know that this table is the right place to do your main filtering.

User specified chains

If a packet enters a chain such as the INPUT chain in the filter table, we can specify a jump rule to a different chain within the same table. The new chain must be userspecified, it may not be a built-in chain such as the INPUT or FORWARD chain for example. If we consider a pointer pointing at the rule in the chain to execute, the pointer will go down from rule to rule, from top to bottom until the chain traversal is either ended by a target or the main chain (I.e., FORWARD, INPUT, et cetera) ends. Once this happens, the default policy of the built-in chain will be applied.

If one of the rules that matches points to another userspecified chain in the jump specification, the pointer will jump over to this chain and then start traversing that chain from the top to bottom. For example, see how the rule execution jumps from rule number 3 to chain 2 in the above image. The packet matched the matches contained in rule 3, and the jump/target specification was set to send the packet on for further examination in chain 2.

Note Userspecified chains can not have a default policy at the end of the chain. Only built in chains can have this. This can be circumvented by appending a single rule at the end of the chain that has no matches, and hence it will behave as a default policy. If no rule is matched in a userspecified chain, the default behaviour is to jump back to the originating chain. As seen in the image above, the rule execution jumps from chain 2 and back to chain 1 rule 4, below the rule that sent the rule execution into chain 2 to begin with.

Each and every rule in the user specified chain is traversed until either one of the rules matches -- then the target specifies if the traversing should end or continue -- or the end of the chain is reached. If the end of the user specified chain is reached, the packet is sent back to the invoking chain. The invoking chain can be either a user specified chain or a built-in chain.

What's next?

In this chapter we have discussed several of the chains and tables and how they are traversed, including the standard built-in chains and userspecified chains. This is a very important area to understand. It may be simple, but unless fully understood, fatal mistakes can be equally easily.

The next chapter will deal in depth with the state machine of netfilter, and how states are traversed and set on packets in a connection tracking machine. The next chapter is in other words just as important as this chapter has been.

Chapter 7. The state machine

This chapter will deal with the state machine and explain it in detail. After reading through it, you should have a complete understanding of how the State machine works. We will also go through a large set of examples on how states are dealt with within the state machine itself. These should clarify everything in practice.


The state machine is a special part within iptables that should really not be called the state machine at all, since it is really a connection tracking machine. However, most people recognize it under the first name. Throughout this chapter I will use these names more or less as if they were synonymous. This should not be overly confusing. Connection tracking is done to let the Netfilter framework know the state of a specific connection. Firewalls that implement this are generally called stateful firewalls. A stateful firewall is generally much more secure than non-stateful firewalls since it allows us to write much tighter rule-sets.

Within iptables, packets can be related to tracked connections in four different so called states. These are known as NEW, ESTABLISHED, RELATED and INVALID. We will discuss each of these in more depth later. With the --state match we can easily control who or what is allowed to initiate new sessions.

All of the connection tracking is done by special framework within the kernel called conntrack. conntrack may be loaded either as a module, or as an internal part of the kernel itself. Most of the time, we need and want more specific connection tracking than the default conntrack engine can maintain. Because of this, there are also more specific parts of conntrack that handles the TCP, UDP or ICMP protocols among others. These modules grab specific, unique, information from the packets, so that they may keep track of each stream of data. The information that conntrack gathers is then used to tell conntrack in which state the stream is currently in. For example, UDP streams are, generally, uniquely identified by their destination IP address, source IP address, destination port and source port.

In previous kernels, we had the possibility to turn on and off defragmentation. However, since iptables and Netfilter were introduced and connection tracking in particular, this option was gotten rid of. The reason for this is that connection tracking can not work properly without defragmenting packets, and hence defragmenting has been incorporated into conntrack and is carried out automatically. It can not be turned off, except by turning off connection tracking. Defragmentation is always carried out if connection tracking is turned on.

All connection tracking is handled in the PREROUTING chain, except locally generated packets which are handled in the OUTPUT chain. What this means is that iptables will do all recalculation of states and so on within the PREROUTING chain. If we send the initial packet in a stream, the state gets set to NEW within the OUTPUT chain, and when we receive a return packet, the state gets changed in the PREROUTING chain to ESTABLISHED, and so on. If the first packet is not originated by ourself, the NEW state is set within the PREROUTING chain of course. So, all state changes and calculations are done within the PREROUTING and OUTPUT chains of the nat table.

The conntrack entries

Let's take a brief look at a conntrack entry and how to read them in /proc/net/ip_conntrack. This gives a list of all the current entries in your conntrack database. If you have the ip_conntrack module loaded, a cat of /proc/net/ip_conntrack might look like:

tcp 6 117 SYN_SENT src= dst= sport=32775 \

dport=22 [UNREPLIED] src= dst= sport=22 \

dport=32775 [ASSURED] use=2

This example contains all the information that the conntrack module maintains to know which state a specific connection is in. First of all, we have a protocol, which in this case is tcp. Next, the same value in normal decimal coding. After this, we see how long this conntrack entry has to live. This value is set to 117 seconds right now and is decremented regularly until we see more traffic. This value is then reset to the default value for the specific state that it is in at that relevant point of time. Next comes the actual state that this entry is in at the present point of time. In the above mentioned case we are looking at a packet that is in the SYN_SENT state. The internal value of a connection is slightly different from the ones used externally with iptables. The value SYN_SENT tells us that we are looking at a connection that has only seen a TCP SYN packet in one direction. Next, we see the source IP address, destination IP address, source port and destination port. At this point we see a specific keyword that tells us that we have seen no return traffic for this connection. Lastly, we see what we expect of return packets. The information details the source IP address and destination IP address (which are both inverted, since the packet is to be directed back to us). The same thing goes for the source port and destination port of the connection. These are the values that should be of any interest to us.

The connection tracking entries may take on a series of different values, all specified in the conntrack headers available in linux/include/netfilter-ipv4/ip_conntrack*.h files. These values are dependent on which sub-protocol of IP we use. TCP, UDP or ICMP protocols take specific default values as specified in linux/include/netfilter-ipv4/ip_conntrack.h. We will look closer at this when we look at each of the protocols; however, we will not use them extensively through this chapter, since they are not used outside of the conntrack internals. Also, depending on how this state changes, the default value of the time until the connection is destroyed will also change.

Note Recently there was a new patch made available in iptables patch-o-matic, called tcp-window-tracking. This patch adds, among other things, all of the above timeouts to special sysctl variables, which means that they can be changed on the fly, while the system is still running. Hence, this makes it unnecessary to recompile the kernel every time you want to change the timeouts.

These can be altered via using specific system calls available in the /proc/sys/net/ipv4/netfilter directory. You should in particular look at the /proc/sys/net/ipv4/netfilter/ip_ct_* variables.

When a connection has seen traffic in both directions, the conntrack entry will erase the [UNREPLIED] flag, and then reset it. The entry that tells us that the connection has not seen any traffic in both directions, will be replaced by the [ASSURED] flag, to be found close to the end of the entry. The [ASSURED] flag tells us that this connection is assured and that it will not be erased if we reach the maximum possible tracked connections. Thus, connections marked as [ASSURED] will not be erased, contrary to the non-assured connections (those not marked as [ASSURED]). How many connections that the connection tracking table can hold depends upon a variable that can be set through the ip-sysctl functions in recent kernels. The default value held by this entry varies heavily depending on how much memory you have. On 128 MB of RAM you will get 8192 possible entries, and at 256 MB of RAM, you will get 16376 entries. You can read and set your settings through the /proc/sys/net/ipv4/ip_conntrack_max setting.

A different way of doing this, that is more efficient, is to set the hashsize option to the ip_conntrack module once this is loaded. Under normal circumstances ip_conntrack_max equals 8 * hashsize. In other words, setting the hashsize to 4096 will result in ip_conntrack_max being set to 32768 conntrack entries. An example of this would be:

work3:/home/blueflux# modprobe ip_conntrack hashsize=4096

work3:/home/blueflux# cat /proc/sys/net/ipv4/ip_conntrack_max



User-land states

As you have seen, packets may take on several different states within the kernel itself, depending on what protocol we are talking about. However, outside the kernel, we only have the 4 states as described previously. These states can mainly be used in conjunction with the state match which will then be able to match packets based on their current connection tracking state. The valid states are NEW, ESTABLISHED, RELATED and INVALID. The following table will briefly explain each possible state.

Table 7-1. User-land states

State Explanation
NEW The NEW state tells us that the packet is the first packet that we see. This means that the first packet that the conntrack module sees, within a specific connection, will be matched. For example, if we see a SYN packet and it is the first packet in a connection that we see, it will match. However, the packet may as well not be a SYN packet and still be considered NEW. This may lead to certain problems in some instances, but it may also be extremely helpful when we need to pick up lost connections from other firewalls, or when a connection has already timed out, but in reality is not closed.
ESTABLISHED The ESTABLISHED state has seen traffic in both directions and will then continuously match those packets. ESTABLISHED connections are fairly easy to understand. The only requirement to get into an ESTABLISHED state is that one host sends a packet, and that it later on gets a reply from the other host. The NEW state will upon receipt of the reply packet to or through the firewall change to the ESTABLISHED state. ICMP reply messages can also be considered as ESTABLISHED, if we created a packet that in turn generated the reply ICMP message.
RELATED The RELATED state is one of the more tricky states. A connection is considered RELATED when it is related to another already ESTABLISHED connection. What this means, is that for a connection to be considered as RELATED, we must first have a connection that is considered ESTABLISHED. The ESTABLISHED connection will then spawn a connection outside of the main connection. The newly spawned connection will then be considered RELATED, if the conntrack module is able to understand that it is RELATED. Some good examples of connections that can be considered as RELATED are the FTP-data connections that are considered RELATED to the FTP control port, and the DCC connections issued through IRC. This could be used to allow ICMP error messages, FTP transfers and DCC's to work properly through the firewall. Do note that most TCP protocols and some UDP protocols that rely on this mechanism are quite complex and send connection information within the payload of the TCP or UDP data segments, and hence require special helper modules to be correctly understood.
INVALID The INVALID state means that the packet can't be identified or that it does not have any state. This may be due to several reasons, such as the system running out of memory or ICMP error messages that do not respond to any known connections. Generally, it is a good idea to DROP everything in this state.
UNTRACKED This is the UNTRACKED state. In brief, if a packet is marked within the raw table with the NOTRACK target, then that packet will show up as UNTRACKED in the state machine. This also means that all RELATED connections will not be seen, so some caution must be taken when dealing with the UNTRACKED connections since the state machine will not be able to see related ICMP messages et cetera.

These states can be used together with the --state match to match packets based on their connection tracking state. This is what makes the state machine so incredibly strong and efficient for our firewall. Previously, we often had to open up all ports above 1024 to let all traffic back into our local networks again. With the state machine in place this is not necessary any longer, since we can now just open up the firewall for return traffic and not for all kinds of other traffic.

TCP connections

In this section and the upcoming ones, we will take a closer look at the states and how they are handled for each of the three basic protocols TCP, UDP and ICMP. Also, we will take a closer look at how connections are handled per default, if they can not be classified as either of these three protocols. We have chosen to start out with the TCP protocol since it is a stateful protocol in itself, and has a lot of interesting details with regard to the state machine in iptables.

A TCP connection is always initiated with the 3-way handshake, which establishes and negotiates the actual connection over which data will be sent. The whole session is begun with a SYN packet, then a SYN/ACK packet and finally an ACK packet to acknowledge the whole session establishment. At this point the connection is established and able to start sending data. The big problem is, how does connection tracking hook up into this? Quite simply really.

As far as the user is concerned, connection tracking works basically the same for all connection types. Have a look at the picture below to see exactly what state the stream enters during the different stages of the connection. As you can see, the connection tracking code does not really follow the flow of the TCP connection, from the users viewpoint. Once it has seen one packet(the SYN), it considers the connection as NEW. Once it sees the return packet(SYN/ACK), it considers the connection as ESTABLISHED. If you think about this a second, you will understand why. With this particular implementation, you can allow NEW and ESTABLISHED packets to leave your local network, only allow ESTABLISHED connections back, and that will work perfectly. Conversely, if the connection tracking machine were to consider the whole connection establishment as NEW, we would never really be able to stop outside connections to our local network, since we would have to allow NEW packets back in again. To make things more complicated, there are a number of other internal states that are used for TCP connections inside the kernel, but which are not available for us in User-land. Roughly, they follow the state standards specified within RFC 793 - Transmission Control Protocol on pages 21-23. We will consider these in more detail further along in this section.

As you can see, it is really quite simple, seen from the user's point of view. However, looking at the whole construction from the kernel's point of view, it's a little more difficult. Let's look at an example. Consider exactly how the connection states change in the /proc/net/ip_conntrack table. The first state is reported upon receipt of the first SYN packet in a connection.

tcp 6 117 SYN_SENT src= dst= sport=1031 \

dport=23 [UNREPLIED] src= dst= sport=23 \

dport=1031 use=1

As you can see from the above entry, we have a precise state in which a SYN packet has been sent, (the SYN_SENT flag is set), and to which as yet no reply has been sent (witness the [UNREPLIED] flag). The next internal state will be reached when we see another packet in the other direction.

tcp 6 57 SYN_RECV src= dst= sport=1031 \

dport=23 src= dst= sport=23 dport=1031 \


Now we have received a corresponding SYN/ACK in return. As soon as this packet has been received, the state changes once again, this time to SYN_RECV. SYN_RECV tells us that the original SYN was delivered correctly and that the SYN/ACK return packet also got through the firewall properly. Moreover, this connection tracking entry has now seen traffic in both directions and is hence considered as having been replied to. This is not explicit, but rather assumed, as was the [UNREPLIED] flag above. The final step will be reached once we have seen the final ACK in the 3-way handshake.

tcp 6 431999 ESTABLISHED src= dst= \

sport=1031 dport=23 src= dst= \

sport=23 dport=1031 [ASSURED] use=1

In the last example, we have gotten the final ACK in the 3-way handshake and the connection has entered the ESTABLISHED state, as far as the internal mechanisms of iptables are aware. Normally, the stream will be ASSURED by now.

A connection may also enter the ESTABLISHED state, but not be[ASSURED]. This happens if we have connection pickup turned on (Requires the tcp-window-tracking patch, and the ip_conntrack_tcp_loose to be set to 1 or higher). The default, without the tcp-window-tracking patch, is to have this behaviour, and is not changeable.

When a TCP connection is closed down, it is done in the following way and takes the following states.

As you can see, the connection is never really closed until the last ACK is sent. Do note that this picture only describes how it is closed down under normal circumstances. A connection may also, for example, be closed by sending a RST(reset), if the connection were to be refused. In this case, the connection would be closed down immediately.

When the TCP connection has been closed down, the connection enters the TIME_WAIT state, which is per default set to 2 minutes. This is used so that all packets that have gotten out of order can still get through our rule-set, even after the connection has already closed. This is used as a kind of buffer time so that packets that have gotten stuck in one or another congested router can still get to the firewall, or to the other end of the connection.

If the connection is reset by a RST packet, the state is changed to CLOSE. This means that the connection per default has 10 seconds before the whole connection is definitely closed down. RST packets are not acknowledged in any sense, and will break the connection directly. There are also other states than the ones we have told you about so far. Here is the complete list of possible states that a TCP stream may take, and their timeout values.

Table 7-2. Internal states

State Timeout value
NONE 30 minutes
SYN_SENT 2 minutes
SYN_RECV 60 seconds
FIN_WAIT 2 minutes
TIME_WAIT 2 minutes
CLOSE 10 seconds
CLOSE_WAIT 12 hours
LAST_ACK 30 seconds
LISTEN 2 minutes

These values are most definitely not absolute. They may change with kernel revisions, and they may also be changed via the proc file-system in the /proc/sys/net/ipv4/netfilter/ip_ct_tcp_* variables. The default values should, however, be fairly well established in practice. These values are set in seconds. Early versions of the patch used jiffies (which was a bug).

Note Also note that the User-land side of the state machine does not look at TCP flags (i.e., RST, ACK, and SYN are flags) set in the TCP packets. This is generally bad, since you may want to allow packets in the NEW state to get through the firewall, but when you specify the NEW flag, you will in most cases mean SYN packets.

This is not what happens with the current state implementation; instead, even a packet with no bit set or an ACK flag, will count as NEW. This can be used for redundant firewalling and so on, but it is generally extremely bad on your home network, where you only have a single firewall. To get around this behavior, you could use the command explained in the State NEW packets but no SYN bit set section of the Common problems and questions appendix. Another way is to install the tcp-window-tracking extension from patch-o-matic, and set the /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_loose to zero, which will make the firewall drop all NEW packets with anything but the SYN flag set.

UDP connections

UDP connections are in themselves not stateful connections, but rather stateless. There are several reasons why, mainly because they don't contain any connection establishment or connection closing; most of all they lack sequencing. Receiving two UDP datagrams in a specific order does not say anything about the order in which they were sent. It is, however, still possible to set states on the connections within the kernel. Let's have a look at how a connection can be tracked and how it might look in conntrack.

As you can see, the connection is brought up almost exactly in the same way as a TCP connection. That is, from the user-land point of view. Internally, conntrack information looks quite a bit different, but intrinsically the details are the same. First of all, let's have a look at the entry after the initial UDP packet has been sent.

udp 17 20 src= dst= sport=137 dport=1025 \

[UNREPLIED] src= dst= sport=1025 \

dport=137 use=1

As you can see from the first and second values, this is an UDP packet. The first is the protocol name, and the second is protocol number. This is just the same as for TCP connections. The third value marks how many seconds this state entry has to live. After this, we get the values of the packet that we have seen and the future expectations of packets over this connection reaching us from the initiating packet sender. These are the source, destination, source port and destination port. At this point, the [UNREPLIED] flag tells us that there's so far been no response to the packet. Finally, we get a brief list of the expectations for returning packets. Do note that the latter entries are in reverse order to the first values. The timeout at this point is set to 30 seconds, as per default.

udp 17 170 src= dst= sport=137 \

dport=1025 src= dst= sport=1025 \

dport=137 [ASSURED] use=1

At this point the server has seen a reply to the first packet sent out and the connection is now considered as ESTABLISHED. This is not shown in the connection tracking, as you can see. The main difference is that the [UNREPLIED] flag has now gone. Moreover, the default timeout has changed to 180 seconds - but in this example that's by now been decremented to 170 seconds - in 10 seconds' time, it will be 160 seconds. There's one thing that's missing, though, and can change a bit, and that is the [ASSURED] flag described above. For the [ASSURED] flag to be set on a tracked connection, there must have been a legitimate reply packet to the NEW packet.

udp 17 175 src= dst= sport=1025 \

dport=53 src= dst= sport=53 \

dport=1025 [ASSURED] use=1

At this point, the connection has become assured. The connection looks exactly the same as the previous example. If this connection is not used for 180 seconds, it times out. 180 Seconds is a comparatively low value, but should be sufficient for most use. This value is reset to its full value for each packet that matches the same entry and passes through the firewall, just the same as for all of the internal states.

ICMP connections

ICMP packets are far from a stateful stream, since they are only used for controlling and should never establish any connections. There are four ICMP types that will generate return packets however, and these have 2 different states. These ICMP messages can take the NEW and ESTABLISHED states. The ICMP types we are talking about are Echo request and reply, Timestamp request and reply, Information request and reply and finally Address mask request and reply. Out of these, the timestamp request and information request are obsolete and could most probably just be dropped. However, the Echo messages are used in several setups such as pinging hosts. Address mask requests are not used often, but could be useful at times and worth allowing. To get an idea of how this could look, have a look at the following image.

As you can see in the above picture, the host sends an echo request to the target, which is considered as NEW by the firewall. The target then responds with a echo reply which the firewall considers as state ESTABLISHED. When the first echo request has been seen, the following state entry goes into the ip_conntrack.

icmp 1 25 src= dst= type=8 code=0 \

id=33029 [UNREPLIED] src= dst= \

type=0 code=0 id=33029 use=1

This entry looks a little bit different from the standard states for TCP and UDP as you can see. The protocol is there, and the timeout, as well as source and destination addresses. The problem comes after that however. We now have 3 new fields called type, code and id. They are not special in any way, the type field contains the ICMP type and the code field contains the ICMP code. These are all available in ICMP types appendix. The final id field, contains the ICMP ID. Each ICMP packet gets an ID set to it when it is sent, and when the receiver gets the ICMP message, it sets the same ID within the new ICMP message so that the sender will recognize the reply and will be able to connect it with the correct ICMP request.

The next field, we once again recognize as the [UNREPLIED] flag, which we have seen before. Just as before, this flag tells us that we are currently looking at a connection tracking entry that has seen only traffic in one direction. Finally, we see the reply expectation for the reply ICMP packet, which is the inversion of the original source and destination IP addresses. As for the type and code, these are changed to the correct values for the return packet, so an echo request is changed to echo reply and so on. The ICMP ID is preserved from the request packet.

The reply packet is considered as being ESTABLISHED, as we have already explained. However, we can know for sure that after the ICMP reply, there will be absolutely no more legal traffic in the same connection. For this reason, the connection tracking entry is destroyed once the reply has traveled all the way through the Netfilter structure.

In each of the above cases, the request is considered as NEW, while the reply is considered as ESTABLISHED. Let's consider this more closely. When the firewall sees a request packet, it considers it as NEW. When the host sends a reply packet to the request it is considered ESTABLISHED.

Note that this means that the reply packet must match the criterion given by the connection tracking entry to be considered as established, just as with all other traffic types.

ICMP requests has a default timeout of 30 seconds, which you can change in the /proc/sys/net/ipv4/netfilter/ip_ct_icmp_timeout entry. This should in general be a good timeout value, since it will be able to catch most packets in transit.

Another hugely important part of ICMP is the fact that it is used to tell the hosts what happened to specific UDP and TCP connections or connection attempts. For this simple reason, ICMP replies will very often be recognized as RELATED to original connections or connection attempts. A simple example would be the ICMP Host unreachable or ICMP Network unreachable. These should always be spawned back to our host if it attempts an unsuccessful connection to some other host, but the network or host in question could be down, and hence the last router trying to reach the site in question will reply with an ICMP message telling us about it. In this case, the ICMP reply is considered as a RELATED packet. The following picture should explain how it would look.

In the above example, we send out a SYN packet to a specific address. This is considered as a NEW connection by the firewall. However, the network the packet is trying to reach is unreachable, so a router returns a network unreachable ICMP error to us. The connection tracking code can recognize this packet as RELATED. thanks to the already added tracking entry, so the ICMP reply is correctly sent to the client which will then hopefully abort. Meanwhile, the firewall has destroyed the connection tracking entry since it knows this was an error message.

The same behavior as above is experienced with UDP connections if they run into any problem like the above. All ICMP messages sent in reply to UDP connections are considered as RELATED. Consider the following image.

This time an UDP packet is sent to the host. This UDP connection is considered as NEW. However, the network is administratively prohibited by some firewall or router on the way over. Hence, our firewall receives a ICMP Network Prohibited in return. The firewall knows that this ICMP error message is related to the already opened UDP connection and sends it as a RELATED packet to the client. At this point, the firewall destroys the connection tracking entry, and the client receives the ICMP message and should hopefully abort.

Default connections

In certain cases, the conntrack machine does not know how to handle a specific protocol. This happens if it does not know about that protocol in particular, or doesn't know how it works. In these cases, it goes back to a default behavior. The default behavior is used on, for example, NETBLT, MUX and EGP. This behavior looks pretty much the same as the UDP connection tracking. The first packet is considered NEW, and reply traffic and so forth is considered ESTABLISHED.

When the default behavior is used, all of these packets will attain the same default timeout value. This can be set via the /proc/sys/net/ipv4/netfilter/ip_ct_generic_timeout variable. The default value here is 600 seconds, or 10 minutes. Depending on what traffic you are trying to send over a link that uses the default connection tracking behavior, this might need changing. Especially if you are bouncing traffic through satellites and such, which can take a long time.

Untracked connections and the raw table

UNTRACKED is a rather special keyword when it comes to connection tracking in Linux. Basically, it is used to match packets that has been marked in the raw table not to be tracked.

The raw table was created specifically for this reason. In this table, you set a NOTRACK mark on packets that you do not wish to track in netfilter.

Important Notice how I say packets, not connection, since the mark is actually set for each and every packet that enters. Otherwise, we would still have to do some kind of tracking of the connection to know that it should not be tracked.

As we have already stated in this chapter, conntrack and the state machine is rather resource hungry. For this reason, it might sometimes be a good idea to turn off connection tracking and the state machine.

One example would be if you have a heavily trafficked router that you want to firewall the incoming and outgoing traffic on, but not the routed traffic. You could then set the NOTRACK mark on all packets not destined for the firewall itself by ACCEPT'ing all packets with destination your host in the raw table, and then set the NOTRACK for all other traffic. This would then allow you to have stateful matching on incoming traffic for the router itself, but at the same time save processing power from not handling all the crossing traffic.

Another example when NOTRACK can be used is if you have a highly trafficked webserver and want to do stateful tracking, but don't want to waste processing power on tracking the web traffic. You could then set up a rule that turns of tracking for port 80 on all the locally owned IP addresses, or the ones that are actually serving web traffic. You could then enjoy statefull tracking on all other services, except for webtraffic which might save some processing power on an already overloaded system.

There is however some problems with NOTRACK that you must take into consideration. If a whole connection is set with NOTRACK, then you will not be able to track related connections either, conntrack and nat helpers will simply not work for untracked connections, nor will related ICMP errors do. You will have to open up for these manually in other words. When it comes to complex protocols such as FTP and SCTP et cetera, this can be very hard to manage. As long as you are aware of this, you should be able to handle this however.

Complex protocols and connection tracking

Certain protocols are more complex than others. What this means when it comes to connection tracking, is that such protocols may be harder to track correctly. Good examples of these are the ICQ, IRC and FTP protocols. Each and every one of these protocols carries information within the actual data payload of the packets, and hence requires special connection tracking helpers to enable it to function correctly.

This is a list of the complex protocols that has support inside the linux kernel, and which kernel version it was introduced in.

Table 7-3. Complex protocols support

Protocol name Kernel versions
FTP 2.3
IRC 2.3
TFTP 2.5
Amanda 2.5




Let's take the FTP protocol as the first example. The FTP protocol first opens up a single connection that is called the FTP control session. When we issue commands through this session, other ports are opened to carry the rest of the data related to that specific command. These connections can be done in two ways, either actively or passively. When a connection is done actively, the FTP client sends the server a port and IP address to connect to. After this, the FTP client opens up the port and the server connects to that specified port from a random unprivileged port (>1024) and sends the data over it.

The problem here is that the firewall will not know about these extra connections, since they were negotiated within the actual payload of the protocol data. Because of this, the firewall will be unable to know that it should let the server connect to the client over these specific ports.

The solution to this problem is to add a special helper to the connection tracking module which will scan through the data in the control connection for specific syntaxes and information. When it runs into the correct information, it will add that specific information as RELATED and the server will be able to track the connection, thanks to that RELATED entry. Consider the following picture to understand the states when the FTP server has made the connection back to the client.

Passive FTP works the opposite way. The FTP client tells the server that it wants some specific data, upon which the server replies with an IP address to connect to and at what port. The client will, upon receipt of this data, connect to that specific port, from its own port 20(the FTP-data port), and get the data in question. If you have an FTP server behind your firewall, you will in other words require this module in addition to your standard iptables modules to let clients on the Internet connect to the FTP server properly. The same goes if you are extremely restrictive to your users, and only want to let them reach HTTP and FTP servers on the Internet and block all other ports. Consider the following image and its bearing on Passive FTP.

Some conntrack helpers are already available within the kernel itself. More specifically, the FTP and IRC protocols have conntrack helpers as of writing this. If you can not find the conntrack helpers that you need within the kernel itself, you should have a look at the patch-o-matic tree within user-land iptables. The patch-o-matic tree may contain more conntrack helpers, such as for the ntalk or H.323 protocols. If they are not available in the patch-o-matic tree, you have a number of options. Either you can look at the CVS source of iptables, if it has recently gone into that tree, or you can contact the Netfilter-devel mailing list and ask if it is available. If it is not, and there are no plans for adding it, you are left to your own devices and would most probably want to read the Rusty Russell's Unreliable Netfilter Hacking HOW-TO which is linked from the Other resources and links appendix.

Conntrack helpers may either be statically compiled into the kernel, or as modules. If they are compiled as modules, you can load them with the following command

modprobe ip_conntrack_ftp

modprobe ip_conntrack_irc

modprobe ip_conntrack_tftp

modprobe ip_conntrack_amanda

Do note that connection tracking has nothing to do with NAT, and hence you may require more modules if you are NAT'ing connections as well. For example, if you were to want to NAT and track FTP connections, you would need the NAT module as well. All NAT helpers starts with ip_nat_ and follow that naming convention; so for example the FTP NAT helper would be named ip_nat_ftp and the IRC module would be named ip_nat_irc. The conntrack helpers follow the same naming convention, and hence the IRC conntrack helper would be named ip_conntrack_irc, while the FTP conntrack helper would be named ip_conntrack_ftp.

What's next?

This chapter has discussed how the state machine in netfilter works and how it keeps state of different connections. The chapter has also discussed how it is represented toward you, the end user and what you can do to alter its behavior, as well as different protocols that are more complex to do connection tracking on, and how the different conntrack helpers come into the picture.

The next chapter will discuss how to save and restore rulesets using the iptables-save and iptables-restore programs distributed with the iptables applications. This has both pros and cons, and the chapter will discuss it in detail.

Chapter 8. Saving and restoring large rule-sets

The iptables package comes with two more tools that are very useful, specially if you are dealing with larger rule-sets. These two tools are called iptables-save and iptables-restore and are used to save and restore rule-sets to a specific file-format that looks quite a bit different from the standard shell code that you will see in the rest of this tutorial.

Tip iptables-restore can be used together with scripting languages. The big problem is that you will need to output the results into the stdin of iptables-restore. If you are creating a very big ruleset (several thousand rules) this might be a very good idea, since it will be much faster to insert all the new rules. For example, you would then run make_rules.sh | iptables-restore.

Speed considerations

One of the largest reasons for using the iptables-save and iptables-restore commands is that they will speed up the loading and saving of larger rule-sets considerably. The main problem with running a shell script that contains iptables rules is that each invocation of iptables within the script will first extract the whole rule-set from the Netfilter kernel space, and after this, it will insert or append rules, or do whatever change to the rule-set that is needed by this specific command. Finally, it will insert the new rule-set from its own memory into kernel space. Using a shell script, this is done for each and every rule that we want to insert, and for each time we do this, it takes more time to extract and insert the rule-set.

To solve this problem, there is the iptables-save and restore commands. The iptables-save command is used to save the rule-set into a specially formatted text-file, and the iptables-restore command is used to load this text-file into kernel again. The best parts of these commands is that they will load and save the rule-set in one single request. iptables-save will grab the whole rule-set from kernel and save it to a file in one single movement. iptables-restore will upload that specific rule-set to kernel in a single movement for each table. In other words, instead of dropping the rule-set out of kernel some 30,000 times, for really large rule-sets, and then upload it to kernel again that many times, we can now save the whole thing into a file in one movement and then upload the whole thing in as little as three movements depending on how many tables you use.

As you can understand, these tools are definitely something for you if you are working on a huge set of rules that needs to be inserted. However, they do have drawbacks that we will discuss more in the next section.

Drawbacks with restore

As you may have already wondered, can iptables-restore handle any kind of scripting? So far, no, it cannot and it will most probably never be able to. This is the main flaw in using iptables-restore since you will not be able to do a huge set of things with these files. For example, what if you have a connection that has a dynamically assigned IP address and you want to grab this dynamic IP every-time the computer boots up and then use that value within your scripts? With iptables-restore, this is more or less impossible.

One possibility to get around this is to make a small script which grabs the values you would like to use in the script, then sed the iptables-restore file for specific keywords and replace them with the values collected via the small script. At this point, you could save it to a temporary file, and then use iptables-restore to load the new values. This causes a lot of problems however, and you will be unable to use iptables-save properly since it would probably erase your manually added keywords in the restore script. It is, in other words, a clumsy solution.

A second possibility is to do as previously described. Make a script that outputs rules in iptables-restore format, and then feed them on standard input of iptables-restore. For very large rulesets this would be to be preferred over running iptables itself, since it has a bad habit of taking a lot of processing power on very large rulesets as previously described in this chapter.

Another solution is to load the iptables-restore scripts first, and then load a specific shell script that inserts more dynamic rules in their proper places. Of course, as you can understand, this is just as clumsy as the first solution. iptables-restore is simply not very well suited for configurations where IP addresses are dynamically assigned to your firewall or where you want different behaviors depending on configuration options and so on.

Another drawback with iptables-restore and iptables-save is that it is not fully functional as of writing this. The problem is simply that not a lot of people use it as of today and hence there are not a lot of people finding bugs, and in turn some matches and targets will simply be inserted badly, which may lead to some strange behaviors that you did not expect. Even though these problems exist, I would highly recommend using these tools which should work extremely well for most rule-sets as long as they do not contain some of the new targets or matches that it does not know how to handle properly.


The iptables-save command is, as we have already explained, a tool to save the current rule-set into a file that iptables-restore can use. This command is quite simple really, and takes only two arguments. Take a look at the following example to understand the syntax of the command.

iptables-save [-c] [-t table]

The -c argument tells iptables-save to keep the values specified in the byte and packet counters. This could for example be useful if we would like to reboot our main firewall, but not lose byte and packet counters which we may use for statistical purposes. Issuing a iptables-save command with the -c argument would then make it possible for us to reboot without breaking our statistical and accounting routines. The default value is, of course, to not keep the counters intact when issuing this command.

The -t argument tells the iptables-save command which tables to save. Without this argument the command will automatically save all tables available into the file. The following is an example on what output you can expect from the iptables-save command if you do not have any rule-set loaded.

# Generated by iptables-save v1.2.6a on Wed Apr 24 10:19:17 2002


:INPUT ACCEPT [404:19766]


:OUTPUT ACCEPT [530:43376]


# Completed on Wed Apr 24 10:19:17 2002

# Generated by iptables-save v1.2.6a on Wed Apr 24 10:19:17 2002



:INPUT ACCEPT [451:22060]


:OUTPUT ACCEPT [594:47151]



# Completed on Wed Apr 24 10:19:17 2002

# Generated by iptables-save v1.2.6a on Wed Apr 24 10:19:17 2002






# Completed on Wed Apr 24 10:19:17 2002

This contains a few comments starting with a # sign. Each table is marked like *<table-name>, for example *mangle. Then within each table we have the chain specifications and rules. A chain specification looks like :<chain-name> <chain-policy> [<packet-counter>:<byte-counter>]. The chain-name may be for example PREROUTING, the policy is described previously and can, for example, be ACCEPT. Finally the packet-counter and byte-counters are the same counters as in the output from iptables -L -v. Finally, each table declaration ends in a COMMIT keyword. The COMMIT keyword tells us that at this point we should commit all rules currently in the pipeline to kernel.

The above example is pretty basic, and hence I believe it is nothing more than proper to show a brief example which contains a very small Iptables-save ruleset. If we would run iptables-save on this, it would look something like this in the output:

# Generated by iptables-save v1.2.6a on Wed Apr 24 10:19:55 2002


:INPUT DROP [1:229]




-A FORWARD -i eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT




# Completed on Wed Apr 24 10:19:55 2002

# Generated by iptables-save v1.2.6a on Wed Apr 24 10:19:55 2002



:INPUT ACCEPT [658:32445]


:OUTPUT ACCEPT [891:68234]



# Completed on Wed Apr 24 10:19:55 2002

# Generated by iptables-save v1.2.6a on Wed Apr 24 10:19:55 2002





-A POSTROUTING -o eth0 -j SNAT --to-source


# Completed on Wed Apr 24 10:19:55 2002

As you can see, each command has now been prefixed with the byte and packet counters since we used the -c argument. Except for this, the command-line is quite intact from the script. The only problem now, is how to save the output to a file. Quite simple, and you should already know how to do this if you have used linux at all before. It is only a matter of piping the command output on to the file that you would like to save it as. This could look like the following:

iptables-save -c > /etc/iptables-save

The above command will in other words save the whole rule-set to a file called /etc/iptables-save with byte and packet counters still intact.


The iptables-restore command is used to restore the iptables rule-set that was saved with the iptables-save command. It takes all the input from standard input and can't load from files as of writing this, unfortunately. This is the command syntax for iptables-restore:

iptables-restore [-c] [-n]

The -c argument restores the byte and packet counters and must be used if you want to restore counters that were previously saved with iptables-save. This argument may also be written in its long form --counters.

The -n argument tells iptables-restore to not overwrite the previously written rules in the table, or tables, that it is writing to. The default behavior of iptables-restore is to flush and destroy all previously inserted rules. The short -n argument may also be replaced with the longer format --noflush.

To load a rule-set with the iptables-restore command, we could do this in several ways, but we will mainly look at the simplest and most common way here.

cat /etc/iptables-save | iptables-restore -c

The following will also work:

iptables-restore -c < /etc/iptables-save

This would cat the rule-set located within the /etc/iptables-save file and then pipe it to iptables-restore which takes the rule-set on the standard input and then restores it, including byte and packet counters. It is that simple to begin with. This command could be varied until oblivion and we could show different piping possibilities, however, this is a bit out of the scope of this chapter, and hence we will skip that part and leave it as an exercise for the reader to experiment with.

The rule-set should now be loaded properly to kernel and everything should work. If not, you may possibly have run into a bug in these commands.

What's next?

This chapter has discussed the iptables-save and iptables-restore programs to some extent and how they can be used. Both applications are distributed with the iptables package, and can be used to quickly save large rulesets and then inserting them into the kernel again.

The next chapter will take a look at the syntax of a iptables rule and how to write properly formatted rule-sets. It will also show some basic good coding styles to adhere to, as required.

Chapter 9. How a rule is built

This chapter and the upcoming three chapters will discuss at length how to build your own rules. A rule could be described as the directions the firewall will adhere to when blocking or permitting different connections and packets in a specific chain. Each line you write that's inserted in a chain should be considered a rule. We will also discuss the basic matches that are available, and how to use them, as well as the different targets and how we can construct new targets of our own (i.e.,new sub chains).

This chapter will deal with the raw basics of how a rule is created and how you write it and enter it so that it will be accepted by the userspace program iptables, the different tables, as well as the commands that you can issue to iptables. After that we will in the next chapter look at all the matches that are available to iptables, and then get more into detail of each type of target and jump.

Basics of the iptables command

As we have already explained, each rule is a line that the kernel looks at to find out what to do with a packet. If all the criteria - or matches - are met, we perform the target - or jump - instruction. Normally we would write our rules in a syntax that looks something like this:

iptables [-t table] command [match] [target/jump]

There is nothing that says that the target instruction has to be the last function in the line. However, you would usually adhere to this syntax to get the best readability. Anyway, most of the rules you'll see are written in this way. Hence, if you read someone else's script, you'll most likely recognize the syntax and easily understand the rule.

If you want to use a table other than the standard table, you could insert the table specification at the point at which [table] is specified. However, it is not necessary to state explicitly what table to use, since by default iptables uses the filter table on which to implement all commands. Neither do you have to specify the table at just this point in the rule. It could be set pretty much anywhere along the line. However, it is more or less standard to put the table specification at the beginning.

One thing to think about though: The command should always come first, or alternatively directly after the table specification. We use 'command' to tell the program what to do, for example to insert a rule or to add a rule to the end of the chain, or to delete a rule. We shall take a further look at this below.

The match is the part of the rule that we send to the kernel that details the specific character of the packet, what makes it different from all other packets. Here we could specify what IP address the packet comes from, from which network interface, the intended IP address, port, protocol or whatever. There is a heap of different matches that we can use that we will look closer at further on in this chapter.

Finally we have the target of the packet. If all the matches are met for a packet, we tell the kernel what to do with it. We could, for example, tell the kernel to send the packet to another chain that we've created ourselves, and which is part of this particular table. We could tell the kernel to drop the packet dead and do no further processing, or we could tell the kernel to send a specified reply to the sender. As with the rest of the content in this section, we'll look closer at it further on in the chapter.


The -t option specifies which table to use. Per default, the filter table is used. We may specify one of the following tables with the -t option. Do note that this is an extremely brief summary of some of the contents of the Traversing of tables and chains chapter.

Table 9-1. Tables

Table Explanation
nat The nat table is used mainly for Network Address Translation. "NAT"ed packets get their IP addresses altered, according to our rules. Packets in a stream only traverse this table once. We assume that the first packet of a stream is allowed. The rest of the packets in the same stream are automatically "NAT"ed or Masqueraded etc, and will be subject to the same actions as the first packet. These will, in other words, not go through this table again, but will nevertheless be treated like the first packet in the stream. This is the main reason why you should not do any filtering in this table, which we will discuss at greater length further on. The PREROUTING chain is used to alter packets as soon as they get in to the firewall. The OUTPUT chain is used for altering locally generated packets (i.e., on the firewall) before they get to the routing decision. Finally we have the POSTROUTING chain which is used to alter packets just as they are about to leave the firewall.
mangle This table is used mainly for mangling packets. Among other things, we can change the contents of different packets and that of their headers. Examples of this would be to change the TTL, TOS or MARK. Note that the MARK is not really a change to the packet, but a mark value for the packet is set in kernel space. Other rules or programs might use this mark further along in the firewall to filter or do advanced routing on; tc is one example. The table consists of five built in chains, the PREROUTING, POSTROUTING, OUTPUT, INPUT and FORWARD chains. PREROUTING is used for altering packets just as they enter the firewall and before they hit the routing decision. POSTROUTING is used to mangle packets just after all routing decisions have been made. OUTPUT is used for altering locally generated packets after they enter the routing decision. INPUT is used to alter packets after they have been routed to the local computer itself, but before the user space application actually sees the data. FORWARD is used to mangle packets after they have hit the first routing decision, but before they actually hit the last routing decision. Note that mangle can't be used for any kind of Network Address Translation or Masquerading, the nat table was made for these kinds of operations.
filter The filter table should be used exclusively for filtering packets. For example, we could DROP, LOG, ACCEPT or REJECT packets without problems, as we can in the other tables. There are three chains built in to this table. The first one is named FORWARD and is used on all non-locally generated packets that are not destined for our local host (the firewall, in other words). INPUT is used on all packets that are destined for our local host (the firewall) and OUTPUT is finally used for all locally generated packets.
raw The raw table and its chains are used before any other tables in netfilter. It was introduced to use the NOTRACK target. This table is rather new and is only available, if compiled, with late 2.6 kernels and later. The raw table contains two chains. The PREROUTING and OUTPUT chain, where they will handle packets before they hit any of the other netfilter subsystems. The PREROUTING chain can be used for all incoming packets to this machine, or that are forwarded, while the OUTPUT chain can be used to alter the locally generated packets before they hit any of the other netfilter subsystems.

The above details should have explained the basics about the three different tables that are available. They should be used for totally different purposes, and you should know what to use each chain for. If you do not understand their usage, you may well dig a pit for yourself in your firewall, into which you will fall as soon as someone finds it and pushes you into it. We have already discussed the requisite tables and chains in more detail within the Traversing of tables and chains chapter. If you do not understand this fully, I advise you to go back and read through it again.


In this section we will cover all the different commands and what can be done with them. The command tells iptables what to do with the rest of the rule that we send to the parser. Normally we would want either to add or delete something in some table or another. The following commands are available to iptables:

Table 9-2. Commands

Command -A, --append
Example iptables -A INPUT ...
Explanation This command appends the rule to the end of the chain. The rule will in other words always be put last in the rule-set and hence be checked last, unless you append more rules later on.
Command -D, --delete
Example iptables -D INPUT --dport 80 -j DROP, iptables -D INPUT 1
Explanation This command deletes a rule in a chain. This could be done in two ways; either by entering the whole rule to match (as in the first example), or by specifying the rule number that you want to match. If you use the first method, your entry must match the entry in the chain exactly. If you use the second method, you must match the number of the rule you want to delete. The rules are numbered from the top of each chain, starting with number 1.
Command -R, --replace
Example iptables -R INPUT 1 -s -j DROP
Explanation This command replaces the old entry at the specified line. It works in the same way as the --delete command, but instead of totally deleting the entry, it will replace it with a new entry. The main use for this might be while you're experimenting with iptables.
Command -I, --insert
Example iptables -I INPUT 1 --dport 80 -j ACCEPT
Explanation Insert a rule somewhere in a chain. The rule is inserted as the actual number that we specify. In other words, the above example would be inserted as rule 1 in the INPUT chain, and hence from now on it would be the very first rule in the chain.
Command -L, --list
Example iptables -L INPUT
Explanation This command lists all the entries in the specified chain. In the above case, we would list all the entries in the INPUT chain. It's also legal to not specify any chain at all. In the last case, the command would list all the chains in the specified table (To specify a table, see the Tables section). The exact output is affected by other options sent to the parser, for example the -n and -v options, etc.
Command -F, --flush
Example iptables -F INPUT
Explanation This command flushes all rules from the specified chain and is equivalent to deleting each rule one by one, but is quite a bit faster. The command can be used without options, and will then delete all rules in all chains within the specified table.
Command -Z, --zero
Example iptables -Z INPUT
Explanation This command tells the program to zero all counters in a specific chain, or in all chains. If you have used the -v option with the -L command, you have probably seen the packet counter at the beginning of each field. To zero this packet counter, use the -Z option. This option works the same as -L, except that -Z won't list the rules. If -L and -Z is used together (which is legal), the chains will first be listed, and then the packet counters are zeroed.
Command -N, --new-chain
Example iptables -N allowed
Explanation This command tells the kernel to create a new chain of the specified name in the specified table. In the above example we create a chain called allowed. Note that there must not already be a chain or target of the same name.
Command -X, --delete-chain
Example iptables -X allowed
Explanation This command deletes the specified chain from the table. For this command to work, there must be no rules that refer to the chain that is to be deleted. In other words, you would have to replace or delete all rules referring to the chain before actually deleting the chain. If this command is used without any options, all chains but those built in to the specified table will be deleted.
Command -P, --policy
Example iptables -P INPUT DROP
Explanation This command tells the kernel to set a specified default target, or policy, on a chain. All packets that don't match any rule will then be forced to use the policy of the chain. Legal targets are DROP and ACCEPT (There might be more, mail me if so).
Command -E, --rename-chain
Example iptables -E allowed disallowed
Explanation The -E command tells iptables to change the first name of a chain, to the second name. In the example above we would, in other words, change the name of the chain from allowed to disallowed. Note that this will not affect the actual way the table will work. It is, in other words, just a cosmetic change to the table.

You should always enter a complete command line, unless you just want to list the built-in help for iptables or get the version of the command. To get the version, use the -v option and to get the help message, use the -h option. As usual, in other words. Next comes a few options that can be used with various different commands. Note that we tell you with which commands the options can be used and what effect they will have. Also note that we do not include any options here that affect rules or matches. Instead, we'll take a look at matches and targets in a later section of this chapter.

Table 9-3. Options

Option -v, --verbose
Commands used with --list, --append, --insert, --delete, --replace
Explanation This command gives verbose output and is mainly used together with the --list command. If used together with the --list command, it outputs the interface address, rule options and TOS masks. The --list command will also include a bytes and packet counter for each rule, if the --verbose option is set. These counters uses the K (x1000), M (x1,000,000) and G (x1,000,000,000) multipliers. To overrule this and get exact output, you can use the -x option, described later. If this option is used with the --append, --insert, --delete or --replace commands, the program will output detailed information on how the rule was interpreted and whether it was inserted correctly, etc.
Option -x, --exact
Commands used with --list
Explanation This option expands the numerics. The output from --list will in other words not contain the K, M or G multipliers. Instead we will get an exact output from the packet and byte counters of how many packets and bytes that have matched the rule in question. Note that this option is only usable in the --list command and isn't really relevant for any of the other commands.
Option -n, --numeric
Commands used with --list
Explanation This option tells iptables to output numerical values. IP addresses and port numbers will be printed by using their numerical values and not host-names, network names or application names. This option is only applicable to the --list command. This option overrides the default of resolving all numerics to hosts and names, where this is possible.
Option --line-numbers
Commands used with --list
Explanation The --line-numbers command, together with the --list command, is used to output line numbers. Using this option, each rule is output with its number. It could be convenient to know which rule has which number when inserting rules. This option only works with the --list command.
Option -c, --set-counters
Commands used with --insert, --append, --replace
Explanation This option is used when creating a rule or modifying it in some way. We can then use the option to initialize the packet and byte counters for the rule. The syntax would be something like --set-counters 20 4000, which would tell the kernel to set the packet counter to 20 and byte counter to 4000.
Option --modprobe
Commands used with All
Explanation The --modprobe option is used to tell iptables which module to use when probing for modules or adding them to the kernel. It could be used if your modprobe command is not somewhere in the search path etc. In such cases, it might be necessary to specify this option so the program knows what to do in case a needed module is not loaded. This option can be used with all commands.

What's next?

This chapter has discussed some of the basic commands for iptables and the tables very briefly that can be used in netfilter. The commands makes it possible to do quite a lot of different operations on the netfilter package loaded inside kernel as you have seen.

The next chapter will discuss all the available matches in iptables and netfilter. This is a very heavy and long chapter, and I humbly suggest that you don't need to actually learn every single match available in any detail, except the ones that you are going to use. A good idea might be to get a brief understanding of what each match does, and then get a better grasp on them as you need them.

Chapter 10. Iptables matches

In this chapter we'll talk a bit more about matches. I've chosen to narrow down the matches into five different subcategories. First of all we have the generic matches, which can be used in all rules. Then we have the TCP matches which can only be applied to TCP packets. We have UDP matches which can only be applied to UDP packets, and ICMP matches which can only be used on ICMP packets. Finally we have special matches, such as the state, owner and limit matches and so on. These final matches have in turn been narrowed down to even more subcategories, even though they might not necessarily be different matches at all. I hope this is a reasonable breakdown and that all people out there can understand it.

As you may already understand if you have read the previous chapters, a match is something that specifies a special condition within the packet that must be true (or false). A single rule can contain several matches of any kind. For example, we may want to match packets that come from a specific host on a our local area network, and on top of that only from specific ports on that host. We could then use matches to tell the rule to only apply the target - or jump specification - on packets that have a specific source address, that come in on the interface that connects to the LAN and the packets must be one of the specified ports. If any one of these matches fails (e.g., the source address isn't correct, but everything else is true), the whole rule fails and the next rule is tested on the packet. If all matches are true, however, the target specified by the rule is applied.

Generic matches

This section will deal with Generic matches. A generic match is a kind of match that is always available, whatever kind of protocol we are working on, or whatever match extensions we have loaded. No special parameters at all are needed to use these matches; in other words. I have also included the --protocol match here, even though it is more specific to protocol matches. For example, if we want to use a TCP match, we need to use the --protocol match and send TCP as an option to the match. However, --protocol is also a match in itself, since it can be used to match specific protocols. The following matches are always available.

Table 10-1. Generic matches

Match -p, --protocol
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp
Explanation This match is used to check for certain protocols. Examples of protocols are TCP, UDP and ICMP. The protocol must either be one of the internally specified TCP, UDP or ICMP. It may also take a value specified in the /etc/protocols file, and if it can't find the protocol there it will reply with an error. The protocl may also be an integer value. For example, the ICMP protocol is integer value 1, TCP is 6 and UDP is 17. Finally, it may also take the value ALL. ALL means that it matches only TCP, UDP and ICMP. If this match is given the integer value of zero (0), it means ALL protocols, which in turn is the default behavior, if the --protocol match is not used. This match can also be inversed with the ! sign, so --protocol ! tcp would mean to match UDP and ICMP.
Match -s, --src, --source
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -s
Explanation This is the source match, which is used to match packets, based on their source IP address. The main form can be used to match single IP addresses, such as It could also be used with a netmask in a CIDR "bit" form, by specifying the number of ones (1's) on the left side of the network mask. This means that we could for example add /24 to use a netmask. We could then match whole IP ranges, such as our local networks or network segments behind the firewall. The line would then look something like This would match all packets in the 192.168.0.x range. Another way is to do it with a regular netmask in the form (i.e., We could also invert the match with an ! just as before. If we were, in other words, to use a match in the form of --source !, we would match all packets with a source address not coming from within the 192.168.0.x range. The default is to match all IP addresses.
Match -d, --dst, --destination
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -d
Explanation The --destination match is used for packets based on their destination address or addresses. It works pretty much the same as the --source match and has the same syntax, except that the match is based on where the packets are going to. To match an IP range, we can add a netmask either in the exact netmask form, or in the number of ones (1's) counted from the left side of the netmask bits. Examples are: and Both of these are equivalent. We could also invert the whole match with an ! sign, just as before. --destination ! would in other words match all packets except those destined to the IP address.
Match -i, --in-interface
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -i eth0
Explanation This match is used for the interface the packet came in on. Note that this option is only legal in the INPUT, FORWARD and PREROUTING chains and will return an error message when used anywhere else. The default behavior of this match, if no particular interface is specified, is to assume a string value of +. The + value is used to match a string of letters and numbers. A single + would, in other words, tell the kernel to match all packets without considering which interface it came in on. The + string can also be appended to the type of interface, so eth+ would be all Ethernet devices. We can also invert the meaning of this option with the help of the ! sign. The line would then have a syntax looking something like -i ! eth0, which would match all incoming interfaces, except eth0.
Match -o, --out-interface
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A FORWARD -o eth0
Explanation The --out-interface match is used for packets on the interface from which they are leaving. Note that this match is only available in the OUTPUT, FORWARD and POSTROUTING chains, the opposite in fact of the --in-interface match. Other than this, it works pretty much the same as the --in-interface match. The + extension is understood as matching all devices of similar type, so eth+ would match all eth devices and so on. To invert the meaning of the match, you can use the ! sign in exactly the same way as for the --in-interface match. If no --out-interface is specified, the default behavior for this match is to match all devices, regardless of where the packet is going.
Match -f, --fragment
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -f
Explanation This match is used to match the second and third part of a fragmented packet. The reason for this is that in the case of fragmented packets, there is no way to tell the source or destination ports of the fragments, nor ICMP types, among other things. Also, fragmented packets might in rather special cases be used to compound attacks against other computers. Packet fragments like this will not be matched by other rules, and hence this match was created. This option can also be used in conjunction with the ! sign; however, in this case the ! sign must precede the match, i.e. ! -f. When this match is inverted, we match all header fragments and/or unfragmented packets. What this means, is that we match all the first fragments of fragmented packets, and not the second, third, and so on. We also match all packets that have not been fragmented during transfer. Note also that there are really good defragmentation options within the kernel that you can use instead. As a secondary note, if you use connection tracking you will not see any fragmented packets, since they are dealt with before hitting any chain or table in iptables.

Implicit matches

This section will describe the matches that are loaded implicitly. Implicit matches are implied, taken for granted, automatic. For example when we match on --protocol tcp without any further criteria. There are currently three types of implicit matches for three different protocols. These are TCP matches, UDP matches and ICMP matches. The TCP based matches contain a set of unique criteria that are available only for TCP packets. UDP based matches contain another set of criteria that are available only for UDP packets. And the same thing for ICMP packets. On the other hand, there can be explicit matches that are loaded explicitly. Explicit matches are not implied or automatic, you have to specify them specifically. For these you use the -m or --match option, which we will discuss in the next section.

TCP matches

These matches are protocol specific and are only available when working with TCP packets and streams. To use these matches, you need to specify --protocol tcp on the command line before trying to use them. Note that the --protocol tcp match must be to the left of the protocol specific matches. These matches are loaded implicitly in a sense, just as the UDP and ICMP matches are loaded implicitly. The other matches will be looked over in the continuation of this section, after the TCP match section.

Table 10-2. TCP matches

Match --sport, --source-port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp --sport 22
Explanation The --source-port match is used to match packets based on their source port. Without it, we imply all source ports. This match can either take a service name or a port number. If you specify a service name, the service name must be in the /etc/services file, since iptables uses this file in which to find. If you specify the port by its number, the rule will load slightly faster, since iptables don't have to check up the service name. However, the match might be a little bit harder to read than if you use the service name. If you are writing a rule-set consisting of a 200 rules or more, you should definitely use port numbers, since the difference is really noticeable. (On a slow box, this could make as much as 10 seconds' difference, if you have configured a large rule-set containing 1000 rules or so). You can also use the --source-port match to match any range of ports, --source-port 22:80 for example. This example would match all source ports between 22 and 80. If you omit specifying the first port, port 0 is assumed (is implicit). --source-port :80 would then match port 0 through 80. And if the last port specification is omitted, port 65535 is assumed. If you were to write --source-port 22:, you would have specified a match for all ports from port 22 through port 65535. If you invert the port range, iptables automatically reverses your inversion. If you write --source-port 80:22, it is simply interpreted as --source-port 22:80. You can also invert a match by adding a ! sign. For example, --source-port ! 22 means that you want to match all ports but port 22. The inversion could also be used together with a port range and would then look like --source-port ! 22:80, which in turn would mean that you want to match all ports but ports 22 through 80. Note that this match does not handle multiple separated ports and port ranges. For more information about those, look at the multiport match extension.
Match --dport, --destination-port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp --dport 22
Explanation This match is used to match TCP packets, according to their destination port. It uses exactly the same syntax as the --source-port match. It understands port and port range specifications, as well as inversions. It also reverses high and low ports in port range specifications, as above. The match will also assume values of 0 and 65535 if the high or low port is left out in a port range specification. In other words, exactly the same as the --source-port syntax. Note that this match does not handle multiple separated ports and port ranges. For more information about those, look at the multiport match extension.
Match --tcp-flags
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -p tcp --tcp-flags SYN,FIN,ACK SYN
Explanation This match is used to match on the TCP flags in a packet. First of all, the match takes a list of flags to compare (a mask) and secondly it takes list of flags that should be set to 1, or turned on. Both lists should be comma-delimited. The match knows about the SYN, ACK, FIN, RST, URG, PSH flags, and it also recognizes the words ALL and NONE. ALL and NONE is pretty much self describing: ALL means to use all flags and NONE means to use no flags for the option. --tcp-flags ALL NONE would in other words mean to check all of the TCP flags and match if none of the flags are set. This option can also be inverted with the ! sign. For example, if we specify ! SYN,FIN,ACK SYN, we would get a match that would match packets that had the ACK and FIN bits set, but not the SYN bit. Also note that the comma delimitation should not include spaces. You can see the correct syntax in the example above.
Match --syn
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -p tcp --syn
Explanation The --syn match is more or less an old relic from the ipchains days and is still there for backward compatibility and for and to make transition one to the other easier. It is used to match packets if they have the SYN bit set and the ACK and RST bits unset. This command would in other words be exactly the same as the --tcp-flags SYN,RST,ACK SYN match. Such packets are mainly used to request new TCP connections from a server. If you block these packets, you should have effectively blocked all incoming connection attempts. However, you will not have blocked the outgoing connections, which a lot of exploits today use (for example, hacking a legitimate service and then installing a program or suchlike that enables initiating an existing connection to your host, instead of opening up a new port on it). This match can also be inverted with the ! sign in this, ! --syn, way. This would match all packets with the RST or the ACK bits set, in other words packets in an already established connection.
Match --tcp-option
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -p tcp --tcp-option 16
Explanation This match is used to match packets depending on their TCP options. A TCP Option is a specific part of the header. This part consists of 3 different fields. The first one is 8 bits long and tells us which Options are used in this stream, the second one is also 8 bits long and tells us how long the options field is. The reason for this length field is that TCP options are, well, optional. To be compliant with the standards, we do not need to implement all options, but instead we can just look at what kind of option it is, and if we do not support it, we just look at the length field and can then jump over this data. This match is used to match different TCP options depending on their decimal values. It may also be inverted with the ! flag, so that the match matches all TCP options but the option given to the match. For a complete list of all options, take a closer look at the Internet Engineering Task Force who maintains a list of all the standard numbers used on the Internet.

UDP matches

This section describes matches that will only work together with UDP packets. These matches are implicitly loaded when you specify the --protocol UDP match and will be available after this specification. Note that UDP packets are not connection oriented, and hence there is no such thing as different flags to set in the packet to give data on what the datagram is supposed to do, such as open or closing a connection, or if they are just simply supposed to send data. UDP packets do not require any kind of acknowledgment either. If they are lost, they are simply lost (Not taking ICMP error messaging etc into account). This means that there are quite a lot less matches to work with on a UDP packet than there is on TCP packets. Note that the state machine will work on all kinds of packets even though UDP or ICMP packets are counted as connectionless protocols. The state machine works pretty much the same on UDP packets as on TCP packets.

Table 10-3. UDP matches

Match --sport, --source-port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p udp --sport 53
Explanation This match works exactly the same as its TCP counterpart. It is used to perform matches on packets based on their source UDP ports. It has support for port ranges, single ports and port inversions with the same syntax. To specify a UDP port range, you could use 22:80 which would match UDP ports 22 through 80. If the first value is omitted, port 0 is assumed. If the last port is omitted, port 65535 is assumed. If the high port comes before the low port, the ports switch place with each other automatically. Single UDP port matches look as in the example above. To invert the port match, add a ! sign, --source-port ! 53. This would match all ports but port 53. The match can understand service names, as long as they are available in the /etc/services file. Note that this match does not handle multiple separated ports and port ranges. For more information about this, look at the multiport match extension.
Match --dport, --destination-port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p udp --dport 53
Explanation The same goes for this match as for --source-port above. It is exactly the same as for the equivalent TCP match, but here it applies to UDP packets. It matches packets based on their UDP destination port. The match handles port ranges, single ports and inversions. To match a single port you use, for example, --destination-port 53, to invert this you would use --destination-port ! 53. The first would match all UDP packets going to port 53 while the second would match packets but those going to the destination port 53. To specify a port range, you would, for example, use --destination-port 9:19. This example would match all packets destined for UDP port 9 through 19. If the first port is omitted, port 0 is assumed. If the second port is omitted, port 65535 is assumed. If the high port is placed before the low port, they automatically switch place, so the low port winds up before the high port. Note that this match does not handle multiple ports and port ranges. For more information about this, look at the multiport match extension.

ICMP matches

These are the ICMP matches. These packets are even more ephemeral, that is to say short lived, than UDP packets, in the sense that they are connectionless. The ICMP protocol is mainly used for error reporting and for connection controlling and suchlike. ICMP is not a protocol subordinated to the IP protocol, but more of a protocol that augments the IP protocol and helps in handling errors. The headers of ICMP packets are very similar to those of the IP headers, but differ in a number of ways. The main feature of this protocol is the type header, that tells us what the packet is for. One example is, if we try to access an unaccessible IP address, we would normally get an ICMP host unreachable in return. For a complete listing of ICMP types, see the ICMP types appendix. There is only one ICMP specific match available for ICMP packets, and hopefully this should suffice. This match is implicitly loaded when we use the --protocol ICMP match and we get access to it automatically. Note that all the generic matches can also be used, so that among other things we can match on the source and destination addresses.

Table 10-4. ICMP matches

Match --icmp-type
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p icmp --icmp-type 8
Explanation This match is used to specify the ICMP type to match. ICMP types can be specified either by their numeric values or by their names. Numerical values are specified in RFC 792. To find a complete listing of the ICMP name values, do an iptables --protocol icmp --help, or check the ICMP types appendix. This match can also be inverted with the ! sign in this, --icmp-type ! 8, fashion. Note that some ICMP types are obsolete, and others again may be "dangerous" for an unprotected host since they may, among other things, redirect packets to the wrong places. The type and code may also be specified by their typename, numeric type, and type/code as well. For example --icmp-type network-redirect, --icmp-type 8 or --icmp-type 8/0. For a complete listing of the names, type iptables -p icmp --help.

SCTP matches

SCTP or Stream Control Transmission Protocol is a relatively new occurence in the networking domain in comparison to the TCP and UDP protocols. The SCTP Characteristics chapter explains the protocol more in detail. The implicit SCTP matches are loaded through adding the -p sctp match to the command line of iptables.

The SCTP protocol was developed by some of the larger telecom and switch/network manufacturers out there, and the protocol is specifically well suited for large simultaneous transactions with high reliability and high throughput.

Table 10-5. SCTP matches

Match --source-port, --sport
Kernel 2.6
Example iptables -A INPUT -p sctp --source-port 80
Explanation The --source-port match is used to match an SCTP packet based on the source port in the SCTP packet header. The port can either be a single port, as in the example above, or a range of ports specified as --source-port 20:100, or it can also be inverted with the !-sign. This looks, for example, like --source-port ! 25. The source port is an unsigned 16 bit integer, so the maximum value is 65535 and the lowest value is 0.
Match --destination-port, --dport
Kernel 2.6
Example iptables -A INPUT -p sctp --destination-port 80
Explanation This match is used for the destination port of the SCTP packets. All SCTP packets contain a destination port, just as it does a source port, in the headers. The port can be either specified as in the example above, or with a port range such as --destination-port 6660:6670. The command can also be inverted with the !-sign, for example, --destination-port ! 80. This example would match all packets but those to port 80. The same applies for destination ports as for source ports, the highest port is 65535 and the lowest is 0.
Match --chunk-types
Kernel 2.6
Example iptables -A INPUT -p sctp --chunk-types any INIT,INIT_ACK
Explanation This matches the chunk type of the SCTP packet. Currently there are a host of different chunk types available. For a complete list, see below. The match begins with the --chunk-types keyword, and then continues with a flag noting if we are to match all, any or none. After this, you specify the SCTP Chunk Types to match for. The Chunk Types are available in the separate list below.
Additionally, the flags can take some Chunk Flags as well. This is done for example in the form --chunk-types any DATA:Be. The flags are specific for each SCTP Chunk type and must be valid according to the separate list after this table.
If an upper case letter is used, the flag must be set, and if a lower case flag is set it must be unset to match. The whole match can be inversed by using an ! sign just after the --chunk-types keyword. For example, --chunk-types ! any DATA:Be would match anything but this pattern.

Below is the list of chunk types that the --chunk-types match will recognize. The list is quite extensive as you can see, but the mostly used packets are DATA and SACK packets. The rest are mostly used for controlling the association.

SCTP Chunk types as used in --chunk-types


















The following flags can be used with the --chunk-types match as seen above. According to the RFC 2960 - Stream Control Transmission Protocol all the rest of the flags are reserved or not in use, and must be set to 0. Iptables does currently not contain any measures to enforce this, fortunately, since it begs to become another problem such as the one previously experienced when ECN was implemented in the IP protocol.

SCTP Chunk flags as used in --chunk-types

• DATA - U or u for Unordered bit, B or b for Beginning fragment bit and E or e for Ending fragment bit.

• ABORT - T or t for TCB destroy flag.

• SHUTDOWN_COMPLETE - T or t for TCB destroyed flag.

Explicit matches

Explicit matches are those that have to be specifically loaded with the -m or --match option. State matches, for example, demand the directive -m state prior to entering the actual match that you want to use. Some of these matches may be protocol specific . Some may be unconnected with any specific protocol - for example connection states. These might be NEW (the first packet of an as yet unestablished connection), ESTABLISHED (a connection that is already registered in the kernel), RELATED (a new connection that was created by an older, established one) etc. A few may just have been evolved for testing or experimental purposes, or just to illustrate what iptables is capable of. This in turn means that not all of these matches may at first sight be of any use. Nevertheless, it may well be that you personally will find a use for specific explicit matches. And there are new ones coming along all the time, with each new iptables release. Whether you find a use for them or not depends on your imagination and your needs. The difference between implicitly loaded matches and explicitly loaded ones, is that the implicitly loaded matches will automatically be loaded when, for example, you match on the properties of TCP packets, while explicitly loaded matches will never be loaded automatically - it is up to you to discover and activate explicit matches.

Addrtype match

The addrtype module matches packets based on the address type. The address type is used inside the kernel to put different packets into different categories. With this match you will be able to match all packets based on their address type according to the kernel. It should be noted that the exact meaning of the different address types varies between the layer 3 protocols. I will give a brief general description here however, but for more information I suggest reading Linux Advanced Routing and Traffic Control HOW-TO and Policy Routing using Linux. The available types are as follows:

Table 10-6. Address types

Type Description
ANYCAST This is a one-to-many associative connection type, where only one of the many receiver hosts actually receives the data. This is for example implemented in DNS. You have single address to a root server, but it actually has several locations and your packet will be directed to the closest working server. Not implemented in Linux IPv4.
BLACKHOLE A blackhole address will simply delete the packet and send no reply. It works as a black hole in space basically. This is configured in the routing tables of linux.
BROADCAST A broadcast packet is a single packet sent to everyone in a specific network in a one-to-many relation. This is for example used in ARP resolution, where a single packet is sent out requesting information on how to reach a specific IP, and then the host that is authoritative replies with the proper MAC address of that host.
LOCAL An address that is local to the host we are working on. for example.
MULTICAST A multicast packet is sent to several hosts using the shortest distance and only one packet is sent to each waypoint where it will be multiple copies for each host/router subscribing to the specific multicast address. Commonly used in one way streaming media such as video or sound.
NAT An address that has been NAT'ed by the kernel.
PROHIBIT Same as blackhole except that a prohibited answer will be generated. In the IPv4 case, this means an ICMP communication prohibited (type 3, code 13) answer will be generated.
THROW Special route in the Linux kernel. If a packet is thrown in a routing table it will behave as if no route was found in the table. In normal routing, this means that the packet will behave as if it had no route. In policy routing, another route might be found in another routing table.
UNICAST A real routable address for a single address. The most common type of route.
UNREACHABLE This signals an unreachable address that we do not know how to reach. The packets will be discarded and an ICMP Host unreachable (type 3, code 1) will be generated.
UNSPEC An unspecified address that has no real meaning.
XRESOLVE This address type is used to send route lookups to userland applications which will do the lookup for the kernel. This might be wanted to send ugly lookups to the outside of the kernel, or to have an application do lookups for you. Not implemented in Linux.

The addrtype match is loaded by using the -m addrtype keyword. When this is done, the extra match options in the following table will be available for usage.

Table 10-7. Addrtype match options

Match --src-type
Kernel 2.6
Example iptables -A INPUT -m addrtype --src-type UNICAST
Explanation The --src-type match option is used to match the source address type of the packet. It can either take a single address type or several separated by coma signs, for example --src-type BROADCAST,MULTICAST. The match option may also be inverted by adding an exclamation sign before it, for example ! --src-type BROADCAST,MULTICAST.
Match --dst-type
Kernel 2.6
Example iptables -A INPUT -m addrtype --dst-type UNICAST
Explanation The --dst-type works exactly the same way as --src-type and has the same syntax. The only difference is that it will match packets based on their destination address type.

AH/ESP match

These matches are used for the IPSEC AH and ESP protocols. IPSEC is used to create secure tunnels over an insecure Internet connection. The AH and ESP protocols are used by IPSEC to create these secure connections. The AH and ESP matches are really two separate matches, but are both described here since they look very much alike, and both are used in the same function.

I will not go into detail to describe IPSEC here, instead look at the following pages and documents for more information:

RFC 2401 - Security Architecture for the Internet Protocol



Linux Advanced Routing and Traffic Control HOW-TO

There is also a ton more documentation on the Internet on this, but you are free to look it up as needed.

To use the AH/ESP matches, you need to use -m ah to load the AH matches, and -m esp to load the ESP matches.

Note In 2.2 and 2.4 kernels, Linux used something called FreeS/WAN for the IPSEC implementation, but as of Linux kernel 2.5.47 and up, Linux kernels have a direct implementation of IPSEC that requires no patching of the kernel. This is a total rewrite of the IPSEC implementation on Linux.

Table 10-8. AH match options

Match --ahspi
Kernel 2.5 and 2.6
Example iptables -A INPUT -p 51 -m ah --ahspi 500
Explanation This matches the AH Security Parameter Index (SPI) number of the AH packets. Please note that you must specify the protocol as well, since AH runs on a different protocol than the standard TCP, UDP or ICMP protocols. The SPI number is used in conjunction with the source and destination address and the secret keys to create a security association (SA). The SA uniquely identifies each and every one of the IPSEC tunnels to all hosts. The SPI is used to uniquely distinguish each IPSEC tunnel connected between the same two peers. Using the --ahspi match, we can match a packet based on the SPI of the packets. This match can match a whole range of SPI values by using a : sign, such as 500:520, which will match the whole range of SPI's.

Table 10-9. ESP match options

Match --espspi
Kernel 2.5 and 2.6
Example iptables -A INPUT -p 50 -m esp --espspi 500
Explanation The ESP counterpart Security Parameter Index (SPI) is used exactly the same way as the AH variant. The match looks exactly the same, with the esp/ah difference. Of course, this match can match a whole range of SPI numbers as well as the AH variant of the SPI match, such as --espspi 200:250 which matches the whole range of SPI's.

Comment match

The comment match is used to add comments inside the iptables ruleset and the kernel. This can make it much easier to understand your ruleset and to ease debugging. For example, you could add comments documenting which bash function added specific sets of rules to netfilter, and why. It should be noted that this isn't actually a match. The comment match is loaded using the -m comment keywords. At this point the following options will be available.

Table 10-10. Comment match options

Match --comment
Kernel 2.6
Example iptables -A INPUT -m comment --comment "A comment"
Explanation The --comment option specifies the comment to actually add to the rule in kernel. The comment can be a maximum of 256 characters.

Connmark match

The connmark match is used very much the same way as the mark match is in the MARK/mark target and match combination. The connmark match is used to match marks that has been set on a connection with the CONNMARK target. It only takes one option.

Important To match a mark on the same packet as is the first to create the connection marking, you must use the connmark match after the CONNMARK target has set the mark on the first packet.

Table 10-11. Connmark match options

Match --mark
Kernel 2.6
Example iptables -A INPUT -m connmark --mark 12 -j ACCEPT
Explanation The mark option is used to match a specific mark associated with a connection. The mark match must be exact, and if you want to filter out unwanted flags from the connection mark before actually matching anything, you can specify a mask that will be anded to the connection mark. For example, if you have a connection mark set to 33 (10001 in binary) on a connection, and want to match the first bit only, you would be able to run something like --mark 1/1. The mask (00001) would be masked to 10001, so 10001 && 00001 equals 1, and then matched against the 1.

Conntrack match

The conntrack match is an extended version of the state match, which makes it possible to match packets in a much more granular way. It let's you look at information directly available in the connection tracking system, without any "frontend" systems, such as in the state match. For more information about the connection tracking system, take a look at the The state machine chapter.

There are a number of different matches put together in the conntrack match, for several different fields in the connection tracking system. These are compiled together into the list below. To load these matches, you need to specify -m conntrack.

Table 10-12. Conntrack match options

Match --ctstate
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctstate RELATED
Explanation This match is used to match the state of a packet, according to the conntrack state. It is used to match pretty much the same states as in the original state match. The valid entries for this match are:
The entries can be used together with each other separated by a comma. For example, -m conntrack --ctstate ESTABLISHED,RELATED. It can also be inverted by putting a ! in front of --ctstate. For example: -m conntrack ! --ctstate ESTABLISHED,RELATED, which matches all but the ESTABLISHED and RELATED states.
Match --ctproto
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctproto TCP
Explanation This matches the protocol, the same as the --protocol does. It can take the same types of values, and is inverted using the ! sign. For example, -m conntrack ! --ctproto TCP matches all protocols but the TCP protocol.
Match --ctorigsrc
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctorigsrc
Explanation --ctorigsrc matches based on the original source IP specification of the conntrack entry that the packet is related to. The match can be inverted by using a ! between the --ctorigsrc and IP specification, such as --ctorigsrc ! It can also take a netmask of the CIDR form, such as --ctorigsrc
Match --ctorigdst
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctorigdst
Explanation This match is used exactly as the --ctorigsrc, except that it matches on the destination field of the conntrack entry. It has the same syntax in all other respects.
Match --ctreplsrc
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctreplsrc
Explanation The --ctreplsrc match is used to match based on the original conntrack reply source of the packet. Basically, this is the same as the --ctorigsrc, but instead we match the reply source expected of the upcoming packets. This target can, of course, be inverted and address a whole range of addresses, just the same as the the previous targets in this class.
Match --ctrepldst
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctrepldst
Explanation The --ctrepldst match is the same as the --ctreplsrc match, with the exception that it matches the reply destination of the conntrack entry that matched the packet. It too can be inverted, and accept ranges, just as the --ctreplsrc match.
Match --ctstatus
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctstatus RELATED
Explanation This matches the status of the connection, as described in the The state machine chapter. It can match the following statuses.
• NONE - The connection has no status at all.
• EXPECTED - This connection is expected and was added by one of the expectation handlers.
• SEEN_REPLY - This connection has seen a reply but isn't assured yet.
• ASSURED - The connection is assured and will not be removed until it times out or the connection is closed by either end.
This can also be inverted by using the ! sign. For example -m conntrack ! --ctstatus ASSURED which will match all but the ASSURED status.
Match --ctexpire
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m conntrack --ctexpire 100:150
Explanation This match is used to match on packets based on how long is left on the expiration timer of the conntrack entry, measured in seconds. It can either take a single value and match against, or a range such as in the example above. It can also be inverted by using the ! sign, such as this -m conntrack ! --ctexpire 100. This will match every expiration time, which does not have exactly 100 seconds left to it.

Dscp match

This match is used to match on packets based on their DSCP (Differentiated Services Code Point) field. This is documented in the RFC 2638 - A Two-bit Differentiated Services Architecture for the Internet RFC. The match is explicitly loaded by specifying -m dscp. The match can take two mutually exclusive options, described below.

Table 10-13. Dscp match options

Match --dscp
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m dscp --dscp 32
Explanation This option takes a DSCP value in either decimal or in hex. If the option value is in decimal, it would be written like 32 or 16, et cetera. If written in hex, it should be prefixed with 0x, like this: 0x20. It can also be inverted by using the ! character, like this: -m dscp ! --dscp 32.
Match --dscp-class
Kernel 2.5 and 2.6
Example iptables -A INPUT -p tcp -m dscp --dscp-class BE
Explanation The --dscp-class match is used to match on the DiffServ class of a packet. The values can be any of the BE, EF, AFxx or CSx classes as specified in the various RFC's. This match can be inverted just the same way as the --dscp option.

Note Please note that the --dscp and --dscp-class options are mutually exclusive and can not be used in conjunction with each other.

Ecn match

The ecn match is used to match on the different ECN fields in the TCP and IPv4 headers. ECN is described in detail in the RFC 3168 - The Addition of Explicit Congestion Notification (ECN) to IP RFC. The match is explicitly loaded by using -m ecn in the command line. The ecn match takes three different options as described below.

Table 10-14. Ecn match options

Match --ecn
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m ecn --ecn-tcp-cwr
Explanation This match is used to match the CWR (Congestion Window Received) bit, if it has been set. The CWR flag is set to notify the other endpoint of the connection that they have received an ECE, and that they have reacted to it. Per default this matches if the CWR bit is set, but the match may also be inversed using an exclamation point.
Match --ecn-tcp-ece
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m ecn --ecn-tcp-ece
Explanation This match can be used to match the ECE (ECN-Echo) bit. The ECE is set once one of the endpoints has received a packet with the CE bit set by a router. The endpoint then sets the ECE in the returning ACK packet, to notify the other endpoint that it needs to slow down. The other endpoint then sends a CWR packet as described in the --ecn-tcp-cwr explanation. This matches per default if the ECE bit is set, but may be inversed by using an exclamation point.
Match --ecn-ip-ect
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m ecn --ecn-ip-ect 1
Explanation The --ecn-ip-ect match is used to match the ECT (ECN Capable Transport) codepoints. The ECT codepoints has several types of usage. Mainly, they are used to negotiate if the connection is ECN capable by setting one of the two bits to 1. The ECT is also used by routers to indicate that they are experiencing congestion, by setting both ECT codepoints to 1. The ECT values are all available in the in the ECN Field in IP table below.
The match can be inversed using an exclamation point, for example ! --ecn-ip-ect 2 which will match all ECN values but the ECT(0) codepoint. The valid value range is 0-3 in iptables. See the above table for their values.

Table 10-15. ECN Field in IP

Iptables value ECT CE [Obsolete] RFC 2481 names for the ECN bits.
0 0 0 Not-ECT, ie. non-ECN capable connection.
1 0 1 ECT(1), New naming convention of ECT codepoints in RFC 3168.
2 1 0 ECT(0), New naming convention of ECT codepoints in RFC 3168.
3 1 1 CE (Congestion Experienced), Used to notify endpoints of congestion

Hashlimit match

This is a modified version of the Limit match. Instead of just setting up a single token bucket, it sets up a hash table pointing to token buckets for each destination IP, source IP, destination port and source port tuple. For example, you can set it up so that every IP address can receive a maximum of 1000 packets per second, or you can say that every service on a specific IP address may receive a maximum of 200 packets per second. The hashlimit match is loaded by specifying the -m hashlimit keywords.

Each rule that uses the hashlimit match creates a separate hashtable which in turn has a specific max size and a maximum number of buckets. This hash table contains a hash of either a single or multiple values. The values can be any and/or all of destination IP, source IP, destination port and source port. Each entry then points to a token bucket that works as the limit match.

Table 10-16. Hashlimit match options

Match --hashlimit
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000/sec --hashlimit-mode dstip,dstport --hashlimit-name hosts
Explanation The --hashlimit specifies the limit of each bucket. In this example the hashlimit is set to 1000. In this example, we have set up the hashlimit-mode to be dstip,dstport and destination Hence, for every port or service on the destination host, it can receive 1000 packets per second. This is the same setting as the limit option for the limit match. The limit can take a /sec, /minute, /hour or /day postfix. If no postfix is specified, the default postfix is per second.
Match --hashlimit-mode
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000/sec --hashlimit-mode dstip --hashlimit-name hosts
Explanation The --hashlimit-mode option specifies which values we should use as the hash values. In this example, we use only the dstip (destination IP) as the hashvalue. So, each host in the network will be limited to receiving a maximum of 1000 packets per second in this case. The possible values for the --hashlimit-mode is dstip (Destination IP), srcip (Source IP), dstport (Destination port) and srcport (Source port). All of these can also be separated by a comma sign to include more than one hashvalue, such as for example --hashlimit-mode dstip,dstport.
Match --hashlimit-name
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts
Explanation This option specifies the name that this specific hash will be available as. It can be viewed inside the /proc/net/ipt_hashlimit directory. The example above would be viewable inside the /proc/net/ipt_hashlimit/hosts file. Only the filename should be specified.
Match --hashlimit-burst
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-burst 2000
Explanation This match is the same as the --limit-burst in that it sets the maximum size of the bucket. Each bucket will have a burst limit, which is the maximum amount of packets that can be matched during a single time unit. For an example on how a token bucket works, take a look at the Limit match.
Match --hashlimit-htable-size
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-size 500
Explanation This sets the maximum available buckets to be used. In this example, it means that a maximum of 500 ports can be open and active at the same time.
Match --hashlimit-htable-max
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-max 500
Explanation The --hashlimit-htable-max sets the maximum number of hashtable entries. This means all of the connections, including the inactive connections that doesn't require any token buckets for the moment.
Match --hashlimit-htable-gcinterval
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-gcinterval 1000
Explanation How often should the garbage collection function be run. Generally speaking this value should be lower than the expire value. The value is measured in milliseconds. If it is set too low it will be taking up unnecessary system resources and processing power, but if it's too high it can leave unused token buckets lying around for too long and leaving other connections impossible. In this example the garbage collector will run every second.
Match --hashlimit-htable-expire
Kernel 2.6
Example iptables -A INPUT -p tcp --dst -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-expire 10000
Explanation This value sets after how long time an idle hashtable entry should expire. If a bucket has been unused for longer than this, it will be expired and the next garbage collection run will remove it from the hashtable, as well as all of the information pertaining to it.

Helper match

This is a rather unorthodox match in comparison to the other matches, in the sense that it uses a little bit specific syntax. The match is used to match packets, based on which conntrack helper that the packet is related to. For example, let's look at the FTP session. The Control session is opened up, and the ports/connection is negotiated for the Data session within the Control session. The ip_conntrack_ftp helper module will find this information, and create a related entry in the conntrack table. Now, when a packet enters, we can see which protocol it was related to, and we can match the packet in our ruleset based on which helper was used. The match is loaded by using the -m helper keyword.

Table 10-17. Helper match options

Match --helper
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m helper --helper ftp-21
Explanation The --helper option is used to specify a string value, telling the match which conntrack helper to match. In the basic form, it may look like --helper irc. This is where the syntax starts to change from the normal syntax. We can also choose to only match packets based on which port that the original expectation was caught on. For example, the FTP Control session is normally transferred over port 21, but it may as well be port 954 or any other port. We may then specify upon which port the expectation should be caught on, like --helper ftp-954.

IP range match

The IP range match is used to match IP ranges, just as the --source and --destination matches are able to do as well. However, this match adds a different kind of matching in the sense that it is able to match in the manner of from IP - to IP, which the --source and --destination matches are unable to. This may be needed in some specific network setups, and it is rather a bit more flexible. The IP range match is loaded by using the -m iprange keyword.

Table 10-18. IP range match options

Match --src-range
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m iprange --src-range
Explanation This matches a range of source IP addresses. The range includes every single IP address from the first to the last, so the example above includes everything from to The match may also be inverted by adding an !. The above example would then look like -m iprange ! --src-range, which would match every single IP address, except the ones specified.
Match --dst-range
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m iprange --dst-range
Explanation The --dst-range works exactly the same as the --src-range match, except that it matches destination IP's instead of source IP's.

Length match

The length match is used to match packets based on their length. It is very simple. If you want to limit packet length for some strange reason, or want to block ping-of-death-like behaviour, use the length match.

Table 10-19. Length match options

Match --length
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m length --length 1400:1500
Explanation The example --length will match all packets with a length between 1400 and 1500 bytes. The match may also be inversed using the ! sign, like this: -m length ! --length 1400:1500 . It may also be used to match only a specific length, removing the : sign and onwards, like this: -m length --length 1400. The range matching is, of course, inclusive, which means that it includes all packet lengths in between the values you specify.

Limit match

The limit match extension must be loaded explicitly with the -m limit option. This match can, for example, be used to advantage to give limited logging of specific rules etc. For example, you could use this to match all packets that do not exceed a given value, and after this value has been exceeded, limit logging of the event in question. Think of a time limit: You could limit how many times a certain rule may be matched in a certain time frame, for example to lessen the effects of DoS syn flood attacks. This is its main usage, but there are more usages, of course. The limit match may also be inverted by adding a ! flag in front of the limit match. It would then be expressed as -m limit ! --limit 5/s.This means that all packets will be matched after they have broken the limit.

To further explain the limit match, it is basically a token bucket filter. Consider having a leaky bucket where the bucket leaks X packets per time-unit. X is defined depending on how many matching packets we get, so if we get 3 packets, the bucket leaks 3 packets per that time-unit. The --limit option tells us how many packets to refill the bucket with per time-unit, while the --limit-burst option tells us how big the bucket is in the first place. So, setting --limit 3/minute --limit-burst 5, and then receiving 5 matches will empty the bucket. After 20 seconds, the bucket is refilled with another token, and so on until the --limit-burst is reached again or until they get used.

Consider the example below for further explanation of how this may look.

We set a rule with -m limit --limit 5/second --limit-burst 10/second. The limit-burst token bucket is set to 10 initially. Each packet that matches the rule uses a token.

We get packet that matches, 1-2-3-4-5-6-7-8-9-10, all within a 1/1000 of a second.

The token bucket is now empty. Once the token bucket is empty, the packets that qualify for the rule otherwise no longer match the rule and proceed to the next rule if any, or hit the chain policy.

For each 1/5 s without a matching packet, the token count goes up by 1, upto a maximum of 10. 1 second after receiving the 10 packets, we will once again have 5 tokens left.

And of course, the bucket will be emptied by 1 token for each packet it receives.

Table 10-20. Limit match options

Match --limit
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -m limit --limit 3/hour
Explanation This sets the maximum average match rate for the limit match. You specify it with a number and an optional time unit. The following time units are currently recognized: /second /minute /hour /day. The default value here is 3 per hour, or 3/hour. This tells the limit match how many times to allow the match to occur per time unit (e.g. per minute).
Match --limit-burst
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -m limit --limit-burst 5
Explanation This is the setting for the burst limit of the limit match. It tells iptables the maximum number of tokens available in the bucket when we start, or when the bucket is full. This number gets decremented by one for every packet that arrives, down to the lowest possible value, 1. The bucket will be refilled by the limit value every time unit, as specified by the --limit option. The default --limit-burst value is 5. For a simple way of checking out how this works, you can use the example Limit-match.txt one-rule-script. Using this script, you can see for yourself how the limit rule works, by simply sending ping packets at different intervals and in different burst numbers. All echo replies will be blocked when the burst value has been exceeded, and then be refilled by the limit value every second.

Mac match

The MAC (Ethernet Media Access Control) match can be used to match packets based on their MAC source address. As of writing this documentation, this match is a little bit limited, however, in the future this may be more evolved and may be more useful. This match can be used to match packets on the source MAC address only as previously said.

Note Do note that to use this module we explicitly load it with the -m mac option. The reason that I am saying this is that a lot of people wonder if it should not be -m mac-source, which it should not.

Table 10-21. Mac match options

Match --mac-source
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -m mac --mac-source 00:00:00:00:00:01
Explanation This match is used to match packets based on their MAC source address. The MAC address specified must be in the form XX:XX:XX:XX:XX:XX, else it will not be legal. The match may be reversed with an ! sign and would look like --mac-source ! 00:00:00:00:00:01. This would in other words reverse the meaning of the match, so that all packets except packets from this MAC address would be matched. Note that since MAC addresses are only used on Ethernet type networks, this match will only be possible to use for Ethernet interfaces. The MAC match is only valid in the PREROUTING, FORWARD and INPUT chains and nowhere else.

Mark match

The mark match extension is used to match packets based on the marks they have set. A mark is a special field, only maintained within the kernel, that is associated with the packets as they travel through the computer. Marks may be used by different kernel routines for such tasks as traffic shaping and filtering. As of today, there is only one way of setting a mark in Linux, namely the MARK target in iptables. This was previously done with the FWMARK target in ipchains, and this is why people still refer to FWMARK in advanced routing areas. The mark field is currently set to an unsigned integer, or 4294967296 possible values on a 32 bit system. In other words, you are probably not going to run into this limit for quite some time.

Table 10-22. Mark match options

Match --mark
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -t mangle -A INPUT -m mark --mark 1
Explanation This match is used to match packets that have previously been marked. Marks can be set with the MARK target which we will discuss in the next section. All packets traveling through Netfilter get a special mark field associated with them. Note that this mark field is not in any way propagated, within or outside the packet. It stays inside the computer that made it. If the mark field matches the mark, it is a match. The mark field is an unsigned integer, hence there can be a maximum of 4294967296 different marks. You may also use a mask with the mark. The mark specification would then look like, for example, --mark 1/1. If a mask is specified, it is logically AND ed with the mark specified before the actual comparison.

Multiport match

The multiport match extension can be used to specify multiple destination ports and port ranges. Without the possibility this match gives, you would have to use multiple rules of the same type, just to match different ports.

Note You can not use both standard port matching and multiport matching at the same time, for example you can't write: --sport 1024:63353 -m multiport --dport 21,23,80. This will simply not work. What in fact happens, if you do, is that iptables honors the first element in the rule, and ignores the multiport instruction.

Table 10-23. Multiport match options

Match --source-port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m multiport --source-port 22,53,80,110
Explanation This match matches multiple source ports. A maximum of 15 separate ports may be specified. The ports must be comma delimited, as in the above example. The match may only be used in conjunction with the -p tcp or -p udp matches. It is mainly an enhanced version of the normal --source-port match.
Match --destination-port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m multiport --destination-port 22,53,80,110
Explanation This match is used to match multiple destination ports. It works exactly the same way as the above mentioned source port match, except that it matches destination ports. It too has a limit of 15 ports and may only be used in conjunction with -p tcp and -p udp.
Match --port
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A INPUT -p tcp -m multiport --port 22,53,80,110
Explanation This match extension can be used to match packets based both on their destination port and their source port. It works the same way as the --source-port and --destination-port matches above. It can take a maximum of 15 ports and can only be used in conjunction with -p tcp and -p udp. Note that the --port match will only match packets coming in from and going to the same port, for example, port 80 to port 80, port 110 to port 110 and so on.

Owner match

The owner match extension is used to match packets based on the identity of the process that created them. The owner can be specified as the process ID either of the user who issued the command in question, that of the group, the process, the session, or that of the command itself. This extension was originally written as an example of what iptables could be used for. The owner match only works within the OUTPUT chain, for obvious reasons: It is pretty much impossible to find out any information about the identity of the instance that sent a packet from the other end, or where there is an intermediate hop to the real destination. Even within the OUTPUT chain it is not very reliable, since certain packets may not have an owner. Notorious packets of that sort are (among other things) the different ICMP responses. ICMP responses will never match.

Table 10-24. Owner match options

Match --cmd-owner
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m owner --cmd-owner httpd
Explanation This is the command owner match, and is used to match based on the command name of the process that is sending the packet. In the example, httpd is matched. This match may also be inverted by using an exclamation sign, for example -m owner ! --cmd-owner ssh.
Match --uid-owner
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m owner --uid-owner 500
Explanation This packet match will match if the packet was created by the given User ID (UID). This could be used to match outgoing packets based on who created them. One possible use would be to block any other user than root from opening new connections outside your firewall. Another possible use could be to block everyone but the http user from sending packets from the HTTP port.
Match --gid-owner
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m owner --gid-owner 0
Explanation This match is used to match all packets based on their Group ID (GID). This means that we match all packets based on what group the user creating the packets is in. This could be used to block all but the users in the network group from getting out onto the Internet or, as described above, only to allow members of the http group to create packets going out from the HTTP port.
Match --pid-owner
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m owner --pid-owner 78
Explanation This match is used to match packets based on the Process ID (PID) that was responsible for them. This match is a bit harder to use, but one example would be only to allow PID 94 to send packets from the HTTP port (if the HTTP process is not threaded, of course). Alternatively we could write a small script that grabs the PID from a ps output for a specific daemon and then adds a rule for it. For an example, you could have a rule as shown in the Pid-owner.txt example.
Match --sid-owner
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m owner --sid-owner 100
Explanation This match is used to match packets based on the Session ID used by the program in question. The value of the SID, or Session ID of a process, is that of the process itself and all processes resulting from the originating process. These latter could be threads, or a child of the original process. So, for example, all of our HTTPD processes should have the same SID as their parent process (the originating HTTPD process), if our HTTPD is threaded (most HTTPDs are, Apache and Roxen for instance). To show this in example, we have created a small script called Sid-owner.txt. This script could possibly be run every hour or so together with some extra code to check if the HTTPD is actually running and start it again if necessary, then flush and re-enter our OUTPUT chain if needed.

Note The pid, sid and command matching is broken in SMP kernels since they use different process lists for each processor. It might be fixed in the future however

Packet type match

The packet type match is used to match packets based on their type. I.e., are they destined to a specific person, to everyone or to a specific group of machines or users. These three groups are generally called unicast, broadcast and multicast, as discussed in the TCP/IP repetition chapter. The match is loaded by using -m pkttype.

Table 10-25. Packet type match options

Match --pkt-type
Kernel 2.3, 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m pkttype --pkt-type unicast
Explanation The --pkt-type match is used to tell the packet type match which packet type to match. It can either take unicast , broadcast or multicast as an argument, as in the example. It can also be inverted by using a ! like this: -m pkttype --pkt-type ! broadcast, which will match all other packet types.

Realm match

The realm match is used to match packets based on the routing realm that they are part of. Routing realms are used in Linux for complex routing scenarios and setups such as when using BGP et cetera. The realm match is loaded by adding the -m realm keyword to the commandline.

A routing realm is used in Linux to classify routes into logical groups of routes. In most dedicated routers today, the Routing Information Base (RIB) and the forwarding engine are very close to eachother. Inside the kernel for example. Since Linux isn't really a dedicated routing system, it has been forced to separate its RIB and Forwarding Information Base (FIB). The RIB lives in userspace and the FIB lives inside kernelspace. Because of this separation, it becomes quite resourceheavy to do quick searches in the RIB. The routing realm is the Linux solution to this, and actually makes the system more flexible and richer.

The Linux realms can be used together with BGP and other routing protocols that delivers huge amounts of routes. The routing daemon can then sort the routes by their prefix, aspath, or source for example, and put them in different realms. The realm is numeric, but can also be named through the /etc/iproute2/rt_realms file.

Table 10-26. Realm match options

Match --realm
Kernel 2.6
Example iptables -A OUTPUT -m realm --realm 4
Explanation This option matches the realm number and optionally a mask. If this is not a number, it will also try and resolve the realm from the /etc/iproute2/rt_realms file also. If a named realm is used, no mask may be used. The match may also be inverted by setting an exclamation sign, for example --realm ! cosmos.

Recent match

The recent match is a rather large and complex matching system, which allows us to match packets based on recent events that we have previously matched. For example, if we would see an outgoing IRC connection, we could set the IP addresses into a list of hosts, and have another rule that allows identd requests back from the IRC server within 15 seconds of seeing the original packet.

Before we can take a closer look at the match options, let's try and explain a little bit how it works. First of all, we use several different rules to accomplish the use of the recent match. The recent match uses several different lists of recent events. The default list being used is the DEFAULT list. We create a new entry in a list with the set option, so once a rule is entirely matched (the set option is always a match), we also add an entry in the recent list specified. The list entry contains a timestamp, and the source IP address used in the packet that triggered the set option. Once this has happened, we can use a series of different recent options to match on this information, as well as update the entries timestamp, et cetera.

Finally, if we would for some reason want to remove a list entry, we would do this using the --remove match option from the recent match. All rules using the recent match, must load the recent module (-m recent) as usual. Before we go on with an example of the recent match, let's take a look at all the options.

Table 10-27. Recent match options

Match --name
Kernel 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m recent --name examplelist
Explanation The name option gives the name of the list to use. Per default the DEFAULT list is used, which is probably not what we want if we are using more than one list.
Match --set
Kernel 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m recent --set
Explanation This creates a new list entry in the named recent list, which contains a timestamp and the source IP address of the host that triggered the rule. This match will always return success, unless it is preceded by a ! sign, in which case it will return failure.
Match --rcheck
Kernel 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m recent --name examplelist --rcheck
Explanation The --rcheck option will check if the source IP address of the packet is in the named list. If it is, the match will return true, otherwise it returns false. The option may be inverted by using the ! sign. In the later case, it will return true if the source IP address is not in the list, and false if it is in the list.
Match --update
Kernel 2.4, 2.5 and 2.6
Example iptables -A OUTPUT -m recent --name examplelist --update
Explanation This match is true if the source combination is available in the specified list and it also updates the last-seen time in the list. This match may also be reversed by setting the ! mark in front of the match. For example, ! --update.
Match --remove
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -m recent --name example --remove
Explanation This match will try to find the source address of the packet in the list, and returns true if the packet is there. It will also remove the corresponding list entry from the list. The command is also possible to inverse with the ! sign.
Match --seconds
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -m recent --name example --check --seconds 60
Explanation This match is only valid together with the --check and --update matches. The --seconds match is used to specify how long since the "last seen" column was updated in the recent list. If the last seen column was older than this amount in seconds, the match returns false. Other than this the recent match works as normal, so the source address must still be in the list for a true return of the match.
Match --hitcount
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -m recent --name example --check --hitcount 20
Explanation The --hitcount match must be used together with the --check or --update matches and it will limit the match to only include packets that have seen at least the hitcount amount of packets. If this match is used together with the --seconds match, it will require the specified hitcount packets to be seen in the specific timeframe. This match may also be reversed by adding a ! sign in front of the match. Together with the --seconds match, this means that a maximum of this amount of packets may have been seen during the specified timeframe. If both of the matches are inversed, then a maximum of this amount of packets may have been seen during the last minumum of seconds.
Match --rttl
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -m recent --name example --check --rttl
Explanation The --rttl match is used to verify that the TTL value of the current packet is the same as the original packet that was used to set the original entry in the recent list. This can be used to verify that people are not spoofing their source address to deny others access to your servers by making use of the recent match.
Match --rsource
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -m recent --name example --rsource
Explanation The --rsource match is used to tell the recent match to save the source address and port in the recent list. This is the default behavior of the recent match.
Match --rdest
Kernel 2.4, 2.5 and 2.6
Example iptables -A INPUT -m recent --name example --rdest
Explanation The --rdest match is the opposite of the --rsource match in that it tells the recent match to save the destination address and port to the recent list.

I have created a small sample script of how the recent match can be used, which you can find in the Recent-match.txt section.

Briefly, this is a poor replacement for the state engine available in netfilter. This version was created with a http server in mind, but will work with any TCP connection. First we have created two chains named http-recent and http-recent-final. The http-recent chain is used in the starting stages of the connection, and for the actual data transmission, while the http-recent-final chain is used for the last and final FIN/ACK, FIN handshake.

Warning! This is a very bad replacement for the built in state engine and can not handle all of the possibilities that the state engine can handle. However, it is a good example of what can be done with the recent match without being too specific. Do not use this example in a real world environment. It is slow, handles special cases badly, and should generally never be used more than as an example.

For example, it does not handle closed ports on connection, asyncronuous FIN handshake (where one of the connected parties closes down, while the other continues to send data), etc.

Let's follow a packet through the example ruleset. First a packet enters the INPUT chain, and we send it to the http-recent chain.

The first packet should be a SYN packet, and should not have the ACK,FIN or RST bits set. Hence it is matched using the --tcp-flags SYN,ACK,FIN,RST SYN line. At this point we add the connection to the httplist using -m recent --name httplist --set line. Finally we accept the packet.

After the first packet we should receive a SYN/ACK packet to acknowledge that the SYN packet was received. This can be matched using the --tcp-flags SYN,ACK,FIN,RST SYN,ACK line. FIN and RST should be illegal at this point as well. At this point we update the entry in the httplist using -m recent --name httplist --update and finally we ACCEPT the packet.

By now we should get a final ACK packet, from the original creater of the connection, to acknowledge the SYN/ACK sent by the server. SYN, FIN and RST are illegal at this point of the connection, so the line should look like --tcp-flags SYN,ACK,FIN,RST ACK. We update the list in exactly the same way as in the previous step, and ACCEPT it.

At this point the data transmission can start. The connection should never contain any SYN packet now, but it will contain ACK packets to acknowledge the data packets that are sent. Each time we see any packet like this, we update the list and ACCEPT the packets.

The transmission can be ended in two ways, the simplest is the RST packet. RST will simply reset the connection and it will die. With FIN/ACK, the other endpoint answers with a FIN, and this closes down the connection so that the original source of the FIN/ACK can no longer send any data. The receiver of the FIN, will still be able to send data, hence we send the connection to a "final" stage chain to handle the rest.

In the http-recent-final chain we check if the packet is still in the httplist, and if so, we send it to the http-recent-final1 chain. In that chain we remove the connection from the httplist and add it to the http-recent-final list instead. If the connection has already been removed and moved over to the http-recent-final list, we send te packet to the http-recent-final2 chain.

In the final http-recent-final2 chain, we wait for the non-closed side to finish sending its data, and to close the connection from their side as well. Once this is done, the connection is completely removed.

As you can see the recent list can become quite complex, but it will give you a huge set of possibilities if need be. Still, try and remember not to reinvent the wheel. If the ability you need is already implemented, try and use it instead of trying to create your own solution.

State match

The state match extension is used in conjunction with the connection tracking code in the kernel. The state match accesses