Kernel

Download as pdf or txt
Download as pdf or txt
You are on page 1of 571

Linux Kernel Networking

Rami Rosen
[email protected]
Haifux, August 2007

Disclaimer
Everything in this lecture shall not, under any
circumstances, hold any legal liability whatsoever.
Any usage of the data and information in this document
shall be solely on the responsibility of the user.
This lecture is not given on behalf of any company
or organization.

Warning

This lecture will deal with design functional


description side by side with many implementation details;
some knowledge of C is preferred.

General

The Linux networking kernel code (including network device


drivers) is a large part of the Linux kernel code.

Scope: We will not deal with wireless, IPv6, and multicasting.

Also not with user space routing daemons/apps, and with


security attacks (like DoS, spoofing, etc.) .

Understanding a packet walkthrough in the kernel is a key to


understanding kernel networking. Understanding it is a must if
we want to understand Netfilter or IPSec internals, and more.

There is a 10 pages Linux kernel networking walkthrouh document


which was written in some university (see 1 in the list of links).

General - Contd.

Though it deals with 2.4.20 Linux kernel, most of it is relevant.

This lecture will concentrate on this walkthrough (design and


implementation details).

References to code in this lecture are based on linux-2.6.23-rc2.

There was some serious cleanup in 2.6.23



Hierarchy of networking layers

The layers that we will deal with (based on the 7 layers model) are:
Link Layer (L2) (ethernet)
Network Layer (L3) (ip)
Transport Layer (L4) (udp,tcp...)

Networking Data Structures

The two most important structures of linux kernel network layer


are:

sk_buff (defined in include/linux/skbuff.h)

netdevice (defined in include/linux/netdevice.h)

It is better to know a bit about them before delving into the


walkthrough code.

SK_BUFF

sk_buff represents data and headers.

sk_buff API (examples)

sk_buff allocation is done with alloc_skb() or dev_alloc_skb();


drivers use dev_alloc_skb();. (free by kfree_skb() and
dev_kfree_skb().

unsigned char* data : points to the current header.

skb_pull(int len) removes data from the start of a buffer by


advancing data to data+len and by decreasing len.

Almost always sk_buff instances appear as skb in the kernel


code.

SK_BUFF - contd

sk_buff includes 3 unions; each corresponds to a kernel network


layer:

transport_header (previously called h) for layer 4, the transport


layer (can include tcp header or udp header or icmp header, and
more)

network_header (previously called nh) for layer 3, the network


layer (can include ip header or ipv6 header or arp header).

mac_header (previously called mac) for layer 2, the link layer.

skb_network_header(skb), skb_transport_header(skb) and


skb_mac_header(skb) return pointer to the header.

SK_BUFF - contd.

struct dst_entry *dst the route for this sk_buff; this route is
determined by the routing subsystem.

It has 2 important function pointers:

int (*input)(struct sk_buff*);

int (*output)(struct sk_buff*);

input() can be assigned to one of the following : ip_local_deliver,


ip_forward, ip_mr_input, ip_error or dst_discard_in.

output() can be assigned to one of the following :ip_output,


ip_mc_output, ip_rt_bug, or dst_discard_out.

we will deal more with dst when talking about routing.



SK_BUFF - contd.

In the usual case, there is only one dst_entry for every skb.

When using IPSec, there is a linked list of dst_entries and only the
last one is for routing; all other dst_entries are for IPSec
transformers ; these other dst_entries have the DST_NOHASH
flag set.

tstamp (of type ktime_t ) : time stamp of receiving the packet.

net_enable_timestamp() must be called in order to get values.



net_device

net_device represents a network interface card.

There are cases when we work with virtual devices.

For example, bonding (setting the same IP for two or more


NICs, for load balancing and for high availability.)

Many times this is implemented using the private data of the


device (the void *priv member of net_device);

In OpenSolaris there is a special pseudo driver called vnic


which enables bandwidth allocation (project CrossBow).

Important members:

net_device - contd

unsigned int mtu Maximum Transmission Unit: the maximum


size of frame the device can handle.

Each protocol has mtu of its own; the default is 1500 for Ethernet.

you can change the mtu with ifconfig; for example,like this:

ifconfig eth0 mtu 1400

You cannot of course, change it to values higher than 1500 on


10Mb/s network:

ifconfig eth0 mtu 1501 will give:

SIOCSIFMTU: Invalid argument



net_device - contd

unsigned int flags - (which you see or set using ifconfig utility):
for example, RUNNING or NOARP.

unsigned char dev_addr[MAX_ADDR_LEN] : the MAC address


of the device (6 bytes).

int (*hard_start_xmit)(struct sk_buff *skb,


struct net_device *dev);

a pointer to the device transmit method.

int promiscuity; (a counter of the times a NIC is told to set to


work in promiscuous mode; used to enable more than one sniffing
client.)

net_device - contd

You are likely to encounter macros starting with IN_DEV like:


IN_DEV_FORWARD() or IN_DEV_RX_REDIRECTS(). How are the
related to net_device ? How are these macros implemented ?

void *ip_ptr: IPv4 specific data. This pointer is assigned to a


pointer to in_device in inetdev_init() (net/ipv4/devinet.c)

net_device - Contd.

struct in_device have a member named cnf (instance of


ipv4_devconf). Setting /proc/sys/net/ipv4/conf/all/forwarding
eventually sets the forwarding member of in_device to 1.
The same is true to accept_redirects and send_redirects; both
are also members of cnf (ipv4_devconf).

In most distros, /proc/sys/net/ipv4/conf/all/forwarding=0

But probably this is not so on your ADSL router.



network interface drivers

Most of the nics are PCI devices; there are also some USB
network devices.

The drivers for network PCI devices use the generic PCI calls, like
pci_register_driver() and pci_enable_device().

For more info on nic drives see the article Writing Network
Device Driver for Linux (link no. 9 in links) and chap17 in ldd3.

There are two modes in which a NIC can receive a packet.

The traditional way is interrupt-driven : each received packet is


an asynchronous event which causes an interrupt.

NAPI

NAPI (new API).

The NIC works in polling mode.

In order that the nic will work in polling mode it should be built
with a proper flag.

Most of the new drivers support this feature.

When working with NAPI and when there is a very high load,
packets are lost; but this occurs before they are fed into the
network stack. (in the non-NAPI driver they pass into the stack)

in Solaris, polling is built into the kernel (no need to build


drivers in any special way).

User Space Tools

iputils (including ping, arping, and more)

net-tools (ifconfig, netstat, , route, arp and more)

IPROUTE2 (ip command with many options)

Uses rtnetlink API.

Has much wider functionalities; for example, you can create


tunnels with ip command.

Note: no need for -n flag when using IPROUTE2 (because it


does not work with DNS).

Routing Subsystem

The routing table and the routing cache enable us to find the net
device and the address of the host to which a packet will be sent.

Reading entries in the routing table is done by calling


fib_lookup(const struct flowi *flp, struct fib_result *res)

FIB is the Forwarding Information Base.

There are two routing tables by default: (non Policy Routing case)

local FIB table (ip_fib_local_table ; ID 255).

main FIB table (ip_fib_main_table ; ID 254)

See : include/net/ip_fib.h.

Routing Subsystem - contd.

Routes can be added into the main routing table in one of 3 ways:

By sys admin command (route add/ip route).

By routing daemons.

As a result of ICMP (REDIRECT).

A routing table is implemented by struct fib_table.



Routing Tables

fib_lookup() first searches the local FIB table (ip_fib_local_table).

In case it does not find an entry, it looks in the main FIB table
(ip_fib_main_table).

Why is it in this order ?

There is one routing cache, regardless of how many routing tables


there are.

You can see the routing cache by running route -C.

Alternatively, you can see it by : cat /proc/net/rt_cache.

con: this way, the addresses are in hex format



Routing Cache

The routing cache is built of rtable elements:

struct rtable (see: /include/net/route.h)


{
union {
struct dst_entry dst;
} u;
...
}

Routing Cache - contd

The dst_entry is the protocol-independent part.

Thus, for example, we have a dst_entry member (also


called dst) in rt6_info in ipv6. ( include/net/ip6_fib.h)

The key for a lookup operation in the routing cache is an IP


address (whereas in the routing table the key is a subnet).

Inserting elements into the routing cache by : rt_intern_hash()

There is an alternate mechanism for route cache lookup,


called fib_trie, which is inside the kernel tree
(net/ipv4/fib_trie.c)

Routing Cache - contd

It is based on extending the lookup key.

You should set: CONFIG_IP_FIB_TRIE (=y)

(instead of CONFIG_IP_FIB_HASH)

By Robert Olsson et al (see links).



Creating a Routing Cache Entry

Allocation of rtable instance (rth) is done by: dst_alloc().

dst_alloc() in fact creates and returns a pointer to


dst_entry and we cast it to rtable (net/core/dst.c).

Setting input and output methods of dst:

(rth->u.dst.input and rth->u.dst.input )

Setting the flowi member of dst (rth->fl)

Next time there is a lookup in the cache,for example ,


ip_route_input(), we will compare against rth->fl.

Routing Cache - Contd.

A garbage collection call which delete


eligible entries from the routing cache.

Which entries are not eligible ?



Policy Routing (multiple tables)

Generic routing uses destination-address based decisions.

There are cases when the destination-address is not the sole


parameter to decide which route to give; Policy Routing comes to
enable this.

Policy Routing (multiple tables)-contd.

Adding a routing table : by adding a line to: /etc/iproute2/rt_tables.

For example: add the line 252 my_rt_table.

There can be up to 255 routing tables.

Policy routing should be enabled when building the kernel


(CONFIG_IP_MULTIPLE_TABLES should be set.)

Example of adding a route in this table:

> ip route add default via 192.168.0.1 table my_rt_table

Show the table by:

ip route show table my_rt_table



Policy Routing (multiple tables)-contd.

You can add a rule to the routing policy database (RPDB)


by ip rule add ...

The rule can be based on input interface, TOS, fwmark


(from netfilter).

ip rule list show all rules.



Policy Routing: add/delete a rule - example

ip rule add tos 0x04 table 252

This will cause packets with tos=0x08 (in the iphdr)


to be routed by looking into the table we added (252)

So the default gw for these type of packets will be


192.168.0.1

ip rule show will give:

32765: from all tos reliability lookup my_rt_table

...

Policy Routing: add/delete a rule - example

Delete a rule : ip rule del tos 0x04 table 252



Routing Lookup
Cache lookup
fib_lookup() in
ip_fib_local_table
ip_route_input() in: net/ipv4/route.c
fib_lookup() in
ip_fib_main_table
ip_route_input_slow()
in: net/ipv4/route.c
Deliver packet by:
ip_local_deliver()
or ip_forward()
according to result
Hit Hit Hit
Hit
Hit
Miss
Miss
Drop packet
Miss

Routing Table Diagram

tb_lookup()
tb_insert()
tb_delete()
----------------------
struct fn_zone
struct fn_zone
...
...
struct fn_zone
fib_table
fz_hash
struct fn_zone
33
hlist_head
hlist_head
...
hlist_head
fn_alias
fn_key
struct fib_node
fa_info
struct fib_alias
fib_nh
struct fib_info
fz_divisor

hlist_head
fn_alias
fib_node
fn_key

Routing Tables

Breaking the fib_table into multiple data structures gives


flexibility and enables fine grained and high level of sharing.

Suppose that we 10 routes to 10 different networks have


the same next hop gw.

We can have one fib_info which will be shared by 10


fib_aliases.

fz_divisor is the number of buckets



Routing Tables - contd

Each fib_ node element represents a unique subnet.

The fn_key member of fib_ node is the subnet (32 bit)



Routing Tables - contd

Suppose that a device goes down or enabled.

We need to disable/enable all routes which use this device.

But how can we know which routes use this device ?

In order to know it efficiently, there is the fib_info_devhash


table.

This table is indexed by the device identifier.

See fib_sync_down() and fib_sync_up() in


net/ipv4/fib_semantics.c

Routing Table lookup algorithm

LPM (Longest Prefix Match) is the lookup algorithm.

The route with the longest netmask is the one chosen.

Netmask 0, which is the shortest netmask, is for the default


gateway.

What happens when there are multiple entries with


netmask=0?

fib_lookup() returns the first entry it finds in the fib table


where netmask length is 0.

Routing Table lookup - contd.

It may be that this is not the best choice default gateway.

So in case that netmask is 0 (prefixlen of the fib_result returned


from fib_look is 0) we call fib_select_default().

fib_select_default() will select the route with the lowest priority


(metric) (by comparing to fib_priority values of all default
gateways).

Receiving a packet

When working in interrupt-driven model, the nic registers an


interrupt handler with the IRQ with which the device works by
calling request_irq().

This interrupt handler will be called when a frame is received

The same interrupt handler will be called when transmission of a


frame is finished and under other conditions. (depends on the
NIC; sometimes, the interrupt handler will be called when there is
some error).

Receiving a packet - contd

Typically in the handler, we allocate sk_buff by calling


dev_alloc_skb() ; also eth_type_trans() is called; among other
things it advances the data pointer of the sk_buff to point to the IP
header ; this is done by calling skb_pull(skb, ETH_HLEN).

See : net/ethernet/eth.c

ETH_HLEN is 14, the size of ethernet header.



Receiving a packet - contd

The handler for receiving a packet is ip_rcv(). (net/ipv4/ip_input.c)

Handler for the protocols are registered at init phase.

Likewise, arp_rcv() is the handler for ARP packets.

First, ip_rcv() performs some sanity checks. For example:


if (iph->ihl < 5 || iph->version != 4)
goto inhdr_error;

iph is the ip header ; iph->ihl is the ip header length (4 bits).

The ip header must be at least 20 bytes.

It can be up to 60 bytes (when we use ip options)



Receiving a packet - contd

Then it calls ip_rcv_finish(), by:


NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
ip_rcv_finish);

This division of methods into two stages (where the second has
the same name with the suffix finish or slow, is typical for
networking kernel code.)

In many cases the second method has a slow suffix instead of


finish; this usually happens when the first method looks in some
cache and the second method performs a lookup in a table, which
is slower.

Receiving a packet - contd

ip_rcv_finish() implementation:
if (skb->dst == NULL) {
int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
skb->dev);
...
}
...
return dst_input(skb);

Receiving a packet - contd

ip_route_input():
First performs a lookup in the routing cache to see if there is a
match. If there is no match (cache miss), calls
ip_route_input_slow() to perform a lookup in the routing table.
(This lookup is done by calling fib_lookup()).

fib_lookup(const struct flowi *flp, struct fib_result *res)


The results are kept in fib_result.

ip_route_input() returns 0 upon successful lookup. (also when


there is a cache miss but a successful lookup in the routing table.)

Receiving a packet - contd
According to the results of fib_lookup(), we know if the frame is for
local delivery

or for forwarding or to be dropped.

If the frame is for local delivery , we will set the input() function
pointer of the route to ip_local_deliver():
rth->u.dst.input= ip_local_deliver;

If the frame is to be forwarded, we will set the input() function


pointer to ip_forward():
rth->u.dst.input = ip_forward;

Local Delivery

Prototype:
ip_local_deliver(struct sk_buff *skb) (net/ipv4/ip_input.c).
- calls NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb->dev,
NULL,ip_local_deliver_finish);

Delivers the packet to the higher protocol layers according to its


type.

Forwarding

Prototype:

int ip_forward(struct sk_buff *skb)

(net/ipv4/ip_forward.c)

decreases the ttl in the ip header

If the ttl is <=1 , the methods send ICMP message


(ICMP_TIME_EXCEEDED) and drops the packet.

Calls NF_HOOK(PF_INET,NF_IP_FORWARD, skb, skb->dev,


rt->u.dst.dev, ip_forward_finish);

Forwarding- Contd

ip_forward_finish(): sends the packet out by calling


dst_output(skb).

dst_output(skb) is just a wrapper, which calls


skb->dst->output(skb). (see include/net/dst.h)

Sending a Packet

Handling of sending a packet is done by


ip_route_output_key().

We need to perform routing lookup also in the case of


transmission.

In case of a cache miss, we calls ip_route_output_slow(),


which looks in the routing table (by calling fib_lookup(), as
also is done in ip_route_input_slow().)

If the packet is for a remote host, we set dst->output to


ip_output()

Sending a Packet-contd

ip_output() will call ip_finish_output()

This is the NF_IP_POST_ROUTING point.

ip_finish_output() will eventually send the packet from a


neighbor by:

dst->neighbour->output(skb)

arp_bind_neighbour() sees to it that the L2 address of the


next hop will be known. (net/ipv4/arp.c)

Sending a Packet - Contd.

If the packet is for the local machine:

dst->output = ip_output

dst->input = ip_local_deliver

ip_output() will send the packet on the loopback device,

Then we will go into ip_rcv() and ip_rcv_finish(), but this


time dst is NOT null; so we will end in ip_local_deliver().

See: net/ipv4/route.c

Multipath routing

This feature enables the administrator to set multiple next


hops for a destination.

To enable multipath routing,


CONFIG_IP_ROUTE_MULTIPATH should be set when
building the kernel.

There was also an option for multipath caching: (by setting


CONFIG_IP_ROUTE_MULTIPATH_CACHED).

It was experimental and removed in 2.6.23 - See links (6).




Netfilter

Netfilter is the kernel layer to support applying iptables rultes.

It enables:

Filtering

Changing packets (masquerading)

Connection Tracking

Netfilter rule - example

Short example:

Applying the following iptables rule:

iptables -A INPUT -p udp --dport 9999 -j DROP

This is NF_IP_LOCAL_IN rule;

The packet will go to:

ip_rcv()

and then: ip_rcv_finish()

And then ip_local_deliver()



Netfilter rule - example (contd)

but it will NOT proceed to ip_local_deliver_finish() as in the


usual case, without this rule.

As a result of applying this rule it reaches nf_hook_slow()


with verdict == NF_DROP (calls skb_free() to free the packet)

See /net/netfilter/core.c.

ICMP redirect message

ICMP protocol is used to notify about problems.

A REDIRECT message is sent in case the route


is suboptimal (inefficient).

There are in fact 4 types of REDIRECT

Only one is used :

Redirect Host (ICMP_REDIR_HOST)

See RFC 1812 (Requirements for IP Version 4 Routers).



ICMP redirect message - contd.

To support sending ICMP redirects, the machine should be


configured to send redirect messages.

/proc/sys/net/ipv4/conf/all/send_redirects should be 1.

In order that the other side will receive redirects, we should


set
/proc/sys/net/ipv4/conf/all/accept_redirects to 1.

ICMP redirect message - contd.

Example:

Add a suboptimal route on 192.168.0.31:

route add -net 192.168.0.10 netmask 255.255.255.255 gw


192.168.0.121

Running now route on 192.168.0.31 will show a new entry:


Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.0.10 192.168.0.121 255.255.255.255 UGH 0 0 0 eth0

ICMP redirect message - contd.

Send packets from 192.168.0.31 to 192.168.0.10 :

ping 192.168.0.10 (from 192.168.0.31)

We will see (on 192.168.0.31):

From 192.168.0.121: icmp_seq=2 Redirect Host(New


nexthop: 192.168.0.10)

now, running on 192.168.0.121:

route -Cn | grep .10

shows that there is a new entry in the routing cache:


ICMP redirect message - contd.

192.168.0.31 192.168.0.10 192.168.0.10 ri 0 0 34 eth0

The r in the flags column means: RTCF_DOREDIRECT.

The 192.168.0.121 machine had sent a redirect by calling


ip_rt_send_redirect() from ip_forward().
(net/ipv4/ip_forward.c)

ICMP redirect message - contd.

And on 192.168.0.31, running route -C | grep .10 shows


now a new entry in the routing cache: (in case
accept_redirects=1)

192.168.0.31 192.168.0.10 192.168.0.10 0 0 1


eth0

In case accept_redirects=0 (on 192.168.0.31), we will see:

192.168.0.31 192.168.0.10 192.168.0.121 0 0 0 eth0

which means that the gw is still 192.168.0.121 (which is the


route that we added in the beginning).

ICMP redirect message - contd.

Adding an entry to the routing cache as a result of getting


ICMP REDIRECT is done in ip_rt_redirect(), net/ipv4/route.c.

The entry in the routing table is not deleted.



Neighboring Subsystem

Most known protocol: ARP (in IPV6: ND, neighbour discovery)

ARP table.

Ethernet header is 14 bytes long:

Source mac address (6 bytes).

Destination mac address (6 bytes).

Type (2 bytes).

0x0800 is the type for IP packet (ETH_P_IP)

0x0806 is the type for ARP packet (ETH_P_ARP)

see: include/linux/if_ether.h

Neighboring Subsystem - contd

When there is no entry in the ARP cache for the destination IP


address of a packet, a broadcast is sent (ARP request,
ARPOP_REQUEST: who has IP address x.y.z...). This is done by
a method called arp_solicit(). (net/ipv4/arp.c)

You can see the contents of the arp table by running:


cat /proc/net/arp or by running the arp from a command line .

You can delete and add entries to the arp table; see man arp.

Bridging Subsystem

You can define a bridge and add NICs to it (enslaving


ports) using brctl (from bridge-utils).

You can have up to 1024 ports for every bridge device


(BR_MAX_PORTS) .

Example:

brctl addbr mybr

brctl addif mybr eth0

brctl show

Bridging Subsystem - contd.

When a NIC is configured as a bridge port, the br_port


member of net_device is initialized.

(br_port is an instance of struct net_bridge_port).

When we receive a frame, netif_receive_skb() calls


handle_bridge().

Bridging Subsystem - contd.

The bridging forwarding database is searched for the


destination MAC address.

In case of a hit, the frame is sent to the bridge port with


br_forward() (net/bridge/br_forward.c).

If there is a miss, the frame is flooded on all


bridge ports using br_flood() (net/bridge/br_forward.c).

Note: this is not a broadcast !

The ebtables mechanism is the L2 parallel of L3 Netfilter.



Bridging Subsystem- contd

Ebtables enable us to filter and mangle packets


at the link layer (L2).

IPSec

Works at network IP layer (L3)

Used in many forms of secured networks like VPNs.

Mandatory in IPv6. (not in IPv4)

Implemented in many operating systems: Linux, Solaris, Windows,


and more.

RFC2401

In 2.6 kernel : implemented by Dave Miller and Alexey Kuznetsov.

Transformation bundles.

Chain of dst entries; only the last one is for routing.



IPSec-cont.

User space tools: http://ipsec-tools.sf.net

Building VPN : http://www.openswan.org/ (Open Source).

There are also non IPSec solutions for VPN

example: pptp

struct xfrm_policy has the following member:

struct dst_entry *bundles.

__xfrm4_bundle_create() creates dst_entries (with the


DST_NOHASH flag) see: net/ipv4/xfrm4_policy.c

Transport Mode and Tunnel Mode.



IPSec-contd.

Show the security policies:

ip xfrm policy show

Create RSA keys:

ipsec rsasigkey --verbose 2048 > keys.txt

ipsec showhostkey --left > left.publickey

ipsec showhostkey --right > right.publickey



IPSec-contd.
Example: Host to Host VPN (using openswan)
in /etc/ipsec.conf:
conn linux-to-linux
left=192.168.0.189
leftnexthop=%direct
leftrsasigkey=0sAQPPQ...
right=192.168.0.45
rightnexthop=%direct
rightrsasigkey=0sAQNwb...
type=tunnel
auto=start

IPSec-contd.

service ipsec start (to start the service)

ipsec verify Check your system to see if IPsec got installed and
started correctly.

ipsec auto status

If you see IPsec SA established , this implies success.

Look for errors in /var/log/secure (fedora core) or in kernel syslog



Tips for hacking

Documentation/networking/ip-sysctl.txt: networking kernel tunabels

Example of reading a hex address:

iph->daddr == 0x0A00A8C0 or
means checking if the address is 192.168.0.10 (C0=192,A8=168,
00=0,0A=10).

Tips for hacking - Contd.

Disable ping reply:

echo 1 >/proc/sys/net/ipv4/icmp_echo_ignore_all

Disable arp: ip link set eth0 arp off (the NOARP flag will be set)

Also ifconfig eth0 -arp has the same effect.

How can you get the Path MTU to a destination (PMTU)?

Use tracepath (see man tracepath).

Tracepath is from iputils.



Tips for hacking - Contd.

Keep iphdr struct handy (printout): (from linux/ip.h)


struct iphdr {
__u8 ihl:4,
version:4;
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
};

Tips for hacking - Contd.

NIPQUAD() : macro for printing hex addresses

CONFIG_NET_DMA is for TCP/IP offload.

When you encounter: xfrm / CONFIG_XFRM this has to to do with


IPSEC. (transformers).

New and future trends

IO/AT.

NetChannels (Van Jacobson and Evgeniy Polyakov).

TCP Offloading.

RDMA.

Mulitqueus. : some new nics, like e1000 and IPW2200,


allow two or more hardware Tx queues. There are already
patches to enable this.

New and future trends - contd.

See: Enabling Linux Network Support of Hardware


Multiqueue Devices, OLS 2007.

Some more info in: Documentation/networking/multiqueue.txt


in recent Linux kernels.

Devices with multiple TX/RX queues will have the


NETIF_F_MULTI_QUEUE feature (include/linux/netdevice.h)

MQ nic drivers will call alloc_etherdev_mq() or


alloc_netdev_mq() instead of alloc_etherdev() or
alloc_netdev().

Links and more info
1) Linux Network Stack Walkthrough (2.4.20):
http://gicl.cs.drexel.edu/people/sevy/network/Linux_network_stack_walkthrough.html
2) Understanding the Linux Kernel, Second Edition
By Daniel P. Bovet, Marco Cesati
Second Edition December 2002
chapter 18: networking.
- Understanding Linux Network Internals, Christian benvenuti
Oreilly , First Edition.

Links and more info
3) Linux Device Driver, by Jonathan Corbet, Alessandro Rubini, Greg
Kroah-Hartman
Third Edition February 2005.

Chapter 17, Network Drivers


4) Linux networking: (a lot of docs about specific networking topics)

http://linux-net.osdl.org/index.php/Main_Page
5) netdev mailing list: http://www.spinics.net/lists/netdev/

Links and more info
6) Removal of multipath routing cache from kernel code:
http://lists.openwall.net/netdev/2007/03/12/76
http://lwn.net/Articles/241465/
7) Linux Advanced Routing & Traffic Control :
http://lartc.org/
8) ebtables a filtering tool for a bridging:
http://ebtables.sourceforge.net/

Links and more info
9) Writing Network Device Driver for Linux: (article)

http://app.linux.org.mt/article/writing-netdrivers?locale=en

Links and more info
10) Netconf a yearly networking conference; first was in 2004.

http://vger.kernel.org/netconf2004.html

http://vger.kernel.org/netconf2005.html

http://vger.kernel.org/netconf2006.html

Next one: Linux Conf Australia, January 2008,Melbourne

David S. Miller, James Morris , Rusty Russell , Jamal Hadi Salim ,Stephen Hemminger
, Harald Welte, Hideaki YOSHIFUJI, Herbert Xu ,Thomas Graf ,Robert Olsson ,Arnaldo
Carvalho de Melo and others

Links and more info
11) Policy Routing With Linux - Online Book Edition

by Matthew G. Marsh (Sams).

http://www.policyrouting.org/PolicyRoutingBook/
12) THRASH - A dynamic LC-trie and hash data structure:
Robert Olsson Stefan Nilsson, August 2006
http://www.csc.kth.se/~snilsson/public/papers/trash/trash.pdf
13) IPSec howto:
http://www.ipsec-howto.org/t1.html

Links and more info
14) Openswan: Building and Integrating Virtual Private
Networks , by Paul Wouters, Ken Bantoft
http://www.packtpub.com/book/openswan/mid/061205jqdnh2by
publisher: Packt Publishing.


Linux Kernel Networking-
advanced topics:
Neighboring and IPsec
Rami Rosen
[email protected]
Haifux, January 2008
www.haifux.org

Contents

Short rehearsal (4 slides)

Neighboring Subsystem

struct neighbour

arp

arp_bind_neighbour() method

Duplicate Address Detection (DAD)

LVS (Linux Virtual Sever)

ARPD arp user space daemon

Neighbour states

Change of IP address/Mac address

IPsec

Scope

We will not deal with multicast and with ipv6 and with wireless.

The L3 network protocol we deal with is ipv4, and the


L2 Link Layer protocol is Ethernet.

Neighboring Subsystem

All code in this lecture is taken from linux-2.6.24-rc4

04-Dec-2007

Can be obtained from


http://www.kernel.org/pub/linux/kernel/v2.6/testing/ (and mirrors)

Short rehearsal (4 slides)

The layers that we will deal with (based on the 7 layers model)
are:
Transport Layer (L4) (udp,tcp...)
Network Layer (L3) (ip)
Link Layer (L2) (ethernet)

Short rehearsal (4 slides)

Two most Important data structures: sk_buff and net_device.


sk_buff:

dst is an instance of dst_entry; dst is a member in sk_buff.

The lookup in the routing subsystem constructs dst.

It decides how the packet will continue its traversal.

This is done by assigning methods to its input()/output() functions

Each dst_entry has a neighbour member.(with IPSec it is NULL).

When working with IPSec, the dst in fact represents a linked


list of dst_entries. Only the last one is for routing; all previous
dst_entries are for IPSec transformers.

Short rehearsal (4 slides)
net_device

net_device represents a Network Interface Card.

net_device has members like mtu, dev_addr (device MAC


address), promiscuity,name of device (eth0,eth1,lo, etc), and
more.

An important member of net_device is flags.

You can disable ARP replies on a NIC by setting IFF_NOARP flag:

ifconfig eth0 -arp

ifconfig eth0 will show:

UP BROADCAST RUNNING NOARP MULTICAST ...

Enabling ARP again is done by: ifconfig eth0 arp.



Short rehearsal (4 slides)

ip_input_route() method: performs a lookup in the routing


subsystem for each incoming packet. Looks first in the
routing cache; in case there is a cache miss, looks into the
routing table and inserts an entry into the routing cache. Calls
arp_bind_neighbour() for UNICAST packets only. Returns 0
upon success.

dev_queue_xmit(struct sk_buff *skb) is called to transmit


the packet, when it is ready. (has L2 destination address)
(net/core/dev.c)

dev_queue_xmit() passes the packet to the nic device driver


for transmission using the device driver hard_start_xmit()
method.

Neighboring Subsystem

Goals: what is the neighboring subsystem for?

The world is a jungle in general, and the networking game


contributes many animals. (from RFC 826, ARP, 1982)

In IPV4 implemented by ARP; in IPv6: ND, neighbour discovery.

Ethernet header is 14 bytes long:

Source Mac address and destination Mac address - 6 bytes each.

Type (2 bytes). For example, (include/linux/if_ether.h)

0x0800 is the type for IP packet (ETH_P_IP)

0x0806 is the type for ARP packet (ETH_P_ARP)

0X8035 is the type for RARP packet (ETH_P_RARP)



Neighboring Subsystem struct neighbour

neighbour (instance of struct neighbour) is embedded in dst,


which is in turn is embedded in sk_buff:
sk_buff
dst
Neighbour
--
ha
primary_key
...

Neighboring Subsystem struct neighbour

Implementation - important data structures

struct neighbour (/include/et/neighbour.h)

ha - the hardware address (MAC address when dealing with


Ethernet) of the neighbour. This field is filled when an ARP
response arrives.

primary_key The IP address (L3) of the neighbour.

lookup in the arp table is done with the primary_key.

nud_state represents the Network Unreachability Detection


state of the neighbor. (for example, NUD_REACHABLE).

Neighboring Subsystem struct neighbour
contd

A neighbour can change its state to NUD_REACHABLE by


one of three ways:

L4 confirmation.

Receive ARP reply for the first time or receiving an ARP reply
in response to an ARP request when in NUD_PROBE state.

Confirmation can be done also by issuing a sysadmin


command (but it is rare).

Neighboring Subsystem struct neighbour
contd

int (*output)(struct sk_buff *skb);

output() can be assigned to different methods according to the


state of the neighbour. For example, neigh_resolve_output()
and neigh_connected_output(). Initially, it is
neigh_blackhole().

When a state changes, than also the output function may be


assigned to a different function.

refcnt -incremented by neigh_hold(); decremented by


neigh_release(). We don't free a neighbour when the refcnt
is higher than 1; instead, we set dead (a member of neighbour)
to 1.

Neighboring Subsystem struct neighbour
contd

timer (The callback method is neigh_timer_handler()).

struct hh_cache *hh (defined in include/linux/netdevice.h)

confirmed confirmation timestamp.

Confirmation can done from L4 (transport layer).

For example, dst_confirm() calls neigh_confirm().

dst_confirm() is called from tcp_ack() (net/ipv4/tcp_input.c)

and by udp_sendmsg() (net/ipv4/udp.c) and more.

neigh_confirm() does NOT change the state it is the job


of neigh_timer_handler().

Neighboring Subsystem struct neighbour
contd

dev (net_device from which we send packets to the neighbour).

struct neigh_parms *parms;

parms include mostly timer tunables, net structure (network


namespaces), etc.

network namespaces enable multiple instances of the network


stack to the user space.

A network device belongs to exactly one network namespace.

CONFIG_NET_NS when building the kernel.



Neighboring Subsystem struct neighbour
contd

arp_queue

every neighbour has a small arp queue of itself.

There can be only 3 elements by default in an arp_queue.

This is configurable:/proc/sys/net/ipv4/neigh/default/unres_qlen

struct neigh_table

struct neigh_table represents a neighboring table

(/include/net/neighbour.h)

The arp table (arp_tbl) is a neigh_table. (/include/net/arp.h)

In IPv6, nd_tbl (Neighbor Discovery table ) is a neigh_table


also (include/net/ndisc.h)

There is also dn_neigh_table (DECnet )


(linux/net/decnet/dn_neigh.c) and clip_tbl (for ATM) (net/atm/clip.c)

gc_timer : neigh_periodic_timer() is the callback for garbage


collection.

neigh_periodic_timer() deletes FAILED entries from the ARP


table.

Neighboring Subsystem - arp

When there is no entry in the ARP cache for the destination IP


address of a packet, a broadcast is sent (ARP request,
ARPOP_REQUEST: who has IP address x.y.z...). This is done by
a method called arp_solicit(). (net/ipv4/arp.c)

In IPv6, the parallel mechanism is called ND (Neighbor


discovery) and is implemented as part of ICMPv6.

A multicast is sent in IPv6 (and not a broadcast).

If there is no answer in time to this arp request, then we will end up


with sending back an ICMP error (Destination Host Unreachable).

This is done by arp_error_report() , which indirectly calls


ipv4_link_failure() ; see net/ipv4/route.c.

ARP table
Neighbour Neighbour Neighbour
next next
hh_cache
hh_cahe
hh_data
SA DA TYPE
hh_cahe
hh_data
SA DA TYPE
ha
hh hh
ha

Neighboring Subsystem - arp

You can see the contents of the arp table by running:


cat /proc/net/arp or by running the arp from a command line .

ip neigh show is the new method to show arp (from IPROUTE2)

You can delete and add entries to the arp table; see man arp/man
ip.

When using ip neigh add you can specify the state of the entry
which you are adding (like permanent,stale,reachable, etc).

Neighboring Subsystem arp table

arp command does not show reachability states except the


incomplete state and permanent state:
Permanent entries are marked with M in Flags:
example : arp output
Address HWtype HWaddress Flags Mask Iface
10.0.0.2 (incomplete) eth0
10.0.0.3 ether 00:01:02:03:04:05 CM eth0
10.0.0.138 ether 00:20:8F:0C:68:03 C eth0

Neighboring Subsystem ip show neigh

We can see the current neighbour states:

Example :

ip neigh show
192.168.0.254 dev eth0 lladdr 00:03:27:f1:a1:31 REACHABLE
192.168.0.152 dev eth0 lladdr 00:00:00:cc:bb:aa STALE
192.168.0.121 dev eth0 lladdr 00:10:18:1b:1c:14 PERMANENT
192.168.0.54 dev eth0 lladdr aa:ab:ac:ad:ae:af STALE
192.168.0.98 dev eth0 INCOMPLETE

Neighboring Subsystem arp

arp_process() handles both ARP requests and ARP responses.

net/ipv4/arp.c

If the target ip (tip) address in the arp header is the loopback


then arp_process() drops it since loopback does not need ARP.
...
if (LOOPBACK(tip) || MULTICAST(tip))
goto out;
out:
...
kfree_skb(skb);
return 0;

Neighboring Subsystem - arp
(see: #define LOOPBACK(x) (((x) & htonl(0xff000000)) == htonl(0x7f000000)) in
linux/in.h

If it is an ARP request (ARPOP_REQUEST)


we call ip_route_input().

Why ?

In case it is for us, (RTN_LOCAL) we send and ARP reply.

arp_send(ARPOP_REPLY,ETH_P_ARP,sip,dev,tip,sha
,dev->dev_addr,sha);

We also update our arp table with the sender entry (ip/mac).

Special case: ARP proxy server.



Neighboring Subsystem - arp

In case we receive an ARP reply (ARPOP_REPLY)

We perform a lookup in the arp table. (by calling


__neigh_lookup())

If we find an entry, we update the arp table by


neigh_update().

Neighboring Subsystem - arp

If there is no entry and there is NO support for unsolicited ARP we


don't create an entry in the arp table.

Support for unsolicited ARP by


setting /proc/sys/net/ipv4/conf/all/arp_accept to 1.

The corresponding macro is:


IPV4_DEVCONF_ALL(ARP_ACCEPT))

In older kernels, support for unsolicited ARP was done by:

CONFIG_IP_ACCEPT_UNSOLICITED_ARP

Neighboring Subsystem lookup

Lookup in the neighboring subsystem is done via: neigh_lookup()


parameters:

neigh_table (arp_tbl)

pkey (ip address, the primary_key of neighbour struct)

dev (net_device)

There are 2 wrappers:

__neigh_lookup()

just one more parameter: creat (a flag: to create a neighbor


by neigh_create() or not))

and __neigh_lookup_errno()

Neighboring Subsystem static entries

Adding a static entry is done by arp -s ipAddress MacAddress

Alternatively, this can be done by:


ip neigh add ipAddress dev eth0 lladdr MacAddress nud permanent

The state (nud_state) of this entry will be NUD_PERMANENT

ip neigh show will show it as PERMANENT.

Why do we need PERMANENT entries ?



arp_bind_neighbour() method

Suppose we are sending a packet to a host for the first time.

a dst_entry is added to the routing cache by rt_intern_hash().

We should know the L2 address of that host.

so rt_intern_hash() calls arp_bind_neighbour().

only for RTN_UNICAST (not for multicast/broadcast).

arp_bind_neighbour(): net/ipv4/arp.c

dst->neighbour=NULL, so it calls__neigh_lookup_errno().

There is no such entry in the arp table.

So we will create a neighbour with neigh_create() and add


it to the arp table.

arp_bind_neighbour() method

neigh_create() creates a neighbour with NUD_NONE state

setting nud_state to NUD_NONE is done in neigh_alloc()



Neighboring Subsystem IFF_NOARP flag

Disabling and enabling arp

ifconfig eth1 -arp

You will see the NOARP flag now in ifconfig -a

ifconfig eth1 arp (to enable arp of the device).

In fact, this sets the IFF_NOARP flag of net_device.

There are cases where the interface by default is with the


IFF_NOARP flag (for example, ppp interface,
see ppp_setup() (drivers/net/ppp_generic.c)

Changing IP address

Suppose we try to set eth1 to an IP address of a different


machine on the LAN:

First, we will set an ip for eth1 in (in FC8,for example)

/etc/sysconfig/network-scripts/ifcfg-eth1
...
IPADDR=192.168.0.122
...
and than run:

ifup eth1

Changing IP address - contd.

we will get:
Error, some other host already uses address
192.168.0.122.

But:

ifconfig eth0 192.168.0.122

works ok !

Why is it so ?

ifup is from the initscripts package.



Duplicate Address Detection (DAD)

Duplicate Address Detection mode (DAD)

arping -I eth0 -D 192.168.0.10

sends a broadcast packet whose source address


is 0.0.0.0.

0.0.0.0 is not a valid IP address (for example, you cannot


set an ip address to 0.0.0.0 with ifconfig)

The mac address of the sender is the real one.

-D flag is for Duplicate Address Detection mode.




Duplicate Address Detection -contd
Code: (from arp_process() ; see /net/ipv4/arp.c)
/* Special case: IPv4 duplicate address detection packet (RFC2131)
*/
if (sip == 0) {
if (arp->ar_op == htons(ARPOP_REQUEST) &&
inet_addr_type(tip) == RTN_LOCAL &&
!arp_ignore(in_dev,dev,sip,tip))
arp_send(ARPOP_REPLY,ETH_P_ARP,tip,dev,tip,sha,dev-
>dev_addr,dev->dev_addr);
goto out;
}

Neighboring Subsystem Garbage
Collection

Garbage Collection

neigh_periodic_timer()

neigh_timer_handler()

neigh_periodic_timer() removes entires which are in


NUD_FAILED state. This is done by setting dead to 1, and
calling neigh_release(). The refcnt must be 1 to ensure no one
else uses this neighbour. Also expired entries are removed.

NUD_FAILED entries don't have MAC address ; see ip neigh


show in the example above).

Neighboring Subsystem Asynchronous
Garbage Collection

neigh_forced_gc() performs synchronous garbage collection.

It is called from neigh_alloc() when the number of the entries


in the arp table exceeds a (configurable) limit.

This limit is configurable (gc_thresh2,gc_thresh3)


/proc/sys/net/ipv4/neigh/default/gc_thresh2
/proc/sys/net/ipv4/neigh/default/gc_thresh3

The default for gc_thresh3 is 1024.

Candidates for cleanup: Entries which their reference


count is 1, or which their state is NOT permanent.

Neighboring Subsystem Garbage
Collection

Changing the neighbour state is done only in


neigh_timer_handler() .

LVS (Linux Virtual Sever)

http://www.linuxvirtualserver.org/

Integrated into the Linux kernel (in 2.4 kernel it was a patch).

Located in: net/ipv4/ipvs in the kernel tree. No IPV6 support.

LVS has eight scheduling algorithms.

LVS/DR is LVS with direct routing (a load balancing solution).

ipvsadm is the user space management tools (available in


most distros).

Direct Routing is the packet-forwarding-method.

-g, --gatewaying => Use gatewaying (direct routing)

see man ipvsadm.



LVS/DR

Example: 3 Real Servers and the Director all have the same
Virtual IP (VIP).
Real Server 3
Linux Director
Real Server 1
Real Server 2
VIP
VIP
VIP
VIP (Virtual IP)
clients

LVS and ARP

There is an ARP problem in this configuration.

When you send an ARP broadcast, and the receiving


machine has two or more NICs, each of them responds to
this ARP request.

Example: a machine with two NICs ;

eth0 is 192.168.0.151 and eth1 is 192.168.0.152.



LVS and ARP - example:

LVS and ARP

Solutions
1) Set ARP_IGNORE to 1:

echo 1 > /proc/sys/net/ipv4/conf/eth0/arp_ignore

echo 1 > /proc/sys/net/ipv4/conf/eth1/arp_ignore


2) Use arptables.

There are 3 points in the arp walkthrough:


(include/linux/netfilter_arp.h)

NF_ARP_IN (in arp_rcv() , net/ipv4/arp.c).

NF_ARP_OUT (in arp_xmit()),net/ipv4/arp.c)

NF_ARP_FORWARD ( in br_nf_forward_arp(),
net/bridge/br_netfilter.c)

LVS and ARP

http://ebtables.sourceforge.net/download.html

Ebtables is in fact the parallel of netfilter but in L2.



LVS example (ipvsadm)

An example for setting LVS/DR on TCP port 80 with three


real servers:

ipvsadm -C // clear the LVS table

ipvsadm -A -t DirectorIPAddress:80

ipvsadm -a -t DirectorIPAddress:80 -r RealServer1 -g

ipvsadm -a -t DirectorIPAddress:80 -r RealServer2 -g

ipvsadm -a -t DirectorIPAddress:80 -r RealServer3 -g

This example deals with tcp connections (for udp


connection we should use -u instead of -t in the last 3 lines).

LVS example:

ipvsadm -Ln // list the LVS table

/proc/sys/net/ipv4/ip_forward should be set to 1

In this example, packets sent to VIP will be sent to the load


balancer; it will delegate them to the real server according
to its scheduler. The dest MAC address in L2 header will be
the MAC address of the real server to which the packet will
be sent. The dest IP header will be VIP.

This is done with NF_IP_LOCAL_IN.



ARPD arp user space daemon

ARPD is a user space daemon; it can be used if we want to


remove some work from the kernel.

The user space daemon is part of iproute2 (/misc/arpd.c)

ARPD has support for negative entries and for dead hosts.

The kernel arp code does NOT support these type of


entries!

The kernel by default is not compiled with ARPD support; we


should set CONFIG_ARPD for using it:

Networking Support->Networking Options->IP: ARP daemon


support. (It is considered Experimental).

see: /usr/share/doc/iproute-2.6.22/arpd.ps (Alexey Kuznetsov).



ARPD

We should also set app_probes to a value greater than 0 by


setting

/proc/sys/net/ipv4/neigh/eth0/app_solicit

This can be done also by the -a (active_probes) parameter.

The value of this parameter tells how many ARP requests to


send before that neighbour is considered dead.

The -k parameter tells the kernel not to send ARP broadcast; in


such case, the arpd daemon is not only listening to ARP requests,
but also send ARP broadcasts.

We can tune kernel parameters as we like; in fact, we can tune it


so that arp requests will be send only from the daemon and not
from the kernel at all.

ARPD

Activation:

arpd -a 1 -k eth0 &

On some distros, you will get the error db_open: No such file
or directory unless you simply run mkdir /var/lib/arpd/ before
(for the arpd.db file).

Pay attention: you can start arpd daemon when there is no


support in the kernel (CONFIG_ARPD is not set).

In this case you, arp packets are still caught by arpd daemon
get_arp_pkt() (misc/arpd.c)

But you don't get messages from the kernel.

get_arp_pkt() is not called. (misc/arpd.c)



ARPD

Tip: to check if CONFIG_ARPD is set, simply see if there are


any resulrs from

cat /proc/kallsyms | grep neigh_app



Mac addresses

MAC address (Media Access Control)

According to specs, MAC address should be unique.

The 3 first bytes specify a hw manufacturer of the card.

Allocated by IANA.

There are exceptions to this rule.

Technion (?)

Ethernet HWaddr 00:16:3E:3F:6E:5D



ARPwatch (detect ARP cache
poisoning)

Changing MAC address can be as a result of some security


attack (ARP cache poisoning, ARP spoofing).

Arpwatch is an open source tool;helps to detect such attack.

Activation: arpwatch -d -i eth0 (output to stderr)

Arpwatch keeps a table of ip/mac addresses and senses


when there is a change.

-d is for redirecting the log to stderr (no syslog, no mail).

In case someone changed MAC address on the same


network, you will get a message like this:

ARPwatch - Example
From: root (Arpwatch)
To: root
Subject: changed ethernet address (jupiter)
hostname: jupiter
ip address: 192.168.0.54
ethernet address: aa:bb:cc:dd:ee:ff
ethernet vendor: <unknown>
old ethernet address: 0:20:18:61:e5:e0
old ethernet vendor: ...

Change of IP address/Mac address

Change of IP address does not trigger notifying its


neighbours.

Change of MAC address , NETDEV_CHANGEADDR,also does


not trigger notifying its neighbours.

It does update the local arp table by neigh_changeaddr().

Exception to this is irlan eth:


irlan_eth_send_gratuitous_arp()

(net/irda/irlan/irlan_eth.c)

Some nics don't permit changing of MAC address you get:


SIOCSIFHWADDR: Device or resource busy

Sometimes you should only bring down the nic before.



Flushing the arp table

Flushing the arp:

ip -statistics neigh flush dev eth0

*** Round 1, deleting 7 entries ***

*** Flush is complete after 1 round ***



Flushing the arp table -contd

Specifying twice -statistics will also show which entries were


deleted, their mac addresses, etc...

ip -statistics -statistics neigh flush dev eth0

192.168.0.254 lladdr 00:04:27:fd:ad:30 ref 17 used 0/0/0


REACHABLE

*** Round 1, deleting 1 entries ***

*** Flush is complete after 1 round ***

calls neigh_delete() in net/core/neighbour.c

Changes the state to NUD_FAILED



Neighbour states

neighbour states
Reachable Incomplete
neigh_alloc()
None
Stale
Delay
Probe

Neighboring Subsystem states

NUD states

NUD_NONE

NUD_REACHABLE

NUD_STALE

NUD_DELAY

NUD_PROBE

NUD_FAILED

NUD_INCOMPLETE

Neighboring Subsystem states

From the beginning of core/neighbour.c:

Is it a (latent) bug ?
if (!(state & NUD_IN_TIMER)) {
#ifndef CONFIG_SMP
printk(KERN_WARNING "neigh: timer & !nud_in_timer\n");
#endif
goto out;
}

Neighboring Subsystem states

Special states:

NUD_NOARP

NUD_PERMANENT

No state transitions are allowed from these states to another


state.

Neighboring Subsystem states

NUD state combinations:

NUD_IN_TIMER (NUD_INCOMPLETE|NUD_REACHABLE|
NUD_DELAY|NUD_PROBE)

When removing a neighbour, we stop the timer (call


del_timer()) only if the state is NUD_IN_TIMER.

NUD_VALID (NUD_PERMANENT|NUD_NOARP|
NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)

NUD_CONNECTED (NUD_PERMANENT|NUD_NOARP|
NUD_REACHABLE)

Neighbour states

When a neighbour is in a STALE state it will remain in this


state until one of the two will occur

a packet is sent to this neighbour.

Its state changes to FAILED.

neigh_resolve_output() and neigh_connected_output().

net/core/neighbour.c

A neighbour in INCOMPLETE state does not have MAC address


set yet (ha member of neighbour)

So when neigh_resolve_output() is called, the neighbour state


is changed to INCOMPLETE.

Neighbour states

When neigh_connected_output() is called, the MAC address of the


neighbour is known; so we end up with calling dev_queue_xmit(),
which calls the hard_start_xmit() method of the NIC device driver.

The hard_start_xmit() method actually puts the frame on the wire.



IPSec

Works at network IP layer (L3)

Used in many forms of secured networks like VPNs.

Mandatory in IPv6. (not in IPv4)

Implemented in many operating systems: Linux, Solaris, Windows,


and more.

In 2.6 kernel : implemented by Dave Miller and Alexey Kuznetsov.

Transformation bundles.

Chain of dst entries; only the last one is for routing.

The dst entries in the chain have A NULL Neighbor as a member.

(except the last one)



IPSec-cont.

RFC2401

IPSec-cont.

User space tools: http://ipsec-tools.sf.net

Building VPN : http://www.openswan.org/ (Open Source).

There are also non IPSec solutions for VPN

OpenVPN uses ssl/tls.

example: pptp

struct xfrm_policy has the following member:

struct dst_entry *bundles.

__xfrm4_bundle_create() creates dst_entries (with the


DST_NOHASH flag) see: net/ipv4/xfrm4_policy.c

Transport Mode and Tunnel Mode.



IPSec-contd.

Show the security policies:

ip xfrm policy show

Create RSA keys:

ipsec rsasigkey --verbose 2048 > keys.txt

ipsec showhostkey --left > left.publickey

ipsec showhostkey --right > right.publickey



IPSec-contd.
Example: Host to Host VPN (using openswan)
in /etc/ipsec.conf:
conn linux-to-linux
left=192.168.0.189
leftnexthop=%direct
leftrsasigkey=0sAQPPQ...
right=192.168.0.45
rightnexthop=%direct
rightrsasigkey=0sAQNwb...
type=tunnel
auto=start

IPSec-contd.

service ipsec start (to start the service)

ipsec verify Check your system to see if IPsec got installed and
started correctly.

ipsec auto status

If you see IPsec SA established , this implies success.

Look for errors in /var/log/secure (fedora core) or in kernel syslog



Tips for hacking

Documentation/networking/ip-sysctl.txt: networking kernel tunabels

Example of reading a hex address:

iph->daddr == 0x0A00A8C0 or
means checking if the address is 192.168.0.10 (C0=192,A8=168, 00=0,0A=10).

A BASH script for getting MAC address from IP address: (ipToHex.sh)


#!/bin/sh
IP_ADDR=$1
for I in $(echo ${IP_ADDR}| sed -e "s/\./ /g"); do
printf '%02X' $I
done
echo
usage example: ./ipToHex.sh 192.168.0.1 => C0A80001

Tips for hacking - Contd.

Disable ping reply:

echo 1 >/proc/sys/net/ipv4/icmp_echo_ignore_all

Disable arp: ip link set eth0 arp off (the NOARP flag will be set)

Also ifconfig eth0 -arp has the same effect.

How can you get the Path MTU to a destination (PMTU)?

Use tracepath (see man tracepath).

Tracepath is from iputils.



Tips for hacking - Contd.

inet_addr_type() method: returns the address type; the input to this


method is the IP address. The return value can be RTN_LOCL!
RTN_"NIC#T! RTN_$RO%C#T! RTN_&"LTIC#T etc.
#ee: net'ipv(')ib_)rontend.c

Tips for hacking - Contd.

In case you want to send a packet from a user space application


through a specified device without altering any routing tables:
struct ifreq interface;
strncpy(interface.ifr_ifrn.ifrn_name, "eth1",IFNAMSIZ);
if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, (char
*)&interface, sizeof(interface)) < 0)
{
printf("error setting SO_BINDTODEVICE");
exit(1);
}

Tips for hacking - Contd.

Keep iphdr struct handy (printout): (from linux/ip.h)


struct iphdr {
__u8 ihl:4,
version:4;
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
};

Tips for hacking - Contd.

NIPQUAD() : macro for printing hex addresses

Printing mac address (from net_device):


printk("sk_buff->dev =%02x:%02x:%02x:%02x:%02x:%02x\n",
((skb)->dev)->dev_addr[0], ((skb)->dev)->dev_addr[1],
((skb)->dev)->dev_addr[2],((skb)->dev)->dev_addr[3],
((skb)->dev)->dev_addr[4], ((skb)->dev)->dev_addr[5]);

Printing IP address (primary_key) of a neighbour (in hex format):


printk("neigh->primary_key =%02x.%02x.%02x.%02x\n",
neigh->primary_key[0], neigh->primary_key[1],
neigh->primary_key[2],neigh->primary_key[3]);

Tips for hacking - Contd.

Or:
printk("***neigh->primary_key= %u.%u.%u.%u\n",
NIPQUAD(*(u32*)neigh->primary_key));

CONFIG_NET_DMA is for TCP/IP offload.

When you encounter: xfrm / CONFIG_XFRM this has to to do with


IPSEC. (transformers).

Tips for hacking - Contd.

Showing arp statistics by:

cat /proc/net/stat/arp_cache
entries allocs destroys hash_grows lookups hits res_failed
rcv_probes_mcast rcv_probes_ucast periodic_gc_runs
forced_gc_runs
periodic_gc_runs: statistics of how many times the
neigh_periodic_timer() is called.

Links and more info
1) Linux Network Stack Walkthrough (2.4.20):
http://gicl.cs.drexel.edu/people/sevy/network/Linux_network_stack_walkthrough.html
2) Understanding the Linux Kernel, Second Edition
By Daniel P. Bovet, Marco Cesati
Second Edition December 2002
chapter 18: networking.
- Understanding Linux Network Internals, Christian benvenuti
Oreilly , First Edition.

Links and more info
3) Linux Device Driver, by Jonathan Corbet, Alessandro Rubini, Greg
Kroah-Hartman
Third Edition February 2005.

Chapter 17, Network Drivers


4) Linux networking: (a lot of docs about specific networking topics)

http://linux-net.osdl.org/index.php/Main_Page
5) netdev mailing list: http://www.spinics.net/lists/netdev/

Links and more info
6) Removal of multipath routing cache from kernel code:
http://lists.openwall.net/netdev/2007/03/12/76
http://lwn.net/Articles/241465/
7) Linux Advanced Routing & Traffic Control :
http://lartc.org/
8) ebtables a filtering tool for a bridging:
http://ebtables.sourceforge.net/

Links and more info
9) Writing Network Device Driver for Linux: (article)

http://app.linux.org.mt/article/writing-netdrivers?locale=en

Links and more info
10) Netconf a yearly networking conference; first was in 2004.

http://vger.kernel.org/netconf2004.html

http://vger.kernel.org/netconf2005.html

http://vger.kernel.org/netconf2006.html

Next one: Linux Conf Australia, January 2008,Melbourne

David S. Miller, James Morris , Rusty Russell , Jamal Hadi Salim ,Stephen
Hemminger , Harald Welte, Hideaki YOSHIFUJI, Herbert Xu ,Thomas Graf ,Robert
Olsson ,Arnaldo Carvalho de Melo and others

Links and more info
11) Policy Routing With Linux - Online Book Edition

by Matthew G. Marsh (Sams).

http://www.policyrouting.org/PolicyRoutingBook/
12) THRASH - A dynamic LC-trie and hash data structure:
Robert Olsson Stefan Nilsson, August 2006
http://www.csc.kth.se/~snilsson/public/papers/trash/trash.pdf
13) IPSec howto:
http://www.ipsec-howto.org/t1.html

Links and more info
14) Openswan: Building and Integrating Virtual Private
Networks , by Paul Wouters, Ken Bantoft
http://www.packtpub.com/book/openswan/mid/061205jqdnh2by
publisher: Packt Publishing.
15) a book including chapters about LVS:
The Linux Enterprise Cluster- Build a Highly Available Cluster
with Commodity Hardware and Free Software, By Karl
Kopper.
http://www.nostarch.com/frameset.php?startat=cluster
15) http://www.vyatta.com - Open-Source Networking


Links and more info
16) Address Resolution Protocol (ARP)

http://linux-ip.net/html/ether-arp.html
17) ARPWatch a tool for monitor incoming ARP traffic.
Lawrence Berkeley National Laboratory -
ftp://ftp.ee.lbl.gov/arpwatch.tar.gz.
18) arptables:
http://ebtables.sourceforge.net/download.html
19) TCP/IP Illustrated, Volume 1: The Protocols
By W. Richard Stevens
http://www.informit.com/store/product.aspx?isbn=0201633469

Links and more info
20) Unix Network Programming, Volume 1: The Sockets
Networking API (3rd Edition) (Addison-Wesley Professional
Computing Series) (Hardcover)
by W. Richard Stevens (Author), Bill Fenner (Author), Andrew M.
Rudoff (Author)

Questions

Questions ?

Thank You !

IPV6
Linux Kernel Networking (3)-
advanced topics
Rami Rosen
[email protected]
Haifux, April 2008
www.haifux.org

Linux Kernel Networking (3)-
advanced topics

Note:

This lecture is a sequel to the following two lectures I gave:

Linux Kernel Networking lecture

http://www.haifux.org/lectures/172/

slides:http://www.haifux.org/lectures/172/netLec.pdf

Advanced Linux Kernel Networking - Neighboring


Subsystem and IPSec lecture

http://www.haifux.org/lectures/180/

slides:http://www.haifux.org/lectures/180/netLec2.pdf

Contents

IPV6

General

ICMPV6

Radvd

Autoconfiguration

Network Namespaces

Bridging Subsystem

Pktgen kernel module.

Tips

Links and more info



Scope

We will not deal with wireless.

The L3 network protocol we deal with is ipv4/ipv6, and the


L2 Link Layer protocol is Ethernet.

IPV6 -General

Discussions started at IETF in 1992 (IPng).

First Specification: RFC1883 (1995).

Was deprecated by RFC2460 (1998)

Main reason for IPV6: shortage of IPv4 addresses.

The address space is enlarged in IPV6 from


2^32 to 2^128. (which is by 2^96).

Secondary reason: improvements over IPV4.

For example: using ICMPV6 as a neighbour protocol


instead of ARP.

Fixed IP header (40 bytes) (in IPV4 it is 20-60 bytes).



IPV6 -General

Usually in IPV4, Mobile devices are behind NAT.

Using mobile IPV6 devices which are not behind a NAT can
avoid the need to send Keep-Alive.

Growing market of mobile devices

Some say number of mobile devices will exceed 4 billion


in the end of 2008.

IPSec is mandatory in IPV6 and optional in IPV4.

Though most operating systems implemented IPSec also


in IPv4.

IPV6 - history

In the end of 1997 IBM's AIX 4.3 was the first commercial
platform that supported IPv6

Sun Solaris has IPv6 support since Solaris 8 in February


2000.

2007: Microsoft Windows Vista (2007) has IPv6 supported


and enabled by default.

February 2008: IANA added DNS records for the IPv6


addresses of six of the thirteen root name servers to
enable Internet host to communicate by IPV6.

Around the world

There was a big IPV6 experiment in China

USA, Japan, Korea, France: sites which operate


with IPV6.

Israel: experiment in Intenet Zahav

http://www.m6bone.net/

Freenet6: http://go6.net/4105/freenet.asp

IPV6 in the Linux Kernel

IPV6 Kernel part was started long ago - 1996 (by Pedro
Roque), based on the BSD API; It was Linux kernel 2.1.8.

When adding IPV6 support to the Linux kernel, almost only


the Network layer (L3) is changed.

Almost no other layer needs to be changed


because each layer operates independently.

IPV6 in the Linux Kernel-contd.

USAGI Project was founded in 2000 in Japan.

Universal Playground for Ipv6.

Held by volunteers, mostly from Japan. The USAGI aimed


at providing missing implementation according to the new
IPV6 RFCs.

Awarded with IPV6 ready logo

Yoshifuki Hideaki-member of USAGI Project; Keio


University

comaintainer of IPV6 in the Linux Kernel.(also maintainer of


iputils)

IPV6 -General

Yoshfuji git tree.

From time to time, the main networking tree pulls this git tree.

git-clone git://git.linux-ipv6.org/gitroot/yoshfuji/linux-2.6-dev.git inet-2.6.26

This git tree supports IPV6 Multicast Routing. (with pim6sd


daemon, ported from kame; PIM-SM stands for Protocol
Independent MulticastSparse Mode).

Based on Mickael Hoerdt IPv6 Multicast Forwarding Patch.

http://clarinet.u-strasbg.fr/~hoerdt/

Hoerdt's patch is partially based on mrouted.

Many patches in 2.6.* kernel are from the USAGI project.


There was also the KAME project in Japan, sponsored by six


large companies.

It aimed at providing a free stack of IPv6, IPsec, and Mobile


IPv6 for BSD variants.

http://www.kame.net/

Sponsored by Hitachi and Toshiba and others.

Mobile IPV6:

HUT - Helsinki University of Technology

http://www.mobile-ipv6.org/

IPV6 -General

Many router vendors support IPV6:

Cisco supports IPv6 since IOS 12.2(2)T

Hitachi

Nortel Networks

Juniper Networks, others.

http://www.ietf.org/IESG/Implementations/ipv6-implementations.txt

Drawbacks of IPV6

Currently LVS is not implemented in IPV6.

Takes effort to port existing, IPV4 applications.

Tunnels, transitions (IPv6 and IPv4)



IPV6 Addresses

RFC 4291, IP Version 6 Addressing Architecture

Format of IPV6 address:

8 blocks of 16 bits => 128 bits.

xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx

Where x is a hexadecimal digit.

Leading zeroes can be replaced with "::" , but only once.

Localhost address:

0000:0000:0000:0000:0000:0000:0000:0001

Or, in compressed form:

::1

IPV6 Addresses - contd

No broadcast address in IPV6 as in IPV4.

Global addresses: 2001:

There are more.

Link Local : FE80:



IPV6 -General

Caveat:

Sometimes ipv6 is configured as a module.

You cannot rmmod the ipv6 module.

How can I know if my kernel has support for IPV6?

Run: ls /proc/net/if_inet6

Managing IP address:

ifconfig eth0 inet6 add 2001:db8:0:1:240:95ff:fe30:b0a3/64

ifconfig eth0 inet6 del 2001:db8:0:1:240:95ff:fe30:b0a3/64



IPV6 -General

Can be done also by ip command (IPROUTE2).

ip -6 addr

Using tcpdump to monitor ipv6 traffic:

tcpdump ip6

or , for example:

tcpdump ip6 and udp port 9999.

For wireshark fans:

tethereal -R ipv6

IPV6 -General

To show the Kernel IPv6 routing table :

route -A inet6

Also: ip -6 route show

ssh usage: ssh -6 2001:db8:0:1:230:48ff:fe61:e5e0

traceroute6 -i eth0 fe80::20d:60ff:fe9a:26d2

netstat -A inet6

ip6tables solution exist in IPV6.



IPV6 -General

tracepath6 finds PMTU (path MTU).

This is done using IPV6_MTU_DISCOVER and


IPV6_PMTUDISC_PROBE socket options.

using a UDP socket.



ICMPV6

In IPV6, the neighboring subsystem uses ICMPV6 for


Neighboring messages (instead of ARP in IPV4).

There are 5 types of ICMP codes for neighbour discovery


messages:
Message ICMPV6 code
NEIGHBOUR SOLICITATION (135) -parallel to ARP request
in IPV4
NEIGHBOUR ADVERTISEMENT (136) -parallel to ARP reply in
IPV4

ROUTER SOLICITATION (133)
ROUTER ADVERTISEMENT (134) // see snif below
REDIRECT (137)

ROUTER ADVERTISEMENT can be periodic or on demand.

When ROUTER ADVERTISEMENT is sent as a reply to a


ROUTER SOLICITATION, the destination address is unicast.
When it is periodic, the destination address is a multicast (all
hosts).

Statefull and Stateless config

There are two ways to configure IPV6 addresses on hosts


(except configuring it manually):

Statefull: DHCPV6 on a server.

RFC3315, Dynamic Host Configuration Protocol for IPv6


(DHCPv6).

Stateless: for example, RADVD or Quagga on a server.

RFC 4862 - IPv6 Stateless Address Autoconfiguration


(SLAAC) from 2007 ; Obsoletes RFC 2462 (1998).

In RADVD, you declare a prefix that only hosts (not routers)
use. You can define more than one prefix.

Special Addresses:

All nodes (or : All hosts) address: FF02::1

ipv6_addr_all_nodes() sets address to FF02::1

All Routers address: FF02::2

ipv6_addr_all_routers() sets address to FF02::2


Both in include/net/addrconf.h

IPV6: All addresses starting with FF are multicast address.

IPV4: Addresses in the range 224.0.0.0 239.255.255.255


are multicast addresses (class D).

see http://www.iana.org/assignments/ipv6-address-space

ping6 -I eth0 FF02::2 or ping6 -I eth0 ff02:0:0:0:0:0:0:2


will cause all the routers to reply.

This means that all machines on which


/proc/sys/net/ipv6/conf/eth*/forwarding is 1 will reply.

RADVD

RADVD stands for ROUTER ADVERTISEMENT daemon.

Maintainer: Pekka Savola

http://www.litech.org/radvd/

Sends ROUTER ADVERTISEMENT messages.

The handler for all neighboring messages is ndisc_rcv().


(net/ipv6/ndisc.c)

When NDISC_ROUTER_ADVERTISEMENT message


arrives from radvd, it calls ndisc_router_discovery().
(net/ipv6/ndisc.c)

RADVD - contd

If the receiving machine is a router (or is configured not to


accept router advertisement), the Router Advertisement is
not handled: see this code fragment from
ndisc_router_discovery() (net/ipv6/ndisc.c)
if (in6_dev->cnf.forwarding || !in6_dev->cnf.accept_ra) {
in6_dev_put(in6_dev);
return;

addrconf_prefix_rcv() eventually tries to create an address


using the prefix received from radvd and the mac address of
the machine (net/ipv6/addrconf.c).

RADVD - contd

Adding the IPV6 address is done in ipv6_add_addr()

How can we be sure that there is no same


address on the LAN ?

We can't !

Therefore we set this address first to be tentative

In ipv6_add_addr():

ifa->flags = flags | IFA_F_TENTATIVE;

This means that initially this address cannot


communicate with other hosts. (except for neighboring
messages).

RADVD - contd

Then we start DAD (by calling addrconf_dad_start())

DAD is Duplicate Address Detection.

Upon successful completion of DAD, the IFA_F_TENTATIVE


flag is removed and the host can communicate with other
hosts on the LAN. The flag is set ti be IFA_F_PERMANENT.

Upon failure of DAD, the address is deleted.

You see a message like this in the kernel log:

eth0: duplicate address detected!



RADVD - contd

Caveat:

When using radvd official FC8 rpm, you will see,


in /var/log/messages, the following message after starting the
daemon:
...
radvd[2614]: version 1.0 started
radvd[2615]: failed to set CurHopLimit (64) for eth0

You may ignore this message.

This is due to that we run the daemon as user radvd.

This was fixed in radvd 1.1 (still no Fedora official rpm).



RADVD - contd

radvd: sending router advertisement


// from radvd-1.0/send.c
send_ra()
{
...
addr.sin6_family = AF_INET6;
addr.sin6_port = htons(IPPROTO_ICMPV6);
memcpy(&addr.sin6_addr, dest, sizeof(struct in6_addr));
memset(&buff, 0, sizeof(buff));

radvert = (struct nd_router_advert *) buff;
radvert->nd_ra_type = ND_ROUTER_ADVERT;
...

and we have in /usr/include/netinet/icmpv6.h:

#define ND_ROUTER_SOLICIT 133

#define ND_ROUTER_ADVERT 134

This Router Advertisement is sent to all hosts address:

FF02::1

nd_router_advert structure is declared


in /usr/include/netinet/icmp6.h.

It includes icmpv6 header.

Is there a protection against sending malicious Router


Advertisements ?

No thing as rejecting unsolicited arp replies as in IPV4


(which is the default behaviour, in order to prevent ARP
Cache poisoning)

radvd.conf - example:
interface eth0 #see man radvd.conf
{
AdvSendAdvert on;
MaxRtrAdvInterval 30;
prefix 2002:db8:0:1::/64
{
AdvOnLink on;
AdvAutonomous on;
};
};

radvd.conf example (contd)

The prefix length MUST be 64.

See RFC2464, Transmission of IPv6 Packets over Ethernet


Networks and RFC4291- IP Version 6 Addressing Architecture.

Caveat:

If the prefix length will be different than 64 than the Router


Advertisement will be rejected.

Caveat: You will not notice it, unless your syslog prints
KERN_DEBUG messages (see man syslog.conf)

In case the syslog is configured for printing kernel debug


messages, you will see this messages in the kernel log
IPv6 addrconf: prefix with wrong length

radvd.conf example (contd)

Caveat 2:

You cannot start radvd service if there is no link local address


configured on your machine. Trying to to do will result with:
radvd: no linklocal address configured for (null)
radvd: error parsing or activating the config file:
[FAILED]

Even if it was possible, the kernel would reject a Router


Advertisements originating from machines without link local
IPV6 address.

radvd -d 5 -m stderr (for starting with many debug messages)



Router Advertisement with Prefix
Information option snif

valid_lft - how long this prefix is valid, in seconds.

preferred_lft - how long this address can be in preferred


state, in seconds.

ip -6 addr (note: ifconfig does not show these time values)


2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu
1500 qlen 1000
inet6 2004:db8:0:1:240:95ff:fe30:b0a3/64 scope global dynamic
valid_lft 279sec preferred_lft 99sec
inet6 fe80::240:95ff:fe30:b0a3/64 scope link
valid_lft forever preferred_lft forever

When preferred time is finished, this IPv6 address will stop


communicating. (will not answer ping6, etc).

When the valid time is over, the IPV6 address is removed.

ipv6_del_addr() in net/ipv6/addrconf.c is responsible for


deleting non valid addresses (called from addrconf_verify())

This is useful for renumbering.

RFC 2894 - Router Renumbering for IPv6


Radvd also send its mac address of itself as part of the options.

This enable the receiving host to add/update it neighbour table


accordingly with the mac address of the router:

From ndisc_router_discovery(): (net/ipv6/ndisc.c)


...
lladdr=ndisc_opt_addr_data(ndopts.nd_opts_src_lladdr,skb->dev);
...
neigh_update(neigh, lladdr, ...)
...

lladdr is the mac address of the router, passed in the options


of Router Advertisement.

Router Advertisement with Link
Layer option snif

radvd.conf and a default router

Specifying AdvDefaultLifetime in radvd.conf will cause the


host to add the radvd router as a default router.

Unless /proc/sys/net/ipv6/conf/eth0/accept_ra_defrtr is 0.

This default router has a limited lifetime. It will expire after the
value specified for AdvDefaultLifetime.

Maximum AdvDefaultLifetime value is 18.2 hours.


Example (after setting AdvDefaultLifetime to 8000)

ip -6 route show default

default via fe80::215:58ff:fe95:5be6 dev eth0 proto kernel


metric 1024 expires 7996sec mtu 1420 advmss 1360
hoplimit 64

radvd.conf and default router-cont.

When we stop the radvd daemon this will send a Neighbour


Advertisement with Router Lifetime as 0.

This will cause the hosts which receive this message to


delete the default router.

Implemented by: ip6_del_rt() called from


ndisc_router_discovery() in net/ipv6/ndisc.c

You can also set MTU in radvd.conf

This is not the MTU you see in ifconfig, but you see it
in /proc/sys/net/ipv6/conf/eth0/mtu.

Radvd builtin util: radvdump prints out the contents of


incoming router advertisements sent by radvd.

IPV6 :Neighboring Solicitation snif

Quagga

Quagga replaces Zebra

http://www.quagga.net/

Many routing protocols (BGP, OSPF, RIP, others)

Supports IPV6

Supports sending Router Advertisements.


interface eth0
ipv6 nd send-ra
ipv6 nd prefix-advertisement 2001:0db8:0005:0006::/64

Privacy Extensions

Since the address is build using a prefix and MAC address,


the identity of the machine can be found.

To avoid this, you can use Privacy Extensions.

This adds randomness to the IPV6 address


creation process. (calling get_random_bytes() for
example).

RFC 3041 - Privacy Extensions for Stateless Address


Autoconfiguration in IPv6.

You need CONFIG_IPV6_PRIVACY to be set when building


the kernel.

Hosts can disable receiving Router Advertisements by setting


/proc/sys/net/ipv6/conf/all/accept_ra to 0.

Hosts can request Router Advertisements by sending


a Router Solicitation message.

Autoconfiguration

When a host boots, (and its cable is connected) it first


creates a Link Local Address.

A Link Local address starts with FE80.

This address is tentative (only works with ND messages).

The host sends a Neighbour Solicitation message.

The target is its tentative address, the source is all zeros.

This is DAD (Double Address Detection).

If there is no answer in due time, the state is changed to


permanent. (IFA_F_PERMANENT)

Autoconfiguration - contd.

Then the host send Router Solicitation.

The target address of the Router Solicitation


message is the All Routers multicast address
FF02::2

All the routers reply with a Router Advertisement


message.

The host sets address/addresses according to


the prefix/prefixes received and starts the DAD
process as before.

Autoconfiguration - contd.

At the end of the process, the host will have two (or more)
IPv6 addresses:

Link Local IPV6 address.

The IPV6 address/addresses which was built


using the prefix. (in case that there is one or more
routers sending RAs).

There are three trials by default for sending Router


Solicitation.

It can be configured by:

/proc/sys/net/ipv6/conf/eth0/router_solicitations

If a host boots when its cable is disconnected it will not get an


IPV6 address.

Connecting the cable will trigger an event


(NETDEV_CHANGE) in addrconf_notify() and will result in
sending Router Solicits (calling ndisc_send_rs) and
eventually autoconfiguration will set an IPV6 address to the
host.

Optimistic DAD

Do not wait till DAD is completed, and allow hosts to


communicate with peers before DAD has finished
successfully

Target: to reduce latencies in the DAD process.

The kernel should be build with: CONFIG_IPV6_OPTIMISTIC_DAD.

Very few apps need Optimistic DAD ; Usually the DAD


process of DAD takes less than 2 seconds.

RFC 4429 , Optimistic Duplicate Address Detection (DAD) for


IPv6.

IPV6 Fragmentation

In IPV6, fragmentation is not done by routers (as in IPV4).

The Minimum MTU is IPV6 is 1280.

It is the responsibility of the host (sender) to fragment


packets.

Path MTU discovery is done by ICMPV6

ICMPV6_PKT_TOOBIG messages.

RFC 1981, Path MTU Discovery for IP version 6.


Lookup in the IPV6 routing tables is done by fib6_lookup()

(net/ipv6/ip6_fib.c)

The parameters for the lookup are the root of the table and
the source and destination IPV6 address. (struct in6_addr)

The result of the lookup is saved in rt6_info.

rt6_info is the parallel of rtable in IPV4.


// from include/net/ip6_fib.h

struct rt6_info
{
union {
struct dst_entry dst;
} u;

IPV6 - contd

Enable forwarding:

echo "1" > /proc/sys/net/ipv6/conf/all/forwarding

For Multicast Routing forwarding, there will be in the future:

/proc/sys/net/ipv6/conf/all/mc_forwarding

IPV6 header 40 bytes

include/linux/ipv6.h
Version (4)
Priority/Traffic Class (4) Flow Label (24)
Payload Length
(16)
Next Header (8) Hop Limit (8)
Source Address (128 bits=>16 bytes)
Destination Address (128 bits=>16 bytes)

IPV6 header - contd.

The IPV6 header length is fixed: 40 bytes.

Therefore there is no header length field as in IPV4.

In IPV4 the ip header is of variable size: 20 - 60 bytes; so we


need the header length field. We can add to the base ip header
by multiplications of 4 bytes up to 60 bytes.

Extension headers in ipv6.



IPV6 header Hop Limit

The hop limit is by default 64.

IPV6_DEFAULT_HOPLIMIT is 64.

This is the parallel of ttl field in ip header.

ip6_forward() checks the hop_limit ; when it reaches 0 ,


it sends an ICMP message:

(ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT...

and the packet is dropped.

Note: there is NO checksum field (as in IPV4).



Extension Headers

Hop-by-hop options (IPPROTO_HOPOPTS)

Routing packet header extension (IPPROTO_ROUTING)

Fragment packet header extension (IPPROTO_FRAGMENT)

ICMPV6 options (IPPROTO_ICMPV6)

No next header (IPPROTO_NONE)

Destination options (IPPROTO_DSTOPTS)

Mobility options (IPPROTO_MH)

Other Protocols (TCP,UDP,...)

See include/linux/in6.h

There are some types of Next Headers which cannot have a Next Header
field. For example, ICMPV6, TCP, UDP, no next header (IPPROTO_NONE).

Extension Headers - contd.

All thses protocols are registered by inet6_add_protocol()

If a host tries to parse an extension header which it does not


recognize, then an ICMP error will be sent, notifying about a
parameter problem. (type: ICMPV6_PARAMPROB, code:
ICMPV6_UNK_NEXTHDR) and the packet will be dropped.

For example, in ip6_input_finish() (in net/ipv6/ip6_input.c)

// next header does not specified a registered protocol


...
icmpv6_send(skb, ICMPV6_PARAMPROB,
ICMPV6_UNK_NEXTHDR, nhoff, skb->dev);

Extension Headers - contd

Router Alert is a subtype of Hop-by-hop option, and it tells


the router to process the packet besides forwarding it. It is
used in multicasting.

DHCPv6

The DHCPv6 client runs on UDP port 546.

The DHCPv6 server runs on UDP port 547.

Projects:

Dibbler:

http://klub.com.pl/dhcpv6/

Linux and windows

https://fedorahosted.org/dhcpv6/

Maintained by David Cantrell (Red Hat)

No mailing list...

WIDE-DHCPv6

originally developed in KAME project,

for BSD and Linux

DHCPV6 clients send SOLICIT requests in order to find


DHCPV6 servers.

Hosts send DHCPV6 solicit messages are sent on to the all-


DHCPv6 multicast address (FF02::1:2).

DHCPv6 Servers reply with advertisements.



Socket API

By default, the port space is not shared between IPV6 and


IPV4.

Simple example for creating TCP server with IPV6:


unsigned short port=9999;
struct sockaddr_in6 server;
struct sockaddr_in6 from;
sock = socket(AF_INET6, SOCK_STREAM, 0);
server.sin6_family = AF_INET6;
server.sin6_addr = in6addr_any;

Socket API - contd.
server.sin6_port = htons(port);
bind(sock,(struct sockaddr*)&server, sizeof(server));
fromlen = sizeof(from);
if (listen(sock,5)<0)
printf("error listening\n");
while (1)
{
newsock = (int*)malloc(sizeof(int));
*newsock = accept(sock, (struct sockaddr *)&from,&fromlen);

Socket API - contd.

When trying to run a similar application in IPV4 on the same


port simultanously, you will get the following error:
binding socket error
bind
: Address already in use

It will succeed if you set IPV6_V6ONLY option in IPV6


socket:
int on=1;
if (setsockopt(sock, IPPROTO_IPV6, IPV6_V6ONLY,
(char *)&on, sizeof(on)) == -1)

Receiving an IPV6 packet

ipv6_rcv() is the handler for IPV6 packet (net/ipv6/ip6_input.c)

Performs some sanity checks and then calls:

return NF_HOOK(PF_INET6, NF_IP6_PRE_ROUTING, skb,


dev, NULL, ip6_rcv_finish);

ip6_rcv_finish() performs a lookup in the routing subsystem


by calling ip6_route_input(skb) in order to construct skb->dst.

IPV6 address types

Unicast

The target is a single interface;

packet is delivered to a single interface.

Anycast (new ! Does not exist in IPV4).

The target is a set of interfaces;

packet is delivered to a single interface.

Multicast

The target is a set of interfaces;

packet is delivered to all the interfaces in this set.


ip6_mc_input() currently only verifies that the packet is


indeed for a multicast address of which the netdevice device
is a member and calls ip6_input()

The Multicast Routing patch (CONFIG_IPV6_MROUTE)


adds a call to ip6_mr_input().

net/ipv6/ip6_input.c

There is a user space daemon (pim6sd) which works in


conjunction with multicast routing.

Configuration file: /usr/local/etc/pim6sd.conf

pim6sd is part of mcast-tools



MLD

MLD - Multicast Listener Discovery

also known as Multicast Group Management.

MLD is similar to IGMP in IPV4 but used ICMPv6 messages

MLD messages are sent via ICMPV6.

MLDV2: RFC 3810 (added filtering abilities).

The MLD is used by routers to discover the presence of


Multicast listeners.

MLDV2 is based on IGMPv3.



MLD - contd.

A host can belong to more than one multicast group.

The Ethernet frame for a multicast address starts with


0x3333

In IPV4, in multicast addresses, the first bit is 1 in the


Ethernet frame (this is half of the MAC addresses!).

The hop limit is always 1 in MLD messages so that a router


will not forward them.

netstat -g -n : show IPv6/IPv4 Group Memberships.

ip -6 maddr show

mcjoin: a util for joining an IPv6 Multicast Group

http://www.benedikt-stockebrand.net/hacks_e.html

MLD - contd.

When a host boots, it first sends an MLDV2 message in


ICMPV6. This is Type 143 message (ICMP code), and it is a
Multicast Listener Discovery 2 Report Message. The report
message tells routers and multicast-aware switches that the
host wants to receive messages sent to the multicast address
of the group it joined. This message has a hop limit of 1, so
that it won't be forwarded outside. It is sent to FF02::16 (A
multicast address, which represents the all MLDv2-capable
routers multicast group).

MLD - contd.

addrconf_add_linklocal() calls ipv6_add_addr() which


eventually calls igmp6_group_added(), mld_newpack() ,
setting type to ICMPV6_MLD2_REPORT and send an ICMP
message.

(net/ipv6/mcast.c)

The source address of the MLD message can be a Link Local


address or the unspecified address (::)

When a host boots, it has a tentative address (until DAD is


finished) and it sends an MLD report message to join the
solicited node multicast group.

addrconf_join_solict() calls ipv6_dev_mc_inc() net/ipv6/addrconf.c



MLD - contd.

In this case , the source address of the MLD messages is the


unspecified address (::)

"Change to Exclude" in MLDV2 report.

When a host leaves a group, it sends an MLDv2


ICMPV6_MGM_REDUCTION message

(igmp6_leave_group() in /net/ipv6/mcast.c)

This message is sent to the all routers multicast address


(note the difference against REPORT, which is sent to
FF02::16).

MLD snif

MLD - contd.

A router will recognize this MLD message by the hop-by-hop


option in the extended header.

NEXTHDR_HOP in include/net/ipv6.h

RFC 2711 - IPv6 Router Alert Option.



MLD - contd.

There are two types of messages in MLDV2:

Query (130) (ICMPV6_MGM_QUERY)

In icmp6 header, icmp6_type = 130

Report (143) ICMPV6_MLD2_REPORT

In icmp6 header, icmp6_type = 143

Reports are sent by MLDV2 with destination


address of FF02::16 (FF02:0:0:0:0:0:0:16)

Network Namespaces

Two types of virtualization in the Linux Kernel

OS virtualization.

process/container virtulaization. (Like solaris zones).

OS virtualization:

Xen

Kvm (hardware virtualization)

Lguest (Only 32 bit; there is a RedHat trial to


write 64 bit version).

Network Namespaces - contd.

OpenVZ project (http://openvz.org/).

Currently only for Linux (the FreeBSD port was


dropped).

There is a serious effort to integrate it into


mainline Linux kernel.

Many patches recently to netdev kernel mailing list.

struct net (include/net/net_namespace.h)

Adding support for namespaces in IPV4 is finished.

Currently work is being done on adding support


for namespaces in IPV6.

Packet Generator

Pktgen kernel module (Robert ollson)

Works also with IPV6.



Bridging Subsystem

You can define a bridge and add NICs to it (enslaving


ports) using brctl (from bridge-utils).

You can have up to 1024 for every bridge device


(BR_MAX_PORTS) .

Example:

brctl addbr mybr

brctl addif mybr eth0 #adding interface to a bridge

brctl show

Simple example

In this simple example, you can connect a PC to a bridge


without any configuration on the PC.
PC
Bridge
LAN

Bridging Subsystem-contd

There are devices which you cannot add to a bridge (by


addif); like another bridge or a loopback device or a tunnel
device or any other device which has no HW address.

You can add a tap device (but not a tun device) (?)

When a NIC is configured as a bridge port, the br_port


member of net_device is initialized.

(br_port is an instance of struct net_bridge_port).

When we receive a frame, netif_receive_skb() calls


handle_bridge().

Bridging Subsystem-contd

br_handle_frame() is invoked (net/bridge/br_input.c)

NF_HOOK(PF_BRIDGE, NF_BR_PRE_ROUTING, skb,


skb->dev, NULL, br_handle_frame_finish);

br_handle_frame_finish() checks the MAC destination of the


packet.

If the packet is for the local machine, we do not


forward the packet but call br_pass_frame_up().

br_pass_frame_up() calls:

NF_HOOK(PF_BRIDGE, NF_BR_LOCAL_IN,
skb, indev, NULL,netif_receive_skb);

Bridging Subsystem-contd

If the packet is for the local machine we forward the packet:

by br_forward() if the address is in the


forwarding DB.

br_flood_forward() if the address is in not the


forwarding DB.

Bridging Subsystem-contd

The bridging forwarding database is searched for the

destination MAC address.

In case of a hit, the frame is sent to the bridge port with


br_forward() (net/bridge/br_forward.c).

If there is a miss, the frame is flooded on all

bridge ports using br_flood() (net/bridge/br_forward.c).

Note: this is not a broadcast !

The ebtables mechanism is the L2 parallel of L3 Netfilter.



Bridging Subsystem-contd

Ebtables enable us to filter and mangle packets


at the link layer (L2).

Tips for hacking

Documentation/networking/ip-sysctl.txt: networking kernel tunabels

Example of reading a hex address:

iph->daddr == 0x0A00A8C0 or
means checking if the address is 192.168.0.10 (C0=192,A8=168, 00=0,0A=10).

A BASH script for getting MAC address from IP address: (ipToHex.sh)


#!/bin/sh
IP_ADDR=$1
for I in $(echo ${IP_ADDR}| sed -e "s/\./ /g"); do
printf '%02X' $I
done
echo
usage example: ./ipToHex.sh 192.168.0.1 => C0A80001

Tips for hacking - Contd.

Disable ping reply:

echo 1 >/proc/sys/net/ipv4/icmp_echo_ignore_all

Disable arp: ip link set eth0 arp off (the NOARP flag will be set)

Also ifconfig eth0 -arp has the same effect.

How can you get the Path MTU to a destination (PMTU)?

Use tracepath (see man tracepath).

Tracepath is from iputils.



Tips for hacking - Contd.

inet_addr_type() method: returns the address type; the input to this


method is the IP address. The return value can be RTN_L!"L#
RTN_$NI!"%T# RTN_&R"'!"%T# RTN_($LTI!"%T etc.
%ee: net)ipv*)+ib_+rontend.c

Tips for hacking - Contd.

In case you want to send a packet from a user space application


through a specified device without altering any routing tables:
struct ifreq interface;
strncpy(interface.ifr_ifrn.ifrn_name, "eth1",IFNAMSIZ);
if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, (char
*)&interface, sizeof(interface)) < 0)
{
printf("error setting SO_BINDTODEVICE");
exit(1);
}

Tips for hacking - Contd.

Keep iphdr struct handy (printout): (from linux/ip.h)


struct iphdr {
__u8 ihl:4,
version:4;
__u8 tos;
__be16 tot_len;
__be16 id;
__be16 frag_off;
__u8 ttl;
__u8 protocol;
__sum16 check;
__be32 saddr;
__be32 daddr;
/*The options start here. */
};

Tips for hacking - Contd.

NIPQUAD() : macro for printing hex addresses

Printing mac address (from net_device):


printk("sk_buff->dev =%02x:%02x:%02x:%02x:%02x:%02x\n",
((skb)->dev)->dev_addr[0], ((skb)->dev)->dev_addr[1],
((skb)->dev)->dev_addr[2],((skb)->dev)->dev_addr[3],
((skb)->dev)->dev_addr[4], ((skb)->dev)->dev_addr[5]);

Printing IP address (primary_key) of a neighbour (in hex format):


printk("neigh->primary_key =%02x.%02x.%02x.%02x\n",
neigh->primary_key[0], neigh->primary_key[1],
neigh->primary_key[2],neigh->primary_key[3]);

Tips for hacking - Contd.

Or:
printk("***neigh->primary_key= %u.%u.%u.%u\n",
NIPQUAD(*(u32*)neigh->primary_key));

CONFIG_NET_DMA is for TCP/IP offload.

When you encounter: xfrm / CONFIG_XFRM this has to to do with


IPSEC. (transformers).

Tips for hacking - Contd.

Showing arp statistics by:

cat /proc/net/stat/arp_cache
entries allocs destroys hash_grows lookups hits res_failed
rcv_probes_mcast rcv_probes_ucast periodic_gc_runs
forced_gc_runs
periodic_gc_runs: statistics of how many times the
neigh_periodic_timer() is called.

Links and more info

IPV6 howto (Peter Bieringer) :


http://www.ibiblio.org/pub/Linux/docs/HOWTO/other-formats/pdf/Linux+IPv6-HOWTO.pdf

USAGI Project - Linux IPv6 Development Project

http://www.linux-ipv6.org/

Porting applications to IPv6 HowTo BY Eva M. Castro:

http://gsyc.es/~eva/IPv6-web/ipv6.html

RFC 3493: Basic Socket Interface Extensions for IPv6.

RFC 3542: Advanced Sockets Application Program Interface


(API) for IPv6.

Links and more info

Books:

IPv6 Essentials, Second Edition (OReilly)

A book By Silvia Hagen

Second Edition May 2006

Pages: 436

ISBN 10: 0-596-10058-2 | ISBN 13: 9780596100582



Links and more info

IPv6 in Practice: A Unixer's Guide to the Next Generation


Internet

by Benedikt Stockebrand (Author) ; Springer; 1 edition,


2006.

Talks about implementation of IPv6 is Linux, Solaris, BSD.

http://www.benedikt-stockebrand.net/books_e.html

Links and more info
1) IPv6 Advanced Protocols Implementation (2007)
2) IPv6 Core Protocols Implementation (2006)

Both books were written by Qing Li, Tatuya Jinmei and Keiichi
Shima
- published by Morgan Kaufmann Series in Networking.

Both books discuss the Kame implementation of


IPV6. (in BSD).

Links and more info

IPv6 Information Page!

http://www.ipv6.org/

What's up in the Linux IPv6 Stack

Lecture slides by Hideaki YOSHIFUJI from lca2008.

Keio University

USAGI/WIDE Project

http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/131-200801-LCA2008-LinuxIPv6.pdf

Html: http://www.linux-ipv6.org/materials/200801-LCA2008/

Links and more info
Linux Network Stack Walkthrough (2.4.20):
http://gicl.cs.drexel.edu/people/sevy/network/Linux_network_stack_walkthrough.html
Understanding the Linux Kernel, Second Edition
By Daniel P. Bovet, Marco Cesati
Second Edition December 2002
chapter 18: networking.
- Understanding Linux Network Internals, Christian benvenuti
Oreilly , First Edition.

Links and more info
Linux Device Driver, by Jonathan Corbet, Alessandro Rubini, Greg
Kroah-Hartman
Third Edition February 2005.

Chapter 17, Network Drivers


Linux networking: (a lot of docs about specific networking topics)

http://www.linux-foundation.org/en/Net:Main_Page

netdev mailing list: http://www.spinics.net/lists/netdev/



Links and more info
Removal of multipath routing cache from kernel code:
http://lists.openwall.net/netdev/2007/03/12/76
http://lwn.net/Articles/241465/
Linux Advanced Routing & Traffic Control :
http://lartc.org/
ebtables a filtering tool for a bridging:
http://ebtables.sourceforge.net/

Links and more info
Writing Network Device Driver for Linux: (article)

http://app.linux.org.mt/article/writing-netdrivers?locale=en

Links and more info
Netconf a yearly networking conference; first was in 2004.

http://vger.kernel.org/netconf2004.html

http://vger.kernel.org/netconf2005.html

http://vger.kernel.org/netconf2006.html

Next one: Linux Conf Australia, January 2008,Melbourne

David S. Miller, James Morris , Rusty Russell , Jamal Hadi Salim ,Stephen
Hemminger , Harald Welte, Hideaki YOSHIFUJI, Herbert Xu ,Thomas Graf ,Robert
Olsson ,Arnaldo Carvalho de Melo and others

Links and more info
Policy Routing With Linux - Online Book Edition

by Matthew G. Marsh (Sams).

http://www.policyrouting.org/PolicyRoutingBook/
THRASH - A dynamic LC-trie and hash data structure:
Robert Olsson Stefan Nilsson, August 2006
http://www.csc.kth.se/~snilsson/public/papers/trash/trash.pdf
IPSec howto:
http://www.ipsec-howto.org/t1.html

Links and more info
Openswan: Building and Integrating Virtual Private Networks ,
by Paul Wouters, Ken Bantoft
http://www.packtpub.com/book/openswan/mid/061205jqdnh2by
publisher: Packt Publishing.
a book including chapters about LVS:
The Linux Enterprise Cluster- Build a Highly Available Cluster
with Commodity Hardware and Free Software, By Karl
Kopper.
http://www.nostarch.com/frameset.php?startat=cluster
http://www.vyatta.com - Open-Source Networking


Links and more info
Address Resolution Protocol (ARP)

http://linux-ip.net/html/ether-arp.html
ARPWatch a tool for monitor incoming ARP traffic.
Lawrence Berkeley National Laboratory -
ftp://ftp.ee.lbl.gov/arpwatch.tar.gz.
arptables:
http://ebtables.sourceforge.net/download.html
TCP/IP Illustrated, Volume 1: The Protocols
By W. Richard Stevens
http://www.informit.com/store/product.aspx?isbn=0201633469

Links and more info
Unix Network Programming, Volume 1: The Sockets Networking
API (3rd Edition) (Addison-Wesley Professional Computing Series)
(Hardcover)
by W. Richard Stevens (Author), Bill Fenner (Author), Andrew M.
Rudoff (Author)
Linux Ethernet Bridging mailing list:
http://www.spinics.net/lists/linux-ethernet-bridging/

Questions

Questions ?

Thank You !

Linux Wireless -
Linux Kernel Networking (4)-
advanced topics
Rami Rosen
[email protected]
ai!ux" #arc$ %&&'
www.$ai!ux.org

Linux Kernel Networking (4)-
advanced topics

Note(

)$is lecture is a se*uel to t$e !ollowing +


lectures , gave(
1) Linux Kernel Networking lecture

$ttp(--www.$ai!ux.org-lectures-./%-

slides($ttp(--www.$ai!ux.org-lectures-./%-netLec.pd!
2) Advanced Linux Kernel Networking -
Neighboring Subsystem and !Sec lecture

$ttp(--www.$ai!ux.org-lectures-.0&-

slides($ttp(--www.$ai!ux.org-lectures-.0&-netLec%.pd!

Linux Kernel Networking (4)-
advanced topics
") Advanced Linux Kernel Networking -
!v# in the Linux Kernel lecture

$ttp(--www.$ai!ux.org-lectures-.0/-

Slides( $ttp(--www.$ai!ux.org-lectures-.0/-netLec+.pd!

1ontents(

2eneral.

,3330&%.. specs.

4o!t#51 and 6ull#517 mac0&%...

#odes( (0&%... )opologies)

,n!rastructure mode.

Association.

Scanning.

ostapd

8ower save in ,n!rastructure mode.

,944 (5d oc mode).

#es$ mode (0&%..s).


802.11 Physical Modes.

Appendix: mac80211- implementation details.

Tips.

Glossary.

Links.

,mages

9eacon !ilter : Wires$ark sni!!.

edimax router user manual page (9R-;<&4N).


Note$ we will not deal wit$ securit=-encr=ption"


regulation" !ragmentation in t$e linux wireless
stack and not deal wit$ tools (Network#anager"
kwi!imanager"etc). and not wit$ >illing (Radius"
etc).

?ou mig$t !ind $elp on t$ese topics in two ai!ux lectures(

Wireless management (Wi6i (0&%...) in 2N@-Linux >= A$ad


LutBk=)(

$ttp(--www.$ai!ux.org-lectures-.+0-

Wireless securit= (6irewall 8iercing" >= 5lon 5ltman)(

$ttp(--www.$ai!ux.org-lectures-.%4-

Note( We will not delve into $ardware !eatures.



2eneral

Wireless networks market grows constantl=

)wo items !rom recent mont$ newspaper(


(=net.co.il)

Aver .%"&&& wireless room $otels in ,srael.

Aver <&"&&& wireless networks in 3urope.

,n t$e late nineties t$ere were discussions in


,333 committees regarding t$e 0&%... protocol.

1%%% ( )$e !irst spec (a>out <&& pages).

(see no . in t$e list o! links >elow).

2&&'( 5 second spec (.%+% pages) 7 and t$ere


were some amendments since t$en.

4o!t#51 and 6ull#51

,n %&&&-%&&." t$e market >ecame a>ound wit$


laptops wit$ wireless nics.

,t was important to produce wireless driver and


wireless stack Linux solutions in time.

)$e goal was t$en" as Ce!! 2arBik (t$e previous


wireless #aintainer) put it( D)$e= Eust want t$eir
$ardware to work...F.

mac(&211 - new Linux so!tmac la=er.

!ormerl= called d0&%.. o! Gevicescape)

1urrent mac0&%.. maintainer( Co$annes 9erg


!rom sipsolutions.

#ac0&%.. merged into Kernel mainstream


(upstream) starting %.;.%%" Cul= %&&/.

Grivers were adEusted to use mac0&%..


a!terwards.

Gevicescape is a wireless networking compan=.

$ttp(--devicescape.com-pu>-$ome.do

Location in t$e kernel tree( net-mac0&%...

5 kernel module named mac0&%...ko.


#ost wireless drivers were ported to use


mac0&%...

)$ere is a little num>er o! exceptions t$oug$.

Li>ertas (#arvell) does not work wit$


mac0&%...

li>ertasHt! (#arvell) uses t$in !irmware 7 so it


does use mac0&%...

li>ertasHt! supports 5ccess 8oint and #es$ 8oint.

9ot$ are in AL81 proEect.

W$en starting development o! a new driver"


most c$ances are t$at it will use mac0&%.. 58,.

#odes( ,n!rastructure 944

1lassic 344 (3xtended 4ervice 4et)
344 I two or more 944s.

W$at is an 5ccess 8oint J

3dimax #,#A n#ax 9R-;<&4n Router


Links=s WR)<42L <4#>ps Route


NA)3( n)rastructure *SS +, *SS

*SS , nde-endent *SS. /Ad-0oc mode)

Access !oint( 5 wireless device acting in


master mode wit$ some $w en$ancements and
a management so!tware (like $ostapd).

5 wireless device in master mode cannot scan


(as opposed to ot$er modes).

5lso a wireless device in monitor mode cannot scan.

#aster #ode is one o! / modes in w$ic$ a


wireless card can >e con!igured.

5ll stations must aut$enticate and associate


and wit$ t$e 5ccess 8oint prior to
communicating.

4tations sometimes per!orm scanning prior to


aut$entication and association in order to get
details a>out t$e 5ccess 8oint (like mac
address" essid" and more).

4canning

4canning can >e(

Active (send >roadcast 8ro>e re*uest) scanning.

!assive (Listening !or >eacons) scanning.

4ome drivers support passive scanning. ( see t$e


,3330&%..H15NH8544,K3H415N !lag).

8assive scanning is needed in some $ig$er


0&%...5 !re*uenc= >ands"as =ouLre not allowed to
transmit an=t$ing at all until =ouLve $eard an 58
>eacon.

scanning wit$ Fiwlist wlan0 scanF is in !act


sending an ,A1)L (4,A14,W415N).

4canning-contd.

,t is $andled >= ieee80211_ioctl_siwscan().

)$is is part o! t$e Wireless-3xtensions


mec$anism. (aka W3).

5lso ot$er operations like setting t$e mode to


5d-oc or #anaged can >e done via ,A1)Ls.

)$e Wireless 3xtensions module7 see(


net-mac0&%..-wext.c

3ventuall=" t$e scanning starts >= calling


ieee80211_sta_start_scan() met$od "in
net-mac0&%..-mlme.c.

#L#3 I #51 La=er #anagement 3ntit=.



4canning-contd.

5ctive 4canning is per!ormed >= sending 8ro>e


Re*uests on all t$e c$annels w$ic$ are
supported >= t$e station.

Ane station in eac$ 944 will respond to a 8ro>e


Re*uest.

)$at station is t$e one which transmitted the last


beacon in that *SS.

,n in!rastructure 944" t$is stations is t$e 5ccess 8oint.

4impl= >ecause t$ere are no ot$er stations in 944 w$ic$


send >eacons.

,n ,944" t$e station w$ic$ sent t$e last >eacon can


c$ange during time.

4canning-contd.

?ou can also sometimes scan !or a speci!ic


944(

iwlist wlan1 scan essid homeNet.

5lso in t$is case" a >roadcast is sent.

(sometimes" t$is will return $omeNet. also and


$omeNet%).

3xample o! scan results
iwlist wlan% scan
wlan% 4can completed (
1ell &. - 5ddress( &&(.;(3+(6&(69(+'
344,G(F4,3#3N4-6&69+'F
#ode(#aster
1$annel(;
6re*uenc=(%.4+/ 2B (1$annel ;)
Mualit=I<-.&& 4ignal level(%<-.&&
3ncr=ption ke=(on
,3( @nknown( &&&3<+4'4<4G4<43<+%G4;+&4;4%+++'
,3( @nknown( &.&00%0409';%4+&40;1
,3( @nknown( &+&.&;
,3( @nknown( %5&.&&

,3( @nknown( +%&4&1.%.0;&
,3( @nknown( GG&;&&.&.0&%&&&&
9it Rates(. #>-s7 % #>-s7 <.< #>-s7 .. #>-s7 .0 #>-s
%4 #>-s7 +; #>-s7 <4 #>-s7 ; #>-s7 ' #>-s
.% #>-s7 40 #>-s
3xtra(ts!I&&&&&&;+c>!+%4/'
3xtra( Last >eacon( 4/&ms ago
1ell &% - 5ddress( &&(.+(4;(/+(G4(6.
344,G(FG-LinkF
#ode(#aster
1$annel(;
6re*uenc=(%.4+/ 2B (1$annel ;)

5ut$entication

Apen-s=stem aut$entication
(WL5NH5@)HA83N) is t$e onl= mandator=
aut$entication met$od re*uired >= 0&%....

)$e 58 does not c$eck t$e identit= o! t$e


station.

5ut$entication 5lgorit$m ,denti!ication I &.

5ut$entication !rames are management !rames.



5ssociation

5t a given moment" a station ma= >e


associated wit$ no more t$an one 58.

5 4tation (D4)5N) can select a 944 and


aut$enticate and associate to it.

(,n 5d-oc ( aut$entication is not de!ined).



5ssociation-contd.

)r=ing t$is(

iwconfig wlan0 essid AP1 ap macAddress1

iwconfig wlan0 essid AP2 ap macAddress2

Will cause !irst associating to 58." and t$en


disassociating !rom 58. and associating to
58%.

58 will not receive an= data !rames !rom a


station >e!ore it it is associated wit$ t$e 58.

5ssociation-contd.

5n 5ccess 8oint w$ic$ receive an association


re*uest will c$eck w$et$er t$e mo>ile station
parameters matc$ t$e 5ccess point parameters.

)$ese parameters are 44,G" 4upported Rates and


capa>ilit= in!ormation. )$e 5ccess 8oint also de!ine
a Listen ,nterval.

W$en a station associates to an 5ccess 8oint" it


gets an 544A1,5),AN ,G (A1) in t$e range
.-%&&/.

5ssociation-contd.

)r=ing unsuccess!ull= to associate more t$an +


times results wit$ t$is message in t$e kernel
log(

DapGeviceName( association wit$ 58 ap#ac5ddress timed outN and


t$s state is c$anged to 222(&2113S4A35L5231SA*L21.

5lso i! does not matc$ securi= re*uirement" will return


222(&2113S4A35L5231SA*L21.

ostapd

$ostapd is a user space daemon implementing


access point !unctionalit= (and aut$entication
servers). ,t supports Linux and 6ree94G.

http://hostap.epitest.fi/hostapd/

Geveloped >= Couni #alinen.

$ostapd.con! is t$e con!iguration !ile.

3xample o! a ver= simple $ostapd.con! !ile(


inter!aceIwlan&
driverInl0&%..
$wHmodeIg
c$annelI.
ssidI$omeNet

ostapd-cont.

Launc$ing $ostapd(

./hostapd hostapd.conf

(add dd for getting more !er"ose de"#g


messages)

1ertain devices" w$ic$ support #aster #ode"


can >e operated as 5ccess 8oints >= running
t$e $ostapd daemon.

ostapd implements part o! t$e #L#3 58 code


w$ic$ is not in t$e kernel

and pro>a>l= will not >e in t$e near !uture.

6or example( $andling association re*uests w$ic$ are


received !rom wireless clients.

ostapd-cont.

ostapd uses t$e nl0&%.. 58, (netlink socket


>ased " as opposed to ioctl >ased).

ostapd-cont.

)$e $ostapd starts t$e device in monitor mode(


drv-OmonitorHi!idx I
nl0&%..HcreateHi!ace(drv" >u!" NL0&%..H,6)?83H#AN,)AR" N@LL)7
)$e $ostapd opens a raw socket wit$ t$is device(
dr!$monitor_soc% & soc%et(P'_PA()*+, -.()_/A0, htons(*+1_P_A22))3
($ostapd-driverHnl0&%...c)
)$e packets w$ic$ arrive at t$is socket are $andled >= t$e 58.

Receiving in monitor mode means t$at a special $eader


(R5G,A)58) is added to t$e received packet.

)$e $ostapd c$anges management and control packets.

)$e packet is sent >= t$e sendmsg() s=stem call(

sendmsg(dr!$monitor_soc%, 4msg, flags)3



ostapd-cont.

)$is means sending directl= !rom t$e raw


socket (86H851K3)) and putting on t$e
transmit *ueue (>= de!_5#e#e_6mit())" wit$out
going t$roug$ t$e 0&%.. stack and wit$out t$e
driver).

W$en t$e packet is transmitted" an D,NC31)3GN


!lags is added. )$is tells t$e ot$er side" w$ic$
will receive t$e packet" to remove t$e radiotap
$eader. (,3330&%..H)PH1)LH,NC31)3G)

ostapd-cont.

ostapd manages(

5ssociation-Gisassociation re*uests.

5ut$entication-deaut$entication re*uests.

)$e ostapd keeps an arra= o! stations7 W$en


an association re*uest o! a new station arrives
at t$e 58" a new station is added to t$is arra=.

ostapd-cont.

)$ere are t$ree t=pes o! ,3330&%.. packets(

)$e t=pe and su>t=pe o! t$e packet are


represented >= t$e )rame control !ield in t$e
0&%... $eader.

5anagement (,3330&%..H6)?83H#2#))

3ac$ management !rame contains in!ormation


elements (,3s). 6or example" >eacons $as t$e ssid
(network name) "344-,944 >its (.&I58"&.I,944)"
and more.

(WL5NH15859,L,)?H344-WL5NH15859,L,)?H,944 in ieee0&%...$.)

)$ere are 4/ t=pes o! in!ormation elements (,3s) in current


implementation

5ll in -include-linux-ieee0&%...$.

5ssociation and 5ut$entication are management


packets.

9eacons are also management !rames.

,3330&%..H4)?83H9351AN

ostapd-cont.

6ontrol /222(&2113748!2364L)

6or example" 848ALL


,3330&%..H4)?83H848ALL

5lso 51K" R)4-1)4.

1ata (,3330&%..H6)?83HG5)5)

4ee( include-linux-ieee0&%...$

)$e $ostapd daemon sends special management packets


called beacons (5ccess 8oints send usuall= .& >eacons in
a second7 t$is can >e con!igured (see t$e router manual
page at t$e >ottom)).

)$e area in w$ic$ t$ese >eacons appear de!ine


t$e >asic service area.

6rom -net-mac0&%..-rx.c (wit$ remarks)
Q ,333 0&%... address !ields(
)oG4 6romG4 5ddr. 5ddr% 5ddr+ 5ddr4
& & G5 45 944,G n-a 5doc
& . G5 944,G 45 n-a ,n!ra (6rom 58)
. & 944,G 45 G5 n-a )o 58 (,n!ra)
. . R5 )5 G5 45 WG4 (9ridge )

#= laptop as an access point

#= laptop as an access point( )$ere is an


,sraeli 4tart @p compan= w$ic$ develops !ree
access point Windows sw w$ic$ ena>les =our
laptop to >e an access point.

$ttp(--www.>Beek.com-static-index.$tml

1urrentl= it is !or ,ntel 8RA-Wireless +'4<.

,n t$e !uture( ,ntel 8RA-Wireless 4';<.



8ower 4ave in ,n!rastructure #ode

8ower 4ave it a $ot su>Eect.

,ntel linux 8ower 4ave site(

$ttp(--www.lesswatts.org-

8ower)A8 util(

8ower)A8 is a tool t$at $elps =ou !ind w$ic$ so!tware is


using t$e most power.

8ower 4ave in ,n!rastructure #ode-
cont

@sual case (,n!rastructure 944).


4)5. 4)5%
58

#o>ile devices are usuall= >atter= powered


most o! t$e time.

5 station ma= >e in one o! two di!!erent modes(

5wake (!ull= powered)

5sleep (also termed DdoBedN in t$e specs)

5ccess points never enters power save mode


and does not transmit Null packets.

,n power save mode" t$e station is not a>le to


transmit or receive and consumes ver= low
power.

@ntil recentl=" power management worked onl=


wit$ devices w$ic$ $andled power save in
!irmware.

6rom time to time" a station enters power save


mode.

)$is is done >=(

!irmware" or

>= using mac0&%.. 58,

G=namic power management patc$es t$at were recentl=


sent >= Kalle Kalo (Nokia).

ow do we initiate power saveJ

iwconfig wlan0 power timeo#t 7

-ets the timeo#t to 7 seconds.

Note: this can "e done onl8 with the "eta


!ersion of 0ireless +ools (!ersion 90pre:
("eta) ):

http://www.hpl.hp.com/personal/;ean_+o#rrilhes/2in#6/+ools.html

,n case t$e !irmware $as support !or power


save" drivers can disa>le t$is !eature >= setting
222(&2113093N:3S4A6K318NA563!S
!lag in t$e driver con!iguration.

)$e 5ccess 8oint is noti!ied a>out it >= a null )rame


w$ic$ is sent !rom t$e client (w$ic$ calls
ieee80211_send_n#llf#nc() ). )$e !5 >it is set in t$is
packet (8ower #anagement).

W$en 4)5% is in power saving mode(

58 $as two >u!!ers( (a dou>l= linked list o!


skH>u!! structures" skH>u!!H$ead).

6or unicast !rames (psHtxH>u! in sta7 one *ueue !or


eac$ station).

6or multicast->roadcast !rames. (psH>cH>u! "one !or


58).
4)5. 4)5%
58

3ac$ 58 $as an arra= o! its associated stations inside


(staHin!o o>Eects). 3ac$ one $as -s3tx3bu) *ueue inside"
(!or unicasts)" and -s3bc3bu) (!or multicast->roadcasts)
4)5H,N6A
psHtxH>u!
A!
psH>cH>u!

4he si;e o) -s3tx3bu) and o) -s3bc3bu) is 12( -ackets

Rde!ine 4)5H#5PH)PH9@663R .%0 in


net-mac0&%..-staHin!o.$

Rde!ine 58H#5PH91H9@663R .%0 in


net-mac0&%..-ieee0&%..Hi.$

5dding to t$e *ueue( done >= s%"_5#e#e_tail().

)$ere is $owever" a common counter


(total_ps_"#ffered) w$ic$ sums >ot$ >u!!ered
unicasts and multicasts.

W$en a station enters 84 mode it turns o!! its


R6. 6rom time to time it turns t$e R6 on" >ut
only )or receiving beacons.

W$en >u!!ering in 58" ever= packet (unicast and


multicast) is saved in t$e corresponding ke=.

)$e onl= exception is w$en strict ordering


>etween unicast and multicast is en!orced. )$is
is a service w$ic$ #51 la=er suppl=. owever"
it is rarel= in use.

6rom net-mac0&%..-tx.c(
ieee80211_t6_h_m#lticast_ps_"#f() <
...
-Q no >u!!ering !or ordered !rames Q-
i! (ieee0&%..H$asHorder($dr-O!rameHcontrol))
return )PH1AN),N@37

)$e 58 sends a 45 ()ra!!ic ,ndication #ap)


wit$ eac$ >eacon.

9eacons are sent periodicall= !rom t$e 58.

),#SiTI. IO )$e 58 $as >u!!ered tra!!ic !or a


station wit$ 5ssociation ,GIi.

,n !act" a partial virtual >itmap is sent : w$ic$ is a


smaller data structure in most cases.

)$e 4)5 sends a 84-8ALL packet (8ower


4aving 8oll) to tell t$e 58 t$at it is awake.

58 sends t$e >u!!ered !rame.



pspoll diagram

,944 #ode

,944 : without an access point.



,944 #ode - contd

,944 network is o!ten !ormed wit$out pre-


planning" !or onl= as long as t$e L5N is needed.

)$is t=pe o! operation is o!ten re!erred to as an


5d oc network.

5lso sometimes called D8eer )o 8eerN network.


1reating 5d-oc network(

iwcon!ig wlan& mode ad-$oc

(note( i! t$e nic is running" =ou s$ould run >e!ore


t$is( ifconfig wlan0 down)

iwcon!ig wlan& essid m=3ssid

)$e essid $as to >e distri>uted manuall= (or


ot$erwise) to ever=one w$o wis$es to connect
to t$e 5d-oc network.

)$e 944,G is a random #51 address.

(in !act" 4; >its o! it are random).


Diwcon!ig wlan& essid m=3ssidN triggers i>ss


creation >= calling ieee80211_sta_create_i"ss()

net/mac80211/mlme.c

Coining an ,944(

5ll mem>ers o! t$e ,944 participate in >eacon


generation.

)$e mem>ers are s=nc$roniBed ()46).

)$e >eacon interval wit$in an ,944 is esta>lis$ed


>= t$e 4)5 t$at instantiates t$e ,944.

ieee80211_sta_create_i"ss() (mlme.c)

)$e >ssid o! t$e i>ss is a random address (>ased on mixing


getHrandomH>=tes() and #51 address).

#es$ #ode (0&%..s)
7ull 5esh$n the )ull mesh to-ology< each
node is connected directly to each o) the
others.

#es$ #ode (0&%..s)
!artial 5esh$nodes are connected to only
some< not all.

0&%...s (#es$)

0&%...s started as a 4tud= 2roup o! ,333


0&%... in 4eptem>er %&&+" and >ecame a )2
()ask 2roup) in %&&4. (name( )2s)

,n %&&;" two proposals" out o! .<" (t$e F433-


#es$F and FWi-#es$F proposals) were merged
into one. )$is is dra!t G&.&..

Wireless #es$ Networks are also called W#N.

Wireless mes$ networks !orward data packets


over multiple wireless $ops. 3ac$ mes$ node
acts as rela= point-router !or ot$er mes$ nodes.

,n %.;.%;" t$e network stack added support !or


t$e dra!t o! wireless mes$ networking (0&%...s)"
t$anks to t$e open0&%..s proEect (
$ttp(--www.open0&%..s.org-).

)$ere is still no !inal spec.

)$ere are currentl= )ive drivers in linux wit$ support


to mes$ networking (at$<k">4+"li>ertasHt!"p<4"
Bd.%..rw)" and one is under development (rt%x&&).

Apen0&%...s

2oal( )o create t$e !irst open implementation o!


0&%...s.

4ponsors(

AL81 proEect.

1oB=>it ($ttp(--www.coB=>it.com-)" t$e compan= t$at


developed t$e mes$ so!tware on t$e AL81 Laptop.

Luis 1arlos 1o>o and Cavier 1ardona (>ot$ !rom 1oB=>it)


developed t$e Linux mac0&%.. mes$ code.

Nortel

0&%...s de!ines a de!ault routing protocol called


095! (=>rid Wireless #es$ 8rotocol)

9ased on( 5d oc Gemand Gistance Kector


(5AGK) routing (1. 8erkins)7 r!c+<;..

)$e W#8 protocol works wit$ la=er % (#ac


addresses).

)$e 0&%.. $eader was extended(

5 ttl !ield was added to avoid loops.


)$e current implementation uses on demand


pat$ selection.

)$e dra!t also talks a>out proactive pat$


selection.

)$is is not implemented =et in t$e Linux Kernel.

@ses Root 5nnouncement (R5NN) messages and


#es$ 8ortal as a root.

5s wit$ ,8K4 static routes" =ou can !orce a


speci!ic next $op !or a mes$ station
(#34H85)H6,P3G !lag)

(mesh_path_fi6_ne6thop() in mesh_patht"l.c)

3ver= station is called an 5!. (#es$ 8oint)

#88 is a #es$ 8ortal. (6or example" w$en an #8 is used to


connect to external network" like t$e ,nternet).

3ac$ station $olds a routing ta>le (str#ct mesh_ta"le) : $elps to


decide w$ic$ route to take.

,n t$e initial state" w$en a packet is sent to anot$er station"


t$ere is !irst a lookup in t$e mes$ ta>le7 t$ere is no $it" so a
!=2> /!ath =e?uest) is sent as a >roadcast.

W$en t$e !=2> is received on all stations except t$e !inal


destination" it is !orwarded.

W$en t$e !=2> is received on t$e !inal station" a 8R38 is


sent /!ath =e-ly).

,! t$ere is some !ailure on t$e wa=" a !2== is sent./!ath


2rror).

andled >= mesh_path_error_t6()" mes$H$wmp.c

)$e route take into consideration an airtime metric

1alculated in airtime_lin%_metric_get() (>ased on rate and ot$er $w


parameters).

8AW3R 45K,N2 in t$e #34 spec is optional.


Advantage$

Rapid deplo=ment.

#inimal con!iguration7 inexpensive.

3as= to deplo= in $ard-to-wire environments.

1isadvantage(

#an= >roadcasts limit network per!ormance

?ou can set a wireless device to work in mes$ mode onl= wit$
t$e iw command (?ou cannot per!orm t$is wit$ t$e wireless
tools).

3xample( setting a wireless nic to work in mes$ mode(

iw de! wlan1 interface add mesh t8pe mp mesh_id 1

(t8pe & mp &$ =esh Point)



802.11 Physical Modes

0&%... (Wi6i) is a set o! standards !or wireless


networking" w$ic$ were de!ined in .''/ >ut
started to >ecome popular in t$e market
around %&&..

(&2.11a (.''') at < 2B" <4#9it maximum


speed7 range a>out +&m.

(&2.11b (.''') at %.42B" ..#>it maximum


speed" range a>out +&m.

(&2.11g (%&&+) at %.42B" <4#>it maximum


speed" range a>out +&m.

(&2.11n (%&&0) at %.42B-<2B" %&& #>it


(t=pical)" range a>out <&m.

is planned to support up to a>out <4&#>it- ;&&


#>it.

,mproves t$e previous 0&%... standards >=


adding multiple-input multiple-output (#,#A)

multiple antennas.

ig$ )$roug$put ()).

@se packet aggregation

)$e a>ilit= to send several packets toget$er at one time.


4till is considered a proposal.

3xpected to >e approved onl= in Gecem>er %&&' or


later.

iwlagn and at$'k are t$e onl= drivers t$at


support 0&%...n in t$e Linux kernel at t$e
moment.

)ip( $ow can , know w$et$er m= wireless nic


supports 0&%...nJ

Run( iwconfig

?ou s$ould see ( F,333 0&%...a>gnF or somesuc$.



5ppendix( mac0&%..
implementation details

944,G I 9asic 4ervice 4et ,denti!ication.

3ac$ 944 $as an 944,G.

944,G is an 40 >it num>er (like #51 address).

)$is avoids getting >roadcasts !rom ot$er networks


w$ic$ ma= >e p$=sicall= overlapping.

,n in!rastructure 944" t$e 944,G is t$e #51


address o! t$e 5ccess 8oint w$ic$ created t$e 944.

,n ,944" t$e 944,G is generated !rom calling a


random !unction (generating 4; random >its7 t$e
ot$er % are !ixed).

#odes o! operation

5 wireless inter!ace alwa=s operates in one o!


t$e !ollowing modes(

n)rastructure mode( wit$ an 5ccess8oint (58)

)$e access point $old a list o! associated stations.

also called managed)

*SS (,ndependent 944"5d-oc) mode

W$en using ad-$oc" an access point is not needed.

5onitor mode

91S (Wireless Gistri>ution 4=stem)



#odes o! operation - contd.

Wireless Gistri>ution 4=stem (WG4) - allows access


points to talk to ot$er access points.

5esh

see( include-linux-nl0&%...$(
enum nl(&2113i)ty-e @
NL(&2113748!23ANS!26721<
NL(&2113748!23A10:6<
NL(&2113748!23S4A4:N<
NL(&2113748!23A!<
NL(&2113748!23A!3BLAN<
NL(&2113748!2391S<
NL(&2113748!235:N4:=<
NL(&2113748!2352S03!:N4<
U

c!g0&%.. and nl0&%..

Wireless-3xtensions $as a new replacement7

,t is c!g0&%.. and nl0&%.. (message->ased


mec$anism" using netlink inter!ace).

iw uses t$e nl0&%.. inter!ace.

?ou can compare it t$e t$e old ioctl->ased net-tools


versus t$e new rtnetlink ,8RA@)3% set o! tools.

?ou cannot set master mode wit$ iw.

?ou cannot c$ange t$e c$annel wit$ iw.


Wireless git trees(

Wireless-testing

Was started on 6e>ruar= .4" %&&0 >= Co$n Linville.

primar= development target.

t$e >leeding edge Linux wireless developments.

wireless-next-%.;

Wireless-%.;

Gail= compat-wireless tar >all in(

$ttp(--www.or>it-la>.org-kernel-compat-wireless-%.;-

)$e compat-wireless tar >all includes onl= part o! t$e kernel

(3ssentiall= it includes wireless drivers and wireless stack)


6edora kernels are usuall= up-to-date wit$ wireless-testing git


tree.

)$ere is usuall= at least one pull re*uest (or more) in a week"


to t$e netdev mailing list (main Linux kernel networking mailing
list).

)$e #aintainer o! t$e wireless (0&%...) in t$e Linux kernel is


Co$n Linville (Redat)" starting !rom Canuar= %&&;.

6or $elping in delving into t$e mac0&%.. code little $elp.

,mportant data structures(

struct ieee0&%..H$w : represents $ardware in!ormation and


state (include-net-mac0&%...$).

,mportant mem>er( void Qpriv (pointer to private area).

#ost drivers de!ine a struct !or t$is private area " like
l"tf_pri!ate (#arvell) or iwl_pri! (iwlwi!i o! ,ntel) or
mac80211_hwsim_data in mac0&%..H$wsim.

3ver= driver allocates it >= ieee80211_alloc_hw()

A pointer to ieee80211_ops (see later) is passed as a


parameter to ieee80211_alloc_hw().

3ver= driver calls ieee80211_register_hw() to create wlan&


and wmaster& and !or various initialiBations.

?ou set t$e mac$ine mode prior to calling


ieee80211_register_hw() >= assigning !lags !or t$e
inter!aceHmodes !lags o! wip$= mem>er

wip$= itsel! is a mem>er o! ieee0&%..H$w structure.

6or example"
$w-Owip$=-Ointer!aceHmodes I
9,)(NL0&%..H,6)?83H4)5),AN) V
9,)(NL0&%..H,6)?83H58)7

)$is sets t$e mac$ine to >e in 5ccess 8oint mode.


struct ieee0&%..Hi!Hap : represents an access point. (see


ieee0&%..Hi.$)

8ower saving mem>ers o! ieee0&%..Hi!Hap(

psH>cH>u! (multicast->roadcast >u!!er).

numHstaHps (num>er o! stations in 84 mode).


struct ieee0&%..Hops : )$e drivers use its mem>ers. (include-net-


mac0&%...$).

6or example" config (to c$ange a c$annel) or config_interface


to c$ange >ssid.

4ome drivers upload !irmware at t$e start() met$od" like


l>t!HopHstart() in li>etrasHt! driver or BdHopHstart() (w$ic$ calls
BdHopHstart() to upload !irmware Bd.%..rw

5ll met$ods o! t$is struct get a pointer to struct ieee0&%..H$w


as a !irst parameter.

)$ere are %4 met$ods in t$is struct.

Seven o! t$em are mandator=(


tx"start"stop"addHinter!ace"removeHinter!ace"con!ig and
con!igureH!ilter.

(,! an=one o! t$em is missing" we end in 9@2HAN())


Receiving a packet is done >= calling


ieee80211_r6_ir5safe() !rom t$e low level
driver. 3ventuall=" t$e packet is $andled >=
__ieee80211_r6()(

HHieee0&%..Hrx()(struct ieee0&%..H$w Q$w"


struct skH>u!! Qsk>"
struct ieee0&%..HrxHstatus Qstatus)7

ieee80211_r6_ir5safe() can >e called !rom interrupt


context.

)$ere is onl= one more mac0&%.. met$od w$ic$ can


>e called !rom interrupt context(

ieee80211_t6_stat#s_ir5safe()

Gata !rames

5ddr. : destination (receiver #51 address).

5ddr% : source (transmitter #51 address).

5ddr+ - G4 in!o

5ddr4 : !or WG4.

#anagement !rames

5ddr. : destination (receiver #51 address).

5ddr% : source (transmitter #51 address).

5ddr+ - G4 in!o

6irmware

6irmware(

#ost wireless drivers load !irmware in t$e pro>e


met$od (>= calling re5#est_firmware())

@suall= t$e !irmware is not open source.

Apen 6irmWare !or Wi6i networks site(

$ttp(--www.ing.uni>s.it-open!ww!-

Written in assem>ler.

94+ !irmware will >e replaced >= open source


!irmware.

at$<k-at$k'k driver doesnLt load !irmware. (its !w is


>urnt into an onc$ip RA#)

Wireless 6uture trends (Wi#ax)

Wi#ax - ,333 0&%..;.

)$ere are alread= laptops w$ic$ are sold wit$

Wi#ax c$ips ()os$i>a" Lenovo).

Wi#ax and Linux(

$ttp(--linuxwimax.org-

,nak= 8ereB-2onBaleB !rom ,ntel

(!ormerl= a kernel @49 developer)

Location in t$e kernel tree( dri!ers/net/wima6.



Wireless 6uture trends (Wi#ax) -
contd

)wo parts(

Kernel module driver

@ser space management stack" W,#5P


Network 4ervice.

5 re*uest to merge linux-wimax 2,) tree wit$


t$e netdev 2,) tree was sent in %;....&0

$ttp(--www.spinics.net-lists-netdev-msg0.'&%.$tml

)$ere is also an initiative !rom Nokia !or a Wi#ax stack !or


Linux.

)ips

ow can , know i! m= wireless nic was


con!igured to support power management J

Look in iwconfig !or D8ower #anagementN entr=.

ow do , know i! m= @49 nic $as support in


LinuxJ

$ttp(--www.*>ik.c$-us>-devices-

ow do , know w$ic$ Wireless 3xtensions does


m= kernel useJ

2rep !or Rde!ine W,R3L344H3P) in


include-linux-wireless.$ in =our kernel tree.

ow can , know t$e c$annel num>er !rom a


sni!!J

Look at t$e radiotap $eader in t$e sni!!er output7


c$annel !re*uenc= translates to a c$annel num>er
(. to ..)

4ee also )a>le .<-/WG444 8? !re*uenc=


c$annel plan " in t$e %&&/ 0&%..

A!ten" t$e c$annel num>er appears in s*uare


>rackets. Like(

c$annel !re*uenc= %4+/ S92 ;T

92 stands !or 0&%...9-0&%...2" respectivel=


1$annel .4 !or example would s$ow as 9"


>ecause =ouLre not allowed to transmit 0&%...2
on it.

,srael regdomain(

$ttp(--wireless.kernel.org-en-developers-Regulator=-Gata>aseJalp$a%I,L

,L is in t$e range .-.+.

Wit$ @4 con!iguration" onl= c$annel . to .. are selecta>le. Not .%".+.

#an= 5ps ares s$ipped on a @4 con!iguration.


W$at is t$e #51 address o! m= nicJ

cat -s=s-class-ieee0&%..-p$=Q-macaddress

6ommon 7ilters )or wireshark sni))er$


#anagement 6rames wlan.!c.t=pe e* &
1ontrol 6rames wlan.!c.t=pe e* .
Gata 6rames wlan.!c.t=pe e* %
5ssociation Re*uest wlan.!c.t=peHsu>t=pe e* &
5ssociation response wlan.!c.t=peHsu>t=pe e* .
Reassociation Re*uest wlan.!c.t=peHsu>t=pe e* %
Reassociation Response wlan.!c.t=peHsu>t=pe e* +
8ro>e Re*uest wlan.!c.t=peHsu>t=pe e* 4

8ro>e Response wlan.!c.t=peHsu>t=pe e* <
9eacon wlan.!c.t=peHsu>t=pe e* 0
5nnouncement )ra!!ic ,ndication #ap (5),#) wlan.!c.t=peHsu>t=pe e* '
Gisassociate wlan.!c.t=peHsu>t=pe e* .&
5ut$entication wlan.!c.t=peHsu>t=pe e* ..
Geaut$entication wlan.!c.t=peHsu>t=pe e* .%
5ction 6rames wlan.!c.t=peHsu>t=pe e* .+
9lock 5cknowledgement (51K) Re*uest wlan.!c.t=peHsu>t=pe e* %4
9lock 51K wlan.!c.t=peHsu>t=pe e* %<
8ower-4ave 8oll wlan.!c.t=peHsu>t=pe e* %;
Re*uest to 4end wlan.!c.t=peHsu>t=pe e* %/

4ni!!ing a WL5N

?ou could sni!! wit$ wires$ark

4ometime =ou canLt put t$e wireless inter!ace to


promiscuous mode (or it is not enoug$). ?ou
s$ould set t$e inter!ace to work in monitor
mode (6or example( iwcon!ig wlan& mode
monitior).

,! =ou want to capture tra!!ic on networks ot$er


t$an t$e one wit$ w$ic$ =ouLre associated" =ou
will
have to
have to

capture in monitor mode.

4ni!!ing a WL5N - contd.

4ee t$e !ollowing wires$ark wiki page" talking


a>out various wireless cards and sni!!ing in
Linux7

WL5N (,333 0&%...) capture setup(

$ttp(--wiki.wires$ark.org-1apture4etup-WL5NR$ead->>0+/+e!4'&+!e'da%>0+/<++./%;<4.!>.ad+%d

@sing a !ilter !rom command line(

ts$ark -R wlan -i wlan&

tet$ereal -R wlan -i wlan& -w wlan.et$

?ou will see t$is message in t$e kernel log(

Ddevice wlan& entered promiscuous modeN



4ni!!ing a WL5N - contd.

4ometimes =ou will $ave to set a di!!erent


c$annel t$an t$e de!ault one in order to see
>eacon !rames (tr= c$annels .";"..)

iwcon!ig wlan. c$annel ..

)ip( use!ull wires$ark displa= !ilter(

6or s$owing onl= >eacons(

wlan.fc.t8pe_s#"t8pe e5 8

6or ts$ark command line(

tshar% / >wlan.fc.t8pe_s#"t8pe e5 8> i wlan0

(this will sniff for "eacons).



2lossar=

5#8G@I5pplication #essage 8rotocol Gata


@nit.

1RG5 I 1entral Regulator= Gomain 5gent

14#5-15 I 1arrier 4ense #ultiple 5ccess wit$


1ollision 5voidance

14#5-1G 1arrier 4ense #ultiple 5ccess wit$


1ollision Getection

G4 I Gistri>ution 4=stem

358 I )$e 3xtensi>le 5ut$entication 8rotocol

3R8 I extended rate 8?


W#8 I =>rid Wireless #es$ 8rotocol

#8G@ I #51 8rotocol Gata @nit

#,#A I #ultiple-,nput-#ultiple-Autput

8458 I 8ower 4aving 5ccess 8oints

84 I 8ower 4aving.

R44, I Receive signal strengt$ indicator.

),# I )ra!!ic ,ndication #ap

W85 I Wi-6i 8rotected 5ccess

W#3 I Wireless #ultimedia 3xtensions




Links

.) ,333 0&%.. specs(

$ttp(--standards.ieee.org-getieee0&%-0&%....$tml

%) Linux wireless status Cune - %&&0

$ttp(--www.kernel.org-pu>-linux-kernel-people-mcgro
!-presentations-linux-wireless-status.pd!

+) o!!icial Linux Wireless wiki $osted >=


Co$annes 9erg.

$ttp(--wireless.kernel.org-

or $ttp(--linuxwireless.org-

4) 5 >ook(

0&%... Wireless Networks( )$e Ge!initive 2uide

>= #att$ew 2ast

8u>lis$er( ALReill=

<) Wireless 4ni!!ing wit$ Wires$ark - 1$apter ; o! 4=ngress


Wires$ark and 3t$ereal Network 8rotocol 5nal=Ber )oolkit.

;) $ttp(--www.lesswatts.org-

4aving power wit$ Linux (an ,ntel site)


/) 5 >ook( Wireless #es$ Networking(


5rc$itectures" 8rotocols 5nd 4tandards
>= ?an X$ang" CiEun Luo" onglin u (ardcover
: %&&;)
5uer>ac$ 8u>lications
0) $ttp(--www.radiotap.org-

,mages

9eacon wires$ark !ilter(

wlan.!c.t=peHsu>t=pe e* 0

s$ows onl= >eacons.



9eacon !ilter : sni!!

9eacon interval and G),# period in
edimax router (9R-;<&4N) (6rom
t$e manual)

)$ank ?ou Y

Linux Kernel Networking
advanced topics (5)
Sockets in the kernel
Rami Rosen
ramirose@gmailcom
!ai"ux# $ugust %&&'
wwwhai"uxorg
$ll rights reserved

Linux Kernel Networking (5)(
advanced topics

Note)

*his lecture is a se+uel to the "ollowing ,


lectures - gave in !ai"ux)
1) Linux Kernel Networking lecture

http)..wwwhai"uxorg.lectures./0%.

slides)http)..wwwhai"uxorg.lectures./0%.netLecpd"
2) Advanced Linux Kernel Networking -
Neighboring Subsystem and !Sec lecture

http)..wwwhai"uxorg.lectures./1&.

slides)http)..wwwhai"uxorg.lectures./1&.netLec%pd"

Linux Kernel Networking (5)(
advanced topics
") Advanced Linux Kernel Networking -
!v# in the Linux Kernel lecture

http)..wwwhai"uxorg.lectures./10.

Slides) http)..wwwhai"uxorg.lectures./10.netLec2pd"
$) %ireless in Linux
http)..wwwhai"uxorg.lectures.%&3.

Slides) http)..wwwhai"uxorg.lectures.%&3.wirelessLecpd"

*a4le o" contents)

*he socket() s5stem call

678 protocol

9ontrol :essages

$ppendixes

Note) $ll code examples in this lecture re"er to


the recent 2&#&"' version o" the Linux kernel


La5er % (:$9 la5er)
La5er 2 (Network la5er) -8;,.-8;3)
La5er , (*98#678#S9*8#)
kernel
*98 socket
678 Socket
(sers)ace

-n user space# we have application# session and presentation


la5ers(tcp.ip re"ers to all 2 as application la5er)

creating a socket *rom user s)ace is done 45


the socket() s5stem call)

int socket (int "amil5# int t5pe# int protocol)<

=rom man % socket)

R>*6RN ;$L6>

?n success# a "ile descriptor "or the new socket is returned

=or open() s5stem call ("or "iles)# we also get a "ile descriptor
as the return value

@>ver5thing is a "ileA 6nix paradigm

*he "irst parameter# "amil5# is also sometimes re"erred to as @domainA


*he *amily is 8=B-N>* "or -8;, or 8=B-N>*3 "or -8;3

*he "amil5 is 8=B8$9K>* "or 8acket sockets# which operate


at the device driver la5er (La5er %)

pcap li4rar5 "or Linux uses 8=B8$9K>* sockets)

pcap li4rar5 is in use 45 sni""ers such as tcpdump

$lso hostapd uses 8=B8$9K>* sockets)

(hostapd is a wireless access point management proCect)

=rom hostapd)

drv(DmonitorBsock E socket(8=B8$9K>*# S?9KBR$F#


htons(>*!B8B$LL))<

*5pe)

S?9KBS*R>$: and S?9KB7GR$: are the mostl5 used


t5pes

S?9KBS*R>$: "or *98# S9*8# HL6>*??*!

S?9KB7GR$: "or 678

S?9KBR$F "or R$F sockets

*here are cases where protocol can 4e either


S?9KBS*R>$: or S?9KB7GR$:< "or example# 6nix
domain socket ($=B6N-I)

8rotocol)usuall5 & ( -88R?*?B-8 is &# see)


include.linux.inh)

=or S9*8# the protocol is !!+,-,.S/-!)

sock"dEsocket($=B-N>*# S?9KBS*R>$:#!!+,-,.S/-!)<

=or 4luetooth.R=9?::)

socket($=BHL6>*??*!# S?9KBS*R>$:#
H*8R?*?BR=9?::)<

S9*8) Stream 9ontrol *ransmission 8rotocol

=or ever5 socket which is created 45 a userspace


application# there is a corresponding socket struct and
sock struct in the kernel

*his s5stem call eventuall5 invokes the sock_create()


method in the kernel.

An instance of struct socket is created (include/linux/net.h)

struct socket has only 8 members; struct sock has more than 20,
and is one of the biest structures in the net!orkin stack. "ou
can easily be confused bet!een them. #o the con$ention is this%

sock sock al!ays refers to struct socket.

sk sk al!ays refers to struct sock.



struct sock) (include.net.sockh)
struct sock J

struct socket Kssocket<
L
struct socket (include.linux.neth)
struct socket J
socketBstate state<
short t5pe<
unsigned long "lags<
struct "as5ncBstruct K"as5ncBlist<
waitB+ueueBheadBt wait<
struct "ile K"ile<
struct sock Ksk<
const struct protoBops Kops<
L<

*he state can 4e

SSB=R>>

SSB6N9?NN>9*>7

SSB9?NN>9*-NG

SSB9?NN>9*>7

SSB7-S9?NN>9*-NG

*hese states are not la5er , states (like *98B>S*$HL-S!>7


or *98B9L?S>)

*he skBprotocol mem4er o" struct sock e+uals to the third


parameter (protocol) o" the socket() s5stem call

struct protoBops (inter"ace o" struct socket)


inet.stream.o)s
(ie# *98 sockets)
inet.dgram.o)s
(ie# 678 sockets)
inet.sockraw.o)s
(ie# R$F sockets)
"amil5 8=B-N>* 8=B-N>* 8=B-N>*
owner *!-SB:?76L> *!-SB:?76L> *!-SB:?76L>
release inetBrelease inetBrelease inetBrelease
4ind inetB4ind inetB4ind inetB4ind
connect inetBstreamBconnect inetBdgramBconnect inetBdgramBconnect
socketpair sockBnoBsocketpair sockBnoBsocketpair sockBnoBsocketpair
accept inetBaccept sockBnoBaccept sockBnoBaccept
getname inetBgetname inetBgetname inetBgetname
poll tcpBpoll udpBpoll datagramBpoll
ioctl inetBioctl inetBioctl inetBioctl
listen inetBlisten sockBnoBlisten sockBnoBlisten
shutdown inetBshutdown inetBshutdown inetBshutdown
setsockopt sockBcommonBsetsockopt sockBcommonBsetsockopt sockBcommonBsetsockopt
getsockopt sockBcommonBgetsockopt sockBcommonBgetsockopt sockBcommonBgetsockopt
sendmsg tcpBsendmsg inetBsendmsg inetBsendmsg
recvmsg sockBcommonBrecvmsg sockBcommonBrecvmsg sockBcommonBrecvmsg
mmap sockBnoBmmap sockBnoBmmap sockBnoBmmap
sendpage tcpBsendpage inetBsendpage inetBsendpage
spliceBread tcpBspliceBread ( (

Note) *he inetBdgramBops and inetBsockrawBops di""er onl5 in


the poll mem4er)

in inetBdgramBops it is ud&_&oll()

in inetBsockrawBops# it is dataram_&oll().

7iagram)
struct inetBsock

struct sock (sk)
struct ipBoptions Kopt<
BBu1 tos<
BBu1 recverr)/<
BBu1 hdrincl)/<

inetBsk(sock Ksk) ED returns the inetBsock which contains sk


struct sock has three +ueues) rx # tx and err


skB4u"" skB4u"" skB4u""
skBreceiveB+ueue
skB4u"" skB4u"" skB4u""
skBwriteB+ueue

0ach 1ueue has a lock 2s)inlock)


skB4u"" skB4u"" skB4u""
skBerrorB+ueue
& & & &
& & & &
& & & &

skb_'ueue_tail() % Addin to the 'ueue

skb_de'ueue() % remo$in from the 'ueue

(ith )#*_+,,-, this is done in t!o staes%

skb_&eek()

__skb_unlink(). (to remo$e the sk_buff from the


'ueue).

=or the error +ueue) sock_'ueue_err_skb() adds


to its tail (include.net.sockh) >ventuall5# it also calls
skb_'ueue_tail().

>rrors can 4e -9:8 errors or >:SGS-M> errors

=or more a4out errors#see $88>N7-I =) 678 errors



678 and *98

No explicit connection setup is done with 678

-n *98 there is a preliminar5 connection setup

8ackets can 4e lost in 678 (there is no


retransmission mechanism in the kernel) *98
on the other hand is relia4le (there is a
retransmission mechanism)

:ost o" the -nternet tra""ic is *98 (like http#


ssh)

678 is "or audio.video (R*8).streaming

Note) streaming with ;L9 is 45 678 (R*8)

Streaming via Nou*u4e is tcp (http)



*he udp header

*here are a ver5 "ew 678(4ased servers like 7NS# N*8#


7!98# *=*8 and more

=or 7!98# it is +uite natural to 4e 678 (Since man5 times with


7!98# 5ou donOt have a source address# which is a must "or *98)

*98 implementation is much more complex

*he *98 header is much 4igger than 678 header


*he udp header) include/linux/ud&.h
struct udphdr J
BB4e/3source<
BB4e/3dest<
BB4e/3len<
BBsum/3 check<
L<

678 packet E 678 header P pa5load

$ll mem4ers are % 45tes (/3 4its)


source port dest port
len checksum
!ayload

Receiving packets in 678 "rom
kernel

678 kernel sockets can get tra""ic either "rom userspace or


"rom kernel
678 la5er ,
-8v, ( la5er 2
(S0+ S!A/0
(3! sockets
ip_local_deliver_finish()
calls udp_rcv()
NF_INET_LOCAL_IN
hook
K0+N0L
sock_queue_rcv_sk()
La5er % (>thernet)

=rom user s)ace# 5ou can receive udp tra""ic in


three s5stem calls)

rec$() (!hen the socket is connected)

rec$from()

rec$ms()

$ll three are handled 45 ud&_rec$ms() in the kernel

.ote that fourth &arameter of these / methods is flas;


ho!e$er, this &arameter is .01 chaned u&on return. 2f you are
interested in returned flas , you must use onl! rec$ms(), and
to retrie$e the ms.ms_flas member.

=or example# suppose 5ou have a client(server udp


applications# and the sender sends a packets which is longer
then what the client had allocated "or input 4u""er *he kernel
than truncates the packet# and send 4S5.-+(N/ "lag -n
order to retrieve it# 5ou should use something like)
recvmsg(udpSocket# Qmsg# "lags)<
i" (msgmsgB"lags Q :SGB*R6N9)
print"(R:SGB*R6N9SnR)<

*here was a new suggestion recentl5 "or


rec$mms() system call for recei$in multi&le
messaes (3y Arnaldo 4ar$alho de )elo)

1he rec$mms() !ill reduce the o$erhead caused by multi&le


system calls of rec$ms() in the usual case.

Receiving packets in 678 "rom user
space

678 kernel sockets can get tra""ic either "rom userspace or


"rom kernel
678 la5er ,
-8v, ( la5er 2
(S0+ S!A/0
(3! sockets
K0+N0L
udp_recv"s#()
recv*rom2) system call
__sk_recv_data#ra"() $
reads fro" sk%&sk_receive_queue
La5er % (>thernet)
recv2) syscall call recvmsg2) syscall

Receiving packets ( udpBrcv()

ud&_rc$() is the handler "or all 678 packets


"rom the -8 la5er -t handles all incoming
packets in which the protocol "ield in the ip
header is -88R?*?B678 (/0) a"ter ip la5er
"inished with them
See the udpBprotocol de"inition) (net.ipv,.a"Binetc)
struct netBprotocol udpBprotocol E J
handler E udpBrcv#
errBhandler E udpBerr#

L<

-n the same wa5 we have )

ra!_rc$() as a handler "or raw packets

tc&_$5_rc$() as a handler "or *98 packets

icm&_rc$() as a handler "or -9:8 packets

Kernel implementation) the &roto_reister()


method registers a protocol handler
(net.core.sockc)

ud&_rc$() im&lementation%

=or 4roadcasts and multicast there is a special


treatment)
i" (rt(DrtB"lags Q (R*9=BHR?$79$S*TR*9=B:6L*-9$S*))
return BBudp,Bli4BmcastBdeliver(net# sk4# uh#
saddr# daddr# udpta4le)<

*hen per"orm a lookup in a hashta4le o" struct sock

!ash ke5 is created "rom destination port in the udp header

-" there is no entr5 in the hashta4le# then there is no sock


listening on this 678 destination port ED so send -9:8
4ack) (o" )ort unreachable)

icmpBsend(sk4# -9:8B7>S*B6NR>$9!#
-9:8B8?R*B6NR>$9!# &)<

udpBrcv()

-n this case# a corresponding SN:8 :-H


counter is incremented
(678B:-HBN?8?R*S)

678B-N9BS*$*SBH!(net# 678B:-HBN?8?R*S# proto EE


-88R?*?B678L-*>)<

Nou can see it 45)


netstat 6s
.....
6dp)

25 packets to unknown port received


udpBrcv() ( contd

?r# 45)

cat .proc.net.snmp T grep 6dp)


6dp) -n7atagrams No8orts -n>rrors
?ut7atagrams Rcv4u">rrors Snd4u">rrors
6dp) /, 25 & 2& & &

-" there is a sock listening on the destination


port# call ud&_'ueue_rc$_skb().

,$entually calls sock_'ueue_rc$_skb().

Fhich adds the packet to the sk.receive.1ueue 45


skb_'ueue_tail()

udpBrcv() diagram

udpBrcv()
BBudp,Bli4Brcv
:ulticast
BBudp,Bli4BmcastBdeliver
6nicast
BBudp,Bli4BlookupBsk4
=ind a sock in udpta4le
udpB+ueueBrcvBsk4 sockB+ueueBrcvBsk4
7onOt "ind a sock
icmpBsend()
-9:8B7>S*B6NR>$9!#
-9:8B8?R*B6NR>$9!

ud&_rec$ms()%

9alls __skb_rec$_dataram() # "or receiving


one skB4u""

*he __skb_rec$_dataram() ma5 4lock

>ventuall5# what __skb_rec$_dataram() does is


read one skB4u"" "rom the sk_recei$e_'ueue
+ueue

memc&y_toio$ec() per"orms the actual cop5 to


user space 45 invoking co&y_to_user().

?ne o" the parameters o" ud&_rec$ms() is a


pointer to struct "s#hdr LetOs take a look)

:SG!7R
=rom include.linux.socketh)
struct msghdr J
void KmsgBname< .K Socket name K.
int msgBnamelen< .K Length o" name K.
struct iovec KmsgBiov< .K 7ata 4locks K.
BBkernelBsiUeBt msgBiovlen< .K Num4er o" 4locks K.
void KmsgBcontrol<
BBkernelBsiUeBt msgBcontrollen< .K Length o" cmsg list K.
unsigned msgB"lags<
L<

9ontrol messages (ancillar5
messages)

*he msgBcontrol mem4er o" msgdhr represent a control


message

Sometimes 5ou need to per"orm some special


things =or example# getting to know what was the
destination address o" a received packet

Sometimes there is more than one address on a machine


(and also 5ou can have multiple addresses on the same
nic)

!ow can we know the destination address o" the ip


header in the applicationV

struct cmsghdr (.usr.include.4its.socketh)


represents a control message

cmsghdr mem4ers can mean di""erent things 4ased on the t5pe


o" socket

*here is a set o" macros "or handling cmsghdr like


9:SGB=-RS*!7R()# 9:SGBNI*!7R()# 9:SGB7$*$()#
9:SGBL>N() and more

*here are no control messages "or *98 sockets



Socket options)
-n order to tell the socket to get the in"ormation a4out the packet
destination# we should call setsockopt()

setsocko&t() and etsocko&t() ( set and get options on a socket

Hoth methods return & on success and (/ on error

8rotot5pe) int setsockopt(int sock"d# int level# int optname#


*here are two levels o" socket options)
*o manipulate options at the sockets $8- level) S?LBS?9K>*
*o manipulate options at a protocol level# that protocol num4er
should 4e used<

"or example# "or 678 it is -88R?*?B678 or S?LB678


(4oth are e+ual /0) < see include.linux.inh and include.linux.socketh

S?LB-8 is &

*here are currentl5 /' Linux socket options and one


another on option "or HS7 compati4ilit5
See $ppendix H "or a "ull list o" socket options

*here is an option called -8B8K*-N=?

Fe will set the -8B8K*-N=? option on a socket in the


"ollowing example

.. "rom .usr.include.4its.inh
Wde"ine -8B8K*-N=? 1 .K 4ool K.
.K Structure used "or -8B8K*-N=? K.
struct inBpktin"o
J
int ipiBi"index< .K -nter"ace index K.
struct inBaddr ipiBspecBdst< .K Routing destination address K.
struct inBaddr ipiBaddr< .K !eader destination address K.
L<

const int on E /<
sock"d E socket($=B-N>*# S?9KB7GR$:#&)<
i" (setsockopt(sock"d# S?LB-8# -8B8K*-N=?# Qon#
siUeo"(on))X&)
perror(RsetsockoptR)<


Fhen calling recvmsg()# we will parse the msghr like this)
"or (cmptrE9:SGB=-RS*!7R(Qmsg)< cmptrYEN6LL<
cmptrE9:SGBNI*!7R(Qmsg#cmptr))
J
i" (cmptr(DcmsgBlevel EE S?LB-8 QQ cmptr(DcmsgBt5pe EE
-8B8K*-N=?)
J
pktin"o E (struct inBpktin"oK)9:SGB7$*$(cmptr)<
print"(RdestinationEZsSnR# inetBntop($=B-N>*# Qpktin"o(DipiBaddr#
str# siUeo"(str)))<
L
L

-n the kernel# this calls i&_cms_rec$() in


net.ipv,.ipBsockgluec (which eventuall5 calls
i&_cms_rec$_&ktinfo())

Nou can in this wa5 retrieve other "ields o" the ip


header)

=or getting the **L)

setsockopt(sock"d# S?LB-8# -8BR>9;**L# Qon#


siUeo"(on))X&)

Hut) cmsgBt5pe EE -8B**L

=or getting ipBoptions)

setsockopt() with -8B?8*-?NS


Note) 5ou cannot get.set ipBoptions in [ava


app

Sending packets in 678

=rom user s)ace# 5ou can send udp tra""ic with


three s5stem calls)

send() (!hen the socket is connected).

sendto()

sendms()

$ll three are handled 45 ud&_sendms() in the kernel

ud&_sendms() is much simpler than the tcp parallel


method # tc&_sendms().

ud&_send&ae() is called when user space calls


send"ile() (to cop5 a "ile into a udp socket)

send"ile() can 4e used also to cop5 data 4etween one "ile


descriptor and another

ud&_send&ae() in$okes ud&_sendms().

ud&_send&ae() !ill !ork only if the nic su&&orts


#catter/*ather (.,127_7_#* feature is su&&orted).

>xample udp client
Winclude XstdiohD
Winclude Xarpa.inethD
Winclude Xs5s.sockethD
Winclude XstringhD
int main()
J
int s<
struct sockaddrBin target<
int res<
char 4u"\/&]<

targetsinB"amil5 E $=B-N>*<
targetsinBportEhtons(''')<
inetBaton(R/'%/31&/%/R#QtargetsinBaddr)<
strcp5(4u"#Rmessage /)R)<
s E socket($=B-N>*# S?9KB7GR$:# &)<
i" (sX&)
perror(RsocketR)<
res E sendto(s# 4u"# siUeo"(4u")# &#(struct sockaddrK)Qtarget#
siUeo"(struct sockaddrBin))<
i" (resX&)
perror(RsendtoR)<
else
print"(RZd 45tes were sentSnR#res)<
L

=or comparison# there is a tcp client in appendix 9

*he source port o" the 678 packet here is


chosen randoml5 in the kernel

-" - want to send "rom a speci"ied port V


Nou can 4ind to a speci"ic source port (111 in this example) 45
adding)
sourcesinB"amil5 E $=B-N>*<
sourcesinBport E htons(111)<
sourcesinBaddrsBaddr E htonl(-N$77RB$NN)<
i" (4ind(s# (struct sockaddrK)Qsource# siUeo"(struct
sockaddrBin)) EE (/)
perror(R4indR)<

Nou cannot 4ind to privileged ports (ports lower


than /&%,) when you are not root 6

*r5ing to do this will give)

@8ermission deniedA (0!0+4)

Nou can ena4le non root 4inding on privileged port


45 running as root) (Nou will need at least a %3%,
kernel)

setcap OcapBnetB4indBserviceEPepO udpclient

*his sets the /A!.N0-.7N3.S0+8/0


capa4ilit5

Nou cannot 4ind on a port which is alread5


4ound

*r5ing to do this will give)

@$ddress alread5 in useA (0A33+N(S0)

Nou cannot 4ind twice or more with the same


678 socket (even i" 5ou change the port)

Nou will get @4ind) -nvalid argumentA error in such


case 20N8AL)

-" 5ou tr5 connect() on an un4ound 678 socket


and then bind() 5ou will also get the >-N;$L
error *he reason is that connecting to an
un4ound socket will call inet_autobind() to
automaticall5 4ind an un4ound socket (on a
random port) So a"ter connect()# the socket is
4ounded $nd the calling 4ind() again will "ail
with >-N;$L (since the socket is alread5
4onded)

Hinding in the kernel "or 678 is implemented in


inet_bind() and inet_autobind()

(in 2+89% inet9_bind() )



Non local 4ind

Fhat happens i" we tr5 to 4ind on a non local address V (a non


local address can 4e "or example# an address o" inter"ace which
is temporaril5 down)

Fe get >$77RN?*$;$-L error)

@4ind) 9annot assign re+uested addressA

!owever# i" we set


.proc.s5s.net.ipv,.ipBnonlocalB4ind to /# 45

echo R/R D .proc.s5s.net.ipv,.ipBnonlocalB4ind

?r adding in .etc.s5sctlcon")
netipv,ipBnonlocalB4indE/

*he bind() will succeed# 4ut it ma5 sometimes 4reak


applications

Fhat will happen i" in the a4ove udp client example# we will tr5
setting a 4roadcast address as the destination (instead o"
/'%/31&/%/)# thus)
inetBaton(R%55%55%55%55R#QtargetsinBaddr)<

Fe will get >$99>SS error (@8ermission deniedA) "or sendto().

2n order that :;+ broadcast !ill !ork, !e ha$e to add%


int fla < =;
if (setsocko&t (s, #0>_#04-,1, #0_3?0A;4A#1,@fla,
siAeof(fla)) B 0)
&error(Csetsocko&tC);

678 socket options

=or !!+,-,.(3!9S,L.(3! level# we have


two socket options)

678B9?RK socket option

$dded in Linux kernel %5,,


int stateE/<
setsockopt(s# -88R?*?B678# 678B9?RK# Qstate#
siUeo"(state))<
"or (CE/<CX/&&&<CPP)
sendto(s#4u"/#)
stateE&<
setsockopt(s# -88R?*?B678# 678B9?RK# Qstate#
siUeo"(state))<

*he a4ove code "ragment will call


ud&_sendms() /&&& times without actuall5
sending an5thing on the wire (in the usual case#
when without setsocko&t() with 678B9?RK#
/&&& packets will 4e send)

?nl5 a"ter the second setsocko&t() is called#


with 678B9?RK and stateE&# one packet is
sent on the wire

Kernel implementation) when using


678B9?RK# ud&_sendms() passes
:SGB:?R> to i&_a&&end_data()

-mplementation detail) 678B9?RK is not in gli4c(header


(.usr.include.netinet.udph)< 5ou need to add in 5our
program)

Wde"ine 678B9?RK /

678B>N9$8 socket option

=or usage with -8S>9

6sed# "or example# in ipsec(tools

Note) 678B>N9$8 does not appear 5et in the man page


o" udp (678B9?RK does appear)

Note that there are other socket options at the


S?LBS?9K>* level which 5ou can get.set on
678 sockets) "or example# S?BN?B9!>9K (to
disa4le checksum on 678 receive) (see $ppendix >)

S?B7?N*R?6*> (e+uivalent to :SGB7?N*R?6*> in send()

*he S?B7?N*R?6*> option tells @donOt send via a gatewa5#


onl5 send to directl5 connected hostsA

$dding)

setsockopt(s# S?LBS?9K>*# S?B7?N*R?6*># val#


siUeo"(one)) X &)

$nd sending the packet to a host on a di""erent network will


cause @Network is unreacha4leA error to 4e received
(>N>*6NR>$9!)

*he same will happen when :SGB7?N*R?6*> "lag is set


in sendto()

S?BSN7H6=

getsockopt(s# S?LBS?9K>*# S?BSN7H6=# (void K) Qsnd4u")


Suppose we want to receive -9:8 errors with the 678 client


example (like -9:8 destination unreacha4le.port unreacha4le)

!ow can we achieve this V

=irst# we should set this socket option)

int valE/<

setsockopt(s# S?LB-8# !.+0/80++#(charK)Qval# siUeo"(val))<


*hen# we should add a call to a method like this


"or receiving error messages)
int recvBerr(int s)
J
int res<
char c4u"\5/%]<
struct iovec iov<
struct msghdr msg<
struct cmsghdr Kcmsg<
struct sockBextendedBerr Ke<
struct icmphdr icmph<
struct sockaddrBin target<

"or (<<)
J
ioviovB4ase E Qicmph<
ioviovBlen E siUeo"(icmph)<
msgmsgBname E (voidK)Qtarget<
msgmsgBnamelen E siUeo"(target)<
msgmsgBiov E Qiov<
msgmsgBiovlen E /<
msgmsgB"lags E &<
msgmsgBcontrol E c4u"<
msgmsgBcontrollen E siUeo"(c4u")<
res E recvmsg(s# Qmsg# :SGB>RR^6>6> T :SGBF$-*$LL)<

i" (resX&)
continue<
"or (cmsg E 9:SGB=-RS*!7R(Qmsg)<cmsg< cmsg E9:SGBNI*!7R(Qmsg# cmsg))
J
i" (cmsg(DcmsgBlevel EE S?LB-8)
i" (cmsg(DcmsgBt5pe EE -8BR>9;>RR)
J
print"(Rgot -8BR>9;>RR messageSnR)<
e E (struct sockBextendedBerrK)9:SGB7$*$(cmsg)<
i" (e)
i" (e(DeeBorigin EE S?B>>B?R-G-NB-9:8) J
struct sockaddrBin Ksin E (struct sockaddrBin K)(eP/)<

i" ( (e(DeeBt5pe EE -9:8B7>S*B6NR>$9!) QQ (e(DeeBcode EE
-9:8B8?R*B6NR>$9!) )
print"(R7estination port unreacha4leSnR)<
L
L
L
L
L

udpBsendmsg()

ud&_sendms(struct kioc4 Kioc4# struct sock


Ksk# struct msghdr Kmsg# siUeBt len)

Sanit5 checks in ud&_sendms()%


*he destination 678 port must not 4e &

-" we tr5 destination port o" & we get >-N;$L


error as a return value o" ud&_sendms()

*he destination 678 is em4edded inside the


msghdr parameter (-n "act# msg(DmsgBname
represents a sockaddrBin< sin.)ort is sockaddrBin
is the destination port num4er)

:SGB??H is the onl5 illegal "lag "or 678


Returns >?8N?*S688 error i" such a "lag is
passed (onl5 permitted to S?9KBS*R>$:)

:SGB??H is also illegal in $=B6N-I


??H stands "or @?ut ?" Hand dataA

*he :SGB??H "lag is permitted in *98

-t ena4les sending one 45te o" data in urgent mode

(telnet # @ctrl.cA "or example)

*he destination must 4e either)

speci"ied in the msghdr (the name "ield in msghdr)

?r the socket is connected

sk(DskBstate EE *98B>S*$HL-S!>7

Notice that though this is 678# we use *98 semantics here



Sending packets in 678 (contd)

-n case the socket is not connected# we should


"ind a route to it< this is done 45 calling
i&_route_out&ut_flo!().

-n case it is connected# we use the route "rom


the sock (sk_dst_cache mem4er o" sk# which is
an instance o" dst_entry)

Fhen the connect() s5stem call was invoked#


i&5_dataram_connect() "ind the route 45
i&_route_connect() and set sk6Dsk_dst_cache in
sk_dst_set()

:oving the packet to La5er 2 (-8 la5er) is done


45 i&_a&&end_data().

2n 14+, mo$in the &acket to >ayer / is done


!ith i&_'ueue_xmit().

(hatEs the difference F

:;+ does not handle framentation;


i&_a&&end_data() does handle framentation.

14+ handles framentation in layer 5. #o no need


for i&_a&&end_data().

i&_'ueue_xmit() is (naturally) a sim&ler method.

Hasicall5 what the ud&_sendms() method


does is)

=inds the route "or the packet 45


i&_route_out&ut_flo!()

Sends the packet with


i&_local_out(skb)

$s5nchronous -.?

*here is support "or $s5nchronous -.? in 678


sockets

*his means that instead o" polling to know i"


there is data (45 select()# "or example)# the
kernel sends a S-G-? signal in such a case

6sing $s5nchronous -.? 678 in a user space


application is done in three stages)

/) $dding a S-G-? signal handler 45 calling


siaction() s5stem call

%) 9alling fcntl() with =BS>*?FN and the pid o" our


process to tell the process that it is the owner o" the
socket (so that S-G-? signals will 4e delivered to it)
Several processes can access a socket -" we will not call
fcntl() with =BS>*?FN# there can 4e am4iguit5 as to which
process will get the S-G-? signal =or example# i" we call
"ork() the owner o" the S-G-? is the parent< 4ut we can call#
in the son# "cntl(s#=BS>*?FN# getpid())

2) Setting "lags) calling "cntl() with =BS>*=L and


?BN?NHL?9K T =$SNN9

-n the S-G-? handler# we call rec$from().

,xam&le%
struct sockaddr_in source;
struct siaction handler;
source.sin_family < A7_2.,1;
source.sin_&ort < htons(888);
source.sin_addr.s_addr < htonl(2.A;;?_A.");
ser$#ocket < socket(A7_2.,1, #04-_;*?A), 0);
bind(ser$#ocket,(struct sockaddrG)@source,siAeof(struct
sockaddr_in));

handler.sa_handler < #2*20Handler;
sifillset(@handler.sa_mask);
handler.sa_flas < 0;
siaction(#2*20, @handler, 0);
fcntl(ser$#ocket,7_#,10(., et&id());
fcntl(ser$#ocket,7_#,17>, 0_.0.3>04- I 7A#".4);

1he fcntl() !hich sets the 0_.0.3>04- I 7A#".4 flas


in$okes sock_fasync() in net/socket.c to add the socket.

1he 'I(IO)andler() method !ill be called !hen there is


data (since a #2*20 sinal !as enerated) ; it should call
rec$ms().

$ppendix H ) Socket o)tions

Socket o)tions by )rotocol:


! )rotocol 2S,L.!) 1; socket o)tions:
-8B*?S -8B**L
-8B!7R-N9L -8B?8*-?NS
-8BR?6*>RB$L>R* -8BR>9;?8*S
-8BR>*?8*S -8B8K*-N=?
-8B8K*?8*-?NS -8B:*6B7-S9?;>R
-8BR>9;>RR -8BR>9;**L
-8BR>9;*?S -8B:*6
-8B=R>>H-N7 -8B-8S>9B8?L-9N
-8BI=R:B8?L-9N -8B8$SSS>9
-8B*R$NS8$R>N*
Note) =or HS7 compati4ilit5 there is -8BR>9;R>*?8*S (which is identical to
-8BR>*?8*S)

$=B6N-I)

S?B8$SS9R>7 "or $=B6N-I sockets

Note)=or historical reasons these socket options are speci"ied with a


S?LBS?9K>* t5pe even though the5 are 8=B6N-I speci"ic

678)

678B9?RK (-88R?*?B678 level)

R$F)

-9:8B=-L*>R

*98)

*98B9?RK

*98B7>=>RB$99>8*

*98B-N=?

*98BK>>89N*

*98BK>>8-7L>

*98BK>>8-N*;L

*98BL-NG>R%

*98B:$IS>G

*98BN?7>L$N

*98B^6-9K$9K

*98BSNN9N*

*98BF-N7?FB9L$:8

$=B8$9K>*

8$9K>*B$77B:>:H>RS!-8

8$9K>*B7R?8B:>:H>RS!-8

Socket o)tions *or socket level:
S?B7>H6G
S?BR>6S>$77R
S?B*N8>
S?B>RR?R
S?B7?N*R?6*>
S?BHR?$79$S*
S?BSN7H6=
S?BR9;H6=
S?BSN7H6==?R9>
S?BR9;H6==?R9>
S?BK>>8$L-;>
S?B??H-NL-N>

S?BN?B9!>9K
S?B8R-?R-*N
S?BL-NG>R
S?BHS79?:8$*

$ppendix 9) tcp client
Winclude X"cntlhD
Winclude Xstdli4hD
Winclude XerrnohD
Winclude XstdiohD
Winclude XstringhD
Winclude Xs5s.send"ilehD
Winclude Xs5s.stathD
Winclude Xs5s.t5peshD
Winclude XunistdhD
Winclude Xarpa.inethD
int main()
J

tcp client ( contd
struct sockaddrBin sa<
int sd E socket(8=B-N>*# S?9KBS*R>$:# &)<
i" (sdX&)
print"(RerrorR)<
memset(Qsa# &# siUeo"(struct sockaddrBin))<
sasinB"amil5 E $=B-N>*<
sasinBport E htons(152)<
inetBaton(R/'%/31&/%/R#QsasinBaddr)<
i" (connect(sd# (struct sockaddrK)Qsa# siUeo"(sa))X&) J
perror(RconnectR)<
exit(&)<
L
close(sd)<
L

tcp client ( contd

-" on the other side (/'%/31&/%/ in this example) there is no


*98 server listening on this port (152) 5ou will get this error "or
the socket() s5stem call)

connect) 9onnection re"used

Nou can send data on this socket 45 adding# "or example)


const char Kmessage E Rm5messageR<
int length<
length E strlen(message)P/<
res E write(sd# message# length)<

write() is the same as send()# 4ut with no "lags



$ppendix 7 ) -9:8 options

*hese are -9:8 options 5ou can set with


setsockopt on R$F -9:8 socket) (see
.usr.include.netinet.ipBicmph)
-9:8B>9!?R>8LN
-9:8B7>S*B6NR>$9!
-9:8BS?6R9>B^6>N9!
-9:8BR>7-R>9*
-9:8B>9!?
-9:8B*-:>B>I9>>7>7
-9:8B8$R$:>*>R8R?H
-9:8B*-:>S*$:8

-9:8B*-:>S*$:8R>8LN
-9:8B-N=?BR>^6>S*
-9:8B-N=?BR>8LN
-9:8B$77R>SS
-9:8B$77R>SSR>8LN

$88>N7-I >) "lags "or send.receive
:SGB??H
:SGB8>>K
:SGB7?N*R?6*>
:SGB*RN!$R7 ( S5non5m "or :SGB7?N*R?6*> "or 7>9net
:SGB9*R6N9
:SGB8R?H> ( 7o not send ?nl5 pro4e path "e "or :*6
:SGB*R6N9
:SGB7?N*F$-* ( Non4locking io
:SGB>?R ( >nd o" record
:SGBF$-*$LL ( Fait "or a "ull re+uest

:SGB=-N
:SGBSNN
:SGB9?N=-R: ( 9on"irm path validit5
:SGBRS*
:SGB>RR^6>6> ( =etch message "rom error +ueue
:SGBN?S-GN$L ( 7o not generate S-G8-8>
:SGB:?R> &x1&&& ( Sender will send more

>xample) set and get an option

*his simple example demonstrates how to set and get an -8 la5er option)
Winclude XstdiohD
Winclude Xarpa.inethD
Winclude Xs5s.t5peshD
Winclude Xs5s.sockethD
Winclude XstringhD
int main()
J
int s<
int opt<
int res<
int one E /<
int siUe E siUeo"(opt)<

s E socket($=B-N>*# S?9KB7GR$:# &)<
i" (sX&)
perror(RsocketR)<
res E setsockopt(s# S?LB-8# -8BR>9;>RR# Qone# siUeo"(one))<
i" (resEE(/)
perror(RsetsockoptR)<
res E getsockopt(s# S?LB-8# -8BR>9;>RR#Qopt#QsiUe)<
i" (resEE(/)
perror(RgetsockoptR)<
print"(Ropt E ZdSnR#opt)<
close(s)<
L

>xample) record route option

*his example shows how to send a record route


option
Wde"ine NR?6*>S '
int main()
J
int s<
int optlenE&<
struct sockaddrBin target<
int res<

char rspace\2P,KNR?6*>SP/]<
char 4u"\/&]<
targetsinB"amil5 E $=B-N>*<
targetsinBportEhtons(''')<
inetBaton(R/','&/5R#QtargetsinBaddr)<
strcp5(4u"#Rmessage /)R)<
s E socket($=B-N>*# S?9KB7GR$:# &)<
i" (sX&)
perror(RsocketR)<
memset(rspace# &# siUeo"(rspace))<
rspace\&] E -8?8*BN?8<
rspace\/P-8?8*B?8*;$L] E -8?8*BRR<
rspace\/P-8?8*B?L>N] E siUeo"(rspace)(/<

rspace\/P-8?8*B?==S>*] E -8?8*B:-N?==<
optlenE,&<
i" (setsockopt(s# -88R?*?B-8# -8B?8*-?NS# rspace#
siUeo"(rspace))X&)
J
perror(Rrecord routeSnR)<
exit(%)<
L

$88>N7-I =) 678 errors
Running )
cat .proc.net.snmp T grep 6dp)
will give something like)
6dp) -n7atagrams No8orts -n>rrors ?ut7atagrams Rcv4u">rrors
Snd4u">rrors
6dp) %3%5 / & %/&& & &
-n>rrors ( (678B:-HB-N>RR?RS)
Rcv4u">rrors 678B:-HBR9;H6=>RR?RS)

-ncremented in __ud&_'ueue_rc$_skb() (net.ipv,.udpc)


Snd4u">rrors (678B:-HBSN7H6=>RR?RS)

-ncremented in ud&_sendms()

$nother metric)

cat .proc.net.udp

*he last column in) drops

Represents sk(DskBdrops

-ncremented in BBudpB+ueueBrcvBsk4()

net.ipv,.udpc

Fhen do Rcv4u">rrors occur V

*he total num4er o" 45tes +ueued in skBreceiveB+ueue


+ueue o" a socket is sk(DskBrmemBalloc

*he total allowed memor5 o" a socket is sk(DskBrcv4u"

-t can 4e retrieved with getsockopt() using S?BR9;H6=


>ach time a packet is received# the sk(


DskBrmemBalloc is incremented 45 sk4(DtruesiUe)

sk4(DtruesiUe it the siUe (in 45tes) allocated "or the data o"
the sk4 plus the siUe o" skB4u"" structure itsel"

*his incrementation is done in sk4BsetBownerBr()

atomicBadd(sk4(DtruesiUe# Qsk(DskBrmemBalloc)<

see) include.net.sockh

Fhen the packet is "reed 45 k"reeBsk4()# we decrement sk%


&sk_r"e"_alloc 45 sk%&truesi*e< this is done in
sock_rfree())

sockBr"ree()

atomicBsu4(sk4(DtruesiUe# Qsk(DskBrmemBalloc)<

-mmediatel5 in the 4eginning o" sockB+ueueBrcvBsk4()# we


have this check)
i" (atomicBread(Qsk(DskBrmemBalloc) P sk4(DtruesiUe DE
(unsigned)sk(DskBrcv4u") J
err E (>N?:>:<

Fhen returning (>N?:>:# this noti"ies the


caller to drop the packet

*his is done in __ud&_'ueue_rc$_skb() method)


static int BBudpB+ueueBrcvBsk4(struct sock Ksk# struct skB4u"" Ksk4)
J

i" ((rc E sockB+ueueBrcvBsk4(sk# sk4)) X &) J


.K Note that an >N?:>: error is charged twice K.
i" (rc EE (>N?:>:) J
678B-N9BS*$*SBH!(sockBnet(sk)# 678B:-HBR9;H6=>RR?RS#
isBudplite)<
atomicBinc(Qsk(DskBdrops)<

*he de"ault siUe o" sk(DskBrcv4u" is SKBR:>:B:$I


(s5sctlBrmemBmax)

-t e+uals to

(siUeo"(struct skB4u"") P %53) K %53

See) SKBR:>:B:$I de"inition in


net.core.sockc

*his can 4e viewed and modi"ied 45)

/&roc/sys/net/core/rmem_default entr5

getsockopt().setsockopt() with S,.+/87(<


=or the send +ueue (sk_!rite_'ueue), !e ha$e in


i&_a&&end_data() a call to sock_alloc_send_skb(), !hich
e$entually in$okes sock_alloc_send_&skb().

2n sock_alloc_send_&skb(), !e &eform this check%


...
if (atomic_read(@sk6Dsk_!mem_alloc) B sk6Dsk_sndbuf)
...

2f it is true, e$erythin is fine.

2f not, !e end !ith settin 'OC+_A',NC_NO'-ACE and


S,/K.N,S!A/0 flas of the socket%
set_bit(#04-_A#".4_.0#+A4,, @sk6Dsk_socket6Dflas);
set_bit(#04-_.0#+A4,, @sk6Dsk_socket6Dflas);

2n ud&_sendms(), !e check the #04-_.0#+A4, fla. 2f it is


set, !e increment the :;+_)23_#.;3:7,??0?# counter.

sock_alloc_send_&skb() calls skb_set_o!ner_!().

2n skb_set_o!ner_!(), !e ha$e%
...
atomic_add(skb6DtruesiAe, @sk6Dsk_!mem_alloc);
...

(hen the &acket is freed by kfree_skb(), !e decrement
sk_!mem_alloc, in sock_!free() method%
sock_!free()
...
atomic_sub(skb6DtruesiAe, @sk6Dsk_!mem_alloc);
...

*ips

*o "ind out socket used 45 a process)

ls (l .proc.\pid]."dTgrep socketTcut (d) ("2Tsed Os.S\..<s.S]..O

*he num4er returned is the inode num4er o" the socket

-n"ormation a4out these sockets can 4e o4tained "rom

netstat (ae

$"ter starting a process which creates a socket# 5ou can see


that the inode cache was incremented 45 one 45)

more .proc.sla4in"o T grep sock

sockBinodeBcache ,03 ,15 031 5 / ) tuna4les & &


& ) sla4data '0 '0 &

*he "irst num4er# ,03# is the num4er o" active o4Cects



>N7

*hank 5ouY

ramirose@gmailcom

Linux Kernel Networking
advanced topics (6)
Sockets in the kernel
Rami Rosen
ramirose@gmailcom
!ai"ux# $ugust %&&'
wwwhai"uxorg
$ll rights reserved

Linux Kernel Networking (6)(
advanced topics
Note)
*his lecture is a se+uel to the "ollowing , lectures
- gave in !ai"ux)
1) Linux Kernel Networking lecture
http)..wwwhai"uxorg.lectures./0%.
slides)http)..wwwhai"uxorg.lectures./0%.netLecpd"
2) Advanced Linux Kernel Networking -
Neighboring Subsystem and !Sec lecture
http)..wwwhai"uxorg.lectures./1&.
slides)http)..wwwhai"uxorg.lectures./1&.netLec%pd"

Linux Kernel Networking (6)(
advanced topics
") Advanced Linux Kernel Networking -
!v# in the Linux Kernel lecture
http)..wwwhai"uxorg.lectures./10.
Slides) http)..wwwhai"uxorg.lectures./10.netLec2pd"
$) %ireless in Linux
http)..wwwhai"uxorg.lectures.%&6.
Slides) http)..wwwhai"uxorg.lectures.%&6.wirelessLecpd"
&) Sockets in the Linux Kernel
http)..wwwhai"uxorg.lectures.%/0.
Slides) http)..wwwhai"uxorg.lectures.%/0.netLec,pd"

Note
Note) *his is the second part o" the 3Sockets in
the Linux Kernel4 lecture which was given in
!ai"ux in %00&' 5ou ma6 "ind some
7ackground material "or this lecture in its slides)
http)..wwwhai"uxorg.lectures.%/0.netLec,pd"

*89
*89)
R$: Sockets
;N-< =omain Sockets
Netlink sockets
S9*> sockets
$ppendices
Note) $ll code examples in this lecture re"er to
the recent 2'#'"( version o" the Linux kernel

)A% Sockets
*here are cases when there is no inter"ace to create
sockets o" a certain protocol (-9?> protocol# N@*L-NK
protocol) AB use Raw sockets
raw socket creation is done thus# "or example)
sd A socket($CD-N@*# S89KDR$:# &)E
sd A socket($CD-N@*# S89KDR$:#->>R8*8D;=>)E
sd A socket($CD>$9K@*# S89KDR$:# htons(@*!D>D->))E
@*!D>D-> tells to handle all -> packets
:hen using $CD-N@* "amil6# as in the "irst two cases# the
socket is added to kernel R$: sockets hash ta7le (the hash
ke6 is the protocol num7er) *his is done 76 raw_hash_sk()#
(net.ipvF.rawc)# which is invoked 76 inet_create()# when
creating the socket

:hen using $CD>$9K@* "amil6# as in the third case# a socket
is not added to the kernel R$: sockets hash ta7le
See $ppendix C "or an example o" using packet raw socket
Raw socket creation *+S, 7e done as a super
user
-n case an ordinar6 user tr6 to create a raw socket#
6ou will get)
3error) socket) 8peration not permitted4 (-!-)*)
5ou can set the 9$>DN@*DR$: capa7ilit6 to ena7le
non root users to create raw sockets)
setcap capDnetDrawAGep rawserver

;sage o" R$: socket) ping
5ou do not speci"6 ports with R$: socketsE R$:
sockets do not work with ports
:hen the kernel receives a raw packet# it
delivers it to all raw sockets
>ing in "act is sending an -9?> packet
*he t6pe o" this -9?> packet is .*! -./0
)-1+-S,'

Send a ping
implementation(simpli"ied)
Hde"ine I;CS-J@ /,&&
char send7u"KI;CS-J@LE
struct icmp MicmpE
int sock"dE
struct sockaddrDin targetE
int datalenA,6E
targetsinD"amil6 A $CD-N@*E
inetDaton(N/'%/61&/%/N#OtargetsinDaddr)E
icmp A (struct icmp M)send7u"E
icmp(BicmpDt6pe A -9?>D@9!8E
icmp(BicmpDcode A &E
icmp(BicmpDid A getpid()E

memset(icmp(BicmpDdata# &xa,# datalen)E
icmp(BicmpDcksumA&E
sock"dAsocket($CD-N@*# S89KDR$:# ->>R8*8D-9?>)E
res A sendto(sock"d# send7u"# len# &# (struct sockaddrM)Otarget# siPeo"(struct
sockaddrDin))E
( ?issing here is se+uence num7er# checksum computation
( *he de"ault num7er o" data 76tes to 7e sent is ,6E the -9?>
header is 1 76tes So we get 6F 76tes (or 1F 76tes# i" we include
the -> header o" %& 76tes)

Receive a ping(
implementation(simpli"ied)
DDu1 M7u"E
char addr7u"K/%1LE
struct iovec iovE
struct iphdr MiphdrE
int sock"dE
struct icmphdr MicmphdrE
char recv7u"KI;CS-J@LE
char control7u"KI;CS-J@LE
struct msghdr msgE
sock"dAsocket($CD-N@*# S89KDR$:# ->>R8*8D-9?>)E

ioviovD7ase A recv7u"E
ioviovDlen A siPeo"(recv7u")E
memset(Omsg# &# siPeo"(msg))E
msgmsgDname A addr7u"E
msgmsgDnamelen A siPeo"(addr7u")E
msgmsgDiov A OiovE
msgmsgDiovlen A /E
msgmsgDcontrol A control7u"E
msgmsgDcontrollen A siPeo"(control7u")E
n A recvmsg(sock"d# Omsg# &)E

7u" A msgmsgDiov(BiovD7aseE
iphdr A (struct iphdrM)7u"E
icmphdr A (struct icmphdrM)(7u"G(iphdr(BihlMF))E
i" (icmphdr(Bt6pe AA -9?>D@9!8R@>L5)
print"(N-9?>D@9!8R@>L5QnN)E
i" (icmphdr(Bt6pe AA -9?>D=@S*D;NR@$9!)
print"(N-9?>D=@S*D;NR@$9!QnN)E

*he onl6 S8LDR$: option a Raw socket can get
is -9?>DC-L*@R
*his can 7e done thus)
Hde"ine -9?>DC-L*@R /
struct icmpD"ilter R
DDu2% dataE
SE
"iltdata A /TT.*!23-S,2+N)-A./E
res A setsockopt(sock"d# S8LDR$:# .*!24L,-)#
(charM)O"ilt# siPeo"("ilt))E

$dding this code in the receive >ing application
a7ove will prevent 3estination +nreachable
-9?> messages "rom received in user space
76 recvmsg
*here are +uite a lot more -9?> optionsE 76
de"ault# we do N8* "ilter an6 -9?> messages
$mong the other options 6ou can set 76
setsockopt are)
-9?>D@9!8 (echo re+uest)
-9?>D@9!8R@>L5 (echo repl6)

-9?>D*-?@D@<9@@=@=
$nd more (see $ppendix = "or a "ull list)
,raceroute also uses raw sockets
*raceroute changes the **L "ield in the ip header
*his is done 76 ->D**L and control messages in
current Linux traceroute implementation (=mitr6
Iutsko6)
-n the original traceroute (76 Uan Vaco7son) it was
done with the ->D!=R-N9L socket option)
(setsockopt(sndsock# ->>R8*8D-># ->D!=R-N9L#)

*he ->D!=R-N9L tells the -> la6er not not to prepare an -> header
when sending a packet
->D!=R-N9L is also applica7le in ->U6
:hen receiving a packet# the -> header is alwa6s included in the
packet
:hen sending a packet# 76 speci"6ing the the IP_HDRINCL
option you tell the kernel that the IP header is already included
in the packet, so no need to prepare it in the kernel.
raw_send_hdrinc() in netip!"raw.c
#he IP_HDRINCL option is applied only to the $%C&_R'(
type o) protocol.
See Lawrence Ierkele6 National La7orator6 traceroute)
"tp).."tpeel7lgov.traceroutetargP

-" a raw socket was created with protocol t6pe o"
->>R8*8DR$: # this implies ena7ling ->D!=R-N9L)
*hus# this call "rom user space)
socket($CD-N@*#S89KDR$:#->>R8*8DR$:)
invokes this code in the kernel)
i" (S89KDR$: AA sock(Bt6pe) R
inet(Bnum A protocolE
i" (->>R8*8DR$: AA protocol)
inet(Bhdrincl A /E

(Crom inetDcreate()# net.ipvF..a"Dinetc)



S5oo6ing attack7 setting the -> address o"
packets to 7e di""erent than the real ones
;=> spoo"ing is easier since ;=> is
connectionless
Collowing is an example o" ;=> spoo"ing with
raw sockets and ->D!=R-N9L option)
:e 7uild an -> header
:e set the protocol "ield in this ip header to
->D>R8*8;=>
:e 7uild a ;=> header
Note ) when 7ehind a N$*# this pro7a7l6 will not work

unsigned short inDcksum(unsigned short Maddr# int len)E
int main(int argc# char MMargv)
R
struct iphdr ipE
struct udphdr udpE
int sdE
const int on A /E
struct sockaddrDin sinE
int resE
uDchar MpacketE
packet A (uDchar M)malloc(6&)E

ipihl A &x,E
ipversion A &xFE
iptos A &x&E
iptotDlen A 6&E
ipid A htons(/%12&)E
ip"ragDo"" A &x&E
ipttl A 6FE
ipprotocol A ->>R8*8D;=>E
ipcheck A &x&E
ipsaddr A inetDaddr(N/'%/61&/''N)E
ipdaddr A inetDaddr(N06/%,F2/&2N)

memcp6(packet# Oip# siPeo"(ip))E
udpsource A htons(F,,/%)E
udpdest A htons(''')E
udplen A htons(/&)E
udpcheck A &E
memcp6(packet G %&# Oudp# siPeo"(udp))E
memcp6(packet G %1#Na7N#%)E
i" ((sd A socket($CD-N@*# S89KDR$:# &)) T &) R
perror(Nraw socketN)E
exit(/)E
S

i" (setsockopt(sd# ->>R8*8D-># ->D!=R-N9L# Oon# siPeo"(on)) T &) R
perror(NsetsockoptN)E
exit(/)E
S
memset(Osin# &# siPeo"(sin))E
sinsinD"amil6 A $CD-N@*E
sinsinDaddrsDaddr A ipdaddrE
resAsendto(sd# packet# 6&# &# (struct sockaddr M)Osin# siPeo"(struct sockaddr) )E
i" (resT&)
perror(NsendtoN)E
else
print"(Nok Wd 76tes sentQnN#res)E
S

Note) what will happen i" we speci"6 an illegal
source address# like 3%,,%,,%,,%,,4X
*he packet will 7e sent
-" we want to log such packets on the receiver side#
(to detect spoo"ing attempts)# we must set the
log2martians kernel tuna7le thus)
echo N/N B .proc.s6s.net.ipvF.con".all.logDmartians
*hen we will see in the kernel s6slog messages like
this)
martian source 1%1&1&/'2 "rom %,,%,,%,,%,,# on
dev eth&
Collowing will 7e the ethernet header)
ll header)

Raw sockets and sni""ers
:hen 6ou activate tshark ("ormerl6 tethereal) or
wireshark or tcpdump# 6ou call the
pcap_open_li!e() method o" the pcap li7rar6
*his method creates a raw socket thus)
socket(>CD>$9K@*# S89KDR$:# htons(@*!D>D$LL))
pcap_open_li!e() is i*ple*ented in li7pcap(&'1.pcap(linuxc
>CD>$9K@* sockets work with the network inter"ace card

Note)
:hen 6ou open tshark thus)
tshark (i an6
*hen the socket is opened thus)
socket(>CD>$9K@*# S0.K238)A*#
htons(@*!D>D$LL))
*his is called 3cooked mode4
SLL (Socket Link La6er)
:ith S0.K238)A*9 the kernel is responsi7le "or adding
ethernet header (when sending a packet) or removing
ethernet header (when receiving a packet)

:ith S89KDR$:# the application is responsi7le "or adding an
ethernet header when sending
$lso 6ou will get this message)
39apturing on >seudo(device that captures on all inter"aces4
tshark) >romiscuous mode not supported on the Nan6N device

+nix 3omain Sockets
$CD;N-< . >CD;N-< . $CDL89$L . >CDL89$L
$ wa6 "or interprocess communication (->9)
the client and server are on the same host
$CD;N-< sockets can 7e either S89KDS*R@$?
or S89KD=YR$?
$nd# since kernel %6F# also S89KDS@Z>$9K@*
;sage) in rs6slogd($CD;N-<.S89KD=YR$?)
and udev daemons ($CDL89$L.
S89KD=YR$?)# hald# crond# and a lot more

;nix domain sockets do not support the
transmission o" out(o"(7and data
?SYD88I is not supported at all in ;nix domain
sockets
*his applies Cor all 2 t6pes#
S89KDS*R@$?#S89KD=YR$? and S89KDS@Z>$9K@*

;suall6 uses "iles in the local "iles6stem
$7stract namespaces
:h6 not extend it to use 7etween domains in
virtualiPation which have access to shared "iles6stem X
:ith rs6slogd# the path is under .dev)
ls (al .dev.log
srw(rw(rw( / root root & &/(&0(&' /2)/0 .dev.log
Notice the [s[ in the 7eginning AB "or socket
ls (C .dev.log
.dev.logA
(with ls# (C is "or appending indicator to entries)

;nix =omain Socket server
@xample
int sE
int resE
struct sockaddrDun nameE
memset(Oname#&#siPeo" (name))E
namesunD"amil6 A $CDL89$LE
strcp6(namesunDpath#N.work.testDunixN)E
s A socket($CD;N-<# S89KDS*R@$?#&)E
i" (sT&)
perror(NsocketN)E
res A 7ind(s# (struct sockaddrM)Oname# S;NDL@N(Oname))E

9alling +ind() in the example a7ove will create a
"ile named .work.testDunix
ls (al .work.testDunix
srwxr(xr(x
Notice the 3s4 "or socket
Notice that with =YR$? ;nix domain sockets# calling sednto()
without calling 7ind() 7e"ore# will not call autobind:) as
o55osed to what ha55ens in ud5 under the same scenario'
n this case9 the receiver cannot re5ly :because it does not
know to who)'

lso" (; ) shows ;nix domain sockets
$lso) netstat ((unix all
*ip) use netstat (ax "or short
K$99L in the third column means that the socket is
unconnected and waiting "or connection
(S8D$99@9>*8N)
$nd also)
cat .proc.net.protocols \ grep ;N-<
cat .proc.net.unix
struct sockaddrDun (usrincludelinu,un.h)

*he pathname "or a ;nix domain socket should
7e an a7solute pathname
Cor a7stract namespaces)
addresssunDpathK&L A &
*he last column o" netstat ((unix ((all is the path
-n case o" a7stract namespace# it will 7egin with @)
netstat ((unix ((all \ grep udevd
unix % K L =YR$? 6&%
@.org.kernel.udev.udevd

9ontrol messages in ;nix domain sockets)
S9?DR-Y!*S ( 5ou can pass an open "ile descriptor
"rom one process to another using ;nix domain
socket and control messages (ancillar6 data)
S9?D9R@=@N*-$LS( "or passing process
credentials (uid and gid)
5ou need to set the S02!ASS.)-3 S02!ASS.)-3 socket option with
setsockopt() on the receiving side
S.* stands "or ) Socket 9ontrol ?essage #and not
So"tware con"iguration management )()

*hese credentials are passed via a cred struct in
a control message)
kernel) in include.linux.socketh)
struct ucred R
DDu2% pidE .M process -= o" the sending process M.
DDu2% uidE .M user -= o" the sending process M.
DDu2% gidE .M group -= o" the sending process M.
SE
Cor user space apps# it is in .usr.include.7its.socketh

;nix domain client example
const charM const socketDname A N.tmp.serverNE
int socketD"dE
int resE
struct sockaddrDun remoteE
socketD"d A socket(>CDL89$L# S0.K2S,)-A*9 &)E
memset(Oremote# &# siPeo"(remote))E
remotesunD"amil6 A $CDL89$LE
strcp6(remotesunDpath# socketDname)E
res A connect(socketD"d# (struct sockaddrM)Oremote# S;NDL@N(Oremote))E
i" (resT&)
perror(NconnectN)E
res A sendto(socketD"d#NaaaN# 2# &# (struct sockaddrM)Oremote# siPeo"(remote))E

-" we will tr6 to call send() in a stream(oriented
socket a"ter the stream(oriented server was
closed# we will get @>->@ error)
send) Iroken pipe
*he kernel also sends the user space a S-Y>->@
signal in this case
-n case the "lags parameter in send() is
*S82N0S8NAL# the kernel does N8* send a
S-Y>->@ signal
-n IS=# 6ou can avoid signals 76 setsockopt()
with S8DN8S-Y>->@ (S8LDS89K@* option)

-n ->UF# the onl6 signal used is S-Y;RY "or 88I
in tcp
-n case o" datagram(oriented sockets# S-Y>->@
is not sentE we ]ust get connection re"used
error

-"# in the a7ove example# we tried to create a
dgram client instead o" stream client# thusE
socketD"d A socket(>CDL89$L# S89KD=YR$?# &)E
:e would get)
connect) >rotocol wrong t6pe "or socket (@>R8*8*5>@)
see) uni,_)ind_other()
*he socketpair() s6stem call)
9reates a pair o" connected sockets
8n Linux# the onl6 supported domain "or this call is A42+N;
(or s6non6mousl6# A42L0.AL)

Netlink sockets
Netlink sockets7 a message mechanism "rom
user(space to kernel and also 7etween kernel
ingredients
;sed widel6 in the kernelE mostl6 in networking#
7ut also in other su7s6stems
*here are other mechanism "or communication "rom
user space to kernel)
ioctls (drivers)
.proc or .s6s entries (UCS)
$nd there are o" course signals "rom kernel to user
space (like S-Y-8# and more)

Creating netlink sockets is done (in the kernel) by
netlink_kernel_create().
For example, in net/core/rtnetlink.c:
static int rtnetlink_net_init(struct net *net)
{
struct sock *sk;
sk = netlink_kernel_create(net, NETLINK_ROUTE,
RTNLR!_"#$, rtnetlink_rc%, &rtnl_'ute(,
T)I*_"O+ULE);

:ith generic netlink sockets# this is done using
the N@*L-NKDY@N@R-9 protocol thus)
netlink_kernel_create(-init_net, N.#LIN&_/.N.RIC, 0,
1enl_rc!, -1enl_*ute,, #HI$_2%D3L.)4
$ee netnetlink1enetlink.c

T,e secon- .ara'eter, NETLINK_ROUTE, is t,e
.rotocol. (kernel /.0.12).
There are currently 1 netlink protocols in the kernel:
!"T#$!%&'()T" !"T#$!%&)!)*"+ !"T#$!%&)*"'*(C%
!"T#$!%&F$'",-## !"T#$!%&$!"T&+$-. !"T#$!%&!F#(.
!"T#$!%&/F'0 !"T#$!%&*"#$!)/ !"T#$!%&$*C*$
!"T#$!%&-)+$T !"T#$!%&F$1&#((%)2 !"T#$!%&C(!!"CT('
!"T#$!%&!"TF$#T"' !"T#$!%&$23&F, !"T#$!%&+!'T0*.
!"T#$!%&%(14"CT&)"5"!T !"T#$!%&."!"'$C !"T#$!%&*C*$T'-!*2('T
!"T#$!%&"C'62TF*
(see inclu-e3linu(3netlink.,).

*he "ourth parameter# rtnetlink_rc%, is t,e ,an-ler
4or netlink .ackets.
rtnetlink_rc%() 5ets a .acket (sk_6u44) as its .ara'eter.
NETLINK_ROUTE 'essa5es are not con4ine- to t,e
routin5 su6s7ste'; t,e7 inclu-e also ot,er t7.es o4
'essa5es (4or e(a'.le, nei5,6orin5)
NETLINK_ROUTE 'essa5es can 6e -i%i-e- into
4a'ilies. "ost o4 t,ese 4a'ilies ,as t,ree t7.es o4
'essa5es. (Ne8, +el an- et).

9or e(a'.le:
RT"_NE;ROUTE < create a ne8 route.
)an-le- 67 inet_rt'_ne8route().
RT"_+ELROUTE = -elete a route.
)an-le- 67 inet_rt'_-elroute().
RT"_ETROUTE < retrie%e in4or'ation a6out a
route.
)an-le- 67 inet_-u'._4i6().
#ll t,ree 'et,o-s are in net3i.%>34i6_4ronten-.c.

$nother "amil6 o" ?@*L-NKDR8;*@ is the N@-Y! "amil6)
R*?DN@:N@-Y!
R*?D=@LN@-Y!
R*?DY@*N@-Y!

!ow do these messages reach these handlersX
Registration is done 76 calling rtnl_re1ister()
in ip_)i+_init()5
rtnlDregister(>CD-N@*# R*?DN@:R8;*@#
inetDrtmDnewroute# N;LL)E
rtnlDregister(>CD-N@*# R*?D=@LR8;*@#
inetDrtmDdelroute# N;LL)E
rtnlDregister(>CD-N@*# R*?DY@*R8;*@# N;LL#
inetDdumpD"i7)E

->R8;*@% package is 7ased on rtnetlink
(->R8;*@% is 3ip4 with su7commands# "or
example) ip route show to show the routin1
ta+les)
->R8;*@% uses the li7netlink li7rar6
See li7netlinkh (in the ->R8;*@% li7rar6)
rtnl_open() to open a socket in user space
rtnl_send() to send a message to the kernel

rtnl_open() calls the socket() s6stem call to
create an rtnetlink socket)
socket('6_N.#LIN&, $%C&_R'(, protocol)4
rtnl_listen() starts recei!in1 *essa1es +y callin1 the rec!*s1()
syste* call.
#he '6_N.#LIN& protocol is i*ple*ented in
netnetlinka)_netlink.c.
'6_R%3#. is a synony* o) '6_N.#LIN& (due to 7$D)
8de)ine '6_R%3#. '6_N.#LIN& (includelinu,socket.h)
#he kernel holds an array called nl_ta+le4 it has up to
9: ele*ents. (2';_LIN&$).
.ach ele*ent in this ta+le corresponds to a protocol
(in )act, the protocol is the inde,)

@xample
Hinclude Nli7netlinkhN
int acceptDmsg(const struct sockaddrDnl Mwho# struct nlmsghdr Mn# void Marg) R
i" (n(BnlmsgDt6pe AA R*?DN@:R8;*@)
print"(Ngot R*?DN@:R8;*@ message QnN)E
S
int main() R
int resE
struct rtnlDhandle rthE
unsigned int groups A ^R*?YR>D*9 \ R*NLYR>D->UFDR8;*@E
i" (rtnlDopen(Orth#groups) T &) R
print"(NrtnlDopen() "ailed in Ws WsQnN#DDC;N9*-8NDD#DDC-L@DD)E
return (/E
S

i" (rtnlDlisten(Orth#acceptDmsg# stdout)T&) R
print"(N"ailed in rtnlDlisten()QnN)E
return (/E
S
S
$dding a route will 7e logged to stdout)
ip route add /&&&/& via /&/&/&//
will print)
got R*?DN@:R8;*@ message
( -n this case# the rtnlDopen() invokes
socket('6_N.#LIN&, $%C&_R'(, N.#LIN&_R%3#.)4
( *he example can 7e expanded also "or R*?D=@LR8;*@# etc

Yeneric Netlink
*he iw tools (wireless user space management)
use the Yeneric Netlink $>-
*his $>- is 7ased on Netlink sockets
5ou register handlers in nl<0:==_init()
netwirelessnl<0:==.c

Cor example# "or wireless inter"aces we have
these messages and handlers)
NL1&%//D9?=DY@*D-N*@RC$9@
!andled 76 nl1&%//DdumpDinter"ace()
NL1&%//D9?=DS@*D-N*@RC$9@
!andled 76 nl1&%//DsetDinter"ace()
NL1&%//D9?=DN@:D-N*@RC$9@
!andled 76 nl1&%//DnewDinter"ace()
NL1&%//D9?=D=@LD-N*@RC$9@
!andled 76 nl1&%//DdelDinter"ace()

-n the wireless su7s6stem there are currentl6 2,
messages# each with its own handler
See appendix $

5ou can use the N@*L-NKDC-R@:$LL protocol
"or a netlink socket to catch packets in user
space with the help o" an ipta7les kernel
module named ipD+ueueko
ipta7les ($ 8;*>;* (p ;=> ((dport '''' (]
NCZ;@;@ ((+ueue(num &
*he user space application uses
li7net"ilterD+ueue(&&/0 $>- (which replaced
the li7ip+ li7)
Netlink sockets usage) xorp# (routing daemons)
http)..wwwxorporg.) # iproute%# iw

S9*>
Yeneral)
9om7ines "eatures o" *9> and ;=>
Relia7le (like *9>)
RC9 F'6& (o7soletes RC9 %'6&)
*arget) Uo-># telecommunications
>eople)
Randall Stewart (9isco)) co inventor# CreeIS=
>eter Lei (9isco)
?ichael *uxen (?ac8S)

Linux Kernel S9*> ?aintainers)
Ulad 5asevich (!>)
Sridhar Samudrala (-I?)
S9*> support in the Linux kernel tree is "rom
versions %,26 and "ollowing
Location in the kernel tree) net.sctp

S9*>
*here are two t6pes o" S9*> sockets)
0ne to one socket
socket:A42N-,9 S0.K2S,)-A*9 !!)0,02S.,!)
?uch like *9> connection
0ne to many socket
socket:A42N-,9 S0.K2S-1!A.K-,9 !!)0,02S.,!)
See 6or exam5le9 here7
htt57<<heim'i6i'uio'no<michawe<teaching<di5ls<ste6an2=oerer'5d6
- ?uch like ;=> server with man6 clients

5ou need to have lksctp(tools to use S9*> in
userspace applications
http)..lksctpsource"orgenet
-n "edora#
lksctp(tools rpm
lksctp(tools(devel rpm ("or .usr.include.netinet.sctph)

Cuture lectures
Net"ilter kernel implementation)
N$* and connection trackingE dnat# snat
?$SZ;@R$=-NY
Cilter and mangle ta7les
Net"ilter verdicts
*he new generation) n"ta7les
Network namespaces (9ontainers . 8penUJ)
=99>
Uirtio

->US.LUS (Linux Uirtual Server)
Iluetooth# RC98??
?ulti+ueues
LR8 (Large Receive 8""load)
?ulticasting
*9> protocol

$ppendix $ ) wireless messages
NL1&%//D9?=DY@*D:->!5# NL1&%//D9?=DS@*D:->!5#
NL1&%//D9?=DY@*D-N*@RC$9@#NL1&%//D9?=DS@*D-N*@RC$9@#
NL1&%//D9?=DN@:D-N*@RC$9@#NL1&%//D9?=D=@LD-N*@RC$9@#
NL1&%//D9?=DY@*DK@5#NL1&%//D9?=DS@*DK@5#NL1&%//D9?=DN@:DK@5#NL1&%//D9?=D=@LDK@5#
NL1&%//D9?=DS@*DI@$98N# NL1&%//D9?=DN@:DI@$98N# NL1&%//D9?=D=@LDI@$98N#
NL1&%//D9?=DY@*DS*$*-8N# NL1&%//D9?=DS@*DS*$*-8N# NL1&%//D9?=DN@:DS*$*-8N#
NL1&%//D9?=D=@LDS*$*-8N#
NL1&%//D9?=DY@*D?>$*!# NL1&%//D9?=DS@*D?>$*!# NL1&%//D9?=DN@:D?>$*!# NL1&%//D9?=D=@LD?>$*!#
NL1&%//D9?=DS@*DISS# NL1&%//D9?=DY@*DR@Y#
NL1&%//D9?=DS@*DR@Y# NL1&%//D9?=DR@ZDS@*DR@Y#
NL1&%//D9?=DY@*D?@S!D>$R$?S#NL1&%//D9?=DS@*D?@S!D>$R$?S#
NL1&%//D9?=D*R-YY@RDS9$N# NL1&%//D9?=DY@*DS9$N#
NL1&%//D9?=D$;*!@N*-9$*@#NL1&%//D9?=D$SS89-$*@# NL1&%//D9?=D=@$;*!@N*-9$*@#
NL1&%//D9?=D=-S$SS89-$*@#
NL1&%//D9?=DV8-ND-ISS#NL1&%//D9?=DL@$U@D-ISS#

$ppendix I ) Socket o5tions
Socket o5tions by 5rotocol7
! 5rotocol :S0L2!) 1> socket o5tions7
->D*8S ->D**L
->D!=R-N9L ->D8>*-8NS
->DR8;*@RD$L@R* ->DR@9U8>*S
->DR@*8>*S ->D>K*-NC8
->D>K*8>*-8NS ->D?*;D=-S98U@R
->DR@9U@RR ->DR@9U**L
->DR@9U*8S ->D?*;
->DCR@@I-N= ->D->S@9D>8L-95
->D<CR?D>8L-95 ->D>$SSS@9
->D*R$NS>$R@N*
Note) Cor IS= compati7ilit6 there is ->DR@9UR@*8>*S (which is identical to
->DR@*8>*S)

$CD;N-<)
S8D>$SS9R@= "or $CD;N-< sockets
Note)Cor historical reasons these socket options are speci"ied with a
S8LDS89K@* t6pe even though the6 are >CD;N-< speci"ic
;=>)
;=>D98RK (->>R8*8D;=> level)
R$:)
-9?>DC-L*@R
*9>)
*9>D98RK
*9>D=@C@RD$99@>*
*9>D-NC8
*9>DK@@>9N*

*9>DK@@>-=L@
*9>DK@@>-N*UL
*9>DL-NY@R%
*9>D?$<S@Y
*9>DN8=@L$5
*9>DZ;-9K$9K
*9>DS5N9N*
*9>D:-N=8:D9L$?>
$CD>$9K@*
>$9K@*D$==D?@?I@RS!->
>$9K@*D=R8>D?@?I@RS!->

Socket o5tions 6or socket level7
S8D=@I;Y
S8DR@;S@$==R
S8D*5>@
S8D@RR8R
S8D=8N*R8;*@
S8DIR8$=9$S*
S8DSN=I;C
S8DR9UI;C
S8DSN=I;CC8R9@
S8DR9UI;CC8R9@
S8DK@@>$L-U@
S8D88I-NL-N@

S8DN8D9!@9K
S8D>R-8R-*5
S8DL-NY@R
S8DIS=98?>$*

$ppendix 9) tcp client
Hinclude T"cntlhB
Hinclude Tstdli7hB
Hinclude TerrnohB
Hinclude TstdiohB
Hinclude TstringhB
Hinclude Ts6s.send"ilehB
Hinclude Ts6s.stathB
Hinclude Ts6s.t6peshB
Hinclude TunistdhB
Hinclude Tarpa.inethB
int main()
R

tcp client ( contd
struct sockaddrDin saE
int sd A socket(>CD-N@*# S89KDS*R@$?# &)E
i" (sdT&)
print"(NerrorN)E
memset(Osa# &# siPeo"(struct sockaddrDin))E
sasinD"amil6 A $CD-N@*E
sasinDport A htons(1,2)E
inetDaton(N/'%/61&/%/N#OsasinDaddr)E
i" (connect(sd# (struct sockaddrM)Osa# siPeo"(sa))T&) R
perror(NconnectN)E
exit(&)E
S
close(sd)E
S

tcp client ( contd
-" on the other side (/'%/61&/%/ in this example) there is no
*9> server listening on this port (1,2) 6ou will get this error "or
the socket() s6stem call)
connect) 9onnection re"used
5ou can send data on this socket 76 adding# "or example)
const char Mmessage A Nm6messageNE
int lengthE
length A strlen(message)G/E
res A write(sd# message# length)E
write() is the same as send()# 7ut with no "lags

$ppendix = ) -9?> options
*hese are -9?> options 6ou can set with
setsockopt on R$: -9?> socket) (see
.usr.include.netinet.ipDicmph)
-9?>D@9!8R@>L5
-9?>D=@S*D;NR@$9!
-9?>DS8;R9@DZ;@N9!
-9?>DR@=-R@9*
-9?>D@9!8
-9?>D*-?@D@<9@@=@=
-9?>D>$R$?@*@R>R8I
-9?>D*-?@S*$?>

-9?>D*-?@S*$?>R@>L5
-9?>D-NC8DR@Z;@S*
-9?>D-NC8DR@>L5
-9?>D$==R@SS
-9?>D$==R@SSR@>L5

$>>@N=-< @) "lags "or send.receive
?SYD88I
?SYD>@@K
?SYD=8N*R8;*@
?SYD*R5!$R= ( S6non6m "or ?SYD=8N*R8;*@ "or =@9net
?SYD9*R;N9
?SYD>R8I@ ( =o not send 8nl6 pro7e path "e "or ?*;
?SYD*R;N9
?SYD=8N*:$-* ( Non7locking io
?SYD@8R ( @nd o" record
?SYD:$-*$LL ( :ait "or a "ull re+uest

?SYDC-N
?SYDS5N
?SYD98NC-R? ( 9on"irm path validit6
?SYDRS*
?SYD@RRZ;@;@ ( Cetch message "rom error +ueue
?SYDN8S-YN$L ( =o not generate S-Y>->@
?SYD?8R@&x1&&& ( Sender will send more

@xample) set and get an option
*his simple example demonstrates how to set and get an -> la6er option)
Hinclude TstdiohB
Hinclude Tarpa.inethB
Hinclude Ts6s.t6peshB
Hinclude Ts6s.sockethB
Hinclude TstringhB
int main()
R
int sE
int optE
int resE
int one A /E
int siPe A siPeo"(opt)E

s A socket($CD-N@*# S89KD=YR$?# &)E
i" (sT&)
perror(NsocketN)E
res A setsockopt(s# S8LD-># ->DR@9U@RR# Oone# siPeo"(one))E
i" (resAA(/)
perror(NsetsockoptN)E
res A getsockopt(s# S8LD-># ->DR@9U@RR#Oopt#OsiPe)E
i" (resAA(/)
perror(NgetsockoptN)E
print"(Nopt A WdQnN#opt)E
close(s)E
S

@xample) record route option
*his example shows how to send a record route
option
Hde"ine NR8;*@S '
int main()
R
int sE
int optlenA&E
struct sockaddrDin targetE
int resE

char rspaceK2GFMNR8;*@SG/LE
char 7u"K/&LE
targetsinD"amil6 A $CD-N@*E
targetsinDportAhtons(''')E
inetDaton(N/'F'&/,N#OtargetsinDaddr)E
strcp6(7u"#Nmessage /)N)E
s A socket($CD-N@*# S89KD=YR$?# &)E
i" (sT&)
perror(NsocketN)E
memset(rspace# &# siPeo"(rspace))E
rspaceK&L A ->8>*DN8>E
rspaceK/G->8>*D8>*U$LL A ->8>*DRRE
rspaceK/G->8>*D8L@NL A siPeo"(rspace)(/E

rspaceK/G->8>*D8CCS@*L A ->8>*D?-N8CCE
optlenAF&E
i" (setsockopt(s# ->>R8*8D-># ->D8>*-8NS# rspace#
siPeo"(rspace))T&)
R
perror(Nrecord routeQnN)E
exit(%)E
S

$ppendix C) using packet raw
socket
int main()
R
int sE
int nE
char 7u""erK%&F1LE
unsigned char MiphdrE
unsigned char MethhdrE
s A socket(!42!A.K-,9 S0.K2)A%9 htons:-,/2!2!))E
while (/)
R
print"(NMMMMMMMQnN)E
n A recv"rom(s# 7u""er# %&F1# &# N;LL# N;LL)E
print"(Nn 76tes readQnN)

ethhdr A 7u""erE
print"(Nsource ?$9 address A W&%x)W&%x)W&%x)W&%x)W&%x)W&%xQnN#
ethhdrK&L#ethhdrK/L#ethhdrK%L#
ethhdrK2L#ethhdrKFL#ethhdrK,L)E
S
S

*ips
*o "ind out socket used 76 a process)
ls (l .proc.KpidL."d\grep socket\cut (d) ("2\sed [s.QK..Es.QL..[
*he num7er returned is the inode num7er o" the socket
-n"ormation a7out these sockets can 7e o7tained "rom
netstat (ae
$"ter starting a process which creates a socket# 6ou can see that
the inode cache was incremented 76 one 76)
more .proc.sla7in"o \ grep sock
sockDinodeDcache F06 F1, 061 , / ) tuna7les & &
& ) sla7data '0 '0 &
*he "irst num7er# F06# is the num7er o" active o7]ects

@N=
*hank 6ou_
ramirose@gmailcom

You might also like