Linux Network Stack Walkthrough (2.4.20)
The following is a walkthrough of the routines in the Linux 2.4.20 network
stack, focusing on IP networking. I've written this up in an attempt to understand
in some detail the functioning of the Linux networking code. The walkthrough
follows the sequence of function calls that occur for received and locally-generated
network traffic. The analysis walks through each function, giving a short
description of the functionality provided by each successive clause (related
group of lines), and indicating which routines are called from within the
function. Corresponding line numbers are listed for each clause; the Linux
source code can be conveniently inspected using the Linux Cross Reference
website. This document thus provides a sort of annotated reference to the
source code. It's still a work-in-progress, so additional explanation will
be added; in particular, question-marks are an indication that more description
needs to be added for an indicated clause. I hope to eventually add a more
discursive discussion, with overviews at different levels of detail.
Received traffic:
Core (layer 2, protocol-independent) receive routines:
- net/core/dev.c: netif_rx( )
Receive routine, called by driver when frame received on interface; enqueues frame on current processor's packet receive queue,
and kickstarts receive routine
- updates stats, applies socketbuffer timestamp (1237)
- check processor's softnet_data input packet queue length: if less than maximum, (1248)
-
if queue length is 0 (before skb added), call netif_rx_schedule(queue->blog_dev)
to schedule current processor's softnet_data backlog device to have queued
received packets handled (1271)
- the poll method of the
"pseudo-dev" queue->blog_dev, associated with the cpu socket queue, is process_backlog( )
- call __skb_queue_tail to enqueue socketbuffer on current processor's input packet queue (1255)
- return the queue's congestion level (queue->cng_level)(?) (1260)
- include/linux/netdevice.h: netif_rx_schedule( )
Called
to add
a processor's backlog_dev to the processor's receive poll list, and to
raise the NET_RX_SOFTIRQ to get the network code to handle received
frames
- call netif_rx_schedule_prep( ) to make sure device is running and has not already been added to poll list (746); if true,
- add device to current processor's softnet_data rx poll list (733)
- adjust dev->quota (?) (734)
- raise softirq NET_RX_SOFTIRQ (738)
- net/core/dev.c: net_rx_action( )
Receive softirq
handler,
registered for NET_RX_SOFTIRQ. Runs through the current processor's
receive poll list, which is a list of "backlog devices" associated with
the processor's packet input queues, and calls each backlog device's
poll method. The poll method of each backlog device is process_backlog(
), assigned during system initialization in netdev_init( ); this
routine dequeues packets and calls netif_receive_skb( ) for each.
- set budget for handling received traffic to netdev_max_backlog (1563)
- loop through current processor's softnet_data poll list, consisting
of blog_dev's of processor packet input queues (1568): for each device on
the list
- if budget for handling network receive traffic is exhausted, or 1 jiffie has elapsed, reschedule this softirq and exit (1571)
- if device quota exhausted, put device back on end of queue, continue (1578)
- calls dev->poll( ), which equals process_backlog( ) (assigned for each backlog_dev in netdev_init( )) (1578)
-
if dev->poll returns non-zero, indicating packets remaining to be handled,
put device back on end of queue, adjust device quota(?), continue (1579 -
1585)
Question??? Why loop through poll list? Seems
like only one item could ever be on list, which is processor's softnet_data
blog_dev???
- net/core/dev.c: process_backlog( )
Assigned as the
poll method
of each cpu's socket queue's backlog device (blog_dev) in net_dev_init(
); the backlog device is added
to the poll list (if not already present) whenever netif_rx( ) is
called. This routine is called from within the net_rx_action( ) receive
softirq routine, and in turn dequeues packets and passes them for
further processing to netif_receive_skb( ).
- loops, dequeuing packets from processor's input queue until queue empty, quota exceeded, or 1 jiffy passes
- calls netif_receive_skb( ) for each dequeued packet (1504)
- adjust value of budget parameter passed in, and device quota, decrementing by number of packets dequeued (1536, 1542 )
- if queue empty, remove dev from poll list, return 0 (1544)
- if queue not empty, return -1 (1538)
- net/core/dev.c: netif_receive_skb( )
Main device receive routine, called from within NET_RX_SOFTIRQ softirq handler.
Checks the payload type, and calls any handler(s) registered for that type (or bridges
frame if bridging enabled on incoming interface). For IP traffic, the registered handler is ip_recv( ).
- updates stats (1426)
- passes socketbuffer to any handlers registered for packet type ETH_P_ALL (i.e., handlers to be executed for all packets) (1438)
- calls handle_bridge, if bridging enabled (1457)
- calls br_handle_frame_hook, returns
- passes socketbuffer to any handlers registered for specific payload type (e.g., ARP, IP handlers) (1464)
- Since handlers stored in simple hashtable, check each handler in
list to make sure it is in fact for specified payload type (1465)
Note: when searching for packet handler in ptype list, always "one behind" in search loop (pt_prev)
IP (layer 3) receive routines:
- net/ipv4/ip_input.c: ip_recv( )
Main IP receive
routine, called from netif_receive_skb( ) when an IP packet is received
on an interface. Together with ip_recv_finish( ), checks packet, passes
to
netfilter hook, gets route for packet and assigns it to skb->dst (if
netfilter doesn't
assign one), and calls skb->dst->input( ) to complete packet
routing.
- drop IP_OTHERHOST packets (386)
- update stats (389)
- call skb_share_check( ) to clone socketbuffer if it’s shared (391)
- make sure socketbuffer is large enough to contain IP header, header size field is reasonable, version correct, header checksum
correct (394 - 418)
- check that total packet length reported in header
is OK (bigger than header and less than total socket buffer size) (423)
- trim socketbuffer to correct size (430)
- call netfilter NF_IP_PRE_ROUTING hook (437)
- call ip_rcv_finish (if netfilter hook didn't steal
packet)
- net/ipv4/ip_input.c: ip_recv_finish( )
- call ip_route_input( ) if skb->dst is NULL - i.e., if netfilter hook didn't assign its own routing-table entry (315)
- if return is error value, drop packet
- process IP header options, if any (331 - 365)
- return skb->dst->input( ) (367)
Routing (received packets):
- net/ipv4/route.c: ip_route_input( )
Find route for supplied packet (skb),
looking first in route cache hashtable and calling ip_route_input_slow( )
if not found, and assign result to skb->dst.
- get hash from supplied key value (1630)
- loop through hash entries for one matching supplied
src addr, dst addr, tos, iif, oif = = 0 (1632)
- if rtable
found, assign it to skb->dst, update stats, return
- handle multicast dest addr (1664)
- if multicast
address is one we've registered to receive, or we're a multicast router,
call ip_route_input_mc( ), return
- call ip_route_input_slow( )
- net/ipv4/route.c: ip_route_input_slow( )
Main routing
routine for incoming packets when entry not in route cache hashtable. Get
appropriate dst_entry struct from "slow" routing tables, assign it to skb->dst
and add it to the route cache hashtable
- check that the supplied (incoming) interface has IP enabled, i.e., that there's an in_dev associated with it, otherwise drop
- create a new key based on the supplied dst arrd, src addr, tos, iif, oif = = 0, and scope RT_SCOPE_UNIVERSE (1332 - 1342)
- check for bad source, destination address (“martians”),
log errors and return
- call fib_lookup( ) to get route (1366)
- uses key,
assigns value to fib_result
- exits if
error, i.e., no route found or input interface not configured for forwarding
- handle NAT stuff (1375 - 1398)
- if fib_result type RTN_BROADCAST (1400)
- discard if
packet type not IP (1511)
- call ip_get_address( ) to get source address if supplied
src addr is 0, else call fib_validate_source( ) to make sure source address
is as expected
- create and
populate a dst_entry struct (1528 - 1567)
- call rt_set_nexthop to set mtu information in dst_entry
from fib_result (1485)
- call rt_intern_hash(
) (1501)
- add dst_entry info to the route cache hashtable
- call arp_bind_neighbour( ) if unicast or output (oif == 0) route
- assign skb->dst to hold dst_entry
- if fib_result type RTN_LOCAL (1403)
- create and populate a dst_entry struct (1528 - 1567)
- call rt_set_nexthop to set mtu information in dst_entry from fib_result (1485)
- call rt_intern_hash( ) (1501)
- add dst_entry info to the route cache hashtable
- call arp_bind_neighbour( ) if unicast or output (oif == 0) route
- assign skb->dst to hold dst_entry
- Otherwise (packet to be unicast-forwarded)
- check if input interface not configured for forwarding, or fib_result type not RTN_UNICAST (1416 - 1419)
- get output device associated with fib_result (1425)
- call fib_validate_source( ) to make sure source address is as expected (1433)
- set RTCF_DOREDIRECT, RTCF_DIRECTSRC flags if needed (???) (1441)
- reject packet if socketbuffer protocol not IP (e.g., ARP) and RTCF_DNAT flag not set
- create and populate a dst_entry struct (1454 - 1487)
- call rt_set_nexthop to set mtu information in dst_entry from fib_result (1485)
- call rt_intern_hash( ) (1501)
- add dst_entry info to the route cache hashtable
- call arp_bind_neighbour( ) if unicast or output (oif == 0) route
- assign skb->dst to hold dst_entry
Local delivery:
- net/ipv4/ip_input.c: ip_local_deliver( )
Assigned as the dst->input routine for local routes in
ip_route_input_slow( ); called from ip_recv( ) after route assigned to
skb->dst
- if needed, call ip_defrag( ) to defragment packet (296)
- returns reassembled fragment if complete, or null if not
- call netfilter hook NF_IP_LOCAL_IN (302)
- call ip_local_deliver_finish( ) if netfilter hook didn't "steal" packet
- net/ipv4/ip_input.c: ip_local_deliver_finish( )
Find level-4 protocol handler(s) for packet, call handler's receive function
- pull IP header from packet (227)
- call nf_conntrack_put to release packet from netfilter connection-tracking module (232)
- check raw socket hashtable to see if there are any potential raw sockets registered (250)
- if so, call raw_v4_input( )
- call __raw_v4_lookup( ) in loop to get successive sockets matching info (addresses, etc.)
- for each socket except last, clone the socketbuffer and call raw_rcv( ) (???)
- call sock_queue_rcv_skb( ) (???)
- return the last socket (or NULL if none)
- find protocol handlers for packet payload type (255)
- if single handler and no raw sockets, call its handler( ) function, return
- if multiple handlers, or raw sockets, call ip_run_ipprot( )
- run through handlers, cloning socket buffer for any handlers with copy field set
- return 1 if at least one handler handles packet, 0 if none handle it
- call raw_rcv( ) for last raw socket, if any (275)
- otherwise, if no protocol handler handled packet, and no raw sockets, send ICMP error message (279)
Forwarding:
-
net/ipv4/ip_forward.c: ip_forward ( )
Assigned as the dst->input routine for non-local routes in
ip_route_input_slow( ); called from ip_recv( ) after route assigned to
skb->dst
-
check socketbuffer opt.router_alert field, return if true (?) (81)
-
drop packet if type isn't PACKET_HOST (84)
- discard packet, send ICMP time-exceeded if TTL <= 1 (98)
-
drop packet, send ICMP host unreachable message if strict routing indicated and next-hop inappropriate (101)
- send route-redirect message if indicated by assigned route (117)
- copy socket buffer so can mangle entries (decrease TTL, etc.) (121)
-
decrement TTL (126)
- if packet size greater
than MTU and don't-fragment flag is set, send ICMP destination-unreachable/fragmentation-needed
message (133)
- if route uses fast NAT (indicated by rt_flags including RTCF_NAT), call ip_do_nat( ) (137)
-
call netfilter hook NF_IP_FORWARD, with completion routine ip_forward_finish( ) if packet not stolen
-
net/ipv4/ip_forward.c: ip_forward_finish ( )
- update forwarding statistics (48)
- if no IP options:
- handle fast routing if enabled (?) (51 - 65)
- return ip_send( ) (66)
- if IP options:
- call ip_forward_options( ) (69)
Output:
IP (layer 3) output routines:
- include/net/ip.h: ip_send( )
- if packet length greater than next-hop MTU
- call ip_fragment( ), with completion function ip_finish_output( ) (165)
- call ip_finish_output( ) (167)
-
net/ipv4/ip_output.c: ip_finish_output ( )
- call netfilter hook IP_POST_ROUTING, with completion function ip_finish_output2( ) (191)
-
net/ipv4/ip_output.c: ip_finish_output2 ( )
- if destination entry has a cached hardware header, return its hh_output( ) method (174)
- for ARP, hh_output( ) is dev_queue_xmit( )
- if no hardware header cache, return destination's neighbour->output( ) method (176)
- for ARP, the output( ) method is neigh_resolve_output( )
Core (layer 2) output routines:
-
net/core/dev.c: dev_queue_xmit( )
- linearize socket buffer, if needed (996 - 1011)
- checksum packet, if needed (1017 - 1022)
- if device has a queue (qdisc):
- enqueue packet (1029)
- call qdisc_run( ) for the output device, which calls qdisc_restart( ) if device is not stopped:
- if netdev_nit, have protocol handler(s) for ETH_PTYPE_ALL; call dev_queue_xmit_nit( ) (???)
- call dev->hard_start_xmit( )
- if device has no queue
- if netdev_nit, have protocol handler(s) for ETH_PTYPE_ALL; call dev_queue_xmit_nit( ) (???) (1057)
- call dev->hard_start_xmit( ) (1060)
Locally generated traffic:
UDP (layer 4) send routines:
- net/ipv4/udp.c: udp_sendmsg( )
- check length, flags
- set destination address and port from supplied sock or msghdr
- set source address, port from sock
- use ipc to send messages to source or dest if errors
- get TOS bits, and set RTO_LINK if sk->localroute set;
- find routing information (rtable struct):
- call sk_dest_check to see if sock already has destination attached (516)
- if not, call ip_route_output (518)
- calls ip_route_output_key to get rtable
- call sk_dst_set( ) to attach rtable to socket for future use if connected
- call ip_build_xmit( ) to send packet (passing in rtable struct) (547)
IP (layer 3) send routines:
- net/ipv4/ip_output.c: ip_build_xmit( )
- call ip_build_xmit_slow( ) if fragmentation neededor packet has IP options (654)
- call ip_local_error( ) if header already included and packet size too large (657)
- call sock_alloc_send_skb( ) to allocate a socket buffer (678)
-
call skb_reserve( ) to reserve space at head of socket buffer for hardware
header, and skb_put to move tail pointer for space for packet data (682 -
688)
- assign skb->dst to clone of dest entry in rtable passed in
- set fields of IP header (if not supplied) (690 - 704)
- call getfrag, which is either udp_getfrag( ) or udp_getfrag_nosum( ), to copy packet data into socket buffer (705)
- call netfilter hook NF_IP_LOCAL_OUT (713)
- call output_maybe_reroute( ) if netfilter hook doesn't steal packet
- call skb->dst->output( )
Routing (locally-generated packets):
- net/ipv4/route.c: ip_route_output( )
- calls ip_route_output_key to get rtable
- call sk_dst_set( ) to attach rtable to socket for future use if connected
- net/ipv4/route.c: ip_route_output_key( )
- get hash from supplied key value
- loop through hash entries for one matching key fields src addr, dst addr, tos, oif, iif = = 0
- also need hash entry to have iif = = 0; so only for locally-generated packets???
- if rtable found, update stats, return rtable in supplied pointer argument
- if none found, call ip_route_output_slow
- net/ipv4/route.c: ip_route_output_slow( )
Main routing routine for outgoing (locally generated?) packets when entry not in fast-route hashtable
- create a new key based on the supplied key
- in particular,set key.iif = loopback interface
- if supplied key has nonzero source address
- if supplied key has nonzero output interface
- if no destination address supplied
- net/ipv4/devinet.c: inet_select_addr
Given destination address, device, scope, returns one of device’s inet addresses to use as a source address
- return 0 if device has no inet addresses
- return interface local address(???) if dest matches one of device's address / mask combinations
Bridging code:
-
net/core/dev.c: handle_bridge( )
-
call any leftover handler for ETH_P_ALL
-
call br_handle_frame_hook( ), return
-
net/bridge/br_input.c: br_handle_frame( )
Hook installed at boot time as br_handle_frame_hook; handles bridging
-
checks that interface frame was received on was configured for bridging,
that interface is up, and that source address not broadcast (123)
-
if bridging state is forwarding or learning, calls br_fdb_insert( )
to add entry to forwarding table if not already present (139)
-
handles "special" frames ??? (143)
-
if bridging state is forwarding, calls netfilter hook NF_BR_PRE_ROUTING
-
if frame not stolen, calls br_handle_frame_finish( )
-
net/bridge/br_input.c: br_handle_frame_finish( )
-
if interface in promiscuous mode, clones socketbuffer, calls br_pass_frame_up( ) (69)
-
if dest address is broadcast, calls br_flood_forward( ), and br_pass_frame_up( ) if not already done (79)
-
calls br_fdb_get( ) to get routing info for frame (86)
-
if destination is local, calls br_pass_frame_up( ) if not already done (87)
-
if destination not local, calls br_forward( ) (96)
-
if destination NULL (unknown), calls br_flood_forward( ) (102)
-
net/bridge/br_input.c: br_pass_frame_up( )
Called to pass frame for local consumption
-
set socketbuffer packet type to PACKET_HOST, and incoming interface to bridge interface
-
calls netfilter hook NF_BR_LOCAL_IN
-
if frame not stolen, calls br_pass_frame_up_finish( )
-
br_pass_frame_up_finish( ) calls netif_rx( )
-
net/bridge/br_forward: br_forward( )
Forwarding routine for packets to be forwarded on single bridge port
-
call should_deliver( ) to test if frame should be forwarded (85)
-
returns false if skb->dev == bridge_port->dev or state not forwarding
-
calls __br_forward if true
-
sets indev = skb->dev, skb->dev = bridge_port->dev
-
calls netfilter hook NF_BR_FORWARD
-
calls __br_forward_finish if packet not stolen
-
calls netfilter hook NF_BR_POST_ROUTING
-
calls __dev_queue_push_xmit if packet not stolen
-
net/bridge/br_forward: br_deliver( )
Delivery routine for locally-originated packets to be forwarded on single bridge port
-
call should_deliver( ) to test if frame should be forwarded (85)
-
returns false if skb->dev == bridge_port->dev or state not forwarding
-
calls __br_deliver( ) if true
-
sets indev = skb->dev, skb->dev = bridge_port->dev
-
calls netfilter hook NF_BR_LOCAL_OUT
-
calls __br_forward_finish if packet not stolen
-
calls netfilter hook NF_BR_POST_ROUTING
-
calls __dev_queue_push_xmit if packet not stolen
-
net/bridge/br_forward: br_flood_forward( )
Forwarding routine for packets to be flooded on all bridge ports
-
call br_flood( ), with packet_hook = __br_forward( )
-
net/bridge/br_forward: br_flood_deliver( )
Delivery routine for locally-originated packets to be flooded on all bridge ports
-
call br_flood( ), with packet_hook = __br_deliver( )
-
net/bridge/br_forward: br_flood( )
Routine for packets to be flooded on all bridge ports
-
clone socketbuffer if this is specified
-
run through port list of bridge
-
call __packet_hook( ), supplied as argument
To do:
-
fib_lookup( )
-
route
cache
- multicast:
- net/ipv4/route.c: ip_route_input_mc( ) (1235)
- validate packet (check that dev not null, source address OK, protocol is IP, call fib_validate_source( )) (1246)