diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 7cd879eba5dc..94444b152fbc 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -3,9 +3,9 @@ -------------------------------------------------------------------------------- This file documents the mmap() facility available with the PACKET -socket interface on 2.4 and 2.6 kernels. This type of sockets is used for -capture network traffic with utilities like tcpdump or any other that needs -raw access to network interface. +socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for +i) capture network traffic with utilities like tcpdump, ii) transmit network +traffic, or any other that needs raw access to network interface. You can find the latest version of this document at: http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap @@ -21,19 +21,18 @@ Please send your comments to + Why use PACKET_MMAP -------------------------------------------------------------------------------- -In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very -inefficient. It uses very limited buffers and requires one system call -to capture each packet, it requires two if you want to get packet's -timestamp (like libpcap always does). +In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very +inefficient. It uses very limited buffers and requires one system call to +capture each packet, it requires two if you want to get packet's timestamp +(like libpcap always does). In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size configurable circular buffer mapped in user space that can be used to either send or receive packets. This way reading packets just needs to wait for them, most of the time there is no need to issue a single system call. Concerning transmission, multiple packets can be sent through one system call to get the -highest bandwidth. -By using a shared buffer between the kernel and the user also has the benefit -of minimizing packet copies. +highest bandwidth. By using a shared buffer between the kernel and the user +also has the benefit of minimizing packet copies. It's fine to use PACKET_MMAP to improve the performance of the capture and transmission process, but it isn't everything. At least, if you are capturing @@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the device driver of your network interface card supports some sort of interrupt load mitigation or (even better) if it supports NAPI, also make sure it is enabled. For transmission, check the MTU (Maximum Transmission Unit) used and -supported by devices of your network. +supported by devices of your network. CPU IRQ pinning of your network interface +card can also be an advantage. -------------------------------------------------------------------------------- + How to use mmap() to improve capture process @@ -87,9 +87,7 @@ the following process: socket creation and destruction is straight forward, and is done the same way with or without PACKET_MMAP: -int fd; - -fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) + int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); where mode is SOCK_RAW for the raw interface were link level information can be captured or SOCK_DGRAM for the cooked @@ -180,7 +178,6 @@ and the PACKET_TX_HAS_OFF option. + PACKET_MMAP settings -------------------------------------------------------------------------------- - To setup PACKET_MMAP from user level code is done with a call like - Capture process @@ -214,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true frames_per_block * tp_block_nr == tp_frame_nr - Lets see an example, with the following values: tp_block_size= 4096 @@ -240,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into account when choosing the frame_size. See "Mapping and use of the circular buffer (ring)". - -------------------------------------------------------------------------------- + PACKET_MMAP setting constraints -------------------------------------------------------------------------------- @@ -277,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and The pagesize can also be determined dynamically with the getpagesize (2) system call. - Block number limit -------------------- @@ -297,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated. v block #2 block #1 - kmalloc allocates any number of bytes of physically contiguous memory from a pool of pre-determined sizes. This pool of memory is maintained by the slab allocator which is at the end the responsible for doing the allocation and @@ -312,7 +305,6 @@ pointers to blocks is 131072/4 = 32768 blocks - PACKET_MMAP buffer size calculator ------------------------------------ @@ -353,7 +345,6 @@ and a value for of 2048 bytes. These parameters will yield and hence the buffer will have a 262144 MiB size. So it can hold 262144 MiB / 2048 bytes = 134217728 frames - Actually, this buffer size is not possible with an i386 architecture. Remember that the memory is allocated in kernel space, in the case of an i386 kernel's memory size is limited to 1GiB. @@ -385,7 +376,6 @@ the following (from include/linux/if_packet.h): - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - Pad to align to TPACKET_ALIGNMENT=16 */ - The following are conditions that are checked in packet_set_ring @@ -426,7 +416,6 @@ and the following flags apply: #define TP_STATUS_LOSING 4 #define TP_STATUS_CSUMNOTREADY 8 - TP_STATUS_COPY : This flag indicates that the frame (and associated meta information) has been truncated because it's larger than tp_frame_size. This packet can be @@ -475,7 +464,6 @@ packets are in the ring: It doesn't incur in a race condition to first check the status value and then poll for frames. - ++ Transmission process Those defines are also used for transmission: @@ -506,6 +494,196 @@ The user can also use poll() to check if a buffer is available: pfd.events = POLLOUT; retval = poll(&pfd, 1, timeout); +------------------------------------------------------------------------------- ++ What TPACKET versions are available and when to use them? +------------------------------------------------------------------------------- + + int val = tpacket_version; + setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + +where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. + +TPACKET_V1: + - Default if not otherwise specified by setsockopt(2) + - RX_RING, TX_RING available + - VLAN metadata information available for packets + (TP_STATUS_VLAN_VALID) + +TPACKET_V1 --> TPACKET_V2: + - Made 64 bit clean due to unsigned long usage in TPACKET_V1 + structures, thus this also works on 64 bit kernel with 32 bit + userspace and the like + - Timestamp resolution in nanoseconds instead of microseconds + - RX_RING, TX_RING available + - How to switch to TPACKET_V2: + 1. Replace struct tpacket_hdr by struct tpacket2_hdr + 2. Query header len and save + 3. Set protocol version to 2, set up ring as usual + 4. For getting the sockaddr_ll, + use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of + (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +TPACKET_V2 --> TPACKET_V3: + - Flexible buffer implementation: + 1. Blocks can be configured with non-static frame-size + 2. Read/poll is at a block-level (as opposed to packet-level) + 3. Added poll timeout to avoid indefinite user-space wait + on idle links + 4. Added user-configurable knobs: + 4.1 block::timeout + 4.2 tpkt_hdr::sk_rxhash + - RX Hash data available in user space + - Currently only RX_RING available + +------------------------------------------------------------------------------- ++ AF_PACKET fanout mode +------------------------------------------------------------------------------- + +In the AF_PACKET fanout mode, packet reception can be load balanced among +processes. This also works in combination with mmap(2) on packet sockets. + +Minimal example code by David S. Miller (try things like "./test eth0 hash", +"./test eth0 lb", etc.): + +#include +#include +#include +#include + +#include +#include +#include +#include + +#include + +#include +#include + +#include + +static const char *device_name; +static int fanout_type; +static int fanout_id; + +#ifndef PACKET_FANOUT +# define PACKET_FANOUT 18 +# define PACKET_FANOUT_HASH 0 +# define PACKET_FANOUT_LB 1 +#endif + +static int setup_socket(void) +{ + int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); + struct sockaddr_ll ll; + struct ifreq ifr; + int fanout_arg; + + if (fd < 0) { + perror("socket"); + return EXIT_FAILURE; + } + + memset(&ifr, 0, sizeof(ifr)); + strcpy(ifr.ifr_name, device_name); + err = ioctl(fd, SIOCGIFINDEX, &ifr); + if (err < 0) { + perror("SIOCGIFINDEX"); + return EXIT_FAILURE; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = AF_PACKET; + ll.sll_ifindex = ifr.ifr_ifindex; + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + return EXIT_FAILURE; + } + + fanout_arg = (fanout_id | (fanout_type << 16)); + err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, + &fanout_arg, sizeof(fanout_arg)); + if (err) { + perror("setsockopt"); + return EXIT_FAILURE; + } + + return fd; +} + +static void fanout_thread(void) +{ + int fd = setup_socket(); + int limit = 10000; + + if (fd < 0) + exit(fd); + + while (limit-- > 0) { + char buf[1600]; + int err; + + err = read(fd, buf, sizeof(buf)); + if (err < 0) { + perror("read"); + exit(EXIT_FAILURE); + } + if ((limit % 10) == 0) + fprintf(stdout, "(%d) \n", getpid()); + } + + fprintf(stdout, "%d: Received 10000 packets\n", getpid()); + + close(fd); + exit(0); +} + +int main(int argc, char **argp) +{ + int fd, err; + int i; + + if (argc != 3) { + fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); + return EXIT_FAILURE; + } + + if (!strcmp(argp[2], "hash")) + fanout_type = PACKET_FANOUT_HASH; + else if (!strcmp(argp[2], "lb")) + fanout_type = PACKET_FANOUT_LB; + else { + fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); + exit(EXIT_FAILURE); + } + + device_name = argp[1]; + fanout_id = getpid() & 0xffff; + + for (i = 0; i < 4; i++) { + pid_t pid = fork(); + + switch (pid) { + case 0: + fanout_thread(); + + case -1: + perror("fork"); + exit(EXIT_FAILURE); + } + } + + for (i = 0; i < 4; i++) { + int status; + + wait(&status); + } + + return 0; +} + ------------------------------------------------------------------------------- + PACKET_TIMESTAMP ------------------------------------------------------------------------------- @@ -532,6 +710,13 @@ the networking stack is used (the behavior before this setting was added). See include/linux/net_tstamp.h and Documentation/networking/timestamping for more information on hardware timestamps. +------------------------------------------------------------------------------- ++ Miscellaneous bits +------------------------------------------------------------------------------- + +- Packet sockets work well together with Linux socket filters, thus you also + might want to have a look at Documentation/networking/filter.txt + -------------------------------------------------------------------------------- + THANKS --------------------------------------------------------------------------------