pigfoot’s weblog

We should forget about small efficiencies — Donald Knuth

Archive for the ‘Kernel’ Category

When we write a network server program, I think lots of system calls have their own explicit parameters like socket(), bind(), accept(). But it’s very interesting when we use this system call listen(). Let’s see its prototype:

int listen(int sockfd, int backlog);

Yes, it’s very obvious that the first parameter is the socket fd. But, what’s the meaning of backlog number? Some body would tell us like manpage LISTEN(2) says: “The backlog parameter defines the maximum length the queue of pending connections may grow to. If a connection request arrives with the queue full the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that retries succeed.

From a robust server’s perspective, what’s the maximum value it should be assign? At first, I assigned very large number like 1,024 (of course, listen system call still returns successfully). After saw the manpage LISTEN(2) in Linux, I was wrong..

If the socket is of type AF_INET, and the backlog argument is greater than the constant SOMAXCONN (128 in Linux 2.0 & 2.2), it is silently truncated to SOMAXCONN.

It doesn’t mention kernel 2.6. But it’s fine. Let’s investigate into Linux kernel source code.

In Linux kernel 2.6.20.1, we can see the listen system call implementation in net/socket.c line 1306. As it shows, the maximum number of backlog cannot be large than sysctl_somaxconn, which is assigned to SOMAXCONN. Furthermore, SOMAXCONN is defined 128 in include/linux/socket.h line 226.

In my opinion, in Linux 2.0 to 2.6, this means backlog cannot exceed 128 by default, or it would be truncated to SOMAXCONN silently like the manpage says.

How about FreeBSD? We can see the note of manpage LISTEN(2) in FreeBSD 6:

The listen() system call appeared in 4.2BSD. The ability to configure the maximum backlog at run-time, and to use a negative backlog to request the maximum allowable value, was introduced in FreeBSD 2.2.

I’m not very familiar with FreeBSD kernel, but let me try to trace. The start point is to check sys/kern/uipc_syscalls.c of cvstag RELENG_6 in FreeBSD. We can see listen system call will invoke solisten(so, uap->backlog, td). Thus, we go to sys/kern/uipc_socket.c now to see the implementation of solisten(struct socket *so, int backlog, struct thread *td). The same, the maximum value is somaxconn which is assigned to SOMAXCONN by default. Finally, we can see the value is defined in sys/sys/socket.h. The value is the same as Linux — 128.

To put it another way, if you’re writing a server program in either Linux or FreeBSD platform, it’s very appropriate to assign the value of backlog to 128. in FreeBSD, however, you can assign a negative backlog to request the maximum allowable value.

You may ask what’s the value of backlog in popular modern server? Let’s check the source of Apache HTTP Server. As you see in /server/listen.c, ap_listenbacklog is assigned to DEFAULT_LISTENBACKLOG which is defined 511 in /include/mpm_common.h.

Popularity: 57% [?]

Zero copy between ISR, kernel and User, from LKML.

Q:

I would like to allow the transferring of data between ISR’s, kernel and user code, without requiring copying.

I envision allocating buffers in the kernel and then mapping them so that they appear at the same addresses to all code, and never being swapped out of memory.

Is this feasible for all supported Linux architectures and is there existing code that someone could point me towards?

A:

Your better off having application mmap a device, then transfer the data to there. Something like AF_PACKET.

Popularity: 29% [?]

  • 0 Comments
  • Filed under: Develop, Kernel
  • POSIX Asynchronous I/O

    POSIX Asynchronous I/O, from OSNews.

    Used judiciously, asynchronous I/O (short for AIO) can provide a significant speed benefit, says David Chisnall. Perhaps enough to help your program overcome the fact that modern processors can really zoom, while hard drives still drag.

    Code using AIO on Linux must be linked with -lrt to provide support for the POSIX real-time extensions, which means it may not be the kernel I/O mechanism, especially Kernel AIO Support for Linux has not been merged into mainline kernel yet.

    After studying rt extensions in glibc source, I find how do they implement posix AIO in glibc — a thread pool plus a request waiting queue. Besides, helper thread will be spawned to manage the request, hence it’s not so trivial.

    There is something interesting when freeing allocated resources. You can see the function libc_freeres_fn() in aio_misc.c. This is also why it is unnecessary to implement “destroy” API.

    Whereas, FreeBSD has AIO family system calls. I, however, have no experience with AIO in FreeBSD, but it seems AIO in FreeBSD has kernel support? If so, it would be more efficient than Linux until now?

    Popularity: 29% [?]

  • 0 Comments
  • Filed under: Develop, Kernel
  • Linux Per-Process Syscall Hooking

    Linux Per-Process Syscall Hooking, by Pluf.

    This document describes a new syscall hooking technique for Linux systems and exposes how it can be implemented as part of a virus or a backdoor in order to take full control over an userland application.

    Although there are some well-known methods for hooking functions, they are mostly based on the ELF format itself.

    This technique is focused on thoses pieces of code that are externally called by the main program and invoke a system call or system service.

    A simple implementation of this hooking mechanism has been developed as a result of the research and it is included with the article.

    This code provided does not have all the features you wish but includes the required ones, is not a real backdoor but a simple proof of concept, perfect to write your own one.

    Popularity: 30% [?]

  • 0 Comments
  • Filed under: C/C++, Develop, Kernel
  • Van Jacobson’s network channels

    Van Jacobson’s network channels, from LWN.net.

    Van Jacobson 在今年一月底的 linux.conf.au 上展示了他的 network channels 的 idea, 然後引發了一系列的討論. 雖然有不少障礙要克服, 不過設計上應該有蠻多學習的地方 ;-)

    Van 的 slides [PDF] 上講到, 傳統的 Networking stack, 到現在的實作已經變成了 “Standard Model”. 在 Linux kernel 實作上大概如下 (From Van’s slides):

    net channel 1

    當封包被網路卡收到, kernel 會收到 interrupt, 然後呼叫 ISR, 或是有註冊 NET_RX_SOFTIRQ 的 softirq handler (一般應該是 driver; 另, 這裡也可能是 tasklet), 會根據自己的硬體運作方法把 packer 收下來組成 skb, 然後呼叫 net/core/dev.c:netif_receive_skb(). netif_receive_skb() 這裡會檢查 payload 然後解多工. 舉個例子, IP 應該會送到 net/ipv4/ip_input.c:ip_recv() 去. 當然, 後面的 Socket (更高的像是 UDP/TCP Layer) 也是會參考這個 skb, 當然就大家所知, TCP 甚至還要組成 Byte-Stream.

    這樣的設計當可能有一些缺點, 為了不失原意, 我摘錄原文如下:

    • Passing network packets through multiple layers of the kernel.
      When a packet arrives, the network card’s interrupt handler begins the task of feeding the packet to the kernel. The remainder of the work may well be performed at software interrupt level within the driver (in a tasklet, perhaps). The core network processing happens in another software interrupt. Copying the data (an expensive operation in itself) to the application happens in kernel context. Finally the application itself does something interesting with the data. The context changes are expensive, and if any of these changes causes the work to move from one CPU to another, a big cache penalty results. Much work has been done to improve CPU locality in the networking subsystem, but much remains to be done.
    • Locking is expensive.
      Taking a lock requires a cross-system atomic operation and moves a cache line between processors. Locking costs have led to the development of lock-free techniques like seqlocks and read-copy-update (RCU), but the the networking stack (like the rest of the kernel) remains full of locks.
    • The networking code makes extensive use of queues implemented with doubly-linked lists.
      These lists have poor cache behavior since they require each user to make changes (and thus move cache lines) in multiple places.

    因此, 為了要增加 networking scalability, 首要就是要消除 locking 和 shared data. Van 利用 end-to-end principle 來達成這個目的. 也就是說, 盡可能的讓資料交給 application, 而不要在 kernel 任何地方等待. 於是他設計了 net channel — 一個 circular buffer (應該是 Circular FIFO queue implemented by Array) 用來取代 skb 和目前用在 networking stack 的 queue. 比方說, 原先需要用 softirq 的地方 (driver -> socket), 改用 netchannel, locking 數都明顯的下降, 進而提高 scalability.

    但是, 這個方法遇到的第一個問題就是, 讓資料從 packet 一條鞭的到 application, 會把 netfilter 的原先 hook 的點變相的消除, 為了加回來 netfilter 的 support, 利用得到的優勢便蕩然無存了.

    不過我有一點不懂的是, 一個還算簡單的 circular buffer, 為何是 “Cache aware, cache friendly queue” 呢? 是因為用 Array implement 這樣嗎? :p

    Popularity: 20% [?]

  • 0 Comments
  • Filed under: Develop, Kernel