<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>pigfoot's weblog &#187; Kernel</title>
	<atom:link href="http://weblog.pigfoot.org/pigfoot/category/dev/kernel/feed/" rel="self" type="application/rss+xml" />
	<link>http://weblog.pigfoot.org/pigfoot</link>
	<description>We should forget about small efficiencies -- Donald Knuth</description>
	<lastBuildDate>Fri, 30 Jul 2010 02:51:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>What&#8217;s the maximum value of backlog argument in listen(2) system call?</title>
		<link>http://weblog.pigfoot.org/pigfoot/2007/07/05/whats-the-maximum-value-of-backlog-argument-in-listen2-system-call/</link>
		<comments>http://weblog.pigfoot.org/pigfoot/2007/07/05/whats-the-maximum-value-of-backlog-argument-in-listen2-system-call/#comments</comments>
		<pubDate>Thu, 05 Jul 2007 05:47:47 +0000</pubDate>
		<dc:creator>pigfoot</dc:creator>
				<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Develop]]></category>
		<category><![CDATA[FreeBSD]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[Kernel]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Unix]]></category>

		<guid isPermaLink="false">http://weblog.pigfoot.org/pigfoot/2007/07/05/whats-the-maximum-value-of-backlog-argument-in-listen2-system-call/</guid>
		<description><![CDATA[When we write a network server program, I think lots of system calls have their own explicit parameters like socket(), bind(), accept(). But it&#8217;s very interesting when we use this system call listen(). Let&#8217;s see its prototype: int listen(int sockfd, int backlog); Yes, it&#8217;s very obvious that the first parameter is the socket fd. But, [...]]]></description>
			<content:encoded><![CDATA[<p>When we write a network server program, I think lots of system calls have their own explicit parameters like socket(), bind(), accept(). But it&#8217;s very interesting when we use this system call listen(). Let&#8217;s see its prototype:</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">int listen(int sockfd, int backlog);</div></div>
<p>Yes, it&#8217;s very obvious that the first parameter is the socket fd. But, what&#8217;s the meaning of backlog number? Some body would tell us like manpage LISTEN(2) says: &#8220;<em>The backlog parameter defines the maximum length the queue of pending connections may grow to. If a connection request arrives with the queue full the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that retries succeed.</em>&#8221;</p>
<p>From a robust server&#8217;s perspective, what&#8217;s the maximum value it should be assign? At first, I assigned very large number like 1,024 (of course, listen system call still returns successfully). After saw the manpage LISTEN(2) in <a target="_blank" href="http://en.wikipedia.org/wiki/Linux">Linux</a>, I was wrong..</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">If the socket is of type AF_INET, and the backlog argument is greater than the constant SOMAXCONN (128 in Linux 2.0 &amp; 2.2), it is silently truncated to SOMAXCONN.</div></div>
<p>It doesn&#8217;t mention kernel 2.6. But it&#8217;s fine. Let&#8217;s investigate into Linux kernel source code.</p>
<p>In Linux kernel 2.6.20.1, we can see the listen system call implementation in <a target="_blank" href="http://lxr.linux.no/source/net/socket.c#L1304">net/socket.c line 1306</a>. As it shows, the maximum number of backlog cannot be large than sysctl_somaxconn, which is assigned to SOMAXCONN. Furthermore, SOMAXCONN is defined 128 in <a target="_blank" href="http://lxr.linux.no/source/include/linux/socket.h#L226">include/linux/socket.h line 226</a>.</p>
<p>In my opinion, in Linux 2.0 to 2.6, this means backlog cannot exceed 128 by default, or it would be truncated to SOMAXCONN silently like the manpage says.</p>
<p>How about <a target="_blank" href="http://en.wikipedia.org/wiki/FreeBSD">FreeBSD</a>? We can see the note of manpage LISTEN(2) in FreeBSD 6:</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">The listen() system call appeared in 4.2BSD. The ability to configure the maximum backlog at run-time, and to use a negative backlog to request the maximum allowable value, was introduced in FreeBSD 2.2.</div></div>
<p>I&#8217;m not very familiar with FreeBSD kernel, but let me try to trace. The start point is to check <a target="_blank" href="http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/uipc_syscalls.c?only_with_tag=RELENG_6">sys/kern/uipc_syscalls.c</a> of cvstag RELENG_6 in FreeBSD. We can see listen system call will invoke <em>solisten(so, uap->backlog, td)</em>. Thus, we go to <a target="_blank" href="http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/uipc_socket.c?only_with_tag=RELENG_6">sys/kern/uipc_socket.c</a> now to see the implementation of <em>solisten(struct socket *so, int backlog, struct thread *td)</em>. The same, the maximum value is somaxconn which is assigned to SOMAXCONN by default. Finally, we can see the value is defined in <a target="_blank" href="http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/sys/socket.h?only_with_tag=RELENG_6">sys/sys/socket.h</a>. The value is the same as Linux &#8212; 128.</p>
<p><strong>To put it another way, if you&#8217;re writing a server program in either Linux or FreeBSD platform, it&#8217;s very appropriate to assign the value of backlog to 128. in FreeBSD, however, you can assign a negative backlog to request the maximum allowable value.</strong></p>
<p>You may ask what&#8217;s the value of backlog in popular modern server? Let&#8217;s check the source of <a target="_blank" href="http://en.wikipedia.org/wiki/Apache_HTTP_Server">Apache HTTP Server</a>. As you see in <a target="_blank" href="http://svn.apache.org/viewvc/httpd/httpd/trunk/server/listen.c?view=markup">/server/listen.c</a>, ap_listenbacklog is assigned to DEFAULT_LISTENBACKLOG which is defined 511 in <a target="_blank" href="http://svn.apache.org/viewvc/httpd/httpd/trunk/include/mpm_common.h?view=markup">/include/mpm_common.h</a>.</p>
<img src="http://weblog.pigfoot.org/pigfoot/?ak_action=api_record_view&id=416&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://weblog.pigfoot.org/pigfoot/2007/07/05/whats-the-maximum-value-of-backlog-argument-in-listen2-system-call/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Zero copy between ISR, kernel and User</title>
		<link>http://weblog.pigfoot.org/pigfoot/2006/09/28/zero-copy-between-isr-kernel-and-user/</link>
		<comments>http://weblog.pigfoot.org/pigfoot/2006/09/28/zero-copy-between-isr-kernel-and-user/#comments</comments>
		<pubDate>Thu, 28 Sep 2006 04:06:56 +0000</pubDate>
		<dc:creator>pigfoot</dc:creator>
				<category><![CDATA[Develop]]></category>
		<category><![CDATA[Kernel]]></category>

		<guid isPermaLink="false">http://weblog.pigfoot.org/pigfoot/2006/09/28/zero-copy-between-isr-kernel-and-user/</guid>
		<description><![CDATA[Zero copy between ISR, kernel and User, from LKML. Q: I would like to allow the transferring of data between ISR&#8217;s, kernel and user code, without requiring copying. I envision allocating buffers in the kernel and then mapping them so that they appear at the same addresses to all code, and never being swapped out [...]]]></description>
			<content:encoded><![CDATA[<p><a target="_blank" href="http://groups.google.com/group/linux.kernel/tree/browse_frm/thread/c93cd3f9b874c3a9/6eaa68a3fed0c8fa">Zero copy between ISR, kernel and User</a>, from <a target="_blank" href="http://en.wikipedia.org/wiki/LKML"><acronym title="Linux Kernel Mailing List">LKML</acronym></a>.</p>
<p><strong>Q:</strong></p>
<p>I would like to allow the transferring of data between ISR&#8217;s, kernel and user code, without requiring copying.</p>
<p>I envision allocating buffers in the kernel and then mapping them so that they appear at the same addresses to all code, and never being swapped out of memory.</p>
<p>Is this feasible for all supported Linux architectures and is there existing code that someone could point me towards?</p>
<p><strong>A:</strong></p>
<p>Your better off having application mmap a device, then transfer the data to there. Something like AF_PACKET.</p>
<img src="http://weblog.pigfoot.org/pigfoot/?ak_action=api_record_view&id=381&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://weblog.pigfoot.org/pigfoot/2006/09/28/zero-copy-between-isr-kernel-and-user/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>POSIX Asynchronous I/O</title>
		<link>http://weblog.pigfoot.org/pigfoot/2006/09/27/posix-asynchronous-io/</link>
		<comments>http://weblog.pigfoot.org/pigfoot/2006/09/27/posix-asynchronous-io/#comments</comments>
		<pubDate>Wed, 27 Sep 2006 06:35:02 +0000</pubDate>
		<dc:creator>pigfoot</dc:creator>
				<category><![CDATA[Develop]]></category>
		<category><![CDATA[Kernel]]></category>

		<guid isPermaLink="false">http://weblog.pigfoot.org/pigfoot/2006/09/27/posix-asynchronous-io/</guid>
		<description><![CDATA[POSIX Asynchronous I/O, from OSNews. Used judiciously, asynchronous I/O (short for AIO) can provide a significant speed benefit, says David Chisnall. Perhaps enough to help your program overcome the fact that modern processors can really zoom, while hard drives still drag. Code using AIO on Linux must be linked with -lrt to provide support for [...]]]></description>
			<content:encoded><![CDATA[<p><a target="_blank" href="http://www.informit.com/articles/article.asp?p=607373">POSIX Asynchronous I/O</a>, from <a target="_blank" href="http://www.osnews.com/story.php?news_id=15955">OSNews</a>.</p>
<p>Used judiciously, asynchronous I/O (short for AIO) can provide a significant speed benefit, <a target="_blank" href="http://www.informit.com/articles/article.asp?p=607373">says David Chisnall</a>. Perhaps enough to help your program overcome the fact that modern processors can really zoom, while hard drives still drag.</p>
<p>Code using AIO on Linux must be linked with <strong>-lrt</strong> to provide support for the POSIX real-time extensions, which means it may not be the kernel I/O mechanism, especially Kernel AIO Support for Linux has not been merged into mainline kernel yet.</p>
<p>After studying <a target="_blank" href="http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/rt/?cvsroot=glibc">rt extensions in glibc source</a>, I <a target="_blank" href="http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/sysdeps/pthread/?cvsroot=glibc">find</a> how do they implement posix AIO in glibc &#8212; a thread pool plus a request waiting queue. Besides, helper thread will be spawned to manage the request, hence it&#8217;s not so trivial.</p>
<p>There is something interesting when freeing allocated resources. You can see the function <font face="Courier New">libc_freeres_fn()</font> in <a target="_blank" href="http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/sysdeps/pthread/aio_misc.c?cvsroot=glibc">aio_misc.c</a>. This is also why it is unnecessary to implement &#8220;<em>destroy</em>&#8221; API.</p>
<p>Whereas, <a target="_blank" href="http://www.freebsd.org/">FreeBSD</a> has <a target="_blank" href="http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/syscalls.c?rev=1.207&#038;content-type=text/x-cvsweb-markup">AIO family system calls</a>. I, however, have no experience with AIO in FreeBSD, but it seems AIO in FreeBSD has kernel support? If so, it would be more efficient than Linux until now?</p>
<img src="http://weblog.pigfoot.org/pigfoot/?ak_action=api_record_view&id=379&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://weblog.pigfoot.org/pigfoot/2006/09/27/posix-asynchronous-io/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linux Per-Process Syscall Hooking</title>
		<link>http://weblog.pigfoot.org/pigfoot/2006/09/06/linux-per-process-syscall-hooking/</link>
		<comments>http://weblog.pigfoot.org/pigfoot/2006/09/06/linux-per-process-syscall-hooking/#comments</comments>
		<pubDate>Wed, 06 Sep 2006 05:27:28 +0000</pubDate>
		<dc:creator>pigfoot</dc:creator>
				<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Develop]]></category>
		<category><![CDATA[Kernel]]></category>

		<guid isPermaLink="false">http://weblog.pigfoot.org/pigfoot/2006/09/06/linux-per-process-syscall-hooking/</guid>
		<description><![CDATA[Linux Per-Process Syscall Hooking, by Pluf. This document describes a new syscall hooking technique for Linux systems and exposes how it can be implemented as part of a virus or a backdoor in order to take full control over an userland application. Although there are some well-known methods for hooking functions, they are mostly based [...]]]></description>
			<content:encoded><![CDATA[<p><a target="_blank" href="http://www.7a69ezine.org/Linux_PerProcess_Syscall_Hooking.txt">Linux Per-Process Syscall Hooking</a>, by Pluf.</p>
<blockquote><p>This document describes a new syscall hooking technique for Linux systems and exposes how it can be implemented as part of a virus or a backdoor in order to take full control over an userland application.</p>
<p>Although there are some well-known methods for hooking functions, they are mostly based on the ELF format itself.</p>
<p>This technique is focused on thoses pieces of code that are externally called by the main program and invoke a system call or system service.</p>
<p>A simple implementation of this hooking mechanism has been developed as a result of the research and it is included with the article.</p>
<p>This code provided does not have all the features you wish but includes the required ones, is not a real backdoor but a simple proof of concept, perfect to write your own one.</p></blockquote>
<img src="http://weblog.pigfoot.org/pigfoot/?ak_action=api_record_view&id=354&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://weblog.pigfoot.org/pigfoot/2006/09/06/linux-per-process-syscall-hooking/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Van Jacobson&#8217;s network channels</title>
		<link>http://weblog.pigfoot.org/pigfoot/2006/08/14/van-jacobsons-network-channels/</link>
		<comments>http://weblog.pigfoot.org/pigfoot/2006/08/14/van-jacobsons-network-channels/#comments</comments>
		<pubDate>Mon, 14 Aug 2006 11:19:55 +0000</pubDate>
		<dc:creator>pigfoot</dc:creator>
				<category><![CDATA[Develop]]></category>
		<category><![CDATA[Kernel]]></category>

		<guid isPermaLink="false">http://weblog.pigfoot.org/pigfoot/2006/08/14/van-jacobsons-network-channels/</guid>
		<description><![CDATA[Van Jacobson&#8217;s network channels, from LWN.net. Van Jacobson 在今年一月底的 linux.conf.au 上展示了他的 network channels 的 idea, 然後引發了一系列的討論. 雖然有不少障礙要克服, 不過設計上應該有蠻多學習的地方 ;-) 從 Van 的 slides [PDF] 上講到, 傳統的 Networking stack, 到現在的實作已經變成了 &#8220;Standard Model&#8221;. 在 Linux kernel 實作上大概如下 (From Van&#8217;s slides): 當封包被網路卡收到, kernel 會收到 interrupt, 然後呼叫 ISR, 或是有註冊 NET_RX_SOFTIRQ 的 softirq handler (一般應該是 driver; 另, 這裡也可能是 tasklet), 會根據自己的硬體運作方法把 [...]]]></description>
			<content:encoded><![CDATA[<p><a target="_blank" href="http://lwn.net/Articles/169961/">Van Jacobson&#8217;s network channels</a>, from <a target="_blank" href="http://lwn.net/">LWN.net</a>.</p>
<p>Van Jacobson 在今年一月底的 linux.conf.au 上<a target="_blank" href="http://lwn.net/Articles/169961/">展示了他的 network channels 的 idea</a>, 然後引發了一系列的討論. 雖然有<a target="_blank" href="http://lwn.net/Articles/192767/">不少障礙要克服</a>, 不過設計上應該有蠻多學習的地方 ;-)</p>
<p>從 <a target="_blank" href="http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf">Van 的 slides [PDF]</a> 上講到, 傳統的 Networking stack, 到現在的實作已經變成了 &#8220;Standard Model&#8221;. 在 Linux kernel 實作上大概如下 (From Van&#8217;s slides):</p>
<p><img title="net channel 1" alt="net channel 1" src="http://static.flickr.com/93/214883500_02309fe89b.jpg" /></p>
<p>當封包被網路卡收到, kernel 會收到 interrupt, 然後呼叫 <acronym title="Interrupt Service Routine, a software routine that is executed in response to an interrupt"><a target="_blank" href="http://en.wikipedia.org/wiki/Interrupt_handler">ISR</a></acronym>, 或是有註冊 NET_RX_SOFTIRQ 的 softirq handler (一般應該是 driver; 另, 這裡也可能是 tasklet), 會根據自己的硬體運作方法把 packer 收下來組成 <acronym title="Socket Buffer">skb</acronym>, 然後呼叫 net/core/dev.c:netif_receive_skb(). netif_receive_skb() 這裡會檢查 payload 然後解多工. 舉個例子, IP 應該會送到 net/ipv4/ip_input.c:ip_recv() 去. 當然, 後面的 Socket (更高的像是 UDP/TCP Layer) 也是會參考這個 skb, 當然就大家所知, TCP 甚至還要組成 Byte-Stream.</p>
<p>這樣的設計當可能有一些缺點, 為了不失原意, 我摘錄原文如下:</p>
<ul>
<li><strong>Passing network packets through multiple layers of the kernel</strong>.<br />
When a packet arrives, the network card&#8217;s interrupt handler begins the task of feeding the packet to the kernel. The remainder of the work may well be performed at software interrupt level within the driver (in a tasklet, perhaps). The core network processing happens in another software interrupt. Copying the data (an expensive operation in itself) to the application happens in kernel context. Finally the application itself does something interesting with the data. The context changes are expensive, and if any of these changes causes the work to move from one CPU to another, a big cache penalty results. Much work has been done to improve CPU locality in the networking subsystem, but much remains to be done.</li>
<li><strong>Locking is expensive</strong>.<br />
Taking a lock requires a cross-system atomic operation and moves a cache line between processors. Locking costs have led to the development of lock-free techniques like <a target="_blank" href="http://lwn.net/Articles/22818/">seqlocks</a> and <a target="_blank" href="http://lwn.net/Articles/37889/">read-copy-update (RCU)</a>, but the the networking stack (like the rest of the kernel) remains full of locks.</li>
<li><strong>The networking code makes extensive use of queues implemented with doubly-linked lists</strong>.<br />
These lists have poor cache behavior since they require each user to make changes (and thus move cache lines) in multiple places.</li>
</ul>
<p>因此, 為了要增加 networking scalability, 首要就是要消除 locking 和 shared data. Van 利用 end-to-end principle 來達成這個目的. 也就是說, 盡可能的讓資料交給 application, 而不要在 kernel 任何地方等待. 於是他設計了 net channel &#8212; 一個 circular buffer (應該是 Circular FIFO queue implemented by Array) 用來取代 skb 和目前用在 networking stack 的 queue. 比方說, 原先需要用 softirq 的地方 (driver -> socket), 改用 netchannel, locking 數都明顯的下降, 進而提高 scalability.</p>
<p>但是, 這個方法遇到的第一個問題就是, 讓資料從 packet 一條鞭的到 application, 會把 <a target="_blank" href="http://www.netfilter.org/">netfilter</a> 的原先 hook 的點變相的消除, 為了加回來 netfilter 的 support, <a target="_blank" href="http://lwn.net/Articles/192767/">利用得到的優勢便蕩然無存了</a>.</p>
<p>不過我有一點不懂的是, 一個還算簡單的 circular buffer, 為何是 &#8220;<strong>Cache aware, cache friendly queue</strong>&#8221; 呢? 是因為用 Array implement 這樣嗎? :p</p>
<img src="http://weblog.pigfoot.org/pigfoot/?ak_action=api_record_view&id=325&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://weblog.pigfoot.org/pigfoot/2006/08/14/van-jacobsons-network-channels/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
