akrさんのテンポラリファイルの扱いに関するドキュメント

よい文章。読むべし。

http://www.ipa.go.jp/security/fy20/reports/tech1-tg/2_05.html

ちょっと前に転載したUlrichの効率的なdirectry readingコードについてのアーティクルで、openatを使うとセキュリティが云々言っているあたりが分かりやすく解説されている。

関連記事

ext4 さん弱すぎる (2009/10/04)
akrさんのテンポラリファイルの扱いに関するドキュメント (2009/09/28)
[備忘録] perf コマンドの使い方 (2009/09/23)

linux | 【2009-09-28(Mon) 15:03:20】 | Trackback:(0) | Comments:(6)

[備忘録] perf コマンドの使い方

IngoがLKMLで説明してくれた、perfの使い方だけど、-topでしか使えないオプション等があって、まだ試せていないので、備忘録としてここに貼っておく


btw., if you run -tip and have these enabled:

  CONFIG_PERF_COUNTER=y 
  CONFIG_EVENT_TRACING=y

  cd tools/perf/
  make -j install

... then you can use a couple of new perfcounters features to 
measure scheduler latencies. For example:

  perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20

Will tell you how many times this workload got delayed by waiting 
for CPU time.

You can repeat the workload as well and see the statistical 
properties of those metrics:

 aldebaran:/home/mingo> perf stat --repeat 10 -e \
              sched:sched_stat_wait:r -e task-clock ./hackbench 20
 Time: 0.251
 Time: 0.214
 Time: 0.254
 Time: 0.278
 Time: 0.245
 Time: 0.308
 Time: 0.242
 Time: 0.222
 Time: 0.268
 Time: 0.244

 Performance counter stats for './hackbench 20' (10 runs):

          59826  sched:sched_stat_wait    #      0.026 M/sec   ( +-   5.540% )
    2280.099643  task-clock-msecs         #      7.525 CPUs    ( +-   1.620% )

    0.303013390  seconds time elapsed   ( +-   3.189% )

To get scheduling events, do:

 # perf list 2>&1 | grep sched:
  sched:sched_kthread_stop                   [Tracepoint event]
  sched:sched_kthread_stop_ret               [Tracepoint event]
  sched:sched_wait_task                      [Tracepoint event]
  sched:sched_wakeup                         [Tracepoint event]
  sched:sched_wakeup_new                     [Tracepoint event]
  sched:sched_switch                         [Tracepoint event]
  sched:sched_migrate_task                   [Tracepoint event]
  sched:sched_process_free                   [Tracepoint event]
  sched:sched_process_exit                   [Tracepoint event]
  sched:sched_process_wait                   [Tracepoint event]
  sched:sched_process_fork                   [Tracepoint event]
  sched:sched_signal_send                    [Tracepoint event]
  sched:sched_stat_wait                      [Tracepoint event]
  sched:sched_stat_sleep                     [Tracepoint event]
  sched:sched_stat_iowait                    [Tracepoint event]

stat_wait/sleep/iowait would be the interesting ones, for latency 
analysis.

Or, if you want to see all the specific delays and want to see 
min/max/avg, you can do:

  perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20
  perf trace

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

関連記事

akrさんのテンポラリファイルの扱いに関するドキュメント (2009/09/28)
[備忘録] perf コマンドの使い方 (2009/09/23)
効率的なdirectry readingコードについて (2009/09/21)

linux | 【2009-09-23(Wed) 02:08:36】 | Trackback:(0) | Comments:(0)

効率的なdirectry readingコードについて

Ulrich Drepper が自身のブログで、効率的なディレクトリ読み込みについてエントリを書いている。
しかし、改善案が思いっきり linux+glibc 依存なのでこれを実践できる人は少ないだろうな。と苦笑

元記事： http://udrepper.livejournal.com/18555.html

以下、抜粋

ダメなコード


  DIR *dir = opendir(some_path);
  struct dirent *d;
  struct dirent d_mem;
  while (readdir_r(d, &d_mem, &d) == 0) {
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/%s/somefile", some_path, d->d_name);
    int fd = open(path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);
    }
  }
  closedir(dir);

オススメ

  DIR *dir = opendir(some_path);
  int dfd = dirfd(dir);
  struct dirent64 *d;
  while ((d = readdir64(dir)) != NULL) {
    if (d->d_type != DT_DIR && d->d_type != DT_UNKNOWN)
      continue;
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/somefile", d->d_name);
    int fd = openat(dfd, path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);
    }
  }
  closedir(dir);

ポイント

readdir_r() は意味ないよ。これは複数のスレッドが同じディレクトリストリームを読むことを可能にする関数だが、dirがローカル変数になってるから絶対他スレッドと競合しない
readdir()ではなく、readdir64()を使え。2GB越えのファイルで泣きたくなければ
Linuxは、名前長制限は~~個々のファイル名~~open()等の引数文字列長にかかるのであって、絶対パスにはかからない（まあリンクとか有るからカーネルでチェックできんしね）。だからPATH_MAXを安易に使うな。some_pathがPATH_MAXに非常に近い文字列長だったら、どうなる？openat()を使うべし
そもそも some_pathを足してopenする事自体がracyでダメ。パスをたぐってる間に別プロセスがリンクを張り直したりできるよ。セキュリティーホールになるよ。openat()を使うべし
dirent64はd_typeフィールドがあるから、openしなくてもディレクトリかどうか分かるよ

追記：元エントリへのリンクを忘れていたので貼る
追記２：識者から、個々のファイル名という表現は誤解を招くというご指摘をいただいた。その通りなので修正

関連記事

[備忘録] perf コマンドの使い方 (2009/09/23)
効率的なdirectry readingコードについて (2009/09/21)
muninのプラグインを書いてみた (2009/09/20)

linux | 【2009-09-21(Mon) 13:54:17】 | Trackback:(0) | Comments:(1)

muninのプラグインを書いてみた

最近、kosakiという人が「オレはLKMLでもっとも頻繁にOOMバグの解析を行っているデベロッパの一人である。オレが言うんだから間違いない。現在のOOMの表示と/proc/meminfoはフィールドが足りない」と真偽が定かではない主張をして、フィールドを大量に増やすパッチを、ねじ込んだ。

ちなみに、現在の /proc/meminfoはこんな感じ

$ cat /proc/meminfo
MemTotal: 6037184 kB
MemFree: 1229820 kB
Buffers: 252336 kB
Cached: 3673464 kB
SwapCached: 0 kB
Active: 1463432 kB
Inactive: 2772100 kB
Active(anon): 315332 kB
Inactive(anon): 16 kB
Active(file): 1148100 kB
Inactive(file): 2772084 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 10174448 kB
SwapFree: 10174448 kB
Dirty: 1092 kB
Writeback: 0 kB
AnonPages: 309728 kB
Mapped: 75640 kB
Shmem: 5620 kB
Slab: 379376 kB
SReclaimable: 339928 kB
SUnreclaim: 39448 kB
KernelStack: 2392 kB
PageTables: 33848 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 13193040 kB
Committed_AS: 788376 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 46116 kB
VmallocChunk: 34359682980 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7104 kB
DirectMap2M: 3137536 kB
DirectMap1G: 3145728 kB

んで、せっかくなので、今回増えたShmem（共有メモリとtmpfs上のファイル）とKernelStask（それぞれのプロセスがカーネル内で使うスタック）をmuninの表示に反映させるプラグインを書いてみた。
ソースは以下にアップしてあるので、興味がある人は好きなように使ってもらってかまわない
（ただし、v2.6.32以降でないと利点があまり生かせないと思うが）

http://github.com/kosaki/munin-plugins

なお、標準のmemory usage プラグインとの差分は以下

app フィールドの廃止（内部処理として、/proc/meminfoで取得できなかった使途不明メモリをappとして表示していたので意味がなかった）

代わりに anon フィールドを追加
slab_cache フィールドを slab(unreclaim) と slab(reclaimable) の２つに分けた
reclaimableはinode cacheとdentry cacheでその名の通りメモリが逼迫すれば捨ててもかまわないキャッシュ unreclaimはカーネルの内部処理に使っているメモリで破棄不可能。
この２つはシステム分析上まったく性格が違うメモリなので混在させるべきではない
cache フィールドを cache とshmemに分割
tmpfs上のファイル、および共有メモリは破棄することが出来ず、スワップアウトさせないといけないという所がレギュラーファイルを大きく異なる。普通キャッシュがたくさんあるときは、まだまだスワップしないと思いたいところなので、この２つは混在させるべきではない
active と inactive を削除。VMの内部なんかみても何も分析できんよ。2.6.28のSplit LRU VMで意味が大きく変わっているし
committed も削除。JavaVMなど、使用メモリ量の数十倍も仮想メモリを予約するソフトがいるので意味のある値は取れない。だいたい普通Web Serverで使うソフトでpreforkアーキテクチャと相性悪い値取ってどうする
swapフィールドの表示を（他のメモリのような）色塗りから、（mappedのような）ラインに変更。スワップはメモリではない
mlocked フィールドを追加
dirty フィールドを追加
writeback フィールドを追加
表示順を大きく変更。グラフが下から順に、アプリ系メモリ、カーネル・IO系メモリ、キャッシュ系に並ぶようになった

以下、サンプル画像

ところで、色が気に入らないんだけど、これって変えられないの？

追記： githubの同じディレクトリにmemory_lruというプラグインも入れておいた。これはVM LRUの内訳
・Active(anon)
・Inactive(anon)
・Active(file)
・Inactive(file)
・Unevictable
を表示するもので、mlock()やshmctl(SHM_LOCK)を多用するシステムで威力を発揮する。なぜならMlockedフィールドはanonをmlockしたのかfileをmlockしたのか分からないため、あとどのくらいmlockされていないページがあるのか分からなくなるから。こっちでActive(file) + Inactive(file)を見れば自明。

追記２： memory_ext と memory_lruを統合して、v2 を作った。現在のフィールドは以下

anon
anonymousメモリ。v1と違ってmlockされたページは抜かれていてswap可能なもののみを集計
page_tables
kernel_stack
swap_cache
shmem
vmalloc_used
unevictable
mlocked の代わりに導入。mlock以外にもshmctl(SHM_LOCK)のようなあらゆるページ固定されたメモリを集計
また表示がライン表示から積み重ね表示に変更。これを実現するためanonとcacheはページ固定されたメモリは抜くように仕様変更
slab_unreclaim
slab_reclaimable
file_cache_dirty
キャッシュのうちダーティーなページ
file_cache_clean
キャッシュのうちクリーンなページ。これとfreeを足した量で空きメモリが十分か判断する
free
swap_used
mapped
writeback

ポイントは以下

bufferフィールドは削除し、file_cacheに統合した。あれはfreeコマンド等古いコマンドがBuffersカラムがなくなると発狂するので表示されているだけで、この２つを分けることによっていかなる分析も出来はしない
file_cache_cleanフィールドを実現。"file_cache_clean + free" でシステムの余力を簡単に把握できるようになった
Mlock フィールドを Unevictableフィールドに変換。ロックされた量が他と同じように積み重ね表示出来るようになった

画面イメージはこんな感じ

途中で赤（unevictable)が増えているのは、anonymousメモリを1GBほどmlock()したから。mlockするとanonではなくなってしまうのが奇妙に思う人もいると思うけど

anon: スワップ可能メモリ
cache_clean: 破棄可能メモリ
cache_dirty: writebackが終われば破棄可能メモリ
unevictable: 破棄不可能メモリ

という分類にする事によって、システム高負荷時の挙動を予測しやすくするのが狙い

追記３：追記２で書いたv2を revert した。すまん、あれバグってる。Unevictableでもdirtyになりうるので、{A|ina}ctive(file) - Dirty は正しくない計算式。これは、さらにもう１つフィールド追加がいるね。絶対反対されそうなので、作戦がいるけど。

関連記事

効率的なdirectry readingコードについて (2009/09/21)
muninのプラグインを書いてみた (2009/09/20)
[備忘録][あとで調べる] Linuxの負荷可視化ツール (2009/09/17)

linux | 【2009-09-20(Sun) 00:01:59】 | Trackback:(0) | Comments:(1)

うむむ

そろそろ次の原稿を書く時期か、憂鬱。
前回はほぼ Nyaruru さんターゲットに記事を書いたのに全然反応をもらえなかったので、今回は普通に書こう。いや、書くべき。
たまには一般読者に受ける方向を狙わねば。

さて、なに書こう

関連記事

雑談 | 【2009-09-19(Sat) 17:05:17】 | Trackback:(0) | Comments:(0)

[備忘録][あとで調べる] Linuxの負荷可視化ツール

munin,cacti, zabbix, ganglia あたりがメジャーらしい
今度調べる

関連記事

muninのプラグインを書いてみた (2009/09/20)
[備忘録][あとで調べる] Linuxの負荷可視化ツール (2009/09/17)
devtmpfs (2009/09/17)

linux | 【2009-09-17(Thu) 09:32:07】 | Trackback:(0) | Comments:(0)

devtmpfs

あんなに、NAKされたのにマージされおった！！

関連記事

[備忘録][あとで調べる] Linuxの負荷可視化ツール (2009/09/17)
devtmpfs (2009/09/17)
reverseパッチの簡単な作り方 (2009/09/15)

linux | 【2009-09-17(Thu) 08:24:34】 | Trackback:(0) | Comments:(0)

reverseパッチの簡単な作り方

http://d.hatena.ne.jp/gunshot/20090915/p1

より

interdiff original.patch /dev/null > reversed.patch

関連記事

devtmpfs (2009/09/17)
reverseパッチの簡単な作り方 (2009/09/15)
INGO Why you remove set_user_nice() from kernel/kthread.c (2009/09/15)

linux | 【2009-09-15(Tue) 14:25:55】 | Trackback:(0) | Comments:(0)

INGO Why you remove set_user_nice() from kernel/kthread.c

カーネルスレッドのnice値が-5から0に特に議論もなく変えられたのでハングして困ってるぞ。とのこと。

Next patсh -
http://www.kernel.org/diff/diffview.cgi?file=%2Fpub%2Flinux%2Fkernel%2F%2Fv2.6%2Fsnapshots%2Fpatch-2.6.31-git2.bz2;z=548

This patch defines the core processes that are working with nice leve equal to
zero , as in the BFS. :)

Why?

VirtualBox, Vmware, QEMU, Firefox, Azureus, and many subsystems and
applications began working with large timeouts. In appearance similar to
hang.

Compare

2.6.31-git2 with KTHREAD_NICE_LEVEL = 0
2.6.31-git2 with KTHREAD_NICE_LEVEL = -5

Diff.

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5fe7099..eb8751a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -16,6 +16,8 @@
#include
#include

+#define KTHREAD_NICE_LEVEL (-5)
+
static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
struct task_struct *kthreadd_task;
@@ -143,6 +145,7 @@ struct task_struct *kthread_create(int (*threadfn)(void
*data),
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(create.result, SCHED_NORMAL,
&param);
+ set_user_nice(create.result, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(create.result, cpu_all_mask);
}
return create.result;
@@ -218,6 +221,7 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
+ set_user_nice(tsk, KTHREAD_NICE_LEVEL);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_possible_map);

Used benchmarks.

# cyclictest and signaltest -
http://www.osadl.org/Realtime-test-utilities-cyclictest-and-s.rt-test-cyclictest-signaltest.0.html

----------
CYCLE TEST
----------

-T: 0 ( 5263) P: 0 I:1000 C: 98345 Min: 8 Act:656014 Avg:287390 Max:
656450
-T: 1 ( 5264) P: 0 I:1500 C: 65680 Min: 7 Act:481583 Avg:236140 Max:
482343
-T: 2 ( 5265) P: 0 I:2000 C: 49358 Min: 7 Act:286071 Avg:111300 Max:
287453
-T: 3 ( 5266) P: 0 I:2500 C: 39453 Min: 7 Act:370028 Avg:116111 Max:
372481
+T: 0 ( 6634) P: 0 I:1000 C: 98888 Min: 7 Act:113011 Avg:28733 Max:
113817
+T: 1 ( 6635) P: 0 I:1500 C: 65953 Min: 8 Act:72013 Avg:25026 Max:
73110
+T: 2 ( 6636) P: 0 I:2000 C: 49468 Min: 6 Act:66076 Avg:17455 Max:
67486
+T: 3 ( 6637) P: 0 I:2500 C: 39580 Min: 7 Act:52514 Avg:12882 Max:
53256

----------
SIGNAL TEST
----------

-T: 0 ( 5285) P: 0 C: 100000 Min: 13 Act: 23 Avg: 30 Max: 9229
-T: 1 ( 5286) P: 0 C: 100000 Min: 13 Act: 99 Avg: 662 Max: 17282
-T: 2 ( 5287) P: 0 C: 100000 Min: 13 Act: 110 Avg: 662 Max: 17294
-T: 3 ( 5288) P: 0 C: 100000 Min: 13 Act: 119 Avg: 662 Max: 18645
+T: 0 ( 6698) P: 0 C: 100000 Min: 13 Act: 22 Avg: 24 Max: 7898
+T: 1 ( 6699) P: 0 C: 100000 Min: 13 Act: 104 Avg: 654 Max: 15728
+T: 2 ( 6700) P: 0 C: 100000 Min: 13 Act: 114 Avg: 654 Max: 15740
+T: 3 ( 6701) P: 0 C: 100000 Min: 13 Act: 124 Avg: 654 Max: 16102

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

関連記事

reverseパッチの簡単な作り方 (2009/09/15)
INGO Why you remove set_user_nice() from kernel/kthread.c (2009/09/15)
ChangeLogに ^------- を入れるな (2009/09/10)

linux | 【2009-09-15(Tue) 11:24:57】 | Trackback:(0) | Comments:(0)

Java の closeDescriptors()

kzk さんに教えてもらったネタ

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6336770

Javaは新しいプロセス作るときに、ファイルディスクリプタを全部閉じようとするけども、実装がバグっているのでデッドロックが起きる可能性があるそうだ。

てきとうに見つけたクロスリファレンスサイトから引用すると
http://www.jiema.org/xref/openjdk/jdk7/jdk/src/solaris/native/java/lang/UNIXProcess_md.c#293

fork-and-exec処理の時のディスクリプタ全クローズ処理で、よりにもよってopendir()。ここでmalloc()が発生。あぼーん。

なんで、UNIX系OSってcloseall()システムコールがないんだろうね。困ったもんだ

追記： Linux限定の話でいうと /proc/{pid}/statusのFDSizeフィールドがmax fdを表しているので、３からFDSizeまでcloseしていけば、OKなんじゃないかという気がしてきた。誰か検証プリーズ

追記２：ついでにRubyの実装をちらりと見たけど、NOFILEまでしかクローズしていないから最大ファイルディスクリプタが動的なシステムだとリークしているっぽい。今度、akrさんあたりに真相を聞いてみよう。

追記３： glibc malloc はpthread_atfork()を使って、fork前後でmutexを取っているので、forkとexecの間でmalloc()呼んでも平気。これはtcmallocのような外部mallocを使った場合のみに問題になりそう

追記４：　いくつかのページで sysconf(_SC_OPEN_MAX)を薦めているところがあったけど、これ、Linuxではうまくいかない。getrlimit()で現在のlimitを取っているだけなので、ファイルを開いた後で最大値を下げられたら、ちゃんとハンドリングできない。まあ、普通の人はそんな事しない。という指摘は正しいと思います

追記５： forkとexecの間はasync signal safeしかダメって話は以前にも書いたので、そっちも見てね
http://mkosaki.blog46.fc2.com/blog-entry-886.html

追記６： kzkさんに指摘されて気づいたけど、これが記念すべき1000エントリ目かー。妙に感慨深いな

追記７： kzkさんの関連エントリも参照のこと。
http://kzk9.net/blog/2009/09/deadlock_on_process_builder.html

関連記事

プログラミング | 【2009-09-14(Mon) 22:29:58】 | Trackback:(0) | Comments:(5)

涙目

Fedora10上で、mainlineカーネルを動かすとKVM開始時になぜか、バージョン確認ioctlの返値12に対してI/Oを発行するというありえない動作をしていたので、バージョンアップを決意。Fedora11に。

そして、操作を失敗して、全データ消えた。オレ、オワタ＼(^o^)／

関連記事

雑談 | 【2009-09-13(Sun) 15:41:03】 | Trackback:(0) | Comments:(0)

エロはダメです！

昨日ばったり会った編集者さんに、「エロはダメ！せめて萌えにしてください！」と力説される。
そんなつもりじゃなかったんだよー

しかし、萌えはOKなのか。新しい発見をしてしまった。

関連記事

雑談 | 【2009-09-11(Fri) 08:54:47】 | Trackback:(0) | Comments:(2)

kernel watchのブックマーク

今月のブックマークでやたら面白いのがあったので晒してみる

IGA-OS 純技術的に面白い読み物

これは裏返していうと非技術的な要素がイマイチという事です。
やはり期待されているのはエロか！エロなのか！？

Kernel Watchに期待するものも人それぞれなんだなー。と思いました。
来月に反映できるかどうかは分かりませんが、今後とも精進したいと思います。

関連記事

雑談 | 【2009-09-10(Thu) 13:24:34】 | Trackback:(0) | Comments:(0)

ChangeLogに ^------- を入れるな

と怒られた。ふが

本来は ^---$ （正確に - を３つだけ含む行）だけが、特殊な意味を持つんだけれども、
多くの人が buggy own scriptを使っていて、罫線っぽく

-------------------

とか書くと混乱してしまうのだそうだ。
そんなん気づくか！

関連記事

INGO Why you remove set_user_nice() from kernel/kthread.c (2009/09/15)
ChangeLogに ^------- を入れるな (2009/09/10)
2.6.31 (2009/09/10)

linux | 【2009-09-10(Thu) 09:45:35】 | Trackback:(0) | Comments:(0)

2.6.31

出ました

関連記事

linux | 【2009-09-10(Thu) 08:22:15】 | Trackback:(0) | Comments:(0)

kernel watch 8月版

公開されました。
今回はＳＳＤの事しか書いてない。ああダメ人間だなぁ・・

http://www.atmarkit.co.jp/flinux/rensai/watch2009/watch08a.html

ところで督促厳しかった割には原稿渡してから公開までの期間は普段とあまり変わらなかったような。つまり、締め切りを守るとは最初から思われていないので大きくバッファが取ってあるんだと思うんです。
（ポン！）そうか、じゃあもっと遅れても大丈夫なんだ。

な、なんだってーーー

（この話はフィクションです）

関連記事

2.6.31 (2009/09/10)
kernel watch 8月版 (2009/09/08)
Is 7 years of RHEL support still sufficient ? (2009/09/08)

linux | 【2009-09-08(Tue) 21:38:15】 | Trackback:(0) | Comments:(2)

Is 7 years of RHEL support still sufficient ?

LWNの記事(http://lwn.net/Articles/351298/)経由で知ったのだがRHELのサポート期間が７年というのに不満の声があるようだ。

元発言Blog：　http://dag.wieers.com/blog/is-7-years-of-rhel-support-still-sufficient

要約すると
・現在のRHEL5は2014年3月までしかサポートされない
・ハードウェアの切り替えサイクルは平均で４年（米国での話の気がするな）
・2010年時点でRHEL6はまだ出荷されていない(or 管理者が評価中で実運用に投入できない）
と、考えると、2010年はRHEL環境でビジネスがしにくい。RHEL5のEOL(End of Life)がハードウェアの切り替えより先に来てしまう。

という感じらしい。たしかに問題だなー
やっぱ１０年ぐらい欲しいよね。サポート期間。欲しいというより現状のメジャーリリース間隔の実情（約３年）から逆算すると必要。

つまり、昔はLinuxに足りないfeatureが多かったので頻繁なメジャーアップが歓迎された。だから７年でもEOLにならなかった。でも、Linuxが成熟するにつれてメジャーリリース間隔が延びていっているので、７年では全然話にならないケースが出てきている。と

興味深い

関連記事

kernel watch 8月版 (2009/09/08)
Is 7 years of RHEL support still sufficient ? (2009/09/08)
BFS vs. mainline scheduler benchmarks and measurements (2009/09/08)

linux | 【2009-09-08(Tue) 10:00:46】 | Trackback:(0) | Comments:(0)

BFS vs. mainline scheduler benchmarks and measurements

Ingo がBFSとCFSとベンチマーク比較して、BFS全然速くないよ。と主張している

Ingo のメール


hi Con,

I've read your BFS announcement/FAQ with great interest:

    http://ck.kolivas.org/patches/bfs/bfs-faq.txt

First and foremost, let me say that i'm happy that you are hacking 
the Linux scheduler again. It's perhaps proof that hacking the 
scheduler is one of the most addictive things on the planet ;-)

I understand that BFS is still early code and that you are not 
targeting BFS for mainline inclusion - but BFS is an interesting 
and bold new approach, cutting a _lot_ of code out of 
kernel/sched*.c, so it raised my curiosity and interest :-)

In the announcement and on your webpage you have compared BFS to 
the mainline scheduler in various workloads - showing various 
improvements over it. I have tried and tested BFS and ran a set of 
benchmarks - this mail contains the results and my (quick) 
findings.

So ... to get to the numbers - i've tested both BFS and the tip of 
the latest upstream scheduler tree on a testbox of mine. I 
intentionally didnt test BFS on any really large box - because you 
described its upper limit like this in the announcement:

-----------------------
|
| How scalable is it?
|
| I don't own the sort of hardware that is likely to suffer from 
| using it, so I can't find the upper limit. Based on first 
| principles about the overhead of locking, and the way lookups 
| occur, I'd guess that a machine with more than 16 CPUS would 
| start to have less performance. BIG NUMA machines will probably 
| suck a lot with this because it pays no deference to locality of 
| the NUMA nodes when deciding what cpu to use. It just keeps them 
| all busy. The so-called "light NUMA" that constitutes commodity 
| hardware these days seems to really like BFS.
|
-----------------------

I generally agree with you that "light NUMA" is what a Linux 
scheduler needs to concentrate on (at most) in terms of 
scalability. Big NUMA, 4096 CPUs is not very common and we tune the 
Linux scheduler for desktop and small-server workloads mostly.

So the testbox i picked fits into the upper portion of what i 
consider a sane range of systems to tune for - and should still fit 
into BFS's design bracket as well according to your description: 
it's a dual quad core system with hyperthreading. It has twice as 
many cores as the quad you tested on but it's not excessive and 
certainly does not have 4096 CPUs ;-)

Here are the benchmark results:

  kernel build performance:
     http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg     

  pipe performance:
     http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

  messaging performance (hackbench):
     http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg  

  OLTP performance (postgresql + sysbench)
     http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

Alas, as it can be seen in the graphs, i can not see any BFS 
performance improvements, on this box.

Here's a more detailed description of the results:

| Kernel build performance
---------------------------

  http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg     

In the kbuild test BFS is showing significant weaknesses up to 16 
CPUs. On 8 CPUs utilized (half load) it's 27.6% slower. All results 
(-j1, -j2... -j15 are slower. The peak at 100% utilization at -j16 
is slightly stronger under BFS, by 1.5%. The 'absolute best' result 
is sched-devel at -j64 with 46.65 seconds - the best BFS result is 
47.38 seconds (at -j64) - 1.5% better.

| Pipe performance
-------------------

  http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg

Pipe performance is a very simple test, two tasks message to each 
other via pipes. I measured 1 million such messages:

   http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c

The pipe test ran a number of them in parallel:

   for ((i=0;i<$NR;i++)); do ~/sched-tests/pipe-test-1m & done; wait

and measured elapsed time. This tests two things: basic scheduler 
performance and also scheduler fairness. (if one of these parallel 
jobs is delayed unfairly then the test will finish later.)

[ see further below for a simpler pipe latency benchmark as well. ]

As can be seen in the graph BFS performed very poorly in this test: 
at 8 pairs of tasks it had a runtime of 45.42 seconds - while 
sched-devel finished them in 3.8 seconds.

I saw really bad interactivity in the BFS test here - the system 
was starved for as long as the test ran. I stopped the tests at 8 
loops - the system was unusable and i was getting IO timeouts due 
to the scheduling lag:

 sd 0:0:0:0: [sda] Unhandled error code
 sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
 end_request: I/O error, dev sda, sector 81949243
 Aborting journal on device sda2.
 ext3_abort called.
 EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
 Remounting filesystem read-only

I measured interactivity during this test:

   $ time ssh aldebaran /bin/true
   real  2m17.968s
   user  0m0.009s
   sys   0m0.003s

A single command took more than 2 minutes.

| Messaging performance
------------------------

  http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg  

Hackbench ran better - but mainline sched-devel is significantly 
faster for smaller and larger loads as well. With 20 groups 
mainline ran 61.5% faster.

| OLTP performance
--------------------

http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg

As can be seen in the graph for sysbench OLTP performance 
sched-devel outperforms BFS on each of the main stages:

   single client load   (   1 client  -   6.3% faster )
   half load            (   8 clients -  57.6% faster )
   peak performance     (  16 clients - 117.6% faster )
   overload             ( 512 clients - 288.3% faster )

| Other tests
--------------

I also tested a couple of other things, such as lat_tcp:

  BFS:          TCP latency using localhost: 16.5608 microseconds
  sched-devel:  TCP latency using localhost: 13.5528 microseconds [22.1% faster]

lat_pipe:

  BFS:          Pipe latency: 4.9703 microseconds
  sched-devel:  Pipe latency: 2.6137 microseconds [90.1% faster]

General interactivity of BFS seemed good to me - except for the 
pipe test when there was significant lag over a minute. I think 
it's some starvation bug, not an inherent design property of BFS, 
so i'm looking forward to re-test it with the fix.

Test environment: i used latest BFS (205 and then i re-ran under 
208 and the numbers are all from 208), and the latest mainline 
scheduler development tree from:

  http://people.redhat.com/mingo/tip.git/README

Commit 840a065 in particular. It's on a .31-rc8 base while BFS is 
on a .30 base - will be able to test BFS on a .31 base as well once 
you release it. (but it doesnt matter much to the results - there 
werent any heavy core kernel changes impacting these workloads.)

The system had enough RAM to have the workloads cached, and i 
repeated all tests to make sure it's all representative. 
Nevertheless i'd like to encourage others to repeat these (or 
other) tests - the more testing the better.

I also tried to configure the kernel in a BFS friendly way, i used 
HZ=1000 as recommended, turned off all debug options, etc. The 
kernel config i used can be found here:

  http://redhat.com/~mingo/misc/config

( Let me know if you need any more info about any of the tests i
  conducted. )

Also, i'd like to outline that i agree with the general goals 
described by you in the BFS announcement - small desktop systems 
matter more than large systems. We find it critically important 
that the mainline Linux scheduler performs well on those systems 
too - and if you (or anyone else) can reproduce suboptimal behavior 
please let the scheduler folks know so that we can fix/improve it.

I hope to be able to work with you on this, please dont hesitate 
sending patches if you wish - and we'll also be following BFS for 
good ideas and code to adopt to mainline.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

それに対する Con の返信。感じ悪いな


2009/9/7 Ingo Molnar :
> hi Con,

Sigh..

Well hello there.

>
> I've read your BFS announcement/FAQ with great interest:
>
>     http://ck.kolivas.org/patches/bfs/bfs-faq.txt

> I understand that BFS is still early code and that you are not
> targeting BFS for mainline inclusion - but BFS is an interesting
> and bold new approach, cutting a _lot_ of code out of
> kernel/sched*.c, so it raised my curiosity and interest :-)

Hard to keep a project under wraps and get an audience at the same
time, it is. I do realise it was inevitable LKML would invade my
personal space no matter how much I didn't want it to, but it would be
rude of me to not respond.

> In the announcement and on your webpage you have compared BFS to
> the mainline scheduler in various workloads - showing various
> improvements over it. I have tried and tested BFS and ran a set of
> benchmarks - this mail contains the results and my (quick)
> findings.

/me sees Ingo run off to find the right combination of hardware and
benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is
and/or how bad bfs is, along with telling people they should use these
artificial benchmarks to determine how good it is, demonstrating yet
again why benchmarks fail the desktop]

I'm not interested in a long protracted discussion about this since
I'm too busy to live linux the way full time developers do, so I'll
keep it short, and perhaps you'll understand my intent better if the
FAQ wasn't clear enough.


Do you know what a normal desktop PC looks like? No, a more realistic
question based on what you chose to benchmark to prove your point
would be: Do you know what normal people actually do on them?


Feel free to treat the question as rhetorical.

Regards,
-ck

/me checks on his distributed computing client's progress, fires up
his next H264 encode, changes music tracks and prepares to have his
arse whooped on quakelive.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

関連記事

Is 7 years of RHEL support still sufficient ? (2009/09/08)
BFS vs. mainline scheduler benchmarks and measurements (2009/09/08)
Con Kolivas Returns, With a Desktop-Oriented Linux Scheduler (2009/09/06)

linux | 【2009-09-08(Tue) 08:23:09】 | Trackback:(0) | Comments:(0)

Con Kolivas Returns, With a Desktop-Oriented Linux Scheduler

http://linux.slashdot.org/story/09/09/06/0433209/Con-Kolivas-Returns-With-a-Desktop-Oriented-Linux-Scheduler?from=rss

スラド本家で、Con Kolivasがまた新しいスケジューラ書いたよ。って取り上げてる。まあ、メインライン化はされないと思うけど、いつも同じメンツだと飽きるので新しい人がコードを引っさげて乱入してくるのは大歓迎

関連記事

BFS vs. mainline scheduler benchmarks and measurements (2009/09/08)
Con Kolivas Returns, With a Desktop-Oriented Linux Scheduler (2009/09/06)
The Turtle and the Hare - A Tale of Two Kernels (Red Hat Summit 2009) (2009/09/04)

linux | 【2009-09-06(Sun) 19:48:08】 | Trackback:(0) | Comments:(0)

The Turtle and the Hare - A Tale of Two Kernels (Red Hat Summit 2009)

Rik van Riel がRedhat summitで最近のカーネルとかについて話したらしい

http://www.surriel.com/presentations

なぜか意味もなく文中にkosakiの名前があるので、ぐぐる先生に補足されたのであった。それはさておき、RHELとupstreamでパッチ数が１０倍違うという話はおもしろいなーとか思った。
マイナーリリースあたり1000件の修正が多いか少ないかはここでは論じないが。だって、新ハード対応用新規ドライバの比率が分からないもの。新規ドライバはregression risk が０なので、分けて考えないと品質管理的な事は語れないんじゃないかと思った

関連記事

Con Kolivas Returns, With a Desktop-Oriented Linux Scheduler (2009/09/06)
The Turtle and the Hare - A Tale of Two Kernels (Red Hat Summit 2009) (2009/09/04)
JLS 早期割引期間延長 (2009/09/03)

linux | 【2009-09-04(Fri) 09:13:53】 | Trackback:(0) | Comments:(0)

JLS 早期割引期間延長

Linux Foundationからこんなメールが来た

セプキャン参加者 or 未踏ユースとかの人には刺激的なイベントかもしれない
カーネルのコード見た事ない人がプレゼン聞いてもまったく理解できないと思うけど :-)

==>国際技術シンポジウム「第１回Japan Linux Symposium」、早期割引期間延長！

早期割引の期間を２週間延長しました。9月15日までの登録者には、300USドルの参加料が200USドルになる割引特典が適用されます。
ぜひお早めに登録を！
(すでに登録をいただいている皆様、ご登録ありがとうございます。当日の参加をお待ちしております）
http://www.linuxfoundation.jp/news-media/announcements/2009/08/jls

==>アカデミック割引を追加
　Linuxの開発をより幅広く、特に学生の皆様に理解していただくために学生向けの特別参加料金を設定いたしました。学生の皆様、ぜひこの機会に登録ください。
対象者：学生（高校・高専・大学・短大在籍）の方
アカデミック割引参加料　$50
登録の際にアカデミックDiscount code "JLS_50"を入力してください。
なお、当日の受付で学生証などの提示をお願いいたします。

==>基調講演のみの登録は残りはわずか！
お早めに登録いただくか、セッションの方に登録をお願いいたします。
セッションに登録いただければ基調講演にもご参加いただけます。

関連記事

The Turtle and the Hare - A Tale of Two Kernels (Red Hat Summit 2009) (2009/09/04)
JLS 早期割引期間延長 (2009/09/03)
mbind の man のバグ（日本語Only) (2009/08/31)

linux | 【2009-09-03(Thu) 11:31:42】 | Trackback:(0) | Comments:(0)