プロフィール

kosaki

Author:kosaki
連絡先はコチラ

ブログ検索
最近の記事
最近のコメント
最近のトラックバック
リンク
カテゴリー
月別アーカイブ
RSSフィード
FC2ブログランキング

そろそろ今月のKernel Watchについて一言いっとくか このエントリーをはてなブックマークに追加

ごめんなさい。お休みです。

プログラミングキャンプに参加すると特別講義が聴けるという事で許して(8人しか参加できんけど)

関連記事
雑談 | 【2009-07-31(Fri) 10:42:31】 | Trackback:(0) | Comments:(2)

魚をありがとう。ふたたび このエントリーをはてなブックマークに追加

Con Kolivas の捨て台詞として一躍有名になった「thanks for all the fish」 であるが、最近のcommit ログにまたまた登場

Linusはどういう気持ちでこれ書いたんだろうな

Gitweb:     http://git.kernel.org/linus/90a09c9cf78344d18e2438c3b87363b949629fa3
Commit: 90a09c9cf78344d18e2438c3b87363b949629fa3
Parent: 658874f05d040ca96eb5ba9b1c30ce0ff287d762
Author: Linus Torvalds
AuthorDate: Thu Jul 30 16:40:37 2009 -0700
Committer: Linus Torvalds
CommitDate: Thu Jul 30 16:40:37 2009 -0700

Alan doesn't want to maintain tty code any more

Not that anybody can blame him. It's a morass. But hey, it's way
better than it _used_ to be, though, so thanks for all the fish.

Signed-off-by: Linus Torvalds
---
MAINTAINERS | 7 ++-----
1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 66a3865..79471ba 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -155,10 +155,9 @@ S: Maintained
F: drivers/net/r8169.c

8250/16?50 (AND CLONE UARTS) SERIAL DRIVER
-M: Alan Cox
L: [email protected]
W: http://serial.sourceforge.net
-S: Odd Fixes
+S: Orphan
F: drivers/serial/8250*
F: include/linux/serial_8250.h

@@ -4997,9 +4996,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial.git
S: Maintained

TTY LAYER
-M: Alan Cox
-S: Maintained
-T: stgit http://zeniv.linux.org.uk/~alan/ttydev/
+S: Orphan
F: drivers/char/tty_*
F: drivers/serial/serial_core.c
F: include/linux/serial_core.h
--
To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html




関連記事
linux | 【2009-07-31(Fri) 09:31:33】 | Trackback:(0) | Comments:(2)

A new GCC runtime library license snag? このエントリーをはてなブックマークに追加

http://lwn.net/Articles/343608/rss

新しいgccで、GPLv2のプログラムをコンパイルすると再配布不可能になるよ。という話。よく知られているように、gccはコンパイル時に(勝手に)libgccをリンクする。これは除算のサポート等が含まれている。うろ覚えだけどC++の例外の補助コード等もここ。

んで、勝手にリンクされる都合上、あんまりきつい制約に出来ないのでGPLなんだけど、特別免除としてプロプラなライセンスとリンクしてもOKということにしてあげるよ。ということになっている。

で、このGCC runtimeのライセンスが、gcc 4.4からGPLv3になった。で、プロプラなライセンスは従来の特赦条項で救われるのでOKだけど、GPLv2は再配布時にライセンス変更しちゃダメ条項があり、かつ、GPLv2とGPLv3はincompatibleなので、自動的にライセンス違反になる。
ということのよう

これはディストリが困りそうだよね。彼らはGPLv2プログラムを全捨てという選択肢はない


関連記事
linux | 【2009-07-30(Thu) 10:08:25】 | Trackback:(0) | Comments:(5)

DRBD in linux-next このエントリーをはてなブックマークに追加

あれ?
いつのまにか、linux-nextにマージされてるらしい。ビルドできんぞメールが飛んでる。

いつ、マージの可否を議論したんだ?

関連記事
linux | 【2009-07-28(Tue) 14:25:40】 | Trackback:(0) | Comments:(0)

アポロ11号のソースコード このエントリーをはてなブックマークに追加

http://d.hatena.ne.jp/KZR/20090727/p2

アポロ11号のソースコードに temporary とコメントしてある箇所があり、そんなんで月にいくなと総ツッコミ状態らしい。面白い

# Page 801
CAF TWO # WCHPHASE = 2 ---> VERTICAL: P65,P66,P67
TS WCHPHOLD
TS WCHPHASE
TC BANKCALL # TEMPORARY, I HOPE HOPE HOPE
CADR STOPRATE # TEMPORARY, I HOPE HOPE HOPE
TC DOWNFLAG # PERMIT X-AXIS OVERRIDE
ADRES XOVINFLG
TC DOWNFLAG
ADRES REDFLAG
TCF VERTGUID


テンポラリのコードのまま月まで行ってはる!



関連記事
雑談 | 【2009-07-28(Tue) 10:55:03】 | Trackback:(0) | Comments:(0)

そういえば このエントリーをはてなブックマークに追加

本家のmanはすぐにfixされるというのに、JMからは未だに返事こず。
文化だねぇ

関連記事
雑談 | 【2009-07-28(Tue) 10:19:19】 | Trackback:(0) | Comments:(0)

man-pages-3.22 released このエントリーをはてなブックマークに追加

なんでmanのリリースメールがccされてくるんだ?と思ったら以前指摘したpollのmanの間違い(EBADFの記載があるが、そんなコードはない)が反映された為の模様。
ところで、変更点の説明に人のメールをそのままコピペするのはやめてほしい。恥ずかしい(>_<

追記: forkの所もオレが直したんやった。

Gidday,

The Linux man-pages maintainer proudly announces:

  man-pages-3.22.tar.gz - man pages for Linux

This release is now available for download at:

 http://www.kernel.org/pub/linux/docs/man-pages
 or ftp://ftp.kernel.org/pub/linux/docs/man-pages

The online changelog is available at
http://www.kernel.org/doc/man-pages/changelog.html
(blogged at
http://linux-man-pages.blogspot.com/2009/07/man-pages-322-is-released.html )
and the current version of the pages is browsable at
http://www.kernel.org/doc/man-pages/

You are receiving this message either because:

a) You contributed to the content of this release.

b) You are subscribed to [email protected] (*).

c) I have information (possibly inaccurate) that you are the maintainer of
a translation of the manual pages, or are the maintainer of the manual
pages set in a particular distribution, or have expressed interest in
helping with man-pages maintenance, or have otherwise expressed interest in
being notified about man-pages releases.  If you don't want to receive such
messages from me, or you know of some other translator or maintainer who
may want to receive such notifications, send me a message.

Cheers,

Michael

==================== Changes in man-pages-3.22 ====================

Released: 2009-07-25, Munich


Contributors
------------

The following people contributed notes, ideas, or patches that have
been incorporated in changes in this release:

Adrian Dewhurst
Alexander Lamaison
Bryan Østergaard
Christopher Head
Doug Goldstein
Florentin Duneau
Gokdeniz Karadag
Jeff Moyer
KOSAKI Motohiro
Lucian Adrian Grijincu
Mark Hills
Michael Kerrisk
Mike Frysinger
Petr Baudis
Reimar Döffinger
Ricardo Garcia
Rui Rlex
Shachar Shemesh
Tolga Dalman
ku roi
sobtwmxt

Apologies if I missed anyone!


Changes to individual pages
---------------------------

clone.2
   Michael Kerrisk
       Rewrite crufty text about number of args in older version of clone()
               Some bit rot had crept in regarding the discussion of the
               number of arguments in older versions of this syscall.
               Simplify the text to just say that Linux 2.4 and earlier
               didn't have ptid, tls, and ctid arguments.

               See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=533868
   Michael Kerrisk
       Fix version number for CLONE_NEWIPC
           It's 2.6.19, not 2.4.19.
   Michael Kerrisk
       Fix errors in argument names in text (ptid, ctd)

execve.2
   Mike Frysinger
       Remove erroneous statement that pending signal set is cleared
       on execve(2).

fcntl.2
   Michael Kerrisk
       The kernel source file mandatory.txt is now mandatory-locking.txt
   Michael Kerrisk
       The Documentation/* files are now in Documentation/filesystems

flock.2
   Michael Kerrisk
       Remove unneeded reference to Documentation/mandatory.txt
           Mandatory locks are only implemented by fcntl() locking
   Michael Kerrisk
       The Documentation/* files are now in Documentation/filesystems

fork.2
   Jeff Moyer
       Document fork() behaviour for the Linux native AIO io_context
           It was noted on lkml that the fork behaviour is documented
           for the POSIX AIO calls, but not for the Linux native calls.
           Here is a patch which adds a small blurb that folks will
           hopefully find useful.

           Upon fork(), the child process does not inherit the
           io_context_t data structures returned by io_setup,
           and thus cannot submit further asynchronous I/O or
           reap event completions for said contexts.

getdents.2
   Michael Kerrisk
       The d_type field is fully supported on Btrfs

mount.2
   Michael Kerrisk
       Document MS_STRICTATIME, update description of MS_RELATIME
           Starting with Linux 2.6.30, the MS_RELATIME behavior became
           the default, and MS_STRICTATIME is required to obtain the
           traditional semantics.

poll.2
   Michael Kerrisk
       Remove EBADF error from ERRORS
           As reported by Motohiro:

           "man poll" describe this error code.

           >ERRORS
           > EBADF  An invalid file descriptor was given in one of the sets.

           but current kernel implementation ignore invalid file descriptor,
           not return EBADF.
           ...

           In the other hand, SUSv3 talk about

           > POLLNVAL
           >  The specified fd value is invalid. This flag is only valid in the
           >  revents member; it shall ignored in the events member.

           and

           > If the value of fd is less than 0, events shall be ignored, and
           > ireevents shall be set to 0 in that entry on return from poll().

           but, no desribe EBADF.
           (see
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html)

           So, I think the implementation is correct.

           Why don't we remove EBADF description?

sigaction.2
   Michael Kerrisk
       EWxpand description of si_utime and si_stime fields of siginfo_t

stat.2
   Michael Kerrisk
       Improve wording of ENOTDIR error

syscalls.2
   Michael Kerrisk
       Ad preadv() and pwritev(), new in kernel 2.6.30

wait.2
   Gokdeniz Karadag
       Document CLD_DUMPED and CLD_TRAPPED si_code values

daemon.3
   Michael Kerrisk
       Clarify discussion of 'noclose' and 'nochdir' arguments

ffs.3
   Petr Baudis
       SEE ALSO: add memchr(3)

fmemopen.3
   Petr Baudis
       Relocate BUGS section to correct position
   Petr Baudis
       NOTES: there is no file descriptor associated with the returned stream
           Alexander Lamaison pointed out that this is not obvious
           from the documentation, citing an example with passing the
           FILE * handle to a function that tries to fstat() its
           fileno() in order to determine the buffer size.
   Michael Kerrisk
       CONFORMING TO: remove note that these functions are GNU extensions
           That sentence is now redundant, since these functions
           are added in POSIX.1-2008.

lockf.3
   Michael Kerrisk
       Clarify relationship between fcntl() and lockf() locking

memchr.3
   Petr Baudis
       SEE ALSO: add ffs(3)

readdir.3
   Michael Kerrisk
       The d_type field is fully supported on Btrfs

setjmp.3
   Mike Frysinger
       Fix typo and clarify RETURN description
           The word "signal" was duplicated in NOTES, and the RETURN
           section refers to setjmp() and sigsetjmp(), and mentions
           longjmp(), but not siglongjmp().

strcmp.3
   Petr Baudis
       SEE ALSO: add strverscmp(3)

strcpy.3
   Mark Hills
       SEE ALSO: Add strdup(3)

complex.7
   Michael Kerrisk
       Add missing header file for example program
   Reimar Döffinger
       Fix type used in example code
       man complex (from release 3.18) contains the following code:
           complex z = cexp(I * pi);
       Reading the C99 standard, "complex" is not a valid type,
       and several compilers (Intel ICC, ARM RVCT) will refuse to compile.
       It should be
           double complex z = cexp(I * pi); instead.

environ.7
   Michael Kerrisk
       Note that last element in environ array is NULL
           See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528628
   Michael Kerrisk
       Wording fixes

mq_overview.7
   Michael Kerrisk
       Note that mkdir and mount commands here need superuser privilege
   Michael Kerrisk
       Fix example showing contents of /dev/mqueue file

standards.7
   Michael Kerrisk
       Remove references to dated books
           Gallmeister and Lewine are rather old books. Probably,
           there are better books to consult nowadays, and anyway,
           this man page isn't intended to be a bibliography.

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Watch my Linux system programming book progress to publication!
http://blog.man7.org/



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Watch my Linux system programming book progress to publication!
http://blog.man7.org/




関連記事
linux | 【2009-07-28(Tue) 08:47:57】 | Trackback:(0) | Comments:(0)

C++ 0x が正式に C++ 1x に このエントリーをはてなブックマークに追加

http://yebo-blog.blogspot.com/2009/07/c2010.html

コンセプトを落としたのが原因とか。
まあ、2009年中に出ると思っていた人は、もういなかったと思いますが。




関連記事
プログラミング | 【2009-07-25(Sat) 15:22:40】 | Trackback:(0) | Comments:(0)

[patch 00/54] [Announce] Microsoft Hyper-V drivers for Linux このエントリーをはてなブックマークに追加

話題のHyper-VのLinux contributionですが、LKMLにはGregKHが対応しているみたい。ということはmainline入りがほぼ確実だね。
たぶん、MS - Suseラインでなにか取引があったんでしょう。きっと


Hi all,

I'm happy to announce, that after many months of discussions, Microsoft
has released their Hyper-V Linux drivers under the GPLv2. Following
this message, will be the patches that add the drivers to the
drivers/staging/ tree, and a whole bunch of cleanups.

It's taken a long road to get here, and I'd like to thank the following
people who made this possible:
- Steve Hemminger for the initial prodding and extreme patience
- Hank Janssen for providing the code and working with me to get it
into a workable and semi-mergable state. His involvement within
Microsoft was also invaluable.
- Sam Ramji for his push within Microsoft to make this happen in a
manner that works with the Linux community.
- Novell for sponsoring my work on the Linux Driver project, without
which, this would not have even been possible.
And there are many others both within Novell and Microsoft, who I do not
want to slight by not naming, but the list would be too long to go into.

These drivers are to enable Linux to work better when running as a guest
on top of the Hyper-V system. There is still a lot of work to do in
getting this into "proper" mergable state, and moving it out of the
staging directory, but Hank and I will be undertaking this task. See
the TODO file in the drivers/staging/hv/ directory if anyone wishes to
help out with this task.

The code should be showing up in the linux-next tree soon, as the
patches are now in my public tree.

If anyone has any questions about this code, please let me and Hank know
about it.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



関連記事
linux | 【2009-07-23(Thu) 05:03:39】 | Trackback:(0) | Comments:(3)

「Linux カーネルの zero-day exploit コード、リリースされる」への余談 このエントリーをはてなブックマークに追加

http://slashdot.jp/linux/article.pl?sid=09/07/22/0121226

この脆弱性であるが、Linuxにおいてはユーザプロセスが0番地にmmap()することが合法だったので
ユーザ空間のデーモンなどにも大穴があいていた。
んで、そもそもなんで0番地mmapなんかする必要があるんだーーという議論になり、vm86では必要とか
そんな議論に。
で、一時互換性重要派閥が勝利しかけたんだけど、Linus裁定により、特殊なpersonalityを持つプロセス以外0番地mmap()できなくなったはず。

うろ覚えで書いているから、まったく間違っているかもしれない。


まあ、ようするにgccの仮定がセキュリティ視点からはナンセンス極まりなかった。とそういうイージーな問題。


追記:なんか、Eric Parisがうだうだ言ってたので貼っとく。
ようするに、SELinuxをONにするとセキュリティが弱くなるってどうよ?ってことかね。


Subject: mmap_min_addr and your local LSM (ok, just SELinux)
From Eric Paris <[email protected]>

Brad Spengler recently pointed out that the SELinux decision on how to
handle mmap_min_addr in some ways weakens system security vs on a system
without SELinux (and in other ways can be stronger). There is a trade
off and a reason I did what I did but I would like ideas and discussion
on how to get the best of both worlds.

With SELinux mapping the 0 page requires an SELinux policy permission,
mmap_zero. Without SELinux mapping the 0 page requires CAP_SYS_RAWIO.
Note that CAP_SYS_RAWIO roughly translates to uid=0 since noone really
does interesting things with capabilities.

The main problem is WINE. I'm told that WINE needs to map the 0 page to
support 16bit applications. On distros without SELinux users must
disable the mmap_min_addr protections for the ENTIRE system if they want
to run WINE.

http://wiki.winehq.org/PreloaderPageZeroProblem

I believe (from reading mailing lists) if you install WINE on ubuntu it
automatically disables these protections. Thus installing wine on
ubuntu disables ALL hardening gains of the mmap_min_addr.

On Fedora, with SELinux, we allow users to run WINE in a domain that has
the SELinux mmap_zero permission and thus other programs/domains, do not
have security weakened. Your daemons, like the web server, are still
unable to map the 0 page. This is different than distros without
SELinux, remember they have to disable protection globally.

But logged in users (by default), under SELinux, are 'unconfined' and
can by their very nature run their program in a domain that allows
mmap_zero. Trying to 'confine' the 'unconfined' user with SELinux is an
open problem which we don't currently even reasonably attempt address on
a broad scale. It's like besieging the user in a gentle mist of water
hoping they won't try to escape.

So in Fedora your web server is a harder entry point to exploit kernel
NULL pointer bugs, but you have no protections against a malicious user.
On Ubuntu if you install WINE your web server and your logged in users
have no hardening. If you do not install WINE non-root is hardened,
anything running as root is not (aka suid apps, aka pulseaudio).

So I was thinking today, wondering how to get the best (or at least
better) of both worlds on an SELinux system. I was considering adding a
second mmap_min_addr_lsm which would typically be equal to
mmap_min_addr. The purpose would be to allow the sysadmin to
individually control DAC/LSM protections. The security checks would
turn (sort of) into

if (addr < mmap_min_addr)
ret |= capable(CAP_SYS_RAWIO);
if (addr < mmap_min_addr_lsm)
ret |= [insert LSM check here]

So on a non-SELinux system users would end up with exactly what they
have today. if you want to run WINE as a normal user you have to set
mmap_min_addr = 0 and then you no longer need CAP_SYS_RAWIO. Not much
else we can do if your distro down support fine grained permissions.

On an SELinux system what this lets me do is default to a stricter
setup, one in which you have to have both CAP_SYS_RAWIO and the selinux
mmap_zero permission. You, out of the box, get protection for both your
malicious logged in user and your web server. Then if a user decides to
run WINE they would turn down mmap_min_addr. This would remove the
requirement that they are root, and leave the system vulnerable to a
malicious user, but would still allow SELinux to protect confined
domains and daemons.

Does anyone see a better way to let users continue to be users while
protecting most people? Yes SELinux is stronger in some areas than
without confining the ability to map the 0 page, but as has be rightly
pointed out it's foolish an broken that SELinux can weaken any
protections.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/




それに対する、James Morrisからの返答

I haven't seen a better idea so far.

I strongly believe that we need to maintain the principle, in SELinux and
LSM generally, that the interface is restrictive, i.e. that it can only
further restrict access. It should be impossible, from a design point of
view at least, for any LSM module to authorize more privilege than
standard DAC. This has always been a specific design goal of LSM. (The
capability module is an exception, as it has a fixed security policy and
implements legacy DAC behavior; there's no way to "fix" this).

In this case, we're not dealing with a standard form of access control,
where access to a userland object is being mediated. We're trying to
mediate the ability of a subject to bypass a separate mechanism which aims
to protect the kernel itself from attack via a more fundamental system
flaw. The LSM module didn't create that vulnerability directly, but it
must not allow the vulnerability to be more easily exploited.

The security policy writer should have a guarantee that the worst mistake
they can make is to mess up their own security model; if they can mess up
the base DAC security with MAC policy, we break that guarantee. There's
also an issue of user confidence in the LSM modules, in that they should
not be any worse off security-wise if they enable an enhanced protection
mechanism.

This does not account for kernel bugs in the LSM modules themselves,
obviously, but the same can be said for any kernel code, albeit with less
irony.



そのあと、Eric Parisが出してきたパッチ

その1: CONFIG_SECURIYの有無にかかわらず、常にCAP_SYS_RAWIOをチェックするように変更

Subject: [PATCH 1/2] VM/SELinux: require CAP_SYS_RAWIO for all mmap_zero operations

Currently non-SELinux systems need CAP_SYS_RAWIO for an application to mmap
the 0 page. On SELinux systems they need a specific SELinux permission,
but do not need CAP_SYS_RAWIO. This has proved to be a poor decision by
the SELinux team as, by default, SELinux users are logged in unconfined and
thus a malicious non-root has nothing stopping them from mapping the 0 page
of virtual memory.

On a non-SELinux system, a malicious non-root user is unable to do this, as
they need CAP_SYS_RAWIO.

This patch checks CAP_SYS_RAWIO for all operations which attemt to map a
page below mmap_min_addr.

Signed-off-by: Eric Paris
---

include/linux/security.h | 2 --
mm/mmap.c | 10 ++++++++++
mm/mremap.c | 8 ++++++++
mm/nommu.c | 3 +++
security/capability.c | 2 --
5 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index 1459091..f7d198a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2197,8 +2197,6 @@ static inline int security_file_mmap(struct file *file, unsigned long reqprot,
unsigned long addr,
unsigned long addr_only)
{
- if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
- return -EACCES;
return 0;
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 34579b2..37fdc90 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1047,6 +1047,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
}
}

+ if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ return -EACCES;
+
error = security_file_mmap(file, reqprot, prot, flags, addr, 0);
if (error)
return error;
@@ -1657,6 +1660,10 @@ static int expand_downwards(struct vm_area_struct *vma,
return -ENOMEM;

address &= PAGE_MASK;
+
+ if ((address < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ return -EACCES;
+
error = security_file_mmap(NULL, 0, 0, 0, address, 1);
if (error)
return error;
@@ -1998,6 +2005,9 @@ unsigned long do_brk(unsigned long addr, unsigned long len)
if (is_hugepage_only_range(mm, addr, len))
return -EINVAL;

+ if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ return -EACCES;
+
error = security_file_mmap(NULL, 0, 0, 0, addr, 1);
if (error)
return error;
diff --git a/mm/mremap.c b/mm/mremap.c
index a39b7b9..066e73d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -299,6 +299,10 @@ unsigned long do_mremap(unsigned long addr,
if ((addr <= new_addr) && (addr+old_len) > new_addr)
goto out;

+ ret = -EACCES;
+ if ((new_addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ goto out;
+
ret = security_file_mmap(NULL, 0, 0, 0, new_addr, 1);
if (ret)
goto out;
@@ -407,6 +411,10 @@ unsigned long do_mremap(unsigned long addr,
goto out;
}

+ ret = -EACCES;
+ if ((new_addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ goto out;
+
ret = security_file_mmap(NULL, 0, 0, 0, new_addr, 1);
if (ret)
goto out;
diff --git a/mm/nommu.c b/mm/nommu.c
index 53cab10..c1f3eff 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -995,6 +995,9 @@ static int validate_mmap_request(struct file *file,
}

/* allow the security API to have its say */
+ if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
+ return -EACCES;
+
ret = security_file_mmap(file, reqprot, prot, flags, addr, 0);
if (ret < 0)
return ret;
diff --git a/security/capability.c b/security/capability.c
index f218dd3..a3a5d9b 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -334,8 +334,6 @@ static int cap_file_mmap(struct file *file, unsigned long reqprot,
unsigned long prot, unsigned long flags,
unsigned long addr, unsigned long addr_only)
{
- if ((addr < mmap_min_addr) && !capable(CAP_SYS_RAWIO))
- return -EACCES;
return 0;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


その2: selinux_file_mmapがmmap_min_addrチューニングパラメタを無視して、常に0ページを
チェックするようにする。
つまり、mmap_min_addr=0でSELinuxのチェックを無効化できなくするパッチ。


Subject: [PATCH 2/2] SELinux: selinux_file_mmap always enforce mapping the 0 page

Currently SELinux enforcement of controls on the ability to map the 0 page
is determined by the mmap_min_addr tunable. This patch causes SELinux to
ignore the tunable and to always (but ONLY) protect the 0 page.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the 0 page based on it's mmap_zero
permission.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map the 0 page.

Note: the additional SELinux restriction will now ONLY protect the 0 page.
CAP_SYS_RAWIO will protect anything between 0 and mmap_min_addr, but SELinux
will only protect between 0 and PAGE_SIZE.

Signed-off-by: Eric Paris
---

include/linux/security.h | 1 -
security/selinux/hooks.c | 2 +-
2 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index f7d198a..de774f7 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -91,7 +91,6 @@ struct seq_file;
extern int cap_netlink_send(struct sock *sk, struct sk_buff *skb);
extern int cap_netlink_recv(struct sk_buff *skb, int cap);

-extern unsigned long mmap_min_addr;
/*
* Values used in the task_security_ops calls
*/
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index e65677d..7bbac1d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3034,7 +3034,7 @@ static int selinux_file_mmap(struct file *file, unsigned long reqprot,
int rc = 0;
u32 sid = current_sid();

- if (addr < mmap_min_addr)
+ if (addr < PAGE_SIZE)
rc = avc_has_perm(sid, sid, SECCLASS_MEMPROTECT,
MEMPROTECT__MMAP_ZERO, NULL);
if (rc || addr_only)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/




関連記事
linux | 【2009-07-22(Wed) 18:47:32】 | Trackback:(1) | Comments:(0)

x86 Linuxではディスクサイズは最大16TB このエントリーをはてなブックマークに追加

らしい。たとえ、ext4を使っても。
よく考えたら、32bit max (4GB) x page-size(4KB) = 16TB なので、16TB以上はページ構造体がポイントできないので、当然の帰結だな。

しかし、64bitの移行はあんまり順調じゃないのに、ディスク16TBはすぐそこだ。困ったな

追記: 元ネタメールを貼っておく


Subject: How to handle >16TB devices on 32 bit hosts ??

Hi,
It has recently come to by attention that Linux on a 32 bit host does
not handle devices beyond 16TB particularly well.

In particular, any access that goes through the page cache for the
block device is limited to a pgoff_t number of pages.
As pgoff_t is "unsigned long" and hence 32bit, and as page size is
4096, this comes to 16TB total.

A filesystem created on a 17TB device should be able to access and
cache file data perfectly providing CONFIG_LBDAF is set.
However if the filesystem caches metadata using the block device,
then metadata beyond 16TB will be a problem.

Access to the block device (/dev/whatever) via open/read/write will
also cause problems beyond 16TB, though if O_DIRECT is used I think
it should work OK (it will probably try to flushed out completely
irrelevant parts of the page cache before allowing the IO, but that
is a benign error case I think).

With 2TB drives easily available, more people will probably try
building arrays this big and we cannot just assume they will only do
it on 64bit hosts.

So the question I wanted to ask really is: is there any point in
allowing >16TB arrays to be created on 32bit hosts, or should we just
disallow them? If we allow them, what steps should we take to make
the possible failure modes more obvious?

As I said, I think O_DIRECT largely works fine on these devices and
we could fix the few irregularities with little effort. So one step
might be to make mkfs/fsck utilities use O_DIRECT on >16TB devices on
32bit hosts.

Given that non-O_DIRECT can fail (e.g. in do_generic_file_read,
index = *ppos >> PAGE_CACHE_SHIFT
will lose data if *ppos is beyond 44 bits) we should probably fail
opens on devices larger than 16TB.... though just failing the open
doesn't help if the device can change size, as dm and md devices can.

I believe ext[234] uses the block device's page cache for metadata, so
they cannot safely be used with >16TB devices on 32bit. Is that
correct? Should they fail a mount attempt? Do they?

Are there any filesystems that do not use the block device cache and
so are not limited to 16TB on 32bit?

Even if no filesystem can use >16TB on 32bit, I suspect dm can
usefully use such a device for logical volume management, and as long
as each logical volume does not exceed 16TB, all should be happy. So
completely disallowing them might not be best.

I suppose we could add a CONFIG option to make pgoff_t be
"unsigned long long". Would the cost/benefit of that be acceptable?

Your thoughts are most welcome.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



関連記事
linux | 【2009-07-21(Tue) 20:10:40】 | Trackback:(0) | Comments:(4)

OSS開発の新たな流れを予感させたTOMOYO Linuxメインライン化記念勉強会 このエントリーをはてなブックマークに追加

http://japan.zdnet.com/news/os/story/0,2000056192,20396891,00.htm

ZDNetの記事に載りました

関連記事
linux | 【2009-07-18(Sat) 17:28:26】 | Trackback:(0) | Comments:(2)

too many isolate pages 対策がマージ このエントリーをはてなブックマークに追加

スレッドを一杯作った状況でメモリ不足になると、全員がいっせいにメモリ回収ロジックに入り、不必要にOOM起こしていた問題を修正。
あわせて、/proc/meminfoとOOM時のメモリ使用量ログを大幅強化

無事、mmotmに入った

関連記事
linux | 【2009-07-17(Fri) 09:37:47】 | Trackback:(0) | Comments:(0)

prelink 死ね このエントリーをはてなブックマークに追加

という議論があるようだ。

http://lwn.net/Articles/341244/

・なんか、失敗するとシステムがブートしなくなるのでリスキー
・二週間に一回しか、ライブラリアドレスが変わらなくなるので、セキュリティが落ちる
・最近の32bit x86だとVDSOがランダムなアドレスに配置されるので、
prelinkが前提としている、ライブラリのロードアドレスは事前に計算できるはず。という仮定は
 くずれている。
よく、glibc と VDSOが衝突してるよ
・だいたいSELinuxが受け入れられてるぐらいなんだから、たいていのユーザは
 そこまで、パフォーマンス気にしてないんだよ
・結局、体感できるほど効果あるのってOOo使った時だけだよね。システム全体に
手を入れる話なのかなぁ?




関連記事
linux | 【2009-07-16(Thu) 15:53:11】 | Trackback:(0) | Comments:(0)

このスレは監視されています このエントリーをはてなブックマークに追加

今週のkernel podcast
http://www.kernelpodcast.org/2009/07/14/20090709-linux-kernel-podcast/


OOM. Rik van Riel posted a patch aimed at addressing some of the recent OOM situations, as touched upon in yesterday’s podcast. Rik noticed that vmscan can get horribly confused when too many tasks go into direct reclaim, and trigger an OOM situation that is caused because too few pages are on the LRU. Instead, Rik proposes limiting the number of tasks that may enter page reclaim to allow at most half of each inactive list to be isolated at any one time. In yet another semi-related VM thread, Kosaki Motohiro posted version 2 of his patches aimed at helping to track down OOM situations with more information presented to the user in the generated kernel log messages.




       {    !      _,, -ェェュ、   |
ィ彡三ミヽ  `ヽ     ,ィハミミミミミミミミミヽ、|
彡'⌒ヾミヽ   `ー  /ililハilミilミliliミliliミミヾ|
     ヾ、        /iiiiイ!ヾヾミ、ミニ=ー-ミii|
  _    `ー―' i!ハ:.:.\\_::::::::::::::/:.|   このスレは
彡三ミミヽ        i! ヽ:.:.:.:冫': : :::/,,∠|
彡'   ヾ、    _ノ i!::: ̄二ー:: : ::::ソ ・ ,|   Kernel Podcastに
      `ー '    {ヘラ' ・_>シ;テツ"''''"|
 ,ィ彡三ニミヽ  __ノ ヽヘ`" 彡' 〈     |    監視されています
彡'      ` ̄       `\   ー-=ェっ |
      _  __ ノ  {ミ;ヽ、   ⌒   |
   ,ィ彡'   ̄        ヾミミミミト-- '  |
ミ三彡'        /⌒ / ̄ ̄ | : ::::::::::|
       ィニニ=- '     / i   `ー-(二つ
     ,ィ彡'         { ミi      (二⊃
   //        /  l ミii       ト、二)
 彡'       __,ノ   | ミソ     :..`ト-'
        /          | ミ{     :.:.:..:|
            ノ / ヾ\i、   :.:.:.:.:|
      ィニ=-- '"  /  ヾヾiiヽ、 :.:.:.:.::::|
    /     /  `/ ̄ ̄7ハヾヾ : .:.:.|
   ノ     _/   /   /  |:. :.:.:.:.:.:.:|
      /     /   /   |::.:.:.:.:.:.:.:|


こえーよ。なんで知ってるんだよ

関連記事
linux | 【2009-07-14(Tue) 19:49:24】 | Trackback:(0) | Comments:(0)

RFC for a new Scheduling policy/class in the Linux-kernel このエントリーをはてなブックマークに追加

なんか、デッドラインスケジューラーを追加しよーぜー。とか言ってる


Hi all!

This is a proposal for a global [1], deadline driven scheduler for
real-time tasks in the Linux kernel. I thought I should send out an RFC to
gather some feedback instead of wildy hack away at it.

This proposed scheduler is a modified MLLF (modified Least Laxity First)
called Earliest Failure First (EFF) as it orders tasks according to when
they will miss their deadlines, not when the actual deadline is.

== Motivation and background ==

Deadlines will give the developer greater flexibility and expressiveness
when creating real-time applications. Compared to a priority scheme,
this simplifies the process considerably as it removes the need for
calculating the priorities off-line in order to find the priority-map
that will order the tasks in the correct order. Yet another important
aspect with deadlines instead of priorities, are systems too dynamic to
analyze (a large application with 100s of processes/threads or a system
running more than one rt-application).

* In very large systems, it becomes very difficult to find the correct
set of priorities, even with sophisticated tools, and a slight change
will require a re-calculation of all priorities.

* In very dynamic systems, it can be impossible to analyze the system
off-line, reducing the calculated priorities to best-effort only

* If more than one application run at the same time, this become even
worse.


As a final point, a schedule of tasks with their priorities, is in
almost all scenarios, a result of all deadlines for all tasks. This also
goes for non-rt tasks, even though the concept of deadlines are a bit
more fuzzy here. The problem is that this mapping is a one-way function,
--you cannot determine the deadlines from a set of priorities.

The problem is, how to implement this efficiently in a priority-oriented
kernel, and more importantly, how to extend this to multi-core
systems. For single-core systems, EDF is optimal and can accept tasks up
to 100% utilization and still guarantee that all deadlines are
met. However, on multi-core, this breaks down and the utilization bound
must be set very low if any guarantee should be given (known as "Dhall's
effect").

== Related Work ==

Recently, I've been working away on a pfair-based scheduler (PD^2 to be
exact), but this failed for several reasons [2]. The most prominent being
the sensitivity for timer inaccuracies and very frequent task
preemption. pfair has several good qualities, as it reduces jitter,
scales to many cores and achieves high sustained utilization. However,
the benefits do not outweigh the added overhead and strict requirements
placed on the timer infrastructure.

This latter point is what haunts EDF on multi-core platforms. A global
EDF-US[1/2] cannot exceed (m+1)/2, standard EDF is much
worse. Partitioned can reach higher limits, but is very susceptible to
the bin-packing heuristics. Going fully partitioned will also introduce
other issues like the need for load balancing and more complex deadline
inheritance logic. However, one clear benefit with EDF, is that it will
minimize the number of task-switches, clearly something desirable.

== Proposed algorithm ==

So, with this in mind, my motivation was to find a way to extend the a
deadline driver scheduler scheduler to battle Dhall's effect. This can
be done if you look at time in a more general sense than just
deadlines. What you must do, is look at how the expected computation
time needed by a task with respect to the tasks deadline compares to
other tasks.

=== Notation ===

- Take a set of tasks with corresponding attributes. This set and their
attributes are called the schedule, 'S' and contains *all* tasks for
the given scheduling class (i.e. all EFF-tasks).

- Consider a multi-core system with 'm' processors.

- Let the i'th task in the schedule be denoted tau_i. [3]

- Each task will run in intervals, each 'round' is called a job. A task
consists of an infinite sequence of jobs. The k'th job of tau_i is
called tau_{i,k}

- Each task has a set of (relative) attributes supplied when the task is
inserted into the scheduler (passed via syscall)
* Period T_i
* Deadline D_i
* WCET C_i

- Each job (tau_{i,k}) has absolute attributes (computed from the relative
tasks-attributes coupled with physical time).
* Release-time r_{i,k}
* Deadline d_{i,k}
* Allocated time so for a job, C_a(t, tau_{i,k})
When C_a equals WCET, the jobs budget is exhausted and it should
start a new cycle. This is tested (see below) by the scheduler.
* Remaining time for the job, C_r(t, tau_{i,nk})

- The acceptance function for EFF screens new tasks on their expected
utilization. Depending on the mode and implementation, it can be based
on the period, or on the deadline. The latter will cause firmer
restraints, but may lead to wasted resources.

U = C_i / T_i For SRT (bounded deadline tardiness)
U = C_i / D_i For HRT

- A relative measure, time to failure, ttf, indicates how much time is
left before a job must be scheduled to run in order to avoid a
deadline-miss. This will decrease as time progresses and the job is
not granted CPU time. For tasks currently running on a CPU, this value
will be constant.

Take a job with a WCET of 10ms, it has been allowed to run for 4
ms so far. The deadline is 8 ms away. Then the job must be
scheduled to run within the next 4 ms, otherwise it will not be
able to finish in time.

- An absolute value, time of failure (tof) can also be computed in a
static manner. For tasks not running on a CPU, the allocated time is
static. That means you can take the absolute deadline, subtract the
allocated time and you have the absolute point in time when a given
job will fail to meet its deadline.

=== Outline of scheduler ===

Store tasks in 2 queues. One of size m, containing all the tasks
currently running on the CPUs (queue R). The other will hold all
currently active tasks waiting to execute (queue W).

queue R is sorted based on ttf (time to failure, the relative time left
until a task will miss it's deadline). As the tasks approaches the
absolute time of failure at the same rate C_a increases, ttf is
constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
(i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
be quite fluent.

queue W is sorted based on absolute time of failure (tof). Since this is
a fixed point in time, and the tasks in W are not running (C_a is
unchanged), this value is constant.

When a task is scheduled to run, a timer is set at the point in time
where it has exhausted it's budget (t_now + WCET - C_a). This is to
ensure that a runaway task does not grab the CPU.

When a new task arrives, it is handled according the following rules:
- The system has one or more CPUs not running EFF-tasks. Pick any of the
free CPUs and assign the new job there. Set a timer to

- All CPUs are busy, the new task has greater time to failure than the
head of W. The task is inserted into W at the appropriate place.

- All CPUs are busy and the new task has smaller time to failure than
the head of W. The new task is compared to the last task in Q. If time
to failure is larger than the task at the tail, it is added to the
head of W.

- If all CPUs are busy, and time to failure is smaller than the tail of
Q, the new task is a candidate for insertion. At this point the tasks
must be compared to see if picking one or the other will cause a
deadline-miss. If both will miss the deadline if the other is
scheduled, keep the existing running and place the new at the head of
W (as you will have a deadline-miss anyway unless the the task is
picked up by another CPU soon).

- A task running on a CPU with ttf=0 should *never* be preempted with
another task. If all tasks in R have ttf=0, and a newly arrived task
has ttf=0, a deadline-miss is inevitable and switching tasks will only
waste resources.

When a task in R finish (or is stopped due to the timer-limit), it is
removed from R, and the head of W is added to R, inserted at the
appropriate place.

It has been some discussion lately (in particular on #linux-rt) about
the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
should be possible to extend EFF to handle both. As a side note, if
anyone has some good information about PEP, I'd like a copy :)

Based on this, I think the utilization can be set as high as M
(i.e. full utilization of all CPUs), but the jitter can probably be
quite bad, so for jitter-sensitive tasks, a short period/deadline should
be used.

There are still some issues left to solve, for instance how to best
handle sporadic tasks, and whether or not deadline-miss should be allow,
or just 'bounded deadline tardiness'. Either way, EFF should be able to
handle it. Then, there are problems concerning blocking of tasks. One
solution would be BWI or PEP, but I have not had the time to read
properly through those, but from what I've gathered a combination of BWI
and PEP looks promising (anyone with good info about BWI and PEP - feel
free to share! (-: ).



Henrik


1) Before you freeze at 'global' and get all caught up on "This won't
ever scale to X", or "He will be haunted by Y" - I do not want to
extend a global algorithm to 2000 cores. I would like to scale to a
single *chip* and then we can worry about 2-way and 4-way systems
later. For the record, I've donned my asbestos suit anyway.

2) http://austad.us/kernel/thesis_henrikau.pdf

3) Anyone want to include LaTeX-notation into an email-rfc?



--
-> henrik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



関連記事
linux | 【2009-07-14(Tue) 19:11:16】 | Trackback:(0) | Comments:(0)

2009/07/05 Linux Kernel Podcast に載ったらしい このエントリーをはてなブックマークに追加

Jon Masters がやってる Linux Kernel Mailing List (LKML) Summary Podcast に登場したらしい。あんた凄いよ。どうやってウォッチしてるんだ?

http://www.kernelpodcast.org/2009/07/08/20090705-linux-kernel-podcast/

OOM. Following up to recent discussion concerning noswap related patches triggering excessive OOM kill scenarios, and simply in reaction to the general mess that is trying to figure out exactly why an OOM occured, Kosaki Motohiro posted a “OOM analysis helper” patch series which adds a number of statistics to the output produced by the kernel on an OOM condition.




関連記事
linux | 【2009-07-08(Wed) 23:29:07】 | Trackback:(0) | Comments:(4)

FSベンチマーク対決 このエントリーをはてなブックマークに追加

http://www.phoronix.com/scan.php?page=article&item=ext4_btrfs_nilfs2&num=1

たぶん、デフォルトパラメタで勝負しているので、ext3はwritebackモードだと思う。

要約
・SQLiteテストはext3が20秒で終わるところが、ext4では870秒、btrfsに至っては1472秒もかかった
・PostgreSQLのpgbenchのbtrfs, XFSは完走できなかった
・IOZone
Write: ext3:107MB/s, ext4:131MB/s, Btrfs:89MB/s
Read: ext3:202MB/s, ext4:219MB/s, Btrfs:93MB/s
Btrfs遅いな~
・Dbenchはext3:100MB/s, ext4:32MB/s, Btrfs:46MB/s
やはり、ordered-modeは並列IOに弱すぎる。Btrfsはもうちょっとチューニング出来る気がするが
・PostMarkはext4の圧勝だけど、そもそも何やってるベンチなのかよく分からん
・BlogBench(Webサーバーワークロードのベンチ)だとBtrfsが優秀
たぶん、アプリがアペンドライトを多用すると、FSのCOWを通らなくなるからだね



関連記事
linux | 【2009-07-03(Fri) 03:07:14】 | Trackback:(0) | Comments:(0)

kernel watch 6月版 このエントリーをはてなブックマークに追加


公開されました。

http://www.atmarkit.co.jp/flinux/rensai/watch2009/watch06a.html

関連記事
linux | 【2009-07-01(Wed) 20:34:28】 | Trackback:(0) | Comments:(0)

/dev/ksm なくなっちゃった このエントリーをはてなブックマークに追加

madviseで共有アドバイスを与える事になった模様

関連記事
linux | 【2009-07-01(Wed) 08:52:03】 | Trackback:(0) | Comments:(0)
  1. 無料アクセス解析