Linux Kernel: PSI - Pressure Stall Information /proc/pressure/memory で メモリのストールを観察する

下記の続きエントリです

hiboma.hatenadiary.jp

今回は PSI - Pressure Stall Information の /proc/pressure/memory についてのエントリ

PSI を使うと CPU, メモリ, IO で stall した時間(割合) を計測できるってなことですが、どういった実装で「memory stall = メモリのストール」を起こしているタスクを計測しているのかかが疑問で調べていました

PSI のドキュメント

www.kernel.org

正直、ドキュメントを読んだだけでは /proc/pressure/memory がどういったタイミングで計上されるのか分かりませんでした.

ソースを読む

www.kernel.org

./kernel/sched/psi.c をざーっと呼んでいって psi_memstall_enter() と psi_memstall_leave() という関数を見つけます

psi_memstall_enter()

psi_memstall_enter() で 「メモリで stall する操作を始めるタスク」をマークし PSI で計上し、psi_memstall_leave() でマークを外す という仕組みになっているらしい.

/**
 * psi_memstall_enter - mark the beginning of a memory stall section
 * @flags: flags to handle nested sections
 *
 * Marks the calling task as being stalled due to a lack of memory,
 * such as waiting for a refault or performing reclaim.
 */
void psi_memstall_enter(unsigned long *flags)
{
    struct rq_flags rf;
    struct rq *rq;

    if (static_branch_likely(&psi_disabled))
        return;

    *flags = current->flags & PF_MEMSTALL;
    if (*flags)
        return;
    /*
    * PF_MEMSTALL setting & accounting needs to be atomic wrt
    * changes to the task's scheduling state, otherwise we can
    * race with CPU migration.
    */
    rq = this_rq_lock_irq(&rf);

    current->flags |= PF_MEMSTALL;
    psi_task_change(current, 0, TSK_MEMSTALL);

    rq_unlock_irq(rq, &rf);
/**
 * psi_memstall_leave - mark the end of an memory stall section
 * @flags: flags to handle nested memdelay sections
 *
 * Marks the calling task as no longer stalled due to lack of memory.
 */
void psi_memstall_leave(unsigned long *flags)
{
    struct rq_flags rf;
    struct rq *rq;

    if (static_branch_likely(&psi_disabled))
        return;

    if (*flags)
        return;
    /*
    * PF_MEMSTALL clearing & accounting needs to be atomic wrt
    * changes to the task's scheduling state, otherwise we could
    * race with CPU migration.
    */
    rq = this_rq_lock_irq(&rf);

    current->flags &= ~PF_MEMSTALL;
    psi_task_change(current, TSK_MEMSTALL, 0);

    rq_unlock_irq(rq, &rf);
}

この2つが呼び出される箇所を追っていけば PSI - /proc/pressure/memory で計上される Linux カーネル内の処理が特定できます

読んでて分かりやすい

ところで psi_memstall_enter() と psi_memstall_leave() はペアで呼び出されるように書かれているので、ソースを読んでいて *「あ、ここが メモリで stall しうる場所なんだな」 というアノテーションになってくれてもいる.

例えば kcomapctd の以下のソースなんかがわかりやすい

/*
 * The background compaction daemon, started as a kernel thread
 * from the init process.
 */
static int kcompactd(void *p)
{
    pg_data_t *pgdat = (pg_data_t*)p;
    struct task_struct *tsk = current;

    const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

    if (!cpumask_empty(cpumask))
        set_cpus_allowed_ptr(tsk, cpumask);

    set_freezable();

    pgdat->kcompactd_max_order = 0;
    pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;

    while (!kthread_should_stop()) {
        unsigned long pflags;

        trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
        wait_event_freezable(pgdat->kcompactd_wait,
                kcompactd_work_requested(pgdat));

        psi_memstall_enter(&pflags); 👈
        kcompactd_do_work(pgdat);
        psi_memstall_leave(&pflags); 👈
    }

    return 0;
}

カーネルのソースを追いかけていてもメモリバウンド(?) な処理になるかどうかってのは分かりにくい/覚えてないのでソースの可読性が上がるのは嬉しい.

PSI が導入されていない古いバージョンのカーネルを読む際にも、ヒントにもなるのではなかろうか

psi_memstall_enter() を呼び出す箇所

5.4-rc8 がでたばかりの https://github.com/torvalds/linux/ で psi_memstall_enter() 呼び出しの箇所を探すと、9箇所とさほどは多くない感じだった

psi_memstall_enter 1722 block/blk-cgroup.c       psi_memstall_enter(&pflags);
psi_memstall_enter 1188 block/blk-core.c        psi_memstall_enter(&pflags);
psi_memstall_enter   22 include/linux/psi.h void psi_memstall_enter(unsigned long *flags);
psi_memstall_enter 2650 mm/compaction.c         psi_memstall_enter(&pflags);
psi_memstall_enter 1148 mm/filemap.c            psi_memstall_enter(&pflags);
psi_memstall_enter 2494 mm/memcontrol.c     psi_memstall_enter(&pflags);
psi_memstall_enter 3898 mm/page_alloc.c     psi_memstall_enter(&pflags);
psi_memstall_enter 4120 mm/page_alloc.c     psi_memstall_enter(&pflags);
psi_memstall_enter 3380 mm/vmscan.c         psi_memstall_enter(&pflags);
psi_memstall_enter 3592 mm/vmscan.c         psi_memstall_enter(&pflags);

以下のような関数群が psi_memstall_enter() を呼び出しています

  • blkcg_maybe_throttle_blkg()
  • submit_bio()
  • wait_on_page_bit_common()
  • mem_cgroup_handle_over_high()
  • __alloc_pages_direct_compact()
  • __perform_reclaim()
  • try_to_free_mem_cgroup_pages()
  • kcompactd() ... kcompactd
  • balance_pgdat() ... kswapd

さらにこれらを呼び出す関数がまたいろいろあるから、どいうった経路で呼び出されるのかを網羅するのは大変

実験

ソースをざっと追ったところで 実験にうつります

  • PSI - /proc/pressure/memory の値に作用しそうな「負荷」をかける
  • 実際に値が変化するのかを観察する
  • bpftrace でバックトレースを出してソースを読む

ということをやります

psi_memstall_enter() はメモリサブシステムの関数軍に埋め込まれているのでカーネルモジュールを書いて擬似の負荷を生み出すというアプローチには向いてなさそうだったので 次の手順で実験を行いました

実験環境

Ubuntu + Bionic に 5.4-rc7 のカーネルを載せての実験です

root@bionic:~# uname -a
Linux bionic 5.4.0-050400rc7-generic #201911102031 SMP Mon Nov 11 01:34:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

実験1. OOM を起こす

cgroup で memory.mem.limit_in_bytes の制限をかけて OOM Killer を起こす

root@bionic:~# mkdir       /sys/fs/cgroup/memory/000
root@bionic:~# echo 100M > /sys/fs/cgroup/memory/000/memory.mem.limit_in_bytes 
root@bionic:~# echo $$   > /sys/fs/cgroup/memory/000/tasks 

# OOM 🔥
root@bionic:~# perl -e '@a = (); for(;;) { push @a, 1 x 1024 }' 

同時に /proc/pressure/io を観測する

for i in {1..100}; do cat /proc/pressure/memory | perl -nle 'printf "%s\t%s\n", scalar localtime(), $_'; sleep 1; done;

OOM Killer が実行されるのと同タイミング位で /proc/pressure/memory の値が変化するのを観測できる

Mon Nov 18 16:59:18 2019 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:18 2019  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:19 2019  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:19 2019  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:20 2019  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:20 2019  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:21 2019  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:21 2019  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:22 2019  some avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:22 2019  full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Mon Nov 18 16:59:25 2019  👉 some avg10=0.00 avg60=0.00 avg300=0.00 total=28846
Mon Nov 18 16:59:25 2019  👉 full avg10=0.00 avg60=0.00 avg300=0.00 total=11876
Mon Nov 18 16:59:26 2019  some avg10=1.96 avg60=0.41 avg300=0.08 total=286128
Mon Nov 18 16:59:26 2019  full avg10=1.51 avg60=0.31 avg300=0.06 total=221701
Mon Nov 18 16:59:27 2019  some avg10=1.96 avg60=0.41 avg300=0.08 total=286128
Mon Nov 18 16:59:27 2019  full avg10=1.51 avg60=0.31 avg300=0.06 total=221701
Mon Nov 18 16:59:28 2019  some avg10=1.60 avg60=0.39 avg300=0.08 total=286128
Mon Nov 18 16:59:28 2019  full avg10=1.24 avg60=0.30 avg300=0.06 total=221701
Mon Nov 18 16:59:29 2019  some avg10=1.60 avg60=0.39 avg300=0.08 total=286128
Mon Nov 18 16:59:29 2019  full avg10=1.24 avg60=0.30 avg300=0.06 total=221701
Mon Nov 18 16:59:30 2019  some avg10=1.31 avg60=0.38 avg300=0.08 total=286128
Mon Nov 18 16:59:30 2019  full avg10=1.01 avg60=0.29 avg300=0.06 total=221701
Mon Nov 18 16:59:31 2019  some avg10=1.31 avg60=0.38 avg300=0.08 total=286128
Mon Nov 18 16:59:31 2019  full avg10=1.01 avg60=0.29 avg300=0.06 total=221701
Mon Nov 18 16:59:32 2019  some avg10=1.07 avg60=0.37 avg300=0.08 total=286128
Mon Nov 18 16:59:32 2019  full avg10=0.83 avg60=0.28 avg300=0.06 total=221701
Mon Nov 18 16:59:33 2019  some avg10=1.07 avg60=0.37 avg300=0.08 total=286128
Mon Nov 18 16:59:33 2019  full avg10=0.83 avg60=0.28 avg300=0.06 total=221701
Mon Nov 18 16:59:34 2019  some avg10=0.88 avg60=0.36 avg300=0.08 total=286128
Mon Nov 18 16:59:34 2019  full avg10=0.68 avg60=0.27 avg300=0.06 total=221701
Mon Nov 18 16:59:35 2019  some avg10=0.88 avg60=0.36 avg300=0.08 total=286128
Mon Nov 18 16:59:35 2019  full avg10=0.68 avg60=0.27 avg300=0.06 total=221701
Mon Nov 18 16:59:36 2019  some avg10=0.72 avg60=0.34 avg300=0.08 total=286128
Mon Nov 18 16:59:36 2019  full avg10=0.55 avg60=0.26 avg300=0.06 total=221701

bpftrace で kprobe:psi_memstall_enter() 呼び出しの際のバックトレースをとると以下のような呼び出しパスをとる。結構おおいな〜

vagrant@bionic:~$ sudo bpftrace -e 'kprobe:psi_memstall_enter {  @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
    psi_memstall_enter+1
    ext4_mpage_readpages+1145
    ext4_readpages+58
    read_pages+113
    __do_page_cache_readahead+303
    ondemand_readahead+402
    page_cache_sync_readahead+120
    generic_file_buffered_read+1393
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 1
@[
    psi_memstall_enter+1
    submit_bh_wbc+389
    ll_rw_block+171
    __breadahead+68
    __ext4_get_inode_loc+945
    ext4_reserve_inode_write+84
    ext4_mark_inode_dirty+83
    ext4_dirty_inode+73
    __mark_inode_dirty+640
    generic_update_time+157
    file_update_time+235
    __generic_file_write_iter+154
    ext4_file_write_iter+185
    new_sync_write+293
    __vfs_write+41
    vfs_write+185
    ksys_write+103
    __x64_sys_write+26
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+68
]: 1
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    __do_page_cache_readahead+161
    ondemand_readahead+655
    page_cache_sync_readahead+120
    generic_file_buffered_read+1393
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 1
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    __do_page_cache_readahead+161
    filemap_fault+1742
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 1
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    pagecache_get_page+191
    squashfs_readpage_block+275
    squashfs_readpage+537
    read_pages+294
    __do_page_cache_readahead+384
    filemap_fault+1742
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 2
@[
    psi_memstall_enter+1
    ext4_mpage_readpages+1145
    ext4_readpages+58
    read_pages+113
    __do_page_cache_readahead+384
    ondemand_readahead+402
    page_cache_async_readahead+181
    generic_file_buffered_read+1964
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 2
@[
    psi_memstall_enter+1
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 3
@[
    psi_memstall_enter+1
    ext4_mpage_readpages+1145
    ext4_readpages+58
    read_pages+113
    __do_page_cache_readahead+384
    ondemand_readahead+402
    page_cache_sync_readahead+120
    generic_file_buffered_read+1393
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 3
@[
    psi_memstall_enter+1
    ext4_mpage_readpages+1145
    ext4_readpages+58
    read_pages+113
    __do_page_cache_readahead+303
    filemap_fault+1742
    ext4_filemap_fault+50
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 4
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    pagecache_get_page+191
    __getblk_gfp+241
    squashfs_read_data+728
    squashfs_readpage_block+667
    squashfs_readpage+537
    read_pages+294
    __do_page_cache_readahead+384
    filemap_fault+1742
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 4
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    __do_page_cache_readahead+161
    ondemand_readahead+402
    page_cache_async_readahead+181
    generic_file_buffered_read+1964
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 4
@[
    psi_memstall_enter+1
    ext4_mpage_readpages+1145
    ext4_readpages+58
    read_pages+113
    __do_page_cache_readahead+384
    filemap_fault+1742
    ext4_filemap_fault+50
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 5
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    __do_page_cache_readahead+161
    ondemand_readahead+402
    page_cache_sync_readahead+120
    generic_file_buffered_read+1393
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 6
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_vma+127
    do_anonymous_page+280
    __handle_mm_fault+1888
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 8
@[
    psi_memstall_enter+1
    filemap_fault+1855
    ext4_filemap_fault+50
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 13
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    __do_page_cache_readahead+161
    filemap_fault+1742
    ext4_filemap_fault+50
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 17
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    pagecache_get_page+191
    filemap_fault+1778
    ext4_filemap_fault+50
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 19
@[
    psi_memstall_enter+1
    submit_bh_wbc+389
    ll_rw_block+171
    squashfs_read_data+787
    squashfs_readpage_block+667
    squashfs_readpage+537
    read_pages+294
    __do_page_cache_readahead+384
    filemap_fault+1742
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 25
@[
    psi_memstall_enter+1
    submit_bh_wbc+389
    ll_rw_block+171
    squashfs_read_data+787
    squashfs_readpage_block+667
    squashfs_readpage+537
    filemap_fault+287
    __do_fault+60
    do_fault+587
    __handle_mm_fault+1221
    handle_mm_fault+202
    do_user_addr_fault+505
    __do_page_fault+88
    do_page_fault+44
    page_fault+52
]: 34
@[
    psi_memstall_enter+1
    __alloc_pages_nodemask+720
    alloc_pages_current+135
    __page_cache_alloc+114
    generic_file_buffered_read+1429
    generic_file_read_iter+220
    ext4_file_read_iter+79
    do_iter_readv_writev+419
    do_iter_read+226
    vfs_iter_read+25
    loop_queue_work+889
    kthread_worker_fn+121
    loop_kthread_worker_fn+30
    kthread+260
    ret_from_fork+53
]: 77
@[
    psi_memstall_enter+1
    kswapd+511
    kthread+260
    ret_from_fork+53
]: 626

^C

実験2 negative dentry を大量に生成する

negative dentry を大量に生成して、slab の回収処理を強制的に呼び出すという負荷をかける

root@bionic:~# mkdir       /sys/fs/cgroup/memory/000
root@bionic:~# echo 100M > /sys/fs/cgroup/000/memory.mem.limit_in_bytes 
root@bionic:~# echo $$   > /sys/fs/cgroup/memory/000/tasks 

# slab reclaim 🔥
root@bionic:~# timeout 30 perl -e 'stat "/$_" for 1..10000000'

同時に /proc/pressure/io を観測する

for i in {1..100}; do cat /proc/pressure/memory | perl -nle 'printf "%s\t%s\n", scalar localtime(), $_'; sleep 1; done;

徐々に /proc/pressure/memory の値が上がっていくのを観測できる. negative dentry をずーっと作り続けているので、slab の回収処理が走るあたりで計上されているんじゃなかろうか

Mon Nov 18 17:53:50 2019 some avg10=0.00 avg60=0.34 avg300=0.47 total=5822085
Mon Nov 18 17:53:50 2019  full avg10=0.00 avg60=0.22 avg300=0.31 total=4638505
Mon Nov 18 17:53:51 2019  some avg10=0.36 avg60=0.40 avg300=0.48 total=5863706
Mon Nov 18 17:53:51 2019  full avg10=0.18 avg60=0.25 avg300=0.32 total=4670954
Mon Nov 18 17:53:52 2019  some avg10=0.36 avg60=0.40 avg300=0.48 total=5979685
Mon Nov 18 17:53:52 2019  full avg10=0.18 avg60=0.25 avg300=0.32 total=4717815
Mon Nov 18 17:53:53 2019  some avg10=1.92 avg60=0.68 avg300=0.54 total=6048779
Mon Nov 18 17:53:53 2019  full avg10=1.05 avg60=0.40 avg300=0.35 total=4774972
Mon Nov 18 17:53:54 2019  some avg10=1.92 avg60=0.68 avg300=0.54 total=6103945
Mon Nov 18 17:53:54 2019  full avg10=1.05 avg60=0.40 avg300=0.35 total=4811085
Mon Nov 18 17:53:55 2019  some avg10=2.66 avg60=0.85 avg300=0.58 total=6189074
Mon Nov 18 17:53:55 2019  full avg10=1.40 avg60=0.49 avg300=0.37 total=4847986
Mon Nov 18 17:53:56 2019  some avg10=2.66 avg60=0.85 avg300=0.58 total=6302442
Mon Nov 18 17:53:56 2019  full avg10=1.40 avg60=0.49 avg300=0.37 total=4941928
Mon Nov 18 17:53:57 2019  some avg10=3.63 avg60=1.09 avg300=0.63 total=6352685
Mon Nov 18 17:53:57 2019  full avg10=2.23 avg60=0.67 avg300=0.41 total=4975421
Mon Nov 18 17:53:58 2019  some avg10=3.63 avg60=1.09 avg300=0.63 total=6460241
Mon Nov 18 17:53:58 2019  full avg10=2.23 avg60=0.67 avg300=0.41 total=5042527
Mon Nov 18 17:53:59 2019  some avg10=4.78 avg60=1.38 avg300=0.69 total=6556563
Mon Nov 18 17:53:59 2019  full avg10=2.73 avg60=0.81 avg300=0.44 total=5080571
Mon Nov 18 17:54:00 2019  some avg10=4.78 avg60=1.38 avg300=0.69 total=6672685
Mon Nov 18 17:54:00 2019  full avg10=2.73 avg60=0.81 avg300=0.44 total=5168872
Mon Nov 18 17:54:01 2019  some avg10=5.36 avg60=1.60 avg300=0.75 total=6733324
Mon Nov 18 17:54:01 2019  full avg10=3.33 avg60=0.98 avg300=0.48 total=5210281
Mon Nov 18 17:54:02 2019  some avg10=5.36 avg60=1.60 avg300=0.75 total=6799858
Mon Nov 18 17:54:02 2019  full avg10=3.33 avg60=0.98 avg300=0.48 total=5238881
Mon Nov 18 17:54:03 2019  some avg10=5.66 avg60=1.77 avg300=0.79 total=6872770
Mon Nov 18 17:54:03 2019  full avg10=3.45 avg60=1.08 avg300=0.50 total=5283273
Mon Nov 18 17:54:04 2019  some avg10=5.66 avg60=1.77 avg300=0.79 total=6941195
Mon Nov 18 17:54:04 2019  full avg10=3.45 avg60=1.08 avg300=0.50 total=5330769
Mon Nov 18 17:54:05 2019  some avg10=6.08 avg60=1.98 avg300=0.84 total=7045575
Mon Nov 18 17:54:05 2019  full avg10=3.73 avg60=1.21 avg300=0.53 total=5395983
Mon Nov 18 17:54:06 2019  some avg10=6.08 avg60=1.98 avg300=0.84 total=7101877
Mon Nov 18 17:54:06 2019  full avg10=3.73 avg60=1.21 avg300=0.53 total=5430315
Mon Nov 18 17:54:07 2019  some avg10=6.07 avg60=2.11 avg300=0.87 total=7175145
Mon Nov 18 17:54:07 2019  full avg10=3.78 avg60=1.30 avg300=0.55 total=5480289
Mon Nov 18 17:54:08 2019  some avg10=6.07 avg60=2.11 avg300=0.87 total=7323447
Mon Nov 18 17:54:08 2019  full avg10=3.78 avg60=1.30 avg300=0.55 total=5546810
Mon Nov 18 17:54:09 2019  some avg10=6.96 avg60=2.40 avg300=0.94 total=7412187
Mon Nov 18 17:54:09 2019  full avg10=4.00 avg60=1.42 avg300=0.59 total=5594752
Mon Nov 18 17:54:10 2019  some avg10=6.96 avg60=2.40 avg300=0.94 total=7458924
Mon Nov 18 17:54:10 2019  full avg10=4.00 avg60=1.42 avg300=0.59 total=5631377
Mon Nov 18 17:54:11 2019  some avg10=7.33 avg60=2.61 avg300=1.00 total=7619594
Mon Nov 18 17:54:11 2019  full avg10=4.18 avg60=1.54 avg300=0.62 total=5699840
Mon Nov 18 17:54:12 2019  some avg10=7.33 avg60=2.61 avg300=1.00 total=7709262
Mon Nov 18 17:54:12 2019  full avg10=4.18 avg60=1.54 avg300=0.62 total=5774328
Mon Nov 18 17:54:13 2019  some avg10=7.27 avg60=2.76 avg300=1.04 total=7777595
Mon Nov 18 17:54:13 2019  full avg10=4.33 avg60=1.65 avg300=0.65 total=5826863

bpftrace で kprobe:psi_memstall_enter() 呼び出しの際のバックトレースをとる

vagrant@bionic:~$ sudo bpftrace -e 'kprobe:psi_memstall_enter {  @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
    psi_memstall_enter+1
    try_charge+744
    __memcg_kmem_charge_memcg+64
    alloc_slab_page+404
    allocate_slab+125
    new_slab+74
    ___slab_alloc+824
    __slab_alloc+32
    kmem_cache_alloc+516
    __d_alloc+39
    d_alloc+26
    d_alloc_parallel+90
    __lookup_slow+105
    lookup_slow+59
    walk_component+474
    path_lookupat.isra.0+128
    filename_lookup+174
    user_path_at_empty+58
    vfs_statx+125
    __do_sys_newstat+62
    __x64_sys_newstat+22
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+68
]: 141

bpftrace では try_charge() が呼び出し元に見えるが、ソースを読むと try_to_free_mem_cgroup_pages() が正確な場所だった

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
                       unsigned long nr_pages,
                       gfp_t gfp_mask,
                       bool may_swap)
{
    struct zonelist *zonelist;
    unsigned long nr_reclaimed;
    unsigned long pflags;
    int nid;
    unsigned int noreclaim_flag;
    struct scan_control sc = {
        .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
        .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
                (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
        .reclaim_idx = MAX_NR_ZONES - 1,
        .target_mem_cgroup = memcg,
        .priority = DEF_PRIORITY,
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = may_swap,
    };

    set_task_reclaim_state(current, &sc.reclaim_state);
    /*
    * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
    * take care of from where we get pages. So the node where we start the
    * scan does not need to be the current node.
    */
    nid = mem_cgroup_select_victim_node(memcg);

    zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK];

    trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);

    psi_memstall_enter(&pflags); 👈
    noreclaim_flag = memalloc_noreclaim_save();

    nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

    memalloc_noreclaim_restore(noreclaim_flag);
    psi_memstall_leave(&pflags); 👈

    trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
    set_task_reclaim_state(current, NULL);

    return nr_reclaimed;
}
#endif

まとめ

  • ざっくりとソースと呼び出し元の関数を追った
  • OOM と negative dentry の大量生成で /proc/pressure/memory の変化を追った
  • やっぱりメモリサブシステムは複雑だなぁと 頭がパンクした次第
  • /proc/pressure/memory 単体ではどこでストールしているかを追えないので、/proc/vmstat のカウンタと合わせとメトリスクとして扱いやすくなるかな?