Skip to content

Amazon 5.10.y/btrfs soft lockup#2

Closed
vewe-richard wants to merge 737 commits intoamazonlinux:amazon-5.10.y/masterfrom
vewe-richard:amazon-5.10.y/btrfs-soft-lockup
Closed

Amazon 5.10.y/btrfs soft lockup#2
vewe-richard wants to merge 737 commits intoamazonlinux:amazon-5.10.y/masterfrom
vewe-richard:amazon-5.10.y/btrfs-soft-lockup

Conversation

@vewe-richard
Copy link

Issue #, if available:
vewe-richard#1

Description of changes:
Do not call generic_file_buffered_read() directly, call generic_file_read_iter() which will check number of iov to read, and return immediately when the number of iov is zero.

Else, generic_file_buffered_read() will fall into an infinite loop, eat up all cpu resources, cause system lockup.

Baolin Wang and others added 30 commits August 18, 2023 22:38
The huge_ptep_set_access_flags() can not make the huge pte old according
to the discussion [1], that means we will always mornitor the young state
of the hugetlb though we stopped accessing the hugetlb, as a result DAMON
will get inaccurate accessing statistics.

So changing to use set_huge_pte_at() to make the huge pte old to fix this
issue.

[1] https://lore.kernel.org/all/[email protected]/

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 49f4203 ("mm/damon: add access checking for hugetlb pages")
Signed-off-by: Baolin Wang <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
damon_reclaim_init() allocates a memory chunk for ctx with
damon_new_ctx().  When damon_select_ops() fails, ctx is not released,
which will lead to a memory leak.

We should release the ctx with damon_destroy_ctx() when damon_select_ops()
fails to fix the memory leak.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4d69c34 ("mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations()")
Signed-off-by: Jianglei Nie <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
When user tries to create a DAMON context via the DAMON debugfs interface
with a name of an already existing context, the context directory creation
fails but a new context is created and added in the internal data
structure, due to absence of the directory creation success check.  As a
result, memory could leak and DAMON cannot be turned on.  An example test
case is as below:

    # cd /sys/kernel/debug/damon/
    # echo "off" >  monitor_on
    # echo paddr > target_ids
    # echo "abc" > mk_context
    # echo "abc" > mk_context
    # echo $$ > abc/target_ids
    # echo "on" > monitor_on  <<< fails

Return value of 'debugfs_create_dir()' is expected to be ignored in
general, but this is an exceptional case as DAMON feature is depending
on the debugfs functionality and it has the potential duplicate name
issue.  This commit therefore fixes the issue by checking the directory
creation failure and immediately return the error in the case.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 75c1c2b ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: Badari Pulavarty <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]>	[ 5.15.x]
Signed-off-by: Andrew Morton <[email protected]>
This reverts commit 509c2c9.

Commit "1a072f13b2dc Mitigate unbalanced RETs on vmexit via serialising wrmsr"
addresses this with less performance impact.

[ Hailmo: Resolved conflicts when rebasing onto 5.10.190 and adding SRSO
and GDS support ]

Signed-off-by: Suraj Jitindar Singh <[email protected]>
…-rwx-segments"

This reverts commit 8f4f2c9.

This causes arm64 debug builds to fail with:
*** ERROR: No build ID note found in /builddir/build/BUILDROOT/kernel-5.15.63-32.131.amzn2.aarch64/usr/lib/debug/lib/modules/5.15.63-32.131.amzn2.aarch64/vmlinux

This is due to the notes section which contains the build id being
missing from the linux elf.

Revert this commit until this can be remedied.

Signed-off-by: Suraj Jitindar Singh <[email protected]>
Source: https://github.com/amzn/amzn-drivers/

Change Log:

## r2.8.0 release notes
**Notes**
* The driver is now dependent on the ptp module for loading
  See README for more details.

**New Features**
* Add support for PTP HW clock
* Add support for SRD metrics
  Feature's enablement and documentation would be in future release

**Bug Fixes**
* Fix potential sign extension issue
* Reduce memory footprint of some structs
* Fix updating rx_copybreak issue
* Fix xdp drops handling due to multibuf packets
* Handle ena_calc_io_queue_size() possible errors
* Destroy correct amount of xdp queues upon failure

**Minor Changes**
* Remove wide LLQ comment on supported versions
* Backport uapi/bpf.h inclusion
* Add a counter for driver's reset failures
* Take xdp packets stats into account in ena_get_stats64()
* Make queue stats code cleaner by removing if block
* Remove redundant empty line
* Remove confusing comment
* Remove flag reading code duplication
* Replace ENA local ENA_NAPI_BUDGET to global NAPI_POLL_WEIGHT
* Change default print level for netif_ prints
* Relocate skb_tx_timestamp() to improve time stamping accuracy
* Backport bpf_warn_invalid_xdp_action() change
* Fix incorrect indentation using spaces
* Driver now compiles with Linux kernel 5.19

Signed-off-by: Suraj Jitindar Singh <[email protected]>
When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time.  Fix this up by properly calling
dput().

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 75c1c2b ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
When damon_sysfs_add_target couldn't find proper task, New allocated
damon_target structure isn't registered yet, So, it's impossible to free
new allocated one by damon_sysfs_destroy_targets.

By calling damon_add_target as soon as allocating new target, Fix this
possible memory leak.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: a61ea56 ("mm/damon/sysfs: link DAMON for virtual address spaces monitoring")
Signed-off-by: Levi Yun <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Cc: <[email protected]>	[5.17.x]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: SeongJae Park <[email protected]>
commit 9e7a4d9 upstream.

Usage of spin locks was not allowed for tracing programs due to
insufficient preemption checks. The verifier does not currently prevent
LSM programs from using spin locks, but the helpers are not exposed
via bpf_lsm_func_proto.

Based on the discussion in [1], non-sleepable LSM programs should be
able to use bpf_spin_{lock, unlock}.

Sleepable LSM programs can be preempted which means that allowng spin
locks will need more work (disabling preemption and the verifier
ensuring that no sleepable helpers are called when a spin lock is held).

[1]: https://lore.kernel.org/bpf/[email protected]/T/#md601a053229287659071600d3483523f752cd2fb

Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Kuniyuki Iwashima <[email protected]>
commit 4cf1bc1 upstream.

Similar to bpf_local_storage for sockets and inodes add local storage
for task_struct.

The life-cycle of storage is managed with the life-cycle of the
task_struct.  i.e. the storage is destroyed along with the owning task
with a callback to the bpf_task_storage_free from the task_free LSM
hook.

The BPF LSM allocates an __rcu pointer to the bpf_local_storage in
the security blob which are now stackable and can co-exist with other
LSMs.

The userspace map operations can be done by using a pid fd as a key
passed to the lookup, update and delete operations.

Signed-off-by: KP Singh <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().

Cc: [email protected]
Fixes: 6b06314 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <[email protected]>
Since commit 0f91d13 ("mm/damon: simplify stop mechanism") delete
kdamond_stop and change to use kthread stop mechanism, these obsolete
comments should be removed accordingly.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chengming Zhou <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
The kernel is in lockdown mode when secureboot is enabled and hence
debugfs cannot be used. Add support for this and other general cases
where debugfs cannot be read and communicate the same to the user before
running tests.

Signed-off-by: Gautam <[email protected]>
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>
… due to online tuning support

Patch series "mm/damon: trivial cleanups".

This patchset contains trivial cleansups for DAMON code.

This patch (of 6):

Commit 81a8418 ("Docs/admin-guide/mm/damon/reclaim: document
'commit_inputs' parameter") has documented the 'commit_inputs' parameter
which allows online parameter update, but it didn't remove a paragraph
saying the online parameter update is impossible.  This commit removes the
obsolete paragraph.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 81a8418 ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter")
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
The function for knowing if given monitoring context's targets will have
pid or not is defined and used in dbgfs only.  However, the logic is also
needed for sysfs.  This commit moves the code to damon.h and makes both
dbgfs and sysfs to use it.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
DAMON_RECLAIM's handling of 'commit_inputs' parameter is duplicated in
'after_aggregation()' and 'after_wmarks_check()' callbacks.  This commit
deduplicates the code for better maintenance.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
DAMON sysfs interface's DAMON context building and its online parameter
update have duplicated code.  This commit removes the duplicate.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
DAMON_RECLAIM's 'enabled' parameter store callback ('enabled_store()')
schedules the parameter check timer ('damon_reclaim_timer') if the
parameter is set as 'Y'.  Then, the timer schedules itself to check if
user has set the parameter as 'N'.  It's unnecessarily complex.

This commit makes it simpler by making the parameter store callback to
schedule the timer regardless of the parameter value and disabling the
timer's self scheduling.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit adds 'damon_reclaim_' prefix to 'enabled_store()', so that we
can distinguish it easily from the stack trace using 'faddr2line.sh' like
tools.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
…and 'damos_action' values

Patch series "Extend DAMOS for Proactive LRU-lists Sorting".

Introduction
============

In short, this patchset 1) extends DAMON-based Operation Schemes (DAMOS)
for low overhead data access pattern based LRU-lists sorting, and 2)
implements a static kernel module for easy use of conservatively-tuned
version of that using the extended DAMOS capability.

Background
----------

As page-granularity access checking overhead could be significant on huge
systems, LRU lists are normally not proactively sorted but partially and
reactively sorted for special events including specific user requests,
system calls and memory pressure.  As a result, LRU lists are sometimes
not so perfectly prepared to be used as a trustworthy access pattern
source for some situations including reclamation target pages selection
under sudden memory pressure.

DAMON-based Proactive LRU-lists Sorting
---------------------------------------

Because DAMON can identify access patterns of best-effort accuracy while
inducing only user-specified range of overhead, using DAMON for Proactive
LRU-lists Sorting (PLRUS) could be helpful for this situation.  The idea
is quite simple.  Find hot pages and cold pages using DAMON, and
prioritize hot pages while deprioritizing cold pages on their LRU-lists.

This patchset extends DAMON to support such schemes by introducing a
couple of new DAMOS actions for prioritizing and deprioritizing memory
regions of specific access patterns on their LRU-lists.  In detail, this
patchset simply uses 'mark_page_accessed()' and 'deactivate_page()'
functions for prioritization and deprioritization of pages on their LRU
lists, respectively.

To make the scheme easy to use without complex tuning for common
situations, this patchset further implements a static kernel module called
'DAMON_LRU_SORT' using the extended DAMOS functionality.  It proactively
sorts LRU-lists using DAMON with conservatively chosen default
hotness/coldness thresholds and small CPU usage quota limit.  That is, the
module under its default parameters will make no harm for common situation
but provide some level of benefit for systems having clear hot/cold access
pattern under only memory pressure while consuming only limited small
portion of CPU time.

Related Works
-------------

Proactive reclamation is well known to be helpful for reducing non-optimal
reclamation target selection caused performance drops.  However, proactive
reclamation is not a best option for some cases, because it could incur
additional I/O.  For an example, it could be prohitive for systems using
storage devices that total number of writes is limited, or cloud block
storages that charges every I/O.

Some proactive reclamation approaches[1,2] induce a level of memory
pressure using memcg files or swappiness while monitoring PSI.  As
reclamation target selection is still relying on the original LRU-lists
mechanism, using DAMON-based proactive reclamation before inducing the
proactive reclamation could allow more memory saving with same level of
performance overhead, or less performance overhead with same level of
memory saving.

[1] https://blogs.oracle.com/linux/post/anticipating-your-memory-needs
[2] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf

Evaluation
==========

In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page
faults reduction, and 3.74% speedup under memory pressure.

Setup
-----

To show the effect of PLRUS, I run PARSEC3 and SPLASH-2X benchmarks under
below variant systems and measure a few metrics including the runtime of
each workload, number of system-wide major page faults, and system-wide
memory PSI (some).

- orig: v5.18-rc4 based mm-unstable kernel + this patchset, but no DAMON scheme
        applied.
- mprs: Same to 'orig' but artificial memory pressure is induced.
- plrus: Same to 'mprs' but a radically tuned PLRUS scheme is applied to the
         entire physical address space of the system.

For the artificial memory pressure, I set 'memory.limit_in_bytes' to 75%
of the running workload's peak RSS, wait 1 second, remove the pressure by
setting it to 200% of the peak RSS, wait 10 seconds, and repeat the
procedure until the workload finishes[1].  I use zram based swap device.
The tests are automated[2].

[1] https://github.com/awslabs/damon-tests/blob/next/perf/runners/back/0009_memcg_pressure.sh
[2] https://github.com/awslabs/damon-tests/blob/next/perf/full_once_config.sh

Radically Tuned PLRUS
---------------------

To show effect of PLRUS on the PARSEC3/SPLASH-2X workloads which runs for
no long time, we use radically tuned version of PLRUS.  The version asks
DAMON to do the proactive LRU-lists sorting as below.

1. Find any memory regions shown some accesses (approximately >=20 accesses per
   100 sampling) and prioritize pages of the regions on their LRU lists using
   up to 2% CPU time.  Under the CPU time limit, prioritize regions having
   higher access frequency and kept the access frequency longer first.

2. Find any memory regions shown no access for at least >=5 seconds and
   deprioritize pages of the rgions on their LRU lists using up to 2% CPU time.
   Under the CPU time limit, deprioritize regions that not accessed for longer
   time first.

Results
-------

I repeat the tests 25 times and calculate average of the measured numbers.
The results are as below:

    metric               orig        mprs         plrus        plrus/mprs
    runtime_seconds      190.06      292.83       281.87       0.96
    pgmajfaults          852.55      8769420.00   7525040.00   0.86
    memory_psi_some_us   106911.00   6943420.00   6220920.00   0.90

The first row is for legend.  The first cell shows the metric that the
following cells of the row shows.  Second, third, and fourth cells show
the metrics under the configs shown at the first row of the cell, and the
fifth cell shows the metric under 'plrus' divided by the metric under
'mprs'.  Second row shows the averaged runtime of the workloads in
seconds.  Third row shows the number of system-wide major page faults
while the test was ongoing.  Fourth row shows the system-wide memory
pressure stall for some processes in microseconds while the test was
ongoing.

In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page
faults reduction, and 3.74% speedup under memory pressure.  We also
confirmed the CPU usage of kdamond was 2.61% of single CPU, which is below
4% as expected.

Sequence of Patches
===================

The first and second patch cleans up DAMON debugfs interface and
DAMOS_PAGEOUT handling code of physical address space monitoring
operations implementation for easier extension of the code.

The thrid and fourth patches implement a new DAMOS action called
'lru_prio', which prioritizes pages under memory regions which have a
user-specified access pattern, and document it, respectively.  The fifth
and sixth patches implement yet another new DAMOS action called
'lru_deprio', which deprioritizes pages under memory regions which have a
user-specified access pattern, and document it, respectively.

The seventh patch implements a static kernel module called
'damon_lru_sort', which utilizes the DAMON-based proactive LRU-lists
sorting under conservatively chosen default parameter.  Finally, the
eighth patch documents 'damon_lru_sort'.

This patch (of 8):

DAMON debugfs interface assumes users will write 'damos_action' value
directly to the 'schemes' file.  This makes adding new 'damos_action' in
the middle of its definition breaks the backward compatibility of DAMON
debugfs interface, as values of some 'damos_action' could be changed.  To
mitigate the situation, this commit adds mappings between the user inputs
and 'damos_action' value and makes DAMON debugfs code uses those.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit moves code for 'DAMOS_PAGEOUT' handling of the physical
address space monitoring operations set to a separate function so that its
caller, 'damon_pa_apply_scheme()', can be more easily extended for
additional DAMOS actions later.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit adds a new DAMOS action called 'LRU_PRIO' for the physical
address space.  The action prioritizes pages in the memory regions of the
user-specified target access pattern on their LRU lists.  This is hence
supposed to be used for frequently accessed (hot) memory regions so that
hot pages could be more likely protected under memory pressure.
Internally, it simply calls 'mark_page_accessed()'.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit documents the 'lru_prio' scheme action for DAMON sysfs
interface.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit adds a new DAMON-based operation scheme action called
'LRU_DEPRIO' for physical address space.  The action deprioritizes pages
in the memory area of the target access pattern on their LRU lists.  This
is hence supposed to be used for rarely accessed (cold) memory regions so
that cold pages could be more likely reclaimed first under memory
pressure.  Internally, it simply calls 'lru_deactivate()'.

Using this with 'LRU_PRIO' action for hot pages, users can proactively
sort LRU lists based on the access pattern.  That is, it can make the LRU
lists somewhat more trustworthy source of access temperature.  As a
result, efficiency of LRU-lists based mechanisms including the reclamation
target selection could be improved.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit documents the 'LRU_DEPRIO' scheme action for DAMON sysfs
interface.`

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Users can do data access-aware LRU-lists sorting using 'LRU_PRIO' and
'LRU_DEPRIO' DAMOS actions.  However, finding best parameters including
the hotness/coldness thresholds, CPU quota, and watermarks could be
challenging for some users.  To make the scheme easy to be used without
complex tuning for common situations, this commit implements a static
kernel module called 'DAMON_LRU_SORT' using the 'LRU_PRIO' and
'LRU_DEPRIO' DAMOS actions.

It proactively sorts LRU-lists using DAMON with conservatively chosen
default values of the parameters.  That is, the module under its default
parameters will make no harm for common situations but provide some level
of efficiency improvements for systems having clear hot/cold access
pattern under a level of memory pressure while consuming only a limited
small portion of CPU time.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
This commit documents the usage of DAMON_LRU_SORT for admins.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
damon_lru_sort_init() returns an error when damon_select_ops() fails
without freeing 'ctx' which allocated before.  This commit fixes the
potential memory leak by freeing 'ctx' under the situation.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 40e983c ("mm/damon: introduce DAMON-based LRU-lists Sorting")
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
The workflow example code is not working since it got the file names
wrong. So fix this.

Fixes: b184027 ("Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface")
Reviewed-by: SeongJae Park <[email protected]>
Signed-off-by: Kairui Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
shaoyingxu pushed a commit that referenced this pull request Feb 5, 2026
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
nathan-zcgao pushed a commit that referenced this pull request Feb 5, 2026
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.

| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
|  arch_stack_walk+0x4a0/0x4c0 (P)
|  arch_stack_walk+0x248/0x4c0 (L)
|  profile_pc+0x44/0x80
|  profile_tick+0x50/0x80 (F)
|  tick_nohz_handler+0xcc/0x160 (F)
|  __hrtimer_run_queues+0x2ac/0x340 (F)
|  hrtimer_interrupt+0xf4/0x268 (F)
|  arch_timer_handler_virt+0x34/0x60 (F)
|  handle_percpu_devid_irq+0x88/0x220 (F)
|  generic_handle_domain_irq+0x34/0x60 (F)
|  gic_handle_irq+0x54/0x140 (F)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0x98
|  el1_interrupt+0x34/0x68 (F)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  queued_spin_lock_slowpath+0x78/0x460 (P)

The warning in question is:

  WARN_ON_ONCE(state->common.pc == orig_pc))

... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.

This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).

Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Aishwarya TCV <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
nathan-zcgao pushed a commit that referenced this pull request Feb 5, 2026
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit d1bc560 ]

Add nested locking with I_MUTEX_XATTR subclass to avoid lockdep warning
while handling xattr inode on file open syscall at ext4_xattr_inode_iget.

Backtrace
EXT4-fs (loop0): Ignoring removed oldalloc option
======================================================
WARNING: possible circular locking dependency detected
5.10.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor543/2794 is trying to acquire lock:
ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline]
ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425

but task is already holding lock:
ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&ei->i_data_sem/3){++++}-{3:3}:
       lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566
       down_write+0x93/0x180 kernel/locking/rwsem.c:1564
       ext4_update_i_disksize fs/ext4/ext4.h:3267 [inline]
       ext4_xattr_inode_write fs/ext4/xattr.c:1390 [inline]
       ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1538 [inline]
       ext4_xattr_set_entry+0x331a/0x3d80 fs/ext4/xattr.c:1662
       ext4_xattr_ibody_set+0x124/0x390 fs/ext4/xattr.c:2228
       ext4_xattr_set_handle+0xc27/0x14e0 fs/ext4/xattr.c:2385
       ext4_xattr_set+0x219/0x390 fs/ext4/xattr.c:2498
       ext4_xattr_user_set+0xc9/0xf0 fs/ext4/xattr_user.c:40
       __vfs_setxattr+0x404/0x450 fs/xattr.c:177
       __vfs_setxattr_noperm+0x11d/0x4f0 fs/xattr.c:208
       __vfs_setxattr_locked+0x1f9/0x210 fs/xattr.c:266
       vfs_setxattr+0x112/0x2c0 fs/xattr.c:283
       setxattr+0x1db/0x3e0 fs/xattr.c:548
       path_setxattr+0x15a/0x240 fs/xattr.c:567
       __do_sys_setxattr fs/xattr.c:582 [inline]
       __se_sys_setxattr fs/xattr.c:578 [inline]
       __x64_sys_setxattr+0xc5/0xe0 fs/xattr.c:578
       do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62
       entry_SYSCALL_64_after_hwframe+0x61/0xcb

-> #0 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}:
       check_prev_add kernel/locking/lockdep.c:2988 [inline]
       check_prevs_add kernel/locking/lockdep.c:3113 [inline]
       validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729
       __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955
       lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566
       down_write+0x93/0x180 kernel/locking/rwsem.c:1564
       inode_lock include/linux/fs.h:782 [inline]
       ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425
       ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485
       ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline]
       ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline]
       ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774
       __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898
       ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline]
       __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018
       ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562
       notify_change+0xbb6/0xe60 fs/attr.c:435
       do_truncate+0x1de/0x2c0 fs/open.c:64
       handle_truncate fs/namei.c:2970 [inline]
       do_open fs/namei.c:3311 [inline]
       path_openat+0x29f3/0x3290 fs/namei.c:3425
       do_filp_open+0x20b/0x450 fs/namei.c:3452
       do_sys_openat2+0x124/0x460 fs/open.c:1207
       do_sys_open fs/open.c:1223 [inline]
       __do_sys_open fs/open.c:1231 [inline]
       __se_sys_open fs/open.c:1227 [inline]
       __x64_sys_open+0x221/0x270 fs/open.c:1227
       do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62
       entry_SYSCALL_64_after_hwframe+0x61/0xcb

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->i_data_sem/3);
                               lock(&ea_inode->i_rwsem#7/1);
                               lock(&ei->i_data_sem/3);
  lock(&ea_inode->i_rwsem#7/1);

 *** DEADLOCK ***

5 locks held by syz-executor543/2794:
 #0: ffff888026fbc448 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write+0x4a/0x2a0 fs/namespace.c:365
 #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline]
 #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: do_truncate+0x1cf/0x2c0 fs/open.c:62
 #2: ffff8880215e3310 (&ei->i_mmap_sem){++++}-{3:3}, at: ext4_setattr+0xec4/0x19c0 fs/ext4/inode.c:5519
 #3: ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559
 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_write_trylock_xattr fs/ext4/xattr.h:162 [inline]
 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_try_to_expand_extra_isize fs/ext4/inode.c:5938 [inline]
 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: __ext4_mark_inode_dirty+0x4fb/0x810 fs/ext4/inode.c:6018

stack backtrace:
CPU: 1 PID: 2794 Comm: syz-executor543 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x177/0x211 lib/dump_stack.c:118
 print_circular_bug+0x146/0x1b0 kernel/locking/lockdep.c:2002
 check_noncircular+0x2cc/0x390 kernel/locking/lockdep.c:2123
 check_prev_add kernel/locking/lockdep.c:2988 [inline]
 check_prevs_add kernel/locking/lockdep.c:3113 [inline]
 validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729
 __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955
 lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566
 down_write+0x93/0x180 kernel/locking/rwsem.c:1564
 inode_lock include/linux/fs.h:782 [inline]
 ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425
 ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485
 ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline]
 ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline]
 ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774
 __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898
 ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline]
 __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018
 ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562
 notify_change+0xbb6/0xe60 fs/attr.c:435
 do_truncate+0x1de/0x2c0 fs/open.c:64
 handle_truncate fs/namei.c:2970 [inline]
 do_open fs/namei.c:3311 [inline]
 path_openat+0x29f3/0x3290 fs/namei.c:3425
 do_filp_open+0x20b/0x450 fs/namei.c:3452
 do_sys_openat2+0x124/0x460 fs/open.c:1207
 do_sys_open fs/open.c:1223 [inline]
 __do_sys_open fs/open.c:1231 [inline]
 __se_sys_open fs/open.c:1227 [inline]
 __x64_sys_open+0x221/0x270 fs/open.c:1227
 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f0cde4ea229
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd81d1c978 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
RAX: ffffffffffffffda RBX: 0030656c69662f30 RCX: 00007f0cde4ea229
RDX: 0000000000000089 RSI: 00000000000a0a00 RDI: 00000000200001c0
RBP: 2f30656c69662f2e R08: 0000000000208000 R09: 0000000000208000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd81d1c9c0
R13: 00007ffd81d1ca00 R14: 0000000000080000 R15: 0000000000000003
EXT4-fs error (device loop0): ext4_expand_extra_isize_ea:2730: inode #13: comm syz-executor543: corrupted in-inode xattr

Signed-off-by: Wojciech Gładysz <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
(cherry picked from commit c0f57dd)
Signed-off-by: Vegard Nossum <[email protected]>
Signed-off-by: Harshit Mogalapalli <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit f9ff766 ]

Fix a kernel panic in the br_netfilter module when sending untagged
traffic via a VxLAN device.
This happens during the check for fragmentation in br_nf_dev_queue_xmit.

It is dependent on:
1) the br_netfilter module being loaded;
2) net.bridge.bridge-nf-call-iptables set to 1;
3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port;
4) untagged frames with size higher than the VxLAN MTU forwarded/flooded

When forwarding the untagged packet to the VxLAN bridge port, before
the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and
changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type
of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL.

Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check
for frames that needs to be fragmented: frames with higher MTU than the
VxLAN device end up calling br_nf_ip_fragment, which in turns call
ip_skb_dst_mtu.

The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst
with valid dst->dev, thus the crash.

This case was never supported in the first place, so drop the packet
instead.

PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data.
[  176.291791] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000110
[  176.292101] Mem abort info:
[  176.292184]   ESR = 0x0000000096000004
[  176.292322]   EC = 0x25: DABT (current EL), IL = 32 bits
[  176.292530]   SET = 0, FnV = 0
[  176.292709]   EA = 0, S1PTW = 0
[  176.292862]   FSC = 0x04: level 0 translation fault
[  176.293013] Data abort info:
[  176.293104]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  176.293488]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  176.293787]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000
[  176.294166] [0000000000000110] pgd=0000000000000000,
p4d=0000000000000000
[  176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth
br_netfilter bridge stp llc ipv6 crct10dif_ce
[  176.295923] CPU: 0 PID: 188 Comm: ping Not tainted
6.8.0-rc3-g5b3fbd61b9d1 #2
[  176.296314] Hardware name: linux,dummy-virt (DT)
[  176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[  176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter]
[  176.297636] sp : ffff800080003630
[  176.297743] x29: ffff800080003630 x28: 0000000000000008 x27:
ffff6828c49ad9f8
[  176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24:
00000000000003e8
[  176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21:
ffff6828c3b16d28
[  176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18:
0000000000000014
[  176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15:
0000000095744632
[  176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12:
ffffb7e137926a70
[  176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 :
0000000000000000
[  176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 :
f20e0100bebafeca
[  176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 :
0000000000000000
[  176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 :
ffff6828c7f918f0
[  176.300889] Call trace:
[  176.301123]  br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.301411]  br_nf_post_routing+0x2a8/0x3e4 [br_netfilter]
[  176.301703]  nf_hook_slow+0x48/0x124
[  176.302060]  br_forward_finish+0xc8/0xe8 [bridge]
[  176.302371]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.302605]  br_nf_forward_finish+0x118/0x22c [br_netfilter]
[  176.302824]  br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter]
[  176.303136]  br_nf_forward+0x2b8/0x4e0 [br_netfilter]
[  176.303359]  nf_hook_slow+0x48/0x124
[  176.303803]  __br_forward+0xc4/0x194 [bridge]
[  176.304013]  br_flood+0xd4/0x168 [bridge]
[  176.304300]  br_handle_frame_finish+0x1d4/0x5c4 [bridge]
[  176.304536]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.304978]  br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter]
[  176.305188]  br_nf_pre_routing+0x250/0x524 [br_netfilter]
[  176.305428]  br_handle_frame+0x244/0x3cc [bridge]
[  176.305695]  __netif_receive_skb_core.constprop.0+0x33c/0xecc
[  176.306080]  __netif_receive_skb_one_core+0x40/0x8c
[  176.306197]  __netif_receive_skb+0x18/0x64
[  176.306369]  process_backlog+0x80/0x124
[  176.306540]  __napi_poll+0x38/0x17c
[  176.306636]  net_rx_action+0x124/0x26c
[  176.306758]  __do_softirq+0x100/0x26c
[  176.307051]  ____do_softirq+0x10/0x1c
[  176.307162]  call_on_irq_stack+0x24/0x4c
[  176.307289]  do_softirq_own_stack+0x1c/0x2c
[  176.307396]  do_softirq+0x54/0x6c
[  176.307485]  __local_bh_enable_ip+0x8c/0x98
[  176.307637]  __dev_queue_xmit+0x22c/0xd28
[  176.307775]  neigh_resolve_output+0xf4/0x1a0
[  176.308018]  ip_finish_output2+0x1c8/0x628
[  176.308137]  ip_do_fragment+0x5b4/0x658
[  176.308279]  ip_fragment.constprop.0+0x48/0xec
[  176.308420]  __ip_finish_output+0xa4/0x254
[  176.308593]  ip_finish_output+0x34/0x130
[  176.308814]  ip_output+0x6c/0x108
[  176.308929]  ip_send_skb+0x50/0xf0
[  176.309095]  ip_push_pending_frames+0x30/0x54
[  176.309254]  raw_sendmsg+0x758/0xaec
[  176.309568]  inet_sendmsg+0x44/0x70
[  176.309667]  __sys_sendto+0x110/0x178
[  176.309758]  __arm64_sys_sendto+0x28/0x38
[  176.309918]  invoke_syscall+0x48/0x110
[  176.310211]  el0_svc_common.constprop.0+0x40/0xe0
[  176.310353]  do_el0_svc+0x1c/0x28
[  176.310434]  el0_svc+0x34/0xb4
[  176.310551]  el0t_64_sync_handler+0x120/0x12c
[  176.310690]  el0t_64_sync+0x190/0x194
[  176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860)
[  176.315743] ---[ end trace 0000000000000000 ]---
[  176.316060] Kernel panic - not syncing: Oops: Fatal exception in
interrupt
[  176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000
[  176.316564] PHYS_OFFSET: 0xffff97d780000000
[  176.316782] CPU features: 0x0,88000203,3c020000,0100421b
[  176.317210] Memory Limit: none
[  176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal
Exception in interrupt ]---\

Fixes: 11538d0 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: Andy Roulin <[email protected]>
Acked-by: Nikolay Aleksandrov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
(cherry picked from commit f071312)
Signed-off-by: Vegard Nossum <[email protected]>
Signed-off-by: Harshit Mogalapalli <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit 2aa1545 ]

Since the commit 25c6a5a ("net: phy: micrel: Dynamically control
external clock of KSZ PHY"), the clock of Micrel PHY has been enabled
by phy_driver::resume() and disabled by phy_driver::suspend(). However,
devm_clk_get_optional_enabled() is used in kszphy_probe(), so the clock
will automatically be disabled when the device is unbound from the bus.
Therefore, this could cause the clock to be disabled twice, resulting
in clk driver warnings.

For example, this issue can be reproduced on i.MX6ULL platform, and we
can see the following logs when removing the FEC MAC drivers.

$ echo 2188000.ethernet > /sys/bus/platform/drivers/fec/unbind
$ echo 20b4000.ethernet > /sys/bus/platform/drivers/fec/unbind
[  109.758207] ------------[ cut here ]------------
[  109.758240] WARNING: drivers/clk/clk.c:1188 at clk_core_disable+0xb4/0xd0, CPU#0: sh/639
[  109.771011] enet2_ref already disabled
[  109.793359] Call trace:
[  109.822006]  clk_core_disable from clk_disable+0x28/0x34
[  109.827340]  clk_disable from clk_disable_unprepare+0xc/0x18
[  109.833029]  clk_disable_unprepare from devm_clk_release+0x1c/0x28
[  109.839241]  devm_clk_release from devres_release_all+0x98/0x100
[  109.845278]  devres_release_all from device_unbind_cleanup+0xc/0x70
[  109.851571]  device_unbind_cleanup from device_release_driver_internal+0x1a4/0x1f4
[  109.859170]  device_release_driver_internal from bus_remove_device+0xbc/0xe4
[  109.866243]  bus_remove_device from device_del+0x140/0x458
[  109.871757]  device_del from phy_mdio_device_remove+0xc/0x24
[  109.877452]  phy_mdio_device_remove from mdiobus_unregister+0x40/0xac
[  109.883918]  mdiobus_unregister from fec_enet_mii_remove+0x40/0x78
[  109.890125]  fec_enet_mii_remove from fec_drv_remove+0x4c/0x158
[  109.896076]  fec_drv_remove from device_release_driver_internal+0x17c/0x1f4
[  109.962748] WARNING: drivers/clk/clk.c:1047 at clk_core_unprepare+0xfc/0x13c, CPU#0: sh/639
[  109.975805] enet2_ref already unprepared
[  110.002866] Call trace:
[  110.031758]  clk_core_unprepare from clk_unprepare+0x24/0x2c
[  110.037440]  clk_unprepare from devm_clk_release+0x1c/0x28
[  110.042957]  devm_clk_release from devres_release_all+0x98/0x100
[  110.048989]  devres_release_all from device_unbind_cleanup+0xc/0x70
[  110.055280]  device_unbind_cleanup from device_release_driver_internal+0x1a4/0x1f4
[  110.062877]  device_release_driver_internal from bus_remove_device+0xbc/0xe4
[  110.069950]  bus_remove_device from device_del+0x140/0x458
[  110.075469]  device_del from phy_mdio_device_remove+0xc/0x24
[  110.081165]  phy_mdio_device_remove from mdiobus_unregister+0x40/0xac
[  110.087632]  mdiobus_unregister from fec_enet_mii_remove+0x40/0x78
[  110.093836]  fec_enet_mii_remove from fec_drv_remove+0x4c/0x158
[  110.099782]  fec_drv_remove from device_release_driver_internal+0x17c/0x1f4

After analyzing the process of removing the FEC driver, as shown below,
it can be seen that the clock was disabled twice by the PHY driver.

fec_drv_remove()
  --> fec_enet_close()
    --> phy_stop()
      --> phy_suspend()
        --> kszphy_suspend() #1 The clock is disabled
  --> fec_enet_mii_remove()
    --> mdiobus_unregister()
      --> phy_mdio_device_remove()
        --> device_del()
          --> devm_clk_release() #2 The clock is disabled again

Therefore, devm_clk_get_optional() is used to fix the above issue. And
to avoid the issue mentioned by the commit 9853294 ("net: phy:
micrel: use devm_clk_get_optional_enabled for the rmii-ref clock"), the
clock is enabled by clk_prepare_enable() to get the correct clock rate.

Fixes: 25c6a5a ("net: phy: micrel: Dynamically control external clock of KSZ PHY")
Signed-off-by: Wei Fang <[email protected]>
Reviewed-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit a80c9d9 ]

A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key
initialization fails:

  ==================================================================
  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2
  RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline]
  RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401
  Call Trace:

  sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189
  sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111
  sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217
  sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787
  sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline]
  sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169
  sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052
  sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88
  sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243
  sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127

The issue is triggered when sctp_auth_asoc_init_active_key() fails in
sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the
command sequence is currently:

- SCTP_CMD_PEER_INIT
- SCTP_CMD_TIMER_STOP (T1_INIT)
- SCTP_CMD_TIMER_START (T1_COOKIE)
- SCTP_CMD_NEW_STATE (COOKIE_ECHOED)
- SCTP_CMD_ASSOC_SHKEY
- SCTP_CMD_GEN_COOKIE_ECHO

If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while
asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by
SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL
to be queued by sctp_datamsg_from_user().

Since command interpretation stops on failure, no COOKIE_ECHO should been
sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already
been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As
a result, the DATA chunk can be transmitted together with the COOKIE_ECHO
in sctp_outq_flush_data(), leading to the observed issue.

Similar to the other places where it calls sctp_auth_asoc_init_active_key()
right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY
immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting
T1_COOKIE. This ensures that if shared key generation fails, authenticated
DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT,
giving the client another chance to process INIT_ACK and retry key setup.

Fixes: 730fc3d ("[SCTP]: Implete SCTP-AUTH parameter processing")
Reported-by: Zhen Chen <[email protected]>
Tested-by: Zhen Chen <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
…void Priority Inversion in SRIOV

[ Upstream commit dc0297f ]

RLCG Register Access is a way for virtual functions to safely access GPU
registers in a virtualized environment., including TLB flushes and
register reads. When multiple threads or VFs try to access the same
registers simultaneously, it can lead to race conditions. By using the
RLCG interface, the driver can serialize access to the registers. This
means that only one thread can access the registers at a time,
preventing conflicts and ensuring that operations are performed
correctly. Additionally, when a low-priority task holds a mutex that a
high-priority task needs, ie., If a thread holding a spinlock tries to
acquire a mutex, it can lead to priority inversion. register access in
amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
called, which attempts to acquire the mutex. This function is invoked
from amdgpu_sriov_wreg, which in turn is called from
gmc_v11_0_flush_gpu_tlb.

The [ BUG: Invalid wait context ] indicates that a thread is trying to
acquire a mutex while it is in a context that does not allow it to sleep
(like holding a spinlock).

Fixes the below:

[  253.013423] =============================
[  253.013434] [ BUG: Invalid wait context ]
[  253.013446] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G     U     OE
[  253.013464] -----------------------------
[  253.013475] kworker/0:1/10 is trying to lock:
[  253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.013815] other info that might help us debug this:
[  253.013827] context-{4:4}
[  253.013835] 3 locks held by kworker/0:1/10:
[  253.013847]  #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  253.013877]  #1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  253.013905]  #2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
[  253.014154] stack backtrace:
[  253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G     U     OE      6.12.0-amdstaging-drm-next-lol-050225 #14
[  253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
[  253.014224] Workqueue: events work_for_cpu_fn
[  253.014241] Call Trace:
[  253.014250]  <TASK>
[  253.014260]  dump_stack_lvl+0x9b/0xf0
[  253.014275]  dump_stack+0x10/0x20
[  253.014287]  __lock_acquire+0xa47/0x2810
[  253.014303]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014321]  lock_acquire+0xd1/0x300
[  253.014333]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014562]  ? __lock_acquire+0xa6b/0x2810
[  253.014578]  __mutex_lock+0x85/0xe20
[  253.014591]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014782]  ? sched_clock_noinstr+0x9/0x10
[  253.014795]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014808]  ? local_clock_noinstr+0xe/0xc0
[  253.014822]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015012]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.015029]  mutex_lock_nested+0x1b/0x30
[  253.015044]  ? mutex_lock_nested+0x1b/0x30
[  253.015057]  amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015249]  amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
[  253.015435]  gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
[  253.015667]  gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
[  253.015901]  ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
[  253.016159]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.016173]  ? smu_hw_init+0x18d/0x300 [amdgpu]
[  253.016403]  amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
[  253.016614]  amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
[  253.017057]  amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
[  253.017493]  local_pci_probe+0x4b/0xb0
[  253.017746]  work_for_cpu_fn+0x1a/0x30
[  253.017995]  process_one_work+0x21e/0x680
[  253.018248]  worker_thread+0x190/0x330
[  253.018500]  ? __pfx_worker_thread+0x10/0x10
[  253.018746]  kthread+0xe7/0x120
[  253.018988]  ? __pfx_kthread+0x10/0x10
[  253.019231]  ret_from_fork+0x3c/0x60
[  253.019468]  ? __pfx_kthread+0x10/0x10
[  253.019701]  ret_from_fork_asm+0x1a/0x30
[  253.019939]  </TASK>

v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

Fixes: e864180 ("drm/amdgpu: Add lock around VF RLCG interface")
Cc: lin cao <[email protected]>
Cc: Jingwen Chen <[email protected]>
Cc: Victor Skvortsov <[email protected]>
Cc: Zhigang Luo <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
[ Minor conflict resolved. ]
Signed-off-by: Li hongliang <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit 2aa1545 ]

Since the commit 25c6a5a ("net: phy: micrel: Dynamically control
external clock of KSZ PHY"), the clock of Micrel PHY has been enabled
by phy_driver::resume() and disabled by phy_driver::suspend(). However,
devm_clk_get_optional_enabled() is used in kszphy_probe(), so the clock
will automatically be disabled when the device is unbound from the bus.
Therefore, this could cause the clock to be disabled twice, resulting
in clk driver warnings.

For example, this issue can be reproduced on i.MX6ULL platform, and we
can see the following logs when removing the FEC MAC drivers.

$ echo 2188000.ethernet > /sys/bus/platform/drivers/fec/unbind
$ echo 20b4000.ethernet > /sys/bus/platform/drivers/fec/unbind
[  109.758207] ------------[ cut here ]------------
[  109.758240] WARNING: drivers/clk/clk.c:1188 at clk_core_disable+0xb4/0xd0, CPU#0: sh/639
[  109.771011] enet2_ref already disabled
[  109.793359] Call trace:
[  109.822006]  clk_core_disable from clk_disable+0x28/0x34
[  109.827340]  clk_disable from clk_disable_unprepare+0xc/0x18
[  109.833029]  clk_disable_unprepare from devm_clk_release+0x1c/0x28
[  109.839241]  devm_clk_release from devres_release_all+0x98/0x100
[  109.845278]  devres_release_all from device_unbind_cleanup+0xc/0x70
[  109.851571]  device_unbind_cleanup from device_release_driver_internal+0x1a4/0x1f4
[  109.859170]  device_release_driver_internal from bus_remove_device+0xbc/0xe4
[  109.866243]  bus_remove_device from device_del+0x140/0x458
[  109.871757]  device_del from phy_mdio_device_remove+0xc/0x24
[  109.877452]  phy_mdio_device_remove from mdiobus_unregister+0x40/0xac
[  109.883918]  mdiobus_unregister from fec_enet_mii_remove+0x40/0x78
[  109.890125]  fec_enet_mii_remove from fec_drv_remove+0x4c/0x158
[  109.896076]  fec_drv_remove from device_release_driver_internal+0x17c/0x1f4
[  109.962748] WARNING: drivers/clk/clk.c:1047 at clk_core_unprepare+0xfc/0x13c, CPU#0: sh/639
[  109.975805] enet2_ref already unprepared
[  110.002866] Call trace:
[  110.031758]  clk_core_unprepare from clk_unprepare+0x24/0x2c
[  110.037440]  clk_unprepare from devm_clk_release+0x1c/0x28
[  110.042957]  devm_clk_release from devres_release_all+0x98/0x100
[  110.048989]  devres_release_all from device_unbind_cleanup+0xc/0x70
[  110.055280]  device_unbind_cleanup from device_release_driver_internal+0x1a4/0x1f4
[  110.062877]  device_release_driver_internal from bus_remove_device+0xbc/0xe4
[  110.069950]  bus_remove_device from device_del+0x140/0x458
[  110.075469]  device_del from phy_mdio_device_remove+0xc/0x24
[  110.081165]  phy_mdio_device_remove from mdiobus_unregister+0x40/0xac
[  110.087632]  mdiobus_unregister from fec_enet_mii_remove+0x40/0x78
[  110.093836]  fec_enet_mii_remove from fec_drv_remove+0x4c/0x158
[  110.099782]  fec_drv_remove from device_release_driver_internal+0x17c/0x1f4

After analyzing the process of removing the FEC driver, as shown below,
it can be seen that the clock was disabled twice by the PHY driver.

fec_drv_remove()
  --> fec_enet_close()
    --> phy_stop()
      --> phy_suspend()
        --> kszphy_suspend() #1 The clock is disabled
  --> fec_enet_mii_remove()
    --> mdiobus_unregister()
      --> phy_mdio_device_remove()
        --> device_del()
          --> devm_clk_release() #2 The clock is disabled again

Therefore, devm_clk_get_optional() is used to fix the above issue. And
to avoid the issue mentioned by the commit 9853294 ("net: phy:
micrel: use devm_clk_get_optional_enabled for the rmii-ref clock"), the
clock is enabled by clk_prepare_enable() to get the correct clock rate.

Fixes: 25c6a5a ("net: phy: micrel: Dynamically control external clock of KSZ PHY")
Signed-off-by: Wei Fang <[email protected]>
Reviewed-by: Maxime Chevallier <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit a80c9d9 ]

A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key
initialization fails:

  ==================================================================
  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2
  RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline]
  RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401
  Call Trace:

  sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189
  sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111
  sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217
  sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787
  sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline]
  sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169
  sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052
  sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88
  sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243
  sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127

The issue is triggered when sctp_auth_asoc_init_active_key() fails in
sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the
command sequence is currently:

- SCTP_CMD_PEER_INIT
- SCTP_CMD_TIMER_STOP (T1_INIT)
- SCTP_CMD_TIMER_START (T1_COOKIE)
- SCTP_CMD_NEW_STATE (COOKIE_ECHOED)
- SCTP_CMD_ASSOC_SHKEY
- SCTP_CMD_GEN_COOKIE_ECHO

If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while
asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by
SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL
to be queued by sctp_datamsg_from_user().

Since command interpretation stops on failure, no COOKIE_ECHO should been
sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already
been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As
a result, the DATA chunk can be transmitted together with the COOKIE_ECHO
in sctp_outq_flush_data(), leading to the observed issue.

Similar to the other places where it calls sctp_auth_asoc_init_active_key()
right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY
immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting
T1_COOKIE. This ensures that if shared key generation fails, authenticated
DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT,
giving the client another chance to process INIT_ACK and retry key setup.

Fixes: 730fc3d ("[SCTP]: Implete SCTP-AUTH parameter processing")
Reported-by: Zhen Chen <[email protected]>
Tested-by: Zhen Chen <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
[ Upstream commit a80c9d9 ]

A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key
initialization fails:

  ==================================================================
  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2
  RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline]
  RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401
  Call Trace:

  sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189
  sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111
  sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217
  sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787
  sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline]
  sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169
  sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052
  sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88
  sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243
  sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127

The issue is triggered when sctp_auth_asoc_init_active_key() fails in
sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the
command sequence is currently:

- SCTP_CMD_PEER_INIT
- SCTP_CMD_TIMER_STOP (T1_INIT)
- SCTP_CMD_TIMER_START (T1_COOKIE)
- SCTP_CMD_NEW_STATE (COOKIE_ECHOED)
- SCTP_CMD_ASSOC_SHKEY
- SCTP_CMD_GEN_COOKIE_ECHO

If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while
asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by
SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL
to be queued by sctp_datamsg_from_user().

Since command interpretation stops on failure, no COOKIE_ECHO should been
sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already
been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As
a result, the DATA chunk can be transmitted together with the COOKIE_ECHO
in sctp_outq_flush_data(), leading to the observed issue.

Similar to the other places where it calls sctp_auth_asoc_init_active_key()
right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY
immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting
T1_COOKIE. This ensures that if shared key generation fails, authenticated
DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT,
giving the client another chance to process INIT_ACK and retry key setup.

Fixes: 730fc3d ("[SCTP]: Implete SCTP-AUTH parameter processing")
Reported-by: Zhen Chen <[email protected]>
Tested-by: Zhen Chen <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 6, 2026
…void Priority Inversion in SRIOV

[ Upstream commit dc0297f ]

RLCG Register Access is a way for virtual functions to safely access GPU
registers in a virtualized environment., including TLB flushes and
register reads. When multiple threads or VFs try to access the same
registers simultaneously, it can lead to race conditions. By using the
RLCG interface, the driver can serialize access to the registers. This
means that only one thread can access the registers at a time,
preventing conflicts and ensuring that operations are performed
correctly. Additionally, when a low-priority task holds a mutex that a
high-priority task needs, ie., If a thread holding a spinlock tries to
acquire a mutex, it can lead to priority inversion. register access in
amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
called, which attempts to acquire the mutex. This function is invoked
from amdgpu_sriov_wreg, which in turn is called from
gmc_v11_0_flush_gpu_tlb.

The [ BUG: Invalid wait context ] indicates that a thread is trying to
acquire a mutex while it is in a context that does not allow it to sleep
(like holding a spinlock).

Fixes the below:

[  253.013423] =============================
[  253.013434] [ BUG: Invalid wait context ]
[  253.013446] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G     U     OE
[  253.013464] -----------------------------
[  253.013475] kworker/0:1/10 is trying to lock:
[  253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.013815] other info that might help us debug this:
[  253.013827] context-{4:4}
[  253.013835] 3 locks held by kworker/0:1/10:
[  253.013847]  #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  253.013877]  #1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  253.013905]  #2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
[  253.014154] stack backtrace:
[  253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G     U     OE      6.12.0-amdstaging-drm-next-lol-050225 #14
[  253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
[  253.014224] Workqueue: events work_for_cpu_fn
[  253.014241] Call Trace:
[  253.014250]  <TASK>
[  253.014260]  dump_stack_lvl+0x9b/0xf0
[  253.014275]  dump_stack+0x10/0x20
[  253.014287]  __lock_acquire+0xa47/0x2810
[  253.014303]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014321]  lock_acquire+0xd1/0x300
[  253.014333]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014562]  ? __lock_acquire+0xa6b/0x2810
[  253.014578]  __mutex_lock+0x85/0xe20
[  253.014591]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.014782]  ? sched_clock_noinstr+0x9/0x10
[  253.014795]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.014808]  ? local_clock_noinstr+0xe/0xc0
[  253.014822]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015012]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.015029]  mutex_lock_nested+0x1b/0x30
[  253.015044]  ? mutex_lock_nested+0x1b/0x30
[  253.015057]  amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
[  253.015249]  amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
[  253.015435]  gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
[  253.015667]  gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
[  253.015901]  ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
[  253.016159]  ? srso_alias_return_thunk+0x5/0xfbef5
[  253.016173]  ? smu_hw_init+0x18d/0x300 [amdgpu]
[  253.016403]  amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
[  253.016614]  amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
[  253.017057]  amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
[  253.017493]  local_pci_probe+0x4b/0xb0
[  253.017746]  work_for_cpu_fn+0x1a/0x30
[  253.017995]  process_one_work+0x21e/0x680
[  253.018248]  worker_thread+0x190/0x330
[  253.018500]  ? __pfx_worker_thread+0x10/0x10
[  253.018746]  kthread+0xe7/0x120
[  253.018988]  ? __pfx_kthread+0x10/0x10
[  253.019231]  ret_from_fork+0x3c/0x60
[  253.019468]  ? __pfx_kthread+0x10/0x10
[  253.019701]  ret_from_fork_asm+0x1a/0x30
[  253.019939]  </TASK>

v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

Fixes: e864180 ("drm/amdgpu: Add lock around VF RLCG interface")
Cc: lin cao <[email protected]>
Cc: Jingwen Chen <[email protected]>
Cc: Victor Skvortsov <[email protected]>
Cc: Zhigang Luo <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
[ Minor conflict resolved. ]
Signed-off-by: Li hongliang <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
aahmed71 pushed a commit that referenced this pull request Feb 7, 2026
As reported by syzbot and experienced by Pavel, using cpus_read_lock()
in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem
and possibly others).

Instead, shrink the preempt disable region by iterating all CPUs and
checking the online status for each individual CPU while having
preemption disabled.

Fixes: 8850cb6 ("sched: Simplify wake_up_*idle*()")
Reported-by: [email protected]
Reported-by: Pavel Machek <[email protected]>
Reported-by: Qian Cai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Qian Cai <[email protected]>
aahmed71 pushed a commit that referenced this pull request Feb 7, 2026
commit f79a609 upstream.

log_max_qp in driver's default profile #2 was set to 18, but FW actually
supports 17 at the most - a situation that led to the concerning print
when the driver is loaded:
"log_max_qp value in current profile is 18, changing to HCA capabaility
limit (17)"

The expected behavior from mlx5_profile #2 is to match the maximum FW
capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is
initialized to a defined static value (0xff) - which basically means that
when loading this profile, log_max_qp value  will be what the currently
installed FW supports at most.

Signed-off-by: Maher Sanalla <[email protected]>
Reviewed-by: Maor Gottlieb <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
[v5.10: replaced prof->log_max_qp with profile[prof_sel].log_max_qp]
Signed-off-by: Shaoying Xu <[email protected]>
aahmed71 pushed a commit that referenced this pull request Feb 7, 2026
[ Upstream commit 9afb4b2 ]

To clear the flow table on flow table free, the following sequence
normally happens in order:

  1) gc_step work is stopped to disable any further stats/del requests.
  2) All flow table entries are set to teardown state.
  3) Run gc_step which will queue HW del work for each flow table entry.
  4) Waiting for the above del work to finish (flush).
  5) Run gc_step again, deleting all entries from the flow table.
  6) Flow table is freed.

But if a flow table entry already has pending HW stats or HW add work
step 3 will not queue HW del work (it will be skipped), step 4 will wait
for the pending add/stats to finish, and step 5 will queue HW del work
which might execute after freeing of the flow table.

To fix the above, this patch flushes the pending work, then it sets the
teardown flag to all flows in the flowtable and it forces a garbage
collector run to queue work to remove the flows from hardware, then it
flushes this new pending work and (finally) it forces another garbage
collector run to remove the entry from the software flowtable.

Stack trace:
[47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460
[47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704
[47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2
[47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
[47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table]
[47773.889727] Call Trace:
[47773.890214]  dump_stack+0xbb/0x107
[47773.890818]  print_address_description.constprop.0+0x18/0x140
[47773.892990]  kasan_report.cold+0x7c/0xd8
[47773.894459]  kasan_check_range+0x145/0x1a0
[47773.895174]  down_read+0x99/0x460
[47773.899706]  nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table]
[47773.907137]  flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table]
[47773.913372]  process_one_work+0x8ac/0x14e0
[47773.921325]
[47773.921325] Allocated by task 592159:
[47773.922031]  kasan_save_stack+0x1b/0x40
[47773.922730]  __kasan_kmalloc+0x7a/0x90
[47773.923411]  tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct]
[47773.924363]  tcf_ct_init+0x71c/0x1156 [act_ct]
[47773.925207]  tcf_action_init_1+0x45b/0x700
[47773.925987]  tcf_action_init+0x453/0x6b0
[47773.926692]  tcf_exts_validate+0x3d0/0x600
[47773.927419]  fl_change+0x757/0x4a51 [cls_flower]
[47773.928227]  tc_new_tfilter+0x89a/0x2070
[47773.936652]
[47773.936652] Freed by task 543704:
[47773.937303]  kasan_save_stack+0x1b/0x40
[47773.938039]  kasan_set_track+0x1c/0x30
[47773.938731]  kasan_set_free_info+0x20/0x30
[47773.939467]  __kasan_slab_free+0xe7/0x120
[47773.940194]  slab_free_freelist_hook+0x86/0x190
[47773.941038]  kfree+0xce/0x3a0
[47773.941644]  tcf_ct_flow_table_cleanup_work

Original patch description and stack trace by Paul Blakey.

Fixes: c29f74e ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Paul Blakey <[email protected]>
Tested-by: Paul Blakey <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Nathan Gao <[email protected]>
aahmed71 pushed a commit that referenced this pull request Feb 7, 2026
commit c886d70 upstream.

Dipanjan reported a syzbot splat at close time:

WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153
inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
Modules linked in: uio_ivshmem(OE) uio(E)
CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G           OE
5.19.0-rc6-g2eae0556bb9d #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Workqueue: events mptcp_worker
RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153
Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91
f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b
e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d
RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00
RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002
RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000
R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400
R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258
FS:  0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0
Call Trace:
 <TASK>
 __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067
 sk_destruct+0xbd/0xe0 net/core/sock.c:2112
 __sk_free+0xef/0x3d0 net/core/sock.c:2123
 sk_free+0x78/0xa0 net/core/sock.c:2134
 sock_put include/net/sock.h:1927 [inline]
 __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351
 __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828
 mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586
 process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289
 worker_thread+0x623/0x1070 kernel/workqueue.c:2436
 kthread+0x2e9/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
 </TASK>

The root cause of the problem is that an mptcp-level (re)transmit can
race with mptcp_close() and the packet scheduler checks the subflow
state before acquiring the socket lock: we can try to (re)transmit on
an already closed ssk.

Fix the issue checking again the subflow socket status under the
subflow socket lock protection. Additionally add the missing check
for the fallback-to-tcp case.

Fixes: d5f4919 ("mptcp: allow picking different xmit subflows")
Reported-by: Dipanjan Das <[email protected]>
Reviewed-by: Mat Martineau <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
[mheyne@: Fixed a couple of conflicts due to some missing commit:
 - Commit d9ca1de ("mptcp: move page frag allocation in
   mptcp_sendmsg()") reworked a couple of functions. Before this commit
   there used to be a check for EAGAIN after mptcp_sendmsg_frag. This is
   equivalent to the "continue" from this backport and therefore is
   dropped.
 Other commits that affected the context:
 - 5cf92bb ("mptcp: re-enable sndbuf autotune")
 - b713d00 ("mptcp: really share subflow snd_wnd")
 - f70cad1 ("mptcp: stop relying on tcp_tx_skb_cache")]
Signed-off-by: Maximilian Heyne <[email protected]>
simonliebold pushed a commit that referenced this pull request Feb 9, 2026
A "small" CIFS buffer is not big enough in general to hold a
setacl request for SMB2, and we end up overflowing the buffer in
send_set_info(). For instance:

 # mount.cifs //127.0.0.1/test /mnt/test -o username=test,password=test,nounix,cifsacl
 # touch /mnt/test/acltest
 # getcifsacl /mnt/test/acltest
 REVISION:0x1
 CONTROL:0x9004
 OWNER:S-1-5-21-2926364953-924364008-418108241-1000
 GROUP:S-1-22-2-1001
 ACL:S-1-5-21-2926364953-924364008-418108241-1000:ALLOWED/0x0/0x1e01ff
 ACL:S-1-22-2-1001:ALLOWED/0x0/R
 ACL:S-1-22-2-1001:ALLOWED/0x0/R
 ACL:S-1-5-21-2926364953-924364008-418108241-1000:ALLOWED/0x0/0x1e01ff
 ACL:S-1-1-0:ALLOWED/0x0/R
 # setcifsacl -a "ACL:S-1-22-2-1004:ALLOWED/0x0/R" /mnt/test/acltest

this setacl will cause the following KASAN splat:

[  330.777927] BUG: KASAN: slab-out-of-bounds in send_set_info+0x4dd/0xc20 [cifs]
[  330.779696] Write of size 696 at addr ffff88010d5e2860 by task setcifsacl/1012

[  330.781882] CPU: 1 PID: 1012 Comm: setcifsacl Not tainted 4.18.0-rc2+ #2
[  330.783140] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  330.784395] Call Trace:
[  330.784789]  dump_stack+0xc2/0x16b
[  330.786777]  print_address_description+0x6a/0x270
[  330.787520]  kasan_report+0x258/0x380
[  330.788845]  memcpy+0x34/0x50
[  330.789369]  send_set_info+0x4dd/0xc20 [cifs]
[  330.799511]  SMB2_set_acl+0x76/0xa0 [cifs]
[  330.801395]  set_smb2_acl+0x7ac/0xf30 [cifs]
[  330.830888]  cifs_xattr_set+0x963/0xe40 [cifs]
[  330.840367]  __vfs_setxattr+0x84/0xb0
[  330.842060]  __vfs_setxattr_noperm+0xe6/0x370
[  330.843848]  vfs_setxattr+0xc2/0xd0
[  330.845519]  setxattr+0x258/0x320
[  330.859211]  path_setxattr+0x15b/0x1b0
[  330.864392]  __x64_sys_setxattr+0xc0/0x160
[  330.866133]  do_syscall_64+0x14e/0x4b0
[  330.876631]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  330.878503] RIP: 0033:0x7ff2e507db0a
[  330.880151] Code: 48 8b 0d 89 93 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 bc 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 56 93 2c 00 f7 d8 64 89 01 48
[  330.885358] RSP: 002b:00007ffdc4903c18 EFLAGS: 00000246 ORIG_RAX: 00000000000000bc
[  330.887733] RAX: ffffffffffffffda RBX: 000055d1170de140 RCX: 00007ff2e507db0a
[  330.890067] RDX: 000055d1170de7d0 RSI: 000055d115b39184 RDI: 00007ffdc4904818
[  330.892410] RBP: 0000000000000001 R08: 0000000000000000 R09: 000055d1170de7e4
[  330.894785] R10: 00000000000002b8 R11: 0000000000000246 R12: 0000000000000007
[  330.897148] R13: 000055d1170de0c0 R14: 0000000000000008 R15: 000055d1170de550

[  330.901057] Allocated by task 1012:
[  330.902888]  kasan_kmalloc+0xa0/0xd0
[  330.904714]  kmem_cache_alloc+0xc8/0x1d0
[  330.906615]  mempool_alloc+0x11e/0x380
[  330.908496]  cifs_small_buf_get+0x35/0x60 [cifs]
[  330.910510]  smb2_plain_req_init+0x4a/0xd60 [cifs]
[  330.912551]  send_set_info+0x198/0xc20 [cifs]
[  330.914535]  SMB2_set_acl+0x76/0xa0 [cifs]
[  330.916465]  set_smb2_acl+0x7ac/0xf30 [cifs]
[  330.918453]  cifs_xattr_set+0x963/0xe40 [cifs]
[  330.920426]  __vfs_setxattr+0x84/0xb0
[  330.922284]  __vfs_setxattr_noperm+0xe6/0x370
[  330.924213]  vfs_setxattr+0xc2/0xd0
[  330.926008]  setxattr+0x258/0x320
[  330.927762]  path_setxattr+0x15b/0x1b0
[  330.929592]  __x64_sys_setxattr+0xc0/0x160
[  330.931459]  do_syscall_64+0x14e/0x4b0
[  330.933314]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  330.936843] Freed by task 0:
[  330.938588] (stack is not available)

[  330.941886] The buggy address belongs to the object at ffff88010d5e2800
 which belongs to the cache cifs_small_rq of size 448
[  330.946362] The buggy address is located 96 bytes inside of
 448-byte region [ffff88010d5e2800, ffff88010d5e29c0)
[  330.950722] The buggy address belongs to the page:
[  330.952789] page:ffffea0004357880 count:1 mapcount:0 mapping:ffff880108fdca80 index:0x0 compound_mapcount: 0
[  330.955665] flags: 0x17ffffc0008100(slab|head)
[  330.957760] raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff880108fdca80
[  330.960356] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[  330.963005] page dumped because: kasan: bad access detected

[  330.967039] Memory state around the buggy address:
[  330.969255]  ffff88010d5e2880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  330.971833]  ffff88010d5e2900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  330.974397] >ffff88010d5e2980: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  330.976956]                                            ^
[  330.979226]  ffff88010d5e2a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  330.981755]  ffff88010d5e2a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  330.984225] ==================================================================

Fix this by allocating a regular CIFS buffer in
smb2_plain_req_init() if the request command is SMB2_SET_INFO.

Reported-by: Jianhong Yin <[email protected]>
Fixes: 366ed84 ("cifs: Use smb 2 - 3 and cifsacl mount options setacl function")
CC: Stable <[email protected]>
Signed-off-by: Stefano Brivio <[email protected]>
Reviewed-and-tested-by: Aurelien Aptel <[email protected]>
Signed-off-by: Steve French <[email protected]>
Signed-off-by: Boris Protopopov <[email protected]>
simonliebold pushed a commit that referenced this pull request Feb 9, 2026
[ Upstream commit 10876da ]

syzkaller reported a null-ptr-deref in sock_omalloc() while allocating
a CALIPSO option.  [0]

The NULL is of struct sock, which was fetched by sk_to_full_sk() in
calipso_req_setattr().

Since commit a1a5344 ("tcp: avoid two atomic ops for syncookies"),
reqsk->rsk_listener could be NULL when SYN Cookie is returned to its
client, as hinted by the leading SYN Cookie log.

Here are 3 options to fix the bug:

  1) Return 0 in calipso_req_setattr()
  2) Return an error in calipso_req_setattr()
  3) Alaways set rsk_listener

1) is no go as it bypasses LSM, but 2) effectively disables SYN Cookie
for CALIPSO.  3) is also no go as there have been many efforts to reduce
atomic ops and make TCP robust against DDoS.  See also commit 3b24d85
("tcp/dccp: do not touch listener sk_refcnt under synflood").

As of the blamed commit, SYN Cookie already did not need refcounting,
and no one has stumbled on the bug for 9 years, so no CALIPSO user will
care about SYN Cookie.

Let's return an error in calipso_req_setattr() and calipso_req_delattr()
in the SYN Cookie case.

This can be reproduced by [1] on Fedora and now connect() of nc times out.

[0]:
TCP: request_sock_TCPv6: Possible SYN flooding on port [::]:20002. Sending cookies.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
CPU: 3 UID: 0 PID: 12262 Comm: syz.1.2611 Not tainted 6.14.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_pnet include/net/net_namespace.h:406 [inline]
RIP: 0010:sock_net include/net/sock.h:655 [inline]
RIP: 0010:sock_kmalloc+0x35/0x170 net/core/sock.c:2806
Code: 89 d5 41 54 55 89 f5 53 48 89 fb e8 25 e3 c6 fd e8 f0 91 e3 00 48 8d 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 01 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
RSP: 0018:ffff88811af89038 EFLAGS: 00010216
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff888105266400
RDX: 0000000000000006 RSI: ffff88800c890000 RDI: 0000000000000030
RBP: 0000000000000050 R08: 0000000000000000 R09: ffff88810526640e
R10: ffffed1020a4cc81 R11: ffff88810526640f R12: 0000000000000000
R13: 0000000000000820 R14: ffff888105266400 R15: 0000000000000050
FS:  00007f0653a07640(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f863ba096f4 CR3: 00000000163c0005 CR4: 0000000000770ef0
PKRU: 80000000
Call Trace:
 <IRQ>
 ipv6_renew_options+0x279/0x950 net/ipv6/exthdrs.c:1288
 calipso_req_setattr+0x181/0x340 net/ipv6/calipso.c:1204
 calipso_req_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:597
 netlbl_req_setattr+0x18a/0x440 net/netlabel/netlabel_kapi.c:1249
 selinux_netlbl_inet_conn_request+0x1fb/0x320 security/selinux/netlabel.c:342
 selinux_inet_conn_request+0x1eb/0x2c0 security/selinux/hooks.c:5551
 security_inet_conn_request+0x50/0xa0 security/security.c:4945
 tcp_v6_route_req+0x22c/0x550 net/ipv6/tcp_ipv6.c:825
 tcp_conn_request+0xec8/0x2b70 net/ipv4/tcp_input.c:7275
 tcp_v6_conn_request+0x1e3/0x440 net/ipv6/tcp_ipv6.c:1328
 tcp_rcv_state_process+0xafa/0x52b0 net/ipv4/tcp_input.c:6781
 tcp_v6_do_rcv+0x8a6/0x1a40 net/ipv6/tcp_ipv6.c:1667
 tcp_v6_rcv+0x505e/0x5b50 net/ipv6/tcp_ipv6.c:1904
 ip6_protocol_deliver_rcu+0x17c/0x1da0 net/ipv6/ip6_input.c:436
 ip6_input_finish+0x103/0x180 net/ipv6/ip6_input.c:480
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ip6_input+0x13c/0x6b0 net/ipv6/ip6_input.c:491
 dst_input include/net/dst.h:469 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline]
 ip6_rcv_finish+0xb6/0x490 net/ipv6/ip6_input.c:69
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ipv6_rcv+0xf9/0x490 net/ipv6/ip6_input.c:309
 __netif_receive_skb_one_core+0x12e/0x1f0 net/core/dev.c:5896
 __netif_receive_skb+0x1d/0x170 net/core/dev.c:6009
 process_backlog+0x41e/0x13b0 net/core/dev.c:6357
 __napi_poll+0xbd/0x710 net/core/dev.c:7191
 napi_poll net/core/dev.c:7260 [inline]
 net_rx_action+0x9de/0xde0 net/core/dev.c:7382
 handle_softirqs+0x19a/0x770 kernel/softirq.c:561
 do_softirq.part.0+0x36/0x70 kernel/softirq.c:462
 </IRQ>
 <TASK>
 do_softirq arch/x86/include/asm/preempt.h:26 [inline]
 __local_bh_enable_ip+0xf1/0x110 kernel/softirq.c:389
 local_bh_enable include/linux/bottom_half.h:33 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline]
 __dev_queue_xmit+0xc2a/0x3c40 net/core/dev.c:4679
 dev_queue_xmit include/linux/netdevice.h:3313 [inline]
 neigh_hh_output include/net/neighbour.h:523 [inline]
 neigh_output include/net/neighbour.h:537 [inline]
 ip6_finish_output2+0xd69/0x1f80 net/ipv6/ip6_output.c:141
 __ip6_finish_output net/ipv6/ip6_output.c:215 [inline]
 ip6_finish_output+0x5dc/0xd60 net/ipv6/ip6_output.c:226
 NF_HOOK_COND include/linux/netfilter.h:303 [inline]
 ip6_output+0x24b/0x8d0 net/ipv6/ip6_output.c:247
 dst_output include/net/dst.h:459 [inline]
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ip6_xmit+0xbbc/0x20d0 net/ipv6/ip6_output.c:366
 inet6_csk_xmit+0x39a/0x720 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x1a7b/0x3b40 net/ipv4/tcp_output.c:1471
 tcp_transmit_skb net/ipv4/tcp_output.c:1489 [inline]
 tcp_send_syn_data net/ipv4/tcp_output.c:4059 [inline]
 tcp_connect+0x1c0c/0x4510 net/ipv4/tcp_output.c:4148
 tcp_v6_connect+0x156c/0x2080 net/ipv6/tcp_ipv6.c:333
 __inet_stream_connect+0x3a7/0xed0 net/ipv4/af_inet.c:677
 tcp_sendmsg_fastopen+0x3e2/0x710 net/ipv4/tcp.c:1039
 tcp_sendmsg_locked+0x1e82/0x3570 net/ipv4/tcp.c:1091
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1358
 inet6_sendmsg+0xb9/0x150 net/ipv6/af_inet6.c:659
 sock_sendmsg_nosec net/socket.c:718 [inline]
 __sock_sendmsg+0xf4/0x2a0 net/socket.c:733
 __sys_sendto+0x29a/0x390 net/socket.c:2187
 __do_sys_sendto net/socket.c:2194 [inline]
 __se_sys_sendto net/socket.c:2190 [inline]
 __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2190
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc3/0x1d0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f06553c47ed
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f0653a06fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f0655605fa0 RCX: 00007f06553c47ed
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000b
RBP: 00007f065545db38 R08: 0000200000000140 R09: 000000000000001c
R10: f7384d4ea84b01bd R11: 0000000000000246 R12: 0000000000000000
R13: 00007f0655605fac R14: 00007f0655606038 R15: 00007f06539e7000
 </TASK>
Modules linked in:

[1]:
dnf install -y selinux-policy-targeted policycoreutils netlabel_tools procps-ng nmap-ncat
mount -t selinuxfs none /sys/fs/selinux
load_policy
netlabelctl calipso add pass doi:1
netlabelctl map del default
netlabelctl map add default address:::1 protocol:calipso,1
sysctl net.ipv4.tcp_syncookies=2
nc -l ::1 80 &
nc ::1 80

Fixes: e1adea9 ("calipso: Allow request sockets to be relabelled by the lsm.")
Reported-by: syzkaller <[email protected]>
Reported-by: John Cheung <[email protected]>
Closes: https://lore.kernel.org/netdev/CAP=Rh=MvfhrGADy+-WJiftV2_WzMH4VEhEFmeT28qY+4yxNu4w@mail.gmail.com/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Paul Moore <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Mark Bundschuh <[email protected]>
Signed-off-by: Nathan Gao <[email protected]>
simonliebold pushed a commit that referenced this pull request Feb 9, 2026
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.

| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
|  arch_stack_walk+0x4a0/0x4c0 (P)
|  arch_stack_walk+0x248/0x4c0 (L)
|  profile_pc+0x44/0x80
|  profile_tick+0x50/0x80 (F)
|  tick_nohz_handler+0xcc/0x160 (F)
|  __hrtimer_run_queues+0x2ac/0x340 (F)
|  hrtimer_interrupt+0xf4/0x268 (F)
|  arch_timer_handler_virt+0x34/0x60 (F)
|  handle_percpu_devid_irq+0x88/0x220 (F)
|  generic_handle_domain_irq+0x34/0x60 (F)
|  gic_handle_irq+0x54/0x140 (F)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0x98
|  el1_interrupt+0x34/0x68 (F)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  queued_spin_lock_slowpath+0x78/0x460 (P)

The warning in question is:

  WARN_ON_ONCE(state->common.pc == orig_pc))

... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.

This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).

Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Aishwarya TCV <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
simonliebold pushed a commit that referenced this pull request Feb 9, 2026
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 11, 2026
commit 7f7cfcb upstream.

In hci_cs_disconnect, we do hci_conn_del even if disconnection failed.

ISO, L2CAP and SCO connections refer to the hci_conn without
hci_conn_get, so disconn_cfm must be called so they can clean up their
conn, otherwise use-after-free occurs.

ISO:
==========================================================
iso_sock_connect:880: sk 00000000eabd6557
iso_connect_cis:356: 70:1a:b8:98:ff:a2 -> 28:3d:c2:4a:7e:da
...
iso_conn_add:140: hcon 000000001696f1fd conn 00000000b6251073
hci_dev_put:1487: hci0 orig refcnt 17
__iso_chan_add:214: conn 00000000b6251073
iso_sock_clear_timer:117: sock 00000000eabd6557 state 3
...
hci_rx_work:4085: hci0 Event packet
hci_event_packet:7601: hci0: event 0x0f
hci_cmd_status_evt:4346: hci0: opcode 0x0406
hci_cs_disconnect:2760: hci0: status 0x0c
hci_sent_cmd_data:3107: hci0 opcode 0x0406
hci_conn_del:1151: hci0 hcon 000000001696f1fd handle 2560
hci_conn_unlink:1102: hci0: hcon 000000001696f1fd
hci_conn_drop:1451: hcon 00000000d8521aaf orig refcnt 2
hci_chan_list_flush:2780: hcon 000000001696f1fd
hci_dev_put:1487: hci0 orig refcnt 21
hci_dev_put:1487: hci0 orig refcnt 20
hci_req_cmd_complete:3978: opcode 0x0406 status 0x0c
... <no iso_* activity on sk/conn> ...
iso_sock_sendmsg:1098: sock 00000000dea5e2e0, sk 00000000eabd6557
BUG: kernel NULL pointer dereference, address: 0000000000000668
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
RIP: 0010:iso_sock_sendmsg (net/bluetooth/iso.c:1112) bluetooth
==========================================================

L2CAP:
==================================================================
hci_cmd_status_evt:4359: hci0: opcode 0x0406
hci_cs_disconnect:2760: hci0: status 0x0c
hci_sent_cmd_data:3085: hci0 opcode 0x0406
hci_conn_del:1151: hci0 hcon ffff88800c999000 handle 3585
hci_conn_unlink:1102: hci0: hcon ffff88800c999000
hci_chan_list_flush:2780: hcon ffff88800c999000
hci_chan_del:2761: hci0 hcon ffff88800c999000 chan ffff888018ddd280
...
BUG: KASAN: slab-use-after-free in hci_send_acl+0x2d/0x540 [bluetooth]
Read of size 8 at addr ffff888018ddd298 by task bluetoothd/1175

CPU: 0 PID: 1175 Comm: bluetoothd Tainted: G            E      6.4.0-rc4+ #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x5b/0x90
 print_report+0xcf/0x670
 ? __virt_addr_valid+0xf8/0x180
 ? hci_send_acl+0x2d/0x540 [bluetooth]
 kasan_report+0xa8/0xe0
 ? hci_send_acl+0x2d/0x540 [bluetooth]
 hci_send_acl+0x2d/0x540 [bluetooth]
 ? __pfx___lock_acquire+0x10/0x10
 l2cap_chan_send+0x1fd/0x1300 [bluetooth]
 ? l2cap_sock_sendmsg+0xf2/0x170 [bluetooth]
 ? __pfx_l2cap_chan_send+0x10/0x10 [bluetooth]
 ? lock_release+0x1d5/0x3c0
 ? mark_held_locks+0x1a/0x90
 l2cap_sock_sendmsg+0x100/0x170 [bluetooth]
 sock_write_iter+0x275/0x280
 ? __pfx_sock_write_iter+0x10/0x10
 ? __pfx___lock_acquire+0x10/0x10
 do_iter_readv_writev+0x176/0x220
 ? __pfx_do_iter_readv_writev+0x10/0x10
 ? find_held_lock+0x83/0xa0
 ? selinux_file_permission+0x13e/0x210
 do_iter_write+0xda/0x340
 vfs_writev+0x1b4/0x400
 ? __pfx_vfs_writev+0x10/0x10
 ? __seccomp_filter+0x112/0x750
 ? populate_seccomp_data+0x182/0x220
 ? __fget_light+0xdf/0x100
 ? do_writev+0x19d/0x210
 do_writev+0x19d/0x210
 ? __pfx_do_writev+0x10/0x10
 ? mark_held_locks+0x1a/0x90
 do_syscall_64+0x60/0x90
 ? lockdep_hardirqs_on_prepare+0x149/0x210
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on_prepare+0x149/0x210
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7ff45cb23e64
Code: 15 d1 1f 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 80 3d 9d a7 0d 00 00 74 13 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 89 54 24 1c 48 89
RSP: 002b:00007fff21ae09b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff45cb23e64
RDX: 0000000000000001 RSI: 00007fff21ae0aa0 RDI: 0000000000000017
RBP: 00007fff21ae0aa0 R08: 000000000095a8a0 R09: 0000607000053f40
R10: 0000000000000001 R11: 0000000000000202 R12: 00007fff21ae0ac0
R13: 00000fffe435c150 R14: 00007fff21ae0a80 R15: 000060f000000040
 </TASK>

Allocated by task 771:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 __kasan_kmalloc+0xaa/0xb0
 hci_chan_create+0x67/0x1b0 [bluetooth]
 l2cap_conn_add.part.0+0x17/0x590 [bluetooth]
 l2cap_connect_cfm+0x266/0x6b0 [bluetooth]
 hci_le_remote_feat_complete_evt+0x167/0x310 [bluetooth]
 hci_event_packet+0x38d/0x800 [bluetooth]
 hci_rx_work+0x287/0xb20 [bluetooth]
 process_one_work+0x4f7/0x970
 worker_thread+0x8f/0x620
 kthread+0x17f/0x1c0
 ret_from_fork+0x2c/0x50

Freed by task 771:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 kasan_save_free_info+0x2e/0x50
 ____kasan_slab_free+0x169/0x1c0
 slab_free_freelist_hook+0x9e/0x1c0
 __kmem_cache_free+0xc0/0x310
 hci_chan_list_flush+0x46/0x90 [bluetooth]
 hci_conn_cleanup+0x7d/0x330 [bluetooth]
 hci_cs_disconnect+0x35d/0x530 [bluetooth]
 hci_cmd_status_evt+0xef/0x2b0 [bluetooth]
 hci_event_packet+0x38d/0x800 [bluetooth]
 hci_rx_work+0x287/0xb20 [bluetooth]
 process_one_work+0x4f7/0x970
 worker_thread+0x8f/0x620
 kthread+0x17f/0x1c0
 ret_from_fork+0x2c/0x50
==================================================================

Fixes: b8d2905 ("Bluetooth: clean up connection in hci_cs_disconnect")
Signed-off-by: Pauli Virtanen <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
Signed-off-by: Bin Lan <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 11, 2026
During device unmapping (triggered by module unload or explicit unmap),
a refcount underflow occurs causing a use-after-free warning:

  [14747.574913] ------------[ cut here ]------------
  [14747.574916] refcount_t: underflow; use-after-free.
  [14747.574917] WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x55/0x90, CPU#9: kworker/9:1/378
  [14747.574924] Modules linked in: rnbd_client(-) rtrs_client rnbd_server rtrs_server rtrs_core ...
  [14747.574998] CPU: 9 UID: 0 PID: 378 Comm: kworker/9:1 Tainted: G           O     N  6.19.0-rc3lblk-fnext+ #42 PREEMPT(voluntary)
  [14747.575005] Workqueue: rnbd_clt_wq unmap_device_work [rnbd_client]
  [14747.575010] RIP: 0010:refcount_warn_saturate+0x55/0x90
  [14747.575037]  Call Trace:
  [14747.575038]   <TASK>
  [14747.575038]   rnbd_clt_unmap_device+0x170/0x1d0 [rnbd_client]
  [14747.575044]   process_one_work+0x211/0x600
  [14747.575052]   worker_thread+0x184/0x330
  [14747.575055]   ? __pfx_worker_thread+0x10/0x10
  [14747.575058]   kthread+0x10d/0x250
  [14747.575062]   ? __pfx_kthread+0x10/0x10
  [14747.575066]   ret_from_fork+0x319/0x390
  [14747.575069]   ? __pfx_kthread+0x10/0x10
  [14747.575072]   ret_from_fork_asm+0x1a/0x30
  [14747.575083]   </TASK>
  [14747.575096] ---[ end trace 0000000000000000 ]---

Befor this patch :-

The bug is a double kobject_put() on dev->kobj during device cleanup.

Kobject Lifecycle:
  kobject_init_and_add()  sets kobj.kref = 1  (initialization)
  kobject_put()           sets kobj.kref = 0  (should be called once)

* Before this patch:

rnbd_clt_unmap_device()
  rnbd_destroy_sysfs()
    kobject_del(&dev->kobj)                   [remove from sysfs]
    kobject_put(&dev->kobj)                   PUT #1 (WRONG!)
      kref: 1 to 0
      rnbd_dev_release()
        kfree(dev)                            [DEVICE FREED!]

  rnbd_destroy_gen_disk()                     [use-after-free!]

  rnbd_clt_put_dev()
    refcount_dec_and_test(&dev->refcount)
    kobject_put(&dev->kobj)                   PUT #2 (UNDERFLOW!)
      kref: 0 to -1                           [WARNING!]

The first kobject_put() in rnbd_destroy_sysfs() prematurely frees the
device via rnbd_dev_release(), then the second kobject_put() in
rnbd_clt_put_dev() causes refcount underflow.

* After this patch :-

Remove kobject_put() from rnbd_destroy_sysfs(). This function should
only remove sysfs visibility (kobject_del), not manage object lifetime.

Call Graph (FIXED):

rnbd_clt_unmap_device()
  rnbd_destroy_sysfs()
    kobject_del(&dev->kobj)                   [remove from sysfs only]
                                              [kref unchanged: 1]

  rnbd_destroy_gen_disk()                     [device still valid]

  rnbd_clt_put_dev()
    refcount_dec_and_test(&dev->refcount)
    kobject_put(&dev->kobj)                   ONLY PUT (CORRECT!)
      kref: 1 to 0                            [BALANCED]
      rnbd_dev_release()
        kfree(dev)                            [CLEAN DESTRUCTION]

This follows the kernel pattern where sysfs removal (kobject_del) is
separate from object destruction (kobject_put).

Fixes: 581cf83 ("block: rnbd: add .release to rnbd_dev_ktype")
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Acked-by: Jack Wang <[email protected]>
Reviewed-by: Jack Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 11, 2026
Leon Hwang says:

====================
Similar to commit 98770bd ("bpf,x86: add fsession support for x86_64"),
add fsession support on arm64.

Patch #1 adds bpf_jit_supports_fsession() to prevent fsession loading
on architectures that do not implement fsession support.

Patch #2 implements fsession support in the arm64 BPF JIT trampoline.

Patch #3 enables the relevant selftests on arm64, including get_func_ip,
and get_func_args.

All enabled tests pass on arm64:

 cd tools/testing/selftests/bpf
 ./test_progs -t fsession
 #136/1   fsession_test/fsession_test:OK
 #136/2   fsession_test/fsession_reattach:OK
 #136/3   fsession_test/fsession_cookie:OK
 #136     fsession_test:OK
 Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED

 ./test_progs -t get_func
 #138     get_func_args_test:OK
 #139     get_func_ip_test:OK
 Summary: 2/0 PASSED, 0 SKIPPED, 0 FAILED

Changes:
v4 -> v5:
* Address comment from Alexei:
  * Rename helper bpf_link_prog_session_cookie() to
    bpf_prog_calls_session_cookie().
* v4: https://lore.kernel.org/bpf/[email protected]/

v3 -> v4:
* Add a log when !bpf_jit_supports_fsession() in patch #1 (per AI).
* v3: https://lore.kernel.org/bpf/[email protected]/

v2 -> v3:
* Fix typo in subject and patch message of patch #1 (per AI and Chris).
* Collect Acked-by, and Tested-by from Puranjay, thanks.
* v2: https://lore.kernel.org/bpf/[email protected]/

v1 -> v2:
* Add bpf_jit_supports_fsession().
* v1: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 13, 2026
Patch series "kallsyms: Prevent invalid access when showing module
buildid", v3.

We have seen nested crashes in __sprint_symbol(), see below.  They seem to
be caused by an invalid pointer to "buildid".  This patchset cleans up
kallsyms code related to module buildid and fixes this invalid access when
printing backtraces.

I made an audit of __sprint_symbol() and found several situations
when the buildid might be wrong:

  + bpf_address_lookup() does not set @modbuildid

  + ftrace_mod_address_lookup() does not set @modbuildid

  + __sprint_symbol() does not take rcu_read_lock and
    the related struct module might get removed before
    mod->build_id is printed.

This patchset solves these problems:

  + 1st, 2nd patches are preparatory
  + 3rd, 4th, 6th patches fix the above problems
  + 5th patch cleans up a suspicious initialization code.

This is the backtrace, we have seen. But it is not really important.
The problems fixed by the patchset are obvious:

  crash64> bt [62/2029]
  PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs"
  #0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3
  #1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a
  #2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61
  #3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964
  #4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8
  #5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a
  #6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4
  #7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33
  #8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9
  ...


This patch (of 7)

The function kallsyms_lookup_buildid() initializes the given @namebuf by
clearing the first and the last byte.  It is not clear why.

The 1st byte makes sense because some callers ignore the return code and
expect that the buffer contains a valid string, for example:

  - function_stat_show()
    - kallsyms_lookup()
      - kallsyms_lookup_buildid()

The initialization of the last byte does not make much sense because it
can later be overwritten.  Fortunately, it seems that all called functions
behave correctly:

  -  kallsyms_expand_symbol() explicitly adds the trailing '\0'
     at the end of the function.

  - All *__address_lookup() functions either use the safe strscpy()
    or they do not touch the buffer at all.

Document the reason for clearing the first byte.  And remove the useless
initialization of the last byte.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkman <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Luis Chamberalin <[email protected]>
Cc: Marc Rutland <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Petr Pavlu <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Daniel Gomez <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 15, 2026
A potential circular locking dependency (ABBA deadlock) exists between
`ec_dev->lock` and the clock framework's `prepare_lock`.

The first order (A -> B) occurs when scp_ipi_send() is called while
`ec_dev->lock` is held (e.g., within cros_ec_cmd_xfer()):
1. cros_ec_cmd_xfer() acquires `ec_dev->lock` and calls scp_ipi_send().
2. scp_ipi_send() calls clk_prepare_enable(), which acquires
   `prepare_lock`.
See #0 in the following example calling trace.
(Lock Order: `ec_dev->lock` -> `prepare_lock`)

The reverse order (B -> A) is more complex and has been observed
(learned) by lockdep.  It involves the clock prepare operation
triggering power domain changes, which then propagates through sysfs
and power supply uevents, eventually calling back into the ChromeOS EC
driver and attempting to acquire `ec_dev->lock`:
1. Something calls clk_prepare(), which acquires `prepare_lock`.  It
   then triggers genpd operations like genpd_runtime_resume(), which
   takes `&genpd->mlock`.
2. Power domain changes can trigger regulator changes; regulator
   changes can then trigger device link changes; device link changes
   can then trigger sysfs changes.  Eventually, power_supply_uevent()
   is called.
3. This leads to calls like cros_usbpd_charger_get_prop(), which calls
   cros_ec_cmd_xfer_status(), which then attempts to acquire
   `ec_dev->lock`.
See #1 ~ #6 in the following example calling trace.
(Lock Order: `prepare_lock` -> `&genpd->mlock` -> ... -> `&ec_dev->lock`)

Move the clk_prepare()/clk_unprepare() operations for `scp->clk` to the
remoteproc prepare()/unprepare() callbacks.  This ensures `prepare_lock`
is only acquired in prepare()/unprepare() callbacks.  Since
`ec_dev->lock` is not involved in the callbacks, the dependency loop is
broken.

This means the clock is always "prepared" when the SCP is running.  The
prolonged "prepared time" for the clock should be acceptable as SCP is
designed to be a very power efficient processor.  The power consumption
impact can be negligible.

A simplified calling trace reported by lockdep:
> -> #6 (&ec_dev->lock)
>        cros_ec_cmd_xfer
>        cros_ec_cmd_xfer_status
>        cros_usbpd_charger_get_port_status
>        cros_usbpd_charger_get_prop
>        power_supply_get_property
>        power_supply_show_property
>        power_supply_uevent
>        dev_uevent
>        uevent_show
>        dev_attr_show
>        sysfs_kf_seq_show
>        kernfs_seq_show
> -> #5 (kn->active#2)
>        kernfs_drain
>        __kernfs_remove
>        kernfs_remove_by_name_ns
>        sysfs_remove_file_ns
>        device_del
>        __device_link_del
>        device_links_driver_bound
> -> #4 (device_links_lock)
>        device_link_remove
>        _regulator_put
>        regulator_put
> -> #3 (regulator_list_mutex)
>        regulator_lock_dependent
>        regulator_disable
>        scpsys_power_off
>        _genpd_power_off
>        genpd_power_off
> -> #2 (&genpd->mlock/1)
>        genpd_add_subdomain
>        pm_genpd_add_subdomain
>        scpsys_add_subdomain
>        scpsys_probe
> -> #1 (&genpd->mlock)
>        genpd_runtime_resume
>        __rpm_callback
>        rpm_callback
>        rpm_resume
>        __pm_runtime_resume
>        clk_core_prepare
>        clk_prepare
> -> #0 (prepare_lock)
>        clk_prepare
>        scp_ipi_send
>        scp_send_ipi
>        mtk_rpmsg_send
>        rpmsg_send
>        cros_ec_pkt_xfer_rpmsg

Signed-off-by: Tzung-Bi Shih <[email protected]>
Reviewed-by: Chen-Yu Tsai <[email protected]>
Tested-by: Chen-Yu Tsai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mathieu Poirier <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 15, 2026
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support
for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated
by userspace), which KVM completely mishandles. When x2APIC support was
first added, KVM incorrectly advertised and "enabled" Suppress EOI
Broadcast, without fully supporting the I/O APIC side of the equation,
i.e. without adding directed EOI to KVM's in-kernel I/O APIC.

That flaw was carried over to split IRQCHIP support, i.e. KVM advertised
support for Suppress EOI Broadcasts irrespective of whether or not the
userspace I/O APIC implementation supported directed EOIs. Even worse,
KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without
support for directed EOI came to rely on the "spurious" broadcasts.

KVM "fixed" the in-kernel I/O APIC implementation by completely disabling
support for Suppress EOI Broadcasts in commit 0bcc3fb ("KVM: lapic:
stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but
didn't do anything to remedy userspace I/O APIC implementations.

KVM's bogus handling of Suppress EOI Broadcast is problematic when the
guest relies on interrupts being masked in the I/O APIC until well after
the initial local APIC EOI. E.g. Windows with Credential Guard enabled
handles interrupts in the following order:
  1. Interrupt for L2 arrives.
  2. L1 APIC EOIs the interrupt.
  3. L1 resumes L2 and injects the interrupt.
  4. L2 EOIs after servicing.
  5. L1 performs the I/O APIC EOI.

Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt
storm, e.g. if the IRQ line is still asserted and userspace reacts to the
EOI by re-injecting the IRQ, because the guest doesn't de-assert the line
until step #4, and doesn't expect the interrupt to be re-enabled until
step #5.

Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way
of knowing if the userspace I/O APIC supports directed EOIs, i.e.
suppressing EOI broadcasts would result in interrupts being stuck masked
in the userspace I/O APIC due to step #5 being ignored by userspace. And
fully disabling support for Suppress EOI Broadcast is also undesirable, as
picking up the fix would require a guest reboot, *and* more importantly
would change the virtual CPU model exposed to the guest without any buy-in
from userspace.

Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and
KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to
explicitly enable or disable support for Suppress EOI Broadcasts. This
gives userspace control over the virtual CPU model exposed to the guest,
as KVM should never have enabled support for Suppress EOI Broadcast without
userspace opt-in. Not setting either flag will result in legacy quirky
behavior for backward compatibility.

Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel
I/O APIC, as KVM's history/support is just as tragic.  E.g. it's not clear
that commit c806a6a ("KVM: x86: call irq notifiers with directed EOI")
was entirely correct, i.e. it may have simply papered over the lack of
Directed EOI emulation in the I/O APIC.

Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's
APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI
is to support Directed EOI (KVM's name) irrespective of guest CPU vendor.

Fixes: 7543a63 ("KVM: x86: Add KVM exit for IOAPIC EOIs")
Closes: https://lore.kernel.org/kvm/[email protected]
Cc: [email protected]
Suggested-by: David Woodhouse <[email protected]>
Signed-off-by: Khushit Shah <[email protected]>
Link: https://patch.msgid.link/[email protected]
[sean: clean up minor formatting goofs and fix a comment typo]
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
surajjs95 pushed a commit that referenced this pull request Feb 17, 2026
[ Upstream commit 711741f ]

Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order
and prevent the following deadlock from happening

======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc3-build2+ #1301 Tainted: G S      W
------------------------------------------------------
cifsd/6055 is trying to acquire lock:
ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200

but task is already holding lock:
ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ret_buf->chan_lock){+.+.}-{3:3}:
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_setup_session+0x81/0x4b0
       cifs_get_smb_ses+0x771/0x900
       cifs_mount_get_session+0x7e/0x170
       cifs_mount+0x92/0x2d0
       cifs_smb3_do_mount+0x161/0x460
       smb3_get_tree+0x55/0x90
       vfs_get_tree+0x46/0x180
       do_new_mount+0x1b0/0x2e0
       path_mount+0x6ee/0x740
       do_mount+0x98/0xe0
       __do_sys_mount+0x148/0x180
       do_syscall_64+0xa4/0x260
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&ret_buf->ses_lock){+.+.}-{3:3}:
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_match_super+0x101/0x320
       sget+0xab/0x270
       cifs_smb3_do_mount+0x1e0/0x460
       smb3_get_tree+0x55/0x90
       vfs_get_tree+0x46/0x180
       do_new_mount+0x1b0/0x2e0
       path_mount+0x6ee/0x740
       do_mount+0x98/0xe0
       __do_sys_mount+0x148/0x180
       do_syscall_64+0xa4/0x260
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}:
       check_noncircular+0x95/0xc0
       check_prev_add+0x115/0x2f0
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_signal_cifsd_for_reconnect+0x134/0x200
       __cifs_reconnect+0x8f/0x500
       cifs_handle_standard+0x112/0x280
       cifs_demultiplex_thread+0x64d/0xbc0
       kthread+0x2f7/0x310
       ret_from_fork+0x2a/0x230
       ret_from_fork_asm+0x1a/0x30

other info that might help us debug this:

Chain exists of:
  &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ret_buf->chan_lock);
                               lock(&ret_buf->ses_lock);
                               lock(&ret_buf->chan_lock);
  lock(&tcp_ses->srv_lock);

 *** DEADLOCK ***

3 locks held by cifsd/6055:
 #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200
 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200
 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200

Cc: [email protected]
Reported-by: David Howells <[email protected]>
Fixes: d7d7a66 ("cifs: avoid use of global locks for high contention data")
Reviewed-by: David Howells <[email protected]>
Tested-by: David Howells <[email protected]>
Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Steve French <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[v6.1: replaced SERVER_IS_CHAN with CIFS_SERVER_IS_CHAN; resolved
contextual conflicts in cifs_signal_cifsd_for_reconnect()]
Signed-off-by: Shaoying Xu <[email protected]>
surajjs95 pushed a commit that referenced this pull request Feb 17, 2026
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.

| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
|  arch_stack_walk+0x4a0/0x4c0 (P)
|  arch_stack_walk+0x248/0x4c0 (L)
|  profile_pc+0x44/0x80
|  profile_tick+0x50/0x80 (F)
|  tick_nohz_handler+0xcc/0x160 (F)
|  __hrtimer_run_queues+0x2ac/0x340 (F)
|  hrtimer_interrupt+0xf4/0x268 (F)
|  arch_timer_handler_virt+0x34/0x60 (F)
|  handle_percpu_devid_irq+0x88/0x220 (F)
|  generic_handle_domain_irq+0x34/0x60 (F)
|  gic_handle_irq+0x54/0x140 (F)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0x98
|  el1_interrupt+0x34/0x68 (F)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  queued_spin_lock_slowpath+0x78/0x460 (P)

The warning in question is:

  WARN_ON_ONCE(state->common.pc == orig_pc))

... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.

This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).

Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Aishwarya TCV <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
surajjs95 pushed a commit that referenced this pull request Feb 17, 2026
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
paniakin-aws pushed a commit that referenced this pull request Feb 18, 2026
Let's actually check the return value of devm_apple_tunable_parse
instead of trying to check IS_ERR on a pointer to the return value which
is always going to be valid. This prevent a oops when the tunables are
invalid or when they don't exist:

[   57.664567] Unable to handle kernel paging request at virtual address fffffffffffffffe
[   57.664584] Mem abort info:
[   57.664589]   ESR = 0x0000000096000007
[   57.664595]   EC = 0x25: DABT (current EL), IL = 32 bits
[   57.664602]   SET = 0, FnV = 0
[   57.664607]   EA = 0, S1PTW = 0
[   57.664611]   FSC = 0x07: level 3 translation fault
[   57.664617] Data abort info:
[   57.664621]   ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000
[   57.664626]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   57.664631]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   57.664640] swapper pgtable: 16k pages, 47-bit VAs, pgdp=0000000b4391c000
[   57.664647] [fffffffffffffffe] pgd=0000000000000000, p4d=0000000000000000, pud=0000000b44188403, pmd=0000000b4418c403, pte=0000000000000000
[   57.664670] Internal error: Oops: 0000000096000007 [#1]  SMP
[   57.665047] CPU: 1 UID: 0 PID: 23 Comm: kworker/1:0 Tainted: G S                  6.18.2+ #2 PREEMPTLAZY
[   57.665061] Tainted: [S]=CPU_OUT_OF_SPEC
[   57.665066] Hardware name: Apple Mac mini (M1, 2020) (DT)
[   57.665072] Workqueue: events cd321x_update_work [tps6598x]
[   57.665100] pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   57.665111] pc : apple_tunable_apply+0x8/0x80 [apple_tunable]
[   57.665121] lr : atcphy_mux_set+0x3e0/0x1138 [phy_apple_atc]
[   57.665133] sp : ffffc000802a7c00
[   57.665138] x29: ffffc000802a7c00 x28: 0000000000000003 x27: ffff800016c84080
[   57.665151] x26: 0000000000000002 x25: ffff800016c84090 x24: ffff800016c8408f
[   57.665163] x23: 0000000000020004 x22: 0000000000000001 x21: 0000000000000006
[   57.665175] x20: ffff80000d6da9b0 x19: ffff80000d6da880 x18: 0000000000000002
[   57.665188] x17: 0000000000000000 x16: ffffe22de59e0e38 x15: 0000000000000002
[   57.665199] x14: ffffe22de76ecff8 x13: 0000000000000001 x12: ffff9dd5f90bc000
[   57.665211] x11: 00000000000000c0 x10: 048abc15ceba0919 x9 : ffffe22dbc5fde10
[   57.665223] x8 : ffff80000175e0d8 x7 : 0000000000000004 x6 : 0000000000000000
[   57.665234] x5 : 0000000000000001 x4 : 0000000d6d132db7 x3 : 00000000000155db
[   57.665246] x2 : 0000000000000000 x1 : fffffffffffffffe x0 : ffffc00082b80000
[   57.665258] Call trace:
[   57.665265]  apple_tunable_apply+0x8/0x80 [apple_tunable] (P)
[   57.665276]  typec_mux_set+0x74/0xe0 [typec]
[   57.665315]  cd321x_update_work+0x440/0x8c0 [tps6598x]
[   57.665332]  process_one_work+0x178/0x3d0
[   57.665346]  worker_thread+0x260/0x390
[   57.665354]  kthread+0x150/0x250
[   57.665369]  ret_from_fork+0x10/0x20
[   57.665386] Code: e69a0ae8 ffffe22d aa1e03e9 d503201f (f9400022)
[   57.665394] ---[ end trace 0000000000000000 ]---

Reported-by: Thomas Glanzmann <[email protected]>
Fixes: 8e98ca1 ("phy: apple: Add Apple Type-C PHY")
Signed-off-by: Sven Peter <[email protected]>
Reviewed-by: Neil Armstrong <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Vinod Koul <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments