|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
84654c7f |
| 03-Mar-2025 |
Mateusz Guzik <[email protected]> |
wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
In vast majority of cases the condition determining whether the thread can proceed is true after the first wake up.
However,
wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
In vast majority of cases the condition determining whether the thread can proceed is true after the first wake up.
However, even in that case the thread ends up calling into prepare_to_wait_event() again, suffering a spurious irq + lock trip.
Then it calls into finish_wait() to unlink itself.
Note that in case of a pending signal the work done by prepare_to_wait_event() gets ignored even without the change.
pre-check the condition after waking up instead.
Stats gathared during a kernel build: bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \ { @[probe] = count(); }'
@[kprobe:finish_wait]: 392483 @[kprobe:prepare_to_wait_event]: 778690
As in calls to prepare_to_wait_event() almost double calls to finish_wait(). This evens out with the patch.
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2 |
|
| #
36d03cb3 |
| 08-Feb-2025 |
Muchun Song <[email protected]> |
block: introduce init_wait_func()
There is already a macro DEFINE_WAIT_FUNC() to declare a wait_queue_entry with a specified waking function. But there is not a counterpart for initializing one wait
block: introduce init_wait_func()
There is already a macro DEFINE_WAIT_FUNC() to declare a wait_queue_entry with a specified waking function. But there is not a counterpart for initializing one wait_queue_entry with a specified waking function. So introducing init_wait_func() for this, which also could be used in iocost and rq-qos. Using default_wake_function() in rq_qos_wait() to wake up waiters, which could remove ->task field from rq_qos_wait_data.
Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Muchun Song <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6 |
|
| #
211647e5 |
| 31-Oct-2024 |
Nam Cao <[email protected]> |
wait: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack() to keep the naming convention consistent.
Convert the usage site over
wait: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack() to keep the naming convention consistent.
Convert the usage site over to it.
Signed-off-by: Nam Cao <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/fc91182375df81120a88dbe0263267e24d1bf19e.1730386209.git.namcao@linutronix.de
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6 |
|
| #
900bbaae |
| 26-Apr-2024 |
Xuewen Yan <[email protected]> |
epoll: Add synchronous wakeup support for ep_poll_callback
Now, the epoll only use wake_up() interface to wake up task. However, sometimes, there are epoll users which want to use the synchronous wa
epoll: Add synchronous wakeup support for ep_poll_callback
Now, the epoll only use wake_up() interface to wake up task. However, sometimes, there are epoll users which want to use the synchronous wakeup flag to hint the scheduler, such as Android binder driver. So add a wake_up_sync() define, and use the wake_up_sync() when the sync is true in ep_poll_callback().
Co-developed-by: Jing Xia <[email protected]> Signed-off-by: Jing Xia <[email protected]> Signed-off-by: Xuewen Yan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Brian Geffon <[email protected]> Reviewed-by: Brian Geffon <[email protected]> Reported-by: Benoit Lize <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6 |
|
| #
6dfeff09 |
| 11-Dec-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so
wait: Remove uapi header file from main header file
There's really no overlap between uapi/linux/wait.h and linux/wait.h. There are two files which rely on the uapi file being implcitly included, so explicitly include it there and remove it from the main header file.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Reviewed-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6 |
|
| #
37acade0 |
| 10-Oct-2023 |
Matthew Wilcox (Oracle) <[email protected]> |
sched: remove wait bookmarks
There are no users of wait bookmarks left, so simplify the wait code by removing them.
Link: https://lkml.kernel.org/r/[email protected] Signe
sched: remove wait bookmarks
There are no users of wait bookmarks left, so simplify the wait code by removing them.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Benjamin Segall <[email protected]> Cc: Bin Lai <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vincent Guittot <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2 |
|
| #
6f63904c |
| 08-Mar-2023 |
Andrei Vagin <[email protected]> |
sched: add a few helpers to wake up tasks on the current cpu
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake up tasks on the current CPU.
These two helpers are useful when
sched: add a few helpers to wake up tasks on the current cpu
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake up tasks on the current CPU.
These two helpers are useful when the task needs to make a synchronous context switch to another task. In this context, synchronous means it wakes up the target task and falls asleep right after that.
One example of such workloads is seccomp user notifies. This mechanism allows the supervisor process handles system calls on behalf of a target process. While the supervisor is handling an intercepted system call, the target process will be blocked in the kernel, waiting for a response to come back.
On-CPU context switches are much faster than regular ones.
Signed-off-by: Andrei Vagin <[email protected]> Acked-by: "Peter Zijlstra (Intel)" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6 |
|
| #
ee7dc86b |
| 15-Nov-2022 |
Gabriel Krisman Bertazi <[email protected]> |
wait: Return number of exclusive waiters awaken
Sbitmap code will need to know how many waiters were actually woken for its batched wakeups implementation. Return the number of woken exclusive wait
wait: Return number of exclusive waiters awaken
Sbitmap code will need to know how many waiters were actually woken for its batched wakeups implementation. Return the number of woken exclusive waiters from __wake_up() to facilitate that.
Suggested-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0 |
|
| #
5aec788a |
| 27-Sep-2022 |
Peter Zijlstra <[email protected]> |
sched: Fix TASK_state comparisons
Task state is fundamentally a bitmask; direct comparisons are probably not working as intended. Specifically the normal wait-state have a number of possible modifie
sched: Fix TASK_state comparisons
Task state is fundamentally a bitmask; direct comparisons are probably not working as intended. Specifically the normal wait-state have a number of possible modifiers:
TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE TASK_INTERRUPTIBLE: TASK_FREEZABLE
Specifically, the addition of TASK_FREEZABLE wrecked __wait_is_interruptible(). This however led to an audit of direct comparisons yielding the rest of the changes.
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Reported-by: Christian Borntraeger <[email protected]> Debugged-by: Christian Borntraeger <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Christian Borntraeger <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3 |
|
| #
f5d39b02 |
| 22-Aug-2022 |
Peter Zijlstra <[email protected]> |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensu
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count(); schedule(); freezer_count();
And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time.
That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
3f884a10 |
| 22-Aug-2022 |
Peter Zijlstra <[email protected]> |
sched/wait: Add wait_event_state()
Allows waiting with a custom @state.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Link: https://lore.k
sched/wait: Add wait_event_state()
Allows waiting with a custom @state.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5 |
|
| #
cceeeb6a |
| 27-Jun-2022 |
Juri Lelli <[email protected]> |
wait: Fix __wait_event_hrtimeout for RT/DL tasks
Changes to hrtimer mode (potentially made by __hrtimer_init_sleeper on PREEMPT_RT) are not visible to hrtimer_start_range_ns, thus not accounted for
wait: Fix __wait_event_hrtimeout for RT/DL tasks
Changes to hrtimer mode (potentially made by __hrtimer_init_sleeper on PREEMPT_RT) are not visible to hrtimer_start_range_ns, thus not accounted for by hrtimer_start_expires call paths. In particular, __wait_event_hrtimeout suffers from this problem as we have, for example:
fs/aio.c::read_events wait_event_interruptible_hrtimeout __wait_event_hrtimeout hrtimer_init_sleeper_on_stack <- this might "mode |= HRTIMER_MODE_HARD" on RT if task runs at RT/DL priority hrtimer_start_range_ns WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard) fires since the latter doesn't see the change of mode done by init_sleeper
Fix it by making __wait_event_hrtimeout call hrtimer_sleeper_start_expires, which is aware of the special RT/DL case, instead of hrtimer_start_range_ns.
Reported-by: Bruno Goncalves <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5 |
|
| #
42288cb4 |
| 09-Dec-2021 |
Eric Biggers <[email protected]> |
wait: add wake_up_pollfree()
Several ->poll() implementations are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This
wait: add wake_up_pollfree()
Several ->poll() implementations are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
However, that has a bug: wake_up_poll() calls __wake_up() with nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters, and the wakeup function for the first one returns a positive value, only that one will be called. That's *not* what's needed for POLLFREE; POLLFREE is special in that it really needs to wake up everyone.
Considering the three non-blocking poll systems:
- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.
- aio poll is unaffected, since it doesn't support exclusive waits. However, that's fragile, as someone could add this feature later.
- epoll doesn't appear to be broken by this, since its wakeup function returns 0 when it sees POLLFREE. But this is fragile.
Although there is a workaround (see epoll), it's better to define a function which always sends POLLFREE to all waiters. Add such a function. Also make it verify that the queue really becomes empty after all waiters have been woken up.
Reported-by: Linus Torvalds <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Eric Biggers <[email protected]>
show more ...
|
|
Revision tags: v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3 |
|
| #
9b3c4ab3 |
| 21-Sep-2021 |
Peter Zijlstra <[email protected]> |
sched,rcu: Rework try_invoke_on_locked_down_task()
Give try_invoke_on_locked_down_task() a saner name and have it return an int so that the caller might distinguish between different reasons of fail
sched,rcu: Rework try_invoke_on_locked_down_task()
Give try_invoke_on_locked_down_task() a saner name and have it return an int so that the caller might distinguish between different reasons of failure.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Vasily Gorbik <[email protected]> Tested-by: Vasily Gorbik <[email protected]> # on s390 Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5 |
|
| #
77eccd0d |
| 01-Jun-2021 |
Julian Wiedmann <[email protected]> |
wait: use LIST_HEAD_INIT() to initialize wait_queue_head
Replace the open-coded initialization with the right macro.
Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: Peter Zijlstra
wait: use LIST_HEAD_INIT() to initialize wait_queue_head
Replace the open-coded initialization with the right macro.
Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
11c7aa0d |
| 07-Jun-2021 |
Jan Kara <[email protected]> |
rq-qos: fix missed wake-ups in rq_qos_throttle try two
Commit 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle") tried to fix a problem that a process could be sleeping in rq_qos_wait()
rq-qos: fix missed wake-ups in rq_qos_throttle try two
Commit 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle") tried to fix a problem that a process could be sleeping in rq_qos_wait() without anyone to wake it up. However the fix is not complete and the following can still happen:
CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker) rq_qos_wait() rq_qos_wait() acquire_inflight_cb() -> fails acquire_inflight_cb() -> fails
completes IOs, inflight decreased prepare_to_wait_exclusive() prepare_to_wait_exclusive() has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers has_sleeper = !wq_has_single_sleeper() -> true io_schedule() io_schedule()
Deadlock as now there's nobody to wakeup the two waiters. The logic automatically blocking when there are already sleepers is really subtle and the only way to make it work reliably is that we check whether there are some waiters in the queue when adding ourselves there. That way, we are guaranteed that at least the first process to enter the wait queue will recheck the waiting condition before going to sleep and thus guarantee forward progress.
Fixes: 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle") CC: [email protected] Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
show more ...
|
|
Revision tags: v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2 |
|
| #
c4d51a52 |
| 27-Oct-2020 |
David Woodhouse <[email protected]> |
sched/wait: Add add_wait_queue_priority()
This allows an exclusive wait_queue_entry to be added at the head of the queue, instead of the tail as normal. Thus, it gets to consume events first without
sched/wait: Add add_wait_queue_priority()
This allows an exclusive wait_queue_entry to be added at the head of the queue, instead of the tail as normal. Thus, it gets to consume events first without allowing non-exclusive waiters to be woken at all.
The (first) intended use is for KVM IRQFD, which currently has inconsistent behaviour depending on whether posted interrupts are available or not. If they are, KVM will bypass the eventfd completely and deliver interrupts directly to the appropriate vCPU. If not, events are delivered through the eventfd and userspace will receive them when polling on the eventfd.
By using add_wait_queue_priority(), KVM will be able to consistently consume events within the kernel without accidentally exposing them to userspace when they're supposed to be bypassed. This, in turn, means that userspace doesn't have to jump through hoops to avoid listening on the erroneously noisy eventfd and injecting duplicate interrupts.
Signed-off-by: David Woodhouse <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
show more ...
|
|
Revision tags: v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5 |
|
| #
5ef64cc8 |
| 13-Sep-2020 |
Linus Torvalds <[email protected]> |
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in whil
mm: allow a controlled amount of unfairness in the page lock
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made the page locking entirely fair, in that if a waiter came in while the lock was held, the lock would be transferred to the lockers strictly in order.
That was intended to finally get rid of the long-reported watchdog failures that involved the page lock under extreme load, where a process could end up waiting essentially forever, as other page lockers stole the lock from under it.
It also improved some benchmarks, but it ended up causing huge performance regressions on others, simply because fair lock behavior doesn't end up giving out the lock as aggressively, causing better worst-case latency, but potentially much worse average latencies and throughput.
Instead of reverting that change entirely, this introduces a controlled amount of unfairness, with a sysctl knob to tune it if somebody needs to. But the default value should hopefully be good for any normal load, allowing a few rounds of lock stealing, but enforcing the strict ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring may end up exposing an ABBA deadlock that is hidden by the usual optimistic lock stealing, and while the unfairness doesn't fix the fundamental issue (and I'm still looking at that), it avoids it in practice.
The amount of unfairness can be modified by writing a new value to the 'sysctl_page_lock_unfairness' variable (default value of 5, exposed through /proc/sys/vm/page_lock_unfairness), but that is hopefully something we'd use mainly for debugging rather than being necessary for any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and how contended it gets under certain locks. And the main contention doesn't really seem to be anything related to IO (which was the origin of this lock), but for things like just verifying that the page file mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/ Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1 Link: https://lore.kernel.org/linux-fsdevel/[email protected]/ Reported-and-tested-by: Michael Larabel <[email protected]> Tested-by: Matthieu Baerts <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jan Kara <[email protected]> Cc: Amir Goldstein <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6 |
|
| #
2beaf328 |
| 11-Mar-2020 |
Paul E. McKenney <[email protected]> |
sched/core: Add function to sample state of locked-down task
A running task's state can be sampled in a consistent manner (for example, for diagnostic purposes) simply by invoking smp_call_function_
sched/core: Add function to sample state of locked-down task
A running task's state can be sampled in a consistent manner (for example, for diagnostic purposes) simply by invoking smp_call_function_single() on its CPU, which may be obtained using task_cpu(), then having the IPI handler verify that the desired task is in fact still running. However, if the task is not running, this sampling can in theory be done immediately and directly. In practice, the task might start running at any time, including during the sampling period. Gaining a consistent sample of a not-running task therefore requires that something be done to lock down the target task's state.
This commit therefore adds a try_invoke_on_locked_down_task() function that invokes a specified function if the specified task can be locked down, returning true if successful and if the specified function returns true. Otherwise this function simply returns false. Given that the function passed to try_invoke_on_nonrunning_task() might be invoked with a runqueue lock held, that function had better be quite lightweight.
The function is passed the target task's task_struct pointer and the argument passed to try_invoke_on_locked_down_task(), allowing easy access to task state and to a location for further variables to be passed in and out.
Note that the specified function will be called even if the specified task is currently running. The function can use ->on_rq and task_curr() to quickly and easily determine the task's state, and can return false if this state is not to the function's liking. The caller of the try_invoke_on_locked_down_task() would then see the false return value, and could take appropriate action, for example, trying again later or sending an IPI if matters are more urgent.
It is expected that use cases such as the RCU CPU stall warning code will simply return false if the task is currently running. However, there are use cases involving nohz_full CPUs where the specified function might instead fall back to an alternative sampling scheme that relies on heavier synchronization (such as memory barriers) in the target task.
Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Ben Segall <[email protected]> Cc: Mel Gorman <[email protected]> [ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ] [ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ] Reviewed-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8, v5.4-rc7, v5.4-rc6 |
|
| #
7f26482a |
| 30-Oct-2019 |
Peter Zijlstra <[email protected]> |
locking/percpu-rwsem: Remove the embedded rwsem
The filesystem freezer uses percpu-rwsem in a way that is effectively write_non_owner() and achieves this with a few horrible hacks that rely on the r
locking/percpu-rwsem: Remove the embedded rwsem
The filesystem freezer uses percpu-rwsem in a way that is effectively write_non_owner() and achieves this with a few horrible hacks that rely on the rwsem (!percpu) implementation.
When PREEMPT_RT replaces the rwsem implementation with a PI aware variant this comes apart.
Remove the embedded rwsem and implement it using a waitqueue and an atomic_t.
- make readers_block an atomic, and use it, with the waitqueue for a blocking test-and-set write-side.
- have the read-side wait for the 'lock' state to clear.
Have the waiters use FIFO queueing and mark them (reader/writer) with a new WQ_FLAG. Use a custom wake_function to wake either a single writer or all readers until a writer.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Waiman Long <[email protected]> Tested-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1 |
|
| #
f94df989 |
| 24-Sep-2019 |
David Howells <[email protected]> |
Add wake_up_interruptible_sync_poll_locked()
Add a wakeup call for a case whereby the caller already has the waitqueue spinlock held. This can be used by pipes to alter the ring buffer indices and
Add wake_up_interruptible_sync_poll_locked()
Add a wakeup call for a case whereby the caller already has the waitqueue spinlock held. This can be used by pipes to alter the ring buffer indices and issue a wakeup under the same spinlock.
Signed-off-by: David Howells <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
show more ...
|
| #
ce4dd442 |
| 16-Oct-2019 |
David Howells <[email protected]> |
Remove the nr_exclusive argument from __wake_up_sync_key()
Remove the nr_exclusive argument from __wake_up_sync_key() and derived functions as everything seems to set it to 1. Note also that if it
Remove the nr_exclusive argument from __wake_up_sync_key()
Remove the nr_exclusive argument from __wake_up_sync_key() and derived functions as everything seems to set it to 1. Note also that if it wasn't set to 1, it would clear WF_SYNC anyway.
Signed-off-by: David Howells <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
show more ...
|
|
Revision tags: v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2 |
|
| #
dbc1625f |
| 26-Jul-2019 |
Sebastian Andrzej Siewior <[email protected]> |
hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls
hrtimer_init_sleeper() calls require prior initialisation of the hrtimer object which is embedded into the hrtimer_sleeper.
Combin
hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls
hrtimer_init_sleeper() calls require prior initialisation of the hrtimer object which is embedded into the hrtimer_sleeper.
Combine the initialization and spare a function call. Fixup all call sites.
This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper specific initializations of the embedded hrtimer without modifying any of the call sites.
No functional change.
[ anna-maria: Minor cleanups ] [ tglx: Adopted to the removal of the task argument of hrtimer_init_sleeper() and trivial polishing. Folded a fix from Stephen Rothwell for the vsoc code ]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Anna-Maria Gleixner <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
| #
b7449487 |
| 26-Jul-2019 |
Thomas Gleixner <[email protected]> |
hrtimer: Remove task argument from hrtimer_init_sleeper()
All callers hand in 'current' and that's the only task pointer which actually makes sense. Remove the task argument and set current in the f
hrtimer: Remove task argument from hrtimer_init_sleeper()
All callers hand in 'current' and that's the only task pointer which actually makes sense. Remove the task argument and set current in the function.
Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.3-rc1 |
|
| #
a6d81d30 |
| 16-Jul-2019 |
Josef Bacik <[email protected]> |
wait: add wq_has_single_sleeper helper
rq-qos sits in the io path so we want to take locks as sparingly as possible. To accomplish this we try not to take the waitqueue head lock unless we are sure
wait: add wq_has_single_sleeper helper
rq-qos sits in the io path so we want to take locks as sparingly as possible. To accomplish this we try not to take the waitqueue head lock unless we are sure we need to go to sleep, and we have an optimization to make sure that we don't starve out existing waiters. Since we check if there are existing waiters locklessly we need to be able to update our view of the waitqueue list after we've added ourselves to the waitqueue. Accomplish this by adding this helper to see if there is more than just ourselves on the list.
Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
show more ...
|