|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6 |
|
| #
6d536cad |
| 04-Mar-2025 |
Uros Bizjak <[email protected]> |
x86/percpu: Fix __per_cpu_hot_end marker
Make __per_cpu_hot_end marker point to the end of the percpu cache hot data, not to the end of the percpu cache hot section.
This fixes CONFIG_MPENTIUM4 cas
x86/percpu: Fix __per_cpu_hot_end marker
Make __per_cpu_hot_end marker point to the end of the percpu cache hot data, not to the end of the percpu cache hot section.
This fixes CONFIG_MPENTIUM4 case where X86_L1_CACHE_SHIFT is set to 7 (128 bytes).
Also update assert message accordingly.
Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Brian Gerst <[email protected]> Link: https://lore.kernel.org/r/[email protected]
Closes: https://lore.kernel.org/lkml/[email protected]/
show more ...
|
| #
ab2bb9c0 |
| 03-Mar-2025 |
Brian Gerst <[email protected]> |
percpu: Introduce percpu hot section
Add a subsection to the percpu data for frequently accessed variables that should remain cached on each processor. These varables should not be accessed from ot
percpu: Introduce percpu hot section
Add a subsection to the percpu data for frequently accessed variables that should remain cached on each processor. These varables should not be accessed from other processors to avoid cacheline bouncing.
This will replace the pcpu_hot struct on x86, and open up similar functionality to other architectures and the kernel core.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Uros Bizjak <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
2ec01bd7 |
| 17-Dec-2024 |
Benjamin Berg <[email protected]> |
vmlinux.lds.h: Remove entry to place init_task onto init_stack
Since commit 0eb5085c3874 ("arch: remove ARCH_TASK_STRUCT_ON_STACK") there is no option that would allow placing task_struct on the sta
vmlinux.lds.h: Remove entry to place init_task onto init_stack
Since commit 0eb5085c3874 ("arch: remove ARCH_TASK_STRUCT_ON_STACK") there is no option that would allow placing task_struct on the stack. Remove the unused linker script entry.
Signed-off-by: Benjamin Berg <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
68f3ea7e |
| 21-Feb-2025 |
Ard Biesheuvel <[email protected]> |
vmlinux.lds: Ensure that const vars with relocations are mapped R/O
In the kernel, there are architectures (x86, arm64) that perform boot-time relocation (for KASLR) without relying on PIE codegen.
vmlinux.lds: Ensure that const vars with relocations are mapped R/O
In the kernel, there are architectures (x86, arm64) that perform boot-time relocation (for KASLR) without relying on PIE codegen. In this case, all const global objects are emitted into .rodata, including const objects with fields that will be fixed up by the boot-time relocation code. This implies that .rodata (and .text in some cases) need to be writable at boot, but they will usually be mapped read-only as soon as the boot completes.
When using PIE codegen, the compiler will emit const global objects into .data.rel.ro rather than .rodata if the object contains fields that need such fixups at boot-time. This permits the linker to annotate such regions as requiring read-write access only at load time, but not at execution time (in user space), while keeping .rodata truly const (in user space, this is important for reducing the CoW footprint of dynamic executables).
This distinction does not matter for the kernel, but it does imply that const data will end up in writable memory if the .data.rel.ro sections are not treated in a special way, as they will end up in the writable .data segment by default.
So emit .data.rel.ro into the .rodata segment.
Cc: [email protected] Signed-off-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Josh Poimboeuf <[email protected]>
show more ...
|
| #
4b00c116 |
| 23-Jan-2025 |
Brian Gerst <[email protected]> |
percpu: Remove __per_cpu_load
__per_cpu_load is now always equal to __per_cpu_start.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Bi
percpu: Remove __per_cpu_load
__per_cpu_load is now always equal to __per_cpu_start.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
e23cff68 |
| 23-Jan-2025 |
Brian Gerst <[email protected]> |
percpu: Remove PERCPU_VADDR()
x86-64 was the last user.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Cc
percpu: Remove PERCPU_VADDR()
x86-64 was the last user.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
95b09161 |
| 23-Jan-2025 |
Brian Gerst <[email protected]> |
percpu: Remove PER_CPU_FIRST_SECTION
x86-64 was the last user.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Biesheuvel <ardb@kernel.
percpu: Remove PER_CPU_FIRST_SECTION
x86-64 was the last user.
Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
| #
4c56eb33 |
| 01-Feb-2025 |
Masahiro Yamada <[email protected]> |
kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMS
Linus observed that the symbol_request(utf8_data_table) call fails when CONFIG_UNICODE=y and CONFIG_TRIM_UNUSED_KSYMS=y.
sym
kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMS
Linus observed that the symbol_request(utf8_data_table) call fails when CONFIG_UNICODE=y and CONFIG_TRIM_UNUSED_KSYMS=y.
symbol_get() relies on the symbol data being present in the ksymtab for symbol lookups. However, EXPORT_SYMBOL_GPL(utf8_data_table) is dropped due to CONFIG_TRIM_UNUSED_KSYMS, as no module references it in this case.
Probably, this has been broken since commit dbacb0ef670d ("kconfig option for TRIM_UNUSED_KSYMS").
This commit addresses the issue by leveraging modpost. Symbol names passed to symbol_get() are recorded in the special .no_trim_symbol section, which is then parsed by modpost to forcibly keep such symbols. The .no_trim_symbol section is discarded by the linker scripts, so there is no impact on the size of the final vmlinux or modules.
This commit cannot resolve the issue for direct calls to __symbol_get() because the symbol name is not known at compile-time.
Although symbol_get() may eventually be deprecated, this workaround should be good enough meanwhile.
Reported-by: Linus Torvalds <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
dbefa1f3 |
| 06-Nov-2024 |
Masahiro Yamada <[email protected]> |
Rename .data.once to .data..once to fix resetting WARN*_ONCE
Commit b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") added support for clearing the state of once warnings. However, it is
Rename .data.once to .data..once to fix resetting WARN*_ONCE
Commit b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") added support for clearing the state of once warnings. However, it is not functional when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.once matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro.
Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1]
Seven years have passed since then, yet the #ifdef workaround remains in place. Meanwhile, commit b1fca27d384e introduced the .data.once section, and commit dc5723b02e52 ("kbuild: add support for Clang LTO") extended the #ifdef.
Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG.
[1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/
Fixes: b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") Fixes: dc5723b02e52 ("kbuild: add support for Clang LTO") Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
| #
bb43a599 |
| 06-Nov-2024 |
Masahiro Yamada <[email protected]> |
Rename .data.unlikely to .data..unlikely
Commit 7ccaba5314ca ("consolidate WARN_...ONCE() static variables") was intended to collect all .data.unlikely sections into one chunk. However, this has not
Rename .data.unlikely to .data..unlikely
Commit 7ccaba5314ca ("consolidate WARN_...ONCE() static variables") was intended to collect all .data.unlikely sections into one chunk. However, this has not worked when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.unlikely matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro.
Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1]
Seven years have passed since then, yet the #ifdef workaround remains in place.
Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG.
[1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/
Fixes: cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6 |
|
| #
d5dc9583 |
| 02-Nov-2024 |
Rong Xu <[email protected]> |
kbuild: Add Propeller configuration for kernel build
Add the build support for using Clang's Propeller optimizer. Like AutoFDO, Propeller uses hardware sampling to gather information about the frequ
kbuild: Add Propeller configuration for kernel build
Add the build support for using Clang's Propeller optimizer. Like AutoFDO, Propeller uses hardware sampling to gather information about the frequency of execution of different code paths within a binary. This information is then used to guide the compiler's optimization decisions, resulting in a more efficient binary.
The support requires a Clang compiler LLVM 19 or later, and the create_llvm_prof tool (https://github.com/google/autofdo/releases/tag/v0.30.1). This commit is limited to x86 platforms that support PMU features like LBR on Intel machines and AMD Zen3 BRS.
Here is an example workflow for building an AutoFDO+Propeller optimized kernel:
1) Build the kernel on the host machine, with AutoFDO and Propeller build config CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y then $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile>
“<autofdo_profile>” is the profile collected when doing a non-Propeller AutoFDO build. This step builds a kernel that has the same optimization level as AutoFDO, plus a metadata section that records basic block information. This kernel image runs as fast as an AutoFDO optimized kernel.
2) Install the kernel on test/production machines.
3) Run the load tests. The '-c' option in perf specifies the sample event period. We suggest using a suitable prime number, like 500009, for this purpose. For Intel platforms: $ perf record -e BR_INST_RETIRED.NEAR_TAKEN:k -a -N -b -c <count> \ -o <perf_file> -- <loadtest> For AMD platforms: The supported system are: Zen3 with BRS, or Zen4 with amd_lbr_v2 # To see if Zen3 support LBR: $ cat proc/cpuinfo | grep " brs" # To see if Zen4 support LBR: $ cat proc/cpuinfo | grep amd_lbr_v2 # If the result is yes, then collect the profile using: $ perf record --pfm-events RETIRED_TAKEN_BRANCH_INSTRUCTIONS:k -a \ -N -b -c <count> -o <perf_file> -- <loadtest>
4) (Optional) Download the raw perf file to the host machine.
5) Generate Propeller profile: $ create_llvm_prof --binary=<vmlinux> --profile=<perf_file> \ --format=propeller --propeller_output_module_name \ --out=<propeller_profile_prefix>_cc_profile.txt \ --propeller_symorder=<propeller_profile_prefix>_ld_profile.txt
“create_llvm_prof” is the profile conversion tool, and a prebuilt binary for linux can be found on https://github.com/google/autofdo/releases/tag/v0.30.1 (can also build from source).
"<propeller_profile_prefix>" can be something like "/home/user/dir/any_string".
This command generates a pair of Propeller profiles: "<propeller_profile_prefix>_cc_profile.txt" and "<propeller_profile_prefix>_ld_profile.txt".
6) Rebuild the kernel using the AutoFDO and Propeller profile files. CONFIG_AUTOFDO_CLANG=y CONFIG_PROPELLER_CLANG=y and $ make LLVM=1 CLANG_AUTOFDO_PROFILE=<autofdo_profile> \ CLANG_PROPELLER_PROFILE_PREFIX=<propeller_profile_prefix>
Co-developed-by: Han Shen <[email protected]> Signed-off-by: Han Shen <[email protected]> Signed-off-by: Rong Xu <[email protected]> Suggested-by: Sriraman Tallam <[email protected]> Suggested-by: Krzysztof Pszeniczny <[email protected]> Suggested-by: Nick Desaulniers <[email protected]> Suggested-by: Stephane Eranian <[email protected]> Tested-by: Yonghong Song <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
| #
2fd65f7a |
| 02-Nov-2024 |
Rong Xu <[email protected]> |
AutoFDO: Enable machine function split optimization for AutoFDO
Enable the machine function split optimization for AutoFDO in Clang.
Machine function split (MFS) is a pass in the Clang compiler tha
AutoFDO: Enable machine function split optimization for AutoFDO
Enable the machine function split optimization for AutoFDO in Clang.
Machine function split (MFS) is a pass in the Clang compiler that splits a function into hot and cold parts. The linker groups all cold blocks across functions together. This decreases hot code fragmentation and improves iCache and iTLB utilization.
MFS requires a profile so this is enabled only for the AutoFDO builds.
Co-developed-by: Han Shen <[email protected]> Signed-off-by: Han Shen <[email protected]> Signed-off-by: Rong Xu <[email protected]> Suggested-by: Sriraman Tallam <[email protected]> Suggested-by: Krzysztof Pszeniczny <[email protected]> Tested-by: Yonghong Song <[email protected]> Tested-by: Yabin Cui <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
| #
0847420f |
| 02-Nov-2024 |
Rong Xu <[email protected]> |
AutoFDO: Enable -ffunction-sections for the AutoFDO build
Enable -ffunction-sections by default for the AutoFDO build.
With -ffunction-sections, the compiler places each function in its own section
AutoFDO: Enable -ffunction-sections for the AutoFDO build
Enable -ffunction-sections by default for the AutoFDO build.
With -ffunction-sections, the compiler places each function in its own section named .text.function_name instead of placing all functions in the .text section. In the AutoFDO build, this allows the linker to utilize profile information to reorganize functions for improved utilization of iCache and iTLB.
Co-developed-by: Han Shen <[email protected]> Signed-off-by: Han Shen <[email protected]> Signed-off-by: Rong Xu <[email protected]> Suggested-by: Sriraman Tallam <[email protected]> Tested-by: Yonghong Song <[email protected]> Tested-by: Yabin Cui <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
| #
db0b2991 |
| 02-Nov-2024 |
Rong Xu <[email protected]> |
vmlinux.lds.h: Add markers for text_unlikely and text_hot sections
Add markers like __hot_text_start, __hot_text_end, __unlikely_text_start, and __unlikely_text_end which will be included in System.
vmlinux.lds.h: Add markers for text_unlikely and text_hot sections
Add markers like __hot_text_start, __hot_text_end, __unlikely_text_start, and __unlikely_text_end which will be included in System.map. These markers indicate how the compiler groups functions, providing valuable information to developers about the layout and optimization of the code.
Co-developed-by: Han Shen <[email protected]> Signed-off-by: Han Shen <[email protected]> Signed-off-by: Rong Xu <[email protected]> Suggested-by: Sriraman Tallam <[email protected]> Tested-by: Yonghong Song <[email protected]> Tested-by: Yabin Cui <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
| #
0043ecea |
| 02-Nov-2024 |
Rong Xu <[email protected]> |
vmlinux.lds.h: Adjust symbol ordering in text output section
When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather t
vmlinux.lds.h: Adjust symbol ordering in text output section
When the -ffunction-sections compiler option is enabled, each function is placed in a separate section named .text.function_name rather than putting all functions in a single .text section.
However, using -function-sections can cause problems with the linker script. The comments included in include/asm-generic/vmlinux.lds.h note these issues.: “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead code elimination is enabled, so these sections should be converted to use ".." first.”
It is unclear whether there is a straightforward method for converting a suffix to "..".
This patch modifies the order of subsections within the text output section. Specifically, it changes current order: .text.hot, .text, .text_unlikely, .text.unknown, .text.asan to the new order: .text.asan, .text.unknown, .text_unlikely, .text.hot, .text
Here is the rationale behind the new layout:
The majority of the code resides in three sections: .text.hot, .text, and .text.unlikely, with .text.unknown containing a negligible amount. .text.asan is only generated in ASAN builds.
The primary goal is to group code segments based on their execution frequency (hotness).
First, we want to place .text.hot adjacent to .text. Since we cannot put .text.hot after .text (Due to constraints with -ffunction-sections, placing .text.hot after .text is problematic), we need to put .text.hot before .text.
Then it comes to .text.unlikely, we cannot put it after .text (same -ffunction-sections issue) . Therefore, we position .text.unlikely before .text.hot.
.text.unknown and .tex.asan follow the same logic.
This revised ordering effectively reverses the original arrangement (for .text.unlikely, .text.unknown, and .tex.asan), maintaining a similar level of affinity between sections.
It also places .text.hot section at the beginning of a page to better utilize the TLB entry.
Note that the limitation arises because the linker script employs glob patterns instead of regular expressions for string matching. While there is a method to maintain the current order using complex patterns, this significantly complicates the pattern and increases the likelihood of errors.
This patch also changes vmlinux.lds.S for the sparc64 architecture to accommodate specific symbol placement requirements.
Co-developed-by: Han Shen <[email protected]> Signed-off-by: Han Shen <[email protected]> Signed-off-by: Rong Xu <[email protected]> Suggested-by: Sriraman Tallam <[email protected]> Suggested-by: Krzysztof Pszeniczny <[email protected]> Tested-by: Yonghong Song <[email protected]> Tested-by: Yabin Cui <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2 |
|
| #
92a10d38 |
| 30-Jul-2024 |
Jann Horn <[email protected]> |
runtime constants: move list of constants to vmlinux.lds.h
Refactor the list of constant variables into a macro. This should make it easier to add more constants in the future.
Signed-off-by: Jann
runtime constants: move list of constants to vmlinux.lds.h
Refactor the list of constant variables into a macro. This should make it easier to add more constants in the future.
Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Alexander Gordeev <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
| #
b6547e54 |
| 03-Aug-2024 |
Linus Torvalds <[email protected]> |
runtime constants: deal with old decrepit linkers
The runtime constants linker script depended on documented linker behavior [1]:
"If an output section’s name is the same as the input section’s na
runtime constants: deal with old decrepit linkers
The runtime constants linker script depended on documented linker behavior [1]:
"If an output section’s name is the same as the input section’s name and is representable as a C identifier, then the linker will automatically PROVIDE two symbols: __start_SECNAME and __stop_SECNAME, where SECNAME is the name of the section. These indicate the start address and end address of the output section respectively"
to just automatically define the symbol names for the bounds of the runtime constant arrays.
It turns out that this isn't actually something we can rely on, with old linkers not generating these automatic symbols. It looks to have been introduced in binutils-2.29 back in 2017, and we still support building with versions all the way back to binutils-2.25 (from 2015).
And yes, Oleg actually seems to be using such ancient versions of binutils.
So instead of depending on the implicit symbols from "section names match and are representable C identifiers", just do this all manually. It's not like it causes us any extra pain, we already have to do that for all the other sections that we use that often have special characters in them.
Reported-and-tested-by: Oleg Nesterov <[email protected]> Link: https://sourceware.org/binutils/docs/ld/Input-Section-Example.html [1] Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4 |
|
| #
c442db3f |
| 10-Jun-2024 |
Masahiro Yamada <[email protected]> |
kbuild: remove PROVIDE() for kallsyms symbols
This reimplements commit 951bcae6c5a0 ("kallsyms: Avoid weak references for kallsyms symbols") because I am not a big fan of PROVIDE().
As an alternati
kbuild: remove PROVIDE() for kallsyms symbols
This reimplements commit 951bcae6c5a0 ("kallsyms: Avoid weak references for kallsyms symbols") because I am not a big fan of PROVIDE().
As an alternative solution, this commit prepends one more kallsyms step.
KSYMS .tmp_vmlinux.kallsyms0.S # added AS .tmp_vmlinux.kallsyms0.o # added LD .tmp_vmlinux.btf BTF .btf.vmlinux.bin.o LD .tmp_vmlinux.kallsyms1 NM .tmp_vmlinux.kallsyms1.syms KSYMS .tmp_vmlinux.kallsyms1.S AS .tmp_vmlinux.kallsyms1.o LD .tmp_vmlinux.kallsyms2 NM .tmp_vmlinux.kallsyms2.syms KSYMS .tmp_vmlinux.kallsyms2.S AS .tmp_vmlinux.kallsyms2.o LD vmlinux
Step 0 takes /dev/null as input, and generates .tmp_vmlinux.kallsyms0.o, which has a valid kallsyms format with the empty symbol list, and can be linked to vmlinux. Since it is really small, the added compile-time cost is negligible.
Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Reviewed-by: Nicolas Schier <[email protected]>
show more ...
|
| #
73db3abd |
| 06-Jul-2024 |
Masahiro Yamada <[email protected]> |
init/modpost: conditionally check section mismatch to __meminit*
This reverts commit eb8f689046b8 ("Use separate sections for __dev/ _cpu/__mem code/data").
Check section mismatch to __meminit* onl
init/modpost: conditionally check section mismatch to __meminit*
This reverts commit eb8f689046b8 ("Use separate sections for __dev/ _cpu/__mem code/data").
Check section mismatch to __meminit* only when CONFIG_MEMORY_HOTPLUG=n.
With this change, the linker script and modpost become simpler, and we can get rid of the __ref annotations from the memory hotplug code.
[[email protected]: remove MEM_KEEP from arch/powerpc/kernel/vmlinux.lds.S] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Reviewed-by: Wei Yang <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
1a7b7326 |
| 12-Jul-2024 |
Christophe Leroy <[email protected]> |
vmlinux.lds.h: catch .bss..L* sections into BSS")
Commit 9a427556fb8e ("vmlinux.lds.h: catch compound literals into data and BSS") added catches for .data..L* and .rodata..L* but missed .bss..L*
Si
vmlinux.lds.h: catch .bss..L* sections into BSS")
Commit 9a427556fb8e ("vmlinux.lds.h: catch compound literals into data and BSS") added catches for .data..L* and .rodata..L* but missed .bss..L*
Since commit 5431fdd2c181 ("ptrace: Convert ptrace_attach() to use lock guards") the following appears at build:
LD .tmp_vmlinux.kallsyms1 powerpc64-linux-ld: warning: orphan section `.bss..Lubsan_data33' from `kernel/ptrace.o' being placed in section `.bss..Lubsan_data33' NM .tmp_vmlinux.kallsyms1.syms KSYMS .tmp_vmlinux.kallsyms1.S AS .tmp_vmlinux.kallsyms1.S LD .tmp_vmlinux.kallsyms2 powerpc64-linux-ld: warning: orphan section `.bss..Lubsan_data33' from `kernel/ptrace.o' being placed in section `.bss..Lubsan_data33' NM .tmp_vmlinux.kallsyms2.syms KSYMS .tmp_vmlinux.kallsyms2.S AS .tmp_vmlinux.kallsyms2.S LD vmlinux powerpc64-linux-ld: warning: orphan section `.bss..Lubsan_data33' from `kernel/ptrace.o' being placed in section `.bss..Lubsan_data33'
Lets add .bss..L* to BSS_MAIN macro to catch those sections into BSS.
Fixes: 9a427556fb8e ("vmlinux.lds.h: catch compound literals into data and BSS") Signed-off-by: Christophe Leroy <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Arnd Bergmann <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc3 |
|
| #
e7829855 |
| 04-Jun-2024 |
Linus Torvalds <[email protected]> |
runtime constants: add default dummy infrastructure
This adds the initial dummy support for 'runtime constants' for when an architecture doesn't actually support an implementation of fixing up said
runtime constants: add default dummy infrastructure
This adds the initial dummy support for 'runtime constants' for when an architecture doesn't actually support an implementation of fixing up said runtime constants.
This ends up being the fallback to just using the variables as regular __ro_after_init variables, and changes the dcache d_hash() function to use this model.
Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
f0e1a064 |
| 18-Jun-2024 |
Tejun Heo <[email protected]> |
sched_ext: Implement BPF extensible scheduler class
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following:
1.
sched_ext: Implement BPF extensible scheduler class
Implement a new scheduler class sched_ext (SCX), which allows scheduling policies to be implemented as BPF programs to achieve the following:
1. Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies.
2. Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers.
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments.
sched_ext leverages BPF’s struct_ops feature to define a structure which exports function callbacks and flags to BPF programs that wish to implement scheduling policies. The struct_ops structure exported by sched_ext is struct sched_ext_ops, and is conceptually similar to struct sched_class. The role of sched_ext is to map the complex sched_class callbacks to the more simple and ergonomic struct sched_ext_ops callbacks.
For more detailed discussion on the motivations and overview, please refer to the cover letter.
Later patches will also add several example schedulers and documentation.
This patch implements the minimum core framework to enable implementation of BPF schedulers. Subsequent patches will gradually add functionalities including safety guarantee mechanisms, nohz and cgroup support.
include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on top, each operation should be self-explanatory. The followings are worth noting:
- Both "sched_ext" and its shorthand "scx" are used. If the identifier already has "sched" in it, "ext" is used; otherwise, "scx".
- In sched_ext_ops, only .name is mandatory. Every operation is optional and if omitted a simple but functional default behavior is provided.
- A new policy constant SCHED_EXT is added and a task can select sched_ext by invoking sched_setscheduler(2) with the new policy constant. However, if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL and the task is scheduled by CFS. When the BPF scheduler is loaded, all tasks which have the SCHED_EXT policy are switched to sched_ext.
- To bridge the workflow imbalance between the scheduler core and sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for convenience and need not be used by a scheduler that doesn't require it. SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting the next task on the CPU. The BPF scheduler can manage an arbitrary number of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
- sched_ext guarantees system integrity no matter what the BPF scheduler does. To enable this, each task's ownership is tracked through p->scx.ops_state and all tasks are put on scx_tasks list. The disable path can always recover and revert all tasks back to CFS. See p->scx.ops_state and scx_tasks.
- A task is not tied to its rq while enqueued. This decouples CPU selection from queueing and allows sharing a scheduling queue across an arbitrary subset of CPUs. This adds some complexities as a task may need to be bounced between rq's right before it starts executing. See dispatch_to_local_dsq() and move_task_to_local_dsq().
- One complication that arises from the above weak association between task and rq is that synchronizing with dequeue() gets complicated as dequeue() may happen anytime while the task is enqueued and the dispatch path might need to release the rq lock to transfer the task. Solving this requires a bit of complexity. See the logic around p->scx.sticky_cpu and p->scx.ops_qseq.
- Both enable and disable paths are a bit complicated. The enable path switches all tasks without blocking to avoid issues which can arise from partially switched states (e.g. the switching task itself being starved). The disable path can't trust the BPF scheduler at all, so it also has to guarantee forward progress without blocking. See scx_ops_enable() and scx_ops_disable_workfn().
- When sched_ext is disabled, static_branches are used to shut down the entry points from hot paths.
v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab scx_ops_enable_mutex which can lead to deadlocks in the disable path. Fixed.
- Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead to use-after-free.
- Consolidated per-cpu variable usages and other cleanups.
v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple groups can be expressed. Later CPU hotplug operations are put into their own group.
- SCX_OPS_DISABLING state is replaced with the new bypass mechanism which allows temporarily putting the system into simple FIFO scheduling mode bypassing the BPF scheduler. In addition to the shut down path, this will also be used to isolate the BPF scheduler across PM events. Enabling and disabling the bypass mode requires iterating all runnable tasks. rq->scx.runnable_list addition is moved from the later watchdog patch.
- ops.prep_enable() is replaced with ops.init_task() and ops.enable/disable() are now called whenever the task enters and leaves sched_ext instead of when the task becomes schedulable on sched_ext and stops being so. A new operation - ops.exit_task() - is called when the task stops being schedulable on sched_ext.
- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This removes the need for communicating local dispatch decision made by ops.select_cpu() to ops.enqueue() via per-task storage. SCX_KF_SELECT_CPU is added to support the change.
- SCX_TASK_ENQ_LOCAL which told the BPF scheudler that scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly if it finds a suitable idle CPU. If such behavior is not desired, users can use scx_bpf_select_cpu_dfl() which returns the verdict in a bool out param.
- scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up queueing many tasks on a local DSQ which makes tasks to execute in order while other CPUs stay idle which made some hackbench numbers really bad. Fixed.
- The current state of sched_ext can now be monitored through files under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is to enable monitoring on kernels which don't enable debugfs.
- sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may be NULL and a BPF scheduler which derefs the pointer without checking could crash the kernel. Tell BPF. This is currently a bit ugly. A better way to annotate this is expected in the future.
- scx_exit_info updated to carry pointers to message buffers instead of embedding them directly. This decouples buffer sizes from API so that they can be changed without breaking compatibility.
- exit_code added to scx_exit_info. This is used to indicate different exit conditions on non-error exits and will be used to handle e.g. CPU hotplugs.
- The patch "sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext" is folded in and the interface is changed so that partial switching is indicated with a new ops flag %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry and in turn SCX_KF_INIT. ops.init() is now called with SCX_KF_SLEEPABLE.
- Code reorganized so that only the parts necessary to integrate with the rest of the kernel are in the header files.
- Changes to reflect the BPF and other kernel changes including the addition of bpf_sched_ext_ops.cfi_stubs.
v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t instead of atomic64_t and scx_dsp_buf_ent.qseq which uses load_acquire/store_release is now unsigned long instead of u64.
- Fix the bug where bpf_scx_btf_struct_access() was allowing write access to arbitrary fields.
- Distinguish kfuncs which can be called from any sched_ext ops and from anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from sched_ext ops.
- Rename "type" to "kind" in scx_exit_info to make it easier to use on languages in which "type" is a reserved keyword.
- Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup"), PF_IDLE is not set on idle tasks which haven't been online yet which made scx_task_iter_next_filtered() include those idle tasks in iterations leading to oopses. Update scx_task_iter_next_filtered() to directly test p->sched_class against idle_sched_class instead of using is_idle_task() which tests PF_IDLE.
- Other updates to match upstream changes such as adding const to set_cpumask() param and renaming check_preempt_curr() to wakeup_preempt().
v4: - SCHED_CHANGE_BLOCK replaced with the previous sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is because upstream is adaopting a different generic cleanup mechanism. Once that lands, the code will be adapted accordingly.
- task_on_scx() used to test whether a task should be switched into SCX, which is confusing. Renamed to task_should_scx(). task_on_scx() now tests whether a task is currently on SCX.
- scx_has_idle_cpus is barely used anymore and replaced with direct check on the idle cpumask.
- SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer fully idle cores.
- ops.enable() now sees up-to-date p->scx.weight value.
- ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF schedulers expecting ->select_cpu() call.
- Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest of the scheduler.
v3: - ops.set_weight() added to allow BPF schedulers to track weight changes without polling p->scx.weight.
- move_task_to_local_dsq() was losing SCX-specific enq_flags when enqueueing the task on the target dsq because it goes through activate_task() which loses the upper 32bit of the flags. Carry the flags through rq->scx.extra_enq_flags.
- scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running() and scx_bpf_task_cpu() now use the new KF_RCU instead of KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
- The kfunc helper access control mechanism implemented through sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always used when invoking scx_ops operations.
v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is called from put_prev_taks_scx() and pick_next_task_scx() as necessary. To determine whether balance_scx() should be called from put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the comment in put_prev_task_scx() for details.
- sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced with SCHED_CHANGE_BLOCK().
- Unused all_dsqs list removed. This was a left-over from previous iterations.
- p->scx.kf_mask is added to track and enforce which kfunc helpers are allowed. Also, init/exit sequences are updated to make some kfuncs always safe to call regardless of the current BPF scheduler state. Combined, this should make all the kfuncs safe.
- BPF now supports sleepable struct_ops operations. Hacky workaround removed and operations and kfunc helpers are tagged appropriately.
- BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask() and friends are added so that BPF schedulers can use the idle masks with the generic helpers. This replaces the hacky kfunc helpers added by a separate patch in V1.
- CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is enabled. This restriction will be removed by a later patch which adds core-sched support.
- Add MAINTAINERS entries and other misc changes.
Signed-off-by: Tejun Heo <[email protected]> Co-authored-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]> Cc: Andrea Righi <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5 |
|
| #
951bcae6 |
| 15-Apr-2024 |
Ard Biesheuvel <[email protected]> |
kallsyms: Avoid weak references for kallsyms symbols
kallsyms is a directory of all the symbols in the vmlinux binary, and so creating it is somewhat of a chicken-and-egg problem, as its non-zero si
kallsyms: Avoid weak references for kallsyms symbols
kallsyms is a directory of all the symbols in the vmlinux binary, and so creating it is somewhat of a chicken-and-egg problem, as its non-zero size affects the layout of the binary, and therefore the values of the symbols.
For this reason, the kernel is linked more than once, and the first pass does not include any kallsyms data at all. For the linker to accept this, the symbol declarations describing the kallsyms metadata are emitted as having weak linkage, so they can remain unsatisfied. During the subsequent passes, the weak references are satisfied by the kallsyms metadata that was constructed based on information gathered from the preceding passes.
Weak references lead to somewhat worse codegen, because taking their address may need to produce NULL (if the reference was unsatisfied), and this is not usually supported by RIP or PC relative symbol references.
Given that these references are ultimately always satisfied in the final link, let's drop the weak annotation, and instead, provide fallback definitions in the linker script that are only emitted if an unsatisfied reference exists.
While at it, drop the FRV specific annotation that these symbols reside in .rodata - FRV is long gone.
Tested-by: Nick Desaulniers <[email protected]> # Boot Reviewed-by: Nick Desaulniers <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Link: https://lkml.kernel.org/r/20230504174320.3930345-1-ardb%40kernel.org Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1 |
|
| #
22d407b1 |
| 21-Mar-2024 |
Suren Baghdasaryan <[email protected]> |
lib: add allocation tagging support for memory allocation profiling
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_ta
lib: add allocation tagging support for memory allocation profiling
Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily instrument memory allocators. It registers an "alloc_tags" codetag type with /proc/allocinfo interface to output allocation tag information when the feature is enabled.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory allocation profiling instrumentation.
Memory allocation profiling can be enabled or disabled at runtime using /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n. CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation profiling by default.
[[email protected]: Documentation/filesystems/proc.rst: fix allocinfo title] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: explicitly include irqflags.h in alloc_tag.h] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Co-developed-by: Kent Overstreet <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Klara Modin <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
8f69cba0 |
| 22-Mar-2024 |
Xin Li (Intel) <[email protected]> |
x86: Rename __{start,end}_init_task to __{start,end}_init_stack
The stack of a task has been separated from the memory of a task_struct struture for a long time on x86, as a result __{start,end}_ini
x86: Rename __{start,end}_init_task to __{start,end}_init_stack
The stack of a task has been separated from the memory of a task_struct struture for a long time on x86, as a result __{start,end}_init_task no longer mark the start and end of the init_task structure, but its stack only.
Rename __{start,end}_init_task to __{start,end}_init_stack.
Note other architectures are not affected because __{start,end}_init_task are used on x86 only.
Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|