|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
84b25926 |
| 26-Feb-2025 |
Dave Jiang <[email protected]> |
acpi: numa: Add support to enumerate and store extended linear address mode
Store the address mode as part of the cache attriutes. Export the mode attribute to sysfs as all other cache attributes.
acpi: numa: Add support to enumerate and store extended linear address mode
Store the address mode as part of the cache attriutes. Export the mode attribute to sysfs as all other cache attributes.
Link: https://lore.kernel.org/linux-cxl/[email protected]/ Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Li Ming <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Dave Jiang <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12 |
|
| #
5943c0dc |
| 15-Nov-2024 |
Thomas Weißschuh <[email protected]> |
driver core: Constify bin_attribute definitions
Mark all 'struct bin_attribute' instances as const to protect against accidental and malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@
driver core: Constify bin_attribute definitions
Mark all 'struct bin_attribute' instances as const to protect against accidental and malicious modifications.
Signed-off-by: Thomas Weißschuh <[email protected]> Link: https://lore.kernel.org/r/20241115-b4-sysfs-const-bin_attr-group-v1-2-2c9bb12dfc48@weissschuh.net Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc7, v6.12-rc6 |
|
| #
562e932a |
| 03-Nov-2024 |
Thomas Weißschuh <[email protected]> |
driver core: Constify attribute arguments of binary attributes
As preparation for the constification of struct bin_attribute, constify the arguments of the read and write callbacks.
Signed-off-by:
driver core: Constify attribute arguments of binary attributes
As preparation for the constification of struct bin_attribute, constify the arguments of the read and write callbacks.
Signed-off-by: Thomas Weißschuh <[email protected]> Acked-by: Krzysztof Wilczyński <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8 |
|
| #
debdce20 |
| 08-Mar-2024 |
Dave Jiang <[email protected]> |
cxl/region: Deal with numa nodes not enumerated by SRAT
For the numa nodes that are not created by SRAT, no memory_target is allocated and is not managed by the HMAT_REPORTING code. Therefore hmat_c
cxl/region: Deal with numa nodes not enumerated by SRAT
For the numa nodes that are not created by SRAT, no memory_target is allocated and is not managed by the HMAT_REPORTING code. Therefore hmat_callback() memory hotplug notifier will exit early on those NUMA nodes. The CXL memory hotplug notifier will need to call node_set_perf_attrs() directly in order to setup the access sysfs attributes.
In acpi_numa_init(), the last proximity domain (pxm) id created by SRAT is stored. Add a helper function acpi_node_backed_by_real_pxm() in order to check if a NUMA node id is defined by SRAT or created by CFMWS.
node_set_perf_attrs() symbol is exported to allow update of perf attribs for a node. The sysfs path of /sys/devices/system/node/nodeX/access0/initiators/* is created by node_set_perf_attrs() for the various attributes where nodeX is matched to the NUMA node of the CXL region.
Cc: Rafael J. Wysocki <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dan Williams <[email protected]>
show more ...
|
| #
11270e52 |
| 08-Mar-2024 |
Dave Jiang <[email protected]> |
base/node / ACPI: Enumerate node access class for 'struct access_coordinate'
Both generic node and HMAT handling code have been using magic numbers to indicate access classes for 'struct access_coor
base/node / ACPI: Enumerate node access class for 'struct access_coordinate'
Both generic node and HMAT handling code have been using magic numbers to indicate access classes for 'struct access_coordinate'. Introduce enums to enumerate the access0 and access1 classes shared by the two subsystems. Update the function parameters and callers as appropriate to utilize the new enum.
Access0 is named to ACCESS_COORDINATE_LOCAL in order to indicate that the access class is for 'struct access_coordinate' between a target node and the nearest initiator node.
Access1 is named to ACCESS_COORDINATE_CPU in order to indicate that the access class is for 'struct access_coordinate' between a target node and the nearest CPU node.
Cc: Greg Kroah-Hartman <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dan Williams <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7 |
|
| #
6a954e94 |
| 21-Dec-2023 |
Dave Jiang <[email protected]> |
base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'
Dan Williams suggested changing the struct 'node_hmem_attrs' to 'access_coordinates' [1]. The struct is a container of r/w-latency
base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'
Dan Williams suggested changing the struct 'node_hmem_attrs' to 'access_coordinates' [1]. The struct is a container of r/w-latency and r/w-bandwidth numbers. Moving forward, this container will also be used by CXL to store the performance characteristics of each link hop in the PCIE/CXL topology. So, where node_hmem_attrs is just the access parameters of a memory-node, access_coordinates applies more broadly to hardware topology characteristics. The observation is that seemed like an exercise in having the application identify "where" it falls on a spectrum of bandwidth and latency needs. For the tuple of read/write-latency and read/write-bandwidth, "coordinates" is not a perfect fit. Sometimes it is just conveying values in isolation and not a "location" relative to other performance points, but in the end this data is used to identify the performance operation point of a given memory-node. [2]
Link: http://lore.kernel.org/r/[email protected]/ Link: https://lore.kernel.org/linux-cxl/[email protected]/ Suggested-by: Dan Williams <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/170319615734.2212653.15319394025985499185.stgit@djiang5-mobl3 Signed-off-by: Dan Williams <[email protected]>
show more ...
|
| #
580fc9c7 |
| 19-Dec-2023 |
Greg Kroah-Hartman <[email protected]> |
driver core: mark remaining local bus_type variables as const
Now that the driver core can properly handle constant struct bus_type, change the local driver core bus_type variables to be a constant
driver core: mark remaining local bus_type variables as const
Now that the driver core can properly handle constant struct bus_type, change the local driver core bus_type variables to be a constant structure as well, placing them into read-only memory which can not be modified at runtime.
Cc: Ira Weiny <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Kevin Hilman <[email protected]> Cc: Ulf Hansson <[email protected]> Cc: Len Brown <[email protected]> Acked-by: William Breathitt Gray <[email protected]> Acked-by: Dave Ertman <[email protected]> Link: https://lore.kernel.org/r/2023121908-paver-follow-cc21@gregkh Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
48b5928e |
| 30-Oct-2023 |
Gregory Price <[email protected]> |
base/node.c: initialize the accessor list before registering
The current code registers the node as available in the node array before initializing the accessor list. This makes it so that anything
base/node.c: initialize the accessor list before registering
The current code registers the node as available in the node array before initializing the accessor list. This makes it so that anything which might access the accessor list as a result of allocations will cause an undefined memory access.
In one example, an extension to access hmat data during interleave caused this undefined access as a result of a bulk allocation that occurs during node initialization but before the accessor list is initialized.
Initialize the accessor list before making the node generally available to the global system.
Fixes: 08d9dbe72b1f ("node: Link memory nodes to their compute nodes") Signed-off-by: Gregory Price <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7 |
|
| #
4b5b7850 |
| 14-Aug-2023 |
Hugh Dickins <[email protected]> |
mm,thp: fix nodeN/meminfo output alignment
Add one more space to FileHugePages and FilePmdMapped, so the output is aligned with other rows in /sys/devices/system/node/nodeN/meminfo.
Link: https://l
mm,thp: fix nodeN/meminfo output alignment
Add one more space to FileHugePages and FilePmdMapped, so the output is aligned with other rows in /sys/devices/system/node/nodeN/meminfo.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc6 |
|
| #
7f0718ed |
| 10-Aug-2023 |
GUO Zihua <[email protected]> |
base/node: Remove duplicated include
Remove duplicated include of linux/hugetlb.h. Resolves checkincludes message.
Signed-off-by: GUO Zihua <[email protected]> Link: https://lore.kernel.org/r/202
base/node: Remove duplicated include
Remove duplicated include of linux/hugetlb.h. Resolves checkincludes message.
Signed-off-by: GUO Zihua <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6 |
|
| #
dcdfdd40 |
| 06-Jun-2023 |
Kirill A. Shutemov <[email protected]> |
mm: Add support for unaccepted memory
UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to
mm: Add support for unaccepted memory
UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific to the Virtual Machine platform.
There are several ways the kernel can deal with unaccepted memory:
1. Accept all the memory during boot. It is easy to implement and it doesn't have runtime cost once the system is booted. The downside is very long boot time.
Accept can be parallelized to multiple CPUs to keep it manageable (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate memory bandwidth and does not scale beyond the point.
2. Accept a block of memory on the first use. It requires more infrastructure and changes in page allocator to make it work, but it provides good boot time.
On-demand memory accept means latency spikes every time kernel steps onto a new memory block. The spikes will go away once workload data set size gets stabilized or all memory gets accepted.
3. Accept all memory in background. Introduce a thread (or multiple) that gets memory accepted proactively. It will minimize time the system experience latency spikes on memory allocation while keeping low boot time.
This approach cannot function on its own. It is an extension of #2: background memory acceptance requires functional scheduler, but the page allocator may need to tap into unaccepted memory before that.
The downside of the approach is that these threads also steal CPU cycles and memory bandwidth from the user's workload and may hurt user experience.
Implement #1 and #2 for now. #2 is the default. Some workloads may want to use #1 with accept_memory=eager in kernel command line. #3 can be implemented later based on user's demands.
Support of unaccepted memory requires a few changes in core-mm code:
- memblock accepts memory on allocation. It serves early boot memory allocations and doesn't limit them to pre-accepted pool of memory.
- page allocator accepts memory on the first allocation of the page. When kernel runs out of accepted memory, it accepts memory until the high watermark is reached. It helps to minimize fragmentation.
EFI code will provide two helpers if the platform supports unaccepted memory:
- accept_memory() makes a range of physical addresses accepted.
- range_contains_unaccepted_memory() checks anything within the range of physical addresses requires acceptance.
Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: Mike Rapoport <[email protected]> # memblock Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1 |
|
| #
7810f4dc |
| 05-May-2023 |
Dave Jiang <[email protected]> |
base/node: Use 'property' to identify an access parameter
Usage of 'attr' and 'name' in the context of a sysfs attribute definition are confusing because those read as being related to:
struct att
base/node: Use 'property' to identify an access parameter
Usage of 'attr' and 'name' in the context of a sysfs attribute definition are confusing because those read as being related to:
struct attribute .name
Rename 'name' to 'property' in preparation for renaming 'struct node_hmem_attr' to a more generic name that can be used in more contexts ('struct access_coordinate'), and not be confused with 'struct attribute'.
Suggested-by: Dan Williams <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/168332213518.2189163.18377767521423011290.stgit@djiang5-mobl3 Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5 |
|
| #
44b8f8bf |
| 20-Jan-2023 |
Jiaqi Yan <[email protected]> |
mm: memory-failure: add memory failure stats to sysfs
Patch series "Introduce per NUMA node memory error statistics", v2.
Background ==========
In the RFC for Kernel Support of Memory Error Detect
mm: memory-failure: add memory failure stats to sysfs
Patch series "Introduce per NUMA node memory error statistics", v2.
Background ==========
In the RFC for Kernel Support of Memory Error Detection [1], one advantage of software-based scanning over hardware patrol scrubber is the ability to make statistics visible to system administrators. The statistics include 2 categories:
* Memory error statistics, for example, how many memory error are encountered, how many of them are recovered by the kernel. Note these memory errors are non-fatal to kernel: during the machine check exception (MCE) handling kernel already classified MCE's severity to be unnecessary to panic (but either action required or optional).
* Scanner statistics, for example how many times the scanner have fully scanned a NUMA node, how many errors are first detected by the scanner.
The memory error statistics are useful to userspace and actually not specific to scanner detected memory errors, and are the focus of this patchset.
Motivation ==========
Memory error stats are important to userspace but insufficient in kernel today. Datacenter administrators can better monitor a machine's memory health with the visible stats. For example, while memory errors are inevitable on servers with 10+ TB memory, starting server maintenance when there are only 1~2 recovered memory errors could be overreacting; in cloud production environment maintenance usually means live migrate all the workload running on the server and this usually causes nontrivial disruption to the customer. Providing insight into the scope of memory errors on a system helps to determine the appropriate follow-up action. In addition, the kernel's existing memory error stats need to be standardized so that userspace can reliably count on their usefulness.
Today kernel provides following memory error info to userspace, but they are not sufficient or have disadvantages: * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text
Exposing memory error stats is also a good start for the in-kernel memory error detector. Today the data source of memory error stats are either direct memory error consumption, or hardware patrol scrubber detection (either signaled as UCNA or SRAO). Once in-kernel memory scanner is implemented, it will be the main source as it is usually configured to scan memory DIMMs constantly and faster than hardware patrol scrubber.
How Implemented ===============
As Naoya pointed out [2], exposing memory error statistics to userspace is useful independent of software or hardware scanner. Therefore we implement the memory error statistics independent of the in-kernel memory error detector. It exposes the following per NUMA node memory error counters:
/sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed
These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. This approach can be easier to extend for future use cases than /proc/meminfo, trace event, and log. The following math holds for the statistics:
* total = recovered + ignored + failed + delayed
These memory error stats are reset during machine boot.
The 1st commit introduces these sysfs entries. The 2nd commit populates memory error stats every time memory_failure attempts memory error recovery. The 3rd commit adds documentations for introduced stats.
[1] https://lore.kernel.org/linux-mm/[email protected]/T/#mc22959244f5388891c523882e61163c6e4d703af [2] https://lore.kernel.org/linux-mm/[email protected]/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6
This patch (of 3):
Today kernel provides following memory error info to userspace, but each has its own disadvantage
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text
Exposes per NUMA node memory error stats as sysfs entries:
/sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed
These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. The following math holds for the statistics:
* total = recovered + ignored + failed + delayed
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6 |
|
| #
a4a00b45 |
| 14-Sep-2022 |
Muchun Song <[email protected]> |
mm: hugetlb: eliminate memory-less nodes handling
The memory-notify-based approach aims to handle meory-less nodes, however, it just adds the complexity of code as pointed by David in thread [1]. T
mm: hugetlb: eliminate memory-less nodes handling
The memory-notify-based approach aims to handle meory-less nodes, however, it just adds the complexity of code as pointed by David in thread [1]. The handling of memory-less nodes is introduced by commit 4faf8d950ec4 ("hugetlb: handle memory hot-plug events"). >From its commit message, we cannot find any necessity of handling this case. So, we can simply register/unregister sysfs entries in register_node/unregister_node to simlify the code.
BTW, hotplug callback added because in hugetlb_register_all_nodes() we register sysfs nodes only for N_MEMORY nodes, seeing commit 9b5e5d0fdc91, which said it was a preparation for handling memory-less nodes via memory hotplug. Since we want to remove memory hotplug, so make sure we only register per-node sysfs for online (N_ONLINE) nodes in hugetlb_register_all_nodes().
https://lore.kernel.org/linux-mm/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: David Hildenbrand <[email protected]> Signed-off-by: Muchun Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
| #
b958d4d0 |
| 14-Sep-2022 |
Muchun Song <[email protected]> |
mm: hugetlb: simplify per-node sysfs creation and removal
Patch series "simplify handling of per-node sysfs creation and removal", v4.
This patch (of 2):
The following commit offload per-node sys
mm: hugetlb: simplify per-node sysfs creation and removal
Patch series "simplify handling of per-node sysfs creation and removal", v4.
This patch (of 2):
The following commit offload per-node sysfs creation and removal to a kworker and did not say why it is needed. And it also said "I don't know that this is absolutely required". It seems like the author was not sure as well. Since it only complicates the code, this patch will revert the changes to simplify the code.
39da08cb074c ("hugetlb: offload per node attribute registrations")
We could use memory hotplug notifier to do per-node sysfs creation and removal instead of inserting those operations to node registration and unregistration. Then, it can reduce the code coupling between node.c and hugetlb.c. Also, it can simplify the code.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc5, v6.0-rc4, v6.0-rc3 |
|
| #
ebc97a52 |
| 23-Aug-2022 |
Yosry Ahmed <[email protected]> |
mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
We keep track of several kernel memory stats (total kernel memory, page tables, stack, vmalloc, etc) on multiple levels (global, pe
mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
We keep track of several kernel memory stats (total kernel memory, page tables, stack, vmalloc, etc) on multiple levels (global, per-node, per-memcg, etc). These stats give insights to users to how much memory is used by the kernel and for what purposes.
Currently, memory used by KVM mmu is not accounted in any of those kernel memory stats. This patch series accounts the memory pages used by KVM for page tables in those stats in a new NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account for other types of secondary pages tables (e.g. iommu page tables).
KVM has a decent number of large allocations that aren't for page tables, but for most of them, the number/size of those allocations scales linearly with either the number of vCPUs or the amount of memory assigned to the VM. KVM's secondary page table allocations do not scale linearly, especially when nested virtualization is in use.
From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's per-VM pages_{4k,2m,1g} stats unless the guest is doing something bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is forced to allocate a large number of page tables even though the guest isn't accessing that much memory). However, someone would need to either understand how KVM works to make that connection, or know (or be told) to go look at KVM's stats if they're running VMs to better decipher the stats.
Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE is informative. For example, when backing a VM with THP vs. HugeTLB, NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order of magnitude higher with THP. So having this stat will at the very least prove to be useful for understanding tradeoffs between VM backing types, and likely even steer folks towards potential optimizations.
The original discussion with more details about the rationale: https://lore.kernel.org/all/[email protected]
This stat will be used by subsequent patches to count KVM mmu memory usage.
Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7 |
|
| #
7ee951ac |
| 15-Jul-2022 |
Phil Auld <[email protected]> |
drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist
Using bin_attributes with a 0 size causes fstat and friends to return that 0 size. This breaks userspace code that
drivers/base: fix userspace break from using bin_attributes for cpumap and cpulist
Using bin_attributes with a 0 size causes fstat and friends to return that 0 size. This breaks userspace code that retrieves the size before reading the file. Rather than reverting 75bd50fa841 ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI") let's put in a size value at compile time.
For cpulist the maximum size is on the order of NR_CPUS * (ceil(log10(NR_CPUS)) + 1)/2
which for 8192 is 20480 (8192 * 5)/2. In order to get near that you'd need a system with every other CPU on one node. For example: (0,2,4,8, ... ). To simplify the math and support larger NR_CPUS in the future we are using (NR_CPUS * 7)/2. We also set it to a min of PAGE_SIZE to retain the older behavior for smaller NR_CPUS.
The cpumap file the size works out to be NR_CPUS/4 + NR_CPUS/32 - 1 (or NR_CPUS * 9/32 - 1) including the ","s.
Add a set of macros for these values to cpumask.h so they can be used in multiple places. Apply these to the handful of such files in drivers/base/topology.c as well as node.c.
As an example, on an 80 cpu 4-node system (NR_CPUS == 8192):
before:
-r--r--r--. 1 root root 0 Jul 12 14:08 system/node/node0/cpulist -r--r--r--. 1 root root 0 Jul 11 17:25 system/node/node0/cpumap
after:
-r--r--r--. 1 root root 28672 Jul 13 11:32 system/node/node0/cpulist -r--r--r--. 1 root root 4096 Jul 13 11:31 system/node/node0/cpumap
CONFIG_NR_CPUS = 16384 -r--r--r--. 1 root root 57344 Jul 13 14:03 system/node/node0/cpulist -r--r--r--. 1 root root 4607 Jul 13 14:02 system/node/node0/cpumap
The actual number of cpus doesn't matter for the reported size since they are based on NR_CPUS.
Fixes: 75bd50fa841d ("drivers/base/node.c: use bin_attribute to break the size limitation of cpumap ABI") Fixes: bb9ec13d156e ("topology: use bin_attribute to break the size limitation of cpumap ABI") Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Yury Norov <[email protected]> Cc: [email protected] Acked-by: Yury Norov <[email protected]> (for include/linux/cpumask.h) Signed-off-by: Phil Auld <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5 |
|
| #
da63dc84 |
| 29-Apr-2022 |
Miaohe Lin <[email protected]> |
drivers/base/node.c: fix compaction sysfs file leak
Compaction sysfs file is created via compaction_register_node in register_node. But we forgot to remove it in unregister_node. Thus compaction s
drivers/base/node.c: fix compaction sysfs file leak
Compaction sysfs file is created via compaction_register_node in register_node. But we forgot to remove it in unregister_node. Thus compaction sysfs file is leaked. Using compaction_unregister_node to fix this issue.
Link: https://lkml.kernel.org/r/[email protected] Fixes: ed4a6d7f0676 ("mm: compaction: add /sys trigger for per-node memory compaction") Signed-off-by: Miaohe Lin <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
show more ...
|
|
Revision tags: v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1 |
|
| #
a72b6dff |
| 01-Apr-2022 |
Miaohe Lin <[email protected]> |
drivers/base/node.c: fix compaction sysfs file leak
Compaction sysfs file is created via compaction_register_node in register_node. But we forgot to remove it in unregister_node. Thus compaction sys
drivers/base/node.c: fix compaction sysfs file leak
Compaction sysfs file is created via compaction_register_node in register_node. But we forgot to remove it in unregister_node. Thus compaction sysfs file is leaked. Using compaction_unregister_node to fix this issue.
Fixes: ed4a6d7f0676 ("mm: compaction: add /sys trigger for per-node memory compaction") Signed-off-by: Miaohe Lin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
| #
395f6081 |
| 22-Mar-2022 |
David Hildenbrand <[email protected]> |
drivers/base/memory: determine and store zone for single-zone memory blocks
test_pages_in_a_zone() is just another nasty PFN walker that can easily stumble over ZONE_DEVICE memory ranges falling int
drivers/base/memory: determine and store zone for single-zone memory blocks
test_pages_in_a_zone() is just another nasty PFN walker that can easily stumble over ZONE_DEVICE memory ranges falling into the same memory block as ordinary system RAM: the memmap of parts of these ranges might possibly be uninitialized. In fact, we observed (on an older kernel) with UBSAN:
UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50 index 7 is out of range for type 'zone [5]' CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019 Call Trace: dump_stack+0x9a/0xf0 ubsan_epilogue+0x9/0x7a __ubsan_handle_out_of_bounds+0x13a/0x181 test_pages_in_a_zone+0x3c4/0x500 show_valid_zones+0x1fa/0x380 dev_attr_show+0x43/0xb0 sysfs_kf_seq_show+0x1c5/0x440 seq_read+0x49d/0x1190 vfs_read+0xff/0x300 ksys_read+0xb8/0x170 do_syscall_64+0xa5/0x4b0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x7f01f4439b52
We seem to stumble over a memmap that contains a garbage zone id. While we could try inserting pfn_to_online_page() calls, it will just make memory offlining slower, because we use test_pages_in_a_zone() to make sure we're offlining pages that all belong to the same zone.
Let's just get rid of this PFN walker and determine the single zone of a memory block -- if any -- for early memory blocks during boot. For memory onlining, we know the single zone already. Let's avoid any additional memmap scanning and just rely on the zone information available during boot.
For memory hot(un)plug, we only really care about memory blocks that: * span a single zone (and, thereby, a single node) * are completely System RAM (IOW, no holes, no ZONE_DEVICE) If one of these conditions is not met, we reject memory offlining. Hotplugged memory blocks (starting out offline), always meet both conditions.
There are three scenarios to handle:
(1) Memory hot(un)plug
A memory block with zone == NULL cannot be offlined, corresponding to our previous test_pages_in_a_zone() check.
After successful memory onlining/offlining, we simply set the zone accordingly. * Memory onlining: set the zone we just used for onlining * Memory offlining: set zone = NULL
So a hotplugged memory block starts with zone = NULL. Once memory onlining is done, we set the proper zone.
(2) Boot memory with !CONFIG_NUMA
We know that there is just a single pgdat, so we simply scan all zones of that pgdat for an intersection with our memory block PFN range when adding the memory block. If more than one zone intersects (e.g., DMA and DMA32 on x86 for the first memory block) we set zone = NULL and consequently mimic what test_pages_in_a_zone() used to do.
(3) Boot memory with CONFIG_NUMA
At the point in time we create the memory block devices during boot, we don't know yet which nodes *actually* span a memory block. While we could scan all zones of all nodes for intersections, overlapping nodes complicate the situation and scanning all nodes is possibly expensive. But that problem has already been solved by the code that sets the node of a memory block and creates the link in the sysfs -- do_register_memory_block_under_node().
So, we hook into the code that sets the node id for a memory block. If we already have a different node id set for the memory block, we know that multiple nodes *actually* have PFNs falling into our memory block: we set zone = NULL and consequently mimic what test_pages_in_a_zone() used to do. If there is no node id set, we do the same as (2) for the given node.
Note that the call order in driver_init() is: -> memory_dev_init(): create memory block devices -> node_dev_init(): link memory block devices to the node and set the node id
So in summary, we detect if there is a single zone responsible for this memory block and we consequently store the zone in that case in the memory block, updating it during memory onlining/offlining.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Rafael Parra <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rafael Parra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
cc651559 |
| 22-Mar-2022 |
David Hildenbrand <[email protected]> |
drivers/base/node: rename link_mem_sections() to register_memory_block_under_node()
Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.
I remember talkin
drivers/base/node: rename link_mem_sections() to register_memory_block_under_node()
Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.
I remember talking to Michal in the past about removing test_pages_in_a_zone(), which we use for: * verifying that a memory block we intend to offline is really only managed by a single zone. We don't support offlining of memory blocks that are managed by multiple zones (e.g., multiple nodes, DMA and DMA32) * exposing that zone to user space via /sys/devices/system/memory/memory*/valid_zones
Now that I identified some more cases where test_pages_in_a_zone() might go wrong, and we received an UBSAN report (see patch #3), let's get rid of this PFN walker.
So instead of detecting the zone at runtime with test_pages_in_a_zone() by scanning the memmap, let's determine and remember for each memory block if it's managed by a single zone. The stored zone can then be used for the above two cases, avoiding a manual lookup using test_pages_in_a_zone().
This avoids eventually stumbling over uninitialized memmaps in corner cases, especially when ZONE_DEVICE ranges partly fall into memory block (that are responsible for managing System RAM).
Handling memory onlining is easy, because we online to exactly one zone. Handling boot memory is more tricky, because we want to avoid scanning all zones of all nodes to detect possible zones that overlap with the physical memory region of interest. Fortunately, we already have code that determines the applicable nodes for a memory block, to create sysfs links -- we'll hook into that.
Patch #1 is a simple cleanup I had laying around for a longer time. Patch #2 contains the main logic to remove test_pages_in_a_zone() and further details.
[1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected]
This patch (of 2):
Let's adjust the stale terminology, making it match unregister_memory_block_under_nodes() and do_register_memory_block_under_node(). We're dealing with memory block devices, which span 1..X memory sections.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Oscar Salvador <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rafael Parra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
2848a28b |
| 22-Mar-2022 |
David Hildenbrand <[email protected]> |
drivers/base/node: consolidate node device subsystem initialization in node_dev_init()
... and call node_dev_init() after memory_dev_init() from driver_init(), so before any of the existing arch/su
drivers/base/node: consolidate node device subsystem initialization in node_dev_init()
... and call node_dev_init() after memory_dev_init() from driver_init(), so before any of the existing arch/subsys calls. All online nodes should be known at that point: early during boot, arch code determines node and zone ranges and sets the relevant nodes online; usually this happens in setup_arch().
This is in line with memory_dev_init(), which initializes the memory device subsystem and creates all memory block devices.
Similar to memory_dev_init(), panic() if anything goes wrong, we don't want to continue with such basic initialization errors.
The important part is that node_dev_init() gets called after memory_dev_init() and after cpu_dev_init(), but before any of the relevant archs call register_cpu() to register the new cpu device under the node device. The latter should be the case for the current users of topology_init().
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Tested-by: Anatoly Pugachev <[email protected]> (sparc64) Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Albert Ou <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2 |
|
| #
50468e43 |
| 16-Nov-2021 |
Jarkko Sakkinen <[email protected]> |
x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
== Problem ==
The amount of SGX memory on a system is determined by the BIOS and it varies wildly between systems. It can be a
x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node
== Problem ==
The amount of SGX memory on a system is determined by the BIOS and it varies wildly between systems. It can be as small as dozens of MB's and as large as many GB's on servers. Just like how applications need to know how much regular RAM is available, enclave builders need to know how much SGX memory an enclave can consume.
== Solution ==
Introduce a new sysfs file:
/sys/devices/system/node/nodeX/x86/sgx_total_bytes
to enumerate the amount of SGX memory available in each NUMA node. This serves the same function for SGX as /proc/meminfo or /sys/devices/system/node/nodeX/meminfo does for normal RAM.
'sgx_total_bytes' is needed today to help drive the SGX selftests. SGX-specific swap code is exercised by creating overcommitted enclaves which are larger than the physical SGX memory on the system. They currently use a CPUID-based approach which can diverge from the actual amount of SGX memory available. 'sgx_total_bytes' ensures that the selftests can work efficiently and do not attempt stupid things like creating a 100,000 MB enclave on a system with 128 MB of SGX memory.
== Implementation Details ==
Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an arch specific attribute group, and add an attribute for the amount of SGX memory in bytes to each NUMA node:
== ABI Design Discussion ==
As opposed to the per-node ABI, a single, global ABI was considered. However, this would prevent enclaves from being able to size themselves so that they fit on a single NUMA node. Essentially, a single value would rule out NUMA optimizations for enclaves.
Create a new "x86/" directory inside each "nodeX/" sysfs directory. 'sgx_total_bytes' is expected to be the first of at least a few sgx-specific files to be placed in the new directory. Just scanning /proc/meminfo, these are the no-brainers that we have for RAM, but we need for SGX:
MemTotal: xxxx kB // sgx_total_bytes (implemented here) MemFree: yyyy kB // sgx_free_bytes SwapTotal: zzzz kB // sgx_swapped_bytes
So, at *least* three. I think we will eventually end up needing something more along the lines of a dozen. A new directory (as opposed to being in the nodeX/ "root") directory avoids cluttering the root with several "sgx_*" files.
Place the new file in a new "nodeX/x86/" directory because SGX is highly x86-specific. It is very unlikely that any other architecture (or even non-Intel x86 vendor) will ever implement SGX. Using "sgx/" as opposed to "x86/" was also considered. But, there is a real chance this can get used for other arch-specific purposes.
[ dhansen: rewrite changelog ]
Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.16-rc1 |
|
| #
50f9481e |
| 05-Nov-2021 |
David Hildenbrand <[email protected]> |
mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use CO
mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Shuah Khan <[email protected]> [kselftest] Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Oscar Salvador <[email protected]> Cc: Alex Shi <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
859a85dd |
| 08-Sep-2021 |
Mike Rapoport <[email protected]> |
mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
After recent updates to freeing unused parts of the memory map, no arch
mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant.
The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if.
This patch (of 2):
After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within().
With that, pfn_valid_within() is always hardwired to 1 and can be completely removed.
Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|