topology.h - OpenGrok history log for /linux-6.15/include/linux/topology.h

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7
# 4b455f59	11-Mar-2025	Yicong Yang <[email protected]>	cpu/SMT: Provide a default topology_is_primary_thread() Currently if architectures want to support HOTPLUG_SMT they need to provide a topology_is_primary_thread() telling the framework which thread cpu/SMT: Provide a default topology_is_primary_thread() Currently if architectures want to support HOTPLUG_SMT they need to provide a topology_is_primary_thread() telling the framework which thread in the SMT cannot offline. However arm64 doesn't have a restriction on which thread in the SMT cannot offline, a simplest choice is that just make 1st thread as the "primary" thread. So just make this as the default implementation in the framework and let architectures like x86 that have special primary thread to override this function (which they've already done). There's no need to provide a stub function if !CONFIG_SMP or !CONFIG_HOTPLUG_SMT. In such case the testing CPU is already the 1st CPU in the SMT so it's always the primary thread. Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Pierre Gondois <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Signed-off-by: Yicong Yang <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]> show more ...
Revision tags: v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3
# f09177ca	14-Feb-2025	Andrea Righi <[email protected]>	sched/topology: Introduce for_each_node_numadist() iterator Introduce the new helper for_each_node_numadist() to iterate over node IDs in order of increasing NUMA distance from a given starting node sched/topology: Introduce for_each_node_numadist() iterator Introduce the new helper for_each_node_numadist() to iterate over node IDs in order of increasing NUMA distance from a given starting node. This iterator is somehow similar to for_each_numa_hop_mask(), but instead of providing a cpumask at each iteration, it provides a node ID. Example usage: nodemask_t unvisited = NODE_MASK_ALL; int node, start = cpu_to_node(smp_processor_id()); node = start; for_each_node_numadist(node, unvisited) pr_info("node (%d, %d) -> %d\n", start, node, node_distance(start, node)); On a system with equidistant nodes: $ numactl -H ... node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Output of the example above (on node 0): [ 7.367022] node (0, 0) -> 10 [ 7.367151] node (0, 1) -> 20 [ 7.367186] node (0, 2) -> 20 [ 7.367247] node (0, 3) -> 20 On a system with non-equidistant nodes (simulated using virtme-ng): $ numactl -H ... node distances: node 0 1 2 3 0: 10 51 31 41 1: 51 10 21 61 2: 31 21 10 11 3: 41 61 11 10 Output of the example above (on node 0): [ 8.953644] node (0, 0) -> 10 [ 8.953712] node (0, 2) -> 31 [ 8.953764] node (0, 3) -> 41 [ 8.953817] node (0, 1) -> 51 Suggested-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Andrea Righi <[email protected]> Acked-by: Yury Norov [NVIDIA] <[email protected]> Signed-off-by: Tejun Heo <[email protected]> show more ...
Revision tags: v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7
# 8ab63d41	19-Aug-2023	Yury Norov <[email protected]>	sched/topology: Fix sched_numa_find_nth_cpu() in non-NUMA case When CONFIG_NUMA is enabled, sched_numa_find_nth_cpu() searches for a CPU in sched_domains_numa_masks. The masks includes only online C sched/topology: Fix sched_numa_find_nth_cpu() in non-NUMA case When CONFIG_NUMA is enabled, sched_numa_find_nth_cpu() searches for a CPU in sched_domains_numa_masks. The masks includes only online CPUs, so effectively offline CPUs are skipped. When CONFIG_NUMA is disabled, the fallback function should be consistent. Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()") Signed-off-by: Yury Norov <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected] show more ...
Revision tags: v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5
# 06ac0172	21-Jan-2023	Valentin Schneider <[email protected]>	sched/topology: Introduce for_each_numa_hop_mask() The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs reachable within a given distance budget, wrap the logic for iterating over sched/topology: Introduce for_each_numa_hop_mask() The recently introduced sched_numa_hop_mask() exposes cpumasks of CPUs reachable within a given distance budget, wrap the logic for iterating over all (distance, mask) values inside an iterator macro. Signed-off-by: Valentin Schneider <[email protected]> Reviewed-by: Yury Norov <[email protected]> Signed-off-by: Yury Norov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> show more ...
# 9feae658	21-Jan-2023	Valentin Schneider <[email protected]>	sched/topology: Introduce sched_numa_hop_mask() Tariq has pointed out that drivers allocating IRQ vectors would benefit from having smarter NUMA-awareness - cpumask_local_spread() only knows about t sched/topology: Introduce sched_numa_hop_mask() Tariq has pointed out that drivers allocating IRQ vectors would benefit from having smarter NUMA-awareness - cpumask_local_spread() only knows about the local node and everything outside is in the same bucket. sched_domains_numa_masks is pretty much what we want to hand out (a cpumask of CPUs reachable within a given distance budget), introduce sched_numa_hop_mask() to export those cpumasks. Link: http://lore.kernel.org/r/[email protected] Signed-off-by: Valentin Schneider <[email protected]> Reviewed-by: Yury Norov <[email protected]> Signed-off-by: Yury Norov <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> show more ...
# cd7f5535	21-Jan-2023	Yury Norov <[email protected]>	sched: add sched_numa_find_nth_cpu() The function finds Nth set CPU in a given cpumask starting from a given node. Leveraging the fact that each hop in sched_domains_numa_masks includes the same or sched: add sched_numa_find_nth_cpu() The function finds Nth set CPU in a given cpumask starting from a given node. Leveraging the fact that each hop in sched_domains_numa_masks includes the same or greater number of CPUs than the previous one, we can use binary search on hops instead of linear walk, which makes the overall complexity of O(log n) in terms of number of cpumask_weight() calls. Signed-off-by: Yury Norov <[email protected]> Acked-by: Tariq Toukan <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Reviewed-by: Peter Lafreniere <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> show more ...
Revision tags: v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7
# 991d8d81	13-May-2022	Dietmar Eggemann <[email protected]>	topology: Remove unused cpu_cluster_mask() default_topology[] uses cpu_clustergroup_mask() for the CLS level (guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86 (arch/x86/kernel/smp topology: Remove unused cpu_cluster_mask() default_topology[] uses cpu_clustergroup_mask() for the CLS level (guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86 (arch/x86/kernel/smpboot.c) and arm64 (drivers/base/arch_topology.c). Fixes: 778c558f49a2c ("sched: Add cluster scheduler level in core and related Kconfig for ARM64") Signed-off-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Barry Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] show more ...
# 15f214f9	13-May-2022	Dietmar Eggemann <[email protected]>	topology: Remove unused cpu_cluster_mask() default_topology[] uses cpu_clustergroup_mask() for the CLS level (guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86 (arch/x86/kernel/smp topology: Remove unused cpu_cluster_mask() default_topology[] uses cpu_clustergroup_mask() for the CLS level (guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86 (arch/x86/kernel/smpboot.c) and arm64 (drivers/base/arch_topology.c). Fixes: 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64") Acked-by: Barry Song <[email protected]> Signed-off-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> show more ...
Revision tags: v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3
# ab28e944	31-Jan-2022	Tony Luck <[email protected]>	topology/sysfs: Add PPIN in sysfs under cpu topology PPIN is the Protected Processor Identification Number. This is used to identify the socket as a Field Replaceable Unit (FRU). Existing code only topology/sysfs: Add PPIN in sysfs under cpu topology PPIN is the Protected Processor Identification Number. This is used to identify the socket as a Field Replaceable Unit (FRU). Existing code only displays this when reporting errors. But this makes it inconvenient for large clusters to use it for its intended purpose of inventory control. Add ppin to /sys/devices/system/cpu/cpu*/topology to make what is already available using RDMSR more easily accessible. Make the file read only for root in case there are still people concerned about making a unique system "serial number" available. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] show more ...
Revision tags: v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4
# f1045056	29-Nov-2021	Heiko Carstens <[email protected]>	topology/sysfs: rework book and drawer topology ifdefery Provide default defines for the topology_book_[id\|cpumask] and topology_drawer_[id\|cpumask] macros just like for each other topology level. T topology/sysfs: rework book and drawer topology ifdefery Provide default defines for the topology_book_[id\|cpumask] and topology_drawer_[id\|cpumask] macros just like for each other topology level. This way all topology levels are handled in a similar way. Still the the book and drawer levels are only used on s390, and also the sysfs attributes are only created on s390. However other architectures may opt in if wanted. Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> show more ...
# e7957077	29-Nov-2021	Heiko Carstens <[email protected]>	topology/sysfs: export cluster attributes only if an architectures has support The cluster_id and cluster_cpus topology sysfs attributes have been added with commit c5e22feffdd7 ("topology: Represen topology/sysfs: export cluster attributes only if an architectures has support The cluster_id and cluster_cpus topology sysfs attributes have been added with commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die"). They are currently only used for x86, arm64, and riscv (via generic arch topology), however they are still present with bogus default values for all other architectures. Instead of enforcing such new sysfs attributes to all architectures, make them only optional visible if an architecture opts in by defining both the topology_cluster_id and topology_cluster_cpumask attributes. This is similar to what was done when the book and drawer topology levels were introduced: avoid useless and therefore confusing sysfs attributes for architectures which cannot make use of them. This should not break any existing applications, since this is a new interface introduced with the v5.16 merge window. Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> show more ...
# 2c4dcd7f	29-Nov-2021	Heiko Carstens <[email protected]>	topology/sysfs: export die attributes only if an architectures has support The die_id and die_cpus topology sysfs attributes have been added with commit 0e344d8c709f ("cpu/topology: Export die_id") topology/sysfs: export die attributes only if an architectures has support The die_id and die_cpus topology sysfs attributes have been added with commit 0e344d8c709f ("cpu/topology: Export die_id") and commit 2e4c54dac7b3 ("topology: Create core_cpus and die_cpus sysfs attributes"). While they are currently only used and useful for x86 they are still present with bogus default values for all architectures. Instead of enforcing such new sysfs attributes to all architectures, make them only optional visible if an architecture opts in by defining both the topology_die_id and topology_die_cpumask attributes. This is similar to what was done when the book and drawer topology levels were introduced: avoid useless and therefore confusing sysfs attributes for architectures which cannot make use of them. This should not break any existing applications, since this is a rather new interface and applications should be able to handle also older kernel versions without such attributes - besides that they contain only useful information for x86. Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> show more ...
Revision tags: v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3
# 778c558f	24-Sep-2021	Barry Song <[email protected]>	sched: Add cluster scheduler level in core and related Kconfig for ARM64 This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly bene sched: Add cluster scheduler level in core and related Kconfig for ARM64 This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly benefit a lot of workload which loves more resources such as memory bandwidth, caches. Testing has widely been done in two different hardware configurations of Kunpeng920: 24 cores in one NUMA(6 clusters in each NUMA node); 32 cores in one NUMA(8 clusters in each NUMA node) Workload is running on either one NUMA node or four NUMA nodes, thus, this can estimate the effect of cluster spreading w/ and w/o NUMA load balance. * Stream benchmark: 4threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%) 6threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%) 12threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%) Thus, it could help memory-bound workload especially under medium load. Similar improvement is also seen in lkp-pbzip2: * lkp-pbzip2 benchmark 2-96 threads (on 4NUMA * 24cores = 96cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%* 2-24 threads (on 1NUMA * 24cores = 24cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%* In the case of 6 threads and 8 threads, we see the greatest performance improvement. Similar improvement can be seen on lkp-pixz though the improvement is smaller: * lkp-pixz benchmark 2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores) lkp-pixz lkp-pixz w/o patch w/ patch Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%* * SPECrate benchmark 4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores) Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3) 8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%) 16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6) 32 Copies(on 4NUMA * 32 cores = 128cores) [w/o patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3 [w/ patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%) On the other hand, it is slightly helpful to CPU-bound tasks like kernbench: * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores) kernbench kernbench w/o cluster w/ cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%) Note this patch isn't an universal win, it might hurt those workload which can benefit from packing. Though tasks which want to take advantages of lower communication latency of one cluster won't necessarily been packed in one cluster while kernel is not aware of clusters, they have some chance to be randomly packed. But this patch will make them more likely spread. Signed-off-by: Barry Song <[email protected]> Tested-by: Yicong Yang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> show more ...
# c5e22fef	24-Sep-2021	Jonathan Cameron <[email protected]>	topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as topology: Represent clusters of CPUs within a die Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ \| +------+ +------+ +--------------------------+ \| \| \| CPU0 \| \| cpu1 \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| +----+ L3 \| \| \| \| +------+ +------+ cluster \| \| tag \| \| \| \| \| CPU2 \| \| CPU3 \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| \| +-----------------------------------+ \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| L3 \| \| \| \| +------+ +------+ +----+ tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| L3 \| \| data \| +-----------------------------------+ \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ +----+ L3 \| \| \| \| \| \| tag \| \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ +--------------------------+ \| +-----------------------------------\| \| \| +-----------------------------------\| \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| +----+ L3 \| \| \| \| +------+ +------+ \| \| tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| \| +-----------------------------------+ \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| L3 \| \| \| \| +------+ +------+ +---+ tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| \| \| +-----------------------------------+ \| \| +-----------------------------------+ \| \| \| +------+ +------+ +--------------------------+ \| \| \| \| \| \| \| +-----------+ \| \| \| +------+ +------+ \| \| \| \| \| \| \| \| L3 \| \| \| \| +------+ +------+ +--+ tag \| \| \| \| \| \| \| \| \| \| \| \| \| \| +------+ +------+ \| +-----------+ \| \| \| \| +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <[email protected]> Signed-off-by: Tian Tao <[email protected]> Signed-off-by: Barry Song <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected] show more ...
Revision tags: v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5
# 620a6dc4	22-Jan-2021	Valentin Schneider <[email protected]>	sched/topology: Make sched_init_numa() use a set for the deduplicating sort The deduplicating sort in sched_init_numa() assumes that the first line in the distance table contains all unique values i sched/topology: Make sched_init_numa() use a set for the deduplicating sort The deduplicating sort in sched_init_numa() assumes that the first line in the distance table contains all unique values in the entire table. I've been trying to pen what this exactly means for the topology, but it's not straightforward. For instance, topology.c uses this example: node 0 1 2 3 0: 10 20 20 30 1: 20 10 20 20 2: 20 20 10 20 3: 30 20 20 10 0 ----- 1 \| / \| \| / \| \| / \| 2 ----- 3 Which works out just fine. However, if we swap nodes 0 and 1: 1 ----- 0 \| / \| \| / \| \| / \| 2 ----- 3 we get this distance table: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 30 2: 20 20 10 20 3: 20 30 20 10 Which breaks the deduplicating sort (non-representative first line). In this case this would just be a renumbering exercise, but it so happens that we can have a deduplicating sort that goes through the whole table in O(n²) at the extra cost of a temporary memory allocation (i.e. any form of set). The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following this, implement the set as a 256-bits bitmap. Should this not be satisfactory (i.e. we want to support 32-bit values), then we'll have to go for some other sparse set implementation. This has the added benefit of letting us allocate just the right amount of memory for sched_domains_numa_distance[], rather than an arbitrary (nr_node_ids + 1). Note: DT binding equivalent (distance-map) decodes distances as 32-bit values. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected] show more ...
Revision tags: v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1
# 3babbe44	07-Aug-2020	Srikar Dronamraju <[email protected]>	sched/topology: Allow archs to override cpu_smt_mask cpu_smt_mask tracks topology_sibling_cpumask. This would be good for most architectures. One of the users of cpu_smt_mask(), would be to identify sched/topology: Allow archs to override cpu_smt_mask cpu_smt_mask tracks topology_sibling_cpumask. This would be good for most architectures. One of the users of cpu_smt_mask(), would be to identify idle-cores. On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8 core for backward compatibility reasons. powerpc allows LPARs to be live migrated from Power8 to Power9. Do note Power8 had only SMT8 cores. Existing software which has been developed/configured for Power8 would expect to see SMT8 core. Maintaining the illusion of SMT8 core is a requirement to make that work. In order to maintain above userspace backward compatibility with previous versions of processor, Power9 onwards there is option to the firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8 core. On Power9 this pair shares the L2 cache as well. However, from the scheduler's point of view, a core should be determined by SMT4, since its a completely independent unit of compute. Hence allow powerpc architecture to override the default cpu_smt_mask() to point to the SMT4 cores in a SMT8 mode. This will ensure the scheduler is always given the right information. Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] show more ...
Revision tags: v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1
# 667c7901	02-Apr-2020	Vlastimil Babka <[email protected]>	revert "topology: add support for node_to_mem_node() to determine the fallback node" This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2. The function node_to_mem_node() was introduced by revert "topology: add support for node_to_mem_node() to determine the fallback node" This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2. The function node_to_mem_node() was introduced by that commit for use in SLUB on systems with memoryless nodes, but it turned out to be unreliable on some architectures/configurations and a simpler solution exists than fixing it up. Thus commit 0715e6c516f1 ("mm, slub: prevent kmalloc_node crashes and memory leaks") removed the only user of node_to_mem_node() and we can revert the commit that introduced the function. Signed-off-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: PUVICHAKRAVARTHY RAMACHANDRAN <[email protected]> Cc: Sachin Sant <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8, v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4
# a55c7454	08-Aug-2019	Matt Fleming <[email protected]>	sched/topology: Improve load balancing on AMD EPYC systems SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() for any sched domains with a NUMA distance greater than 2 hops (RECLAIM sched/topology: Improve load balancing on AMD EPYC systems SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() for any sched domains with a NUMA distance greater than 2 hops (RECLAIM_DISTANCE). The idea being that it's expensive to balance across domains that far apart. However, as is rather unfortunately explained in: commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30") the value for RECLAIM_DISTANCE is based on node distance tables from 2011-era hardware. Current AMD EPYC machines have the following NUMA node distances: node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 where 2 hops is 32. The result is that the scheduler fails to load balance properly across NUMA nodes on different sockets -- 2 hops apart. For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4 (CPUs 32-39) like so, $ numactl -C 0-7,32-39 ./spinner 16 causes all threads to fork and remain on node 0 until the active balancer kicks in after a few seconds and forcibly moves some threads to node 4. Override node_reclaim_distance for AMD Zen. Signed-off-by: Matt Fleming <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: [email protected] Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> show more ...
Revision tags: v5.3-rc3, v5.3-rc2, v5.3-rc1, v5.2, v5.2-rc7
# 60c1b220	27-Jun-2019	Atish Patra <[email protected]>	cpu-topology: Move cpu topology code to common code. Both RISC-V & ARM64 are using cpu-map device tree to describe their cpu topology. It's better to move the relevant code to a common place instead cpu-topology: Move cpu topology code to common code. Both RISC-V & ARM64 are using cpu-map device tree to describe their cpu topology. It's better to move the relevant code to a common place instead of duplicate code. To: Will Deacon <[email protected]> To: Catalin Marinas <[email protected]> Signed-off-by: Atish Patra <[email protected]> [Tested on QDF2400] Tested-by: Jeffrey Hugo <[email protected]> [Tested on Juno and other embedded platforms.] Tested-by: Sudeep Holla <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Acked-by: Will Deacon <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Paul Walmsley <[email protected]> show more ...
Revision tags: v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3, v5.2-rc2, v5.2-rc1
# 2e4c54da	13-May-2019	Len Brown <[email protected]>	topology: Create core_cpus and die_cpus sysfs attributes Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list" These attributes represent all of the logical CPUs that share the sam topology: Create core_cpus and die_cpus sysfs attributes Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list" These attributes represent all of the logical CPUs that share the same core. These attriutes is synonymous with the existing "thread_siblings" and "thread_siblings_list" attribute, which will be deprecated. Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list". These attributes represent all of the logical CPUs that share the same die. Suggested-by: Brice Goglin <[email protected]> Signed-off-by: Len Brown <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/071c23a298cd27ede6ed0b6460cae190d193364f.1557769318.git.len.brown@intel.com show more ...
# 0e344d8c	13-May-2019	Len Brown <[email protected]>	cpu/topology: Export die_id Export die_id in cpu topology, for the benefit of hardware that has multiple-die/package. Signed-off-by: Len Brown <[email protected]> Signed-off-by: Thomas Gleixner < cpu/topology: Export die_id Export die_id in cpu topology, for the benefit of hardware that has multiple-die/package. Signed-off-by: Len Brown <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.brown@intel.com show more ...
Revision tags: v5.1, v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3, v5.1-rc2, v5.1-rc1, v5.0, v5.0-rc8, v5.0-rc7, v5.0-rc6, v5.0-rc5, v5.0-rc4, v5.0-rc3, v5.0-rc2, v5.0-rc1, v4.20, v4.20-rc7, v4.20-rc6, v4.20-rc5, v4.20-rc4, v4.20-rc3, v4.20-rc2, v4.20-rc1, v4.19, v4.19-rc8, v4.19-rc7, v4.19-rc6, v4.19-rc5, v4.19-rc4, v4.19-rc3, v4.19-rc2, v4.19-rc1, v4.18, v4.18-rc8, v4.18-rc7, v4.18-rc6, v4.18-rc5, v4.18-rc4, v4.18-rc3, v4.18-rc2, v4.18-rc1, v4.17, v4.17-rc7, v4.17-rc6, v4.17-rc5, v4.17-rc4, v4.17-rc3, v4.17-rc2, v4.17-rc1, v4.16, v4.16-rc7, v4.16-rc6, v4.16-rc5, v4.16-rc4, v4.16-rc3, v4.16-rc2, v4.16-rc1, v4.15, v4.15-rc9, v4.15-rc8, v4.15-rc7, v4.15-rc6, v4.15-rc5, v4.15-rc4, v4.15-rc3, v4.15-rc2, v4.15-rc1, v4.14, v4.14-rc8, v4.14-rc7, v4.14-rc6, v4.14-rc5, v4.14-rc4, v4.14-rc3, v4.14-rc2, v4.14-rc1, v4.13, v4.13-rc7, v4.13-rc6, v4.13-rc5, v4.13-rc4, v4.13-rc3, v4.13-rc2, v4.13-rc1, v4.12, v4.12-rc7, v4.12-rc6, v4.12-rc5, v4.12-rc4, v4.12-rc3, v4.12-rc2, v4.12-rc1, v4.11, v4.11-rc8, v4.11-rc7, v4.11-rc6, v4.11-rc5, v4.11-rc4, v4.11-rc3, v4.11-rc2, v4.11-rc1, v4.10, v4.10-rc8, v4.10-rc7, v4.10-rc6, v4.10-rc5, v4.10-rc4, v4.10-rc3, v4.10-rc2, v4.10-rc1, v4.9, v4.9-rc8, v4.9-rc7, v4.9-rc6, v4.9-rc5, v4.9-rc4, v4.9-rc3, v4.9-rc2, v4.9-rc1, v4.8, v4.8-rc8, v4.8-rc7, v4.8-rc6, v4.8-rc5, v4.8-rc4, v4.8-rc3, v4.8-rc2, v4.8-rc1
# a5f5f91d	28-Jul-2016	Mel Gorman <[email protected]>	mm: convert zone_reclaim to node_reclaim As reclaim is now per-node based, convert zone_reclaim to be node_reclaim. It is possible that a node will be reclaimed multiple times if it has multiple zo mm: convert zone_reclaim to node_reclaim As reclaim is now per-node based, convert zone_reclaim to be node_reclaim. It is possible that a node will be reclaimed multiple times if it has multiple zones but this is unavoidable without caching all nodes traversed so far. The documentation and interface to userspace is the same from a configuration perspective and will will be similar in behaviour unless the node-local allocation requests were also limited to lower zones. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hillf Danton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
Revision tags: v4.7, v4.7-rc7, v4.7-rc6, v4.7-rc5, v4.7-rc4, v4.7-rc3, v4.7-rc2, v4.7-rc1, v4.6, v4.6-rc7, v4.6-rc6, v4.6-rc5, v4.6-rc4, v4.6-rc3, v4.6-rc2, v4.6-rc1, v4.5, v4.5-rc7, v4.5-rc6, v4.5-rc5, v4.5-rc4, v4.5-rc3, v4.5-rc2, v4.5-rc1, v4.4
# 00d27c63	05-Jan-2016	Chris Metcalf <[email protected]>	numa: remove stale node_has_online_mem() define This isn't used anywhere, so delete it. Looks like the last usage (in x86-specific code) was removed by Tejun in 2011 in commit bd6709a91a59 ("x86, N numa: remove stale node_has_online_mem() define This isn't used anywhere, so delete it. Looks like the last usage (in x86-specific code) was removed by Tejun in 2011 in commit bd6709a91a59 ("x86, NUMA: Make 32bit use common NUMA init path"). Signed-off-by: Chris Metcalf <[email protected]> show more ...
Revision tags: v4.4-rc8, v4.4-rc7, v4.4-rc6, v4.4-rc5, v4.4-rc4, v4.4-rc3, v4.4-rc2, v4.4-rc1, v4.3, v4.3-rc7, v4.3-rc6, v4.3-rc5, v4.3-rc4, v4.3-rc3, v4.3-rc2, v4.3-rc1, v4.2, v4.2-rc8, v4.2-rc7, v4.2-rc6, v4.2-rc5, v4.2-rc4, v4.2-rc3, v4.2-rc2, v4.2-rc1, v4.1, v4.1-rc8, v4.1-rc7, v4.1-rc6
# 06931e62	26-May-2015	Bartosz Golaszewski <[email protected]>	sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask() Rename topology_thread_cpumask() to topology_sibling_cpumask() for more consistency with scheduler code. Signed-off-by sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask() Rename topology_thread_cpumask() to topology_sibling_cpumask() for more consistency with scheduler code. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Russell King <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Benoit Cousson <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jean Delvare <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Drokin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Russell King <[email protected]> Cc: Viresh Kumar <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> show more ...
Revision tags: v4.1-rc5, v4.1-rc4, v4.1-rc3, v4.1-rc2, v4.1-rc1, v4.0, v4.0-rc7, v4.0-rc6, v4.0-rc5, v4.0-rc4, v4.0-rc3, v4.0-rc2, v4.0-rc1, v3.19, v3.19-rc7, v3.19-rc6, v3.19-rc5, v3.19-rc4, v3.19-rc3, v3.19-rc2, v3.19-rc1, v3.18, v3.18-rc7, v3.18-rc6, v3.18-rc5, v3.18-rc4, v3.18-rc3, v3.18-rc2, v3.18-rc1
# ad2c8144	09-Oct-2014	Joonsoo Kim <[email protected]>	topology: add support for node_to_mem_node() to determine the fallback node Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that on ppc LPARs with memoryless nodes, a large amoun topology: add support for node_to_mem_node() to determine the fallback node Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that on ppc LPARs with memoryless nodes, a large amount of memory was consumed by slabs and was marked unreclaimable. He tracked it down to slab deactivations in the SLUB core when we allocate remotely, leading to poor efficiency always when memoryless nodes are present. After much discussion, Joonsoo provided a few patches that help significantly. They don't resolve the problem altogether: - memory hotplug still needs testing, that is when a memoryless node becomes memory-ful, we want to dtrt - there are other reasons for going off-node than memoryless nodes, e.g., fully exhausted local nodes Neither case is resolved with this series, but I don't think that should block their acceptance, as they can be explored/resolved with follow-on patches. The series consists of: [1/3] topology: add support for node_to_mem_node() to determine the fallback node [2/3] slub: fallback to node_to_mem_node() node if allocating on memoryless node - Joonsoo's patches to cache the nearest node with memory for each NUMA node [3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of task_struct allocations") - At Tejun's request, keep the knowledge of memoryless node fallback to the allocator core. This patch (of 3): We need to determine the fallback node in slub allocator if the allocation target node is memoryless node. Without it, the SLUB wrongly select the node which has no memory and can't use a partial slab, because of node mismatch. Introduced function, node_to_mem_node(X), will return a node Y with memory that has the nearest distance. If X is memoryless node, it will return nearest distance node, but, if X is normal node, it will return itself. We will use this function in following patch to determine the fallback node. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Nishanth Aravamudan <[email protected]> Cc: David Rientjes <[email protected]> Cc: Han Pingtian <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> show more ...
12 3 4