History log of /linux-6.15/include/linux/memory-tiers.h (Results 1 – 17 of 17)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7
# 823430c8 04-Jul-2024 Ho-Ren (Jack) Chuang <[email protected]>

memory tier: consolidate the initialization of memory tiers

The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init

memory tier: consolidate the initialization of memory tiers

The current memory tier initialization process is distributed across two
different functions, memory_tier_init() and memory_tier_late_init(). This
design is hard to maintain. Thus, this patch is proposed to reduce the
possible code paths by consolidating different initialization patches into
one.

The earlier discussion with Jonathan and Ying is listed here:
https://lore.kernel.org/lkml/[email protected]/

If we want to put these two initializations together, they must be placed
together in the later function. Because only at that time, the HMAT
information will be ready, adist between nodes can be calculated, and
memory tiering can be established based on the adist. So we position the
initialization at memory_tier_init() to the memory_tier_late_init() call.
Moreover, it's natural to keep memory_tier initialization in drivers at
device_initcall() level.

If we simply move the set_node_memory_tier() from memory_tier_init() to
late_initcall(), it will result in HMAT not registering the
mt_adistance_algorithm callback function, because set_node_memory_tier()
is not performed during the memory tiering initialization phase, leading
to a lack of correct default_dram information.

Therefore, we introduced a nodemask to pass the information of the default
DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes
is that it is not clean enough. So in the end, we use a __initdata
variable, which is a variable that is released once initialization is
complete, including both CPU and memory nodes for HMAT to iterate through.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Suggested-by: Jonathan Cameron <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gregory Price <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Ravi Jonnalagadda <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3
# a72a30af 05-Apr-2024 Ho-Ren (Jack) Chuang <[email protected]>

memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11.

When a memory device,

memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11.

When a memory device, such as CXL1.1 type3 memory, is emulated as normal
memory (E820_TYPE_RAM), the memory device is indistinguishable from normal
DRAM in terms of memory tiering with the current implementation. The
current memory tiering assigns all detected normal memory nodes to the
same DRAM tier. This results in normal memory devices with different
attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of
memory.
https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the
initialization of memory tiers for CPUless NUMA nodes until they obtain
HMAT information and after all devices are initialized at boot time,
eliminating the need for user intervention. If no HMAT is specified, it
falls back to using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance
attributes through QEMU, and the guest now sees memory nodes with
performance attributes in HMAT. With this change, we enable the guest
kernel to construct the correct memory tiering for the memory nodes.


This patch (of 2):

Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch, enhancing
the scalability and conciseness of the code.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Gregory Price <[email protected]>
Cc: Hao Xiang <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ravi Jonnalagadda <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7
# 6a954e94 21-Dec-2023 Dave Jiang <[email protected]>

base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'

Dan Williams suggested changing the struct 'node_hmem_attrs' to
'access_coordinates' [1]. The struct is a container of r/w-latency

base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'

Dan Williams suggested changing the struct 'node_hmem_attrs' to
'access_coordinates' [1]. The struct is a container of r/w-latency and
r/w-bandwidth numbers. Moving forward, this container will also be used by
CXL to store the performance characteristics of each link hop in
the PCIE/CXL topology. So, where node_hmem_attrs is just the access
parameters of a memory-node, access_coordinates applies more broadly
to hardware topology characteristics. The observation is that seemed like
an exercise in having the application identify "where" it falls on a
spectrum of bandwidth and latency needs. For the tuple of
read/write-latency and read/write-bandwidth, "coordinates" is not a perfect
fit. Sometimes it is just conveying values in isolation and not a
"location" relative to other performance points, but in the end this data
is used to identify the performance operation point of a given memory-node.
[2]

Link: http://lore.kernel.org/r/[email protected]/
Link: https://lore.kernel.org/linux-cxl/[email protected]/
Suggested-by: Dan Williams <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Link: https://lore.kernel.org/r/170319615734.2212653.15319394025985499185.stgit@djiang5-mobl3
Signed-off-by: Dan Williams <[email protected]>

show more ...


Revision tags: v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4
# 6bc2cfdf 26-Sep-2023 Huang Ying <[email protected]>

dax, kmem: calculate abstract distance with general interface

Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
used for slow memory type in kmem driver. This limits the usage

dax, kmem: calculate abstract distance with general interface

Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
used for slow memory type in kmem driver. This limits the usage of kmem
driver, for example, it cannot be used for HBM (high bandwidth memory).

So, we use the general abstract distance calculation mechanism in kmem
drivers to get more accurate abstract distance on systems with proper
support. The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as fallback
only.

Now, multiple memory types may be managed by kmem. These memory types are
put into the "kmem_memory_types" list and protected by
kmem_memory_type_lock.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 3718c02d 26-Sep-2023 Huang Ying <[email protected]>

acpi, hmat: calculate abstract distance with HMAT

A memory tiering abstract distance calculation algorithm based on ACPI
HMAT is implemented. The basic idea is as follows.

The performance attribut

acpi, hmat: calculate abstract distance with HMAT

A memory tiering abstract distance calculation algorithm based on ACPI
HMAT is implemented. The basic idea is as follows.

The performance attributes of system default DRAM nodes are recorded as
the base line. Whose abstract distance is MEMTIER_ADISTANCE_DRAM. Then,
the ratio of the abstract distance of a memory node (target) to
MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
attributes of the node to that of the default DRAM nodes.

The functions to record the read/write latency/bandwidth of the default
DRAM nodes and calculate abstract distance according to read/write
latency/bandwidth ratio will be used by CXL CDAT (Coherent Device
Attribute Table) and other memory device drivers. So, they are put in
memory-tiers.c.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 07a8bdd4 26-Sep-2023 Huang Ying <[email protected]>

memory tiering: add abstract distance calculation algorithms management

Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.

We have the explicit memory tiers framewor

memory tiering: add abstract distance calculation algorithms management

Patch series "memory tiering: calculate abstract distance based on ACPI
HMAT", v4.

We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
Where, same kind of memory devices will be grouped into memory types,
then put into memory tiers. To describe the performance of a memory type,
abstract distance is defined. Which is in direct proportion to the memory
latency and inversely proportional to the memory bandwidth. To keep the
code as simple as possible, fixed abstract distance is used in dax/kmem to
describe slow memory such as Optane DCPMM.

To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a algorithm
implementation based on ACPI HMAT, and used the general abstract distance
calculation interface in dax/kmem driver. So, dax/kmem can support HBM
(high bandwidth memory) in addition to the original Optane DCPMM.


This patch (of 4):

The abstract distance may be calculated by various drivers, such as ACPI
HMAT, CXL CDAT, etc. While it may be used by various code which hot-add
memory node, such as dax/kmem etc. To decouple the algorithm users and
the providers, the abstract distance calculation algorithms management
mechanism is implemented in this patch. It provides interface for the
providers to register the implementation, and interface for the users.

Multiple algorithm implementations can cooperate via calculating abstract
distance for different memory nodes. The preference of algorithm
implementations can be specified via priority (notifier_block.priority).

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Tested-by: Bharata B Rao <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Rafael J Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5
# 51a23b1b 02-Aug-2023 Li Zhijian <[email protected]>

acpi,mm: fix typo sibiling -> sibling

First found this typo as reviewing memory tier code. Fix it by sed like:
$ sed -i 's/sibiling/sibling/g' $(git grep -l sibiling)

so the acpi one will be correc

acpi,mm: fix typo sibiling -> sibling

First found this typo as reviewing memory tier code. Fix it by sed like:
$ sed -i 's/sibiling/sibling/g' $(git grep -l sibiling)

so the acpi one will be corrected as well.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Zhijian <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Huang, Ying <[email protected]>
Cc: Len Brown <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1
# bded67f8 06-Jul-2023 Miaohe Lin <[email protected]>

memory tier: rename destroy_memory_type() to put_memory_type()

It appears that destroy_memory_type() isn't a very good name because we
usually will not free the memory_type here. So rename it to a

memory tier: rename destroy_memory_type() to put_memory_type()

It appears that destroy_memory_type() isn't a very good name because we
usually will not free the memory_type here. So rename it to a more
appropriate name i.e. put_memory_type().

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Suggested-by: Huang, Ying <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Xiao Yang <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Vishal Verma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7
# 1eeaa4fd 23-Sep-2022 Liu Shixin <[email protected]>

memory: move hotplug memory notifier priority to same file for easy sorting

The priority of hotplug memory callback is defined in a different file.
And there are some callers using numbers directly

memory: move hotplug memory notifier priority to same file for easy sorting

The priority of hotplug memory callback is defined in a different file.
And there are some callers using numbers directly. Collect them together
into include/linux/memory.h for easy reading. This allows us to sort
their priorities more intuitively without additional comments.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Shixin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: zefan li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2
# 467b171a 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion: update node_is_toptier to work with memory tiers

With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults.

mm/demotion: update node_is_toptier to work with memory tiers

With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults. Update
node_is_toptier to work with memory tiers. All NUMA nodes are by default
top tier nodes. With lower(slower) memory tiers added we consider all
memory tiers above a memory tier having CPU NUMA nodes as a top memory
tier

[[email protected]: include missed header file, memory-tiers.h]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: mm/memory.c needs linux/memory-tiers.h]
[[email protected]: make toptier_distance inclusive upper bound of toptiers]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 32008027 18-Aug-2022 Jagdish Gediya <[email protected]>

mm/demotion: demote pages according to allocation fallback order

Currently, a higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path. This str

mm/demotion: demote pages according to allocation fallback order

Currently, a higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path. This strict demotion
order does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space). This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that currently.

This patch adds support to get all the allowed demotion targets for a
memory tier. demote_page_list() function is now modified to utilize this
allowed node mask as the fallback allocation mask.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jagdish Gediya <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# b26ac6f3 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion: drop memtier from memtype

Now that we track node-specific memtier in pg_data_t, we can drop memtier
from memtype.

Link: https://lkml.kernel.org/r/20220818131042.113280-8-aneesh.kumar@l

mm/demotion: drop memtier from memtype

Now that we track node-specific memtier in pg_data_t, we can drop memtier
from memtype.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 6c542ab7 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion: build demotion targets based on explicit memory tiers

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be

mm/demotion: build demotion targets based on explicit memory tiers

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default memory tier and additional memory tiers will be added by drivers
like dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest
node in the immediately following memory tier is used as a demotion
target.

Since we are now only building demotion target for N_MEMORY NUMA nodes the
CPU hotplug calls are removed in this patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 7b88bda3 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE

By default, all nodes are assigned to the default memory tier which is the
memory tier designated for nodes with D

mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE

By default, all nodes are assigned to the default memory tier which is the
memory tier designated for nodes with DRAM

Set dax kmem device node's tier to slower memory tier by assigning
abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE. Low-level drivers
like papr_scm or ACPI NFIT can initialize memory device type to a more
accurate value based on device tree details or HMAT. If the kernel
doesn't find the memory type initialized, a default slower memory type is
assigned by the kmem driver.

[[email protected]: assign correct memory type for multiple dax devices with the same node affinity]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# c6123a19 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion: add hotplug callbacks to handle new numa node onlined

If the new NUMA node onlined doesn't have a abstract distance assigned,
the kernel adds the NUMA node to default memory tier.

[ane

mm/demotion: add hotplug callbacks to handle new numa node onlined

If the new NUMA node onlined doesn't have a abstract distance assigned,
the kernel adds the NUMA node to default memory tier.

[[email protected]: fix kernel error with memory hotplug]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 91952440 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion: move memory demotion related code

This moves memory demotion related code to mm/memory-tiers.c. No
functional change in this patch.

Link: https://lkml.kernel.org/r/20220818131042.1132

mm/demotion: move memory demotion related code

This moves memory demotion related code to mm/memory-tiers.c. No
functional change in this patch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


# 992bf775 18-Aug-2022 Aneesh Kumar K.V <[email protected]>

mm/demotion: add support for explicit memory tiers

Patch series "mm/demotion: Memory tiers and demotion", v15.

The current kernel has the basic memory tiering support: Inactive pages on
a higher ti

mm/demotion: add support for explicit memory tiers

Patch series "mm/demotion: Memory tiers and demotion", v15.

The current kernel has the basic memory tiering support: Inactive pages on
a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
node to make room for new allocations on the higher tier NUMA node.
Frequently accessed pages on a lower tier NUMA node can be migrated
(promoted) to a higher tier NUMA node to improve the performance.

In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy tier-by-tier by establishing the per-node
demotion targets based on the distances between nodes.

This current memory tier kernel implementation needs to be improved for
several important use cases:

* The current tier initialization code always initializes each
memory-only NUMA node into a lower tier. But a memory-only NUMA node
may have a high performance memory device (e.g. a DRAM-backed
memory-only node on a virtual machine) and that should be put into a
higher tier.

* The current tier hierarchy always puts CPU nodes into the top tier.
But on a system with HBM (e.g. GPU memory) devices, these memory-only
HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes into the
top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
node from CPU-less into a CPU node (or vice versa), the memory tier
hierarchy gets changed, even though no memory node is added or removed.
This can make the tier hierarchy unstable and make it difficult to
support tier-based memory accounting.

* A higher tier node can only be demoted to nodes with shortest distance
on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in
all use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback when
the preferred demotion node is out of space), and has resulted in the
feature request for an interface to override the system-wide, per-node
demotion order from the userspace. This demotion order is also
inconsistent with the page allocation fallback order when all the nodes
in a higher tier are out of space: The page allocation can fall back to
any node from any lower tier, whereas the demotion order doesn't allow
that.

This patch series make the creation of memory tiers explicit under the
control of device driver.

Memory Tier Initialization
==========================

Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.

By default, all memory nodes are assigned to the default tier with
abstract distance 512.

A device driver can move its memory nodes from the default tier. For
example, PMEM can move its memory nodes below the default tier, whereas
GPU can move its memory nodes above the default tier.

The kernel initialization code makes the decision on which exact tier a
memory node should be assigned to based on the requests from the device
drivers as well as the memory device hardware information provided by the
firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.


This patch (of 10):

In the current kernel, memory tiers are defined implicitly via a demotion
path relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed.
The current implementation puts all nodes with CPU into the highest tier,
and builds the tier hierarchy by establishing the per-node demotion
targets based on the distances between nodes.

This current memory tier kernel implementation needs to be improved for
several important use cases,

The current tier initialization code always initializes each memory-only
NUMA node into a lower tier. But a memory-only NUMA node may have a high
performance memory device (e.g. a DRAM-backed memory-only node on a
virtual machine) that should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top tier. But
on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
these devices should be in the top tier, and DRAM nodes with CPUs are
better to be placed into the next lower tier.

With current kernel higher tier node can only be demoted to nodes with
shortest distance on the next lower tier as defined by the demotion path,
not any other node from any lower tier. This strict, demotion order does
not work in all use cases (e.g. some use cases may want to allow
cross-socket demotion to another node in the same demotion tier as a
fallback when the preferred demotion node is out of space), This demotion
order is also inconsistent with the page allocation fallback order when
all the nodes in a higher tier are out of space: The page allocation can
fall back to any node from any lower tier, whereas the demotion order
doesn't allow that.

This patch series address the above by defining memory tiers explicitly.

Linux kernel presents memory devices as NUMA nodes and each memory device
is of a specific type. The memory type of a device is represented by its
abstract distance. A memory tier corresponds to a range of abstract
distance. This allows for classifying memory devices with a specific
performance range into a memory tier.

This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default
DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Faster memory devices can be placed in these faster(higher) memory tiers.
Slower memory devices like persistent memory will have abstract distance
higher than the default DRAM level.

[[email protected]: fix comment, per Aneesh]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Acked-by: Wei Xu <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: SeongJae Park <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...