History log of /linux-6.15/kernel/dma/swiotlb.c (Results 1 – 25 of 177)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3
# ba0fb44a 11-Aug-2024 Catalin Marinas <[email protected]>

dma-mapping: replace zone_dma_bits by zone_dma_limit

The hardware DMA limit might not be power of 2. When RAM range starts
above 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high
bi

dma-mapping: replace zone_dma_bits by zone_dma_limit

The hardware DMA limit might not be power of 2. When RAM range starts
above 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high
bit can not encode this limit.

Use a plain address for the DMA zone limit instead.

Since the DMA zone can now potentially span beyond 4GB physical limit of
DMA32, make sure to use DMA zone for GFP_DMA32 allocations in that case.

Signed-off-by: Catalin Marinas <[email protected]>
Co-developed-by: Baruch Siach <[email protected]>
Signed-off-by: Baruch Siach <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.11-rc2, v6.11-rc1, v6.10
# b69bdba5 12-Jul-2024 Yang Li <[email protected]>

swiotlb: fix kernel-doc description for swiotlb_del_transient

Describe the pool argument in the kernel-doc comment for
swiotlb_del_transient.

Reported-by: Abaci Robot <[email protected]>
Sign

swiotlb: fix kernel-doc description for swiotlb_del_transient

Describe the pool argument in the kernel-doc comment for
swiotlb_del_transient.

Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# 7296f230 08-Jul-2024 Michael Kelley <[email protected]>

swiotlb: reduce swiotlb pool lookups

With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair
in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple
places, the pool is fou

swiotlb: reduce swiotlb pool lookups

With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair
in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple
places, the pool is found and used in one function, and then must
be found again in the next function that is called because only the
tlb_addr is passed as an argument. These are the six call sites:

dma_direct_map_page:
1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce

dma_direct_unmap_page:
2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer
3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu ->
swiotlb_bounce
4. is_swiotlb_buffer
5. swiotlb_tbl_unmap_single -> swiotlb_del_transient
6. swiotlb_tbl_unmap_single -> swiotlb_release_slots

Reduce the number of calls by finding the pool at a higher level, and
passing it as an argument instead of searching again. A key change is
for is_swiotlb_buffer() to return a pool pointer instead of a boolean,
and then pass this pool pointer to subsequent swiotlb functions.

There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer
is a swiotlb buffer before calling a swiotlb function. To reduce code
duplication in getting the pool pointer and passing it as an argument,
introduce inline wrappers for this pattern. The generated code is
essentially unchanged.

Since is_swiotlb_buffer() no longer returns a boolean, rename some
functions to reflect the change:

* swiotlb_find_pool() becomes __swiotlb_find_pool()
* is_swiotlb_buffer() becomes swiotlb_find_pool()
* is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool()

With these changes, a round-trip map/unmap pair requires only 2 pool
lookups (listed using the new names and wrappers):

dma_direct_unmap_page:
1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool
2. swiotlb_tbl_unmap_single -> swiotlb_find_pool

These changes come from noticing the inefficiencies in a code review,
not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC,
__swiotlb_find_pool() is not trivial, and it uses an RCU read lock,
so avoiding the redundant calls helps performance in a hot path.
When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction
is minimal and the perf benefits are likely negligible, but no
harm is done.

No functional change is intended.

Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9
# a6016aac 09-May-2024 Alexander Lobakin <[email protected]>

dma: fix DMA sync for drivers not calling dma_set_mask*()

There are several reports that the DMA sync shortcut broke non-coherent
devices.
dev->dma_need_sync is false after the &device allocation an

dma: fix DMA sync for drivers not calling dma_set_mask*()

There are several reports that the DMA sync shortcut broke non-coherent
devices.
dev->dma_need_sync is false after the &device allocation and if a driver
didn't call dma_set_mask*(), it will still be false even if the device
is not DMA-coherent and thus needs synchronizing. Due to historical
reasons, there's still a lot of drivers not calling it.
Invert the boolean, so that the sync will be performed by default and
the shortcut will be enabled only when calling dma_set_mask*().

Reported-by: Steven Price <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]
Reported-by: Marek Szyprowski <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]
Fixes: f406c8e4b770. ("dma: avoid redundant calls for sync operations")
Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Steven Price <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>

show more ...


# f406c8e4 07-May-2024 Alexander Lobakin <[email protected]>

dma: avoid redundant calls for sync operations

Quite often, devices do not need dma_sync operations on x86_64 at least.
Indeed, when dev_is_dma_coherent(dev) is true and
dev_use_swiotlb(dev) is fals

dma: avoid redundant calls for sync operations

Quite often, devices do not need dma_sync operations on x86_64 at least.
Indeed, when dev_is_dma_coherent(dev) is true and
dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu()
and friends do nothing.

However, indirectly calling them when CONFIG_RETPOLINE=y consumes about
10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate.
Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%.

Add dev->need_dma_sync boolean and turn it off during the device
initialization (dma_set_mask()) depending on the setup:
dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device ||
sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC,
advertised for non-NULL DMA ops.
Then later, if/when swiotlb is used for the first time, the flag
is reset back to on, from swiotlb_tbl_map_single().

On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows
+3-5% increase for direct DMA.

Suggested-by: Christoph Hellwig <[email protected]> # direct DMA shortcut
Co-developed-by: Eric Dumazet <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4
# 327e2c97 08-Apr-2024 Michael Kelley <[email protected]>

swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()

Currently swiotlb_tbl_map_single() takes alloc_align_mask and
alloc_size arguments to specify an swiotlb allocation that is larger
tha

swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()

Currently swiotlb_tbl_map_single() takes alloc_align_mask and
alloc_size arguments to specify an swiotlb allocation that is larger
than mapping_size. This larger allocation is used solely by
iommu_dma_map_single() to handle untrusted devices that should not have
DMA visibility to memory pages that are partially used for unrelated
kernel data.

Having two arguments to specify the allocation is redundant. While
alloc_align_mask naturally specifies the alignment of the starting
address of the allocation, it can also implicitly specify the size
by rounding up the mapping_size to that alignment.

Additionally, the current approach has an edge case bug.
iommu_dma_map_page() already does the rounding up to compute the
alloc_size argument. But swiotlb_tbl_map_single() then calculates the
alignment offset based on the DMA min_align_mask, and adds that offset to
alloc_size. If the offset is non-zero, the addition may result in a value
that is larger than the max the swiotlb can allocate. If the rounding up
is done _after_ the alignment offset is added to the mapping_size (and
the original mapping_size conforms to the value returned by
swiotlb_max_mapping_size), then the max that the swiotlb can allocate
will not be exceeded.

In view of these issues, simplify the swiotlb_tbl_map_single() interface
by removing the alloc_size argument. Most call sites pass the same value
for mapping_size and alloc_size, and they pass alloc_align_mask as zero.
Just remove the redundant argument from these callers, as they will see
no functional change. For iommu_dma_map_page() also remove the alloc_size
argument, and have swiotlb_tbl_map_single() compute the alloc_size by
rounding up mapping_size after adding the offset based on min_align_mask.
This has the side effect of fixing the edge case bug but with no other
functional change.

Also add a sanity test on the alloc_align_mask. While IOMMU code
currently ensures the granule is not larger than PAGE_SIZE, if that
guarantee were to be removed in the future, the downstream effect on the
swiotlb might go unnoticed until strange allocation failures occurred.

Tested on an ARM64 system with 16K page size and some kernel test-only
hackery to allow modifying the DMA min_align_mask and the granule size
that becomes the alloc_align_mask. Tested these combinations with a
variety of original memory addresses and sizes, including those that
reproduce the edge case bug:

* 4K granule and 0 min_align_mask
* 4K granule and 0xFFF min_align_mask (4K - 1)
* 16K granule and 0xFFF min_align_mask
* 64K granule and 0xFFF min_align_mask
* 64K granule and 0x3FFF min_align_mask (16K - 1)

With the changes, all combinations pass.

Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# 75961ffb 02-May-2024 Will Deacon <[email protected]>

swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y

Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction
with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to

swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y

Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction
with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to the following
crash when initialising the restricted pools at boot-time:

| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
| Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
| pc : rmem_swiotlb_device_init+0xfc/0x1ec
| lr : rmem_swiotlb_device_init+0xf0/0x1ec
| Call trace:
| rmem_swiotlb_device_init+0xfc/0x1ec
| of_reserved_mem_device_init_by_idx+0x18c/0x238
| of_dma_configure_id+0x31c/0x33c
| platform_dma_configure+0x34/0x80

faddr2line reveals that the crash is in the list validation code:

include/linux/list.h:83
include/linux/rculist.h:79
include/linux/rculist.h:106
kernel/dma/swiotlb.c:306
kernel/dma/swiotlb.c:1695

because add_mem_pool() is trying to list_add_rcu() to a NULL
'mem->pools'.

Fix the crash by initialising the 'mem->pools' list_head in
rmem_swiotlb_device_init() before calling add_mem_pool().

Reported-by: Nikita Ioffe <[email protected]>
Tested-by: Nikita Ioffe <[email protected]>
Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.9-rc3, v6.9-rc2
# a1255cca 29-Mar-2024 Dexuan Cui <[email protected]>

swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()

Sometimes the readout of /sys/kernel/debug/swiotlb/io_tlb_used and
io_tlb_used_hiwater can be a huge number (e.g. 18446744073709

swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()

Sometimes the readout of /sys/kernel/debug/swiotlb/io_tlb_used and
io_tlb_used_hiwater can be a huge number (e.g. 18446744073709551615),
which is actually a negative number if we use "%ld" to print the number.

When swiotlb_create_default_debugfs() is running from late_initcall,
mem->total_used may already be non-zero, because the storage driver
may have already started to perform I/O operations: if the storage
driver is built-in, its probe() callback is called before late_initcall.

swiotlb_create_debugfs_files() should not blindly set mem->total_used
and mem->used_hiwater to 0; actually it doesn't have to initialize the
fields at all, because the fields, as part of the global struct
io_tlb_default_mem, have been implicitly initialized to zero.

Also don't explicitly set mem->transient_nslabs to 0.

Fixes: 8b0977ecc8b3 ("swiotlb: track and report io_tlb_used high water marks in debugfs")
Fixes: 02e765697038 ("swiotlb: add debugfs to track swiotlb transient pool usage")
Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: ZhangPeng <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# e8068f2d 27-Mar-2024 Michael Kelley <[email protected]>

swiotlb: fix swiotlb_bounce() to do partial sync's correctly

In current code, swiotlb_bounce() may do partial sync's correctly in
some circumstances, but may incorrectly fail in other circumstances.

swiotlb: fix swiotlb_bounce() to do partial sync's correctly

In current code, swiotlb_bounce() may do partial sync's correctly in
some circumstances, but may incorrectly fail in other circumstances.
The failure cases require both of these to be true:

1) swiotlb_align_offset() returns a non-zero "offset" value
2) the tlb_addr of the partial sync area points into the first
"offset" bytes of the _second_ or subsequent swiotlb slot allocated
for the mapping

Code added in commit 868c9ddc182b ("swiotlb: add overflow checks
to swiotlb_bounce") attempts to WARN on the invalid case where
tlb_addr points into the first "offset" bytes of the _first_
allocated slot. But there's no way for swiotlb_bounce() to distinguish
the first slot from the second and subsequent slots, so the WARN
can be triggered incorrectly when #2 above is true.

Related, current code calculates an adjustment to the orig_addr stored
in the swiotlb slot. The adjustment compensates for the difference
in the tlb_addr used for the partial sync vs. the tlb_addr for the full
mapping. The adjustment is stored in the local variable tlb_offset.
But when #1 and #2 above are true, it's valid for this adjustment to
be negative. In such case the arithmetic to adjust orig_addr produces
the wrong result due to tlb_offset being declared as unsigned.

Fix these problems by removing the over-constraining validations added
in 868c9ddc182b. Change the declaration of tlb_offset to be signed
instead of unsigned so the adjustment arithmetic works correctly.

Tested with a test-only hack to how swiotlb_tbl_map_single() calls
swiotlb_bounce(). Instead of calling swiotlb_bounce() just once
for the entire mapped area, do a loop with each iteration doing
only a 128 byte partial sync until the entire mapped area is
sync'ed. Then with swiotlb=force on the kernel boot line, run a
variety of raw disk writes followed by read and verification of
all bytes of the written data. The storage device has DMA
min_align_mask set, and the writes are done with a variety of
original buffer memory address alignments and overall buffer
sizes. For many of the combinations, current code triggers the
WARN statements, or the data verification fails. With the fixes,
no WARNs occur and all verifications pass.

Fixes: 5f89468e2f06 ("swiotlb: manipulate orig_addr when tlb_addr has offset")
Fixes: 868c9ddc182b ("swiotlb: add overflow checks to swiotlb_bounce")
Signed-off-by: Michael Kelley <[email protected]>
Dominique Martinet <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# af133562 25-Mar-2024 Petr Tesarik <[email protected]>

swiotlb: extend buffer pre-padding to alloc_align_mask if necessary

Allow a buffer pre-padding of up to alloc_align_mask, even if it requires
allocating additional IO TLB slots.

If the allocation a

swiotlb: extend buffer pre-padding to alloc_align_mask if necessary

Allow a buffer pre-padding of up to alloc_align_mask, even if it requires
allocating additional IO TLB slots.

If the allocation alignment is bigger than IO_TLB_SIZE and min_align_mask
covers any non-zero bits in the original address between IO_TLB_SIZE and
alloc_align_mask, these bits are not preserved in the swiotlb buffer
address.

To fix this case, increase the allocation size and use a larger offset
within the allocated buffer. As a result, extra padding slots may be
allocated before the mapping start address.

Leave orig_addr in these padding slots initialized to INVALID_PHYS_ADDR.
These slots do not correspond to any CPU buffer, so attempts to sync the
data should be ignored.

The padding slots should be automatically released when the buffer is
unmapped. However, swiotlb_tbl_unmap_single() takes only the address of the
DMA buffer slot, not the first padding slot. Save the number of padding
slots in struct io_tlb_slot and use it to adjust the slot index in
swiotlb_release_slots(), so all allocated slots are properly freed.

Fixes: 2fd4fa5d3fb5 ("swiotlb: Fix alignment checks when both allocation and DMA masks are present")
Link: https://lore.kernel.org/linux-iommu/[email protected]/
Signed-off-by: Petr Tesarik <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.9-rc1, v6.8
# 14cebf68 08-Mar-2024 Will Deacon <[email protected]>

swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE

For swiotlb allocations >= PAGE_SIZE, the slab search historically
adjusted the stride to avoid checking unaligned slots. This had the
sid

swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE

For swiotlb allocations >= PAGE_SIZE, the slab search historically
adjusted the stride to avoid checking unaligned slots. This had the
side-effect of aligning large mapping requests to PAGE_SIZE, but that
was broken by 0eee5ae10256 ("swiotlb: fix slot alignment checks").

Since this alignment could be relied upon drivers, reinstate PAGE_SIZE
alignment for swiotlb mappings >= PAGE_SIZE.

Reported-by: Michael Kelley <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Robin Murphy <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# 51b30ecb 08-Mar-2024 Will Deacon <[email protected]>

swiotlb: Fix alignment checks when both allocation and DMA masks are present

Nicolin reports that swiotlb buffer allocations fail for an NVME device
behind an IOMMU using 64KiB pages. This is becaus

swiotlb: Fix alignment checks when both allocation and DMA masks are present

Nicolin reports that swiotlb buffer allocations fail for an NVME device
behind an IOMMU using 64KiB pages. This is because we end up with a
minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
page (i.e. preserving the 4KiB page offset from the original allocation).
If the original address is not 4KiB-aligned, the allocation will fail
because swiotlb_search_pool_area() erroneously compares these unmasked
bits with the 64KiB-aligned candidate allocation.

Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
reduced based on the required alignment of the allocation.

Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Link: https://lore.kernel.org/r/[email protected]
Reported-by: Nicolin Chen <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# cbf53074 08-Mar-2024 Will Deacon <[email protected]>

swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()

core-api/dma-api-howto.rst states the following properties of
dma_alloc_coherent():

| The CPU virtual address and the DMA address

swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()

core-api/dma-api-howto.rst states the following properties of
dma_alloc_coherent():

| The CPU virtual address and the DMA address are both guaranteed to
| be aligned to the smallest PAGE_SIZE order which is greater than or
| equal to the requested size.

However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
parameter of swiotlb_find_slots() and so this property is not upheld.
Instead, allocations larger than a page are aligned to PAGE_SIZE,

Calculate the mask corresponding to the page order suitable for holding
the allocation and pass that to swiotlb_find_slots().

Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers")
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# 823353b7 08-Mar-2024 Will Deacon <[email protected]>

swiotlb: Enforce page alignment in swiotlb_alloc()

When allocating pages from a restricted DMA pool in swiotlb_alloc(),
the buffer address is blindly converted to a 'struct page *' that is
returned

swiotlb: Enforce page alignment in swiotlb_alloc()

When allocating pages from a restricted DMA pool in swiotlb_alloc(),
the buffer address is blindly converted to a 'struct page *' that is
returned to the caller. In the unlikely event of an allocation bug,
page-unaligned addresses are not detected and slots can silently be
double-allocated.

Add a simple check of the buffer alignment in swiotlb_alloc() to make
debugging a little easier if something has gone wonky.

Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# 04867a7a 08-Mar-2024 Will Deacon <[email protected]>

swiotlb: Fix double-allocation of slots due to broken alignment handling

Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
which was a fix for commit 0eee5ae10256 ("swiotlb:

swiotlb: Fix double-allocation of slots due to broken alignment handling

Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"),
which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment
checks"), causes a functional regression with vsock in a virtual machine
using bouncing via a restricted DMA SWIOTLB pool.

When virtio allocates the virtqueues for the vsock device using
dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
allocations if 'area->index' was left unaligned by a previous allocation
from the buffer:

# Final address in brackets is the SWIOTLB address returned to the caller
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
| virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)

This ends badly (typically buffer corruption and/or a hang) because
swiotlb_alloc() is expecting a page-aligned allocation and so blindly
returns a pointer to the 'struct page' corresponding to the allocation,
therefore double-allocating the first half (2KiB slot) of the 4KiB page.

Fix the problem by treating the allocation alignment separately to any
additional alignment requirements from the device, using the maximum
of the two as the stride to search the buffer slots and taking care
to ensure a minimum of page-alignment for buffers larger than a page.

This also resolves swiotlb allocation failures occuring due to the
inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and
resulting in alignment requirements exceeding swiotlb_max_mapping_size().

Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix")
Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks")
Signed-off-by: Will Deacon <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Tested-by: Nicolin Chen <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1
# 02e76569 09-Jan-2024 ZhangPeng <[email protected]>

swiotlb: add debugfs to track swiotlb transient pool usage

Introduce a new debugfs interface io_tlb_transient_nslabs. The device
driver can create a new swiotlb transient memory pool once default
me

swiotlb: add debugfs to track swiotlb transient pool usage

Introduce a new debugfs interface io_tlb_transient_nslabs. The device
driver can create a new swiotlb transient memory pool once default
memory pool is full. To export the swiotlb transient memory pool usage
via debugfs would help the user estimate the size of transient swiotlb
memory pool or analyze device driver memory leak issue.

Signed-off-by: ZhangPeng <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# 3dc2f209 09-Jan-2024 ZhangPeng <[email protected]>

swiotlb: check alloc_size before the allocation of a new memory pool

The allocation request for swiotlb contiguous memory greater than
128*2KB cannot be fulfilled because it exceeds the maximum cont

swiotlb: check alloc_size before the allocation of a new memory pool

The allocation request for swiotlb contiguous memory greater than
128*2KB cannot be fulfilled because it exceeds the maximum contiguous
memory limit. If the swiotlb memory we allocate is larger than 128*2KB,
swiotlb_find_slots() will still schedule the allocation of a new memory
pool, which will increase memory overhead.

Fix it by adding a check with alloc_size no more than 128*2KB before
scheduling the allocation of a new memory pool in swiotlb_find_slots().

Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.7, v6.7-rc8
# 5e0a760b 28-Dec-2023 Kirill A. Shutemov <[email protected]>

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

show more ...


Revision tags: v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4
# 55c54386 01-Dec-2023 Petr Tesarik <[email protected]>

swiotlb: reduce area lock contention for non-primary IO TLB pools

If multiple areas and multiple IO TLB pools exist, first iterate the
current CPU specific area in all pools. Then move to the next a

swiotlb: reduce area lock contention for non-primary IO TLB pools

If multiple areas and multiple IO TLB pools exist, first iterate the
current CPU specific area in all pools. Then move to the next area index.

This is best illustrated by a diagram:

area 0 | area 1 | ... | area M |
pool 0 A B C
pool 1 D E
...
pool N F G H

Currently, each pool is searched before moving on to the next pool,
i.e. the search order is A, B ... C, D, E ... F, G ... H. With this patch,
each area is searched in all pools before moving on to the next area,
i.e. the search order is A, D ... F, B, E ... G ... C ... H.

Note that preemption is not disabled, and raw_smp_processor_id() may not
return a stable result, but it is called only once to determine the initial
area index. The search will iterate over all areas eventually, even if the
current task is preempted.

Next, some pools may have less (but not more) areas than default_nareas.
Skip such pools if the distance from the initial area index is greater than
pool->nareas. This logic ensures that for every pool the search starts in
the initial CPU's own area and never tries any area twice.

To verify performance impact, I booted the kernel with a minimum pool
size ("swiotlb=512,4,force"), so multiple pools get allocated, and I ran
these benchmarks:

- small: single-threaded I/O of 4 KiB blocks,
- big: single-threaded I/O of 64 KiB blocks,
- 4way: 4-way parallel I/O of 4 KiB blocks.

The "var" column in the tables below is the coefficient of variance over 5
runs of the test, the "diff" column is the relative difference against base
in read-write I/O bandwidth (MiB/s).

Tested on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
block device on the host:

base patched
var var diff
small 0.69% 0.62% +25.4%
big 2.14% 2.27% +25.7%
4way 2.65% 1.70% +23.6%

Tested on a Raspberry Pi against a class-10 A1 microSD card:

base patched
var var diff
small 0.53% 1.96% -0.3%
big 0.02% 0.57% +0.8%
4way 6.17% 0.40% +0.3%

These results confirm that there is significant performance boost in the
software IO TLB slot allocation itself. Where performance is dominated by
actual hardware, there is no measurable change.

Signed-off-by: Petr Tesarik <[email protected]>
Reviewed-by: Mirsad Todorovac <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.7-rc3, v6.7-rc2, v6.7-rc1
# 53c87e84 08-Nov-2023 Petr Tesarik <[email protected]>

swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC

Limit the free list length to the size of the IO TLB. Transient pool can be
smaller than IO_TLB_SEGSIZE, but the free list is i

swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC

Limit the free list length to the size of the IO TLB. Transient pool can be
smaller than IO_TLB_SEGSIZE, but the free list is initialized with the
assumption that the total number of slots is a multiple of IO_TLB_SEGSIZE.
As a result, swiotlb_area_find_slots() may allocate slots past the end of
a transient IO TLB buffer.

Reported-by: Niklas Schnelle <[email protected]>
Closes: https://lore.kernel.org/linux-iommu/[email protected]/
Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool")
Cc: [email protected]
Signed-off-by: Petr Tesarik <[email protected]>
Reviewed-by: Halil Pasic <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


# a5e3b127 02-Nov-2023 Petr Tesarik <[email protected]>

swiotlb: do not free decrypted pages if dynamic

Fix these two error paths:

1. When set_memory_decrypted() fails, pages may be left fully or partially
decrypted.

2. Decrypted pages may be freed

swiotlb: do not free decrypted pages if dynamic

Fix these two error paths:

1. When set_memory_decrypted() fails, pages may be left fully or partially
decrypted.

2. Decrypted pages may be freed if swiotlb_alloc_tlb() determines that the
physical address is too high.

To fix the first issue, call set_memory_encrypted() on the allocated region
after a failed decryption attempt. If that also fails, leak the pages.

To fix the second issue, check that the TLB physical address is below the
requested limit before decrypting.

Let the caller differentiate between unsuitable physical address (=> retry
from a lower zone) and allocation failures (=> no point in retrying).

Cc: [email protected]
Fixes: 79636caad361 ("swiotlb: if swiotlb is full, fall back to a transient memory pool")
Signed-off-by: Petr Tesarik <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.6
# d5090484 25-Oct-2023 Petr Tesarik <[email protected]>

swiotlb: do not try to allocate a TLB bigger than MAX_ORDER pages

When allocating a new pool at runtime, reduce the number of slabs so
that the allocation order is at most MAX_ORDER. This avoids a

swiotlb: do not try to allocate a TLB bigger than MAX_ORDER pages

When allocating a new pool at runtime, reduce the number of slabs so
that the allocation order is at most MAX_ORDER. This avoids a kernel
warning in __alloc_pages().

The warning is relatively benign, because the pool size is subsequently
reduced when allocation fails, but it is silly to start with a request
that is known to fail, especially since this is the default behavior if
the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any
swiotlb= parameter.

Reported-by: Ben Greear <[email protected]>
Closes: https://lore.kernel.org/netdev/[email protected]/
Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full")
Signed-off-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.6-rc7
# 1132a1dc 20-Oct-2023 Sean Christopherson <[email protected]>

swiotlb: rewrite comment explaining why the source is preserved on DMA_FROM_DEVICE

Rewrite the comment explaining why swiotlb copies the original buffer to
the TLB buffer before initiating DMA *from

swiotlb: rewrite comment explaining why the source is preserved on DMA_FROM_DEVICE

Rewrite the comment explaining why swiotlb copies the original buffer to
the TLB buffer before initiating DMA *from* the device, i.e. before the
device DMAs into the TLB buffer. The existing comment's argument that
preserving the original data can prevent a kernel memory leak is bogus.

If the driver that triggered the mapping _knows_ that the device will
overwrite the entire mapping, or the driver will consume only the written
parts, then copying from the original memory is completely pointless.

If neither of the above holds true, then copying from the original adds
value only if preserving the data is necessary for functional
correctness, or the driver explicitly initialized the original memory.
If the driver didn't initialize the memory, then copying the original
buffer to the TLB buffer simply changes what kernel data is leaked to
user space.

Writing the entire TLB buffer _does_ prevent leaking stale TLB buffer
data from a previous bounce, but that can be achieved by simply zeroing
the TLB buffer when grabbing a slot.

The real reason swiotlb ended up initializing the TLB buffer with the
original buffer is that it's necessary to make swiotlb operate as
transparently as possible, i.e. to behave as closely as possible to
hardware, and to avoid corrupting the original buffer, e.g. if the driver
knows the device will do partial writes and is relying on the unwritten
data to be preserved.

Reviewed-by: Robin Murphy <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4
# 2d5780bb 26-Sep-2023 Petr Tesarik <[email protected]>

swiotlb: fix the check whether a device has used software IO TLB

When CONFIG_SWIOTLB_DYNAMIC=y, devices which do not use the software IO TLB
can avoid swiotlb lookup. A flag is added by commit 13957

swiotlb: fix the check whether a device has used software IO TLB

When CONFIG_SWIOTLB_DYNAMIC=y, devices which do not use the software IO TLB
can avoid swiotlb lookup. A flag is added by commit 1395706a1490 ("swiotlb:
search the software IO TLB only if the device makes use of it"), the flag
is correctly set, but it is then never checked. Add the actual check here.

Note that this code is an alternative to the default pool check, not an
additional check, because:

1. swiotlb_find_pool() also searches the default pool;
2. if dma_uses_io_tlb is false, the default swiotlb pool is not used.

Tested in a KVM guest against a QEMU RAM-backed SATA disk over virtio and
*not* using software IO TLB, this patch increases IOPS by approx 2% for
4-way parallel I/O.

The write memory barrier in swiotlb_dyn_alloc() is not needed, because a
newly allocated pool must always be observed by swiotlb_find_slots() before
an address from that pool is passed to is_swiotlb_buffer().

Correctness was verified using the following litmus test:

C swiotlb-new-pool

(*
* Result: Never
*
* Check that a newly allocated pool is always visible when the
* corresponding swiotlb buffer is visible.
*)

{
mem_pools = default;
}

P0(int **mem_pools, int *pool)
{
/* add_mem_pool() */
WRITE_ONCE(*pool, 999);
rcu_assign_pointer(*mem_pools, pool);
}

P1(int **mem_pools, int *flag, int *buf)
{
/* swiotlb_find_slots() */
int *r0;
int r1;

rcu_read_lock();
r0 = READ_ONCE(*mem_pools);
r1 = READ_ONCE(*r0);
rcu_read_unlock();

if (r1) {
WRITE_ONCE(*flag, 1);
smp_mb();
}

/* device driver (presumed) */
WRITE_ONCE(*buf, r1);
}

P2(int **mem_pools, int *flag, int *buf)
{
/* device driver (presumed) */
int r0 = READ_ONCE(*buf);

/* is_swiotlb_buffer() */
int r1;
int *r2;
int r3;

smp_rmb();
r1 = READ_ONCE(*flag);
if (r1) {
/* swiotlb_find_pool() */
rcu_read_lock();
r2 = READ_ONCE(*mem_pools);
r3 = READ_ONCE(*r2);
rcu_read_unlock();
}
}

exists (2:r0<>0 /\ 2:r3=0) (* Not found. *)

Fixes: 1395706a1490 ("swiotlb: search the software IO TLB only if the device makes use of it")
Reported-by: Jonathan Corbet <[email protected]>
Closes: https://lore.kernel.org/linux-iommu/[email protected]/
Signed-off-by: Petr Tesarik <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


Revision tags: v6.6-rc3, v6.6-rc2
# a6a24176 11-Sep-2023 Ross Lagerwall <[email protected]>

swiotlb: use the calculated number of areas

Commit 8ac04063354a ("swiotlb: reduce the number of areas to match
actual memory pool size") calculated the reduced number of areas in
swiotlb_init_remap(

swiotlb: use the calculated number of areas

Commit 8ac04063354a ("swiotlb: reduce the number of areas to match
actual memory pool size") calculated the reduced number of areas in
swiotlb_init_remap() but didn't actually use the value. Replace usage of
default_nareas accordingly.

Fixes: 8ac04063354a ("swiotlb: reduce the number of areas to match actual memory pool size")
Signed-off-by: Ross Lagerwall <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>

show more ...


12345678