History log of /linux-6.15/net/core/skbuff.c (Results 1 – 25 of 891)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14
# 983e0e4e 18-Mar-2025 Pauli Virtanen <[email protected]>

net-timestamp: COMPLETION timestamp on packet tx completion

Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp
when hardware reports a packet completed.

Completion tstamp is us

net-timestamp: COMPLETION timestamp on packet tx completion

Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp
when hardware reports a packet completed.

Completion tstamp is useful for Bluetooth, as hardware timestamps do not
exist in the HCI specification except for ISO packets, and the hardware
has a queue where packets may wait. In this case the software SND
timestamp only reflects the kernel-side part of the total latency
(usually small) and queue length (usually 0 unless HW buffers
congested), whereas the completion report time is more informative of
the true latency.

It may also be useful in other cases where HW TX timestamps cannot be
obtained and user wants to estimate an upper bound to when the TX
probably happened.

Signed-off-by: Pauli Virtanen <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

show more ...


Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5
# 859d6acd 25-Feb-2025 Alexander Lobakin <[email protected]>

net: skbuff: introduce napi_skb_cache_get_bulk()

Add a function to get an array of skbs from the NAPI percpu cache.
It's supposed to be a drop-in replacement for
kmem_cache_alloc_bulk(skbuff_head_ca

net: skbuff: introduce napi_skb_cache_get_bulk()

Add a function to get an array of skbs from the NAPI percpu cache.
It's supposed to be a drop-in replacement for
kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
requirement to call it only from the BH) is that it tries to use
as many NAPI cache entries for skbs as possible, and allocate new
ones only if needed.

The logic is as follows:

* there is enough skbs in the cache: decache them and return to the
caller;
* not enough: try refilling the cache first. If there is now enough
skbs, return;
* still not enough: try allocating skbs directly to the output array
with %GFP_ZERO, maybe we'll be able to get some. If there's now
enough, return;
* still not enough: return as many as we were able to obtain.

Most of times, if called from the NAPI polling loop, the first one will
be true, sometimes (rarely) the second one. The third and the fourth --
only under heavy memory pressure.
It can save significant amounts of CPU cycles if there are GRO cycles
and/or Tx completion cycles (anything that descends to
napi_skb_cache_put()) happening on this CPU.

Tested-by: Daniel Xu <[email protected]>
Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


Revision tags: v6.14-rc4
# de2c2118 22-Feb-2025 Philo Lu <[email protected]>

ipvs: Always clear ipvs_property flag in skb_scrub_packet()

We found an issue when using bpf_redirect with ipvs NAT mode after
commit ff70202b2d1a ("dev_forward_skb: do not scrub skb mark within
the

ipvs: Always clear ipvs_property flag in skb_scrub_packet()

We found an issue when using bpf_redirect with ipvs NAT mode after
commit ff70202b2d1a ("dev_forward_skb: do not scrub skb mark within
the same name space"). Particularly, we use bpf_redirect to return
the skb directly back to the netif it comes from, i.e., xnet is
false in skb_scrub_packet(), and then ipvs_property is preserved
and SNAT is skipped in the rx path.

ipvs_property has been already cleared when netns is changed in
commit 2b5ec1a5f973 ("netfilter/ipvs: clear ipvs_property flag when
SKB net namespace changed"). This patch just clears it in spite of
netns.

Fixes: 2b5ec1a5f973 ("netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed")
Signed-off-by: Philo Lu <[email protected]>
Acked-by: Julian Anastasov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


# b3b81e6b 20-Feb-2025 Jason Xing <[email protected]>

bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback

Support the ACK case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This
callback will occur at the same timestamping po

bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback

Support the ACK case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This
callback will occur at the same timestamping point as the user
space's SCM_TSTAMP_ACK. The BPF program can use it to get the
same SCM_TSTAMP_ACK timestamp without modifying the user-space
application.

This patch extends txstamp_ack to two bits: 1 stands for
SO_TIMESTAMPING mode, 2 bpf extension.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# 2deaf7f4 20-Feb-2025 Jason Xing <[email protected]>

bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback

Support hw SCM_TSTAMP_SND case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_HW_CB. This
callback will occur at the same

bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback

Support hw SCM_TSTAMP_SND case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_HW_CB. This
callback will occur at the same timestamping point as the user
space's hardware SCM_TSTAMP_SND. The BPF program can use it to
get the same SCM_TSTAMP_SND timestamp without modifying the
user-space application.

To avoid increasing the code complexity, replace SKBTX_HW_TSTAMP
with SKBTX_HW_TSTAMP_NOBPF instead of changing numerous callers
from driver side using SKBTX_HW_TSTAMP. The new definition of
SKBTX_HW_TSTAMP means the combination tests of socket timestamping
and bpf timestamping. After this patch, drivers can work under the
bpf timestamping.

Considering some drivers don't assign the skb with hardware
timestamp, this patch does the assignment and then BPF program
can acquire the hwstamp from skb directly.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# ecebb17a 20-Feb-2025 Jason Xing <[email protected]>

bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback

Support sw SCM_TSTAMP_SND case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This
callback will occur at the same

bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback

Support sw SCM_TSTAMP_SND case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This
callback will occur at the same timestamping point as the user
space's software SCM_TSTAMP_SND. The BPF program can use it to
get the same SCM_TSTAMP_SND timestamp without modifying the
user-space application.

Based on this patch, BPF program will get the software
timestamp when the driver is ready to send the skb. In the
sebsequent patch, the hardware timestamp will be supported.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# 6b98ec7e 20-Feb-2025 Jason Xing <[email protected]>

bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback

Support SCM_TSTAMP_SCHED case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This
callback will occur at the same ti

bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback

Support SCM_TSTAMP_SCHED case for bpf timestamping.

Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This
callback will occur at the same timestamping point as the user
space's SCM_TSTAMP_SCHED. The BPF program can use it to get the
same SCM_TSTAMP_SCHED timestamp without modifying the user-space
application.

A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags,
ensuring that the new BPF timestamping and the current user
space's SO_TIMESTAMPING do not interfere with each other.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# aa290f93 20-Feb-2025 Jason Xing <[email protected]>

net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING

No functional changes here. Only add test to see if the orig_skb
matches the usage of application SO_TIMESTAMPING.

In this series,

net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING

No functional changes here. Only add test to see if the orig_skb
matches the usage of application SO_TIMESTAMPING.

In this series, bpf timestamping and previous socket timestamping
are implemented in the same function __skb_tstamp_tx(). To test
the socket enables socket timestamping feature, this function
skb_tstamp_tx_report_so_timestamping() is added.

In the next patch, another check for bpf timestamping feature
will be introduced just like the above report function, namely,
skb_tstamp_tx_report_bpf_timestamping(). Then users will be able
to know the socket enables either or both of features.

Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]

show more ...


# 6bc7e4eb 18-Feb-2025 Paolo Abeni <[email protected]>

Revert "net: skb: introduce and use a single page frag cache"

After the previous commit is finally safe to revert commit dbae2b062824
("net: skb: introduce and use a single page frag cache"): do it

Revert "net: skb: introduce and use a single page frag cache"

After the previous commit is finally safe to revert commit dbae2b062824
("net: skb: introduce and use a single page frag cache"): do it here.

The intended goal of such change was to counter a performance regression
introduced by commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs").

Unfortunately, the blamed commit introduces another regression for the
virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny
size, so that the whole head frag could fit a 512-byte block.

The single page frag cache uses a 1K fragment for such allocation, and
the additional overhead, under small UDP packets flood, makes the page
allocator a bottleneck.

Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for
typical/small skb->head"), this revert does not re-introduce the
original regression. Actually, in the relevant test on top of this
revert, I measure a small but noticeable positive delta, just above
noise level.

The revert itself required some additional mangling due to recent updates
in the affected code.

Suggested-by: Eric Dumazet <[email protected]>
Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache")
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


# 14ad6ed3 18-Feb-2025 Paolo Abeni <[email protected]>

net: allow small head cache usage with large MAX_SKB_FRAGS values

Sabrina reported the following splat:

WARNING: CPU: 0 PID: 1 at net/core/dev.c:6935 netif_napi_add_weight_locked+0x8f2/0xba0

net: allow small head cache usage with large MAX_SKB_FRAGS values

Sabrina reported the following splat:

WARNING: CPU: 0 PID: 1 at net/core/dev.c:6935 netif_napi_add_weight_locked+0x8f2/0xba0
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc1-net-00092-g011b03359038 #996
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
RIP: 0010:netif_napi_add_weight_locked+0x8f2/0xba0
Code: e8 c3 e6 6a fe 48 83 c4 28 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc c7 44 24 10 ff ff ff ff e9 8f fb ff ff e8 9e e6 6a fe <0f> 0b e9 d3 fe ff ff e8 92 e6 6a fe 48 8b 04 24 be ff ff ff ff 48
RSP: 0000:ffffc9000001fc60 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88806ce48128 RCX: 1ffff11001664b9e
RDX: ffff888008f00040 RSI: ffffffff8317ca42 RDI: ffff88800b325cb6
RBP: ffff88800b325c40 R08: 0000000000000001 R09: ffffed100167502c
R10: ffff88800b3a8163 R11: 0000000000000000 R12: ffff88800ac1c168
R13: ffff88800ac1c168 R14: ffff88800ac1c168 R15: 0000000000000007
FS: 0000000000000000(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888008201000 CR3: 0000000004c94001 CR4: 0000000000370ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
gro_cells_init+0x1ba/0x270
xfrm_input_init+0x4b/0x2a0
xfrm_init+0x38/0x50
ip_rt_init+0x2d7/0x350
ip_init+0xf/0x20
inet_init+0x406/0x590
do_one_initcall+0x9d/0x2e0
do_initcalls+0x23b/0x280
kernel_init_freeable+0x445/0x490
kernel_init+0x20/0x1d0
ret_from_fork+0x46/0x80
ret_from_fork_asm+0x1a/0x30
</TASK>
irq event stamp: 584330
hardirqs last enabled at (584338): [<ffffffff8168bf87>] __up_console_sem+0x77/0xb0
hardirqs last disabled at (584345): [<ffffffff8168bf6c>] __up_console_sem+0x5c/0xb0
softirqs last enabled at (583242): [<ffffffff833ee96d>] netlink_insert+0x14d/0x470
softirqs last disabled at (583754): [<ffffffff8317c8cd>] netif_napi_add_weight_locked+0x77d/0xba0

on kernel built with MAX_SKB_FRAGS=45, where SKB_WITH_OVERHEAD(1024)
is smaller than GRO_MAX_HEAD.

Such built additionally contains the revert of the single page frag cache
so that napi_get_frags() ends up using the page frag allocator, triggering
the splat.

Note that the underlying issue is independent from the mentioned
revert; address it ensuring that the small head cache will fit either TCP
and GRO allocation and updating napi_alloc_skb() and __netdev_alloc_skb()
to select kmalloc() usage for any allocation fitting such cache.

Reported-by: Sabrina Dubroca <[email protected]>
Suggested-by: Eric Dumazet <[email protected]>
Fixes: 3948b05950fd ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


Revision tags: v6.14-rc3
# 0892b840 13-Feb-2025 Jakub Kicinski <[email protected]>

Reapply "net: skb: introduce and use a single page frag cache"

This reverts commit 011b0335903832facca86cd8ed05d7d8d94c9c76.

Sabrina reports that the revert may trigger warnings due to intervening

Reapply "net: skb: introduce and use a single page frag cache"

This reverts commit 011b0335903832facca86cd8ed05d7d8d94c9c76.

Sabrina reports that the revert may trigger warnings due to intervening
changes, especially the ability to rise MAX_SKB_FRAGS. Let's drop it
and revisit once that part is also ironed out.

Fixes: 011b03359038 ("Revert "net: skb: introduce and use a single page frag cache"")
Reported-by: Sabrina Dubroca <[email protected]>
Link: https://lore.kernel.org/6bf54579233038bc0e76056c5ea459872ce362ab.1739375933.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc2
# 011b0335 06-Feb-2025 Paolo Abeni <[email protected]>

Revert "net: skb: introduce and use a single page frag cache"

This reverts commit dbae2b062824 ("net: skb: introduce and use a single
page frag cache"). The intended goal of such change was to count

Revert "net: skb: introduce and use a single page frag cache"

This reverts commit dbae2b062824 ("net: skb: introduce and use a single
page frag cache"). The intended goal of such change was to counter a
performance regression introduced by commit 3226b158e67c ("net: avoid
32 x truesize under-estimation for tiny skbs").

Unfortunately, the blamed commit introduces another regression for the
virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny
size, so that the whole head frag could fit a 512-byte block.

The single page frag cache uses a 1K fragment for such allocation, and
the additional overhead, under small UDP packets flood, makes the page
allocator a bottleneck.

Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for
typical/small skb->head"), this revert does not re-introduce the
original regression. Actually, in the relevant test on top of this
revert, I measure a small but noticeable positive delta, just above
noise level.

The revert itself required some additional mangling due to the
introduction of the SKB_HEAD_ALIGN() helper and local lock infra in the
affected code.

Suggested-by: Eric Dumazet <[email protected]>
Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache")
Signed-off-by: Paolo Abeni <[email protected]>
Link: https://patch.msgid.link/e649212fde9f0fdee23909ca0d14158d32bb7425.1738877290.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2
# 7cd1107f 03-Dec-2024 Alexander Lobakin <[email protected]>

bpf, xdp: constify some bpf_prog * function arguments

In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Addre

bpf, xdp: constify some bpf_prog * function arguments

In lots of places, bpf_prog pointer is used only for tracing or other
stuff that doesn't modify the structure itself. Same for net_device.
Address at least some of them and add `const` attributes there. The
object code didn't change, but that may prevent unwanted data
modifications and also allow more helpers to have const arguments.

Reviewed-by: Toke Høiland-Jørgensen <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6
# 3d18dfe6 28-Oct-2024 Yunsheng Lin <[email protected]>

mm: page_frag: avoid caller accessing 'page_frag_cache' directly

Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.

CC: Andrew Morton <[email protected]>

mm: page_frag: avoid caller accessing 'page_frag_cache' directly

Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.

CC: Andrew Morton <[email protected]>
CC: Linux-MM <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Acked-by: Chuck Lever <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2
# da5e06de 05-Oct-2024 Jason Xing <[email protected]>

net-timestamp: namespacify the sysctl_tstamp_allow_data

Let it be tuned in per netns by admins.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>

net-timestamp: namespacify the sysctl_tstamp_allow_data

Let it be tuned in per netns by admins.

Signed-off-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.12-rc1, v6.11
# 65249feb 10-Sep-2024 Mina Almasry <[email protected]>

net: add support for skbs with unreadable frags

For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and una

net: add support for skbs with unreadable frags

For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.

Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.

__skb_fill_netmem_desc() now checks frags added to skbs for net_iov,
and marks the skb as skb->devmem accordingly.

Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Kaiyuan Zhang <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 9f6b619e 10-Sep-2024 Mina Almasry <[email protected]>

net: support non paged skb frags

Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.

Signed-off-by: Mina Almasry <almasry

net: support non paged skb frags

Make skb_frag_page() fail in the case where the frag is not backed
by a page, and fix its relevant callers to handle this case.

Signed-off-by: Mina Almasry <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 8ab79ed5 10-Sep-2024 Mina Almasry <[email protected]>

page_pool: devmem support

Convert netmem to be a union of struct page and struct netmem. Overload
the LSB of struct netmem* to indicate that it's a net_iov, otherwise
it's a page.

Currently these e

page_pool: devmem support

Convert netmem to be a union of struct page and struct netmem. Overload
the LSB of struct netmem* to indicate that it's a net_iov, otherwise
it's a page.

Currently these entries in struct page are rented by the page_pool and
used exclusively by the net stack:

struct {
unsigned long pp_magic;
struct page_pool *pp;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
};

Mirror these (and only these) entries into struct net_iov and implement
netmem helpers that can access these common fields regardless of
whether the underlying type is page or net_iov.

Implement checks for net_iov in netmem helpers which delegate to mm
APIs, to ensure net_iov are never passed to the mm stack.

Signed-off-by: Mina Almasry <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5
# 93886372 22-Aug-2024 Boris Sukholitko <[email protected]>

tc: adjust network header after 2nd vlan push

<tldr>
skb network header of the single-tagged vlan packet continues to point the
vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan.

tc: adjust network header after 2nd vlan push

<tldr>
skb network header of the single-tagged vlan packet continues to point the
vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This
causes problem at the dissector which expects double-tagged packet network
header to point to the inner vlan.

The fix is to adjust network header in tcf_act_vlan.c but requires
refactoring of skb_vlan_push function.
</tldr>

Consider the following shell script snippet configuring TC rules on the
veth interface:

ip link add veth0 type veth peer veth1
ip link set veth0 up
ip link set veth1 up

tc qdisc add dev veth0 clsact

tc filter add dev veth0 ingress pref 10 chain 0 flower \
num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5
tc filter add dev veth0 ingress pref 20 chain 0 flower \
num_of_vlans 1 action vlan push id 100 \
protocol 0x8100 action goto chain 5
tc filter add dev veth0 ingress pref 30 chain 5 flower \
num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success"

Sending double-tagged vlan packet with the IP payload inside:

cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d
0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&......
0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................
0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............
ENDS

will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to
the dmesg.

OTOH, sending single-tagged vlan packet:

cat <<ENDS | text2pcap - - | tcpreplay -i veth1 -
0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ..........."....
0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*..........
0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................
0030 00 00 00 00 00 00 00 00 00 00 00 00 ............
ENDS

will match rule 20, will push the second vlan tag but will *not* match
rule 30. IOW, the match at rule 30 fails if the second vlan was freshly
pushed by the kernel.

Lets look at __skb_flow_dissect working on the double-tagged vlan packet.
Here is the relevant code from around net/core/flow_dissector.c:1277
copy-pasted here for convenience:

if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX &&
skb && skb_vlan_tag_present(skb)) {
proto = skb->protocol;
} else {
vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
data, hlen, &_vlan);
if (!vlan) {
fdret = FLOW_DISSECT_RET_OUT_BAD;
break;
}

proto = vlan->h_vlan_encapsulated_proto;
nhoff += sizeof(*vlan);
}

The "else" clause above gets the protocol of the encapsulated packet from
the skb data at the network header location. printk debugging has showed
that in the good double-tagged packet case proto is
htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet
case proto is garbage leading to the failure to match tc filter 30.

proto is being set from the skb header pointed by nhoff parameter which is
defined at the beginning of __skb_flow_dissect
(net/core/flow_dissector.c:1055 in the current version):

nhoff = skb_network_offset(skb);

Therefore the culprit seems to be that the skb network offset is different
between double-tagged packet received from the interface and single-tagged
packet having its vlan tag pushed by TC.

Lets look at the interesting points of the lifetime of the single/double
tagged packets as they traverse our packet flow.

Both of them will start at __netif_receive_skb_core where the first vlan
tag will be stripped:

if (eth_type_vlan(skb->protocol)) {
skb = skb_vlan_untag(skb);
if (unlikely(!skb))
goto out;
}

At this stage in double-tagged case skb->data points to the second vlan tag
while in single-tagged case skb->data points to the network (eg. IP)
header.

Looking at TC vlan push action (net/sched/act_vlan.c) we have the following
code at tcf_vlan_act (interesting points are in square brackets):

if (skb_at_tc_ingress(skb))
[1] skb_push_rcsum(skb, skb->mac_len);

....

case TCA_VLAN_ACT_PUSH:
err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid |
(p->tcfv_push_prio << VLAN_PRIO_SHIFT),
0);
if (err)
goto drop;
break;

....

out:
if (skb_at_tc_ingress(skb))
[3] skb_pull_rcsum(skb, skb->mac_len);

And skb_vlan_push (net/core/skbuff.c:6204) function does:

err = __vlan_insert_tag(skb, skb->vlan_proto,
skb_vlan_tag_get(skb));
if (err)
return err;

skb->protocol = skb->vlan_proto;
[2] skb->mac_len += VLAN_HLEN;

in the case of pushing the second tag. Lets look at what happens with
skb->data of the single-tagged packet at each of the above points:

1. As a result of the skb_push_rcsum, skb->data is moved back to the start
of the packet.

2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is
incremented, skb->data still points to the start of the packet.

3. As a result of the skb_pull_rcsum, skb->data is moved forward by the
modified skb->mac_len, thus pointing to the network header again.

Then __skb_flow_dissect will get confused by having double-tagged vlan
packet with the skb->data at the network header.

The solution for the bug is to preserve "skb->data at second vlan header"
semantics in the skb_vlan_push function. We do this by manipulating
skb->network_header rather than skb->mac_len. skb_vlan_push callers are
updated to do skb_reset_mac_len.

Signed-off-by: Boris Sukholitko <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


# a8c924e9 22-Aug-2024 Simon Horman <[email protected]>

net: Correct spelling in net/core

Correct spelling in net/core.
As reported by codespell.

Signed-off-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/20240822-net-spell-v1-13-3a98

net: Correct spelling in net/core

Correct spelling in net/core.
As reported by codespell.

Signed-off-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.11-rc4
# 6ad8bc92 15-Aug-2024 Christian Hopps <[email protected]>

net: add copy from skb_seq_state to buffer function

Add an skb helper function to copy a range of bytes from within
an existing skb_seq_state.

Signed-off-by: Christian Hopps <[email protected]>
Signe

net: add copy from skb_seq_state to buffer function

Add an skb helper function to copy a range of bytes from within
an existing skb_seq_state.

Signed-off-by: Christian Hopps <[email protected]>
Signed-off-by: Steffen Klassert <[email protected]>

show more ...


# dca9d62a 15-Aug-2024 Zhang Changzhong <[email protected]>

net: remove redundant check in skb_shift()

The check for '!to' is redundant here, since skb_can_coalesce() already
contains this check.

Signed-off-by: Zhang Changzhong <[email protected]>

net: remove redundant check in skb_shift()

The check for '!to' is redundant here, since skb_can_coalesce() already
contains this check.

Signed-off-by: Zhang Changzhong <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


Revision tags: v6.11-rc3, v6.11-rc2
# c89cca30 02-Aug-2024 Jakub Kicinski <[email protected]>

net: skbuff: sprinkle more __GFP_NOWARN on ingress allocs

build_skb() and frag allocations done with GFP_ATOMIC will
fail in real life, when system is under memory pressure,
and there's nothing we c

net: skbuff: sprinkle more __GFP_NOWARN on ingress allocs

build_skb() and frag allocations done with GFP_ATOMIC will
fail in real life, when system is under memory pressure,
and there's nothing we can do about that. So no point
printing warnings.

Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

show more ...


Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6
# 4dec64c5 28-Jun-2024 Mina Almasry <[email protected]>

page_pool: convert to use netmem

Abstract the memory type from the page_pool so we can later add support
for new memory types. Convert the page_pool to use the new netmem type
abstraction, rather th

page_pool: convert to use netmem

Abstract the memory type from the page_pool so we can later add support
for new memory types. Convert the page_pool to use the new netmem type
abstraction, rather than use struct page directly.

As of this patch the netmem type is a no-op abstraction: it's always a
struct page underneath. All the page pool internals are converted to
use struct netmem instead of struct page, and the page pool now exports
2 APIs:

1. The existing struct page API.
2. The new struct netmem API.

Keeping the existing API is transitional; we do not want to refactor all
the current drivers using the page pool at once.

The netmem abstraction is currently a no-op. The page_pool uses
page_to_netmem() to convert allocated pages to netmem, and uses
netmem_to_page() to convert the netmem back to pages to pass to mm APIs,

Follow up patches to this series add non-paged netmem support to the
page_pool. This change is factored out on its own to limit the code
churn to this 1 patch, for ease of code review.

Signed-off-by: Mina Almasry <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

show more ...


# 2ca58ed2 27-Jun-2024 Pavel Begunkov <[email protected]>

net: limit scope of a skb_zerocopy_iter_stream var

skb_zerocopy_iter_stream() only uses @orig_uarg in the !link_skb path,
and we can move the local variable in the appropriate block.

Signed-off-by:

net: limit scope of a skb_zerocopy_iter_stream var

skb_zerocopy_iter_stream() only uses @orig_uarg in the !link_skb path,
and we can move the local variable in the appropriate block.

Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

show more ...


12345678910>>...36