|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14 |
|
| #
983e0e4e |
| 18-Mar-2025 |
Pauli Virtanen <[email protected]> |
net-timestamp: COMPLETION timestamp on packet tx completion
Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp when hardware reports a packet completed.
Completion tstamp is us
net-timestamp: COMPLETION timestamp on packet tx completion
Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp when hardware reports a packet completed.
Completion tstamp is useful for Bluetooth, as hardware timestamps do not exist in the HCI specification except for ISO packets, and the hardware has a queue where packets may wait. In this case the software SND timestamp only reflects the kernel-side part of the total latency (usually small) and queue length (usually 0 unless HW buffers congested), whereas the completion report time is more informative of the true latency.
It may also be useful in other cases where HW TX timestamps cannot be obtained and user wants to estimate an upper bound to when the TX probably happened.
Signed-off-by: Pauli Virtanen <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
859d6acd |
| 25-Feb-2025 |
Alexander Lobakin <[email protected]> |
net: skbuff: introduce napi_skb_cache_get_bulk()
Add a function to get an array of skbs from the NAPI percpu cache. It's supposed to be a drop-in replacement for kmem_cache_alloc_bulk(skbuff_head_ca
net: skbuff: introduce napi_skb_cache_get_bulk()
Add a function to get an array of skbs from the NAPI percpu cache. It's supposed to be a drop-in replacement for kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the requirement to call it only from the BH) is that it tries to use as many NAPI cache entries for skbs as possible, and allocate new ones only if needed.
The logic is as follows:
* there is enough skbs in the cache: decache them and return to the caller; * not enough: try refilling the cache first. If there is now enough skbs, return; * still not enough: try allocating skbs directly to the output array with %GFP_ZERO, maybe we'll be able to get some. If there's now enough, return; * still not enough: return as many as we were able to obtain.
Most of times, if called from the NAPI polling loop, the first one will be true, sometimes (rarely) the second one. The third and the fourth -- only under heavy memory pressure. It can save significant amounts of CPU cycles if there are GRO cycles and/or Tx completion cycles (anything that descends to napi_skb_cache_put()) happening on this CPU.
Tested-by: Daniel Xu <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4 |
|
| #
de2c2118 |
| 22-Feb-2025 |
Philo Lu <[email protected]> |
ipvs: Always clear ipvs_property flag in skb_scrub_packet()
We found an issue when using bpf_redirect with ipvs NAT mode after commit ff70202b2d1a ("dev_forward_skb: do not scrub skb mark within the
ipvs: Always clear ipvs_property flag in skb_scrub_packet()
We found an issue when using bpf_redirect with ipvs NAT mode after commit ff70202b2d1a ("dev_forward_skb: do not scrub skb mark within the same name space"). Particularly, we use bpf_redirect to return the skb directly back to the netif it comes from, i.e., xnet is false in skb_scrub_packet(), and then ipvs_property is preserved and SNAT is skipped in the rx path.
ipvs_property has been already cleared when netns is changed in commit 2b5ec1a5f973 ("netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed"). This patch just clears it in spite of netns.
Fixes: 2b5ec1a5f973 ("netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed") Signed-off-by: Philo Lu <[email protected]> Acked-by: Julian Anastasov <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
b3b81e6b |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
Support the ACK case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This callback will occur at the same timestamping po
bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
Support the ACK case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This callback will occur at the same timestamping point as the user space's SCM_TSTAMP_ACK. The BPF program can use it to get the same SCM_TSTAMP_ACK timestamp without modifying the user-space application.
This patch extends txstamp_ack to two bits: 1 stands for SO_TIMESTAMPING mode, 2 bpf extension.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
2deaf7f4 |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
Support hw SCM_TSTAMP_SND case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_HW_CB. This callback will occur at the same
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
Support hw SCM_TSTAMP_SND case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_HW_CB. This callback will occur at the same timestamping point as the user space's hardware SCM_TSTAMP_SND. The BPF program can use it to get the same SCM_TSTAMP_SND timestamp without modifying the user-space application.
To avoid increasing the code complexity, replace SKBTX_HW_TSTAMP with SKBTX_HW_TSTAMP_NOBPF instead of changing numerous callers from driver side using SKBTX_HW_TSTAMP. The new definition of SKBTX_HW_TSTAMP means the combination tests of socket timestamping and bpf timestamping. After this patch, drivers can work under the bpf timestamping.
Considering some drivers don't assign the skb with hardware timestamp, this patch does the assignment and then BPF program can acquire the hwstamp from skb directly.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
ecebb17a |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
Support sw SCM_TSTAMP_SND case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This callback will occur at the same
bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
Support sw SCM_TSTAMP_SND case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This callback will occur at the same timestamping point as the user space's software SCM_TSTAMP_SND. The BPF program can use it to get the same SCM_TSTAMP_SND timestamp without modifying the user-space application.
Based on this patch, BPF program will get the software timestamp when the driver is ready to send the skb. In the sebsequent patch, the hardware timestamp will be supported.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
6b98ec7e |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
Support SCM_TSTAMP_SCHED case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This callback will occur at the same ti
bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
Support SCM_TSTAMP_SCHED case for bpf timestamping.
Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This callback will occur at the same timestamping point as the user space's SCM_TSTAMP_SCHED. The BPF program can use it to get the same SCM_TSTAMP_SCHED timestamp without modifying the user-space application.
A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags, ensuring that the new BPF timestamping and the current user space's SO_TIMESTAMPING do not interfere with each other.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
aa290f93 |
| 20-Feb-2025 |
Jason Xing <[email protected]> |
net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
No functional changes here. Only add test to see if the orig_skb matches the usage of application SO_TIMESTAMPING.
In this series,
net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
No functional changes here. Only add test to see if the orig_skb matches the usage of application SO_TIMESTAMPING.
In this series, bpf timestamping and previous socket timestamping are implemented in the same function __skb_tstamp_tx(). To test the socket enables socket timestamping feature, this function skb_tstamp_tx_report_so_timestamping() is added.
In the next patch, another check for bpf timestamping feature will be introduced just like the above report function, namely, skb_tstamp_tx_report_bpf_timestamping(). Then users will be able to know the socket enables either or both of features.
Signed-off-by: Jason Xing <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected]
show more ...
|
| #
6bc7e4eb |
| 18-Feb-2025 |
Paolo Abeni <[email protected]> |
Revert "net: skb: introduce and use a single page frag cache"
After the previous commit is finally safe to revert commit dbae2b062824 ("net: skb: introduce and use a single page frag cache"): do it
Revert "net: skb: introduce and use a single page frag cache"
After the previous commit is finally safe to revert commit dbae2b062824 ("net: skb: introduce and use a single page frag cache"): do it here.
The intended goal of such change was to counter a performance regression introduced by commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs").
Unfortunately, the blamed commit introduces another regression for the virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny size, so that the whole head frag could fit a 512-byte block.
The single page frag cache uses a 1K fragment for such allocation, and the additional overhead, under small UDP packets flood, makes the page allocator a bottleneck.
Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head"), this revert does not re-introduce the original regression. Actually, in the relevant test on top of this revert, I measure a small but noticeable positive delta, just above noise level.
The revert itself required some additional mangling due to recent updates in the affected code.
Suggested-by: Eric Dumazet <[email protected]> Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache") Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
14ad6ed3 |
| 18-Feb-2025 |
Paolo Abeni <[email protected]> |
net: allow small head cache usage with large MAX_SKB_FRAGS values
Sabrina reported the following splat:
WARNING: CPU: 0 PID: 1 at net/core/dev.c:6935 netif_napi_add_weight_locked+0x8f2/0xba0
net: allow small head cache usage with large MAX_SKB_FRAGS values
Sabrina reported the following splat:
WARNING: CPU: 0 PID: 1 at net/core/dev.c:6935 netif_napi_add_weight_locked+0x8f2/0xba0 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc1-net-00092-g011b03359038 #996 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:netif_napi_add_weight_locked+0x8f2/0xba0 Code: e8 c3 e6 6a fe 48 83 c4 28 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc c7 44 24 10 ff ff ff ff e9 8f fb ff ff e8 9e e6 6a fe <0f> 0b e9 d3 fe ff ff e8 92 e6 6a fe 48 8b 04 24 be ff ff ff ff 48 RSP: 0000:ffffc9000001fc60 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88806ce48128 RCX: 1ffff11001664b9e RDX: ffff888008f00040 RSI: ffffffff8317ca42 RDI: ffff88800b325cb6 RBP: ffff88800b325c40 R08: 0000000000000001 R09: ffffed100167502c R10: ffff88800b3a8163 R11: 0000000000000000 R12: ffff88800ac1c168 R13: ffff88800ac1c168 R14: ffff88800ac1c168 R15: 0000000000000007 FS: 0000000000000000(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff888008201000 CR3: 0000000004c94001 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> gro_cells_init+0x1ba/0x270 xfrm_input_init+0x4b/0x2a0 xfrm_init+0x38/0x50 ip_rt_init+0x2d7/0x350 ip_init+0xf/0x20 inet_init+0x406/0x590 do_one_initcall+0x9d/0x2e0 do_initcalls+0x23b/0x280 kernel_init_freeable+0x445/0x490 kernel_init+0x20/0x1d0 ret_from_fork+0x46/0x80 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 584330 hardirqs last enabled at (584338): [<ffffffff8168bf87>] __up_console_sem+0x77/0xb0 hardirqs last disabled at (584345): [<ffffffff8168bf6c>] __up_console_sem+0x5c/0xb0 softirqs last enabled at (583242): [<ffffffff833ee96d>] netlink_insert+0x14d/0x470 softirqs last disabled at (583754): [<ffffffff8317c8cd>] netif_napi_add_weight_locked+0x77d/0xba0
on kernel built with MAX_SKB_FRAGS=45, where SKB_WITH_OVERHEAD(1024) is smaller than GRO_MAX_HEAD.
Such built additionally contains the revert of the single page frag cache so that napi_get_frags() ends up using the page frag allocator, triggering the splat.
Note that the underlying issue is independent from the mentioned revert; address it ensuring that the small head cache will fit either TCP and GRO allocation and updating napi_alloc_skb() and __netdev_alloc_skb() to select kmalloc() usage for any allocation fitting such cache.
Reported-by: Sabrina Dubroca <[email protected]> Suggested-by: Eric Dumazet <[email protected]> Fixes: 3948b05950fd ("net: introduce a config option to tweak MAX_SKB_FRAGS") Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc3 |
|
| #
0892b840 |
| 13-Feb-2025 |
Jakub Kicinski <[email protected]> |
Reapply "net: skb: introduce and use a single page frag cache"
This reverts commit 011b0335903832facca86cd8ed05d7d8d94c9c76.
Sabrina reports that the revert may trigger warnings due to intervening
Reapply "net: skb: introduce and use a single page frag cache"
This reverts commit 011b0335903832facca86cd8ed05d7d8d94c9c76.
Sabrina reports that the revert may trigger warnings due to intervening changes, especially the ability to rise MAX_SKB_FRAGS. Let's drop it and revisit once that part is also ironed out.
Fixes: 011b03359038 ("Revert "net: skb: introduce and use a single page frag cache"") Reported-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/6bf54579233038bc0e76056c5ea459872ce362ab.1739375933.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2 |
|
| #
011b0335 |
| 06-Feb-2025 |
Paolo Abeni <[email protected]> |
Revert "net: skb: introduce and use a single page frag cache"
This reverts commit dbae2b062824 ("net: skb: introduce and use a single page frag cache"). The intended goal of such change was to count
Revert "net: skb: introduce and use a single page frag cache"
This reverts commit dbae2b062824 ("net: skb: introduce and use a single page frag cache"). The intended goal of such change was to counter a performance regression introduced by commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs").
Unfortunately, the blamed commit introduces another regression for the virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny size, so that the whole head frag could fit a 512-byte block.
The single page frag cache uses a 1K fragment for such allocation, and the additional overhead, under small UDP packets flood, makes the page allocator a bottleneck.
Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head"), this revert does not re-introduce the original regression. Actually, in the relevant test on top of this revert, I measure a small but noticeable positive delta, just above noise level.
The revert itself required some additional mangling due to the introduction of the SKB_HEAD_ALIGN() helper and local lock infra in the affected code.
Suggested-by: Eric Dumazet <[email protected]> Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache") Signed-off-by: Paolo Abeni <[email protected]> Link: https://patch.msgid.link/e649212fde9f0fdee23909ca0d14158d32bb7425.1738877290.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2 |
|
| #
7cd1107f |
| 03-Dec-2024 |
Alexander Lobakin <[email protected]> |
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Addre
bpf, xdp: constify some bpf_prog * function arguments
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments.
Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6 |
|
| #
3d18dfe6 |
| 28-Oct-2024 |
Yunsheng Lin <[email protected]> |
mm: page_frag: avoid caller accessing 'page_frag_cache' directly
Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly.
CC: Andrew Morton <[email protected]>
mm: page_frag: avoid caller accessing 'page_frag_cache' directly
Use appropriate frag_page API instead of caller accessing 'page_frag_cache' directly.
CC: Andrew Morton <[email protected]> CC: Linux-MM <[email protected]> Signed-off-by: Yunsheng Lin <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Acked-by: Chuck Lever <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
da5e06de |
| 05-Oct-2024 |
Jason Xing <[email protected]> |
net-timestamp: namespacify the sysctl_tstamp_allow_data
Let it be tuned in per netns by admins.
Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]>
net-timestamp: namespacify the sysctl_tstamp_allow_data
Let it be tuned in per netns by admins.
Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11 |
|
| #
65249feb |
| 10-Sep-2024 |
Mina Almasry <[email protected]> |
net: add support for skbs with unreadable frags
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and una
net: add support for skbs with unreadable frags
For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb.
Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not.
__skb_fill_netmem_desc() now checks frags added to skbs for net_iov, and marks the skb as skb->devmem accordingly.
Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: Kaiyuan Zhang <[email protected]> Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
9f6b619e |
| 10-Sep-2024 |
Mina Almasry <[email protected]> |
net: support non paged skb frags
Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry <almasry
net: support non paged skb frags
Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevant callers to handle this case.
Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
8ab79ed5 |
| 10-Sep-2024 |
Mina Almasry <[email protected]> |
page_pool: devmem support
Convert netmem to be a union of struct page and struct netmem. Overload the LSB of struct netmem* to indicate that it's a net_iov, otherwise it's a page.
Currently these e
page_pool: devmem support
Convert netmem to be a union of struct page and struct netmem. Overload the LSB of struct netmem* to indicate that it's a net_iov, otherwise it's a page.
Currently these entries in struct page are rented by the page_pool and used exclusively by the net stack:
struct { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr; atomic_long_t pp_ref_count; };
Mirror these (and only these) entries into struct net_iov and implement netmem helpers that can access these common fields regardless of whether the underlying type is page or net_iov.
Implement checks for net_iov in netmem helpers which delegate to mm APIs, to ensure net_iov are never passed to the mm stack.
Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5 |
|
| #
93886372 |
| 22-Aug-2024 |
Boris Sukholitko <[email protected]> |
tc: adjust network header after 2nd vlan push
<tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan.
tc: adjust network header after 2nd vlan push
<tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This causes problem at the dissector which expects double-tagged packet network header to point to the inner vlan.
The fix is to adjust network header in tcf_act_vlan.c but requires refactoring of skb_vlan_push function. </tldr>
Consider the following shell script snippet configuring TC rules on the veth interface:
ip link add veth0 type veth peer veth1 ip link set veth0 up ip link set veth1 up
tc qdisc add dev veth0 clsact
tc filter add dev veth0 ingress pref 10 chain 0 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5 tc filter add dev veth0 ingress pref 20 chain 0 flower \ num_of_vlans 1 action vlan push id 100 \ protocol 0x8100 action goto chain 5 tc filter add dev veth0 ingress pref 30 chain 5 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success"
Sending double-tagged vlan packet with the IP payload inside:
cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d 0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&...... 0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................ 0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............ ENDS
will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to the dmesg.
OTOH, sending single-tagged vlan packet:
cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ...........".... 0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*.......... 0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 ............ ENDS
will match rule 20, will push the second vlan tag but will *not* match rule 30. IOW, the match at rule 30 fails if the second vlan was freshly pushed by the kernel.
Lets look at __skb_flow_dissect working on the double-tagged vlan packet. Here is the relevant code from around net/core/flow_dissector.c:1277 copy-pasted here for convenience:
if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; }
proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); }
The "else" clause above gets the protocol of the encapsulated packet from the skb data at the network header location. printk debugging has showed that in the good double-tagged packet case proto is htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet case proto is garbage leading to the failure to match tc filter 30.
proto is being set from the skb header pointed by nhoff parameter which is defined at the beginning of __skb_flow_dissect (net/core/flow_dissector.c:1055 in the current version):
nhoff = skb_network_offset(skb);
Therefore the culprit seems to be that the skb network offset is different between double-tagged packet received from the interface and single-tagged packet having its vlan tag pushed by TC.
Lets look at the interesting points of the lifetime of the single/double tagged packets as they traverse our packet flow.
Both of them will start at __netif_receive_skb_core where the first vlan tag will be stripped:
if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; }
At this stage in double-tagged case skb->data points to the second vlan tag while in single-tagged case skb->data points to the network (eg. IP) header.
Looking at TC vlan push action (net/sched/act_vlan.c) we have the following code at tcf_vlan_act (interesting points are in square brackets):
if (skb_at_tc_ingress(skb)) [1] skb_push_rcsum(skb, skb->mac_len);
....
case TCA_VLAN_ACT_PUSH: err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid | (p->tcfv_push_prio << VLAN_PRIO_SHIFT), 0); if (err) goto drop; break;
....
out: if (skb_at_tc_ingress(skb)) [3] skb_pull_rcsum(skb, skb->mac_len);
And skb_vlan_push (net/core/skbuff.c:6204) function does:
err = __vlan_insert_tag(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (err) return err;
skb->protocol = skb->vlan_proto; [2] skb->mac_len += VLAN_HLEN;
in the case of pushing the second tag. Lets look at what happens with skb->data of the single-tagged packet at each of the above points:
1. As a result of the skb_push_rcsum, skb->data is moved back to the start of the packet.
2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is incremented, skb->data still points to the start of the packet.
3. As a result of the skb_pull_rcsum, skb->data is moved forward by the modified skb->mac_len, thus pointing to the network header again.
Then __skb_flow_dissect will get confused by having double-tagged vlan packet with the skb->data at the network header.
The solution for the bug is to preserve "skb->data at second vlan header" semantics in the skb_vlan_push function. We do this by manipulating skb->network_header rather than skb->mac_len. skb_vlan_push callers are updated to do skb_reset_mac_len.
Signed-off-by: Boris Sukholitko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
a8c924e9 |
| 22-Aug-2024 |
Simon Horman <[email protected]> |
net: Correct spelling in net/core
Correct spelling in net/core. As reported by codespell.
Signed-off-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/20240822-net-spell-v1-13-3a98
net: Correct spelling in net/core
Correct spelling in net/core. As reported by codespell.
Signed-off-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc4 |
|
| #
6ad8bc92 |
| 15-Aug-2024 |
Christian Hopps <[email protected]> |
net: add copy from skb_seq_state to buffer function
Add an skb helper function to copy a range of bytes from within an existing skb_seq_state.
Signed-off-by: Christian Hopps <[email protected]> Signe
net: add copy from skb_seq_state to buffer function
Add an skb helper function to copy a range of bytes from within an existing skb_seq_state.
Signed-off-by: Christian Hopps <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
show more ...
|
| #
dca9d62a |
| 15-Aug-2024 |
Zhang Changzhong <[email protected]> |
net: remove redundant check in skb_shift()
The check for '!to' is redundant here, since skb_can_coalesce() already contains this check.
Signed-off-by: Zhang Changzhong <[email protected]>
net: remove redundant check in skb_shift()
The check for '!to' is redundant here, since skb_can_coalesce() already contains this check.
Signed-off-by: Zhang Changzhong <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2 |
|
| #
c89cca30 |
| 02-Aug-2024 |
Jakub Kicinski <[email protected]> |
net: skbuff: sprinkle more __GFP_NOWARN on ingress allocs
build_skb() and frag allocations done with GFP_ATOMIC will fail in real life, when system is under memory pressure, and there's nothing we c
net: skbuff: sprinkle more __GFP_NOWARN on ingress allocs
build_skb() and frag allocations done with GFP_ATOMIC will fail in real life, when system is under memory pressure, and there's nothing we can do about that. So no point printing warnings.
Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6 |
|
| #
4dec64c5 |
| 28-Jun-2024 |
Mina Almasry <[email protected]> |
page_pool: convert to use netmem
Abstract the memory type from the page_pool so we can later add support for new memory types. Convert the page_pool to use the new netmem type abstraction, rather th
page_pool: convert to use netmem
Abstract the memory type from the page_pool so we can later add support for new memory types. Convert the page_pool to use the new netmem type abstraction, rather than use struct page directly.
As of this patch the netmem type is a no-op abstraction: it's always a struct page underneath. All the page pool internals are converted to use struct netmem instead of struct page, and the page pool now exports 2 APIs:
1. The existing struct page API. 2. The new struct netmem API.
Keeping the existing API is transitional; we do not want to refactor all the current drivers using the page pool at once.
The netmem abstraction is currently a no-op. The page_pool uses page_to_netmem() to convert allocated pages to netmem, and uses netmem_to_page() to convert the netmem back to pages to pass to mm APIs,
Follow up patches to this series add non-paged netmem support to the page_pool. This change is factored out on its own to limit the code churn to this 1 patch, for ease of code review.
Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
2ca58ed2 |
| 27-Jun-2024 |
Pavel Begunkov <[email protected]> |
net: limit scope of a skb_zerocopy_iter_stream var
skb_zerocopy_iter_stream() only uses @orig_uarg in the !link_skb path, and we can move the local variable in the appropriate block.
Signed-off-by:
net: limit scope of a skb_zerocopy_iter_stream var
skb_zerocopy_iter_stream() only uses @orig_uarg in the !link_skb path, and we can move the local variable in the appropriate block.
Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|