|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3 |
|
| #
04e65df9 |
| 09-Oct-2024 |
Paolo Abeni <[email protected]> |
netlink: spec: add shaper YAML spec
Define the user-space visible interface to query, configure and delete network shapers via yaml definition.
Add dummy implementations for the relevant NL callbac
netlink: spec: add shaper YAML spec
Define the user-space visible interface to query, configure and delete network shapers via yaml definition.
Add dummy implementations for the relevant NL callbacks.
set() and delete() operations touch a single shaper creating/updating or deleting it. The group() operation creates a shaper's group, nesting multiple input shapers under the specified output shaper.
Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Link: https://patch.msgid.link/7a33a1ff370bdbcd0cd3f909575c912cd56f41da.1728460186.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc2, v6.12-rc1, v6.11 |
|
| #
26d74602 |
| 13-Sep-2024 |
Mina Almasry <[email protected]> |
memory-provider: disable building dmabuf mp on !CONFIG_PAGE_POOL
When CONFIG_TRACEPOINTS=y but CONFIG_PAGE_POOL=n, we end up with this build failure that is reported by the 0-day bot:
ld: vmlinux.o
memory-provider: disable building dmabuf mp on !CONFIG_PAGE_POOL
When CONFIG_TRACEPOINTS=y but CONFIG_PAGE_POOL=n, we end up with this build failure that is reported by the 0-day bot:
ld: vmlinux.o: in function `mp_dmabuf_devmem_alloc_netmems': >> (.text+0xc37286): undefined reference to `__tracepoint_page_pool_state_hold' >> ld: (.text+0xc3729a): undefined reference to `__SCT__tp_func_page_pool_state_hold' >> ld: vmlinux.o:(__jump_table+0x10c48): undefined reference to `__tracepoint_page_pool_state_hold' >> ld: vmlinux.o:(.static_call_sites+0xb824): undefined reference to `__SCK__tp_func_page_pool_state_hold'
The root cause is that in this configuration, traces are enabled but the page_pool specific trace_page_pool_state_hold is not registered.
There is no reason to build the dmabuf memory provider when CONFIG_PAGE_POOL is not present, as it's really a provider to the page_pool.
In fact the whole NET_DEVMEM is RX path-only at the moment, so we can make the entire config dependent on the PAGE_POOL.
Note that this may need to be revisited after/while devmem TX is added, as devmem TX likely does not need CONFIG_PAGE_POOL. For now this build fix is sufficient.
Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Simon Horman <[email protected]> Tested-by: Simon Horman <[email protected]> # build-tested Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
170aafe3 |
| 10-Sep-2024 |
Mina Almasry <[email protected]> |
netdev: support binding dma-buf to netdevice
Add a netdev_dmabuf_binding struct which represents the dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to rx queues on the netdevice
netdev: support binding dma-buf to netdevice
Add a netdev_dmabuf_binding struct which represents the dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to rx queues on the netdevice. On the binding, the dma_buf_attach & dma_buf_map_attachment will occur. The entries in the sg_table from mapping will be inserted into a genpool to make it ready for allocation.
The chunks in the genpool are owned by a dmabuf_chunk_owner struct which holds the dma-buf offset of the base of the chunk and the dma_addr of the chunk. Both are needed to use allocations that come from this chunk.
We create a new type that represents an allocation from the genpool: net_iov. We setup the net_iov allocation size in the genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally allocated by the page pool and given to the drivers.
The user can unbind the dmabuf from the netdevice by closing the netlink socket that established the binding. We do this so that the binding is automatically unbound even if the userspace process crashes.
The binding and unbinding leaves an indicator in struct netdev_rx_queue that the given queue is bound, and the binding is actuated by resetting the rx queue using the queue API.
The netdev_dmabuf_binding struct is refcounted, and releases its resources only when all the refs are released.
Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: Kaiyuan Zhang <[email protected]> Signed-off-by: Mina Almasry <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> # excluding netlink Acked-by: Daniel Vetter <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5 |
|
| #
f750dfe8 |
| 21-Jun-2024 |
Heng Qi <[email protected]> |
ethtool: provide customized dim profile management
The NetDIM library, currently leveraged by an array of NICs, delivers excellent acceleration benefits. Nevertheless, NICs vary significantly in the
ethtool: provide customized dim profile management
The NetDIM library, currently leveraged by an array of NICs, delivers excellent acceleration benefits. Nevertheless, NICs vary significantly in their dim profile list prerequisites.
Specifically, virtio-net backends may present diverse sw or hw device implementation, making a one-size-fits-all parameter list impractical. On Alibaba Cloud, the virtio DPU's performance under the default DIM profile falls short of expectations, partly due to a mismatch in parameter configuration.
I also noticed that ice/idpf/ena and other NICs have customized profilelist or placed some restrictions on dim capabilities.
Motivated by this, I tried adding new params for "ethtool -C" that provides a per-device control to modify and access a device's interrupt parameters.
Usage ======== The target NIC is named ethx.
Assume that ethx only declares support for rx profile setting (with DIM_PROFILE_RX flag set in profile_flags) and supports modification of usec and pkt fields.
1. Query the currently customized list of the device
$ ethtool -c ethx ... rx-profile: {.usec = 1, .pkts = 256, .comps = n/a,}, {.usec = 8, .pkts = 256, .comps = n/a,}, {.usec = 64, .pkts = 256, .comps = n/a,}, {.usec = 128, .pkts = 256, .comps = n/a,}, {.usec = 256, .pkts = 256, .comps = n/a,} tx-profile: n/a
2. Tune $ ethtool -C ethx rx-profile 1,1,n_2,n,n_3,3,n_4,4,n_n,5,n "n" means do not modify this field. $ ethtool -c ethx ... rx-profile: {.usec = 1, .pkts = 1, .comps = n/a,}, {.usec = 2, .pkts = 256, .comps = n/a,}, {.usec = 3, .pkts = 3, .comps = n/a,}, {.usec = 4, .pkts = 4, .comps = n/a,}, {.usec = 256, .pkts = 5, .comps = n/a,} tx-profile: n/a
3. Hint If the device does not support some type of customized dim profiles, the corresponding "n/a" will display.
If the "n/a" field is being modified, -EOPNOTSUPP will be reported.
Signed-off-by: Heng Qi <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc4, v6.10-rc3 |
|
| #
9b6a30fe |
| 05-Jun-2024 |
Jason Xing <[email protected]> |
net: allow rps/rfs related configs to be switched
After John Sperbeck reported a compile error if the CONFIG_RFS_ACCEL is off, I found that I cannot easily enable/disable the config because of lack
net: allow rps/rfs related configs to be switched
After John Sperbeck reported a compile error if the CONFIG_RFS_ACCEL is off, I found that I cannot easily enable/disable the config because of lack of the prompt when using 'make menuconfig'. Therefore, I decided to change rps/rfc related configs altogether.
Signed-off-by: Jason Xing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7 |
|
| #
768cf841 |
| 03-May-2024 |
Oleksij Rempel <[email protected]> |
net: add IEEE 802.1q specific helpers
IEEE 802.1q specification provides recommendation and examples which can be used as good default values for different drivers.
This patch implements mapping ex
net: add IEEE 802.1q specific helpers
IEEE 802.1q specification provides recommendation and examples which can be used as good default values for different drivers.
This patch implements mapping examples documented in IEEE 802.1Q-2022 in Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming and mapping DSCP to Traffic Type inspired by RFC8325.
This helpers will be used in followup patches for dsa/microchip DCB implementation.
Signed-off-by: Oleksij Rempel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3 |
|
| #
9f06f87f |
| 03-Apr-2024 |
Jakub Kicinski <[email protected]> |
net: skbuff: generalize the skb->decrypted bit
The ->decrypted bit can be reused for other crypto protocols. Remove the direct dependency on TLS, add helpers to clean up the ifdefs leaking out every
net: skbuff: generalize the skb->decrypted bit
The ->decrypted bit can be reused for other crypto protocols. Remove the direct dependency on TLS, add helpers to clean up the ifdefs leaking out everywhere.
Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5 |
|
| #
ea7f3cfa |
| 15-Feb-2024 |
Breno Leitao <[email protected]> |
net: bql: allow the config to be disabled
It is impossible to disable BQL individually today, since there is no prompt for the Kconfig entry, so, the BQL is always enabled if SYSFS is enabled.
Crea
net: bql: allow the config to be disabled
It is impossible to disable BQL individually today, since there is no prompt for the Kconfig entry, so, the BQL is always enabled if SYSFS is enabled.
Create a prompt entry for BQL, so, it could be enabled or disabled at build time independently of SYSFS.
Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8 |
|
| #
98e20e5e |
| 26-Dec-2023 |
Quentin Deslandes <[email protected]> |
bpfilter: remove bpfilter
bpfilter was supposed to convert iptables filtering rules into BPF programs on the fly, from the kernel, through a usermode helper. The base code for the UMH was introduced
bpfilter: remove bpfilter
bpfilter was supposed to convert iptables filtering rules into BPF programs on the fly, from the kernel, through a usermode helper. The base code for the UMH was introduced in 2018, and couple of attempts (2, 3) tried to introduce the BPF program generate features but were abandoned.
bpfilter now sits in a kernel tree unused and unusable, occasionally causing confusion amongst Linux users (4, 5).
As bpfilter is now developed in a dedicated repository on GitHub (6), it was suggested a couple of times this year (LSFMM/BPF 2023, LPC 2023) to remove the deprecated kernel part of the project. This is the purpose of this patch.
[1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/bpf/[email protected]/#t [3]: https://lore.kernel.org/lkml/[email protected]/ [4]: https://dxuuu.xyz/bpfilter.html [5]: https://github.com/linuxkit/linuxkit/pull/3904 [6]: https://github.com/facebook/bpfilter
Signed-off-by: Quentin Deslandes <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6 |
|
| #
b3098d32 |
| 09-Oct-2023 |
Willem de Bruijn <[email protected]> |
net: add skb_segment kunit test
Add unit testing for skb segment. This function is exercised by many different code paths, such as GSO_PARTIAL or GSO_BY_FRAGS, linear (with or without head_frag), fr
net: add skb_segment kunit test
Add unit testing for skb segment. This function is exercised by many different code paths, such as GSO_PARTIAL or GSO_BY_FRAGS, linear (with or without head_frag), frags or frag_list skbs, etc.
It is infeasible to manually run tests that cover all code paths when making changes. The long and complex function also makes it hard to establish through analysis alone that a patch has no unintended side-effects.
Add code coverage through kunit regression testing. Introduce kunit infrastructure for tests under net/core, and add this first test.
This first skb_segment test exercises a simple case: a linear skb. Follow-on patches will parametrize the test and add more variants.
Tested: Built and ran the test with
make ARCH=um mrproper
./tools/testing/kunit/kunit.py run \ --kconfig_add CONFIG_NET=y \ --kconfig_add CONFIG_DEBUG_KERNEL=y \ --kconfig_add CONFIG_DEBUG_INFO=y \ --kconfig_add=CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y \ net_core_gso
Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: Florian Westphal <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
| #
1dab4713 |
| 09-Oct-2023 |
Arnd Bergmann <[email protected]> |
appletalk: remove ipddp driver
After the cops driver is removed, ipddp is now the only CONFIG_DEV_APPLETALK but as far as I can tell, this also has no users and can be removed, making appletalk supp
appletalk: remove ipddp driver
After the cops driver is removed, ipddp is now the only CONFIG_DEV_APPLETALK but as far as I can tell, this also has no users and can be removed, making appletalk support purely based on ethertalk, using ethernet hardware.
Link: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/netdev/[email protected]/ Cc: Doug Brown <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3 |
|
| #
e420bed0 |
| 19-Jul-2023 |
Daniel Borkmann <[email protected]> |
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program managem
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work.
[0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4 |
|
| #
c857946a |
| 23-May-2023 |
Kurt Kanzenbach <[email protected]> |
net/core: Enable socket busy polling on -RT
Busy polling is currently not allowed on PREEMPT_RT, because it disables preemption while invoking the NAPI callback. It is not possible to acquire sleepi
net/core: Enable socket busy polling on -RT
Busy polling is currently not allowed on PREEMPT_RT, because it disables preemption while invoking the NAPI callback. It is not possible to acquire sleeping locks with disabled preemption. For details see commit 20ab39d13e2e ("net/core: disable NET_RX_BUSY_POLL on PREEMPT_RT").
However, strict cyclic and/or low latency network applications may prefer busy polling e.g., using AF_XDP instead of interrupt driven communication.
The preempt_disable() is used in order to prevent the poll_owner and NAPI owner to be preempted while owning the resource to ensure progress. Netpoll performs busy polling in order to acquire the lock. NAPI is locked by setting the NAPIF_STATE_SCHED flag. There is no busy polling if the flag is set and the "owner" is preempted. Worst case is that the task owning NAPI gets preempted and NAPI processing stalls. This is can be prevented by properly prioritising the tasks within the system.
Allow RX_BUSY_POLL on PREEMPT_RT if NETPOLL is disabled. Don't disable preemption on PREEMPT_RT within the busy poll loop.
Tested on x86 hardware with v6.1-RT and v6.3-RT on Intel i225 (igc) with AF_XDP/ZC sockets configured to run in busy polling mode.
Suggested-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Kurt Kanzenbach <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3 |
|
| #
88232ec1 |
| 17-Apr-2023 |
Chuck Lever <[email protected]> |
net/handshake: Add Kunit tests for the handshake consumer API
These verify the API contracts and help exercise lifetime rules for consumer sockets and handshake_req structures.
One way to run these
net/handshake: Add Kunit tests for the handshake consumer API
These verify the API contracts and help exercise lifetime rules for consumer sockets and handshake_req structures.
One way to run these tests:
./tools/testing/kunit/kunit.py run --kunitconfig ./net/handshake/.kunitconfig
Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
3b3009ea |
| 17-Apr-2023 |
Chuck Lever <[email protected]> |
net/handshake: Create a NETLINK service for handling handshake requests
When a kernel consumer needs a transport layer security session, it first needs a handshake to negotiate and establish a sessi
net/handshake: Create a NETLINK service for handling handshake requests
When a kernel consumer needs a transport layer security session, it first needs a handshake to negotiate and establish a session. This negotiation can be done in user space via one of the several existing library implementations, or it can be done in the kernel.
No in-kernel handshake implementations yet exist. In their absence, we add a netlink service that can:
a. Notify a user space daemon that a handshake is needed.
b. Once notified, the daemon calls the kernel back via this netlink service to get the handshake parameters, including an open socket on which to establish the session.
c. Once the handshake is complete, the daemon reports the session status and other information via a second netlink operation. This operation marks that it is safe for the kernel to use the open socket and the security session established there.
The notification service uses a multicast group. Each handshake mechanism (eg, tlshd) adopts its own group number so that the handshake services are completely independent of one another. The kernel can then tell via netlink_has_listeners() whether a handshake service is active and prepared to handle a handshake request.
A new netlink operation, ACCEPT, acts like accept(2) in that it instantiates a file descriptor in the user space daemon's fd table. If this operation is successful, the reply carries the fd number, which can be treated as an open and ready file descriptor.
While user space is performing the handshake, the kernel keeps its muddy paws off the open socket. A second new netlink operation, DONE, indicates that the user space daemon is finished with the socket and it is safe for the kernel to use again. The operation also indicates whether a session was established successfully.
Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4 |
|
| #
3948b059 |
| 23-Mar-2023 |
Eric Dumazet <[email protected]> |
net: introduce a config option to tweak MAX_SKB_FRAGS
Currently, MAX_SKB_FRAGS value is 17.
For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg() attempts order-3 allocations, stuff
net: introduce a config option to tweak MAX_SKB_FRAGS
Currently, MAX_SKB_FRAGS value is 17.
For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg() attempts order-3 allocations, stuffing 32768 bytes per frag.
But with zero copy, we use order-0 pages.
For BIG TCP to show its full potential, we add a config option to be able to fit up to 45 segments per skb.
This is also needed for BIG TCP rx zerocopy, as zerocopy currently does not support skbs with frag list.
We have used MAX_SKB_FRAGS=45 value for years at Google before we deployed 4K MTU, with no adverse effect, other than a recent issue in mlx4, fixed in commit 26782aad00cc ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
Back then, goal was to be able to receive full size (64KB) GRO packets without the frag_list overhead.
Note that /proc/sys/net/core/max_skb_frags can also be used to limit the number of fragments TCP can use in tx packets.
By default we keep the old/legacy value of 17 until we get more coverage for the updated values.
Sizes of struct skb_shared_info on 64bit arches
MAX_SKB_FRAGS | sizeof(struct skb_shared_info): ============================================== 17 320 21 320+64 = 384 25 320+128 = 448 29 320+192 = 512 33 320+256 = 576 37 320+320 = 640 41 320+384 = 704 45 320+448 = 768
This inflation might cause problems for drivers assuming they could pack both the incoming packet (for MTU=1500) and skb_shared_info in half a page, using build_skb().
v3: fix build error when CONFIG_NET=n v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2 |
|
| #
1202cdd6 |
| 18-Aug-2022 |
Stephen Hemminger <[email protected]> |
Remove DECnet support from kernel
DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux ker
Remove DECnet support from kernel
DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel.
It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well.
Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET.
The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical.
Signed-off-by: Stephen Hemminger <[email protected]> Acked-by: David Ahern <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7 |
|
| #
8610037e |
| 02-Mar-2022 |
Joe Damato <[email protected]> |
page_pool: Add allocation stats
Add per-pool statistics counters for the allocation path of a page pool. These stats are incremented in softirq context, so no locking or per-cpu variables are needed
page_pool: Add allocation stats
Add per-pool statistics counters for the allocation path of a page pool. These stats are incremented in softirq context, so no locking or per-cpu variables are needed.
This code is disabled by default and a kernel config option is provided for users who wish to enable them.
The statistics added are: - fast: successful fast path allocations - slow: slow path order-0 allocations - slow_high_order: slow path high order allocations - empty: ptr ring is empty, so a slow path allocation was forced. - refill: an allocation which triggered a refill of the cache - waive: pages obtained from the ptr ring that cannot be added to the cache due to a NUMA mismatch.
Signed-off-by: Joe Damato <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Reviewed-by: Ilias Apalodimas <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2 |
|
| #
2c193f2c |
| 19-Nov-2021 |
Jakub Kicinski <[email protected]> |
net: kunit: add a test for dev_addr_lists
Add a KUnit test for the dev_addr API.
Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
|
|
Revision tags: v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4 |
|
| #
20ab39d1 |
| 01-Oct-2021 |
Sebastian Andrzej Siewior <[email protected]> |
net/core: disable NET_RX_BUSY_POLL on PREEMPT_RT
napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption which would be required while
net/core: disable NET_RX_BUSY_POLL on PREEMPT_RT
napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption which would be required while __napi_poll() invokes the callback of the driver.
A threaded interrupt performing the NAPI-poll can be preempted on PREEMPT_RT. A RT thread on another CPU may observe NAPIF_STATE_SCHED bit set and busy-spin until it is cleared or its spin time runs out. Given it is the task with the highest priority it will never observe the NEED_RESCHED bit set. In this case the time is better spent by simply sleeping.
The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero). Disabling NET_RX_BUSY_POLL on PREEMPT_RT to avoid wrong locking context in case it is used.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4 |
|
| #
bc49d816 |
| 29-Jul-2021 |
Jeremy Kerr <[email protected]> |
mctp: Add MCTP base
Add basic Kconfig, an initial (empty) af_mctp source object, and {AF,PF}_MCTP definitions, and the required definitions for a new protocol type.
Signed-off-by: Jeremy Kerr <jk@c
mctp: Add MCTP base
Add basic Kconfig, an initial (empty) af_mctp source object, and {AF,PF}_MCTP definitions, and the required definitions for a new protocol type.
Signed-off-by: Jeremy Kerr <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2 |
|
| #
b24abcff |
| 11-May-2021 |
Daniel Borkmann <[email protected]> |
bpf, kconfig: Add consolidated menu entry for bpf with core options
Right now, all core BPF related options are scattered in different Kconfig locations mainly due to historic reasons. Moving forwar
bpf, kconfig: Add consolidated menu entry for bpf with core options
Right now, all core BPF related options are scattered in different Kconfig locations mainly due to historic reasons. Moving forward, lets add a proper subsystem entry under ...
General setup ---> BPF subsystem --->
... in order to have all knobs in a single location and thus ease BPF related configuration. Networking related bits such as sockmap are out of scope for the general setup and therefore better suited to remain in net/Kconfig.
Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/f23f58765a4d59244ebd8037da7b6a6b2fb58446.1620765074.git.daniel@iogearbox.net
show more ...
|
|
Revision tags: v5.13-rc1 |
|
| #
4a52dd8f |
| 28-Apr-2021 |
Oleksij Rempel <[email protected]> |
net: selftest: fix build issue if INET is disabled
In case ethernet driver is enabled and INET is disabled, selftest will fail to build.
Reported-by: Randy Dunlap <[email protected]> Fixes: 3e1
net: selftest: fix build issue if INET is disabled
In case ethernet driver is enabled and INET is disabled, selftest will fail to build.
Reported-by: Randy Dunlap <[email protected]> Fixes: 3e1e58d64c3d ("net: add generic selftest support") Signed-off-by: Oleksij Rempel <[email protected]> Acked-by: Randy Dunlap <[email protected]> # build-tested Reviewed-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v5.12 |
|
| #
3e1e58d6 |
| 19-Apr-2021 |
Oleksij Rempel <[email protected]> |
net: add generic selftest support
Port some parts of the stmmac selftest and reuse it as basic generic selftest library. This patch was tested with following combinations: - iMX6DL FEC -> AT8035 - i
net: add generic selftest support
Port some parts of the stmmac selftest and reuse it as basic generic selftest library. This patch was tested with following combinations: - iMX6DL FEC -> AT8035 - iMX6DL FEC -> SJA1105Q switch -> KSZ8081 - iMX6DL FEC -> SJA1105Q switch -> KSZ9031 - AR9331 ag71xx -> AR9331 PHY - AR9331 ag71xx -> AR9331 switch -> AR9331 PHY
Signed-off-by: Oleksij Rempel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4 |
|
| #
919067cc |
| 19-Mar-2021 |
Eric Dumazet <[email protected]> |
net: add CONFIG_PCPU_DEV_REFCNT
I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1
unregister_netdevice: waiting for sit0 to become free. Usage
net: add CONFIG_PCPU_DEV_REFCNT
I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1
unregister_netdevice: waiting for sit0 to become free. Usage count = -1
It would be nice if syzbot could trigger a warning at the time this reference count became negative.
This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds.
v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|