|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
4754affe |
| 28-Feb-2025 |
Nicolas Dichtel <[email protected]> |
net: advertise netns_immutable property via netlink
Since commit 05c1280a2bcf ("netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_local"), there is no way to see if the netns_immutable prop
net: advertise netns_immutable property via netlink
Since commit 05c1280a2bcf ("netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_local"), there is no way to see if the netns_immutable property s set on a device. Let's add a netlink attribute to advertise it.
Signed-off-by: Nicolas Dichtel <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Alexander Lobakin <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
e1f95b19 |
| 26-Feb-2025 |
Daniel Borkmann <[email protected]> |
geneve: Allow users to specify source port range
Recently, in case of Cilium, we run into users on Azure who require to use tunneling for east/west traffic due to hitting IPAM API limits for Kuberne
geneve: Allow users to specify source port range
Recently, in case of Cilium, we run into users on Azure who require to use tunneling for east/west traffic due to hitting IPAM API limits for Kubernetes Pods if they would have gone with publicly routable IPs for Pods. In case of tunneling, Cilium supports the option of vxlan or geneve. In order to RSS spread flows among remote CPUs both derive a source port hash via udp_flow_src_port() which takes the inner packet's skb->hash into account. For clusters with many nodes, this can then hit a new limitation [0]: Today, the Azure networking stack supports 1M total flows (500k inbound and 500k outbound) for a VM. [...] Once this limit is hit, other connections are dropped. [...] Each flow is distinguished by a 5-tuple (protocol, local IP address, remote IP address, local port, and remote port) information. [...]
For vxlan and geneve, this can create a massive amount of UDP flows which then run into the limits if stale flows are not evicted fast enough. One option to mitigate this for vxlan is to narrow the source port range via IFLA_VXLAN_PORT_RANGE while still being able to benefit from RSS. However, geneve currently does not have this option and it spreads traffic across the full source port range of [1, USHRT_MAX]. To overcome this limitation also for geneve, add an equivalent IFLA_GENEVE_PORT_RANGE setting for users.
Note that struct geneve_config before/after still remains at 2 cachelines on x86-64. The low/high members of struct ifla_geneve_port_range (which is uapi exposed) are of type __be16. While they would be perfectly fine to be of __u16 type, the consensus was that it would be good to be consistent with the existing struct ifla_vxlan_port_range from a uapi consumer PoV.
Signed-off-by: Daniel Borkmann <[email protected]> Link: https://learn.microsoft.com/en-us/azure/virtual-network/virtual-machine-network-throughput [0] Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
b9ed315d |
| 20-Dec-2024 |
Daniel Borkmann <[email protected]> |
netkit: Allow for configuring needed_{head,tail}room
Allow the user to configure needed_{head,tail}room for both netkit devices. The idea is similar to 163e529200af ("veth: implement ndo_set_rx_head
netkit: Allow for configuring needed_{head,tail}room
Allow the user to configure needed_{head,tail}room for both netkit devices. The idea is similar to 163e529200af ("veth: implement ndo_set_rx_headroom") with the difference that the two parameters can be specified upon device creation. By default the current behavior stays as is which is needed_{head,tail}room is 0.
In case of Cilium, for example, the netkit devices are not enslaved into a bridge or openvswitch device (rather, BPF-based redirection is used out of tcx), and as such these parameters are not propagated into the Pod's netns via peer device.
Given Cilium can run in vxlan/geneve tunneling mode (needed_headroom) and/or be used in combination with WireGuard (needed_{head,tail}room), allow the Cilium CNI plugin to specify these two upon netkit device creation.
Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2 |
|
| #
6c11379b |
| 05-Dec-2024 |
Petr Machata <[email protected]> |
vxlan: Add an attribute to make VXLAN header validation configurable
The set of bits that the VXLAN netdevice currently considers reserved is defined by the features enabled at the netdevice constru
vxlan: Add an attribute to make VXLAN header validation configurable
The set of bits that the VXLAN netdevice currently considers reserved is defined by the features enabled at the netdevice construction. In order to make this configurable, add an attribute, IFLA_VXLAN_RESERVED_BITS. The payload is a pair of big-endian u32's covering the VXLAN header. This is validated against the set of flags used by the various enabled VXLAN features, and attempts to override bits used by an enabled feature are bounced.
Signed-off-by: Petr Machata <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Link: https://patch.msgid.link/c657275e5ceed301e62c69fe8e559e32909442e2.1733412063.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc1, v6.12, v6.12-rc7 |
|
| #
580db513 |
| 05-Nov-2024 |
Khang Nguyen <[email protected]> |
net: mctp: Expose transport binding identifier via IFLA attribute
MCTP control protocol implementations are transport binding dependent. Endpoint discovery is mandatory based on transport binding. M
net: mctp: Expose transport binding identifier via IFLA attribute
MCTP control protocol implementations are transport binding dependent. Endpoint discovery is mandatory based on transport binding. Message timing requirements are specified in each respective transport binding specification.
However, we currently have no means to get this information from MCTP links.
Add a IFLA_MCTP_PHYS_BINDING netlink link attribute, which represents the transport type using the DMTF DSP0239-defined type numbers, returned as part of RTM_GETLINK data.
We get an IFLA_MCTP_PHYS_BINDING attribute for each MCTP link, for example:
- 0x00 (unspec) for loopback interface; - 0x01 (SMBus/I2C) for mctpi2c%d interfaces; and - 0x05 (serial) for mctpserial%d interfaces.
Signed-off-by: Khang Nguyen <[email protected]> Reviewed-by: Matt Johnston <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2 |
|
| #
83134ef4 |
| 04-Oct-2024 |
Daniel Borkmann <[email protected]> |
netkit: Add option for scrubbing skb meta data
Jordan reported that when running Cilium with netkit in per-endpoint-routes mode, network policy misclassifies traffic. In this direct routing mode of
netkit: Add option for scrubbing skb meta data
Jordan reported that when running Cilium with netkit in per-endpoint-routes mode, network policy misclassifies traffic. In this direct routing mode of Cilium which is used in case of GKE/EKS/AKS, the Pod's BPF program to enforce policy sits on the netkit primary device's egress side.
The issue here is that in case of netkit's netkit_prep_forward(), it will clear meta data such as skb->mark and skb->priority before executing the BPF program. Thus, identity data stored in there from earlier BPF programs (e.g. from tcx ingress on the physical device) gets cleared instead of being made available for the primary's program to process. While for traffic egressing the Pod via the peer device this might be desired, this is different for the primary one where compared to tcx egress on the host veth this information would be available.
To address this, add a new parameter for the device orchestration to allow control of skb->mark and skb->priority scrubbing, to make the two accessible from BPF (and eventually leave it up to the program to scrub). By default, the current behavior is retained. For netkit peer this also enables the use case where applications could cooperate/signal intent to the BPF program.
Note that struct netkit has a 4 byte hole between policy and bundle which is used here, in other words, struct netkit's first cacheline content used in fast-path does not get moved around.
Fixes: 35dfaad7188c ("netkit, bpf: Add bpf programmable net device") Reported-by: Jordan Rife <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Nikolay Aleksandrov <[email protected]> Link: https://github.com/cilium/cilium/issues/34042 Acked-by: Jakub Kicinski <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
| #
f858cc9e |
| 03-Oct-2024 |
Eric Dumazet <[email protected]> |
net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute
Some network devices have the ability to offload EDT (Earliest Departure Time) which is the model used for TCP pacing and FQ packet schedule
net: add IFLA_MAX_PACING_OFFLOAD_HORIZON device attribute
Some network devices have the ability to offload EDT (Earliest Departure Time) which is the model used for TCP pacing and FQ packet scheduler.
Some of them implement the timing wheel mechanism described in https://saeed.github.io/files/carousel-sigcomm17.pdf with an associated 'timing wheel horizon'.
This patch adds dev->max_pacing_offload_horizon expressing this timing wheel horizon in nsec units.
This is a read-only attribute.
Unless a driver sets it, dev->max_pacing_offload_horizon is zero.
v2: addressed Jakub feedback ( https://lore.kernel.org/netdev/[email protected]/T/#mf6294d714c41cc459962154cc2580ce3c9693663 ) v3: added yaml doc (also per Jakub feedback)
Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9 |
|
| #
999cb275 |
| 06-May-2024 |
Pablo Neira Ayuso <[email protected]> |
gtp: add IPv6 support
Add new iflink attributes to configure in-kernel UDP listener socket address: IFLA_GTP_LOCAL and IFLA_GTP_LOCAL6. If none of these attributes are specified, default is still to
gtp: add IPv6 support
Add new iflink attributes to configure in-kernel UDP listener socket address: IFLA_GTP_LOCAL and IFLA_GTP_LOCAL6. If none of these attributes are specified, default is still to IPv4 INADDR_ANY for backward compatibility.
Add new attributes to set up family and IPv6 address of GTP tunnels: GTPA_FAMILY, GTPA_PEER_ADDR6 and GTPA_MS_ADDR6. If no GTPA_FAMILY is specified, AF_INET is assumed for backward compatibility.
setsockopt IPV6_ADDRFORM allows to downgrade socket from IPv6 to IPv4 after socket is bound. Assumption is that socket listener that is attached to the gtp device needs to be either IPv4 or IPv6. Therefore, GTP socket listener does not allow for IPv4-mapped-IPv6 listener.
Signed-off-by: Pablo Neira Ayuso <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc7, v6.9-rc6 |
|
| #
5055cccf |
| 23-Apr-2024 |
Lukasz Majewski <[email protected]> |
net: hsr: Provide RedBox support (HSR-SAN)
Introduce RedBox support (HSR-SAN to be more precise) for HSR networks. Following traffic reduction optimizations have been implemented: - Do not send HSR
net: hsr: Provide RedBox support (HSR-SAN)
Introduce RedBox support (HSR-SAN to be more precise) for HSR networks. Following traffic reduction optimizations have been implemented: - Do not send HSR supervisory frames to Port C (interlink) - Do not forward to HSR ring frames addressed to Port C - Do not forward to Port C frames from HSR ring - Do not send duplicate HSR frame to HSR ring when destination is Port C
The corresponding patch to modify iptable2 sources has already been sent: https://lore.kernel.org/netdev/[email protected]/T/
Testing procedure (veth and netns): ----------------------------------- One shall run: linux-vanila/tools/testing/selftests/net/hsr/hsr_redbox.sh (Detailed description of the setup one can find in the test script file).
Testing procedure (real hardware): ---------------------------------- The EVB-KSZ9477 has been used for testing on net-next branch (SHA1: 5fc68320c1fb3c7d456ddcae0b4757326a043e6f).
Ports 4/5 were used for SW managed HSR (hsr1) as first hsr0 for ports 1/2 (with HW offloading for ksz9477) was created. Port 3 has been used as interlink port (single USB-ETH dongle).
Configuration - RedBox (EVB-KSZ9477): if link set lan1 down;ip link set lan2 down ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1 ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 interlink lan3 supervision 45 version 1 ip link set lan4 up;ip link set lan5 up ip link set lan3 up ip addr add 192.168.0.11/24 dev hsr1 ip link set hsr1 up
Configuration - DAN-H (EVB-KSZ9477):
ip link set lan1 down;ip link set lan2 down ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1 ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 supervision 45 version 1 ip link set lan4 up;ip link set lan5 up ip addr add 192.168.0.12/24 dev hsr1 ip link set hsr1 up
This approach uses only SW based HSR devices (hsr1).
-------------- ----------------- ------------ DAN-H Port5 | <------> | Port5 | | Port4 | <------> | Port4 Port3 | <---> | PC | | (RedBox) | | (USB-ETH) EVB-KSZ9477 | | EVB-KSZ9477 | | -------------- ----------------- ------------
Signed-off-by: Lukasz Majewski <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3 |
|
| #
240fd405 |
| 02-Feb-2024 |
Aahil Awatramani <[email protected]> |
bonding: Add independent control state machine
Add support for the independent control state machine per IEEE 802.1AX-2008 5.4.15 in addition to the existing implementation of the coupled control st
bonding: Add independent control state machine
Add support for the independent control state machine per IEEE 802.1AX-2008 5.4.15 in addition to the existing implementation of the coupled control state machine.
Introduces two new states, AD_MUX_COLLECTING and AD_MUX_DISTRIBUTING in the LACP MUX state machine for separated handling of an initial Collecting state before the Collecting and Distributing state. This enables a port to be in a state where it can receive incoming packets while not still distributing. This is useful for reducing packet loss when a port begins distributing before its partner is able to collect.
Added new functions such as bond_set_slave_tx_disabled_flags and bond_set_slave_rx_enabled_flags to precisely manage the port's collecting and distributing states. Previously, there was no dedicated method to disable TX while keeping RX enabled, which this patch addresses.
Note that the regular flow process in the kernel's bonding driver remains unaffected by this patch. The extension requires explicit opt-in by the user (in order to ensure no disruptions for existing setups) via netlink support using the new bonding parameter coupled_control. The default value for coupled_control is set to 1 so as to preserve existing behaviour.
Signed-off-by: Aahil Awatramani <[email protected]> Reviewed-by: Hangbin Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4 |
|
| #
8c4bafdb |
| 01-Dec-2023 |
Hangbin Liu <[email protected]> |
net: bridge: add document for IFLA_BRPORT enum
Add document for IFLA_BRPORT enum so we can use it in Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <[email protected]> Acked-by:
net: bridge: add document for IFLA_BRPORT enum
Add document for IFLA_BRPORT enum so we can use it in Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
| #
8ebe0661 |
| 01-Dec-2023 |
Hangbin Liu <[email protected]> |
net: bridge: add document for IFLA_BR enum
Add document for IFLA_BR enum so we can use it in Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <[email protected]> Acked-by: Nikolay
net: bridge: add document for IFLA_BR enum
Add document for IFLA_BR enum so we can use it in Documentation/networking/bridge.rst.
Signed-off-by: Hangbin Liu <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc3, v6.7-rc2 |
|
| #
c6e9dba3 |
| 14-Nov-2023 |
Alce Lafranque <[email protected]> |
vxlan: add support for flowlabel inherit
By default, VXLAN encapsulation over IPv6 sets the flow label to 0, with an option for a fixed value. This commits add the ability to inherit the flow label
vxlan: add support for flowlabel inherit
By default, VXLAN encapsulation over IPv6 sets the flow label to 0, with an option for a fixed value. This commits add the ability to inherit the flow label from the inner packet, like for other tunnel implementations. This enables devices using only L3 headers for ECMP to correctly balance VXLAN-encapsulated IPv6 packets.
``` $ ./ip/ip link add dummy1 type dummy $ ./ip/ip addr add 2001:db8::2/64 dev dummy1 $ ./ip/ip link set up dev dummy1 $ ./ip/ip link add vxlan1 type vxlan id 100 flowlabel inherit remote 2001:db8::1 local 2001:db8::2 $ ./ip/ip link set up dev vxlan1 $ ./ip/ip addr add 2001:db8:1::2/64 dev vxlan1 $ ./ip/ip link set arp off dev vxlan1 $ ping -q 2001:db8:1::1 & $ tshark -d udp.port==8472,vxlan -Vpni dummy1 -c1 [...] Internet Protocol Version 6, Src: 2001:db8::2, Dst: 2001:db8::1 0110 .... = Version: 6 .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0) .... 1011 0001 1010 1111 1011 = Flow Label: 0xb1afb [...] Virtual eXtensible Local Area Network Flags: 0x0800, VXLAN Network ID (VNI) Group Policy ID: 0 VXLAN Network Identifier (VNI): 100 [...] Internet Protocol Version 6, Src: 2001:db8:1::2, Dst: 2001:db8:1::1 0110 .... = Version: 6 .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT) .... 0000 00.. .... .... .... .... .... = Differentiated Services Codepoint: Default (0) .... .... ..00 .... .... .... .... .... = Explicit Congestion Notification: Not ECN-Capable Transport (0) .... 1011 0001 1010 1111 1011 = Flow Label: 0xb1afb ```
Signed-off-by: Alce Lafranque <[email protected]> Co-developed-by: Vincent Bernat <[email protected]> Signed-off-by: Vincent Bernat <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.7-rc1, v6.6 |
|
| #
35dfaad7 |
| 24-Oct-2023 |
Daniel Borkmann <[email protected]> |
netkit, bpf: Add bpf programmable net device
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is
netkit, bpf: Add bpf programmable net device
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source.
One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides).
In this initial version, the netkit device ships as a pair, but we plan to extend this further so it can also operate in single device mode. The pair comes with a primary and a peer device. Only the primary device, typically residing in hostns, can manage BPF programs for itself and its peer. The peer device is designated for containers/Pods and cannot attach/detach BPF programs. Upon the device creation, the user can set the default policy to 'pass' or 'drop' for the case when no BPF program is attached.
Additionally, the device can be operated in L3 (default) or L2 mode. The management of BPF programs is done via bpf_mprog, so that multi-attach is supported right from the beginning with similar API and dependency controls as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs"). tc BPF compatibility is provided, so that existing programs can be easily migrated.
Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. Additionally, the programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device, hence the latter is also supporting the ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. Also, BIG TCP is supported on netkit device. For the follow-up work in single device mode, we plan to convert Cilium's cilium_host/_net devices into a single one.
An extensive test suite for checking device operations and the BPF program and link management API comes as BPF selftests in this series.
Co-developed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://github.com/borkmann/iproute2/tree/pr/netkit Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.) Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
show more ...
|
| #
87cd8371 |
| 23-Oct-2023 |
Florian Fainelli <[email protected]> |
net: dsa: Rename IFLA_DSA_MASTER to IFLA_DSA_CONDUIT
This preserves the existing IFLA_DSA_MASTER which is part of the uAPI and creates an alias named IFLA_DSA_CONDUIT.
Reviewed-by: Andrew Lunn <and
net: dsa: Rename IFLA_DSA_MASTER to IFLA_DSA_CONDUIT
This preserves the existing IFLA_DSA_MASTER which is part of the uAPI and creates an alias named IFLA_DSA_CONDUIT.
Reviewed-by: Andrew Lunn <[email protected]> Reviewed-by: Vladimir Oltean <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc7 |
|
| #
ddd1ad68 |
| 16-Oct-2023 |
Johannes Nixdorf <[email protected]> |
net: bridge: Add netlink knobs for number / max learned FDB entries
The previous patch added accounting and a limit for the number of dynamically learned FDB entries per bridge. However it did not p
net: bridge: Add netlink knobs for number / max learned FDB entries
The previous patch added accounting and a limit for the number of dynamically learned FDB entries per bridge. However it did not provide means to actually configure those bounds or read back the count. This patch does that.
Two new netlink attributes are added for the accounting and limit of dynamically learned FDB entries: - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for a single bridge. - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for the bridge.
The new attributes are used like this:
# ip link add name br up type bridge fdb_max_learned 256 # ip link add name v1 up master br type veth peer v2 # ip link set up dev v2 # mausezahn -a rand -c 1024 v2 0.01 seconds (90877 packets per second # bridge fdb | grep -v permanent | wc -l 256 # ip -d link show dev br 13: br: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 [...] [...] fdb_n_learned 256 fdb_max_learned 256
Signed-off-by: Johannes Nixdorf <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2 |
|
| #
5f184269 |
| 13-Sep-2023 |
Jiri Pirko <[email protected]> |
netdev: expose DPLL pin handle for netdevice
In case netdevice represents a SyncE port, the user needs to understand the connection between netdevice and associated DPLL pin. There might me multiple
netdev: expose DPLL pin handle for netdevice
In case netdevice represents a SyncE port, the user needs to understand the connection between netdevice and associated DPLL pin. There might me multiple netdevices pointing to the same pin, in case of VF/SF implementation.
Add a IFLA Netlink attribute to nest the DPLL pin handle, similar to how it is implemented for devlink port. Add a struct dpll_pin pointer to netdev and protect access to it by RTNL. Expose netdev_dpll_pin_set() and netdev_dpll_pin_clear() helpers to the drivers so they can set/clear the DPLL pin relationship to netdev.
Note that during the lifetime of struct dpll_pin the pin handle does not change. Therefore it is save to access it lockless. It is drivers responsibility to call netdev_dpll_pin_clear() before dpll_pin_put().
Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Arkadiusz Kubalewski <[email protected]> Signed-off-by: Vadim Fedorenko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3 |
|
| #
29cfb2aa |
| 17-Jul-2023 |
Ido Schimmel <[email protected]> |
bridge: Add backup nexthop ID support
Add a new bridge port attribute that allows attaching a nexthop object ID to an skb that is redirected to a backup bridge port with VLAN tunneling enabled.
Spe
bridge: Add backup nexthop ID support
Add a new bridge port attribute that allows attaching a nexthop object ID to an skb that is redirected to a backup bridge port with VLAN tunneling enabled.
Specifically, when redirecting a known unicast packet, read the backup nexthop ID from the bridge port that lost its carrier and set it in the bridge control block of the skb before forwarding it via the backup port. Note that reading the ID from the bridge port should not result in a cache miss as the ID is added next to the 'backup_port' field that was already accessed. After this change, the 'state' field still stays on the first cache line, together with other data path related fields such as 'flags and 'vlgrp':
struct net_bridge_port { struct net_bridge * br; /* 0 8 */ struct net_device * dev; /* 8 8 */ netdevice_tracker dev_tracker; /* 16 0 */ struct list_head list; /* 16 16 */ long unsigned int flags; /* 32 8 */ struct net_bridge_vlan_group * vlgrp; /* 40 8 */ struct net_bridge_port * backup_port; /* 48 8 */ u32 backup_nhid; /* 56 4 */ u8 priority; /* 60 1 */ u8 state; /* 61 1 */ u16 port_no; /* 62 2 */ /* --- cacheline 1 boundary (64 bytes) --- */ [...] } __attribute__((__aligned__(8)));
When forwarding an skb via a bridge port that has VLAN tunneling enabled, check if the backup nexthop ID stored in the bridge control block is valid (i.e., not zero). If so, instead of attaching the pre-allocated metadata (that only has the tunnel key set), allocate a new metadata, set both the tunnel key and the nexthop object ID and attach it to the skb.
By default, do not dump the new attribute to user space as a value of zero is an invalid nexthop object ID.
The above is useful for EVPN multihoming. When one of the links composing an Ethernet Segment (ES) fails, traffic needs to be redirected towards the host via one of the other ES peers. For example, if a host is multihomed to three different VTEPs, the backup port of each ES link needs to be set to the VXLAN device and the backup nexthop ID needs to point to an FDB nexthop group that includes the IP addresses of the other two VTEPs. The VXLAN driver will extract the ID from the metadata of the redirected skb, calculate its flow hash and forward it towards one of the other VTEPs. If the ID does not exist, or represents an invalid nexthop object, the VXLAN driver will drop the skb. This relieves the bridge driver from the need to validate the ID.
Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2 |
|
| #
69474a8a |
| 12-May-2023 |
Vladimir Nikishkin <[email protected]> |
net: vxlan: Add nolocalbypass option to vxlan.
If a packet needs to be encapsulated towards a local destination IP, the packet will undergo a "local bypass" and be injected into the Rx path as if it
net: vxlan: Add nolocalbypass option to vxlan.
If a packet needs to be encapsulated towards a local destination IP, the packet will undergo a "local bypass" and be injected into the Rx path as if it was received by the target VXLAN device without undergoing encapsulation. If such a device does not exist, the packet will be dropped.
There are scenarios where we do not want to perform such a bypass, but instead want the packet to be encapsulated and locally received by a user space program for post-processing.
To that end, add a new VXLAN device attribute that controls whether a "local bypass" is performed or not. Default to performing a bypass to maintain existing behavior.
Signed-off-by: Vladimir Nikishkin <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.4-rc1, v6.3 |
|
| #
160656d7 |
| 19-Apr-2023 |
Ido Schimmel <[email protected]> |
bridge: Allow setting per-{Port, VLAN} neighbor suppression state
Add a new bridge port attribute that allows user space to enable per-{Port, VLAN} neighbor suppression. Example:
# bridge -d -j -p
bridge: Allow setting per-{Port, VLAN} neighbor suppression state
Add a new bridge port attribute that allows user space to enable per-{Port, VLAN} neighbor suppression. Example:
# bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]' false # bridge link set dev swp1 neigh_vlan_suppress on # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]' true # bridge link set dev swp1 neigh_vlan_suppress off # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]' false
Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc7, v6.3-rc6, v6.3-rc5 |
|
| #
954d1fa1 |
| 28-Mar-2023 |
Herbert Xu <[email protected]> |
macvlan: Add netlink attribute for broadcast cutoff
Make the broadcast cutoff configurable through netlink. Note that macvlan is weird because there is no central device for us to configure (the lo
macvlan: Add netlink attribute for broadcast cutoff
Make the broadcast cutoff configurable through netlink. Note that macvlan is weird because there is no central device for us to configure (the lowerdev could be anything). So all the options are duplicated over what could be thousands of child devices.
IFLA_MACVLAN_BC_QUEUE_LEN took the approach of taking the maximum of all child device settings. This is unnecessary as we could simply store the option in the port device and take the last child device that gets updated as the value to use.
Signed-off-by: Herbert Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7 |
|
| #
a1aee20d |
| 02-Feb-2023 |
Petr Machata <[email protected]> |
net: bridge: Add netlink knobs for number / maximum MDB entries
The previous patch added accounting for number of MDB entries per port and per port-VLAN, and the logic to verify that these values st
net: bridge: Add netlink knobs for number / maximum MDB entries
The previous patch added accounting for number of MDB entries per port and per port-VLAN, and the logic to verify that these values stay within configured bounds. However it didn't provide means to actually configure those bounds or read the occupancy. This patch does that.
Two new netlink attributes are added for the MDB occupancy: IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy. And another two for the maximum number of MDB entries: IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one.
Note that the two new IFLA_BRPORT_ attributes prompt bumping of RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough.
The new attributes are used like this:
# ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \ mcast_vlan_snooping 1 mcast_querier 1 # ip link set dev v1 master br # bridge vlan add dev v1 vid 2
# bridge vlan set dev v1 vid 1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1 # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1 Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
# bridge link set dev v1 mcast_max_groups 1 # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2 Error: bridge: Port is already in 1 groups, and mcast_max_groups=1.
# bridge -d link show 5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...] [...] mcast_n_groups 1 mcast_max_groups 1
# bridge -d vlan show port vlan-id br 1 PVID Egress Untagged state forwarding mcast_router 1 v1 1 PVID Egress Untagged [...] mcast_n_groups 1 mcast_max_groups 1 2 [...] mcast_n_groups 0 mcast_max_groups 0
Signed-off-by: Petr Machata <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc6 |
|
| #
9eefedd5 |
| 28-Jan-2023 |
Xin Long <[email protected]> |
net: add gso_ipv4_max_size and gro_ipv4_max_size per device
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be
net: add gso_ipv4_max_size and gro_ipv4_max_size per device
This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch.
To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes.
Signed-off-by: Xin Long <[email protected]> Reviewed-by: David Ahern <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
|
Revision tags: v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4 |
|
| #
dca56c30 |
| 02-Nov-2022 |
Jiri Pirko <[email protected]> |
net: expose devlink port over rtnetlink
Expose devlink port handle related to netdev over rtnetlink. Introduce a new nested IFLA attribute to carry the info. Call into devlink code to fill-up the ne
net: expose devlink port over rtnetlink
Expose devlink port handle related to netdev over rtnetlink. Introduce a new nested IFLA attribute to carry the info. Call into devlink code to fill-up the nest with existing devlink attributes that are used over devlink netlink.
Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|
| #
a35ec8e3 |
| 01-Nov-2022 |
Hans J. Schultz <[email protected]> |
bridge: Add MAC Authentication Bypass (MAB) support
Hosts that support 802.1X authentication are able to authenticate themselves by exchanging EAPOL frames with an authenticator (Ethernet bridge, in
bridge: Add MAC Authentication Bypass (MAB) support
Hosts that support 802.1X authentication are able to authenticate themselves by exchanging EAPOL frames with an authenticator (Ethernet bridge, in this case) and an authentication server. Access to the network is only granted by the authenticator to successfully authenticated hosts.
The above is implemented in the bridge using the "locked" bridge port option. When enabled, link-local frames (e.g., EAPOL) can be locally received by the bridge, but all other frames are dropped unless the host is authenticated. That is, unless the user space control plane installed an FDB entry according to which the source address of the frame is located behind the locked ingress port. The entry can be dynamic, in which case learning needs to be enabled so that the entry will be refreshed by incoming traffic.
There are deployments in which not all the devices connected to the authenticator (the bridge) support 802.1X. Such devices can include printers and cameras. One option to support such deployments is to unlock the bridge ports connecting these devices, but a slightly more secure option is to use MAB. When MAB is enabled, the MAC address of the connected device is used as the user name and password for the authentication.
For MAB to work, the user space control plane needs to be notified about MAC addresses that are trying to gain access so that they will be compared against an allow list. This can be implemented via the regular learning process with the sole difference that learned FDB entries are installed with a new "locked" flag indicating that the entry cannot be used to authenticate the device. The flag cannot be set by user space, but user space can clear the flag by replacing the entry, thereby authenticating the device.
Locked FDB entries implement the following semantics with regards to roaming, aging and forwarding:
1. Roaming: Locked FDB entries can roam to unlocked (authorized) ports, in which case the "locked" flag is cleared. FDB entries cannot roam to locked ports regardless of MAB being enabled or not. Therefore, locked FDB entries are only created if an FDB entry with the given {MAC, VID} does not already exist. This behavior prevents unauthenticated devices from disrupting traffic destined to already authenticated devices.
2. Aging: Locked FDB entries age and refresh by incoming traffic like regular entries.
3. Forwarding: Locked FDB entries forward traffic like regular entries. If user space detects an unauthorized MAC behind a locked port and wishes to prevent traffic with this MAC DA from reaching the host, it can do so using tc or a different mechanism.
Enable the above behavior using a new bridge port option called "mab". It can only be enabled on a bridge port that is both locked and has learning enabled. Locked FDB entries are flushed from the port once MAB is disabled. A new option is added because there are pure 802.1X deployments that are not interested in notifications about locked FDB entries.
Signed-off-by: Hans J. Schultz <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Vladimir Oltean <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
show more ...
|