aboutsummaryrefslogtreecommitdiffstats
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2025-11-25tcp: introduce icsk->icsk_keepalive_timerEric Dumazet1-2/+2
sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: rename icsk_timeout() to tcp_timeout_expires()Eric Dumazet1-2/+2
In preparation of sk->tcp_timeout_timer introduction, rename icsk_timeout() helper and change its argument to plain 'const struct sock *sk'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+4
Cross-merge networking fixes after downstream PR (net-6.18-rc7). No conflicts, adjacent changes: tools/testing/selftests/net/af_unix/Makefile e1bb28bf13f4 ("selftest: af_unix: Add test for SO_PEEK_OFF.") 45a1cd8346ca ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18ipv6: clear RA flags when adding a static routeFernando Fernandez Mancera1-0/+4
When an IPv6 Router Advertisement (RA) is received for a prefix, the kernel creates the corresponding on-link route with flags RTF_ADDRCONF and RTF_PREFIX_RT configured and RTF_EXPIRES if lifetime is set. If later a user configures a static IPv6 address on the same prefix the kernel clears the RTF_EXPIRES flag but it doesn't clear the RTF_ADDRCONF and RTF_PREFIX_RT. When the next RA for that prefix is received, the kernel sees the route as RA-learned and wrongly configures back the lifetime. This is problematic because if the route expires, the static address won't have the corresponding on-link route. This fix clears the RTF_ADDRCONF and RTF_PREFIX_RT flags preventing that the lifetime is configured when the next RA arrives. If the static address is deleted, the route becomes RA-learned again. Fixes: 14ef37b6d00e ("ipv6: fix route lookup in addrconf_prefix_rcv()") Reported-by: Garri Djavadyan <g.djavadyan@gmail.com> Closes: https://lore.kernel.org/netdev/ba807d39aca5b4dcf395cc11dca61a130a52cfd3.camel@gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20251115095939.6967-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-14ipv6: clean up routes when manually removing address with a lifetimeJakub Kicinski1-1/+1
When an IPv6 address with a finite lifetime (configured with valid_lft and preferred_lft) is manually deleted, the kernel does not clean up the associated prefix route. This results in orphaned routes (marked "proto kernel") remaining in the routing table even after their corresponding address has been deleted. This is particularly problematic on networks using combination of SLAAC and bridges. 1. Machine comes up and performs RA on eth0. 2. User creates a bridge - does an ip -6 addr flush dev eth0; - adds the eth0 under the bridge. 3. SLAAC happens on br0. Even tho the address has "moved" to br0 there will still be a route pointing to eth0, but eth0 is not usable for IP any more. Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251113031700.3736285-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07tcp: Call tcp_syn_ack_timeout() directly.Kuniyuki Iwashima1-1/+0
Since DCCP has been removed, we do not need to use request_sock_ops.syn_ack_timeout(). Let's call tcp_syn_ack_timeout() directly. Now other function pointers of request_sock_ops are protocol-dependent. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto callbacks from sockaddr to sockaddr_unsizedKees Cook6-14/+16
Convert struct proto pre_connect(), connect(), bind(), and bind_add() callback function prototypes from struct sockaddr to struct sockaddr_unsized. This does not change per-implementation use of sockaddr for passing around an arbitrarily sized sockaddr struct. Those will be addressed in future patches. Additionally removes the no longer referenced struct sockaddr from include/net/inet_common.h. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook1-1/+1
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook2-3/+3
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-30xfrm: Determine inner GSO type from packet inner protocolJianbo Liu1-2/+4
The GSO segmentation functions for ESP tunnel mode (xfrm4_tunnel_gso_segment and xfrm6_tunnel_gso_segment) were determining the inner packet's L2 protocol type by checking the static x->inner_mode.family field from the xfrm state. This is unreliable. In tunnel mode, the state's actual inner family could be defined by x->inner_mode.family or by x->inner_mode_iaf.family. Checking only the former can lead to a mismatch with the actual packet being processed, causing GSO to create segments with the wrong L2 header type. This patch fixes the bug by deriving the inner mode directly from the packet's inner protocol stored in XFRM_MODE_SKB_CB(skb)->protocol. Instead of replicating the code, this patch modifies the xfrm_ip2inner_mode helper function. It now correctly returns &x->inner_mode if the selector family (x->sel.family) is already specified, thereby handling both specific and AF_UNSPEC cases appropriately. With this change, ESP GSO can use xfrm_ip2inner_mode to get the correct inner mode. It doesn't affect existing callers, as the updated logic now mirrors the checks they were already performing externally. Fixes: 26dbd66eab80 ("esp: choose the correct inner protocol for GSO on inter address family tunnels") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-10-29ipv6: icmp: Add RFC 5837 supportIdo Schimmel2-2/+213
Add the ability to append the incoming IP interface information to ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is required for more meaningful traceroute results in unnumbered networks. The feature is disabled by default and controlled via a new sysctl ("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP extensions to append to ICMP error messages. Currently, only a single value is supported, but the interface and the implementation should be able to support more extensions, if needed. Clone the skb and copy the relevant data portions before modifying the skb as the caller of icmp6_send() still owns the skb after the function returns. This should be fine since by default ICMP error messages are rate limited to 1000 per second and no more than 1 per second per specific host. Trim or pad the packet to 128 bytes before appending the ICMP extension structure in order to be compatible with legacy applications that assume that the ICMP extension structure always starts at this offset (the minimum length specified by RFC 4884). Since commit 20e1954fe238 ("ipv6: RFC 4884 partial support for SIT/GRE tunnels") it is possible for icmp6_send() to be called with an skb that already contains ICMP extensions. This can happen when we receive an ICMPv4 message with extensions from a tunnel and translate it to an ICMPv6 message towards an IPv6 host in the overlay network. I could not find an RFC that supports this behavior, but it makes sense to not overwrite the original extensions that were appended to the packet. Therefore, avoid appending extensions if the length field in the provided ICMPv6 header is already filled. Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it available to IPv6 when it is built as a module. Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-24neighbour: Annotate access to neigh_parms fields.Kuniyuki Iwashima1-4/+4
NEIGH_VAR() is read locklessly in the fast path, and IPv6 ndisc uses NEIGH_VAR_SET() locklessly. The next patch will convert neightbl_dump_info() to RCU. Let's annotate accesses to neigh_param with READ_ONCE() and WRITE_ONCE(). Note that ndisc_ifinfo_sysctl_change() uses &NEIGH_VAR() and we cannot use '&' with READ_ONCE(), so NEIGH_VAR_PTR() is introduced. Note also that NEIGH_VAR_INIT() does not need WRITE_ONCE() as it is before parms is published. Also, the only user hippi_neigh_setup_dev() is no longer called since commit e3804cbebb67 ("net: remove COMPAT_NET_DEV_OPS"), which looks wrong, but probably no one uses HIPPI and RoadRunner. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251022054004.2514876-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17tcp: Convert tcp-md5 to use MD5 library instead of crypto_ahashEric Biggers1-82/+37
Make tcp-md5 use the MD5 library API (added in 6.18) instead of the crypto_ahash API. This is much simpler and also more efficient: - The library API just operates on struct md5_ctx. Just allocate this struct on the stack instead of using a pool of pre-allocated crypto_ahash and ahash_request objects. - The library API accepts standard pointers and doesn't require scatterlists. So, for hashing the headers just use an on-stack buffer instead of a pool of pre-allocated kmalloc'ed scratch buffers. - The library API never fails. Therefore, checking for MD5 hashing errors is no longer necessary. Update tcp_v4_md5_hash_skb(), tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(), tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and tcp_request_sock_ops::calc_md5_hash to return void instead of int. - The library API provides direct access to the MD5 code, eliminating unnecessary overhead such as indirect function calls and scatterlist management. Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or from 1020 to 678 cycles (33% fewer) with skb->len == 140. Since tcp_sigpool_hash_skb_data() can no longer be used, add a function tcp_md5_hash_skb_data() which is specialized to MD5. Of course, to the extent that this duplicates any code, it's well worth it. To preserve the existing behavior of TCP-MD5 support being disabled when the kernel is booted with "fips=1", make tcp_md5_do_add() check fips_enabled itself. Previously it relied on the error from crypto_alloc_ahash("md5") being bubbled up. I don't know for sure that this is actually needed, but this preserves the existing behavior. Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel that includes this commit and a kernel that doesn't include this commit. (Side note: please don't use TCP-MD5! It's cryptographically weak. But as long as Linux supports it, it might as well be implemented properly.) Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-17ipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock.Kuniyuki Iwashima2-29/+28
In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at the beginning of a new cache line because 1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned of __cacheline_group_begin(tcp_sock_write_tx) 2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned of struct numa_drop_counters 3. in raw6_sock, struct numa_drop_counters is placed before struct ipv6_pinfo . struct ipv6_pinfo is 136 bytes, but the last cache line is only used by ipv6_fl_list: $ pahole -C ipv6_pinfo vmlinux struct ipv6_pinfo { ... /* --- cacheline 2 boundary (128 bytes) --- */ struct ipv6_fl_socklist * ipv6_fl_list; /* 128 8 */ /* size: 136, cachelines: 3, members: 23 */ Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock to save a full cache line for {tcp6,udp6,raw6}_sock. Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have 64 bytes less, while {tcp,udp,raw}_sock retain the same size. Before: # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}' RAWv6 1408 UDPv6 1472 TCPv6 2560 RAW 1152 UDP 1280 TCP 2368 After: # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}' RAWv6 1344 UDPv6 1408 TCPv6 2496 RAW 1152 UDP 1280 TCP 2368 Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the same cache line. $ pahole -C inet_sock vmlinux ... /* --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- */ struct ipv6_pinfo * pinet6; /* 760 8 */ /* --- cacheline 12 boundary (768 bytes) --- */ struct ipv6_fl_socklist * ipv6_fl_list; /* 768 8 */ unsigned long inet_flags; /* 776 8 */ Doc churn is due to the insufficient Type column (only 1 space short). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-13net/ip6_tunnel: Prevent perpetual tunnel growthDmitry Safonov1-2/+1
Similarly to ipv4 tunnel, ipv6 version updates dev->needed_headroom, too. While ipv4 tunnel headroom adjustment growth was limited in commit 5ae1e9922bbd ("net: ip_tunnel: prevent perpetual headroom growth"), ipv6 tunnel yet increases the headroom without any ceiling. Reflect ipv4 tunnel headroom adjustment limit on ipv6 version. Credits to Francesco Ruggeri, who was originally debugging this issue and wrote local Arista-specific patch and a reproducer. Fixes: 8eb30be0352d ("ipv6: Create ip6_tnl_xmit") Cc: Florian Westphal <fw@strlen.de> Cc: Francesco Ruggeri <fruggeri05@gmail.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://patch.msgid.link/20251009-ip6_tunnel-headroom-v2-1-8e4dbd8f7e35@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-03net: psp: don't assume reply skbs will have a socketJakub Kicinski1-1/+1
Rx path may be passing around unreferenced sockets, which means that skb_set_owner_edemux() may not set skb->sk and PSP will crash: KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] RIP: 0010:psp_reply_set_decrypted (./include/net/psp/functions.h:132 net/psp/psp_sock.c:287) tcp_v6_send_response.constprop.0 (net/ipv6/tcp_ipv6.c:979) tcp_v6_send_reset (net/ipv6/tcp_ipv6.c:1140 (discriminator 1)) tcp_v6_do_rcv (net/ipv6/tcp_ipv6.c:1683) tcp_v6_rcv (net/ipv6/tcp_ipv6.c:1912) Fixes: 659a2899a57d ("tcp: add datapath logic for PSP with inline key exchange") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251001022426.2592750-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-26Merge tag 'ipsec-next-2025-09-26' of ↵Jakub Kicinski1-19/+31
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2025-09-26 1) Fix field-spanning memcpy warning in AH output. From Charalampos Mitrodimas. 2) Replace the strcpy() calls for alg_name by strscpy(). From Miguel García. * tag 'ipsec-next-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: xfrm_user: use strscpy() for alg_name net: ipv6: fix field-spanning memcpy warning in AH output ==================== Link: https://patch.msgid.link/20250926053025.2242061-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-25net: gro: remove is_ipv6 from napi_gro_cbRichard Gobert1-2/+0
Remove is_ipv6 from napi_gro_cb and use sk->sk_family instead. This frees up space for another ip_fixedid bit that will be added in the next commit. udp_sock_create always creates either a AF_INET or a AF_INET6 socket, so using sk->sk_family is reliable. In IPv6-FOU, cfg->ipv6_v6only is always enabled. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-2-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-23udp: remove busylock and add per NUMA queuesEric Dumazet1-2/+3
busylock was protecting UDP sockets against packet floods, but unfortunately was not protecting the host itself. Under stress, many cpus could spin while acquiring the busylock, and NIC had to drop packets. Or packets would be dropped in cpu backlog if RPS/RFS were in place. This patch replaces the busylock by intermediate lockless queues. (One queue per NUMA node). This means that fewer number of cpus have to acquire the UDP receive queue lock. Most of the cpus can either: - immediately drop the packet. - or queue it in their NUMA aware lockless queue. Then one of the cpu is chosen to process this lockless queue in a batch. The batch only contains packets that were cooked on the same NUMA node, thus with very limited latency impact. Tested: DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes (Intel(R) Xeon(R) 6985P-C) Before: nstat -n ; sleep 1 ; nstat | grep Udp Udp6InDatagrams 1004179 0.0 Udp6InErrors 3117 0.0 Udp6RcvbufErrors 3117 0.0 After: nstat -n ; sleep 1 ; nstat | grep Udp Udp6InDatagrams 1116633 0.0 Udp6InErrors 14197275 0.0 Udp6RcvbufErrors 14197275 0.0 We can see this host can now proces 14.2 M more packets per second while under attack, and the victim socket can receive 11 % more packets. I used a small bpftrace program measuring time (in us) spent in __udp_enqueue_schedule_skb(). Before: @udp_enqueue_us[398]: [0] 24901 |@@@ | [1] 63512 |@@@@@@@@@ | [2, 4) 344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 54022 |@@@@@@@@ | [16, 32) 222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 4219 | | [128, 256) 188 | | After: @udp_enqueue_us[398]: [0] 5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 1111277 |@@@@@@@@@@ | [2, 4) 501439 |@@@@ | [4, 8) 102921 | | [8, 16) 29895 | | [16, 32) 43500 | | [32, 64) 31552 | | [64, 128) 979 | | [128, 256) 13 | | Note that the remaining bottleneck for this platform is in udp_drops_inc() because we limited struct numa_drop_counters to only two nodes so far. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22tcp: Remove inet6_hash().Kuniyuki Iwashima2-12/+1
inet_hash() and inet6_hash() are exactly the same. Also, we do not need to export inet6_hash(). Let's consolidate the two into __inet_hash() and rename it to inet_hash(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250919083706.1863217-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-22tcp: Remove osk from __inet_hash() arg.Kuniyuki Iwashima1-1/+1
__inet_hash() is called from inet_hash() and inet6_hash with osk NULL. Let's remove the 2nd arg from __inet_hash(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250919083706.1863217-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18Merge branch 'add-basic-psp-encryption-for-tcp-connections'Paolo Abeni2-4/+19
Daniel Zahka says: ================== add basic PSP encryption for TCP connections This is v13 of the PSP RFC [1] posted by Jakub Kicinski one year ago. General developments since v1 include a fork of packetdrill [2] with support for PSP added, as well as some test cases, and an implementation of PSP key exchange and connection upgrade [3] integrated into the fbthrift RPC library. Both [2] and [3] have been tested on server platforms with PSP-capable CX7 NICs. Below is the cover letter from the original RFC: Add support for PSP encryption of TCP connections. PSP is a protocol out of Google: https://github.com/google/psp/blob/main/doc/PSP_Arch_Spec.pdf which shares some similarities with IPsec. I added some more info in the first patch so I'll keep it short here. The protocol can work in multiple modes including tunneling. But I'm mostly interested in using it as TLS replacement because of its superior offload characteristics. So this patch does three things: - it adds "core" PSP code PSP is offload-centric, and requires some additional care and feeding, so first chunk of the code exposes device info. This part can be reused by PSP implementations in xfrm, tunneling etc. - TCP integration TLS style Reuse some of the existing concepts from TLS offload, such as attaching crypto state to a socket, marking skbs as "decrypted", egress validation. PSP does not prescribe key exchange protocols. To use PSP as a more efficient TLS offload we intend to perform a TLS handshake ("inline" in the same TCP connection) and negotiate switching to PSP based on capabilities of both endpoints. This is also why I'm not including a software implementation. Nobody would use it in production, software TLS is faster, it has larger crypto records. - mlx5 implementation That's mostly other people's work, not 100% sure those folks consider it ready hence the RFC in the title. But it works :) Not posted, queued a branch [4] are follow up pieces: - standard stats - netdevsim implementation and tests [1] https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/ [2] https://github.com/danieldzahka/packetdrill [3] https://github.com/danieldzahka/fbthrift/tree/dzahka/psp [4] https://github.com/kuba-moo/linux/tree/psp Comments we intend to defer to future series: - we prefer to keep the version field in the tx-assoc netlink request, because it makes parsing keys require less state early on, but we are willing to change in the next version of this series. - using a static branch to wrap psp_enqueue_set_decrypted() and other functions called from tcp. - using INDIRECT_CALL for tls/psp in sk_validate_xmit_skb(). We prefer to address this in a dedicated patch series, so that this series does not need to modify the way tls_validate_xmit_skb() is declared and stubbed out. v12: https://lore.kernel.org/netdev/20250916000559.1320151-1-kuba@kernel.org/ v11: https://lore.kernel.org/20250911014735.118695-1-daniel.zahka@gmail.com v10: https://lore.kernel.org/netdev/20250828162953.2707727-1-daniel.zahka@gmail.com/ v9: https://lore.kernel.org/netdev/20250827155340.2738246-1-daniel.zahka@gmail.com/ v8: https://lore.kernel.org/netdev/20250825200112.1750547-1-daniel.zahka@gmail.com/ v7: https://lore.kernel.org/netdev/20250820113120.992829-1-daniel.zahka@gmail.com/ v6: https://lore.kernel.org/netdev/20250812003009.2455540-1-daniel.zahka@gmail.com/ v5: https://lore.kernel.org/netdev/20250723203454.519540-1-daniel.zahka@gmail.com/ v4: https://lore.kernel.org/netdev/20250716144551.3646755-1-daniel.zahka@gmail.com/ v3: https://lore.kernel.org/netdev/20250702171326.3265825-1-daniel.zahka@gmail.com/ v2: https://lore.kernel.org/netdev/20250625135210.2975231-1-daniel.zahka@gmail.com/ v1: https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/ ================== Links: https://patch.msgid.link/20250917000954.859376-1-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- * add-basic-psp-encryption-for-tcp-connections: net/mlx5e: Implement PSP key_rotate operation net/mlx5e: Add Rx data path offload psp: provide decapsulation and receive helper for drivers net/mlx5e: Configure PSP Rx flow steering rules net/mlx5e: Add PSP steering in local NIC RX net/mlx5e: Implement PSP Tx data path psp: provide encapsulation helper for drivers net/mlx5e: Implement PSP operations .assoc_add and .assoc_del net/mlx5e: Support PSP offload functionality psp: track generations of device key net: psp: update the TCP MSS to reflect PSP packet overhead net: psp: add socket security association code net: tcp: allow tcp_timewait_sock to validate skbs before handing to device net: move sk_validate_xmit_skb() to net/core/dev.c psp: add op for rotation of device key tcp: add datapath logic for PSP with inline key exchange net: modify core data structures for PSP datapath support psp: base PSP device support psp: add documentation
2025-09-18net: psp: update the TCP MSS to reflect PSP packet overheadJakub Kicinski2-4/+8
PSP eats 40B of header space. Adjust MSS appropriately. We can either modify tcp_mtu_to_mss() / tcp_mss_to_mtu() or reuse icsk_ext_hdr_len. The former option is more TCP specific and has runtime overhead. The latter is a bit of a hack as PSP is not an ext_hdr. If one squints hard enough, UDP encap is just a more practical version of IPv6 exthdr, so go with the latter. Happy to change. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-10-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: add datapath logic for PSP with inline key exchangeJakub Kicinski1-0/+11
Add validation points and state propagation to support PSP key exchange inline, on TCP connections. The expectation is that application will use some well established mechanism like TLS handshake to establish a secure channel over the connection and if both endpoints are PSP-capable - exchange and install PSP keys. Because the connection can existing in PSP-unsecured and PSP-secured state we need to make sure that there are no race conditions or retransmission leaks. On Tx - mark packets with the skb->decrypted bit when PSP key is at the enqueue time. Drivers should only encrypt packets with this bit set. This prevents retransmissions getting encrypted when original transmission was not. Similarly to TLS, we'll use sk->sk_validate_xmit_skb to make sure PSP skbs can't "escape" via a PSP-unaware device without being encrypted. On Rx - validation is done under socket lock. This moves the validation point later than xfrm, for example. Please see the documentation patch for more details on the flow of securing a connection, but for the purpose of this patch what's important is that we want to enforce the invariant that once connection is secured any skb in the receive queue has been encrypted with PSP. Add GRO and coalescing checks to prevent PSP authenticated data from being combined with cleartext data, or data with non-matching PSP state. On Rx, check skb's with psp_skb_coalesce_diff() at points before psp_sk_rx_policy_check(). After skb's are policy checked and on the socket receive queue, skb_cmp_decrypted() is sufficient for checking for coalescable PSP state. On Tx, tcp_write_collapse_fence() should be called when transitioning a socket into PSP Tx state to prevent data sent as cleartext from being coalesced with PSP encapsulated data. This change only adds the validation points, for ease of review. Subsequent change will add the ability to install keys, and flesh the enforcement logic out Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-5-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18udp: add udp_drops_inc() helperEric Dumazet1-3/+3
Generic sk_drops_inc() reads sk->sk_drop_counters. We know the precise location for UDP sockets. Move sk_drop_counters out of sock_read_rxtx so that sock_write_rxtx starts at a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-9-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18ipv6: np->rxpmtu race annotationEric Dumazet2-2/+2
Add READ_ONCE() annotations because np->rxpmtu can be changed while udpv6_recvmsg() and rawv6_recvmsg() read it. Since this is a very rarely used feature, and that udpv6_recvmsg() and rawv6_recvmsg() read np->rxopt anyway, change the test order so that np->rxpmtu does not need to be in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-4-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18ipv6: make ipv6_pinfo.daddr_cache a booleanEric Dumazet5-7/+7
ipv6_pinfo.daddr_cache is either NULL or &sk->sk_v6_daddr We do not need 8 bytes, a boolean is enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-3-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18ipv6: make ipv6_pinfo.saddr_cache a booleanEric Dumazet5-7/+8
ipv6_pinfo.saddr_cache is either NULL or &np->saddr. We do not need 8 bytes, a boolean is enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-2-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: accecn: AccECN negotiationIlpo Järvinen2-0/+3
Accurate ECN negotiation parts based on the specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN is negotiated using ECE, CWR and AE flags in the TCP header. TCP falls back into using RFC3168 ECN if one of the ends supports only RFC3168-style ECN. The AccECN negotiation includes reflecting IP ECN field value seen in SYN and SYNACK back using the same bits as negotiation to allow responding to SYN CE marks and to detect ECN field mangling. CE marks should not occur currently because SYN=1 segments are sent with Non-ECT in IP ECN field (but proposal exists to remove this restriction). Reflecting SYN IP ECN field in SYNACK is relatively simple. Reflecting SYNACK IP ECN field in the final/third ACK of the handshake is more challenging. Linux TCP code is not well prepared for using the final/third ACK a signalling channel which makes things somewhat complicated here. tcp_ecn sysctl can be used to select the highest ECN variant (Accurate ECN, ECN, No ECN) that is attemped to be negotiated and requested for incoming connection and outgoing connection: TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN, TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN. After this patch, the size of tcp_request_sock remains unchanged and no new holes are added. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; /* 352 4 */ u8 syn_tos; /* 356 1 */ /* size: 360, cachelines: 6, members: 16 */ } [AFTER THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; /* 352 4 */ u8 syn_tos; /* 356 1 */ bool accecn_ok; /* 357 1 */ u8 syn_ect_snt:2; /* 358: 0 1 */ u8 syn_ect_rcv:2; /* 358: 2 1 */ u8 accecn_fail_mode:4; /* 358: 4 1 */ /* size: 360, cachelines: 6, members: 20 */ } After this patch, the size of tcp_sock remains unchanged and no new holes are added. Also, 4 bits of the existing 2-byte hole are exploited. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; /* 2761: 0 1 */ u8 tlp_retrans:1; /* 2761: 2 1 */ u8 unused:5; /* 2761: 3 1 */ u8 thin_lto:1; /* 2762: 0 1 */ u8 fastopen_connect:1; /* 2762: 1 1 */ u8 fastopen_no_cookie:1; /* 2762: 2 1 */ u8 fastopen_client_fail:2; /* 2762: 3 1 */ u8 frto:1; /* 2762: 5 1 */ /* XXX 2 bits hole, try to pack */ [...] u8 keepalive_probes; /* 2765 1 */ /* XXX 2 bytes hole, try to pack */ [...] /* size: 3200, cachelines: 50, members: 164 */ } [AFTER THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; /* 2761: 0 1 */ u8 tlp_retrans:1; /* 2761: 2 1 */ u8 syn_ect_snt:2; /* 2761: 3 1 */ u8 syn_ect_rcv:2; /* 2761: 5 1 */ u8 thin_lto:1; /* 2761: 7 1 */ u8 fastopen_connect:1; /* 2762: 0 1 */ u8 fastopen_no_cookie:1; /* 2762: 1 1 */ u8 fastopen_client_fail:2; /* 2762: 2 1 */ u8 frto:1; /* 2762: 4 1 */ /* XXX 3 bits hole, try to pack */ [...] u8 keepalive_probes; /* 2765 1 */ u8 accecn_fail_mode:4; /* 2766: 0 1 */ /* XXX 4 bits hole, try to pack */ /* XXX 1 byte hole, try to pack */ [...] /* size: 3200, cachelines: 50, members: 166 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-12Merge tag 'nf-next-25-09-11' of ↵Jakub Kicinski1-0/+30
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) Don't respond to ICMP_UNREACH errors with another ICMP_UNREACH error. 2) Support fetching the current bridge ethernet address. This allows a more flexible approach to packet redirection on bridges without need to use hardcoded addresses. From Fernando Fernandez Mancera. 3) Zap a few no-longer needed conditionals from ipvs packet path and convert to READ/WRITE_ONCE to avoid KCSAN warnings. From Zhang Tengfei. 4) Remove a no-longer-used macro argument in ipset, from Zhen Ni. * tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_reject: don't reply to icmp error messages ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enable netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support netfilter: ipset: Remove unused htable_bits in macro ahash_region selftest:net: fixed spelling mistakes ==================== Link: https://patch.msgid.link/20250911143819.14753-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()Dmitry Safonov1-0/+8
Currently there are a couple of minor issues with destroying the keys tcp_v4_destroy_sock(): 1. The socket is yet in TCP bind buckets, making it reachable for incoming segments [on another CPU core], potentially available to send late FIN/ACK/RST replies. 2. There is at least one code path, where tcp_done() is called before sending RST [kudos to Bob for investigation]. This is a case of a server, that finished sending its data and just called close(). The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by __tcp_close()) tcp_v4_do_rcv()/tcp_v6_do_rcv() tcp_rcv_state_process() /* LINUX_MIB_TCPABORTONDATA */ tcp_reset() tcp_done_with_error() tcp_done() inet_csk_destroy_sock() /* Destroys AO/MD5 keys */ /* tcp_rcv_state_process() returns SKB_DROP_REASON_TCP_ABORT_ON_DATA */ tcp_v4_send_reset() /* Sends an unsigned RST segment */ tcpdump: > 22:53:15.399377 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 33929, offset 0, flags [DF], proto TCP (6), length 60) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [F.], seq 2185658590, ack 3969644355, win 502, options [nop,nop,md5 valid], length 0 > 22:53:15.399396 00:00:01:01:00:00 > 00:00:b2:1f:00:00, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 51951, offset 0, flags [DF], proto TCP (6), length 72) > 1.0.0.2.49848 > 1.0.0.1.34567: Flags [.], seq 3969644375, ack 2185658591, win 128, options [nop,nop,md5 valid,nop,nop,sack 1 {2185658590:2185658591}], length 0 > 22:53:16.429588 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658590, win 0, length 0 > 22:53:16.664725 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0 > 22:53:17.289832 00:00:b2:1f:00:00 > 00:00:01:01:00:00, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) > 1.0.0.1.34567 > 1.0.0.2.49848: Flags [R], seq 2185658591, win 0, options [nop,nop,md5 valid], length 0 Note the signed RSTs later in the dump - those are sent by the server when the fin-wait socket gets removed from hash buckets, by the listener socket. Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(), slightly delay it until the actual socket .sk_destruct(). As shutdown'ed socket can yet send non-data replies, they should be signed in order for the peer to process them. Now it also matches how AO/MD5 gets destructed for TIME-WAIT sockets (in tcp_twsk_destructor()). This seems optimal for TCP-MD5, while for TCP-AO it seems to have an open problem: once RST get sent and socket gets actually destructed, there is no information on the initial sequence numbers. So, in case this last RST gets lost in the network, the server's listener socket won't be able to properly sign another RST. Nothing in RFC 1122 prescribes keeping any local state after non-graceful reset. Luckily, BGP are known to use keep alive(s). While the issue is quite minor/cosmetic, these days monitoring network counters is a common practice and getting invalid signed segments from a trusted BGP peer can get customers worried. Investigated-by: Bob Gilligan <gilligan@arista.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Dmitry Safonov <dima@arista.com> Link: https://patch.msgid.link/20250909-b4-tcp-ao-md5-rst-finwait2-v5-1-9ffaaaf8b236@arista.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11ipv6: udp: fix typos in commentsAlok Tiwari1-3/+3
Correct typos in ipv6/udp.c comments: "execeeds" -> "exceeds" "tacking care" -> "taking care" "measureable" -> "measurable" No functional changes. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250909122611.3711859-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11netfilter: nf_reject: don't reply to icmp error messagesFlorian Westphal1-0/+30
tcp reject code won't reply to a tcp reset. But the icmp reject 'netdev' family versions will reply to icmp dst-unreach errors, unlike icmp_send() and icmp6_send() which are used by the inet family implementation (and internally by the REJECT target). Check for the icmp(6) type and do not respond if its an unreachable error. Without this, something like 'ip protocol icmp reject', when used in a netdev chain attached to 'lo', cause a packet loop. Same for two hosts that both use such a rule: each error packet will be replied to. Such situation persist until the (bogus) rule is amended to ratelimit or checks the icmp type before the reject statement. As the inet versions don't do this make the netdev ones follow along. Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-08ipv6: snmp: do not track per idev ICMP6_MIB_RATELIMITHOSTEric Dumazet2-3/+6
Blamed commit added a critical false sharing on a single atomic_long_t under DOS, like receiving UDP packets to closed ports. Per netns ICMP6_MIB_RATELIMITHOST tracking uses per-cpu storage and is enough, we do not need per-device and slow tracking. Fixes: d0941130c9351 ("icmp: Add counters for rate limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamie Bainbridge <jamie.bainbridge@gmail.com> Cc: Abhishek Rawal <rawal.abhishek92@gmail.com> Link: https://patch.msgid.link/20250905165813.1470708-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08ipv6: snmp: do not use SNMP_MIB_SENTINEL anymoreEric Dumazet1-19/+24
Use ARRAY_SIZE(), so that we know the limit at compile time. Following patch needs this preliminary change. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-08ipv6: snmp: remove icmp6type2name[]Eric Dumazet1-22/+22
This 2KB array can be replaced by a switch() to save space. Before: $ size net/ipv6/proc.o text data bss dec hex filename 6410 624 0 7034 1b7a net/ipv6/proc.o After: $ size net/ipv6/proc.o text data bss dec hex filename 5516 592 0 6108 17dc net/ipv6/proc.o Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250905165813.1470708-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-23/+21
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h c51613fa276f ("net: add sk->sk_drop_counters") 5d6b58c932ec ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-04ipv6: sit: Add ipip6_tunnel_dst_find() for cleanupYue Haibing1-56/+48
Extract the dst lookup logic from ipip6_tunnel_xmit() into new helper ipip6_tunnel_dst_find() to reduce code duplication and enhance readability. No functional change intended. On a x86_64, with allmodconfig object size is also reduced: ./scripts/bloat-o-meter net/ipv6/sit.o net/ipv6/sit-new.o add/remove: 5/3 grow/shrink: 3/4 up/down: 1841/-2275 (-434) Function old new delta ipip6_tunnel_dst_find - 1697 +1697 __pfx_ipip6_tunnel_dst_find - 64 +64 __UNIQUE_ID_modinfo2094 - 43 +43 ipip6_tunnel_xmit.isra.cold 79 88 +9 __UNIQUE_ID_modinfo2096 12 20 +8 __UNIQUE_ID___addressable_init_module2092 - 8 +8 __UNIQUE_ID___addressable_cleanup_module2093 - 8 +8 __func__ 55 59 +4 __UNIQUE_ID_modinfo2097 20 18 -2 __UNIQUE_ID___addressable_init_module2093 8 - -8 __UNIQUE_ID___addressable_cleanup_module2094 8 - -8 __UNIQUE_ID_modinfo2098 18 - -18 __UNIQUE_ID_modinfo2095 43 12 -31 descriptor 112 56 -56 ipip6_tunnel_xmit.isra 9910 7758 -2152 Total: Before=72537, After=72103, chg -0.60% Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20250901114857.1968513-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-03Merge tag 'nf-next-25-09-02' of ↵Jakub Kicinski1-12/+25
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter: updates for net-next 1) prefer vmalloc_array in ebtables, from Qianfeng Rong. 2) Use csum_replace4 instead of open-coding it, from Christophe Leroy. 3+4) Get rid of GFP_ATOMIC in transaction object allocations, those cause silly failures with large sets under memory pressure, from myself. 5) Remove test for AVX cpu feature in nftables pipapo set type, testing for AVX2 feature is sufficient. 6) Unexport a few function in nf_reject infra: no external callers. 7) Extend payload offset to u16, this was restricted to values <=255 so far, from Fernando Fernandez Mancera. * tag 'nf-next-25-09-02' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_payload: extend offset to 65535 bytes netfilter: nf_reject: remove unneeded exports netfilter: nft_set_pipapo: remove redundant test for avx feature bit netfilter: nf_tables: all transaction allocations can now sleep netfilter: nf_tables: allow iter callbacks to sleep netfilter: nft_payload: Use csum_replace4() instead of opencoding netfilter: ebtables: Use vmalloc_array() to improve code ==================== Link: https://patch.msgid.link/20250902133549.15945-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02ipv6: Add sanity checks on ipv6_devconf.rpl_seg_enabledYue Haibing1-1/+3
In ipv6_rpl_srh_rcv() we use min(net->ipv6.devconf_all->rpl_seg_enabled, idev->cnf.rpl_seg_enabled) is intended to return 0 when either value is zero, but if one of the values is negative it will in fact return non-zero. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250901123726.1972881-3-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02ipv6: annotate data-races around devconf->rpl_seg_enabledYue Haibing1-4/+2
devconf->rpl_seg_enabled can be changed concurrently from /proc/sys/net/ipv6/conf, annotate lockless reads on it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250901123726.1972881-2-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02net/tcp: Fix socket memory leak in TCP-AO failure handling for IPv6Christoph Paasch1-17/+15
When tcp_ao_copy_all_matching() fails in tcp_v6_syn_recv_sock() it just exits the function. This ends up causing a memory-leak: unreferenced object 0xffff0000281a8200 (size 2496): comm "softirq", pid 0, jiffies 4295174684 hex dump (first 32 bytes): 7f 00 00 06 7f 00 00 06 00 00 00 00 cb a8 88 13 ................ 0a 00 03 61 00 00 00 00 00 00 00 00 00 00 00 00 ...a............ backtrace (crc 5ebdbe15): kmemleak_alloc+0x44/0xe0 kmem_cache_alloc_noprof+0x248/0x470 sk_prot_alloc+0x48/0x120 sk_clone_lock+0x38/0x3b0 inet_csk_clone_lock+0x34/0x150 tcp_create_openreq_child+0x3c/0x4a8 tcp_v6_syn_recv_sock+0x1c0/0x620 tcp_check_req+0x588/0x790 tcp_v6_rcv+0x5d0/0xc18 ip6_protocol_deliver_rcu+0x2d8/0x4c0 ip6_input_finish+0x74/0x148 ip6_input+0x50/0x118 ip6_sublist_rcv+0x2fc/0x3b0 ipv6_list_rcv+0x114/0x170 __netif_receive_skb_list_core+0x16c/0x200 netif_receive_skb_list_internal+0x1f0/0x2d0 This is because in tcp_v6_syn_recv_sock (and the IPv4 counterpart), when exiting upon error, inet_csk_prepare_forced_close() and tcp_done() need to be called. They make sure the newsk will end up being correctly free'd. tcp_v4_syn_recv_sock() makes this very clear by having the put_and_exit label that takes care of things. So, this patch here makes sure tcp_v4_syn_recv_sock and tcp_v6_syn_recv_sock have similar error-handling and thus fixes the leak for TCP-AO. Fixes: 06b22ef29591 ("net/tcp: Wire TCP-AO to request sockets") Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20250830-tcpao_leak-v1-1-e5878c2c3173@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02netfilter: nf_reject: remove unneeded exportsFlorian Westphal1-12/+25
These functions have no external callers and can be static. Signed-off-by: Florian Westphal <fw@strlen.de>
2025-09-01inet: ping: remove ping_hash()Eric Dumazet1-1/+0
There is no point in keeping ping_hash(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250829153054.474201-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01icmp: fix icmp_ndo_send address translation for reply directionFabian Bläse1-2/+4
The icmp_ndo_send function was originally introduced to ensure proper rate limiting when icmp_send is called by a network device driver, where the packet's source address may have already been transformed by SNAT. However, the original implementation only considers the IP_CT_DIR_ORIGINAL direction for SNAT and always replaced the packet's source address with that of the original-direction tuple. This causes two problems: 1. For SNAT: Reply-direction packets were incorrectly translated using the source address of the CT original direction, even though no translation is required. 2. For DNAT: Reply-direction packets were not handled at all. In DNAT, the original direction's destination is translated. Therefore, in the reply direction the source address must be set to the reply-direction source, so rate limiting works as intended. Fix this by using the connection direction to select the correct tuple for source address translation, and adjust the pre-checks to handle reply-direction packets in case of DNAT. Additionally, wrap the `ct->status` access in READ_ONCE(). This avoids possible KCSAN reports about concurrent updates to `ct->status`. Fixes: 0b41713b6066 ("icmp: introduce helper for nat'd source address in network device context") Signed-off-by: Fabian Bläse <fabian@blaese.de> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01tcp: Remove sk->sk_prot->orphan_count.Kuniyuki Iwashima1-1/+0
TCP tracks the number of orphaned (SOCK_DEAD but not yet destructed) sockets in tcp_orphan_count. In some code that was shared with DCCP, tcp_orphan_count is referenced via sk->sk_prot->orphan_count. Let's reference tcp_orphan_count directly. inet_csk_prepare_for_destroy_sock() is moved to inet_connection_sock.c due to header dependency. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250829215641.711664-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29ipv6: use RCU in ip6_output()Eric Dumazet1-14/+15
Use RCU in ip6_output() in order to use dst_dev_rcu() to prevent possible UAF. We can remove rcu_read_lock()/rcu_read_unlock() pairs from ip6_finish_output2(). Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29ipv6: use RCU in ip6_xmit()Eric Dumazet1-14/+21
Use RCU in ip6_xmit() in order to use dst_dev_rcu() to prevent possible UAF. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29ipv6: start using dst_dev_rcu()Eric Dumazet6-13/+14
Refactor icmpv6_xrlim_allow() and ip6_dst_hoplimit() so that we acquire rcu_read_lock() a bit longer to be able to use dst_dev_rcu() instead of dst_dev(). __ip6_rt_update_pmtu() and rt6_do_redirect can directly use dst_dev_rcu() in sections already holding rcu_read_lock(). Small changes to use dst_dev_net_rcu() in ip6_default_advmss(), ipv6_sock_ac_join(), ip6_mc_find_dev() and ndisc_send_skb(). Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250828195823.3958522-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-28inet: raw: add drop_counters to raw socketsEric Dumazet1-0/+1
When a packet flood hits one or more RAW sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to raw sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per RAW socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_skbadd() helperEric Dumazet1-2/+2
Existing sk_drops_add() helper is renamed to sk_drops_skbadd(). Add sk_drops_add() and convert sk_drops_inc() to use it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpersEric Dumazet3-8/+8
We want to split sk->sk_drops in the future to reduce potential contention on this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-26ipv6: sr: Prepare HMAC key ahead of timeEric Biggers1-5/+9
Prepare the HMAC key when it is added to the kernel, instead of preparing it implicitly for every packet. This significantly improves the performance of seg6_hmac_compute(). A microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~1978 cycles to ~1419 cycles, a 28% improvement. The size of 'struct seg6_hmac_info' increases by 128 bytes, but that should be fine, since there should not be a massive number of keys. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250824013644.71928-3-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26ipv6: sr: Use HMAC-SHA1 and HMAC-SHA256 library functionsEric Biggers3-191/+30
Use the HMAC-SHA1 and HMAC-SHA256 library functions instead of crypto_shash. This is simpler and faster. Pre-allocating per-CPU hash transformation objects and descriptors is no longer needed, and a microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~2494 cycles to ~1978 cycles, a 20% improvement. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20250824013644.71928-2-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25tcp: Don't pass hashinfo to socket lookup helpers.Kuniyuki Iwashima6-40/+34
These socket lookup functions required struct inet_hashinfo because they are shared by TCP and DCCP. * __inet_lookup_established() * __inet_lookup_listener() * __inet6_lookup_established() * inet6_lookup_listener() DCCP has gone, and we don't need to pass hashinfo down to them. Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above 4 functions. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25tcp: Remove hashinfo test for inet6?_lookup_run_sk_lookup().Kuniyuki Iwashima1-2/+1
Commit 6c886db2e78c ("net: remove duplicate sk_lookup helpers") started to check if hashinfo == net->ipv4.tcp_death_row.hashinfo in __inet_lookup_listener() and inet6_lookup_listener() and stopped invoking BPF sk_lookup prog for DCCP. DCCP has gone and the condition is always true. Let's remove the hashinfo test. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25tcp: Remove timewait_sock_ops.twsk_destructor().Kuniyuki Iwashima1-1/+0
Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor is always tcp_twsk_destructor(). Let's call tcp_twsk_destructor() directly in inet_twsk_free() and remove ->twsk_destructor(). While at it, tcp_twsk_destructor() is un-exported. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25tcp: Remove sk_protocol test for tcp_twsk_unique().Kuniyuki Iwashima1-2/+1
Commit 383eed2de529 ("tcp: get rid of twsk_unique()") added sk->sk_protocol test in __inet_check_established() and __inet6_check_established() to remove twsk_unique() and call tcp_twsk_unique() directly. DCCP has gone, and the condition is always true. Let's remove the sk_protocol test. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250822190803.540788-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25ipv6: mcast: Add ip6_mc_find_idev() helperYue Haibing1-36/+31
Extract the same code logic from __ipv6_sock_mc_join() and ip6_mc_find_dev(), also add new helper ip6_mc_find_idev() to reduce redundancy and enhance readability. No functional changes intended. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Link: https://patch.msgid.link/20250822064051.2991480-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25tcp: annotate data-races around icsk->icsk_probes_outEric Dumazet1-1/+1
icsk->icsk_probes_out is read locklessly from inet_sk_diag_fill(), get_tcp4_sock() and get_tcp6_sock(). Add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250822091727.835869-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25tcp: annotate data-races around icsk->icsk_retransmitsEric Dumazet1-1/+1
icsk->icsk_retransmits is read locklessly from inet_sk_diag_fill(), tcp_get_timestamping_opt_stats, get_tcp4_sock() and get_tcp6_sock(). Add corresponding READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250822091727.835869-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-4/+7
Cross-merge networking fixes after downstream PR (net-6.17-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21netfilter: nf_reject: don't leak dst refcount for loopback packetsFlorian Westphal1-3/+2
recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20net: set net.core.rmem_max and net.core.wmem_max to 4 MBEric Dumazet1-1/+1
SO_RCVBUF and SO_SNDBUF have limited range today, unless distros or system admins change rmem_max and wmem_max. Even iproute2 uses 1 MB SO_RCVBUF which is capped by the kernel. Decouple [rw]mem_max and [rw]mem_default and increase [rw]mem_max to 4 MB. Before: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 212992 net.core.wmem_default = 212992 net.core.wmem_max = 212992 After: $ sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.core.rmem_default = 212992 net.core.rmem_max = 4194304 net.core.wmem_default = 212992 net.core.wmem_max = 4194304 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250819174030.1986278-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20ipv6: sr: Fix MAC comparison to be constant-timeEric Biggers1-1/+2
To prevent timing attacks, MACs need to be compared in constant time. Use the appropriate helper function for this. Fixes: bf355b8d2c30 ("ipv6: sr: add core files for SR HMAC support") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250818202724.15713-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19ipv6: ip6_gre: replace strcpy with strscpy for tunnel nameMiguel García1-5/+5
Replace the strcpy() call that copies the device name into tunnel->parms.name with strscpy(), to avoid potential overflow and guarantee NULL termination. This uses the two-argument form of strscpy(), where the destination size is inferred from the array type. Destination is tunnel->parms.name (size IFNAMSIZ). Tested in QEMU (Alpine rootfs): - Created IPv6 GRE tunnels over loopback - Assigned overlay IPv6 addresses - Verified bidirectional ping through the tunnel - Changed tunnel parameters at runtime (`ip -6 tunnel change`) Signed-off-by: Miguel García <miguelgarciaroman8@gmail.com> Link: https://patch.msgid.link/20250818220203.899338-1-miguelgarciaroman8@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-19netfilter: Switch to skb_dstref_steal to clear dst_entryStanislav Fomichev1-1/+4
Going forward skb_dst_set will assert that skb dst_entry is empty during skb_dst_set. skb_dstref_steal is added to reset existing entry without doing refcnt. Switch to skb_dstref_steal in ip[6]_route_me_harder and add a comment on why it's safe to skip skb_dstref_restore. Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250818154032.3173645-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-18ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_addMinhong He1-0/+3
The seg6_genl_sethmac() directly uses the algorithm ID provided by the userspace without verifying whether it is an HMAC algorithm supported by the system. If an unsupported HMAC algorithm ID is configured, packets using SRv6 HMAC will be dropped during encapsulation or decapsulation. Fixes: 4f4853dc1c9c ("ipv6: sr: implement API to control SR HMAC structure") Signed-off-by: Minhong He <heminhong@kylinos.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250815063845.85426-1-heminhong@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-15net: ipv6: fix field-spanning memcpy warning in AH outputCharalampos Mitrodimas1-19/+31
Fix field-spanning memcpy warnings in ah6_output() and ah6_output_done() where extension headers are copied to/from IPv6 address fields, triggering fortify-string warnings about writes beyond the 16-byte address fields. memcpy: detected field-spanning write (size 40) of single field "&top_iph->saddr" at net/ipv6/ah6.c:439 (size 16) WARNING: CPU: 0 PID: 8838 at net/ipv6/ah6.c:439 ah6_output+0xe7e/0x14e0 net/ipv6/ah6.c:439 The warnings are false positives as the extension headers are intentionally placed after the IPv6 header in memory. Fix by properly copying addresses and extension headers separately, and introduce helper functions to avoid code duplication. Reported-by: syzbot+01b0667934cdceb4451c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=01b0667934cdceb4451c Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-12Merge tag 'ipsec-2025-08-11' of ↵Paolo Abeni1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-08-11 1) Fix flushing of all states in xfrm_state_fini. From Sabrina Dubroca. 2) Fix some IPsec software offload features. These got lost with some recent HW offload changes. From Sabrina Dubroca. Please pull or let me know if there are problems. * tag 'ipsec-2025-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: udp: also consider secpath when evaluating ipsec use for checksumming xfrm: bring back device check in validate_xmit_xfrm xfrm: restore GSO for SW crypto xfrm: flush all states in xfrm_state_fini ==================== Link: https://patch.msgid.link/20250811092008.731573-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-07netfilter: add back NETFILTER_XTABLES dependenciesArnd Bergmann1-0/+1
Some Kconfig symbols were changed to depend on the 'bool' symbol NETFILTER_XTABLES_LEGACY, which means they can now be set to built-in when the xtables code itself is in a loadable module: x86_64-linux-ld: vmlinux.o: in function `arpt_unregister_table_pre_exit': (.text+0x1831987): undefined reference to `xt_find_table' x86_64-linux-ld: vmlinux.o: in function `get_info.constprop.0': arp_tables.c:(.text+0x1831aab): undefined reference to `xt_request_find_table_lock' x86_64-linux-ld: arp_tables.c:(.text+0x1831bea): undefined reference to `xt_table_unlock' x86_64-linux-ld: vmlinux.o: in function `do_arpt_get_ctl': arp_tables.c:(.text+0x183205d): undefined reference to `xt_find_table_lock' x86_64-linux-ld: arp_tables.c:(.text+0x18320c1): undefined reference to `xt_table_unlock' x86_64-linux-ld: arp_tables.c:(.text+0x183219a): undefined reference to `xt_recseq' Change these to depend on both NETFILTER_XTABLES and NETFILTER_XTABLES_LEGACY. Fixes: 9fce66583f06 ("netfilter: Exclude LEGACY TABLES on PREEMPT_RT.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Florian Westphal <fw@strlen.de> Tested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-08-06xfrm: flush all states in xfrm_state_finiSabrina Dubroca1-1/+1
While reverting commit f75a2804da39 ("xfrm: destroy xfrm_state synchronously on net exit path"), I incorrectly changed xfrm_state_flush's "proto" argument back to IPSEC_PROTO_ANY. This reverts some of the changes in commit dbb2483b2a46 ("xfrm: clean up xfrm protocol checks"), and leads to some states not being removed when we exit the netns. Pass 0 instead of IPSEC_PROTO_ANY from both xfrm_state_fini xfrm6_tunnel_net_exit, so that xfrm_state_flush deletes all states. Fixes: 2a198bbec691 ("Revert "xfrm: destroy xfrm_state synchronously on net exit path"") Reported-by: syzbot+6641a61fe0e2e89ae8c5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6641a61fe0e2e89ae8c5 Tested-by: syzbot+6641a61fe0e2e89ae8c5@syzkaller.appspotmail.com Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-08-01ipv6: reject malicious packets in ipv6_gso_segment()Eric Dumazet1-1/+3
syzbot was able to craft a packet with very long IPv6 extension headers leading to an overflow of skb->transport_header. This 16bit field has a limited range. Add skb_reset_transport_header_careful() helper and use it from ipv6_gso_segment() WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 skb_reset_transport_header include/linux/skbuff.h:3032 [inline] WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151 Modules linked in: CPU: 0 UID: 0 PID: 5871 Comm: syz-executor211 Not tainted 6.16.0-rc6-syzkaller-g7abc678e3084 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:skb_reset_transport_header include/linux/skbuff.h:3032 [inline] RIP: 0010:ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151 Call Trace: <TASK> skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53 nsh_gso_segment+0x54a/0xe10 net/nsh/nsh.c:110 skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53 __skb_gso_segment+0x342/0x510 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x857/0x11b0 net/core/dev.c:3950 validate_xmit_skb_list+0x84/0x120 net/core/dev.c:4000 sch_direct_xmit+0xd3/0x4b0 net/sched/sch_generic.c:329 __dev_xmit_skb net/core/dev.c:4102 [inline] __dev_queue_xmit+0x17b6/0x3a70 net/core/dev.c:4679 Fixes: d1da932ed4ec ("ipv6: Separate ipv6 offload support") Reported-by: syzbot+af43e647fd835acc02df@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/688a1a05.050a0220.5d226.0008.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250730131738.3385939-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-30Merge tag 'bpf-next-6.17' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Remove usermode driver (UMD) framework (Thomas Weißschuh) - Introduce Strongly Connected Component (SCC) in the verifier to detect loops and refine register liveness (Eduard Zingerman) - Allow 'void *' cast using bpf_rdonly_cast() and corresponding '__arg_untrusted' for global function parameters (Eduard Zingerman) - Improve precision for BPF_ADD and BPF_SUB operations in the verifier (Harishankar Vishwanathan) - Teach the verifier that constant pointer to a map cannot be NULL (Ihor Solodrai) - Introduce BPF streams for error reporting of various conditions detected by BPF runtime (Kumar Kartikeya Dwivedi) - Teach the verifier to insert runtime speculation barrier (lfence on x86) to mitigate speculative execution instead of rejecting the programs (Luis Gerhorst) - Various improvements for 'veristat' (Mykyta Yatsenko) - For CONFIG_DEBUG_KERNEL config warn on internal verifier errors to improve bug detection by syzbot (Paul Chaignon) - Support BPF private stack on arm64 (Puranjay Mohan) - Introduce bpf_cgroup_read_xattr() kfunc to read xattr of cgroup's node (Song Liu) - Introduce kfuncs for read-only string opreations (Viktor Malik) - Implement show_fdinfo() for bpf_links (Tao Chen) - Reduce verifier's stack consumption (Yonghong Song) - Implement mprog API for cgroup-bpf programs (Yonghong Song) * tag 'bpf-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (192 commits) selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite selftests/bpf: Add selftest for attaching tracing programs to functions in deny list bpf: Add log for attaching tracing programs to functions in deny list bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Fix various typos in verifier.c comments bpf: Add third round of bounds deduction selftests/bpf: Test invariants on JSLT crossing sign selftests/bpf: Test cross-sign 64bits range refinement selftests/bpf: Update reg_bound range refinement logic bpf: Improve bounds when s64 crosses sign boundary bpf: Simplify bounds refinement from s32 selftests/bpf: Enable private stack tests for arm64 bpf, arm64: JIT support for private stack bpf: Move bpf_jit_get_prog_name() to core.c bpf, arm64: Fix fp initialization for exception boundary umd: Remove usermode driver framework bpf/preload: Don't select USERMODE_DRIVER selftests/bpf: Fix test dynptr/test_dynptr_memset_xdp_chunks failure selftests/bpf: Fix test dynptr/test_dynptr_copy_xdp failure selftests/bpf: Increase xdp data size for arm64 64K page size ...
2025-07-30Merge tag 'net-next-6.17' of ↵Linus Torvalds42-732/+901
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Wrap datapath globals into net_aligned_data, to avoid false sharing - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container) - Add SO_INQ and SCM_INQ support to AF_UNIX - Add SIOCINQ support to AF_VSOCK - Add TCP_MAXSEG sockopt to MPTCP - Add IPv6 force_forwarding sysctl to enable forwarding per interface - Make TCP validation of whether packet fully fits in the receive window and the rcv_buf more strict. With increased use of HW aggregation a single "packet" can be multiple 100s of kB - Add MSG_MORE flag to optimize large TCP transmissions via sockmap, improves latency up to 33% for sockmap users - Convert TCP send queue handling from tasklet to BH workque - Improve BPF iteration over TCP sockets to see each socket exactly once - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code - Support enabling kernel threads for NAPI processing on per-NAPI instance basis rather than a whole device. Fully stop the kernel NAPI thread when threaded NAPI gets disabled. Previously thread would stick around until ifdown due to tricky synchronization - Allow multicast routing to take effect on locally-generated packets - Add output interface argument for End.X in segment routing - MCTP: add support for gateway routing, improve bind() handling - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink - Add a new neighbor flag ("extern_valid"), which cedes refresh responsibilities to userspace. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed - Support NUD_PERMANENT for proxy neighbor entries - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM - Add sequence numbers to netconsole messages. Unregister netconsole's console when all net targets are removed. Code refactoring. Add a number of selftests - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol should be used for an inbound SA lookup - Support inspecting ref_tracker state via DebugFS - Don't force bonding advertisement frames tx to ~333 ms boundaries. Add broadcast_neighbor option to send ARP/ND on all bonded links - Allow providing upcall pid for the 'execute' command in openvswitch - Remove DCCP support from Netfilter's conntrack - Disallow multiple packet duplications in the queuing layer - Prevent use of deprecated iptables code on PREEMPT_RT Driver API: - Support RSS and hashing configuration over ethtool Netlink - Add dedicated ethtool callbacks for getting and setting hashing fields - Add support for power budget evaluation strategy in PSE / Power-over-Ethernet. Generate Netlink events for overcurrent etc - Support DPLL phase offset monitoring across all device inputs. Support providing clock reference and SYNC over separate DPLL inputs - Support traffic classes in devlink rate API for bandwidth management - Remove rtnl_lock dependency from UDP tunnel port configuration Device drivers: - Add a new Broadcom driver for 800G Ethernet (bnge) - Add a standalone driver for Microchip ZL3073x DPLL - Remove IBM's NETIUCV device driver - Ethernet high-speed NICs: - Broadcom (bnxt): - support zero-copy Tx of DMABUF memory - take page size into account for page pool recycling rings - Intel (100G, ice, idpf): - idpf: XDP and AF_XDP support preparations - idpf: add flow steering - add link_down_events statistic - clean up the TSPLL code - preparations for live VM migration - nVidia/Mellanox: - support zero-copy Rx/Tx interfaces (DMABUF and io_uring) - optimize context memory usage for matchers - expose serial numbers in devlink info - support PCIe congestion metrics - Meta (fbnic): - add 25G, 50G, and 100G link modes to phylink - support dumping FW logs - Marvell/Cavium: - support for CN20K generation of the Octeon chips - Amazon: - add HW clock (without timestamping, just hypervisor time access) - Ethernet virtual: - VirtIO net: - support segmentation of UDP-tunnel-encapsulated packets - Google (gve): - support packet timestamping and clock synchronization - Microsoft vNIC: - add handler for device-originated servicing events - allow dynamic MSI-X vector allocation - support Tx bandwidth clamping - Ethernet NICs consumer, and embedded: - AMD: - amd-xgbe: hardware timestamping and PTP clock support - Broadcom integrated MACs (bcmgenet, bcmasp): - use napi_complete_done() return value to support NAPI polling - add support for re-starting auto-negotiation - Broadcom switches (b53): - support BCM5325 switches - add bcm63xx EPHY power control - Synopsys (stmmac): - lots of code refactoring and cleanups - TI: - icssg-prueth: read firmware-names from device tree - icssg: PRP offload support - Microchip: - lan78xx: convert to PHYLINK for improved PHY and MAC management - ksz: add KSZ8463 switch support - Intel: - support similar queue priority scheme in multi-queue and time-sensitive networking (taprio) - support packet pre-emption in both - RealTek (r8169): - enable EEE at 5Gbps on RTL8126 - Airoha: - add PPPoE offload support - MDIO bus controller for Airoha AN7583 - Ethernet PHYs: - support for the IPQ5018 internal GE PHY - micrel KSZ9477 switch-integrated PHYs: - add MDI/MDI-X control support - add RX error counters - add cable test support - add Signal Quality Indicator (SQI) reporting - dp83tg720: improve reset handling and reduce link recovery time - support bcm54811 (and its MII-Lite interface type) - air_en8811h: support resume/suspend - support PHY counters for QCA807x and QCA808x - support WoL for QCA807x - CAN drivers: - rcar_canfd: support for Transceiver Delay Compensation - kvaser: report FW versions via devlink dev info - WiFi: - extended regulatory info support (6 GHz) - add statistics and beacon monitor for Multi-Link Operation (MLO) - support S1G aggregation, improve S1G support - add Radio Measurement action fields - support per-radio RTS threshold - some work around how FIPS affects wifi, which was wrong (RC4 is used by TKIP, not only WEP) - improvements for unsolicited probe response handling - WiFi drivers: - RealTek (rtw88): - IBSS mode for SDIO devices - RealTek (rtw89): - BT coexistence for MLO/WiFi7 - concurrent station + P2P support - support for USB devices RTL8851BU/RTL8852BU - Intel (iwlwifi): - use embedded PNVM in (to be released) FW images to fix compatibility issues - many cleanups (unused FW APIs, PCIe code, WoWLAN) - some FIPS interoperability - MediaTek (mt76): - firmware recovery improvements - more MLO work - Qualcomm/Atheros (ath12k): - fix scan on multi-radio devices - more EHT/Wi-Fi 7 features - encapsulation/decapsulation offload - Broadcom (brcm80211): - support SDIO 43751 device - Bluetooth: - hci_event: add support for handling LE BIG Sync Lost event - ISO: add socket option to report packet seqnum via CMSG - ISO: support SCM_TIMESTAMPING for ISO TS - Bluetooth drivers: - intel_pcie: support Function Level Reset - nxpuart: add support for 4M baudrate - nxpuart: implement powerup sequence, reset, FW dump, and FW loading" * tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits) dpll: zl3073x: Fix build failure selftests: bpf: fix legacy netfilter options ipv6: annotate data-races around rt->fib6_nsiblings ipv6: fix possible infinite loop in fib6_info_uses_dev() ipv6: prevent infinite loop in rt6_nlmsg_size() ipv6: add a retry logic in net6_rt_notify() vrf: Drop existing dst reference in vrf_ip6_input_dst net/sched: taprio: align entry index attr validation with mqprio net: fsl_pq_mdio: use dev_err_probe selftests: rtnetlink.sh: remove esp4_offload after test vsock: remove unnecessary null check in vsock_getname() igb: xsk: solve negative overflow of nb_pkts in zerocopy mode stmmac: xsk: fix negative overflow of budget in zerocopy mode dt-bindings: ieee802154: Convert at86rf230.txt yaml format net: dsa: microchip: Disable PTP function of KSZ8463 net: dsa: microchip: Setup fiber ports for KSZ8463 net: dsa: microchip: Write switch MAC address differently for KSZ8463 net: dsa: microchip: Use different registers for KSZ8463 net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver dt-bindings: net: dsa: microchip: Add KSZ8463 switch support ...
2025-07-28Merge tag 'libcrypto-updates-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull crypto library updates from Eric Biggers: "This is the main crypto library pull request for 6.17. The main focus this cycle is on reorganizing the SHA-1 and SHA-2 code, providing high-quality library APIs for SHA-1 and SHA-2 including HMAC support, and establishing conventions for lib/crypto/ going forward: - Migrate the SHA-1 and SHA-512 code (and also SHA-384 which shares most of the SHA-512 code) into lib/crypto/. This includes both the generic and architecture-optimized code. Greatly simplify how the architecture-optimized code is integrated. Add an easy-to-use library API for each SHA variant, including HMAC support. Finally, reimplement the crypto_shash support on top of the library API. - Apply the same reorganization to the SHA-256 code (and also SHA-224 which shares most of the SHA-256 code). This is a somewhat smaller change, due to my earlier work on SHA-256. But this brings in all the same additional improvements that I made for SHA-1 and SHA-512. There are also some smaller changes: - Move the architecture-optimized ChaCha, Poly1305, and BLAKE2s code from arch/$(SRCARCH)/lib/crypto/ to lib/crypto/$(SRCARCH)/. For these algorithms it's just a move, not a full reorganization yet. - Fix the MIPS chacha-core.S to build with the clang assembler. - Fix the Poly1305 functions to work in all contexts. - Fix a performance regression in the x86_64 Poly1305 code. - Clean up the x86_64 SHA-NI optimized SHA-1 assembly code. Note that since the new organization of the SHA code is much simpler, the diffstat of this pull request is negative, despite the addition of new fully-documented library APIs for multiple SHA and HMAC-SHA variants. These APIs will allow further simplifications across the kernel as users start using them instead of the old-school crypto API. (I've already written a lot of such conversion patches, removing over 1000 more lines of code. But most of those will target 6.18 or later)" * tag 'libcrypto-updates-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (67 commits) lib/crypto: arm64/sha512-ce: Drop compatibility macros for older binutils lib/crypto: x86/sha1-ni: Convert to use rounds macros lib/crypto: x86/sha1-ni: Minor optimizations and cleanup crypto: sha1 - Remove sha1_base.h lib/crypto: x86/sha1: Migrate optimized code into library lib/crypto: sparc/sha1: Migrate optimized code into library lib/crypto: s390/sha1: Migrate optimized code into library lib/crypto: powerpc/sha1: Migrate optimized code into library lib/crypto: mips/sha1: Migrate optimized code into library lib/crypto: arm64/sha1: Migrate optimized code into library lib/crypto: arm/sha1: Migrate optimized code into library crypto: sha1 - Use same state format as legacy drivers crypto: sha1 - Wrap library and add HMAC support lib/crypto: sha1: Add HMAC support lib/crypto: sha1: Add SHA-1 library functions lib/crypto: sha1: Rename sha1_init() to sha1_init_raw() crypto: x86/sha1 - Rename conflicting symbol lib/crypto: sha2: Add hmac_sha*_init_usingrawkey() lib/crypto: arm/poly1305: Remove unneeded empty weak function lib/crypto: x86/poly1305: Fix performance regression on short messages ...
2025-07-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-38/+57
Merge in late fixes to prepare for the 6.17 net-next PR. Conflicts: net/core/neighbour.c 1bbb76a89948 ("neighbour: Fix null-ptr-deref in neigh_flush_dev().") 13a936bb99fb ("neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.") 03dc03fa0432 ("neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries") Adjacent changes: drivers/net/usb/usbnet.c 0d9cfc9b8cb1 ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event") 2c04d279e857 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism") net/ipv6/route.c 31d7d67ba127 ("ipv6: annotate data-races around rt->fib6_nsiblings") 1caf27297215 ("ipv6: adopt dst_dev() helper") 3b3ccf9ed05e ("net: Remove unnecessary NULL check for lwtunnel_fill_encap()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-26ipv6: annotate data-races around rt->fib6_nsiblingsEric Dumazet2-9/+16
rt->fib6_nsiblings can be read locklessly, add corresponding READ_ONCE() and WRITE_ONCE() annotations. Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250725140725.3626540-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-26ipv6: fix possible infinite loop in fib6_info_uses_dev()Eric Dumazet1-6/+11
fib6_info_uses_dev() seems to rely on RCU without an explicit protection. Like the prior fix in rt6_nlmsg_size(), we need to make sure fib6_del_route() or fib6_add_rt2node() have not removed the anchor from the list, or we risk an infinite loop. Fixes: d9ccb18f83ea ("ipv6: Fix soft lockups in fib6_select_path under high next hop churn") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250725140725.3626540-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-26ipv6: prevent infinite loop in rt6_nlmsg_size()Eric Dumazet2-18/+20
While testing prior patch, I was able to trigger an infinite loop in rt6_nlmsg_size() in the following place: list_for_each_entry_rcu(sibling, &f6i->fib6_siblings, fib6_siblings) { rt6_nh_nlmsg_size(sibling->fib6_nh, &nexthop_len); } This is because fib6_del_route() and fib6_add_rt2node() uses list_del_rcu(), which can confuse rcu readers, because they might no longer see the head of the list. Restart the loop if f6i->fib6_nsiblings is zero. Fixes: d9ccb18f83ea ("ipv6: Fix soft lockups in fib6_select_path under high next hop churn") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250725140725.3626540-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-26ipv6: add a retry logic in net6_rt_notify()Eric Dumazet1-5/+10
inet6_rt_notify() can be called under RCU protection only. This means the route could be changed concurrently and rt6_fill_node() could return -EMSGSIZE. Re-size the skb when this happens and retry, removing one WARN_ON() that syzbot was able to trigger: WARNING: CPU: 3 PID: 6291 at net/ipv6/route.c:6342 inet6_rt_notify+0x475/0x4b0 net/ipv6/route.c:6342 Modules linked in: CPU: 3 UID: 0 PID: 6291 Comm: syz.0.77 Not tainted 6.16.0-rc7-syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:inet6_rt_notify+0x475/0x4b0 net/ipv6/route.c:6342 Code: fc ff ff e8 6d 52 ea f7 e9 47 fc ff ff 48 8b 7c 24 08 4c 89 04 24 e8 5a 52 ea f7 4c 8b 04 24 e9 94 fd ff ff e8 9c fe 84 f7 90 <0f> 0b 90 e9 bd fd ff ff e8 6e 52 ea f7 e9 bb fb ff ff 48 89 df e8 RSP: 0018:ffffc900035cf1d8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffffc900035cf540 RCX: ffffffff8a36e790 RDX: ffff88802f7e8000 RSI: ffffffff8a36e9d4 RDI: 0000000000000005 RBP: ffff88803c230f00 R08: 0000000000000005 R09: 00000000ffffffa6 R10: 00000000ffffffa6 R11: 0000000000000001 R12: 00000000ffffffa6 R13: 0000000000000900 R14: ffff888032ea4100 R15: 0000000000000000 FS: 00007fac7b89a6c0(0000) GS:ffff8880d6a20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fac7b899f98 CR3: 0000000034b3f000 CR4: 0000000000352ef0 Call Trace: <TASK> ip6_route_mpath_notify+0xde/0x280 net/ipv6/route.c:5356 ip6_route_multipath_add+0x1181/0x1bd0 net/ipv6/route.c:5536 inet6_rtm_newroute+0xe4/0x1a0 net/ipv6/route.c:5647 rtnetlink_rcv_msg+0x95e/0xe90 net/core/rtnetlink.c:6944 netlink_rcv_skb+0x155/0x420 net/netlink/af_netlink.c:2552 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x58d/0x850 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x8d1/0xdd0 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg net/socket.c:727 [inline] ____sys_sendmsg+0xa95/0xc70 net/socket.c:2566 ___sys_sendmsg+0x134/0x1d0 net/socket.c:2620 Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://patch.msgid.link/20250725140725.3626540-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25Merge tag 'nf-next-25-07-25' of ↵Jakub Kicinski1-10/+9
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following series contains Netfilter/IPVS updates for net-next: 1) Display netns inode in conntrack table full log, from lvxiafei. 2) Autoload nf_log_syslog in case no logging backend is available, from Lance Yang. 3) Three patches to remove unused functions in x_tables, nf_tables and conntrack. From Yue Haibing. 4) Exclude LEGACY TABLES on PREEMPT_RT: Add NETFILTER_XTABLES_LEGACY to exclude xtables legacy infrastructure. 5) Restore selftests by toggling NETFILTER_XTABLES_LEGACY where needed. From Florian Westphal. 6) Use CONFIG_INET_SCTP_DIAG in tools/testing/selftests/net/netfilter/config, from Sebastian Andrzej Siewior. 7) Use timer_delete in comment in IPVS codebase, from WangYuli. 8) Dump flowtable information in nfnetlink_hook, this includes an initial patch to consolidate common code in helper function, from Phil Sutter. 9) Remove unused arguments in nft_pipapo set backend, from Florian Westphal. 10) Return nft_set_ext instead of boolean in set lookup function, from Florian Westphal. 11) Remove indirection in dynamic set infrastructure, also from Florian. 12) Consolidate pipapo_get/lookup, from Florian. 13) Use kvmalloc in nft_pipapop, from Florian Westphal. 14) syzbot reports slab-out-of-bounds in xt_nfacct log message, fix from Florian Westphal. 15) Ignored tainted kernels in selftest nft_interface_stress.sh, from Phil Sutter. 16) Fix IPVS selftest by disabling rp_filter with ipip tunnel device, from Yi Chen. * tag 'nf-next-25-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: selftests: netfilter: ipvs.sh: Explicity disable rp_filter on interface tunl0 selftests: netfilter: Ignore tainted kernels in interface stress test netfilter: xt_nfacct: don't assume acct name is null-terminated netfilter: nft_set_pipapo: prefer kvmalloc for scratch maps netfilter: nft_set_pipapo: merge pipapo_get/lookup netfilter: nft_set: remove indirection from update API call netfilter: nft_set: remove one argument from lookup and update functions netfilter: nft_set_pipapo: remove unused arguments netfilter: nfnetlink_hook: Dump flowtable info netfilter: nfnetlink: New NFNLA_HOOK_INFO_DESC helper ipvs: Rename del_timer in comment in ip_vs_conn_expire_now() selftests: netfilter: Enable CONFIG_INET_SCTP_DIAG selftests: net: Enable legacy netfilter legacy options. netfilter: Exclude LEGACY TABLES on PREEMPT_RT. netfilter: conntrack: Remove unused net in nf_conntrack_double_lock() netfilter: nf_tables: Remove unused nft_reduce_is_readonly() netfilter: x_tables: Remove unused functions xt_{in|out}name() netfilter: load nf_log_syslog on enabling nf_conntrack_log_invalid netfilter: conntrack: table full detailed log ==================== Link: https://patch.msgid.link/20250725170340.21327-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25ipv6: add `force_forwarding` sysctl to enable per-interface forwardingGabriel Goller2-1/+84
It is currently impossible to enable ipv6 forwarding on a per-interface basis like in ipv4. To enable forwarding on an ipv6 interface we need to enable it on all interfaces and disable it on the other interfaces using a netfilter rule. This is especially cumbersome if you have lots of interfaces and only want to enable forwarding on a few. According to the sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding for all interfaces, while the interface-specific `net.ipv6.conf.<interface>.forwarding` configures the interface Host/Router configuration. Introduce a new sysctl flag `force_forwarding`, which can be set on every interface. The ip6_forwarding function will then check if the global forwarding flag OR the force_forwarding flag is active and forward the packet. To preserve backwards-compatibility reset the flag (on all interfaces) to 0 if the net.ipv6.conf.all.forwarding flag is set to 0. Add a short selftest that checks if a packet gets forwarded with and without `force_forwarding`. [0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Gabriel Goller <g.goller@proxmox.com> Link: https://patch.msgid.link/20250722081847.132632-1-g.goller@proxmox.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-25netfilter: Exclude LEGACY TABLES on PREEMPT_RT.Pablo Neira Ayuso1-10/+9
The seqcount xt_recseq is used to synchronize the replacement of xt_table::private in xt_replace_table() against all readers such as ipt_do_table() To ensure that there is only one writer, the writing side disables bottom halves. The sequence counter can be acquired recursively. Only the first invocation modifies the sequence counter (signaling that a writer is in progress) while the following (recursive) writer does not modify the counter. The lack of a proper locking mechanism for the sequence counter can lead to live lock on PREEMPT_RT if the high prior reader preempts the writer. Additionally if the per-CPU lock on PREEMPT_RT is removed from local_bh_disable() then there is no synchronisation for the per-CPU sequence counter. The affected code is "just" the legacy netfilter code which is replaced by "netfilter tables". That code can be disabled without sacrificing functionality because everything is provided by the newer implementation. This will only requires the usage of the "-nft" tools instead of the "-legacy" ones. The long term plan is to remove the legacy code so lets accelerate the progress. Relax dependencies on iptables legacy, replace select with depends on, this should cause no harm to existing kernel configs and users can still toggle IP{6}_NF_IPTABLES_LEGACY in any case. Make EBTABLES_LEGACY, IPTABLES_LEGACY and ARPTABLES depend on NETFILTER_XTABLES_LEGACY. Hide xt_recseq and its users, xt_register_table() and xt_percpu_counter_alloc() behind NETFILTER_XTABLES_LEGACY. Let NETFILTER_XTABLES_LEGACY depend on !PREEMPT_RT. This will break selftest expecing the legacy options enabled and will be addressed in a following patch. Co-developed-by: Florian Westphal <fw@strlen.de> Co-developed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-1/+6
Cross-merge networking fixes after downstream PR (net-6.16-rc8). Conflicts: drivers/net/ethernet/microsoft/mana/gdma_main.c 9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion") 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au Adjacent changes: drivers/net/ethernet/ti/icssg/icssg_prueth.h 6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG") ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-24Merge tag 'ipsec-2025-07-23' of ↵Paolo Abeni3-1/+6
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-07-23 1) Premption fixes for xfrm_state_find. From Sabrina Dubroca. 2) Initialize offload path also for SW IPsec GRO. This fixes a performance regression on SW IPsec offload. From Leon Romanovsky. 3) Fix IPsec UDP GRO for IKE packets. From Tobias Brunner, 4) Fix transport header setting for IPcomp after decompressing. From Fernando Fernandez Mancera. 5) Fix use-after-free when xfrmi_changelink tries to change collect_md for a xfrm interface. From Eyal Birger . 6) Delete the special IPcomp x->tunnel state along with the state x to avoid refcount problems. From Sabrina Dubroca. Please pull or let me know if there are problems. * tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: Revert "xfrm: destroy xfrm_state synchronously on net exit path" xfrm: delete x->tunnel as we delete x xfrm: interface: fix use-after-free after changing collect_md xfrm interface xfrm: ipcomp: adjust transport header after decompressing xfrm: Set transport header to fix UDP GRO handling xfrm: always initialize offload path xfrm: state: use a consistent pcpu_id in xfrm_state_find xfrm: state: initialize state_ptrs earlier in xfrm_state_find ==================== Link: https://patch.msgid.link/20250723075417.3432644-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22ip6_gre: Factor out common ip6gre tunnel match into helperYue Haibing1-66/+34
Extract common ip6gre tunnel match from ip6gre_tunnel_lookup() into new helper function ip6gre_tunnel_match() to reduce code duplication. No functional change intended. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250719081551.963670-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-18net: s/dev_get_flags/netif_get_flags/Stanislav Fomichev1-1/+1
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOCJesper Dangaard Brouer2-9/+4
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets dropped due to memory pressure. In production environments, we've observed memory exhaustion reported by memory layer stack traces, but these drops were not properly tracked in the SKB drop reason infrastructure. While most network code paths now properly report pfmemalloc drops, some protocol-specific socket implementations still use sk_filter() without drop reason tracking: - Bluetooth L2CAP sockets - CAIF sockets - IUCV sockets - Netlink sockets - SCTP sockets - Unix domain sockets These remaining cases represent less common paths and could be converted in a follow-up patch if needed. The current implementation provides significantly improved observability into memory pressure events in the network stack, especially for key protocols like TCP and UDP, helping to diagnose problems in production environments. Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc6Alexei Starovoitov3-12/+7
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-17neighbour: Remove __pneigh_lookup().Kuniyuki Iwashima1-4/+2
__pneigh_lookup() is the lockless version of pneigh_lookup(), but its only caller pndisc_is_router() holds the table lock and reads pneigh_netry.flags. This is because accessing pneigh_entry after pneigh_lookup() was illegal unless the caller holds RTNL or the table lock. Now, pneigh_entry is guaranteed to be alive during the RCU critical section. Let's call pneigh_lookup() and use READ_ONCE() for n->flags in pndisc_is_router() and remove __pneigh_lookup(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Split pneigh_lookup().Kuniyuki Iwashima2-2/+2
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which is confusing. When called with the last argument, creat, 0, pneigh_lookup() literally looks up a proxy neighbour entry. This is the case of the reader path as the fast path and RTM_GETNEIGH. pneigh_lookup(), however, creates a pneigh_entry when called with creat 1 from RTM_NEWNEIGH and SIOCSARP, which require RTNL. Let's split pneigh_lookup() into two functions. We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock) in the new pneigh_lookup() will be dropped. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-5/+5
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-16bpf: Clean up individual BTF_ID codeFeng Yang1-2/+1
Use BTF_ID_LIST_SINGLE(a, b, c) instead of BTF_ID_LIST(a) BTF_ID(b, c) Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Link: https://lore.kernel.org/r/20250710055419.70544-1-yangfeng59949@163.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-16ipv6: mcast: Delay put pmc->idev in mld_del_delrec()Yue Haibing1-1/+1
pmc->idev is still used in ip6_mc_clear_src(), so as mld_clear_delrec() does, the reference should be put after ip6_mc_clear_src() return. Fixes: 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://patch.msgid.link/20250714141957.3301871-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-16ipv6: mcast: Simplify mld_clear_{report|query}()Yue Haibing1-8/+2
Use __skb_queue_purge() instead of re-implementing it. Note that it uses kfree_skb_reason() instead of kfree_skb() internally, and pass SKB_DROP_REASON_QUEUE_PURGE drop reason to the kfree_skb tracepoint. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250715120709.3941510-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-15ipv6: mcast: Remove unnecessary null check in ip6_mc_find_dev()Yue Haibing1-3/+0
These is no need to check null for idev before return NULL. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250714081732.3109764-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-15ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()Yue Haibing1-27/+25
Avoid duplicate non-null pointer check for pmc in mld_del_delrec(). No functional changes. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250714081949.3109947-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-14dev: Pass netdevice_tracker to dev_get_by_flags_rcu().Kuniyuki Iwashima1-5/+6
This is a follow-up for commit eb1ac9ff6c4a5 ("ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST."). We should not add a new device lookup API without netdevice_tracker. Let's pass netdevice_tracker to dev_get_by_flags_rcu() and rename it with netdev_ prefix to match other newer APIs. Note that we always use GFP_ATOMIC for netdev_hold() as it's expected to be called under RCU. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250708184053.102109f6@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250711051120.2866855-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14lib/crypto: sha1: Rename sha1_init() to sha1_init_raw()Eric Biggers1-1/+1
Rename the existing sha1_init() to sha1_init_raw(), since it conflicts with the upcoming library function. This will later be removed, but this keeps the kernel building for the introduction of the library. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250712232329.818226-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-13rpl: Fix use-after-free in rpl_do_srh_inline().Kuniyuki Iwashima1-4/+4
Running lwt_dst_cache_ref_loop.sh in selftest with KASAN triggers the splat below [0]. rpl_do_srh_inline() fetches ipv6_hdr(skb) and accesses it after skb_cow_head(), which is illegal as the header could be freed then. Let's fix it by making oldhdr to a local struct instead of a pointer. [0]: [root@fedora net]# ./lwt_dst_cache_ref_loop.sh ... TEST: rpl (input) [ 57.631529] ================================================================== BUG: KASAN: slab-use-after-free in rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174) Read of size 40 at addr ffff888122bf96d8 by task ping6/1543 CPU: 50 UID: 0 PID: 1543 Comm: ping6 Not tainted 6.16.0-rc5-01302-gfadd1e6231b1 #23 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <IRQ> dump_stack_lvl (lib/dump_stack.c:122) print_report (mm/kasan/report.c:409 mm/kasan/report.c:521) kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:636) kasan_check_range (mm/kasan/generic.c:175 (discriminator 1) mm/kasan/generic.c:189 (discriminator 1)) __asan_memmove (mm/kasan/shadow.c:94 (discriminator 2)) rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:174) rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282) lwtunnel_input (net/core/lwtunnel.c:459) ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1)) __netif_receive_skb_one_core (net/core/dev.c:5967) process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440) __napi_poll.constprop.0 (net/core/dev.c:7452) net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643) handle_softirqs (kernel/softirq.c:579) do_softirq (kernel/softirq.c:480 (discriminator 20)) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:407) __dev_queue_xmit (net/core/dev.c:4740) ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141) ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226) ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248) ip6_send_skb (net/ipv6/ip6_output.c:1983) rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918) __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1)) __x64_sys_sendto (net/socket.c:2231) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f68cffb2a06 Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08 RSP: 002b:00007ffefb7c53d0 EFLAGS: 00000202 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000564cd69f10a0 RCX: 00007f68cffb2a06 RDX: 0000000000000040 RSI: 0000564cd69f10a4 RDI: 0000000000000003 RBP: 00007ffefb7c53f0 R08: 0000564cd6a032ac R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000202 R12: 0000564cd69f10a4 R13: 0000000000000040 R14: 00007ffefb7c66e0 R15: 0000564cd69f10a0 </TASK> Allocated by task 1543: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1)) __kasan_slab_alloc (mm/kasan/common.c:319 mm/kasan/common.c:345) kmem_cache_alloc_node_noprof (./include/linux/kasan.h:250 mm/slub.c:4148 mm/slub.c:4197 mm/slub.c:4249) kmalloc_reserve (net/core/skbuff.c:581 (discriminator 88)) __alloc_skb (net/core/skbuff.c:669) __ip6_append_data (net/ipv6/ip6_output.c:1672 (discriminator 1)) ip6_append_data (net/ipv6/ip6_output.c:1859) rawv6_sendmsg (net/ipv6/raw.c:911) __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1)) __x64_sys_sendto (net/socket.c:2231) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Freed by task 1543: kasan_save_stack (mm/kasan/common.c:48) kasan_save_track (mm/kasan/common.c:60 (discriminator 1) mm/kasan/common.c:69 (discriminator 1)) kasan_save_free_info (mm/kasan/generic.c:579 (discriminator 1)) __kasan_slab_free (mm/kasan/common.c:271) kmem_cache_free (mm/slub.c:4643 (discriminator 3) mm/slub.c:4745 (discriminator 3)) pskb_expand_head (net/core/skbuff.c:2274) rpl_do_srh_inline.isra.0 (net/ipv6/rpl_iptunnel.c:158 (discriminator 1)) rpl_input (net/ipv6/rpl_iptunnel.c:201 net/ipv6/rpl_iptunnel.c:282) lwtunnel_input (net/core/lwtunnel.c:459) ipv6_rcv (./include/net/dst.h:471 (discriminator 1) ./include/net/dst.h:469 (discriminator 1) net/ipv6/ip6_input.c:79 (discriminator 1) ./include/linux/netfilter.h:317 (discriminator 1) ./include/linux/netfilter.h:311 (discriminator 1) net/ipv6/ip6_input.c:311 (discriminator 1)) __netif_receive_skb_one_core (net/core/dev.c:5967) process_backlog (./include/linux/rcupdate.h:869 net/core/dev.c:6440) __napi_poll.constprop.0 (net/core/dev.c:7452) net_rx_action (net/core/dev.c:7518 net/core/dev.c:7643) handle_softirqs (kernel/softirq.c:579) do_softirq (kernel/softirq.c:480 (discriminator 20)) __local_bh_enable_ip (kernel/softirq.c:407) __dev_queue_xmit (net/core/dev.c:4740) ip6_finish_output2 (./include/linux/netdevice.h:3358 ./include/net/neighbour.h:526 ./include/net/neighbour.h:540 net/ipv6/ip6_output.c:141) ip6_finish_output (net/ipv6/ip6_output.c:215 net/ipv6/ip6_output.c:226) ip6_output (./include/linux/netfilter.h:306 net/ipv6/ip6_output.c:248) ip6_send_skb (net/ipv6/ip6_output.c:1983) rawv6_sendmsg (net/ipv6/raw.c:588 net/ipv6/raw.c:918) __sys_sendto (net/socket.c:714 (discriminator 1) net/socket.c:729 (discriminator 1) net/socket.c:2228 (discriminator 1)) __x64_sys_sendto (net/socket.c:2231) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) The buggy address belongs to the object at ffff888122bf96c0 which belongs to the cache skbuff_small_head of size 704 The buggy address is located 24 bytes inside of freed 704-byte region [ffff888122bf96c0, ffff888122bf9980) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x122bf8 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x200000000000040(head|node=0|zone=2) page_type: f5(slab) raw: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002 raw: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000 head: 0200000000000040 ffff888101fc0a00 ffffea000464dc00 0000000000000002 head: 0000000000000000 0000000080270027 00000000f5000000 0000000000000000 head: 0200000000000003 ffffea00048afe01 00000000ffffffff 00000000ffffffff head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888122bf9580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888122bf9600: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888122bf9680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ^ ffff888122bf9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888122bf9780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: a7a29f9c361f8 ("net: ipv6: add rpl sr tunnel") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-7/+2
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2). No conflicts. Adjacent changes: drivers/net/wireless/mediatek/mt76/mt7925/mcu.c c701574c5412 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan") b3a431fe2e39 ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()") drivers/net/wireless/mediatek/mt76/mt7996/mac.c 62da647a2b20 ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()") dc66a129adf1 ("wifi: mt76: add a wrapper for wcid access with validation") drivers/net/wireless/mediatek/mt76/mt7996/main.c 3dd6f67c669c ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()") 8989d8e90f5f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()") net/mac80211/cfg.c 58fcb1b4287c ("wifi: mac80211: reject VHT opmode for unsupported channel widths") 037dc18ac3fb ("wifi: mac80211: add support for storing station S1G capabilities") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10gre: Fix IPv6 multicast route creation.Guillaume Nault1-7/+2
Use addrconf_add_dev() instead of ipv6_find_idev() in addrconf_gre_config() so that we don't just get the inet6_dev, but also install the default ff00::/8 multicast route. Before commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation."), the multicast route was created at the end of the function by addrconf_add_mroute(). But this code path is now only taken in one particular case (gre devices not bound to a local IP address and in EUI64 mode). For all other cases, the function exits early and addrconf_add_mroute() is not called anymore. Using addrconf_add_dev() instead of ipv6_find_idev() in addrconf_gre_config(), fixes the problem as it will create the default multicast route for all gre devices. This also brings addrconf_gre_config() a bit closer to the normal netdevice IPv6 configuration code (addrconf_dev_config()). Cc: stable@vger.kernel.org Fixes: 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation.") Reported-by: Aiden Yang <ling@moedove.com> Closes: https://lore.kernel.org/netdev/CANR=AhRM7YHHXVxJ4DmrTNMeuEOY87K2mLmo9KMed1JMr20p6g@mail.gmail.com/ Reviewed-by: Gary Guo <gary@garyguo.net> Tested-by: Gary Guo <gary@garyguo.net> Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/027a923dcb550ad115e6d93ee8bb7d310378bd01.1752070620.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10net: replace ND_PRINTK with dynamic debugWang Liang1-96/+61
ND_PRINTK with val > 1 only works when the ND_DEBUG was set in compilation phase. Replace it with dynamic debug. Convert ND_PRINTK with val <= 1 to net_{err,warn}_ratelimited, and convert the rest to net_dbg_ratelimited. Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250708033342.1627636-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: Remove setsockopt_needs_rtnl().Kuniyuki Iwashima1-11/+2
We no longer need to hold RTNL for IPv6 socket options. Let's remove setsockopt_needs_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-16-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.Kuniyuki Iwashima2-12/+14
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock(). In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(), and __in6_dev_get() require RTNL. __dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be converted to RCU version. Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST. setsockopt_needs_rtnl() will be removed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Unify two error paths in ipv6_sock_ac_join().Kuniyuki Iwashima1-9/+16
The next patch will replace __dev_get_by_index() and __dev_get_by_flags() to RCU + refcount version. Then, we will need to call dev_put() in some error paths. Let's unify two error paths to make the next patch cleaner. Note that we add READ_ONCE() for net->ipv6.devconf_all->forwarding and idev->conf.forwarding as we will drop RTNL that protects them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-14-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't hold RTNL for IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM.Kuniyuki Iwashima2-18/+20
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock(). In ipv6_sock_ac_drop() and ipv6_sock_ac_close(), only __dev_get_by_index() and __in6_dev_get() requrie RTNL. Let's replace them with dev_get_by_index() and in6_dev_get() and drop RTNL from IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-13-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't use rtnl_dereference().Kuniyuki Iwashima2-11/+8
inet6_dev->ac_list is protected by inet6_dev->lock, so rtnl_dereference() is a bit rough annotation. As done in mcast.c, we can use ac_dereference() that checks if inet6_dev->lock is held. Let's replace rtnl_dereference() with a new helper ac_dereference(). Note that now addrconf_join_solict() / addrconf_leave_solict() in __ipv6_dev_ac_inc() / __ipv6_dev_ac_dec() does not need RTNL, so we can remove ASSERT_RTNL() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-12-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Remove unnecessary ASSERT_RTNL and comment.Kuniyuki Iwashima1-6/+0
Now, RTNL is not needed for mcast code, and what's commented in ip6_mc_msfget() is apparent by for_each_pmc_socklock(), which has lockdep annotation for lock_sock(). Let's remove the comment and ASSERT_RTNL() in ipv6_mc_rejoin_groups(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-11-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL for MCAST_ socket options.Kuniyuki Iwashima2-34/+45
In ip6_mc_source() and ip6_mc_msfilter(), per-socket mld data is protected by lock_sock() and inet6_dev->mc_lock is also held for some per-interface functions. ip6_mc_find_dev_rtnl() only depends on RTNL. If we want to remove it, we need to check inet6_dev->dead under mc_lock to close the race with addrconf_ifdown(), as mentioned earlier. Let's do that and drop RTNL for the rest of MCAST_ socket options. Note that ip6_mc_msfilter() has unnecessary lock dances and they are integrated into one to avoid the last-minute error and simplify the error handling. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-10-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL in ipv6_sock_mc_close().Kuniyuki Iwashima1-21/+1
In __ipv6_sock_mc_close(), per-socket mld data is protected by lock_sock(), and only __dev_get_by_index() and __in6_dev_get() require RTNL. Let's call __ipv6_sock_mc_drop() and drop RTNL in ipv6_sock_mc_close(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-9-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL for IPV6_DROP_MEMBERSHIP and MCAST_LEAVE_GROUP.Kuniyuki Iwashima2-22/+27
In __ipv6_sock_mc_drop(), per-socket mld data is protected by lock_sock(), and only __dev_get_by_index() and __in6_dev_get() require RTNL. Let's use dev_get_by_index() and in6_dev_get() and drop RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP. Note that __ipv6_sock_mc_drop() is factorised to reuse in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-8-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.Kuniyuki Iwashima2-13/+13
In __ipv6_sock_mc_join(), per-socket mld data is protected by lock_sock(), and only __dev_get_by_index() requires RTNL. Let's use dev_get_by_index() and drop RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP. Note that we must call rt6_lookup() and dev_hold() under RCU. If rt6_lookup() returns an entry from the exception table, dst_dev_put() could change rt->dev.dst to loopback concurrently, and the original device could lose the refcount before dev_hold() and unblock device registration. dst_dev_put() is called from NETDEV_UNREGISTER and synchronize_net() follows it, so as long as rt6_lookup() and dev_hold() are called within the same RCU critical section, the dev is alive. Even if the race happens, they are synchronised by idev->dead and mcast addresses are cleaned up. For the racy access to rt->dst.dev, we use dst_dev(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-7-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Use in6_dev_get() in ipv6_dev_mc_dec().Kuniyuki Iwashima2-10/+7
As well as __ipv6_dev_mc_inc(), all code in __ipv6_dev_mc_dec() are protected by inet6_dev->mc_lock, and RTNL is not needed. Let's use in6_dev_get() in ipv6_dev_mc_dec() and remove ASSERT_RTNL() in __ipv6_dev_mc_dec(). Now, we can remove the RTNL comment above addrconf_leave_solict() too. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-6-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Remove mca_get().Kuniyuki Iwashima1-8/+1
Since commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data"), the newly allocated struct ifmcaddr6 cannot be removed until inet6_dev->mc_lock is released, so mca_get() and mc_put() are unnecessary. Let's remove the extra refcounting. Note that mca_get() was only used in __ipv6_dev_mc_inc(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-5-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Check inet6_dev->dead under idev->mc_lock in __ipv6_dev_mc_inc().Kuniyuki Iwashima2-10/+8
Since commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data"), every multicast resource is protected by inet6_dev->mc_lock. RTNL is unnecessary in terms of protection but still needed for synchronisation between addrconf_ifdown() and __ipv6_dev_mc_inc(). Once we removed RTNL, there would be a race below, where we could add a multicast address to a dead inet6_dev. CPU1 CPU2 ==== ==== addrconf_ifdown() __ipv6_dev_mc_inc() if (idev->dead) <-- false dead = true return -ENODEV; ipv6_mc_destroy_dev() / ipv6_mc_down() mutex_lock(&idev->mc_lock) ... mutex_unlock(&idev->mc_lock) mutex_lock(&idev->mc_lock) ... mutex_unlock(&idev->mc_lock) The race window can be easily closed by checking inet6_dev->dead under inet6_dev->mc_lock in __ipv6_dev_mc_inc() as addrconf_ifdown() will acquire it after marking inet6_dev dead. Let's check inet6_dev->dead under mc_lock in __ipv6_dev_mc_inc(). Note that now __ipv6_dev_mc_inc() no longer depends on RTNL and we can remove ASSERT_RTNL() there and the RTNL comment above addrconf_join_solict(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-4-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Replace locking comments with lockdep annotations.Kuniyuki Iwashima1-54/+71
Commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data") added the same comments regarding locking to many functions. Let's replace the comments with lockdep annotation, which is more helpful. Note that we just remove the comment for mld_clear_zeros() and mld_send_cr(), where mc_dereference() is used in the entry of the function. While at it, a comment for __ipv6_sock_mc_join() is moved back to the correct place. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-3-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: ndisc: Remove __in6_dev_get() in pndisc_{constructor,destructor}().Kuniyuki Iwashima1-6/+7
ipv6_dev_mc_{inc,dec}() has the same check. Let's remove __in6_dev_get() from pndisc_constructor() and pndisc_destructor(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-2-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @gfpMichal Luczaj1-2/+1
Since its introduction in commit 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter() never used the @gfp argument. Remove it and adapt callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: replace ADDRLABEL with dynamic debugWang Liang1-21/+11
ADDRLABEL only works when it was set in compilation phase. Replace it with net_dbg_ratelimited(). Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702104417.1526138-1-wangliang74@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08Revert "xfrm: destroy xfrm_state synchronously on net exit path"Sabrina Dubroca1-1/+1
This reverts commit f75a2804da391571563c4b6b29e7797787332673. With all states (whether user or kern) removed from the hashtables during deletion, there's no need for synchronous destruction of states. xfrm6_tunnel states still need to have been destroyed (which will be the case when its last user is deleted (not destroyed)) so that xfrm6_tunnel_free_spi removes it from the per-netns hashtable before the netns is destroyed. This has the benefit of skipping one synchronize_rcu per state (in __xfrm_state_destroy(sync=true)) when we exit a netns. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-08xfrm: delete x->tunnel as we delete xSabrina Dubroca2-1/+3
The ipcomp fallback tunnels currently get deleted (from the various lists and hashtables) as the last user state that needed that fallback is destroyed (not deleted). If a reference to that user state still exists, the fallback state will remain on the hashtables/lists, triggering the WARN in xfrm_state_fini. Because of those remaining references, the fix in commit f75a2804da39 ("xfrm: destroy xfrm_state synchronously on net exit path") is not complete. We recently fixed one such situation in TCP due to defered freeing of skbs (commit 9b6412e6979f ("tcp: drop secpath at the same time as we currently drop dst")). This can also happen due to IP reassembly: skbs with a secpath remain on the reassembly queue until netns destruction. If we can't guarantee that the queues are flushed by the time xfrm_state_fini runs, there may still be references to a (user) xfrm_state, preventing the timely deletion of the corresponding fallback state. Instead of chasing each instance of skbs holding a secpath one by one, this patch fixes the issue directly within xfrm, by deleting the fallback state as soon as the last user state depending on it has been deleted. Destruction will still happen when the final reference is dropped. A separate lockdep class for the fallback state is required since we're going to lock x->tunnel while x is locked. Fixes: 9d4139c76905 ("netns xfrm: per-netns xfrm_state_all list") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-03ipv6: Cleanup fib6_drop_pcpu_from()Yue Haibing1-19/+7
Since commit 0e2338749192 ("ipv6: fix races in ip6_dst_destroy()"), 'table' is unused in __fib6_drop_pcpu_from(), no need pass it from fib6_drop_pcpu_from(). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701041235.1333687-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-02net: ipv6: Fix spelling mistakeChenguang Zhao1-3/+3
change 'Maximium' to 'Maximum' Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: preserve MSG_ZEROCOPY with forwardingWillem de Bruijn1-0/+7
MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: ip6_mc_input() and ip6_mr_input() cleanupsEric Dumazet2-23/+15
Both functions are always called under RCU. We remove the extra rcu_read_lock()/rcu_read_unlock(). We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net(). We use dev_net_rcu() instead of dev_net(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpersEric Dumazet14-35/+39
Use the new helpers as a step to deal with potential dst->dev races. v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: adopt dst_dev() helperEric Dumazet16-45/+58
Use the new helper as a step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->lastuseEric Dumazet1-1/+2
(dst_entry)->lastuse is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->expiresEric Dumazet1-7/+6
(dst_entry)->expires is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->obsoleteEric Dumazet2-6/+5
(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02udp: move udp_memory_allocated into net_aligned_dataEric Dumazet3-2/+3
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02tcp: move tcp_memory_allocated into net_aligned_dataEric Dumazet1-1/+2
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Move tcp_memory_allocated into a dedicated cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02xfrm: Set transport header to fix UDP GRO handlingTobias Brunner1-0/+3
The referenced commit replaced a call to __xfrm4|6_udp_encap_rcv() with a custom check for non-ESP markers. But what the called function also did was setting the transport header to the ESP header. The function that follows, esp4|6_gro_receive(), relies on that being set when it calls xfrm_parse_spi(). We have to set the full offset as the skb's head was not moved yet so adding just the UDP header length won't work. Fixes: e3fd05777685 ("xfrm: Fix UDP GRO handling for some corner cases") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-07-01seg6: fix lenghts typo in a commentAndrea Mayer1-1/+1
Fix a typo: lenghts -> length The typo has been identified using codespell, and the tool currently does not report any additional issues in comments. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250629171226.4988-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01ip6_tunnel: enable to change proto of fb tunnelsNicolas Dichtel1-5/+36
This is possible via the ioctl API: > ip -6 tunnel change ip6tnl0 mode any Let's align the netlink API: > ip link set ip6tnl0 type ip6tnl mode any Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20250630145602.1027220-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30ipv6: guard ip6_mr_output() with rcuEric Dumazet1-1/+1
syzbot found at least one path leads to an ip_mr_output() without RCU being held. Add guard(rcu)() to fix this in a concise way. WARNING: net/ipv6/ip6mr.c:2376 at ip6_mr_output+0xe0b/0x1040 net/ipv6/ip6mr.c:2376, CPU#1: kworker/1:2/121 Call Trace: <TASK> ip6tunnel_xmit include/net/ip6_tunnel.h:162 [inline] udp_tunnel6_xmit_skb+0x640/0xad0 net/ipv6/ip6_udp_tunnel.c:112 send6+0x5ac/0x8d0 drivers/net/wireguard/socket.c:152 wg_socket_send_skb_to_peer+0x111/0x1d0 drivers/net/wireguard/socket.c:178 wg_packet_create_data_done drivers/net/wireguard/send.c:251 [inline] wg_packet_tx_worker+0x1c8/0x7c0 drivers/net/wireguard/send.c:276 process_one_work kernel/workqueue.c:3239 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3322 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3403 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Fixes: 96e8f5a9fe2d ("net: ipv6: Add ip6_mr_output()") Reported-by: syzbot+0141c834e47059395621@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/685e86b3.a00a0220.129264.0003.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Cc: Benjamin Poirier <bpoirier@nvidia.com> Cc: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250627115822.3741390-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-27tcp: remove rtx_syn_ack fieldEric Dumazet1-1/+0
Now inet_rtx_syn_ack() is only used by TCP, it can directly call tcp_rtx_synack() instead of using an indirect call to req->rsk_ops->rtx_syn_ack(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-25net: Remove unnecessary NULL check for lwtunnel_fill_encap()Yue Haibing1-2/+1
lwtunnel_fill_encap() has NULL check and return 0, so no need to check before call it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250624140015.3929241-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: remove sock_i_uid()Eric Dumazet2-3/+3
Difference between sock_i_uid() and sk_uid() is that after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID while sk_uid() returns the last cached sk->sk_uid value. None of sock_i_uid() callers care about this. Use sk_uid() which is much faster and inlined. Note that diag/dump users are calling sock_i_ino() and can not see the full benefit yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23net: annotate races around sk->sk_uidEric Dumazet9-12/+13
sk->sk_uid can be read while another thread changes its value in sockfs_setattr(). Add sk_uid(const struct sock *sk) helper to factorize the needed READ_ONCE() annotations, and add corresponding WRITE_ONCE() where needed. Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://patch.msgid.link/20250620133001.4090592-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+8
Cross-merge networking fixes after downstream PR (net-6.16-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19calipso: Fix null-ptr-deref in calipso_req_{set,del}attr().Kuniyuki Iwashima1-0/+8
syzkaller reported a null-ptr-deref in sock_omalloc() while allocating a CALIPSO option. [0] The NULL is of struct sock, which was fetched by sk_to_full_sk() in calipso_req_setattr(). Since commit a1a5344ddbe8 ("tcp: avoid two atomic ops for syncookies"), reqsk->rsk_listener could be NULL when SYN Cookie is returned to its client, as hinted by the leading SYN Cookie log. Here are 3 options to fix the bug: 1) Return 0 in calipso_req_setattr() 2) Return an error in calipso_req_setattr() 3) Alaways set rsk_listener 1) is no go as it bypasses LSM, but 2) effectively disables SYN Cookie for CALIPSO. 3) is also no go as there have been many efforts to reduce atomic ops and make TCP robust against DDoS. See also commit 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood"). As of the blamed commit, SYN Cookie already did not need refcounting, and no one has stumbled on the bug for 9 years, so no CALIPSO user will care about SYN Cookie. Let's return an error in calipso_req_setattr() and calipso_req_delattr() in the SYN Cookie case. This can be reproduced by [1] on Fedora and now connect() of nc times out. [0]: TCP: request_sock_TCPv6: Possible SYN flooding on port [::]:20002. Sending cookies. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] CPU: 3 UID: 0 PID: 12262 Comm: syz.1.2611 Not tainted 6.14.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_pnet include/net/net_namespace.h:406 [inline] RIP: 0010:sock_net include/net/sock.h:655 [inline] RIP: 0010:sock_kmalloc+0x35/0x170 net/core/sock.c:2806 Code: 89 d5 41 54 55 89 f5 53 48 89 fb e8 25 e3 c6 fd e8 f0 91 e3 00 48 8d 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 01 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b RSP: 0018:ffff88811af89038 EFLAGS: 00010216 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff888105266400 RDX: 0000000000000006 RSI: ffff88800c890000 RDI: 0000000000000030 RBP: 0000000000000050 R08: 0000000000000000 R09: ffff88810526640e R10: ffffed1020a4cc81 R11: ffff88810526640f R12: 0000000000000000 R13: 0000000000000820 R14: ffff888105266400 R15: 0000000000000050 FS: 00007f0653a07640(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f863ba096f4 CR3: 00000000163c0005 CR4: 0000000000770ef0 PKRU: 80000000 Call Trace: <IRQ> ipv6_renew_options+0x279/0x950 net/ipv6/exthdrs.c:1288 calipso_req_setattr+0x181/0x340 net/ipv6/calipso.c:1204 calipso_req_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:597 netlbl_req_setattr+0x18a/0x440 net/netlabel/netlabel_kapi.c:1249 selinux_netlbl_inet_conn_request+0x1fb/0x320 security/selinux/netlabel.c:342 selinux_inet_conn_request+0x1eb/0x2c0 security/selinux/hooks.c:5551 security_inet_conn_request+0x50/0xa0 security/security.c:4945 tcp_v6_route_req+0x22c/0x550 net/ipv6/tcp_ipv6.c:825 tcp_conn_request+0xec8/0x2b70 net/ipv4/tcp_input.c:7275 tcp_v6_conn_request+0x1e3/0x440 net/ipv6/tcp_ipv6.c:1328 tcp_rcv_state_process+0xafa/0x52b0 net/ipv4/tcp_input.c:6781 tcp_v6_do_rcv+0x8a6/0x1a40 net/ipv6/tcp_ipv6.c:1667 tcp_v6_rcv+0x505e/0x5b50 net/ipv6/tcp_ipv6.c:1904 ip6_protocol_deliver_rcu+0x17c/0x1da0 net/ipv6/ip6_input.c:436 ip6_input_finish+0x103/0x180 net/ipv6/ip6_input.c:480 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_input+0x13c/0x6b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:469 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] ip6_rcv_finish+0xb6/0x490 net/ipv6/ip6_input.c:69 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ipv6_rcv+0xf9/0x490 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core+0x12e/0x1f0 net/core/dev.c:5896 __netif_receive_skb+0x1d/0x170 net/core/dev.c:6009 process_backlog+0x41e/0x13b0 net/core/dev.c:6357 __napi_poll+0xbd/0x710 net/core/dev.c:7191 napi_poll net/core/dev.c:7260 [inline] net_rx_action+0x9de/0xde0 net/core/dev.c:7382 handle_softirqs+0x19a/0x770 kernel/softirq.c:561 do_softirq.part.0+0x36/0x70 kernel/softirq.c:462 </IRQ> <TASK> do_softirq arch/x86/include/asm/preempt.h:26 [inline] __local_bh_enable_ip+0xf1/0x110 kernel/softirq.c:389 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline] __dev_queue_xmit+0xc2a/0x3c40 net/core/dev.c:4679 dev_queue_xmit include/linux/netdevice.h:3313 [inline] neigh_hh_output include/net/neighbour.h:523 [inline] neigh_output include/net/neighbour.h:537 [inline] ip6_finish_output2+0xd69/0x1f80 net/ipv6/ip6_output.c:141 __ip6_finish_output net/ipv6/ip6_output.c:215 [inline] ip6_finish_output+0x5dc/0xd60 net/ipv6/ip6_output.c:226 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip6_output+0x24b/0x8d0 net/ipv6/ip6_output.c:247 dst_output include/net/dst.h:459 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_xmit+0xbbc/0x20d0 net/ipv6/ip6_output.c:366 inet6_csk_xmit+0x39a/0x720 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x1a7b/0x3b40 net/ipv4/tcp_output.c:1471 tcp_transmit_skb net/ipv4/tcp_output.c:1489 [inline] tcp_send_syn_data net/ipv4/tcp_output.c:4059 [inline] tcp_connect+0x1c0c/0x4510 net/ipv4/tcp_output.c:4148 tcp_v6_connect+0x156c/0x2080 net/ipv6/tcp_ipv6.c:333 __inet_stream_connect+0x3a7/0xed0 net/ipv4/af_inet.c:677 tcp_sendmsg_fastopen+0x3e2/0x710 net/ipv4/tcp.c:1039 tcp_sendmsg_locked+0x1e82/0x3570 net/ipv4/tcp.c:1091 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1358 inet6_sendmsg+0xb9/0x150 net/ipv6/af_inet6.c:659 sock_sendmsg_nosec net/socket.c:718 [inline] __sock_sendmsg+0xf4/0x2a0 net/socket.c:733 __sys_sendto+0x29a/0x390 net/socket.c:2187 __do_sys_sendto net/socket.c:2194 [inline] __se_sys_sendto net/socket.c:2190 [inline] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2190 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc3/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f06553c47ed Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f0653a06fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f0655605fa0 RCX: 00007f06553c47ed RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000b RBP: 00007f065545db38 R08: 0000200000000140 R09: 000000000000001c R10: f7384d4ea84b01bd R11: 0000000000000246 R12: 0000000000000000 R13: 00007f0655605fac R14: 00007f0655606038 R15: 00007f06539e7000 </TASK> Modules linked in: [1]: dnf install -y selinux-policy-targeted policycoreutils netlabel_tools procps-ng nmap-ncat mount -t selinuxfs none /sys/fs/selinux load_policy netlabelctl calipso add pass doi:1 netlabelctl map del default netlabelctl map add default address:::1 protocol:calipso,1 sysctl net.ipv4.tcp_syncookies=2 nc -l ::1 80 & nc ::1 80 Fixes: e1adea927080 ("calipso: Allow request sockets to be relabelled by the lsm.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: John Cheung <john.cs.hey@gmail.com> Closes: https://lore.kernel.org/netdev/CAP=Rh=MvfhrGADy+-WJiftV2_WzMH4VEhEFmeT28qY+4yxNu4w@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250617224125.17299-1-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19ipv6: Simplify link-local address generation for IPv6 GRE.Guillaume Nault1-5/+5
Since commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation."), addrconf_gre_config() has stopped handling IP6GRE devices specially and just calls the regular addrconf_addr_gen() function to create their link-local IPv6 addresses. We can thus avoid using addrconf_gre_config() for IP6GRE devices and use the normal IPv6 initialisation path instead (that is, jump directly to addrconf_dev_config() in addrconf_init_auto_addrs()). See commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation.") for a deeper explanation on how and why GRE devices started handling their IPv6 link-local address generation specially, why it was a problem, and why this is not even necessary in most cases (especially for GRE over IPv6). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/a9144be9c7ec3cf09f25becae5e8fdf141fde9f6.1750075076.git.gnault@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-17net: ipv6: Add ip6_mr_output()Petr Machata2-0/+119
Multicast routing is today handled in the input path. Locally generated MC packets don't hit the IPMR code today. Thus if a VXLAN remote address is multicast, the driver needs to set an OIF during route lookup. Thus MC routing configuration needs to be kept in sync with the VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC routing code instead. To that end, this patch adds support to route locally generated multicast packets. The newly-added routines do largely what ip6_mr_input() and ip6_mr_forward() do: make an MR cache lookup to find where to send the packets, and use ip6_output() to send each of them. When no cache entry is found, the packet is punted to the daemon for resolution. Similarly to the IPv4 case in a previous patch, the new logic is contingent on a newly-added IP6CB flag being set. Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/3bcc034a3ab4d3c291072fff38f78d7fbbeef4e6.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: ip6mr: Split ip6mr_forward2() in twoPetr Machata1-7/+16
Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and should not take place on the output path. In order to allow reuse of the common parts, extract out of the function a helper, ip6mr_prepare_forward(). Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/8932bd5c0fbe3f662b158803b8509604fa7bc113.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: ip6mr: Make ip6mr_forward2() voidPetr Machata1-6/+6
Nobody uses the return value, so convert the function to void. Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/e0bee259da0da58da96647ea9e21e6360c8f7e11.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: ip6mr: Fix in/out netdev to pass to the FORWARD chainPetr Machata1-1/+2
The netfilter hook is invoked with skb->dev for input netdevice, and vif_dev for output netdevice. However at the point of invocation, skb->dev is already set to vif_dev, and MR-forwarded packets are reported with in=out: # ip6tables -A FORWARD -j LOG --log-prefix '[forw]' # cd tools/testing/selftests/net/forwarding # ./router_multicast.sh # dmesg | fgrep '[forw]' [ 1670.248245] [forw]IN=v5 OUT=v5 [...] For reference, IPv4 MR code shows in and out as appropriate. Fix by caching skb->dev and using the updated value for output netdev. Fixes: 7bc570c8b4f7 ("[IPV6] MROUTE: Support multicast forwarding.") Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/3141ae8386fbe13fef4b793faa75e6bae58d798a.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()Petr Machata2-3/+4
ip6tunnel_xmit() erases the contents of the SKB control block. In order to be able to set particular IP6CB flags on the SKB, add a corresponding parameter, and propagate it to udp_tunnel6_xmit_skb() as well. In one of the following patches, VXLAN driver will use this facility to mark packets as subject to IPv6 multicast routing. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv6: Make udp_tunnel6_xmit_skb() voidPetr Machata1-8/+7
The function always returns zero, thus the return value does not carry any signal. Just make it void. Most callers already ignore the return value. However: - Refold arguments of the call from sctp_v6_xmit() so that they fit into the 80-column limit. - tipc_udp_xmit() initializes err from the return value, but that should already be always zero at that point. So there's no practical change, but elision of the assignment prompts a couple more tweaks to clean up the function. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()Petr Machata1-1/+1
iptunnel_xmit() erases the contents of the SKB control block. In order to be able to set particular IPCB flags on the SKB, add a corresponding parameter, and propagate it to udp_tunnel_xmit_skb() as well. In one of the following patches, VXLAN driver will use this facility to mark packets as subject to IP multicast routing. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16seg6: Allow End.X behavior to accept an oifIdo Schimmel1-2/+3
Extend the End.X behavior to accept an output interface as an optional attribute and make use of it when resolving a route. This is needed when user space wants to use a link-local address as the nexthop address. Before: # ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6 # ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 $ ip -6 route show 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 dev sr6 metric 1024 pref medium 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium After: # ip route add 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6 # ip route add 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 $ ip -6 route show 2001:db8:1::/64 encap seg6local action End.X nh6 fe80::1 oif eth0 dev sr6 metric 1024 pref medium 2001:db8:2::/64 encap seg6local action End.X nh6 2001:db8:10::1 dev sr6 metric 1024 pref medium Note that the oif attribute is not dumped to user space when it was not specified (as an oif of 0) since each entry keeps track of the optional attributes that it parsed during configuration (see struct seg6_local_lwt::parsed_optattrs). Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250612122323.584113-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16seg6: Call seg6_lookup_any_nexthop() from End.X behaviorIdo Schimmel1-1/+1
seg6_lookup_nexthop() is a wrapper around seg6_lookup_any_nexthop(). Change End.X behavior to invoke seg6_lookup_any_nexthop() directly so that we would not need to expose the new output interface argument outside of the seg6local module. No functional changes intended. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250612122323.584113-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-16seg6: Extend seg6_lookup_any_nexthop() with an oif argumentIdo Schimmel1-7/+10
seg6_lookup_any_nexthop() is called by the different endpoint behaviors (e.g., End, End.X) to resolve an IPv6 route. Extend the function with an output interface argument so that it could be used to resolve a route with a certain output interface. This will be used by subsequent patches that will extend the End.X behavior with an output interface as an optional argument. ip6_route_input_lookup() cannot be used when an output interface is specified as it ignores this parameter. Similarly, calling ip6_pol_route() when a table ID was not specified (e.g., End.X behavior) is wrong. Therefore, when an output interface is specified without a table ID, resolve the route using ip6_route_output() which will take the output interface into account. Note that no endpoint behavior currently passes both a table ID and an output interface, so the oif argument passed to ip6_pol_route() is always zero and there are no functional changes in this regard. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250612122323.584113-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge tag 'net-6.16-rc2' of ↵Linus Torvalds1-55/+55
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and wireless. Current release - regressions: - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD Current release - new code bugs: - eth: airoha: correct enable mask for RX queues 16-31 - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer disappears under traffic - ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid routes Previous releases - regressions: - phy: phy_caps: don't skip better duplex match on non-exact match - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0 - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient packet loss, exact reason not fully understood, yet Previous releases - always broken: - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6) - sched: sfq: fix a potential crash on gso_skb handling - Bluetooth: intel: improve rx buffer posting to avoid causing issues in the firmware - eth: intel: i40e: make reset handling robust against multiple requests - eth: mlx5: ensure FW pages are always allocated on the local NUMA node, even when device is configure to 'serve' another node - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850, prevent kernel crashes - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request() for 3 sec if fw_stats_done is not set" * tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context net: ethtool: Don't check if RSS context exists in case of context 0 af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD. ipv6: Move fib6_config_validate() to ip6_route_add(). net: drv: netdevsim: don't napi_complete() from netpoll net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get() veth: prevent NULL pointer dereference in veth_xdp_rcv net_sched: remove qdisc_tree_flush_backlog() net_sched: ets: fix a race in ets_qdisc_change() net_sched: tbf: fix a race in tbf_change() net_sched: red: fix a race in __red_change() net_sched: prio: fix a race in prio_tune() net_sched: sch_sfq: reject invalid perturb period net: phy: phy_caps: Don't skip better duplex macth on non-exact match MAINTAINERS: Update Kuniyuki Iwashima's email address. selftests: net: add test case for NAT46 looping back dst net: clear the dst when changing skb protocol net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper net/mlx5e: Fix leak of Geneve TLV option object net/mlx5: HWS, make sure the uplink is the last destination ...
2025-06-12ipv6: Move fib6_config_validate() to ip6_route_add().Kuniyuki Iwashima1-55/+55
syzkaller created an IPv6 route from a malformed packet, which has a prefix len > 128, triggering the splat below. [0] This is a similar issue fixed by commit 586ceac9acb7 ("ipv6: Restore fib6_config validation for SIOCADDRT."). The cited commit removed fib6_config validation from some callers of ip6_add_route(). Let's move the validation back to ip6_route_add() and ip6_route_multipath_add(). [0]: UBSAN: array-index-out-of-bounds in ./include/net/ipv6.h:616:34 index 20 is out of range for type '__u8 [16]' CPU: 1 UID: 0 PID: 7444 Comm: syz.0.708 Not tainted 6.16.0-rc1-syzkaller-g19272b37aa4f #0 PREEMPT Hardware name: riscv-virtio,qemu (DT) Call Trace: [<ffffffff80078a80>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:132 [<ffffffff8000327a>] show_stack+0x30/0x3c arch/riscv/kernel/stacktrace.c:138 [<ffffffff80061012>] __dump_stack lib/dump_stack.c:94 [inline] [<ffffffff80061012>] dump_stack_lvl+0x12e/0x1a6 lib/dump_stack.c:120 [<ffffffff800610a6>] dump_stack+0x1c/0x24 lib/dump_stack.c:129 [<ffffffff8001c0ea>] ubsan_epilogue+0x14/0x46 lib/ubsan.c:233 [<ffffffff819ba290>] __ubsan_handle_out_of_bounds+0xf6/0xf8 lib/ubsan.c:455 [<ffffffff85b363a4>] ipv6_addr_prefix include/net/ipv6.h:616 [inline] [<ffffffff85b363a4>] ip6_route_info_create+0x8f8/0x96e net/ipv6/route.c:3793 [<ffffffff85b635da>] ip6_route_add+0x2a/0x1aa net/ipv6/route.c:3889 [<ffffffff85b02e08>] addrconf_prefix_route+0x2c4/0x4e8 net/ipv6/addrconf.c:2487 [<ffffffff85b23bb2>] addrconf_prefix_rcv+0x1720/0x1e62 net/ipv6/addrconf.c:2878 [<ffffffff85b92664>] ndisc_router_discovery+0x1a06/0x3504 net/ipv6/ndisc.c:1570 [<ffffffff85b99038>] ndisc_rcv+0x500/0x600 net/ipv6/ndisc.c:1874 [<ffffffff85bc2c18>] icmpv6_rcv+0x145e/0x1e0a net/ipv6/icmp.c:988 [<ffffffff85af6798>] ip6_protocol_deliver_rcu+0x18a/0x1976 net/ipv6/ip6_input.c:436 [<ffffffff85af8078>] ip6_input_finish+0xf4/0x174 net/ipv6/ip6_input.c:480 [<ffffffff85af8262>] NF_HOOK include/linux/netfilter.h:317 [inline] [<ffffffff85af8262>] NF_HOOK include/linux/netfilter.h:311 [inline] [<ffffffff85af8262>] ip6_input+0x16a/0x70c net/ipv6/ip6_input.c:491 [<ffffffff85af8dcc>] ip6_mc_input+0x5c8/0x1268 net/ipv6/ip6_input.c:588 [<ffffffff85af6112>] dst_input include/net/dst.h:469 [inline] [<ffffffff85af6112>] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] [<ffffffff85af6112>] NF_HOOK include/linux/netfilter.h:317 [inline] [<ffffffff85af6112>] NF_HOOK include/linux/netfilter.h:311 [inline] [<ffffffff85af6112>] ipv6_rcv+0x5ae/0x6e0 net/ipv6/ip6_input.c:309 [<ffffffff85087e84>] __netif_receive_skb_one_core+0x106/0x16e net/core/dev.c:5977 [<ffffffff85088104>] __netif_receive_skb+0x2c/0x144 net/core/dev.c:6090 [<ffffffff850883c6>] netif_receive_skb_internal net/core/dev.c:6176 [inline] [<ffffffff850883c6>] netif_receive_skb+0x1aa/0xbf2 net/core/dev.c:6235 [<ffffffff8328656e>] tun_rx_batched.isra.0+0x430/0x686 drivers/net/tun.c:1485 [<ffffffff8329ed3a>] tun_get_user+0x2952/0x3d6c drivers/net/tun.c:1938 [<ffffffff832a21e0>] tun_chr_write_iter+0xc4/0x21c drivers/net/tun.c:1984 [<ffffffff80b9b9ae>] new_sync_write fs/read_write.c:593 [inline] [<ffffffff80b9b9ae>] vfs_write+0x56c/0xa9a fs/read_write.c:686 [<ffffffff80b9c2be>] ksys_write+0x126/0x228 fs/read_write.c:738 [<ffffffff80b9c42e>] __do_sys_write fs/read_write.c:749 [inline] [<ffffffff80b9c42e>] __se_sys_write fs/read_write.c:746 [inline] [<ffffffff80b9c42e>] __riscv_sys_write+0x6e/0x94 fs/read_write.c:746 [<ffffffff80076912>] syscall_handler+0x94/0x118 arch/riscv/include/asm/syscall.h:112 [<ffffffff8637e31e>] do_trap_ecall_u+0x396/0x530 arch/riscv/kernel/traps.c:341 [<ffffffff863a69e2>] handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 Fixes: fa76c1674f2e ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().") Reported-by: syzbot+4c2358694722d304c44e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6849b8c3.a00a0220.1eb5f5.00f0.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250611193551.2999991-1-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()timers-cleanups-2025-06-08Ingo Molnar5-5/+5
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-05seg6: Fix validation of nexthop addressesIdo Schimmel1-4/+2
The kernel currently validates that the length of the provided nexthop address does not exceed the specified length. This can lead to the kernel reading uninitialized memory if user space provided a shorter length than the specified one. Fix by validating that the provided length exactly matches the specified one. Fixes: d1df6fd8a1d2 ("ipv6: sr: define core operations for seg6local lightweight tunnel") Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250604113252.371528-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-30net: Fix checksum update for ILA adj-transportPaul Chaignon1-3/+3
During ILA address translations, the L4 checksums can be handled in different ways. One of them, adj-transport, consist in parsing the transport layer and updating any found checksum. This logic relies on inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when in state CHECKSUM_COMPLETE. This bug can be reproduced with a simple ILA to SIR mapping, assuming packets are received with CHECKSUM_COMPLETE: $ ip a show dev eth0 14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 3333:0:0:1::c078/64 scope global valid_lft forever preferred_lft forever inet6 fd00:10:244:1::c078/128 scope global nodad valid_lft forever preferred_lft forever inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll valid_lft forever preferred_lft forever $ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \ csum-mode adj-transport ident-type luid dev eth0 Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on [3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed skb->csum. The translation and drop are visible on pwru [1] traces: IFACE TUPLE FUNC eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ipv6_rcv eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) ip6_rcv_core eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) nf_hook_slow eth0:9 [fd00:10:244:3::3d8]:51420->[fd00:10:244:1::c078]:8000(tcp) inet_proto_csum_replace_by_diff eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_early_demux eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_route_input eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_input_finish eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ip6_protocol_deliver_rcu eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) raw6_local_deliver eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) ipv6_raw_deliver eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) tcp_v6_rcv eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) __skb_checksum_complete eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skb_reason(SKB_DROP_REASON_TCP_CSUM) eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_head_state eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_release_data eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) skb_free_head eth0:9 [fd00:10:244:3::3d8]:51420->[3333:0:0:1::c078]:8000(tcp) kfree_skbmem This is happening because inet_proto_csum_replace_by_diff is updating skb->csum when it shouldn't. The L4 checksum is updated such that it "cancels" the IPv6 address change in terms of checksum computation, so the impact on skb->csum is null. Note this would be different for an IPv4 packet since three fields would be updated: the IPv4 address, the IP checksum, and the L4 checksum. Two would cancel each other and skb->csum would still need to be updated to take the L4 checksum change into account. This patch fixes it by passing an ipv6 flag to inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're in the IPv6 case. Note the behavior of the only other user of inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in this patch and fixed in the subsequent patch. With the fix, using the reproduction from above, I can confirm skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP SYN proceeds to the application after the ILA translation. Link: https://github.com/cilium/pwru [1] Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/b5539869e3550d46068504feb02d37653d939c0b.1748509484.git.paul.chaignon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-26Merge tag 'nf-next-25-05-23' of ↵Paolo Abeni3-11/+14
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next, specifically 26 patches: 5 patches adding/updating selftests, 4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables): 1) Improve selftest coverage for pipapo 4 bit group format, from Florian Westphal. 2) Fix incorrect dependencies when compiling a kernel without legacy ip{6}tables support, also from Florian. 3) Two patches to fix nft_fib vrf issues, including selftest updates to improve coverage, also from Florian Westphal. 4) Fix incorrect nesting in nft_tunnel's GENEVE support, from Fernando F. Mancera. 5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure and nft_inner to match in inner headers, from Sebastian Andrzej Siewior. 6) Integrate conntrack information into nft trace infrastructure, from Florian Westphal. 7) A series of 13 patches to allow to specify wildcard netdevice in netdev basechain and flowtables, eg. table netdev filter { chain ingress { type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept; } } This also allows for runtime hook registration on NETDEV_{UN}REGISTER event, from Phil Sutter. netfilter pull request 25-05-23 * tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits) selftests: netfilter: Torture nftables netdev hooks netfilter: nf_tables: Add notifications for hook changes netfilter: nf_tables: Support wildcard netdev hook specs netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc() netfilter: nf_tables: Handle NETDEV_CHANGENAME events netfilter: nf_tables: Wrap netdev notifiers netfilter: nf_tables: Respect NETDEV_REGISTER events netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook() netfilter: nf_tables: Introduce nft_register_flowtable_ops() netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}() netfilter: nf_tables: Introduce functions freeing nft_hook objects netfilter: nf_tables: add packets conntrack state to debug trace info netfilter: conntrack: make nf_conntrack_id callable without a module dependency netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx netfilter: nf_dup{4, 6}: Move duplication check to task_struct netfilter: nft_tunnel: fix geneve_opt dump selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs ... ==================== Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-23netfilter: nf_dup{4, 6}: Move duplication check to task_structSebastian Andrzej Siewior2-4/+4
nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Due to the recursion involved, the simplest change is to make it a per-task variable. Move the per-CPU variable nf_skb_duplicated to task_struct and name it in_nf_duplicate. Add it to the existing bitfield so it doesn't use additional memory. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: nft_fib: consistent l3mdev handlingFlorian Westphal1-3/+1
fib has two modes: 1. Obtain output device according to source or destination address 2. Obtain the type of the address, e.g. local, unicast, multicast. 'fib daddr type' should return 'local' if the address is configured in this netns or unicast otherwise. 'fib daddr . iif type' should return 'local' if the address is configured on the input interface or unicast otherwise, i.e. more restrictive. However, if the interface is part of a VRF, then 'fib daddr type' returns unicast even if the address is configured on the incoming interface. This is broken for both ipv4 and ipv6. In the ipv4 case, inet_dev_addr_type must only be used if the 'iif' or 'oif' (strict mode) was requested. Else inet_addr_type_dev_table() needs to be used and the correct dev argument must be passed as well so the correct fib (vrf) table is used. In the ipv6 case, the bug is similar, without strict mode, dev is NULL so .flowi6_l3mdev will be set to 0. Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that to init the .l3mdev structure member. For ipv6, use it from nft_fib6_flowi_init() which gets called from both the 'type' and the 'route' mode eval functions. This provides consistent behaviour for all modes for both ipv4 and ipv6: If strict matching is requested, the input respectively output device of the netfilter hooks is used. Otherwise, use skb->dev to obtain the l3mdev ifindex. Without this, most type checks in updated nft_fib.sh selftest fail: FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4 FAIL: did not find veth0 . dead:1::1 . local in fibtype6 FAIL: did not find veth0 . dead:9::1 . local in fibtype6 FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4 FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4 FAIL: did not find tvrf . dead:1::1 . local in fibtype6 FAIL: did not find tvrf . dead:9::1 . local in fibtype6 FAIL: fib expression address types match (iif in vrf) (fib errounously returns 'unicast' for all of them, even though all of these addresses are local to the vrf). Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-65/+18
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancyFlorian Westphal1-4/+9
With a VRF, ipv4 and ipv6 FIB expression behave differently. fib daddr . iif oif Will return the input interface name for ipv4, but the real device for ipv6. Example: If VRF device name is tvrf and real (incoming) device is veth0. First round is ok, both ipv4 and ipv6 will yield 'veth0'. But in the second round (incoming device will be set to "tvrf"), ipv4 will yield "tvrf" whereas ipv6 returns "veth0" for the second round too. This makes ipv6 behave like ipv4. A followup patch will add a test case for this, without this change it will fail with: get element inet t fibif6iif { tvrf . dead:1::99 . tvrf } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif Alternatively we could either not do anything at all or change ipv4 to also return the lower/real device, however, nft (userspace) doc says "iif: if fib lookup provides a route then check its output interface is identical to the packets input interface." which is what the nft fib ipv4 behaviour is. Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22Merge tag 'ipsec-2025-05-21' of ↵Paolo Abeni2-54/+17
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-05-21 1) Fix some missing kfree_skb in the error paths of espintcp. From Sabrina Dubroca. 2) Fix a reference leak in espintcp. From Sabrina Dubroca. 3) Fix UDP GRO handling for ESPINUDP. From Tobias Brunner. 4) Fix ipcomp truesize computation on the receive path. From Sabrina Dubroca. 5) Sanitize marks before policy/state insertation. From Paul Chaignon. * tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Sanitize marks before insert xfrm: ipcomp: fix truesize computation on receive xfrm: Fix UDP GRO handling for some corner cases espintcp: remove encap socket caching to avoid reference leak espintcp: fix skb leaks ==================== Link: https://patch.msgid.link/20250521054348.4057269-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-20ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.Kuniyuki Iwashima1-31/+3
These two commits preallocated two per-cpu variables in ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init() were expected to be called under RCU. * commit d27b9c40dbd6 ("ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().") * commit 5720a328c3e9 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().") Now these functions can be called without RCU and can use GFP_KERNEL. Let's revert the commits. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Pass gfp_flags down to ip6_route_info_create_nh().Kuniyuki Iwashima1-4/+5
Since commit c4837b9853e5 ("ipv6: Split ip6_route_info_create()."), ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be called under RCU. Now, we can call it without RCU and use GFP_KERNEL. Let's pass gfp_flags to ip6_route_info_create_nh(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20Revert "ipv6: Factorise ip6_route_multipath_add()."Kuniyuki Iwashima1-123/+70
Commit 71c0efb6d12f ("ipv6: Factorise ip6_route_multipath_add().") split a loop in ip6_route_multipath_add() so that we can put rcu_read_lock() between ip6_route_info_create() and ip6_route_info_create_nh(). We no longer need to do so as ip6_route_info_create_nh() does not require RCU now. Let's revert the commit to simplify the code. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during ↵Kuniyuki Iwashima1-3/+3
seg6local LWT setup" The previous patch fixed the same issue mentioned in commit 14a0087e7236 ("ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"). Let's revert it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Narrow down RCU critical section in inet6_rtm_newroute().Kuniyuki Iwashima2-15/+25
Commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") added rcu_read_lock() covering ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that nexthop and netdev will not go away. However, as reported by syzkaller [0], ip_tun_build_state() calls dst_cache_init() with GFP_KERNEL during the RCU critical section. ip6_route_info_create_nh() fetches nexthop or netdev depending on whether RTA_NH_ID is set, and struct fib6_info holds a refcount of either of them by nexthop_get() or netdev_get_by_index(). netdev_get_by_index() looks up a dev and calls dev_hold() under RCU. So, we need RCU only around nexthop_find_by_id() and nexthop_get() ( and a few more nexthop code). Let's add rcu_read_lock() there and remove rcu_read_lock() in ip6_route_add() and ip6_route_multipath_add(). Now these functions called from fib6_add() need RCU: - inet6_rt_notify() - fib6_drop_pcpu_from() (via fib6_purge_rt()) - rt6_flush_exceptions() (via fib6_purge_rt()) - ip6_ignore_linkdown() (via rt6_multipath_rebalance()) All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is added there. [0]: [ BUG: Invalid wait context ] 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W ---------------------------- syz-executor234/5832 is trying to lock: ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at: pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782 other info that might help us debug this: context-{5:5} 1 lock held by syz-executor234/5832: 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:331 [inline] 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:841 [inline] 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913 stack backtrace: CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/29/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline] check_wait_context kernel/locking/lockdep.c:4903 [inline] __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866 __mutex_lock_common kernel/locking/mutex.c:601 [inline] __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746 pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782 dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145 ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687 lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137 fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635 fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669 ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866 ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915 inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:727 ____sys_sendmsg+0x505/0x830 net/socket.c:2566 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620 __sys_sendmsg net/socket.c:2652 [inline] __do_sys_sendmsg net/socket.c:2657 [inline] __se_sys_sendmsg net/socket.c:2655 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94 Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().Kuniyuki Iwashima1-4/+2
Commit f130a0cc1b4f ("inet: fix lwtunnel_valid_encap_type() lock imbalance") added the rtnl_is_held argument as a temporary fix while I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU. Now all callers of lwtunnel_valid_encap_type() do not hold RTNL. Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Remove rcu_read_lock() in fib6_get_table().Kuniyuki Iwashima1-12/+10
Once allocated, the IPv6 routing table is not freed until netns is dismantled. fib6_get_table() uses rcu_read_lock() while iterating net->ipv6.fib_table_hash[], but it's not needed and rather confusing. Because some callers have this pattern, table = fib6_get_table(); rcu_read_lock(); /* ... use table here ... */ rcu_read_unlock(); [ See: addrconf_get_prefix_route(), ip6_route_del(), rt6_get_route_info(), rt6_get_dflt_router() ] and this looks illegal but is actually safe. Let's remove rcu_read_lock() in fib6_get_table() and pass true to the last argument of hlist_for_each_entry_rcu() to bypass the RCU check. Note that protection is not needed but RCU helper is used to avoid data-race. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16mr: consolidate the ipmr_can_free_table() checks.Paolo Abeni1-11/+1
Guoyu Yin reported a splat in the ipmr netns cleanup path: WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_free_table net/ipv4/ipmr.c:440 [inline] WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Modules linked in: CPU: 2 UID: 0 PID: 14564 Comm: syz.4.838 Not tainted 6.14.0 #1 Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ipmr_free_table net/ipv4/ipmr.c:440 [inline] RIP: 0010:ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Code: ff df 48 c1 ea 03 80 3c 02 00 75 7d 48 c7 83 60 05 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e e9 71 67 7f 00 e8 4c 2d 8a fd 90 <0f> 0b 90 eb 93 e8 41 2d 8a fd 0f b6 2d 80 54 ea 01 31 ff 89 ee e8 RSP: 0018:ffff888109547c58 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888108c12dc0 RCX: ffffffff83e09868 RDX: ffff8881022b3300 RSI: ffffffff83e098d4 RDI: 0000000000000005 RBP: ffff888104288000 R08: 0000000000000000 R09: ffffed10211825c9 R10: 0000000000000001 R11: ffff88801816c4a0 R12: 0000000000000001 R13: ffff888108c13320 R14: ffff888108c12dc0 R15: fffffbfff0b74058 FS: 00007f84f39316c0(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f84f3930f98 CR3: 0000000113b56000 CR4: 0000000000350ef0 Call Trace: <TASK> ipmr_net_exit_batch+0x50/0x90 net/ipv4/ipmr.c:3160 ops_exit_list+0x10c/0x160 net/core/net_namespace.c:177 setup_net+0x47d/0x8e0 net/core/net_namespace.c:394 copy_net_ns+0x25d/0x410 net/core/net_namespace.c:516 create_new_namespaces+0x3f6/0xaf0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc3/0x180 kernel/nsproxy.c:228 ksys_unshare+0x78d/0x9a0 kernel/fork.c:3342 __do_sys_unshare kernel/fork.c:3413 [inline] __se_sys_unshare kernel/fork.c:3411 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3411 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f84f532cc29 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f84f3931038 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f84f5615fa0 RCX: 00007f84f532cc29 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000400 RBP: 00007f84f53fba18 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f84f5615fa0 R15: 00007fff51c5f328 </TASK> The running kernel has CONFIG_IP_MROUTE_MULTIPLE_TABLES disabled, and the sanity check for such build is still too loose. Address the issue consolidating the relevant sanity check in a single helper regardless of the kernel configuration. Also share it between the ipv4 and ipv6 code. Reported-by: Guoyu Yin <y04609127@gmail.com> Fixes: 50b94204446e ("ipmr: tune the ipmr_can_free_table() checks.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/372dc261e1bf12742276e1b984fc5a071b7fc5a8.1747321903.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15ipv6: sr: Use nested-BH locking for hmac_storageSebastian Andrzej Siewior1-2/+11
hmac_storage is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: David Ahern <dsahern@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250512092736.229935-5-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: devmem: Implement TX pathMina Almasry1-1/+2
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-6/+9
Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: 08e9f2d584c4 ("net: Lock netdevices during dev_shutdown") a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05gre: Fix again IPv6 link-local address generation.Guillaume Nault1-6/+9
Use addrconf_addr_gen() to generate IPv6 link-local addresses on GRE devices in most cases and fall back to using add_v4_addrs() only in case the GRE configuration is incompatible with addrconf_addr_gen(). GRE used to use addrconf_addr_gen() until commit e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") restricted this use to gretap and ip6gretap devices, and created add_v4_addrs() (borrowed from SIT) for non-Ethernet GRE ones. The original problem came when commit 9af28511be10 ("addrconf: refuse isatap eui64 for INADDR_ANY") made __ipv6_isatap_ifid() fail when its addr parameter was 0. The commit says that this would create an invalid address, however, I couldn't find any RFC saying that the generated interface identifier would be wrong. Anyway, since gre over IPv4 devices pass their local tunnel address to __ipv6_isatap_ifid(), that commit broke their IPv6 link-local address generation when the local address was unspecified. Then commit e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") tried to fix that case by defining add_v4_addrs() and calling it to generate the IPv6 link-local address instead of using addrconf_addr_gen() (apart for gretap and ip6gretap devices, which would still use the regular addrconf_addr_gen(), since they have a MAC address). That broke several use cases because add_v4_addrs() isn't properly integrated into the rest of IPv6 Neighbor Discovery code. Several of these shortcomings have been fixed over time, but add_v4_addrs() remains broken on several aspects. In particular, it doesn't send any Router Sollicitations, so the SLAAC process doesn't start until the interface receives a Router Advertisement. Also, add_v4_addrs() mostly ignores the address generation mode of the interface (/proc/sys/net/ipv6/conf/*/addr_gen_mode), thus breaking the IN6_ADDR_GEN_MODE_RANDOM and IN6_ADDR_GEN_MODE_STABLE_PRIVACY cases. Fix the situation by using add_v4_addrs() only in the specific scenario where the normal method would fail. That is, for interfaces that have all of the following characteristics: * run over IPv4, * transport IP packets directly, not Ethernet (that is, not gretap interfaces), * tunnel endpoint is INADDR_ANY (that is, 0), * device address generation mode is EUI64. In all other cases, revert back to the regular addrconf_addr_gen(). Also, remove the special case for ip6gre interfaces in add_v4_addrs(), since ip6gre devices now always use addrconf_addr_gen() instead. Note: This patch was originally applied as commit 183185a18ff9 ("gre: Fix IPv6 link-local address generation."). However, it was then reverted by commit fc486c2d060f ("Revert "gre: Fix IPv6 link-local address generation."") because it uncovered another bug that ended up breaking net/forwarding/ip6gre_custom_multipath_hash.sh. That other bug has now been fixed by commit 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop"). Therefore we can now revive this GRE patch (no changes since original commit 183185a18ff9 ("gre: Fix IPv6 link-local address generation."). Fixes: e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/a88cc5c4811af36007645d610c95102dccb360a6.1746225214.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05netfilter: bridge: Move specific fragmented packet to slow_path instead of ↵Huajian Yang1-6/+6
dropping it The config NF_CONNTRACK_BRIDGE will change the bridge forwarding for fragmented packets. The original bridge does not know that it is a fragmented packet and forwards it directly, after NF_CONNTRACK_BRIDGE is enabled, function nf_br_ip_fragment and br_ip6_fragment will check the headroom. In original br_forward, insufficient headroom of skb may indeed exist, but there's still a way to save the skb in the device driver after dev_queue_xmit.So droping the skb will change the original bridge forwarding in some cases. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Huajian Yang <huajianyang@asrmicro.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-02ipv6: Restore fib6_config validation for SIOCADDRT.Kuniyuki Iwashima1-42/+55
syzkaller reported out-of-bounds read in ipv6_addr_prefix(), where the prefix length was over 128. The cited commit accidentally removed some fib6_config validation from the ioctl path. Let's restore the validation. [0]: BUG: KASAN: slab-out-of-bounds in ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814) Read of size 1 at addr ff11000138020ad4 by task repro/261 CPU: 3 UID: 0 PID: 261 Comm: repro Not tainted 6.15.0-rc3-00614-g0d15a26b247d #87 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) print_report (mm/kasan/report.c:409 mm/kasan/report.c:521) kasan_report (mm/kasan/report.c:636) ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814) ip6_route_add (net/ipv6/route.c:3902) ipv6_route_ioctl (net/ipv6/route.c:4523) inet6_ioctl (net/ipv6/af_inet6.c:577) sock_do_ioctl (net/socket.c:1190) sock_ioctl (net/socket.c:1314) __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:906 fs/ioctl.c:892 fs/ioctl.c:892) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f518fb2de5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fff14f38d18 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f518fb2de5d RDX: 00000000200015c0 RSI: 000000000000890b RDI: 0000000000000003 RBP: 00007fff14f38d30 R08: 0000000000000800 R09: 0000000000000800 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff14f38e48 R13: 0000000000401136 R14: 0000000000403df0 R15: 00007f518fd3c000 </TASK> Fixes: fa76c1674f2e ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Yi Lai <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/netdev/aBAcKDEFoN%2FLntBF@ly-workstation/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250501005335.53683-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT ↵Andrea Mayer1-3/+3
setup Recent updates to the locking mechanism that protects IPv6 routing tables [1] have affected the SRv6 networking subsystem. Such changes cause problems with some SRv6 Endpoints behaviors, like End.B6.Encaps and also impact SRv6 counters. Starting from commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE."), the inet6_rtm_newroute() function no longer needs to acquire the RTNL lock for creating and configuring IPv6 routes and set up lwtunnels. The RTNL lock can be avoided because the ip6_route_add() function finishes setting up a new route in a section protected by RCU. This makes sure that no dev/nexthops can disappear during the operation. Because of this, the steps for setting up lwtunnels - i.e., calling lwtunnel_build_state() - are now done in a RCU lock section and not under the RTNL lock anymore. However, creating and configuring a lwtunnel instance in an RCU-protected section can be problematic when that tunnel needs to allocate memory using the GFP_KERNEL flag. For example, the following trace shows what happens when an SRv6 End.B6.Encaps behavior is instantiated after commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE."): [ 3061.219696] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 [ 3061.226136] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 445, name: ip [ 3061.232101] preempt_count: 0, expected: 0 [ 3061.235414] RCU nest depth: 1, expected: 0 [ 3061.238622] 1 lock held by ip/445: [ 3061.241458] #0: ffffffff83ec64a0 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x41/0x1e0 [ 3061.248520] CPU: 1 UID: 0 PID: 445 Comm: ip Not tainted 6.15.0-rc3-micro-vm-dev-00590-ge527e891492d #2058 PREEMPT(full) [ 3061.248532] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 3061.248549] Call Trace: [ 3061.248620] <TASK> [ 3061.248633] dump_stack_lvl+0xa9/0xc0 [ 3061.248846] __might_resched+0x218/0x360 [ 3061.248871] __kmalloc_node_track_caller_noprof+0x332/0x4e0 [ 3061.248889] ? rcu_is_watching+0x3a/0x70 [ 3061.248902] ? parse_nla_srh+0x56/0xa0 [ 3061.248938] kmemdup_noprof+0x1c/0x40 [ 3061.248952] parse_nla_srh+0x56/0xa0 [ 3061.248969] seg6_local_build_state+0x2e0/0x580 [ 3061.248992] ? __lock_acquire+0xaff/0x1cd0 [ 3061.249013] ? do_raw_spin_lock+0x111/0x1d0 [ 3061.249027] ? __pfx_seg6_local_build_state+0x10/0x10 [ 3061.249068] ? lwtunnel_build_state+0xe1/0x3a0 [ 3061.249274] lwtunnel_build_state+0x10d/0x3a0 [ 3061.249303] fib_nh_common_init+0xce/0x1e0 [ 3061.249337] ? __pfx_fib_nh_common_init+0x10/0x10 [ 3061.249352] ? in6_dev_get+0xaf/0x1f0 [ 3061.249369] ? __rcu_read_unlock+0x64/0x2e0 [ 3061.249392] fib6_nh_init+0x290/0xc30 [ 3061.249422] ? __pfx_fib6_nh_init+0x10/0x10 [ 3061.249447] ? __lock_acquire+0xaff/0x1cd0 [ 3061.249459] ? _raw_spin_unlock_irqrestore+0x22/0x70 [ 3061.249624] ? ip6_route_info_create+0x423/0x520 [ 3061.249641] ? rcu_is_watching+0x3a/0x70 [ 3061.249683] ip6_route_info_create_nh+0x190/0x390 [ 3061.249715] ip6_route_add+0x71/0x1e0 [ 3061.249730] ? __pfx_inet6_rtm_newroute+0x10/0x10 [ 3061.249743] inet6_rtm_newroute+0x426/0xc50 [ 3061.249764] ? avc_has_perm_noaudit+0x13d/0x360 [ 3061.249853] ? __pfx_inet6_rtm_newroute+0x10/0x10 [ 3061.249905] ? __lock_acquire+0xaff/0x1cd0 [ 3061.249962] ? rtnetlink_rcv_msg+0x52f/0x890 [ 3061.249996] ? __pfx_inet6_rtm_newroute+0x10/0x10 [ 3061.250012] rtnetlink_rcv_msg+0x551/0x890 [ 3061.250040] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 3061.250065] ? __lock_acquire+0xaff/0x1cd0 [ 3061.250092] netlink_rcv_skb+0xbd/0x1f0 [ 3061.250108] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 3061.250124] ? __pfx_netlink_rcv_skb+0x10/0x10 [ 3061.250179] ? netlink_deliver_tap+0x10b/0x700 [ 3061.250210] netlink_unicast+0x2e7/0x410 [ 3061.250232] ? __pfx_netlink_unicast+0x10/0x10 [ 3061.250241] ? __lock_acquire+0xaff/0x1cd0 [ 3061.250280] netlink_sendmsg+0x366/0x670 [ 3061.250306] ? __pfx_netlink_sendmsg+0x10/0x10 [ 3061.250313] ? find_held_lock+0x2d/0xa0 [ 3061.250344] ? import_ubuf+0xbc/0xf0 [ 3061.250370] ? __pfx_netlink_sendmsg+0x10/0x10 [ 3061.250381] __sock_sendmsg+0x13e/0x150 [ 3061.250420] ____sys_sendmsg+0x33d/0x450 [ 3061.250442] ? __pfx_____sys_sendmsg+0x10/0x10 [ 3061.250453] ? __pfx_copy_msghdr_from_user+0x10/0x10 [ 3061.250489] ? __pfx_slab_free_after_rcu_debug+0x10/0x10 [ 3061.250514] ___sys_sendmsg+0xe5/0x160 [ 3061.250530] ? __pfx____sys_sendmsg+0x10/0x10 [ 3061.250568] ? __lock_acquire+0xaff/0x1cd0 [ 3061.250617] ? find_held_lock+0x2d/0xa0 [ 3061.250678] ? __virt_addr_valid+0x199/0x340 [ 3061.250704] ? preempt_count_sub+0xf/0xc0 [ 3061.250736] __sys_sendmsg+0xca/0x140 [ 3061.250750] ? __pfx___sys_sendmsg+0x10/0x10 [ 3061.250786] ? syscall_exit_to_user_mode+0xa2/0x1e0 [ 3061.250825] do_syscall_64+0x62/0x140 [ 3061.250844] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3061.250855] RIP: 0033:0x7f0b042ef914 [ 3061.250868] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53 [ 3061.250876] RSP: 002b:00007ffc2d113ef8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 3061.250885] RAX: ffffffffffffffda RBX: 00000000680f93fa RCX: 00007f0b042ef914 [ 3061.250891] RDX: 0000000000000000 RSI: 00007ffc2d113f60 RDI: 0000000000000003 [ 3061.250897] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008 [ 3061.250902] R10: fffffffffffff26d R11: 0000000000000246 R12: 0000000000000001 [ 3061.250907] R13: 000055a961f8a520 R14: 000055a961f63eae R15: 00007ffc2d115270 [ 3061.250952] </TASK> To solve this issue, we replace the GFP_KERNEL flag with the GFP_ATOMIC one in those SRv6 Endpoints that need to allocate memory during the setup phase. This change makes sure that memory allocations are handled in a way that works with RCU critical sections. [1] - https://lore.kernel.org/all/20250418000443.43734-1-kuniyu@amazon.com/ Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250429132453.31605-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Cross-merge networking fixes after downstream PR (net-6.15-rc5). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01net: use sock_gen_put() when sk_state is TCP_TIME_WAITJibin Zhang1-1/+1
It is possible for a pointer of type struct inet_timewait_sock to be returned from the functions __inet_lookup_established() and __inet6_lookup_established(). This can cause a crash when the returned pointer is of type struct inet_timewait_sock and sock_put() is called on it. The following is a crash call stack that shows sk->sk_wmem_alloc being accessed in sk_free() during the call to sock_put() on a struct inet_timewait_sock pointer. To avoid this issue, use sock_gen_put() instead of sock_put() when sk->sk_state is TCP_TIME_WAIT. mrdump.ko ipanic() + 120 vmlinux notifier_call_chain(nr_to_call=-1, nr_calls=0) + 132 vmlinux atomic_notifier_call_chain(val=0) + 56 vmlinux panic() + 344 vmlinux add_taint() + 164 vmlinux end_report() + 136 vmlinux kasan_report(size=0) + 236 vmlinux report_tag_fault() + 16 vmlinux do_tag_recovery() + 16 vmlinux __do_kernel_fault() + 88 vmlinux do_bad_area() + 28 vmlinux do_tag_check_fault() + 60 vmlinux do_mem_abort() + 80 vmlinux el1_abort() + 56 vmlinux el1h_64_sync_handler() + 124 vmlinux > 0xFFFFFFC080011294() vmlinux __lse_atomic_fetch_add_release(v=0xF2FFFF82A896087C) vmlinux __lse_atomic_fetch_sub_release(v=0xF2FFFF82A896087C) vmlinux arch_atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8 vmlinux raw_atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8 vmlinux atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8 vmlinux __refcount_sub_and_test(i=1, r=0xF2FFFF82A896087C, oldp=0) + 8 vmlinux __refcount_dec_and_test(r=0xF2FFFF82A896087C, oldp=0) + 8 vmlinux refcount_dec_and_test(r=0xF2FFFF82A896087C) + 8 vmlinux sk_free(sk=0xF2FFFF82A8960700) + 28 vmlinux sock_put() + 48 vmlinux tcp6_check_fraglist_gro() + 236 vmlinux tcp6_gro_receive() + 624 vmlinux ipv6_gro_receive() + 912 vmlinux dev_gro_receive() + 1116 vmlinux napi_gro_receive() + 196 ccmni.ko ccmni_rx_callback() + 208 ccmni.ko ccmni_queue_recv_skb() + 388 ccci_dpmaif.ko dpmaif_rxq_push_thread() + 1088 vmlinux kthread() + 268 vmlinux 0xFFFFFFC08001F30C() Fixes: c9d1d23e5239 ("net: add heuristic for enabling TCP fraglist GRO") Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com> Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250429020412.14163-1-shiming.cheng@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29ip: load balance tcp connections to single dst addr and portWillem de Bruijn2-3/+12
Load balance new TCP connections across nexthops also when they connect to the same service at a single remote address and port. This affects only port-based multipath hashing: fib_multipath_hash_policy 1 or 3. Local connections must choose both a source address and port when connecting to a remote service, in ip_route_connect. This "chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and simplify ip_route_{connect,newports}()")) is resolved by first selecting a source address, by looking up a route using the zero wildcard source port and address. As a result multiple connections to the same destination address and port have no entropy in fib_multipath_hash. This is not a problem when forwarding, as skb-based hashing has a 4-tuple. Nor when establishing UDP connections, as autobind there selects a port before reaching ip_route_connect. Load balance also TCP, by using a random port in fib_multipath_hash. Port assignment in inet_hash_connect is not atomic with ip_route_connect. Thus ports are unpredictable, effectively random. Implementation details: Do not actually pass a random fl4_sport, as that affects not only hashing, but routing more broadly, and can match a source port based policy route, which existing wildcard port 0 will not. Instead, define a new wildcard flowi flag that is used only for hashing. Selecting a random source is equivalent to just selecting a random hash entirely. But for code clarity, follow the normal 4-tuple hash process and only update this field. fib_multipath_hash can be reached with zero sport from other code paths, so explicitly pass this flowi flag, rather than trying to infer this case in the function itself. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.Kuniyuki Iwashima1-6/+12
Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE. The remaining things to do are 1. pass false to lwtunnel_valid_encap_type_attr() 2. use rcu_dereference_rtnl() in fib6_check_nexthop() 3. place rcu_read_lock() before ip6_route_info_create_nh(). Let's complete the RTNL-free conversion. When each CPU-X adds 100000 routes on table-X in a batch concurrently on c7a.metal-48xl EC2 instance with 192 CPUs, without this series: $ sudo ./route_test.sh ... added 19200000 routes (100000 routes * 192 tables). time elapsed: 191577 milliseconds. with this series: $ sudo ./route_test.sh ... added 19200000 routes (100000 routes * 192 tables). time elapsed: 62854 milliseconds. I changed the number of routes in each table (1000 ~ 100000) and consistently saw it finish 3x faster with this series. Note that now every caller of lwtunnel_valid_encap_type() passes false as the last argument, and this can be removed later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-16-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Protect nh->f6i_list with spinlock and flag.Kuniyuki Iwashima1-5/+34
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. Then, we may be going to add a route tied to a dying nexthop. The nexthop itself is not freed during the RCU grace period, but if we link a route after __remove_nexthop_fib() is called for the nexthop, the route will be leaked. To avoid the race between IPv6 route addition under RCU vs nexthop deletion under RTNL, let's add a dead flag and protect it and nh->f6i_list with a spinlock. __remove_nexthop_fib() acquires the nexthop's spinlock and sets false to nh->dead, then calls ip6_del_rt() for the linked route one by one without the spinlock because fib6_purge_rt() acquires it later. While adding an IPv6 route, fib6_add() acquires the nexthop lock and checks the dead flag just before inserting the route. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().Kuniyuki Iwashima1-7/+14
The next patch adds per-nexthop spinlock which protects nh->f6i_list. When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock. fib6_add_rt2node() could call fib6_purge_rt() for another route, which could holds another nexthop lock. Then, deadlock could happen between two nexthops. Let's defer fib6_purge_rt() after fib6_add_rt2node(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Protect fib6_link_table() with spinlock.Kuniyuki Iwashima1-5/+21
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. If the request specifies a new table ID, fib6_new_table() is called to create a new routing table. Two concurrent requests could specify the same table ID, so we need a lock to protect net->ipv6.fib_table_hash[h]. Let's add a spinlock to protect the hash bucket linkage. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Factorise ip6_route_multipath_add().Kuniyuki Iwashima1-75/+130
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely on RCU to guarantee dev and nexthop lifetime. Then, the RCU section will start before ip6_route_info_create_nh() in ip6_route_multipath_add(), but ip6_route_info_create() is called in the same loop and will sleep. Let's split the loop into ip6_route_mpath_info_create() and ip6_route_mpath_info_create_nh(). Note that ip6_route_info_append() is now integrated into ip6_route_mpath_info_create_nh() because we need to call different free functions for nexthops that passed ip6_route_info_create_nh(). In case of failure, the remaining nexthops that ip6_route_info_create_nh() has not been called for will be freed by ip6_route_mpath_info_cleanup(). OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked to a local temporary list, which will be spliced back to rt6_nh_list. In case of failure, these nexthops will be released by fib6_info_release() in ip6_route_multipath_add(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-12-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Rename rt6_nh.next to rt6_nh.list.Kuniyuki Iwashima1-7/+7
ip6_route_multipath_add() allocates struct rt6_nh for each config of multipath routes to link them to a local list rt6_nh_list. struct rt6_nh.next is the list node of each config, so the name is quite misleading. Let's rename it to list. Suggested-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-11-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Don't pass net to ip6_route_info_append().Kuniyuki Iwashima1-4/+2
net is not used in ip6_route_info_append() after commit 36f19d5b4f99 ("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case"). Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-10-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().Kuniyuki Iwashima1-0/+9
ip6_route_info_create_nh() will be called under RCU. It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output. As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU. Let's preallocate it in ip6_route_info_create(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-9-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().Kuniyuki Iwashima1-3/+22
ip6_route_info_create_nh() will be called under RCU. Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes and we would see a bunch of this message in dmesg. percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create(). If something fails before the original memory allocation in fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(), which releases the preallocated per-cpu memory. Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-8-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Split ip6_route_info_create().Kuniyuki Iwashima1-33/+62
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely on RCU to guarantee dev and nexthop lifetime. Then, we want to allocate as much as possible before entering the RCU section. The RCU section will start in the middle of ip6_route_info_create(), and this is problematic for ip6_route_multipath_add() that calls ip6_route_info_create() multiple times. Let's split ip6_route_info_create() into two parts; one for memory allocation and another for nexthop setup. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-7-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Move nexthop_find_by_id() after fib6_info_alloc().Kuniyuki Iwashima1-16/+18
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. Then, we must perform two lookups for nexthop and dev under RCU to guarantee their lifetime. ip6_route_info_create() calls nexthop_find_by_id() first if RTA_NH_ID is specified, and then allocates struct fib6_info. nexthop_find_by_id() must be called under RCU, but we do not want to use GFP_ATOMIC for memory allocation here, which will be likely to fail in ip6_route_multipath_add(). Let's move nexthop_find_by_id() after the memory allocation so that we can later split ip6_route_info_create() into two parts: the sleepable part and the RCU part. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-6-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().Kuniyuki Iwashima1-7/+9
In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each entry. If it returns false, the request fails. rt6_qualify_for_ecmp() returns false if either of the conditions below is true: 1. f6i->fib6_flags has RTF_ADDRCONF 2. f6i->nh is not NULL 3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC 1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF to cfg->fc_flags. 2. is equivalent with cfg->fc_nh_id. 3. can be replaced by checking RTF_GATEWAY in the base and each multipath entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always false. These checks do not require RCU and can be done earlier. Let's perform the equivalent checks in rtm_to_fib6_multipath_config(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-5-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().Kuniyuki Iwashima1-37/+42
ip6_route_info_create() is called from 3 functions: * ip6_route_add() * ip6_route_multipath_add() * addrconf_f6i_alloc() addrconf_f6i_alloc() does not need validation for struct fib6_config in ip6_route_info_create(). ip6_route_multipath_add() calls ip6_route_info_create() for multiple routes with slightly different fib6_config instances, which is copied from the base config passed from userspace. So, we need not validate the same config repeatedly. Let's move such validation into rtm_to_fib6_config(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-4-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.Kuniyuki Iwashima1-20/+28
Basically, removing an IPv6 route does not require RTNL because the IPv6 routing tables are protected by per table lock. inet6_rtm_delroute() calls nexthop_find_by_id() to check if the nexthop specified by RTA_NH_ID exists. nexthop uses rbtree and the top-down walk can be safely performed under RCU. ip6_route_del() already relies on RCU and the table lock, but we need to extend the RCU critical section a bit more to cover __ip6_del_rt(). For example, nexthop_for_each_fib6_nh() and inet6_rt_notify() needs RCU. Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT. Even if the nexthop is removed after rcu_read_unlock() in inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes tied to the nexthop, and ip6_route_del() returns -ESRCH. So the request was at least valid as of nexthop_find_by_id(), and it's just a matter of timing. Note that we need to pass false to lwtunnel_valid_encap_type_attr(). The following patches also use the newroute bool. Note also that fib6_get_table() does not require RCU because once allocated fib6_table is not freed until netns dismantle. I will post a follow-up series to convert such callers to RCU-lockless version. [0] Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/ #[0] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-3-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().Kuniyuki Iwashima1-39/+43
We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then we want to perform some validation out of the RCU scope. When creating / removing an IPv6 route with RTA_MULTIPATH, inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY in each multipath entry. Let's do that in rtm_to_fib6_config(). Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with 0 entries, which was accepted but should result in -EINVAL as RTM_NEWROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-2-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+1
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py 4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place") 7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>