aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/xen
AgeCommit message (Collapse)AuthorFilesLines
6 daysMerge tag 'for-linus-6.19-rc1-tag' of ↵Linus Torvalds2-12/+6
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "This round it contains only three small cleanup patches" * tag 'for-linus-6.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: drivers/xen: use min() instead of min_t() drivers/xen/xenbus: Replace deprecated strcpy in xenbus_transaction_end drivers/xen/xenbus: Simplify return statement in join()
6 daysMerge tag 'dma-mapping-6.19-2025-12-05' of ↵Linus Torvalds2-42/+41
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: - More DMA mapping API refactoring to physical addresses as the primary interface instead of page+offset parameters. This time dma_map_ops callbacks are converted to physical addresses, what in turn results also in some simplification of architecture specific code (Leon Romanovsky and Jason Gunthorpe) - Clarify that dma_map_benchmark is not a kernel self-test, but standalone tool (Qinxin Xia) * tag 'dma-mapping-6.19-2025-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: remove unused map_page callback xen: swiotlb: Convert mapping routine to rely on physical address x86: Use physical address for DMA mapping sparc: Use physical address DMA mapping powerpc: Convert to physical address DMA mapping parisc: Convert DMA map_page to map_phys interface MIPS/jazzdma: Provide physical address directly alpha: Convert mapping routine to rely on physical address dma-mapping: remove unused mapping resource callbacks xen: swiotlb: Switch to physical address mapping callbacks ARM: dma-mapping: Switch to physical address mapping callbacks ARM: dma-mapping: Reduce struct page exposure in arch_sync_dma*() dma-mapping: convert dummy ops to physical address mapping dma-mapping: prepare dma_map_ops to conversion to physical address tools/dma: move dma_map_benchmark from selftests to tools/dma
7 daysMerge tag 'soc-drivers-6.19' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC driver updates from Arnd Bergmann: "This is the first half of the driver changes: - A treewide interface change to the "syscore" operations for power management, as a preparation for future Tegra specific changes - Reset controller updates with added drivers for LAN969x, eic770 and RZ/G3S SoCs - Protection of system controller registers on Renesas and Google SoCs, to prevent trivially triggering a system crash from e.g. debugfs access - soc_device identification updates on Nvidia, Exynos and Mediatek - debugfs support in the ST STM32 firewall driver - Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI - Cleanups for memory controller support on Nvidia and Renesas" * tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits) memory: tegra186-emc: Fix missing put_bpmp Documentation: reset: Remove reset_controller_add_lookup() reset: fix BIT macro reference reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe reset: th1520: Support reset controllers in more subsystems reset: th1520: Prepare for supporting multiple controllers dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets reset: remove legacy reset lookup code clk: davinci: psc: drop unused reset lookup reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support reset: eswin: Add eic7700 reset driver dt-bindings: reset: eswin: Documentation for eic7700 SoC reset: sparx5: add LAN969x support dt-bindings: reset: microchip: Add LAN969x support soc: rockchip: grf: Add select correct PWM implementation on RK3368 soc/tegra: pmc: Add USB wake events for Tegra234 amba: tegra-ahb: Fix device leak on SMMU enable ...
7 daysMerge tag 'pull-persistency' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull persistent dentry infrastructure and conversion from Al Viro: "Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). A reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: - get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. - instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here" * tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) d_make_discardable(): warn if given a non-persistent dentry kill securityfs_recursive_remove() convert securityfs get rid of kill_litter_super() convert rust_binderfs convert nfsctl convert rpc_pipefs convert hypfs hypfs: swich hypfs_create_u64() to returning int hypfs: switch hypfs_create_str() to returning int hypfs: don't pin dentries twice convert gadgetfs gadgetfs: switch to simple_remove_by_name() convert functionfs functionfs: switch to simple_remove_by_name() functionfs: fix the open/removal races functionfs: need to cancel ->reset_work in ->kill_sb() functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}() functionfs: don't abuse ffs_data_closed() on fs shutdown convert selinuxfs ...
7 daysdrivers/xen: use min() instead of min_t()David Laight1-1/+1
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'. Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long' and so cannot discard significant bits. In this case the 'unsigned long' value is small enough that the result is ok. Detected by an extra check added to min_t(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20251119224140.8616-30-david.laight.linux@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com>
9 daysMerge tag 'net-next-6.19' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Replace busylock at the Tx queuing layer with a lockless list. Resulting in a 300% (4x) improvement on heavy TX workloads, sending twice the number of packets per second, for half the cpu cycles. - Allow constantly busy flows to migrate to a more suitable CPU/NIC queue. Normally we perform queue re-selection when flow comes out of idle, but under extreme circumstances the flows may be constantly busy. Add sysctl to allow periodic rehashing even if it'd risk packet reordering. - Optimize the NAPI skb cache, make it larger, use it in more paths. - Attempt returning Tx skbs to the originating CPU (like we already did for Rx skbs). - Various data structure layout and prefetch optimizations from Eric. - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly quite expensive on recent AMD machines. - Extend threaded NAPI polling to allow the kthread busy poll for packets. - Make MPTCP use Rx backlog processing. This lowers the lock pressure, improving the Rx performance. - Support memcg accounting of MPTCP socket memory. - Allow admin to opt sockets out of global protocol memory accounting (using a sysctl or BPF-based policy). The global limits are a poor fit for modern container workloads, where limits are imposed using cgroups. - Improve heuristics for when to kick off AF_UNIX garbage collection. - Allow users to control TCP SACK compression, and default to 33% of RTT. - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily aggressive rcvbuf growth and overshot when the connection RTT is low. - Preserve skb metadata space across skb_push / skb_pull operations. - Support for IPIP encapsulation in the nftables flowtable offload. - Support appending IP interface information to ICMP messages (RFC 5837). - Support setting max record size in TLS (RFC 8449). - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL. - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock. - Let users configure the number of write buffers in SMC. - Add new struct sockaddr_unsized for sockaddr of unknown length, from Kees. - Some conversions away from the crypto_ahash API, from Eric Biggers. - Some preparations for slimming down struct page. - YAML Netlink protocol spec for WireGuard. - Add a tool on top of YAML Netlink specs/lib for reporting commonly computed derived statistics and summarized system state. Driver API: - Add CAN XL support to the CAN Netlink interface. - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as defined by the OPEN Alliance's "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" specification. - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x). - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec and performs RSS. - Add info to devlink params whether the current setting is the default or a user override. Allow resetting back to default. - Add standard device stats for PSP crypto offload. - Leverage DSA frame broadcast to implement simple HSR frame duplication for a lot of switches without dedicated HSR offload. - Add uAPI defines for 1.6Tbps link modes. Device drivers: - Add Motorcomm YT921x gigabit Ethernet switch support. - Add MUCSE driver for N500/N210 1GbE NIC series. - Convert drivers to support dedicated ops for timestamping control, and away from the direct IOCTL handling. While at it support GET operations for PHY timestamping. - Add (and convert most drivers to) a dedicated ethtool callback for reading the Rx ring count. - Significant refactoring efforts in the STMMAC driver, which supports Synopsys turn-key MAC IP integrated into a ton of SoCs. - Ethernet high-speed NICs: - Broadcom (bnxt): - support PPS in/out on all pins - Intel (100G, ice, idpf): - ice: implement standard ethtool and timestamping stats - i40e: support setting the max number of MAC addresses per VF - iavf: support RSS of GTP tunnels for 5G and LTE deployments - nVidia/Mellanox (mlx5): - reduce downtime on interface reconfiguration - disable being an XDP redirect target by default (same as other drivers) to avoid wasting resources if feature is unused - Meta (fbnic): - add support for Linux-managed PCS on 25G, 50G, and 100G links - Wangxun: - support Rx descriptor merge, and Tx head writeback - support Rx coalescing offload - support 25G SPF and 40G QSFP modules - Ethernet virtual: - Google (gve): - allow ethtool to configure rx_buf_len - implement XDP HW RX Timestamping support for DQ descriptor format - Microsoft vNIC (mana): - support HW link state events - handle hardware recovery events when probing the device - Ethernet NICs consumer, and embedded: - usbnet: add support for Byte Queue Limits (BQL) - AMD (amd-xgbe): - add device selftests - NXP (enetc): - add i.MX94 support - Broadcom integrated MACs (bcmgenet, bcmasp): - bcmasp: add support for PHY-based Wake-on-LAN - Broadcom switches (b53): - support port isolation - support BCM5389/97/98 and BCM63XX ARL formats - Lantiq/MaxLinear switches: - support bridge FDB entries on the CPU port - use regmap for register access - allow user to enable/disable learning - support Energy Efficient Ethernet - support configuring RMII clock delays - add tagging driver for MaxLinear GSW1xx switches - Synopsys (stmmac): - support using the HW clock in free running mode - add Eswin EIC7700 support - add Rockchip RK3506 support - add Altera Agilex5 support - Cadence (macb): - cleanup and consolidate descriptor and DMA address handling - add EyeQ5 support - TI: - icssg-prueth: support AF_XDP - Airoha access points: - add missing Ethernet stats and link state callback - add AN7583 support - support out-of-order Tx completion processing - Power over Ethernet: - pd692x0: preserve PSE configuration across reboots - add support for TPS23881B devices - Ethernet PHYs: - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support - Support 50G SerDes and 100G interfaces in Linux-managed PHYs - micrel: - support for non PTP SKUs of lan8814 - enable in-band auto-negotiation on lan8814 - realtek: - cable testing support on RTL8224 - interrupt support on RTL8221B - motorcomm: support for PHY LEDs on YT853 - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag - mscc: support for PHY LED control - CAN drivers: - m_can: add support for optional reset and system wake up - remove can_change_mtu() obsoleted by core handling - mcp251xfd: support GPIO controller functionality - Bluetooth: - add initial support for PASTa - WiFi: - split ieee80211.h file, it's way too big - improvements in VHT radiotap reporting, S1G, Channel Switch Announcement handling, rate tracking in mesh networks - improve multi-radio monitor mode support, and add a cfg80211 debugfs interface for it - HT action frame handling on 6 GHz - initial chanctx work towards NAN - MU-MIMO sniffer improvements - WiFi drivers: - RealTek (rtw89): - support USB devices RTL8852AU and RTL8852CU - initial work for RTL8922DE - improved injection support - Intel: - iwlwifi: new sniffer API support - MediaTek (mt76): - WED support for >32-bit DMA - airoha NPU support - regdomain improvements - continued WiFi7/MLO work - Qualcomm/Atheros: - ath10k: factory test support - ath11k: TX power insertion support - ath12k: BSS color change support - ath12k: statistics improvements - brcmfmac: Acer A1 840 tablet quirk - rtl8xxxu: 40 MHz connection fixes/support" * tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits) net: page_pool: sanitise allocation order net: page pool: xa init with destroy on pp init net/mlx5e: Support XDP target xmit with dummy program net/mlx5e: Update XDP features in switch channels selftests/tc-testing: Test CAKE scheduler when enqueue drops packets net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop wireguard: netlink: generate netlink code wireguard: uapi: generate header with ynl-gen wireguard: uapi: move flag enums wireguard: uapi: move enum wg_cmd wireguard: netlink: add YNL specification selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py selftests: drv-net: introduce Iperf3Runner for measurement use cases selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive() Documentation: net: dsa: mention simple HSR offload helpers Documentation: net: dsa: mention availability of RedBox ...
2025-11-17drivers/xen/xenbus: Replace deprecated strcpy in xenbus_transaction_endThorsten Blum1-10/+4
strcpy() is deprecated; inline the read-only string instead. Fix the function comment and use bool instead of int while we're at it. Link: https://github.com/KSPP/linux/issues/88 Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20251031112145.103257-2-thorsten.blum@linux.dev>
2025-11-17drivers/xen/xenbus: Simplify return statement in join()Thorsten Blum1-1/+1
Don't unnecessarily negate 'buffer' and simplify the return statement. Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20251112171410.3140-2-thorsten.blum@linux.dev>
2025-11-16convert xenfsAl Viro1-1/+1
entirely static tree, populated by simple_fill_super(). Can switch to kill_anon_super() without any other changes. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-14syscore: Pass context data to callbacksThierry Reding1-4/+8
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com>
2025-11-13drivers/xen/xenbus: Fix namespace collision and split() section placement ↵Josh Poimboeuf1-2/+2
with AutoFDO When compiling the kernel with -ffunction-sections enabled, the split() function gets compiled into the .text.split section. In some cases it can even be cloned into .text.split.constprop.0 or .text.split.isra.0. However, .text.split.* is already reserved for use by the Clang -fsplit-machine-functions flag, which is used by AutoFDO. That may place part of a function's code in a .text.split.<func> section. This naming conflict causes the vmlinux linker script to wrongly place split() with other .text.split.* code, rather than where it belongs with regular text. Fix it by renaming split() to split_strings(). Fixes: 6568f14cb5ae ("vmlinux.lds: Exclude .text.startup and .text.exit from TEXT_MAIN") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: live-patching@vger.kernel.org Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://patch.msgid.link/92a194234a0f757765e275b288bb1a7236c2c35c.1762991150.git.jpoimboe@kernel.org
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook1-1/+1
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook1-1/+1
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-29xen: swiotlb: Convert mapping routine to rely on physical addressLeon Romanovsky1-8/+12
Switch to .map_phys callback instead of .map_page. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-13-3bbfe3a25cdf@kernel.org
2025-10-29xen: swiotlb: Switch to physical address mapping callbacksLeon Romanovsky1-34/+29
Combine resource and page mappings routines to one function and remove .map_resource/.unmap_resource callbacks completely. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-5-3bbfe3a25cdf@kernel.org
2025-10-03Merge tag 'dma-mapping-6.18-2025-09-30' of ↵Linus Torvalds1-1/+20
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: - Refactoring of DMA mapping API to physical addresses as the primary interface instead of page+offset parameters This gets much closer to Matthew Wilcox's long term wish for struct-pageless IO to cacheable DRAM and is supporting memdesc project which seeks to substantially transform how struct page works. An advantage of this approach is the possibility of introducing DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the common paths, what in turn lets to use recently introduced dma_iova_link() API to map PCI P2P MMIO without creating struct page Developped by Leon Romanovsky and Jason Gunthorpe - Minor clean-up by Petr Tesarik and Qianfeng Rong * tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: kmsan: fix missed kmsan_handle_dma() signature conversion mm/hmm: properly take MMIO path mm/hmm: migrate to physical address-based DMA mapping API dma-mapping: export new dma_*map_phys() interface xen: swiotlb: Open code map_resource callback dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs() kmsan: convert kmsan_handle_dma to use physical addresses dma-mapping: convert dma_direct_*map_page to be phys_addr_t based iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys() iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys dma-debug: refactor to use physical addresses for page mapping iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). dma-mapping: introduce new DMA attribute to indicate MMIO memory swiotlb: Remove redundant __GFP_NOWARN dma-direct: clean up the logic in __dma_direct_alloc_pages()
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds2-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-09-22xen: take system_transition_mutex on suspendMarek Marczykowski-Górecki1-1/+10
Xen's do_suspend() calls dpm_suspend_start() without taking required system_transition_mutex. Since 12ffc3b1513eb moved the pm_restrict_gfp_mask() call, not taking that mutex results in a WARN. Take the mutex in do_suspend(), and use mutex_trylock() to follow how enter_state() does this. Suggested-by: Jürgen Groß <jgross@suse.com> Fixes: 12ffc3b1513eb "PM: Restrict swap use to later in the suspend sequence" Link: https://lore.kernel.org/xen-devel/aKiBJeqsYx_4Top5@mail-itl/ Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Cc: stable@vger.kernel.org # v6.16+ Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250921162853.223116-1-marmarek@invisiblethingslab.com>
2025-09-13mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()David Hildenbrand2-2/+4
... and hide it behind a kconfig option. There is really no need for any !xen code to perform this check. The naming is a bit off: we want to find the "normal" page when a PTE was marked "special". So it's really not "finding a special" page. Improve the documentation, and add a comment in the code where XEN ends up performing the pte_mkspecial() through a hypercall. More details can be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as special on x86 PV guests"). Link: https://lkml.kernel.org/r/20250811112631.759341-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-12xen: swiotlb: Open code map_resource callbackLeon Romanovsky1-1/+20
General dma_direct_map_resource() is going to be removed in next patch, so simply open-code it in xen driver. Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/e9c66a92e818f416875441b6711963f9782dbbeb.1757423202.git.leonro@nvidia.com
2025-09-09xen/manage: Fix suspend error pathLukas Wunner1-1/+2
The device power management API has the following asymmetry: * dpm_suspend_start() does not clean up on failure (it requires a call to dpm_resume_end()) * dpm_suspend_end() does clean up on failure (it does not require a call to dpm_resume_start()) The asymmetry was introduced by commit d8f3de0d2412 ("Suspend-related patches for 2.6.27") in June 2008: It removed a call to device_resume() from device_suspend() (which was later renamed to dpm_suspend_start()). When Xen began using the device power management API in May 2008 with commit 0e91398f2a5d ("xen: implement save/restore"), the asymmetry did not yet exist. But since it was introduced, a call to dpm_resume_end() is missing in the error path of dpm_suspend_start(). Fix it. Fixes: d8f3de0d2412 ("Suspend-related patches for 2.6.27") Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: stable@vger.kernel.org # v2.6.27 Reviewed-by: "Rafael J. Wysocki (Intel)" <rafael@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <22453676d1ddcebbe81641bb68ddf587fee7e21e.1756990799.git.lukas@wunner.de>
2025-09-09xen/events: Update virq_to_irq on migrationJason Andryuk1-1/+12
VIRQs come in 3 flavors, per-VPU, per-domain, and global, and the VIRQs are tracked in per-cpu virq_to_irq arrays. Per-domain and global VIRQs must be bound on CPU 0, and bind_virq_to_irq() sets the per_cpu virq_to_irq at registration time Later, the interrupt can migrate, and info->cpu is updated. When calling __unbind_from_irq(), the per-cpu virq_to_irq is cleared for a different cpu. If bind_virq_to_irq() is called again with CPU 0, the stale irq is returned. There won't be any irq_info for the irq, so things break. Make xen_rebind_evtchn_to_cpu() update the per_cpu virq_to_irq mappings to keep them update to date with the current cpu. This ensures the correct virq_to_irq is cleared in __unbind_from_irq(). Fixes: e46cdb66c8fc ("xen: event channels") Cc: stable@vger.kernel.org Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250828003604.8949-4-jason.andryuk@amd.com>
2025-09-09xen/events: Return -EEXIST for bound VIRQsJason Andryuk1-5/+14
Change find_virq() to return -EEXIST when a VIRQ is bound to a different CPU than the one passed in. With that, remove the BUG_ON() from bind_virq_to_irq() to propogate the error upwards. Some VIRQs are per-cpu, but others are per-domain or global. Those must be bound to CPU0 and can then migrate elsewhere. The lookup for per-domain and global will probably fail when migrated off CPU 0, especially when the current CPU is tracked. This now returns -EEXIST instead of BUG_ON(). A second call to bind a per-domain or global VIRQ is not expected, but make it non-fatal to avoid trying to look up the irq, since we don't know which per_cpu(virq_to_irq) it will be in. Cc: stable@vger.kernel.org Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250828003604.8949-3-jason.andryuk@amd.com>
2025-09-09xen/events: Cleanup find_virq() return codesJason Andryuk1-3/+4
rc is overwritten by the evtchn_status hypercall in each iteration, so the return value will be whatever the last iteration is. This could incorrectly return success even if the event channel was not found. Change to an explicit -ENOENT for an un-found virq and return 0 on a successful match. Fixes: 62cc5fc7b2e0 ("xen/pv-on-hvm kexec: rebind virqs to existing eventchannel ports") Cc: stable@vger.kernel.org Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250828003604.8949-2-jason.andryuk@amd.com>
2025-09-08drivers/xen/gntdev: use xen_pv_domain() instead of cached valueJuergen Gross3-24/+18
Eliminate the use_ptemod variable by replacing its use cases with xen_pv_domain(). Instead of passing the xen_pv_domain() return value to gntdev_ioctl_dmabuf_exp_from_refs(), use xen_pv_domain() in that function. Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250826145608.10352-4-jgross@suse.com>
2025-09-08xen: replace XENFEAT_auto_translated_physmap with xen_pv_domain()Juergen Gross6-17/+15
Instead of testing the XENFEAT_auto_translated_physmap feature, just use !xen_pv_domain() which is equivalent. This has the advantage that a kernel not built with CONFIG_XEN_PV will be smaller due to dead code elimination. Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250826145608.10352-3-jgross@suse.com>
2025-08-20drivers/xen/xenbus: remove quirk for Xen 3.xJuergen Gross1-23/+0
The kernel is not supported to run as a Xen guest on Xen versions older than 4.0. Remove xen_strict_xenbus_quirk() which is testing the Xen version to be at least 4.0. Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250815074052.13792-1-jgross@suse.com>
2025-07-14xen/gntdev: remove struct gntdev_copy_batch from stackJuergen Gross2-21/+54
When compiling the kernel with LLVM, the following warning was issued: drivers/xen/gntdev.c:991: warning: stack frame size (1160) exceeds limit (1024) in function 'gntdev_ioctl' The main reason is struct gntdev_copy_batch which is located on the stack and has a size of nearly 1kb. For performance reasons it shouldn't by just dynamically allocated instead, so allocate a new instance when needed and instead of freeing it put it into a list of free structs anchored in struct gntdev_priv. Fixes: a4cdb556cae0 ("xen/gntdev: add ioctl for grant copy") Reported-by: Abinash Singh <abinashsinghlalotra@gmail.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250703073259.17356-1-jgross@suse.com>
2025-07-14xen: fix UAF in dmabuf_exp_from_pages()Al Viro1-18/+10
[dma_buf_fd() fixes; no preferences regarding the tree it goes through - up to xen folks] As soon as we'd inserted a file reference into descriptor table, another thread could close it. That's fine for the case when all we are doing is returning that descriptor to userland (it's a race, but it's a userland race and there's nothing the kernel can do about it). However, if we follow fd_install() with any kind of access to objects that would be destroyed on close (be it the struct file itself or anything destroyed by its ->release()), we have a UAF. dma_buf_fd() is a combination of reserving a descriptor and fd_install(). gntdev dmabuf_exp_from_pages() calls it and then proceeds to access the objects destroyed on close - starting with gntdev_dmabuf itself. Fix that by doing reserving descriptor before anything else and do fd_install() only when everything had been set up. Fixes: a240d6e42e28 ("xen/gntdev: Implement dma-buf export functionality") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Juergen Gross <jgross@suse.com> Message-ID: <20250712050916.GY1880847@ZenIV> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-07-14xen: Remove some deadcode (x)Dr. David Alan Gilbert3-31/+0
Remove three uncalled functions: xenbus_mkdir() was added in 2007 by commit 4bac07c993d0 ("xen: add the Xenbus sysfs and virtual device hotplug driver") but has remained unused. xen_get_runstate_snapshot() last use was removed in 2016 by commit 6ba286ad8457 ("xen: support runqueue steal time on xen") which replaces the use by the _cpu version. xen_resume_notifier_unregister() last use was removed in 2017 by commit 1914f0cd203c ("xen/acpi: upload PM state from init-domain to Xen") Remove them. Signed-off-by: "Dr. David Alan Gilbert" <linux@treblig.org> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250713132625.164728-1-linux@treblig.org>
2025-07-14xen-pciback: Replace scnprintf() with sysfs_emit_at()Ryan Chung1-6/+6
This is the third revision (v3) of this patch series. No changes since v2—only adding Reviewed-by lines. Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ryan Chung <seokwoo.chung130@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250708001444.86155-1-seokwoo.chung130@gmail.com>
2025-07-14xen/xenbus: fix W=1 build warning in xenbus_va_dev_error functionPeng Jiang1-0/+2
This patch fixes a W=1 format-string warning reported by GCC 12.3.0 by annotating xenbus_switch_fatal() and xenbus_va_dev_error() with the __printf attribute. The attribute enables compile-time validation of printf-style format strings in these functions. The original warning trace: drivers/xen/xenbus/xenbus_client.c:304:9: warning: function 'xenbus_va_dev_error' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format] Signed-off-by: Peng Jiang <jiang.peng9@zte.com.cn> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20250620084104786r5xoR16_AmYZMJLnno3_Q@zte.com.cn> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-05-23xen/x86: fix initial memory balloon targetRoger Pau Monne1-5/+8
When adding extra memory regions as ballooned pages also adjust the balloon target, otherwise when the balloon driver is started it will populate memory to match the target value and consume all the extra memory regions added. This made the usage of the Xen `dom0_mem=,max:` command line parameter for dom0 not work as expected, as the target won't be adjusted and when the balloon is started it will populate memory straight to the 'max:' value. It would equally affect domUs that have memory != maxmem. Kernels built with CONFIG_XEN_UNPOPULATED_ALLOC are not affected, because the extra memory regions are consumed by the unpopulated allocation driver, and then balloon_add_regions() becomes a no-op. Reported-by: John <jw@nuclearfallout.net> Fixes: 87af633689ce ('x86/xen: fix balloon target initialization for PVH dom0') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Message-ID: <20250514080427.28129-1-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-05-23xen: swiotlb: Wire up map_resource callbackJohn Ernberg1-0/+1
When running Xen on iMX8QXP, an Arm SoC without IOMMU, DMA performed via its eDMA v3 DMA engine fail with a mapping error. The eDMA performs DMA between RAM and MMIO space, and it's the MMIO side that cannot be mapped. MMIO->RAM DMA access cannot be bounce buffered if it would straddle a page boundary and on Xen the MMIO space is 1:1 mapped for Arm, and x86 PV Dom0. Cases where MMIO space is not 1:1 mapped, such as x86 PVH Dom0, requires an IOMMU present to deal with the mapping. Considering the above the map_resource callback can just be wired to the existing dma_direct_map_resource() function. There is nothing to do for unmap so the unmap callback is not needed. Signed-off-by: John Ernberg <john.ernberg@actia.se> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Message-ID: <20250512071440.3726697-1-john.ernberg@actia.se> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-05-07xenbus: Use kref to track req lifetimeJason Andryuk4-8/+23
Marek reported seeing a NULL pointer fault in the xenbus_thread callstack: BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: e030:__wake_up_common+0x4c/0x180 Call Trace: <TASK> __wake_up_common_lock+0x82/0xd0 process_msg+0x18e/0x2f0 xenbus_thread+0x165/0x1c0 process_msg+0x18e is req->cb(req). req->cb is set to xs_wake_up(), a thin wrapper around wake_up(), or xenbus_dev_queue_reply(). It seems like it was xs_wake_up() in this case. It seems like req may have woken up the xs_wait_for_reply(), which kfree()ed the req. When xenbus_thread resumes, it faults on the zero-ed data. Linux Device Drivers 2nd edition states: "Normally, a wake_up call can cause an immediate reschedule to happen, meaning that other processes might run before wake_up returns." ... which would match the behaviour observed. Change to keeping two krefs on each request. One for the caller, and one for xenbus_thread. Each will kref_put() when finished, and the last will free it. This use of kref matches the description in Documentation/core-api/kref.rst Link: https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/ Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Fixes: fd8aa9095a95 ("xen: optimize xenbus driver for multiple concurrent xenstore accesses") Cc: stable@vger.kernel.org Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250506210935.5607-1-jason.andryuk@amd.com>
2025-05-07xenbus: Allow PVH dom0 a non-local xenstoreJason Andryuk1-6/+8
Make xenbus_init() allow a non-local xenstore for a PVH dom0 - it is currently forced to XS_LOCAL. With Hyperlaunch booting dom0 and a xenstore stubdom, dom0 can be handled as a regular XS_HVM following the late init path. Ideally we'd drop the use of xen_initial_domain() and just check for the event channel instead. However, ARM has a xen,enhanced no-xenstore mode, where the event channel and PFN would both be 0. Retain the xen_initial_domain() check, and use that for an additional check when the event channel is 0. Check the full 64bit HVM_PARAM_STORE_EVTCHN value to catch the off chance that high bits are set for the 32bit event channel. Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Change-Id: I5506da42e4c6b8e85079fefb2f193c8de17c7437 Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250506204456.5220-1-jason.andryuk@amd.com>
2025-05-07xen: swiotlb: Use swiotlb bouncing if kmalloc allocation demands itJohn Ernberg1-0/+1
Xen swiotlb support was missed when the patch set starting with 4ab5f8ec7d71 ("mm/slab: decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN") was merged. When running Xen on iMX8QXP, a SoC without IOMMU, the effect was that USB transfers ended up corrupted when there was more than one URB inflight at the same time. Add a call to dma_kmalloc_needs_bounce() to make sure that allocations too small for DMA get bounced via swiotlb. Closes: https://lore.kernel.org/linux-usb/ab2776f0-b838-4cf6-a12a-c208eb6aad59@actia.se/ Fixes: 4ab5f8ec7d71 ("mm/slab: decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN") Cc: stable@kernel.org # v6.5+ Signed-off-by: John Ernberg <john.ernberg@actia.se> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250502114043.1968976-2-john.ernberg@actia.se>
2025-04-07x86/xen: fix balloon target initialization for PVH dom0Roger Pau Monne1-10/+24
PVH dom0 re-uses logic from PV dom0, in which RAM ranges not assigned to dom0 are re-used as scratch memory to map foreign and grant pages. Such logic relies on reporting those unpopulated ranges as RAM to Linux, and mark them as reserved. This way Linux creates the underlying page structures required for metadata management. Such approach works fine on PV because the initial balloon target is calculated using specific Xen data, that doesn't take into account the memory type changes described above. However on HVM and PVH the initial balloon target is calculated using get_num_physpages(), and that function does take into account the unpopulated RAM regions used as scratch space for remote domain mappings. This leads to PVH dom0 having an incorrect initial balloon target, which causes malfunction (excessive memory freeing) of the balloon driver if the dom0 memory target is later adjusted from the toolstack. Fix this by using xen_released_pages to account for any pages that are part of the memory map, but are already unpopulated when the balloon driver is initialized. This accounts for any regions used for scratch remote mappings. Note on x86 xen_released_pages definition is moved to enlighten.c so it's uniformly available for all Xen-enabled builds. Take the opportunity to unify PV with PVH/HVM guests regarding the usage of get_num_physpages(), as that avoids having to add different logic for PV vs PVH in both balloon_add_regions() and arch_xen_unpopulated_init(). Much like a6aa4eb994ee, the code in this changeset should have been part of 38620fc4e893. Fixes: a6aa4eb994ee ('xen/x86: add extra pages to unpopulated-alloc if available') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250407082838.65495-1-roger.pau@citrix.com>
2025-04-07xen: Change xen-acpi-processor dom0 dependencyJason Andryuk1-1/+1
xen-acpi-processor functions under a PVH dom0 with only a xen_initial_domain() runtime check. Change the Kconfig dependency from PV dom0 to generic dom0 to reflect that. Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250331172913.51240-1-jason.andryuk@amd.com>
2025-04-07xenbus: add module descriptionArnd Bergmann1-0/+1
Modules without a description now cause a warning: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/xen/xenbus/xenbus_probe_frontend.o Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250328113302.2632353-1-arnd@kernel.org>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-21xen: balloon: update the NR_BALLOON_PAGES stateNico Pache1-0/+4
Update the NR_BALLOON_PAGES counter when pages are added to or removed from the Xen balloon. Link: https://lkml.kernel.org/r/20250314213757.244258-5-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juegren Gross <jgross@suse.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-21xen/pci: Do not register devices with segments >= 0x10000Roger Pau Monne1-0/+32
The current hypercall interface for doing PCI device operations always uses a segment field that has a 16 bit width. However on Linux there are buses like VMD that hook up devices into the PCI hierarchy at segment >= 0x10000, after the maximum possible segment enumerated in ACPI. Attempting to register or manage those devices with Xen would result in errors at best, or overlaps with existing devices living on the truncated equivalent segment values. Note also that the VMD segment numbers are arbitrarily assigned by the OS, and hence there would need to be some negotiation between Xen and the OS to agree on how to enumerate VMD segments and devices behind them. Skip notifying Xen about those devices. Given how VMD bridges can multiplex interrupts on behalf of devices behind them there's no need for Xen to be aware of such devices for them to be usable by Linux. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> Message-ID: <20250219092059.90850-2-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-14xen/pciback: Remove unused pcistub_get_pci_devDr. David Alan Gilbert2-22/+0
pcistub_get_pci_dev() was added in 2009 as part of: commit 30edc14bf39a ("xen/pciback: xen pci backend driver.") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20250307004736.291229-1-linux@treblig.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-14xenfs/xensyms: respect hypervisor's "next" indicationJan Beulich1-2/+2
The interface specifies the symnum field as an input and output; the hypervisor sets it to the next sequential symbol's index. xensyms_next() incrementing the position explicitly (and xensyms_next_sym() decrementing it to "rewind") is only correct as long as the sequence of symbol indexes is non-sparse. Use the hypervisor-supplied value instead to update the position in xensyms_next(), and use the saved incoming index in xensyms_next_sym(). Cc: stable@kernel.org Fixes: a11f4f0a4e18 ("xen: xensyms support") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <15d5e7fa-ec5d-422f-9319-d28bed916349@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-14xen: Add support for XenServer 6.1 platform deviceFrediano Ziglio1-0/+4
On XenServer on Windows machine a platform device with ID 2 instead of 1 is used. This device is mainly identical to device 1 but due to some Windows update behaviour it was decided to use a device with a different ID. This causes compatibility issues with Linux which expects, if Xen is detected, to find a Xen platform device (5853:0001) otherwise code will crash due to some missing initialization (specifically grant tables). Specifically from dmesg RIP: 0010:gnttab_expand+0x29/0x210 Code: 90 0f 1f 44 00 00 55 31 d2 48 89 e5 41 57 41 56 41 55 41 89 fd 41 54 53 48 83 ec 10 48 8b 05 7e 9a 49 02 44 8b 35 a7 9a 49 02 <8b> 48 04 8d 44 39 ff f7 f1 45 8d 24 06 89 c3 e8 43 fe ff ff 44 39 RSP: 0000:ffffba34c01fbc88 EFLAGS: 00010086 ... The device 2 is presented by Xapi adding device specification to Qemu command line. Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Acked-by: Juergen Gross <jgross@suse.com> Message-ID: <20250227145016.25350-1-frediano.ziglio@cloud.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-02-14Merge tag 'for-linus-6.14-rc3-tag' of ↵Linus Torvalds1-9/+13
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Three fixes to xen-swiotlb driver: - two fixes for issues coming up due to another fix in 6.12 - addition of an __init annotation" * tag 'for-linus-6.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: Xen/swiotlb: mark xen_swiotlb_fixup() __init x86/xen: allow larger contiguous memory regions in PV guests xen/swiotlb: relax alignment requirements
2025-02-13Xen/swiotlb: mark xen_swiotlb_fixup() __initJan Beulich1-1/+1
It's sole user (pci_xen_swiotlb_init()) is __init, too. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Message-ID: <e1198286-99ec-41c1-b5ad-e04e285836c9@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-02-13xen/swiotlb: relax alignment requirementsJuergen Gross1-8/+12
When mapping a buffer for DMA via .map_page or .map_sg DMA operations, there is no need to check the machine frames to be aligned according to the mapped areas size. All what is needed in these cases is that the buffer is contiguous at machine level. So carve out the alignment check from range_straddles_page_boundary() and move it to a helper called by xen_swiotlb_alloc_coherent() and xen_swiotlb_free_coherent() directly. Fixes: 9f40ec84a797 ("xen/swiotlb: add alignment check for dma buffers") Reported-by: Jan Vejvalka <jan.vejvalka@lfmotol.cuni.cz> Tested-by: Jan Vejvalka <jan.vejvalka@lfmotol.cuni.cz> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-01-29Merge tag 'for-linus-6.14-rc1-tag' of ↵Linus Torvalds3-4/+14
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "Three minor fixes" * tag 'for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: update pvcalls_front_accept prototype Grab mm lock before grabbing pt lock xen: pcpu: remove unnecessary __ref annotation
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados1-1/+1
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-22xen: update pvcalls_front_accept prototypeStefano Stabellini2-3/+13
While currently there are no in-tree callers of these functions, it is best to keep them up-to-date with the latest network API. In addition, add the missing EXPORT_SYMBOL_GPL to the functions listed in pvcalls-front.h. Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <alpine.DEB.2.22.394.2501201537560.11086@ubuntu-linux-20-04-desktop> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-01-21Merge tag 'irq-core-2025-01-21' of ↵Linus Torvalds1-6/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: - Consolidate the machine_kexec_mask_interrupts() by providing a generic implementation and replacing the copy & pasta orgy in the relevant architectures. - Prevent unconditional operations on interrupt chips during kexec shutdown, which can trigger warnings in certain cases when the underlying interrupt has been shut down before. - Make the enforcement of interrupt handling in interrupt context unconditionally available, so that it actually works for non x86 related interrupt chips. The earlier enablement for ARM GIC chips set the required chip flag, but did not notice that the check was hidden behind a config switch which is not selected by ARM[64]. - Decrapify the handling of deferred interrupt affinity setting. Some interrupt chips require that affinity changes are made from the context of handling an interrupt to avoid certain race conditions. For x86 this was the default, but with interrupt remapping this requirement was lifted and a flag was introduced which tells the core code that affinity changes can be done in any context. Unrestricted affinity changes are the default for the majority of interrupt chips. RISCV has the requirement to add the deferred mode to one of it's interrupt controllers, but with the original implementation this would require to add the any context flag to all other RISC-V interrupt chips. That's backwards, so reverse the logic and require that chips, which need the deferred mode have to be marked accordingly. That avoids chasing the 'sane' chips and marking them. - Add multi-node support to the Loongarch AVEC interrupt controller driver. - The usual tiny cleanups, fixes and improvements all over the place. * tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set() genirq/timings: Add kernel-doc for a function parameter genirq: Remove IRQ_MOVE_PCNTXT and related code x86/apic: Convert to IRQCHIP_MOVE_DEFERRED genirq: Provide IRQCHIP_MOVE_DEFERRED hexagon: Remove GENERIC_PENDING_IRQ leftover ARC: Remove GENERIC_PENDING_IRQ genirq: Remove handle_enforce_irqctx() wrapper genirq: Make handle_enforce_irqctx() unconditionally available irqchip/loongarch-avec: Add multi-nodes topology support irqchip/ts4800: Replace seq_printf() by seq_puts() irqchip/ti-sci-inta : Add module build support irqchip/ti-sci-intr: Add module build support irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown kexec: Consolidate machine_kexec_mask_interrupts() implementation genirq: Reuse irq_thread_fn() for forced thread case genirq: Move irq_thread_fn() further up in the code
2025-01-20xen: pcpu: remove unnecessary __ref annotationSergio Miguéns Iglesias1-1/+1
The __ref annotation has been there since many years now, but the reason why it has been added has been removed from the code long ago. [jgross: clarify commit message] Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: xen-devel@lists.xenproject.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergio Miguéns Iglesias <sergio@lony.xyz> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241202085910.5539-1-sergio@lony.xyz> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-01-15x86/apic: Convert to IRQCHIP_MOVE_DEFERREDThomas Gleixner1-6/+0
Instead of marking individual interrupts as safe to be migrated in arbitrary contexts, mark the interrupt chips, which require the interrupt to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED flag. This makes more sense because this is a per interrupt chip property and not restricted to individual interrupts. That flips the logic from the historical opt-out to a opt-in model. This is simpler to handle for other architectures, which default to unrestricted affinity setting. It also allows to cleanup the redundant core logic significantly. All interrupt chips, which belong to a top-level domain sitting directly on top of the x86 vector domain are marked accordingly, unless the related setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra1-1/+1
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01Get rid of 'remove_new' relic from platform driver structLinus Torvalds1-1/+1
The continual trickle of small conversion patches is grating on me, and is really not helping. Just get rid of the 'remove_new' member function, which is just an alias for the plain 'remove', and had a comment to that effect: /* * .remove_new() is a relic from a prototype conversion of .remove(). * New drivers are supposed to implement .remove(). Once all drivers are * converted to not use .remove_new any more, it will be dropped. */ This was just a tree-wide 'sed' script that replaced '.remove_new' with '.remove', with some care taken to turn a subsequent tab into two tabs to make things line up. I did do some minimal manual whitespace adjustment for places that used spaces to line things up. Then I just removed the old (sic) .remove_new member function, and this is the end result. No more unnecessary conversion noise. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-19Merge tag 'irq-core-2024-11-18' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Tree wide: - Make nr_irqs static to the core code and provide accessor functions to remove existing and prevent future aliasing problems with local variables or function arguments of the same name. Core code: - Prevent freeing an interrupt in the devres code which is not managed by devres in the first place. - Use seq_put_decimal_ull_width() for decimal values output in /proc/interrupts which increases performance significantly as it avoids parsing the format strings over and over. - Optimize raising the timer and hrtimer soft interrupts by using the 'set bit only' variants instead of the combined version which checks whether ksoftirqd should be woken up. The latter is a pointless exercise as both soft interrupts are raised in the context of the timer interrupt and therefore never wake up ksoftirqd. - Delegate timer/hrtimer soft interrupt processing to a dedicated thread on RT. Timer and hrtimer soft interrupts are always processed in ksoftirqd on RT enabled kernels. This can lead to high latencies when other soft interrupts are delegated to ksoftirqd as well. The separate thread allows to run them seperately under a RT scheduling policy to reduce the latency overhead. Drivers: - New drivers or extensions of existing drivers to support Renesas RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt chips - Support for multi-cluster GICs on MIPS. MIPS CPUs can come with multiple CPU clusters, where each CPU cluster has its own GIC (Generic Interrupt Controller). This requires to access the GIC of a remote cluster through a redirect register block. This is encapsulated into a set of helper functions to keep the complexity out of the actual code paths which handle the GIC details. - Support for encrypted guests in the ARM GICV3 ITS driver The ITS page needs to be shared with the hypervisor and therefore must be decrypted. - Small cleanups and fixes all over the place" * tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) irqchip/riscv-aplic: Prevent crash when MSI domain is missing genirq/proc: Use seq_put_decimal_ull_width() for decimal values softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT. timers: Use __raise_softirq_irqoff() to raise the softirq. hrtimer: Use __raise_softirq_irqoff() to raise the softirq riscv: defconfig: Enable T-HEAD C900 ACLINT SSWI drivers irqchip: Add T-HEAD C900 ACLINT SSWI driver dt-bindings: interrupt-controller: Add T-HEAD C900 ACLINT SSWI device irqchip/stm32mp-exti: Use of_property_present() for non-boolean properties irqchip/mips-gic: Fix selection of GENERIC_IRQ_EFFECTIVE_AFF_MASK irqchip/mips-gic: Prevent indirect access to clusters without CPU cores irqchip/mips-gic: Multi-cluster support irqchip/mips-gic: Setup defaults in each cluster irqchip/mips-gic: Support multi-cluster in for_each_online_cpu_gic() irqchip/mips-gic: Replace open coded online CPU iterations genirq/irqdesc: Use str_enabled_disabled() helper in wakeup_show() genirq/devres: Don't free interrupt which is not managed by devres irqchip/gic-v3-its: Fix over allocation in itt_alloc_pool() irqchip/aspeed-intc: Add AST27XX INTC support dt-bindings: interrupt-controller: Add support for ASPEED AST27XX INTC ...
2024-11-18Merge tag 'for-linus-6.13-rc1-tag' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a series for booting as a PVH guest, doing some cleanups after the previous work to make PVH boot code position independent - a fix of the xenbus driver avoiding a leak in an error case * tag 'for-linus-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: Fix the issue of resource not being properly released in xenbus_dev_probe() x86/pvh: Avoid absolute symbol references in .head.text x86/xen: Avoid relocatable quantities in Xen ELF notes x86/pvh: Omit needless clearing of phys_base x86/pvh: Use correct size value in GDT descriptor x86/pvh: Call C code via the kernel virtual mapping
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-23/+5
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-13xen: Fix the issue of resource not being properly released in xenbus_dev_probe()Qiu-ji Chen1-1/+7
This patch fixes an issue in the function xenbus_dev_probe(). In the xenbus_dev_probe() function, within the if (err) branch at line 313, the program incorrectly returns err directly without releasing the resources allocated by err = drv->probe(dev, id). As the return value is non-zero, the upper layers assume the processing logic has failed. However, the probe operation was performed earlier without a corresponding remove operation. Since the probe actually allocates resources, failing to perform the remove operation could lead to problems. To fix this issue, we followed the resource release logic of the xenbus_dev_remove() function by adding a new block fail_remove before the fail_put block. After entering the branch if (err) at line 313, the function will use a goto statement to jump to the fail_remove block, ensuring that the previously acquired resources are correctly released, thus preventing the reference count leak. This bug was identified by an experimental static analysis tool developed by our team. The tool specializes in analyzing reference count operations and detecting potential issues where resources are not properly managed. In this case, the tool flagged the missing release operation as a potential problem, which led to the development of this patch. Fixes: 4bac07c993d0 ("xen: add the Xenbus sysfs and virtual device hotplug driver") Cc: stable@vger.kernel.org Signed-off-by: Qiu-ji Chen <chenqiuji666@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241105130919.4621-1-chenqiuji666@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-11-03assorted variants of irqfd setup: convert to CLASS(fd)Al Viro1-13/+4
in all of those failure exits prior to fdget() are plain returns and the only thing done after fdput() is (on failure exits) a kfree(), which can be done before fdput() just fine. NOTE: in acrn_irqfd_assign() 'fail:' failure exit is wrong for eventfd_ctx_fileget() failure (we only want fdput() there) and once we stop doing that, it doesn't need to check if eventfd is NULL or ERR_PTR(...) there. NOTE: in privcmd we move fdget() up before the allocation - more to the point, before the copy_from_user() attempt. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()Al Viro1-10/+1
just call it, same as privcmd_ioeventfd_deassign() does... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18xen: Remove dependency between pciback and privcmdJiqian Chen4-7/+35
Commit 2fae6bb7be32 ("xen/privcmd: Add new syscall to get gsi from dev") adds a weak reverse dependency to the config XEN_PRIVCMD definition, that dependency causes xen-privcmd can't be loaded on domU, because dependent xen-pciback isn't always be loaded successfully on domU. To solve above problem, remove that dependency, and do not call pcistub_get_gsi_from_sbdf() directly, instead add a hook in drivers/xen/apci.c, xen-pciback register the real call function, then in privcmd_ioctl_pcidev_get_gsi call that hook. Fixes: 2fae6bb7be32 ("xen/privcmd: Add new syscall to get gsi from dev") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241012084537.1543059-1-Jiqian.Chen@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-10-16xen/events: Switch to irq_get_nr_irqs()Bart Van Assche1-1/+1
Use the irq_get_nr_irqs() function instead of the global variable 'nr_irqs'. Prepare for changing 'nr_irqs' from an exported global variable into a variable with file scope. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241015190953.1266194-20-bvanassche@acm.org
2024-10-02xen: Fix config option reference in XEN_PRIVCMD definitionLukas Bulwahn1-1/+1
Commit 2fae6bb7be32 ("xen/privcmd: Add new syscall to get gsi from dev") adds a weak reverse dependency to the config XEN_PRIVCMD definition, referring to CONFIG_XEN_PCIDEV_BACKEND. In Kconfig files, one refers to config options without the CONFIG prefix, though. So in its current form, this does not create the reverse dependency as intended, but is an attribute with no effect. Refer to the intended config option XEN_PCIDEV_BACKEND in the XEN_PRIVCMD definition. Fixes: 2fae6bb7be32 ("xen/privcmd: Add new syscall to get gsi from dev") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240930090650.429813-1-lukas.bulwahn@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-27Merge tag 'for-linus-6.12-rc1a-tag' of ↵Linus Torvalds6-8/+168
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: "A second round of Xen related changes and features: - a small fix of the xen-pciback driver for a warning issued by sparse - support PCI passthrough when using a PVH dom0 - enable loading the kernel in PVH mode at arbitrary addresses, avoiding conflicts with the memory map when running as a Xen dom0 using the host memory layout" * tag 'for-linus-6.12-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/pvh: Add 64bit relocation page tables x86/kernel: Move page table macros to header x86/pvh: Set phys_base when calling xen_prepare_pvh() x86/pvh: Make PVH entrypoint PIC for x86-64 xen: sync elfnote.h from xen tree xen/pciback: fix cast to restricted pci_ers_result_t and pci_power_t xen/privcmd: Add new syscall to get gsi from dev xen/pvh: Setup gsi for passthrough device xen/pci: Add a function to reset device for xen
2024-09-27[tree-wide] finally take no_llseek outAl Viro3-3/+0
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-25xen/pciback: fix cast to restricted pci_ers_result_t and pci_power_tMin-Hua Chen2-2/+2
This patch fix the following sparse warning by applying __force cast to pci_ers_result_t and pci_power_t. drivers/xen/xen-pciback/pci_stub.c:760:16: sparse: warning: cast to restricted pci_ers_result_t drivers/xen/xen-pciback/conf_space_capability.c:125:22: sparse: warning: cast to restricted pci_power_t No functional changes intended. Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240917233653.61630-1-minhuadotchen@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25xen/privcmd: Add new syscall to get gsi from devJiqian Chen3-3/+68
On PVH dom0, when passthrough a device to domU, QEMU and xl tools want to use gsi number to do pirq mapping, see QEMU code xen_pt_realize->xc_physdev_map_pirq, and xl code pci_add_dm_done->xc_physdev_map_pirq, but in current codes, the gsi number is got from file /sys/bus/pci/devices/<sbdf>/irq, that is wrong, because irq is not equal with gsi, they are in different spaces, so pirq mapping fails. And in current linux codes, there is no method to get gsi for userspace. For above purpose, record gsi of pcistub devices when init pcistub and add a new syscall into privcmd to let userspace can get gsi when they have a need. Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Message-ID: <20240924061437.2636766-4-Jiqian.Chen@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25xen/pvh: Setup gsi for passthrough deviceJiqian Chen2-0/+70
In PVH dom0, the gsis don't get registered, but the gsi of a passthrough device must be configured for it to be able to be mapped into a domU. When assigning a device to passthrough, proactively setup the gsi of the device during that process. Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Message-ID: <20240924061437.2636766-3-Jiqian.Chen@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-25xen/pci: Add a function to reset device for xenJiqian Chen2-3/+28
When device on dom0 side has been reset, the vpci on Xen side won't get notification, so that the cached state in vpci is all out of date with the real device state. To solve that problem, add a new function to clear all vpci device state when device is reset on dom0 side. And call that function in pcistub_init_device. Because when using "pci-assignable-add" to assign a passthrough device in Xen, it will reset passthrough device and the vpci state will out of date, and then device will fail to restore bar state. Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Message-ID: <20240924061437.2636766-2-Jiqian.Chen@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-09-19Merge tag 'dma-mapping-6.12-2024-09-19' of ↵Linus Torvalds1-2/+2
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - support DMA zones for arm64 systems where memory starts at > 4GB (Baruch Siach, Catalin Marinas) - support direct calls into dma-iommu and thus obsolete dma_map_ops for many common configurations (Leon Romanovsky) - add DMA-API tracing (Sean Anderson) - remove the not very useful return value from various dma_set_* APIs (Christoph Hellwig) - misc cleanups and minor optimizations (Chen Y, Yosry Ahmed, Christoph Hellwig) * tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: reflow dma_supported dma-mapping: reliably inform about DMA support for IOMMU dma-mapping: add tracing for dma-mapping API calls dma-mapping: use IOMMU DMA calls for common alloc/free page calls dma-direct: optimize page freeing when it is not addressable dma-mapping: clearly mark DMA ops as an architecture feature vdpa_sim: don't select DMA_OPS arm64: mm: keep low RAM dma zone dma-mapping: don't return errors from dma_set_max_seg_size dma-mapping: don't return errors from dma_set_seg_boundary dma-mapping: don't return errors from dma_set_min_align_mask scsi: check that busses support the DMA API before setting dma parameters arm64: mm: fix DMA zone when dma-ranges is missing dma-mapping: direct calls for dma-iommu dma-mapping: call ->unmap_page and ->unmap_sg unconditionally arm64: support DMA zone above 4GB dma-mapping: replace zone_dma_bits by zone_dma_limit dma-mapping: use bit masking to check VM_DMA_COHERENT
2024-09-17xen/swiotlb: fix allocated sizeJuergen Gross1-2/+2
The allocated size in xen_swiotlb_alloc_coherent() and xen_swiotlb_free_coherent() is calculated wrong for the case of XEN_PAGE_SIZE not matching PAGE_SIZE. Fix that. Fixes: 7250f422da04 ("xen-swiotlb: use actually allocated size on check physical continuous") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-17xen/swiotlb: add alignment check for dma buffersJuergen Gross1-0/+6
When checking a memory buffer to be consecutive in machine memory, the alignment needs to be checked, too. Failing to do so might result in DMA memory not being aligned according to its requested size, leading to error messages like: 4xxx 0000:2b:00.0: enabling device (0140 -> 0142) 4xxx 0000:2b:00.0: Ring address not aligned 4xxx 0000:2b:00.0: Failed to initialise service qat_crypto 4xxx 0000:2b:00.0: Resetting device qat_dev0 4xxx: probe of 0000:2b:00.0 failed with error -14 Fixes: 9435cce87950 ("xen/swiotlb: Add support for 64KB page granularity") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-13xen/pci: Avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva1-9/+5
Use the `DEFINE_RAW_FLEX()` helper for an on-stack definition of a flexible structure where the size of the flexible-array member is known at compile-time, and refactor the rest of the code, accordingly. So, with this, fix the following warning: drivers/xen/pci.c:48:55: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Message-ID: <ZsU58MvoYEEqBHZl@elsanto> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-12xen/xenbus: Convert to use ERR_CAST()Shen Lichuan1-3/+3
Use ERR_CAST() as it is designed for casting an error pointer to another type. This macro utilizes the __force and __must_check modifiers, which instruct the compiler to verify for errors at the locations where it is employed. Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20240829084710.30312-1-shenlichuan@vivo.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-04dma-mapping: clearly mark DMA ops as an architecture featureChristoph Hellwig1-2/+2
DMA ops are a helper for architectures and not for drivers to override the DMA implementation. Unfortunately driver authors keep ignoring this. Make the fact more clear by renaming the symbol to ARCH_HAS_DMA_OPS and having the two drivers overriding their dma_ops depend on that. These drivers should probably be marked broken, but we can give them a bit of a grace period for that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> # for IPU6 Acked-by: Robin Murphy <robin.murphy@arm.com>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro1-5/+5
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-07-25Merge tag 'driver-core-6.11-rc1' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core changes for 6.11-rc1. Lots of stuff in here, with not a huge diffstat, but apis are evolving which required lots of files to be touched. Highlights of the changes in here are: - platform remove callback api final fixups (Uwe took many releases to get here, finally!) - Rust bindings for basic firmware apis and initial driver-core interactions. It's not all that useful for a "write a whole driver in rust" type of thing, but the firmware bindings do help out the phy rust drivers, and the driver core bindings give a solid base on which others can start their work. There is still a long way to go here before we have a multitude of rust drivers being added, but it's a great first step. - driver core const api changes. This reached across all bus types, and there are some fix-ups for some not-common bus types that linux-next and 0-day testing shook out. This work is being done to help make the rust bindings more safe, as well as the C code, moving toward the end-goal of allowing us to put driver structures into read-only memory. We aren't there yet, but are getting closer. - minor devres cleanups and fixes found by code inspection - arch_topology minor changes - other minor driver core cleanups All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits) ARM: sa1100: make match function take a const pointer sysfs/cpu: Make crash_hotplug attribute world-readable dio: Have dio_bus_match() callback take a const * zorro: make match function take a const pointer driver core: module: make module_[add|remove]_driver take a const * driver core: make driver_find_device() take a const * driver core: make driver_[create|remove]_file take a const * firmware_loader: fix soundness issue in `request_internal` firmware_loader: annotate doctests as `no_run` devres: Correct code style for functions that return a pointer type devres: Initialize an uninitialized struct member devres: Fix memory leakage caused by driver API devm_free_percpu() devres: Fix devm_krealloc() wasting memory driver core: platform: Switch to use kmemdup_array() driver core: have match() callback in struct bus_type take a const * MAINTAINERS: add Rust device abstractions to DRIVER CORE device: rust: improve safety comments MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER firmware: rust: improve safety comments ...
2024-07-21Merge tag 'mm-stable-2024-07-21-14-50' of ↵Linus Torvalds1-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
2024-07-19Merge tag 'dma-mapping-6.11-2024-07-19' of ↵Linus Torvalds1-11/+20
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - reduce duplicate swiotlb pool lookups (Michael Kelley) - minor small fixes (Yicong Yang, Yang Li) * tag 'dma-mapping-6.11-2024-07-19' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: fix kernel-doc description for swiotlb_del_transient swiotlb: reduce swiotlb pool lookups dma-mapping: benchmark: Don't starve others when doing the test
2024-07-10swiotlb: reduce swiotlb pool lookupsMichael Kelley1-11/+20
With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple places, the pool is found and used in one function, and then must be found again in the next function that is called because only the tlb_addr is passed as an argument. These are the six call sites: dma_direct_map_page: 1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce dma_direct_unmap_page: 2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer 3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu -> swiotlb_bounce 4. is_swiotlb_buffer 5. swiotlb_tbl_unmap_single -> swiotlb_del_transient 6. swiotlb_tbl_unmap_single -> swiotlb_release_slots Reduce the number of calls by finding the pool at a higher level, and passing it as an argument instead of searching again. A key change is for is_swiotlb_buffer() to return a pool pointer instead of a boolean, and then pass this pool pointer to subsequent swiotlb functions. There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer is a swiotlb buffer before calling a swiotlb function. To reduce code duplication in getting the pool pointer and passing it as an argument, introduce inline wrappers for this pattern. The generated code is essentially unchanged. Since is_swiotlb_buffer() no longer returns a boolean, rename some functions to reflect the change: * swiotlb_find_pool() becomes __swiotlb_find_pool() * is_swiotlb_buffer() becomes swiotlb_find_pool() * is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool() With these changes, a round-trip map/unmap pair requires only 2 pool lookups (listed using the new names and wrappers): dma_direct_unmap_page: 1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool 2. swiotlb_tbl_unmap_single -> swiotlb_find_pool These changes come from noticing the inefficiencies in a code review, not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC, __swiotlb_find_pool() is not trivial, and it uses an RCU read lock, so avoiding the redundant calls helps performance in a hot path. When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction is minimal and the perf benefits are likely negligible, but no harm is done. No functional change is intended. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-07-03mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() ↵David Hildenbrand1-2/+7
instead of PageReserved() We currently initialize the memmap such that PG_reserved is set and the refcount of the page is 1. In virtio-mem code, we have to manually clear that PG_reserved flag to make memory offlining with partially hotplugged memory blocks possible: has_unmovable_pages() would otherwise bail out on such pages. We want to avoid PG_reserved where possible and move to typed pages instead. Further, we want to further enlighten memory offlining code about PG_offline: offline pages in an online memory section. One example is handling managed page count adjustments in a cleaner way during memory offlining. So let's initialize the pages with PG_offline instead of PG_reserved. generic_online_page()->__free_pages_core() will now clear that flag before handing that memory to the buddy. Note that the page refcount is still 1 and would forbid offlining of such memory except when special care is take during GOING_OFFLINE as currently only implemented by virtio-mem. With this change, we can now get non-PageReserved() pages in the XEN balloon list. From what I can tell, that can already happen via decrease_reservation(), so that should be fine. HV-balloon should not really observe a change: partial online memory blocks still cannot get surprise-offlined, because the refcount of these PageOffline() pages is 1. Update virtio-mem, HV-balloon and XEN-balloon code to be aware that hotplugged pages are now PageOffline() instead of PageReserved() before they are handed over to the buddy. We'll leave the ZONE_DEVICE case alone for now. Note that self-hosted vmemmap pages will no longer be marked as reserved. This matches ordinary vmemmap pages allocated from the buddy during memory hotplug. Now, really only vmemmap pages allocated from memblock during early boot will be marked reserved. Existing PageReserved() checks seem to be handling all relevant cases correctly even after this change. Link: https://lkml.kernel.org/r/20240607090939.89524-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> [generic memory-hotplug bits] Cc: Alexander Potapenko <glider@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Marco Elver <elver@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03driver core: have match() callback in struct bus_type take a const *Greg Kroah-Hartman2-3/+3
In the match() callback, the struct device_driver * should not be changed, so change the function callback to be a const *. This is one step of many towards making the driver core safe to have struct device_driver in read-only memory. Because the match() callback is in all busses, all busses are modified to handle this properly. This does entail switching some container_of() calls to container_of_const() to properly handle the constant *. For some busses, like PCI and USB and HV, the const * is cast away in the match callback as those busses do want to modify those structures at this point in time (they have a local lock in the driver structure.) That will have to be changed in the future if they wish to have their struct device * in read-only-memory. Cc: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Alex Elder <elder@kernel.org> Acked-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-02xen: privcmd: Fix possible access to a freed kirqfd instanceViresh Kumar1-1/+9
Nothing prevents simultaneous ioctl calls to privcmd_irqfd_assign() and privcmd_irqfd_deassign(). If that happens, it is possible that a kirqfd created and added to the irqfds_list by privcmd_irqfd_assign() may get removed by another thread executing privcmd_irqfd_deassign(), while the former is still using it after dropping the locks. This can lead to a situation where an already freed kirqfd instance may be accessed and cause kernel oops. Use SRCU locking to prevent the same, as is done for the KVM implementation for irqfds. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/9e884af1f1f842eacbb7afc5672c8feb4dea7f3f.1718703669.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>
2024-07-02xen: privcmd: Switch from mutex to spinlock for irqfdsViresh Kumar1-10/+15
irqfd_wakeup() gets EPOLLHUP, when it is called by eventfd_release() by way of wake_up_poll(&ctx->wqh, EPOLLHUP), which gets called under spin_lock_irqsave(). We can't use a mutex here as it will lead to a deadlock. Fix it by switching over to a spin lock. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/a66d7a7a9001424d432f52a9fc3931a1f345464f.1718703669.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>
2024-07-02xen: add missing MODULE_DESCRIPTION() macrosJeff Johnson4-0/+4
With ARCH=x86, make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/xen/xen-pciback/xen-pciback.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/xen/xen-evtchn.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/xen/xen-privcmd.o Add the missing invocations of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240611-md-drivers-xen-v1-1-1eb677364ca6@quicinc.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-07-01xen/manage: Constify struct shutdown_handlerChristophe JAILLET1-1/+1
'struct shutdown_handler' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 7043 788 8 7839 1e9f drivers/xen/manage.o After: ===== text data bss dec hex filename 7164 676 8 7848 1ea8 drivers/xen/manage.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/ca1e75f66aed43191cf608de6593c7d6db9148f1.1719134768.git.christophe.jaillet@wanadoo.fr Signed-off-by: Juergen Gross <jgross@suse.com>
2024-05-24Merge tag 'for-linus-6.10a-rc1-tag' of ↵Linus Torvalds2-21/+29
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a small cleanup in the drivers/xen/xenbus Makefile - a fix of the Xen xenstore driver to improve connecting to a late started Xenstore - an enhancement for better support of ballooning in PVH guests - a cleanup using try_cmpxchg() instead of open coding it * tag 'for-linus-6.10a-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: drivers/xen: Improve the late XenStore init protocol xen/xenbus: Use *-y instead of *-objs in Makefile xen/x86: add extra pages to unpopulated-alloc if available locking/x86/xen: Use try_cmpxchg() in xen_alloc_p2m_entry()
2024-05-23drivers/xen: Improve the late XenStore init protocolHenry Wang1-13/+23
Currently, the late XenStore init protocol is only triggered properly for the case that HVM_PARAM_STORE_PFN is ~0ULL (invalid). For the case that XenStore interface is allocated but not ready (the connection status is not XENSTORE_CONNECTED), Linux should also wait until the XenStore is set up properly. Introduce a macro to describe the XenStore interface is ready, use it in xenbus_probe_initcall() to select the code path of doing the late XenStore init protocol or not. Since now we have more than one condition for XenStore late init, rework the check in xenbus_probe() for the free_irq(). Take the opportunity to enhance the check of the allocated XenStore interface can be properly mapped, and return error early if the memremap() fails. Fixes: 5b3353949e89 ("xen: add support for initializing xenstore later as HVM domain") Signed-off-by: Henry Wang <xin.wang2@amd.com> Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20240517011516.1451087-1-xin.wang2@amd.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-05-20Merge tag 'dma-mapping-6.10-2024-05-20' of ↵Linus Torvalds1-1/+1
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - optimize DMA sync calls when they are no-ops (Alexander Lobakin) - fix swiotlb padding for untrusted devices (Michael Kelley) - add documentation for swiotb (Michael Kelley) * tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping: dma: fix DMA sync for drivers not calling dma_set_mask*() xsk: use generic DMA sync shortcut instead of a custom one page_pool: check for DMA sync shortcut earlier page_pool: don't use driver-set flags field directly page_pool: make sure frag API fields don't span between cachelines iommu/dma: avoid expensive indirect calls for sync operations dma: avoid redundant calls for sync operations dma: compile-out DMA sync op calls when not used iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices swiotlb: remove alloc_size argument to swiotlb_tbl_map_single() Documentation/core-api: add swiotlb documentation
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-17xen/xenbus: Use *-y instead of *-objs in MakefileAndy Shevchenko1-8/+6
*-objs suffix is reserved rather for (user-space) host programs while usually *-y suffix is used for kernel drivers (although *-objs works for that purpose for now). Let's correct the old usages of *-objs in Makefiles. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240508152658.1445809-1-andriy.shevchenko@linux.intel.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-05-13net: change proto and proto_ops accept typeJens Axboe1-1/+5
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then holds all of these arguments, and prepares accept for being able to pass back more information. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()Michael Kelley1-1/+1
Currently swiotlb_tbl_map_single() takes alloc_align_mask and alloc_size arguments to specify an swiotlb allocation that is larger than mapping_size. This larger allocation is used solely by iommu_dma_map_single() to handle untrusted devices that should not have DMA visibility to memory pages that are partially used for unrelated kernel data. Having two arguments to specify the allocation is redundant. While alloc_align_mask naturally specifies the alignment of the starting address of the allocation, it can also implicitly specify the size by rounding up the mapping_size to that alignment. Additionally, the current approach has an edge case bug. iommu_dma_map_page() already does the rounding up to compute the alloc_size argument. But swiotlb_tbl_map_single() then calculates the alignment offset based on the DMA min_align_mask, and adds that offset to alloc_size. If the offset is non-zero, the addition may result in a value that is larger than the max the swiotlb can allocate. If the rounding up is done _after_ the alignment offset is added to the mapping_size (and the original mapping_size conforms to the value returned by swiotlb_max_mapping_size), then the max that the swiotlb can allocate will not be exceeded. In view of these issues, simplify the swiotlb_tbl_map_single() interface by removing the alloc_size argument. Most call sites pass the same value for mapping_size and alloc_size, and they pass alloc_align_mask as zero. Just remove the redundant argument from these callers, as they will see no functional change. For iommu_dma_map_page() also remove the alloc_size argument, and have swiotlb_tbl_map_single() compute the alloc_size by rounding up mapping_size after adding the offset based on min_align_mask. This has the side effect of fixing the edge case bug but with no other functional change. Also add a sanity test on the alloc_align_mask. While IOMMU code currently ensures the granule is not larger than PAGE_SIZE, if that guarantee were to be removed in the future, the downstream effect on the swiotlb might go unnoticed until strange allocation failures occurred. Tested on an ARM64 system with 16K page size and some kernel test-only hackery to allow modifying the DMA min_align_mask and the granule size that becomes the alloc_align_mask. Tested these combinations with a variety of original memory addresses and sizes, including those that reproduce the edge case bug: * 4K granule and 0 min_align_mask * 4K granule and 0xFFF min_align_mask (4K - 1) * 16K granule and 0xFFF min_align_mask * 64K granule and 0xFFF min_align_mask * 64K granule and 0x3FFF min_align_mask (16K - 1) With the changes, all combinations pass. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr@tesarici.cz> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-04-25change alloc_pages name in dma_map_ops to avoid name conflictsSuren Baghdasaryan2-2/+2
After redefining alloc_pages, all uses of that name are being replaced. Change the conflicting names to prevent preprocessor from replacing them when it's not intended. Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-19Merge tag 'for-linus-6.9-rc1-tag' of ↵Linus Torvalds4-15/+21
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - Xen event channel handling fix for a regression with a rare kernel config and some added hardening - better support of running Xen dom0 in PVH mode - a cleanup for the xen grant-dma-iommu driver * tag 'for-linus-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: increment refcnt only if event channel is refcounted xen/evtchn: avoid WARN() when unbinding an event channel x86/xen: attempt to inflate the memory balloon on PVH xen/grant-dma-iommu: Convert to platform remove callback returning void
2024-03-17xen/events: increment refcnt only if event channel is refcountedJuergen Gross1-9/+13
In bind_evtchn_to_irq_chip() don't increment the refcnt of the event channel blindly. In case the event channel is NOT refcounted, issue a warning instead. Add an additional safety net by doing the refcnt increment only if the caller has specified IRQF_SHARED in the irqflags parameter. Fixes: 9e90e58c11b7 ("xen: evtchn: Allow shared registration of IRQ handers") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Link: https://lore.kernel.org/r/20240313071409.25913-3-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-03-17xen/evtchn: avoid WARN() when unbinding an event channelJuergen Gross1-0/+6
When unbinding a user event channel, the related handler might be called a last time in case the kernel was built with CONFIG_DEBUG_SHIRQ. This might cause a WARN() in the handler. Avoid that by adding an "unbinding" flag to struct user_event which will short circuit the handler. Fixes: 9e90e58c11b7 ("xen: evtchn: Allow shared registration of IRQ handers") Reported-by: Demi Marie Obenour <demi@invisiblethingslab.com> Tested-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Link: https://lore.kernel.org/r/20240313071409.25913-2-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-03-13x86/xen: attempt to inflate the memory balloon on PVHRoger Pau Monne1-2/+0
When running as PVH or HVM Linux will use holes in the memory map as scratch space to map grants, foreign domain pages and possibly miscellaneous other stuff. However the usage of such memory map holes for Xen purposes can be problematic. The request of holesby Xen happen quite early in the kernel boot process (grant table setup already uses scratch map space), and it's possible that by then not all devices have reclaimed their MMIO space. It's not unlikely for chunks of Xen scratch map space to end up using PCI bridge MMIO window memory, which (as expected) causes quite a lot of issues in the system. At least for PVH dom0 we have the possibility of using regions marked as UNUSABLE in the e820 memory map. Either if the region is UNUSABLE in the native memory map, or it has been converted into UNUSABLE in order to hide RAM regions from dom0, the second stage translation page-tables can populate those areas without issues. PV already has this kind of logic, where the balloon driver is inflated at boot. Re-use the current logic in order to also inflate it when running as PVH. onvert UNUSABLE regions up to the ratio specified in EXTRA_MEM_RATIO to RAM, while reserving them using xen_add_extra_mem() (which is also moved so it's no longer tied to CONFIG_PV). [jgross: fixed build for CONFIG_PVH without CONFIG_XEN_PVH] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240220174341.56131-1-roger.pau@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-03-13xen/grant-dma-iommu: Convert to platform remove callback returning voidUwe Kleine-König1-4/+2
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240307181135.191192-2-u.kleine-koenig@pengutronix.de Signed-off-by: Juergen Gross <jgross@suse.com>
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-02-13xen/events: close evtchn after mapping cleanupMaximilian Heyne1-2/+6
shutdown_pirq and startup_pirq are not taking the irq_mapping_update_lock because they can't due to lock inversion. Both are called with the irq_desc->lock being taking. The lock order, however, is first irq_mapping_update_lock and then irq_desc->lock. This opens multiple races: - shutdown_pirq can be interrupted by a function that allocates an event channel: CPU0 CPU1 shutdown_pirq { xen_evtchn_close(e) __startup_pirq { EVTCHNOP_bind_pirq -> returns just freed evtchn e set_evtchn_to_irq(e, irq) } xen_irq_info_cleanup() { set_evtchn_to_irq(e, -1) } } Assume here event channel e refers here to the same event channel number. After this race the evtchn_to_irq mapping for e is invalid (-1). - __startup_pirq races with __unbind_from_irq in a similar way. Because __startup_pirq doesn't take irq_mapping_update_lock it can grab the evtchn that __unbind_from_irq is currently freeing and cleaning up. In this case even though the event channel is allocated, its mapping can be unset in evtchn_to_irq. The fix is to first cleanup the mappings and then close the event channel. In this way, when an event channel gets allocated it's potential previous evtchn_to_irq mappings are guaranteed to be unset already. This is also the reverse order of the allocation where first the event channel is allocated and then the mappings are setup. On a 5.10 kernel prior to commit 3fcdaf3d7634 ("xen/events: modify internal [un]bind interfaces"), we hit a BUG like the following during probing of NVMe devices. The issue is that during nvme_setup_io_queues, pci_free_irq is called for every device which results in a call to shutdown_pirq. With many nvme devices it's therefore likely to hit this race during boot because there will be multiple calls to shutdown_pirq and startup_pirq are running potentially in parallel. ------------[ cut here ]------------ blkfront: xvda: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled; bounce buffer: enabled kernel BUG at drivers/xen/events/events_base.c:499! invalid opcode: 0000 [#1] SMP PTI CPU: 44 PID: 375 Comm: kworker/u257:23 Not tainted 5.10.201-191.748.amzn2.x86_64 #1 Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006 Workqueue: nvme-reset-wq nvme_reset_work RIP: 0010:bind_evtchn_to_cpu+0xdf/0xf0 Code: 5d 41 5e c3 cc cc cc cc 44 89 f7 e8 2b 55 ad ff 49 89 c5 48 85 c0 0f 84 64 ff ff ff 4c 8b 68 30 41 83 fe ff 0f 85 60 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 RSP: 0000:ffffc9000d533b08 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000028 RSI: 00000000ffffffff RDI: 00000000ffffffff RBP: ffff888107419680 R08: 0000000000000000 R09: ffffffff82d72b00 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000001ed R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff88bc8b500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002610001 CR4: 00000000001706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? show_trace_log_lvl+0x1c1/0x2d9 ? show_trace_log_lvl+0x1c1/0x2d9 ? set_affinity_irq+0xdc/0x1c0 ? __die_body.cold+0x8/0xd ? die+0x2b/0x50 ? do_trap+0x90/0x110 ? bind_evtchn_to_cpu+0xdf/0xf0 ? do_error_trap+0x65/0x80 ? bind_evtchn_to_cpu+0xdf/0xf0 ? exc_invalid_op+0x4e/0x70 ? bind_evtchn_to_cpu+0xdf/0xf0 ? asm_exc_invalid_op+0x12/0x20 ? bind_evtchn_to_cpu+0xdf/0xf0 ? bind_evtchn_to_cpu+0xc5/0xf0 set_affinity_irq+0xdc/0x1c0 irq_do_set_affinity+0x1d7/0x1f0 irq_setup_affinity+0xd6/0x1a0 irq_startup+0x8a/0xf0 __setup_irq+0x639/0x6d0 ? nvme_suspend+0x150/0x150 request_threaded_irq+0x10c/0x180 ? nvme_suspend+0x150/0x150 pci_request_irq+0xa8/0xf0 ? __blk_mq_free_request+0x74/0xa0 queue_request_irq+0x6f/0x80 nvme_create_queue+0x1af/0x200 nvme_create_io_queues+0xbd/0xf0 nvme_setup_io_queues+0x246/0x320 ? nvme_irq_check+0x30/0x30 nvme_reset_work+0x1c8/0x400 process_one_work+0x1b0/0x350 worker_thread+0x49/0x310 ? process_one_work+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 Modules linked in: ---[ end trace a11715de1eee1873 ]--- Fixes: d46a78b05c0e ("xen: implement pirq type event channels") Cc: stable@vger.kernel.org Co-debugged-by: Andrew Panyakin <apanyaki@amazon.com> Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240124163130.31324-1-mheyne@amazon.de Signed-off-by: Juergen Gross <jgross@suse.com>
2024-02-13xen/gntalloc: Replace UAPI 1-element arrayKees Cook1-1/+1
Without changing the structure size (since it is UAPI), add a proper flexible array member, and reference it in the kernel so that it will not be trip the array-bounds sanitizer[1]. Link: https://github.com/KSPP/linux/issues/113 [1] Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: xen-devel@lists.xenproject.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20240206170320.work.437-kees@kernel.org Signed-off-by: Juergen Gross <jgross@suse.com>
2024-02-13xen: balloon: make balloon_subsys constRicardo B. Marliere1-1/+1
Now that the driver core can properly handle constant struct bus_type, move the balloon_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240203-bus_cleanup-xen-v1-2-c2f5fe89ed95@marliere.net Signed-off-by: Juergen Gross <jgross@suse.com>
2024-02-13xen: pcpu: make xen_pcpu_subsys constRicardo B. Marliere1-1/+1
Now that the driver core can properly handle constant struct bus_type, move the xen_pcpu_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240203-bus_cleanup-xen-v1-1-c2f5fe89ed95@marliere.net Signed-off-by: Juergen Gross <jgross@suse.com>
2024-02-13xen/privcmd: Use memdup_array_user() in alloc_ioreq()Markus Elfring1-10/+5
* The function “memdup_array_user” was added with the commit 313ebe47d75558511aa1237b6e35c663b5c0ec6f ("string.h: add array-wrappers for (v)memdup_user()"). Thus use it accordingly. This issue was detected by using the Coccinelle software. * Delete a label which became unnecessary with this refactoring. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/41e333f7-1f3a-41b6-a121-a3c0ae54e36f@web.de Signed-off-by: Juergen Gross <jgross@suse.com>
2024-02-12xen/xenbus: document will_handle argument for xenbus_watch_path()SeongJae Park1-6/+9
Commit 2e85d32b1c86 ("xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()") added will_handle argument to xenbus_watch_path() and its wrapper, xenbus_watch_pathfmt(), but didn't document it on the kerneldoc comments of the function. This is causing warnings that reported by kernel test robot. Add the documentation to fix it. Fixes: 2e85d32b1c86 ("xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401121154.FI8jDGun-lkp@intel.com/ Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240112185903.83737-1-sj@kernel.org Signed-off-by: Juergen Gross <jgross@suse.com>
2024-01-31x86/traps: Add sysvec_install() to install a system interrupt handlerXin Li1-1/+1
Add sysvec_install() to install a system interrupt handler into the IDT or the FRED system interrupt handler table. Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Shan Kang <shan.kang@intel.com> Link: https://lore.kernel.org/r/20231205105030.8698-28-xin3.li@intel.com
2024-01-17Merge tag 'for-linus-6.8-rc1-tag' of ↵Linus Torvalds2-50/+59
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - update some Xen PV interface related headers - fix some kernel-doc comments in the xenbus driver - fix the Xen gntdev driver to not access the struct page of pages imported from a DMA-buf * tag 'for-linus-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import xen/xenbus: client: fix kernel-doc comments xen: update PV-device interface headers
2024-01-10xen/gntdev: Fix the abuse of underlying struct page in DMA-buf importOleksandr Tyshchenko1-25/+25
DO NOT access the underlying struct page of an sg table exported by DMA-buf in dmabuf_imp_to_refs(), this is not allowed. Please see drivers/dma-buf/dma-buf.c:mangle_sg_table() for details. Fortunately, here (for special Xen device) we can avoid using pages and calculate gfns directly from dma addresses provided by the sg table. Suggested-by: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Daniel Vetter <daniel@ffwll.ch> Link: https://lore.kernel.org/r/20240107103426.2038075-1-olekstysh@gmail.com Signed-off-by: Juergen Gross <jgross@suse.com>
2024-01-09xen/xenbus: client: fix kernel-doc commentsRandy Dunlap1-25/+34
Correct function kernel-doc notation to prevent warnings from scripts/kernel-doc. xenbus_client.c:134: warning: No description found for return value of 'xenbus_watch_path' xenbus_client.c:177: warning: No description found for return value of 'xenbus_watch_pathfmt' xenbus_client.c:258: warning: missing initial short description on line: * xenbus_switch_state xenbus_client.c:267: warning: No description found for return value of 'xenbus_switch_state' xenbus_client.c:308: warning: missing initial short description on line: * xenbus_dev_error xenbus_client.c:327: warning: missing initial short description on line: * xenbus_dev_fatal xenbus_client.c:350: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Equivalent to xenbus_dev_fatal(dev, err, fmt, args), but helps xenbus_client.c:457: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Allocate an event channel for the given xenbus_device, assigning the newly xenbus_client.c:486: warning: expecting prototype for Free an existing event channel. Returns 0 on success or(). Prototype was for xenbus_free_evtchn() instead xenbus_client.c:502: warning: missing initial short description on line: * xenbus_map_ring_valloc xenbus_client.c:517: warning: No description found for return value of 'xenbus_map_ring_valloc' xenbus_client.c:602: warning: missing initial short description on line: * xenbus_unmap_ring xenbus_client.c:614: warning: No description found for return value of 'xenbus_unmap_ring' xenbus_client.c:715: warning: missing initial short description on line: * xenbus_unmap_ring_vfree xenbus_client.c:727: warning: No description found for return value of 'xenbus_unmap_ring_vfree' xenbus_client.c:919: warning: missing initial short description on line: * xenbus_read_driver_state xenbus_client.c:926: warning: No description found for return value of 'xenbus_read_driver_state' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: xen-devel@lists.xenproject.org Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20231206181724.27767-1-rdunlap@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2023-11-28eventfd: simplify eventfd_signal()Christian Brauner1-1/+1
Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28xen/events: fix error code in xen_bind_pirq_msi_to_irq()Dan Carpenter1-1/+3
Return -ENOMEM if xen_irq_init() fails. currently the code returns an uninitialized variable or zero. Fixes: 5dd9ad32d775 ("xen/events: drop xen_allocate_irqs_dynamic()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Juergen Gross <jgross@ssue.com> Link: https://lore.kernel.org/r/3b9ab040-a92e-4e35-b687-3a95890a9ace@moroto.mountain Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-17xen: privcmd: Replace zero-length array with flex-array member and use ↵Gustavo A. R. Silva1-1/+1
__counted_by Fake flexible arrays (zero-length and one-element arrays) are deprecated, and should be replaced by flexible-array members. So, replace zero-length array with a flexible-array member in `struct privcmd_kernel_ioreq`. Also annotate array `ports` with `__counted_by()` to prepare for the coming implementation by GCC and Clang of the `__counted_by` attribute. Flexible array members annotated with `__counted_by` can have their accesses bounds-checked at run-time via `CONFIG_UBSAN_BOUNDS` (for array indexing) and `CONFIG_FORTIFY_SOURCE` (for strcpy/memcpy-family functions). This fixes multiple -Warray-bounds warnings: drivers/xen/privcmd.c:1239:30: warning: array subscript i is outside array bounds of 'struct ioreq_port[0]' [-Warray-bounds=] drivers/xen/privcmd.c:1240:30: warning: array subscript i is outside array bounds of 'struct ioreq_port[0]' [-Warray-bounds=] drivers/xen/privcmd.c:1241:30: warning: array subscript i is outside array bounds of 'struct ioreq_port[0]' [-Warray-bounds=] drivers/xen/privcmd.c:1245:33: warning: array subscript i is outside array bounds of 'struct ioreq_port[0]' [-Warray-bounds=] drivers/xen/privcmd.c:1258:67: warning: array subscript i is outside array bounds of 'struct ioreq_port[0]' [-Warray-bounds=] This results in no differences in binary output. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/ZVZlg3tPMPCRdteh@work Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-17swiotlb-xen: provide the "max_mapping_size" methodKeith Busch1-0/+1
There's a bug that when using the XEN hypervisor with bios with large multi-page bio vectors on NVMe, the kernel deadlocks [1]. The deadlocks are caused by inability to map a large bio vector - dma_map_sgtable always returns an error, this gets propagated to the block layer as BLK_STS_RESOURCE and the block layer retries the request indefinitely. XEN uses the swiotlb framework to map discontiguous pages into contiguous runs that are submitted to the PCIe device. The swiotlb framework has a limitation on the length of a mapping - this needs to be announced with the max_mapping_size method to make sure that the hardware drivers do not create larger mappings. Without max_mapping_size, the NVMe block driver would create large mappings that overrun the maximum mapping size. Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Link: https://lore.kernel.org/stable/ZTNH0qtmint%2FzLJZ@mail-itl/ [1] Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Suggested-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/151bef41-e817-aea9-675-a35fdac4ed@redhat.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-15xen/events: remove some info_for_irq() calls in pirq handlingJuergen Gross1-49/+68
Instead of the IRQ number user the struct irq_info pointer as parameter in the internal pirq related functions. This allows to drop some calls of info_for_irq(). Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-15xen/events: modify internal [un]bind interfacesJuergen Gross1-135/+124
Modify the internal bind- and unbind-interfaces to take a struct irq_info parameter. When allocating a new IRQ pass the pointer from the allocating function further up. This will reduce the number of info_for_irq() calls and make the code more efficient. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-15xen/events: drop xen_allocate_irqs_dynamic()Juergen Gross1-30/+44
Instead of having a common function for allocating a single IRQ or a consecutive number of IRQs, split up the functionality into the callers of xen_allocate_irqs_dynamic(). This allows to handle any allocation error in xen_irq_init() gracefully instead of panicing the system. Let xen_irq_init() return the irq_info pointer or NULL in case of an allocation error. Additionally set the IRQ into irq_info already at allocation time, as otherwise the IRQ would be '0' (which is a valid IRQ number) until being set. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-14xen/events: remove some simple helpers from events_base.cJuergen Gross1-59/+38
The helper functions type_from_irq() and cpu_from_irq() are just one line functions used only internally. Open code them where needed. At the same time modify and rename get_evtchn_to_irq() to return a struct irq_info instead of the IRQ number. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-14xen/events: reduce externally visible helper functionsJuergen Gross3-9/+13
get_evtchn_to_irq() has only one external user while irq_from_evtchn() provides the same functionality and is exported for a wider user base. Modify the only external user of get_evtchn_to_irq() to use irq_from_evtchn() instead and make get_evtchn_to_irq() static. evtchn_from_irq() and irq_from_virq() have a single external user and can easily be combined to a new helper irq_evtchn_from_virq() allowing to drop irq_from_virq() and to make evtchn_from_irq() static. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-13xen/events: remove unused functionsJuergen Gross1-30/+0
There are no users of xen_irq_from_pirq() and xen_set_irq_pending(). Remove those functions. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-13xen/events: fix delayed eoi list handlingJuergen Gross1-1/+3
When delaying eoi handling of events, the related elements are queued into the percpu lateeoi list. In case the list isn't empty, the elements should be sorted by the time when eoi handling is to happen. Unfortunately a new element will never be queued at the start of the list, even if it has a handling time lower than all other list elements. Fix that by handling that case the same way as for an empty list. Fixes: e99502f76271 ("xen/events: defer eoi in case of excessive number of events") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-13xen/shbuf: eliminate 17 kernel-doc warningsRandy Dunlap1-17/+17
Don't use kernel-doc markers ("/**") for comments that are not in kernel-doc format. This prevents multiple kernel-doc warnings: xen-front-pgdir-shbuf.c:25: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * This structure represents the structure of a shared page xen-front-pgdir-shbuf.c:37: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Shared buffer ops which are differently implemented xen-front-pgdir-shbuf.c:65: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Get granted reference to the very first page of the xen-front-pgdir-shbuf.c:85: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Map granted references of the shared buffer. xen-front-pgdir-shbuf.c:106: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Unmap granted references of the shared buffer. xen-front-pgdir-shbuf.c:127: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Free all the resources of the shared buffer. xen-front-pgdir-shbuf.c:154: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Get the number of pages the page directory consumes itself. xen-front-pgdir-shbuf.c:164: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Calculate the number of grant references needed to share the buffer xen-front-pgdir-shbuf.c:176: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Calculate the number of grant references needed to share the buffer xen-front-pgdir-shbuf.c:194: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Unmap the buffer previously mapped with grant references xen-front-pgdir-shbuf.c:242: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Map the buffer with grant references provided by the backend. xen-front-pgdir-shbuf.c:324: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Fill page directory with grant references to the pages of the xen-front-pgdir-shbuf.c:354: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Fill page directory with grant references to the pages of the xen-front-pgdir-shbuf.c:393: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Grant references to the frontend's buffer pages. xen-front-pgdir-shbuf.c:422: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Grant all the references needed to share the buffer. xen-front-pgdir-shbuf.c:470: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Allocate all required structures to mange shared buffer. xen-front-pgdir-shbuf.c:510: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Allocate a new instance of a shared buffer. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Closes: lore.kernel.org/r/202311060203.yQrpPZhm-lkp@intel.com Acked-by: Juergen Gross <jgross@suse.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: xen-devel@lists.xenproject.org Link: https://lore.kernel.org/r/20231106055631.21520-1-rdunlap@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-13acpi/processor: sanitize _OSC/_PDC capabilities for Xen dom0Roger Pau Monne1-0/+22
The Processor capability bits notify ACPI of the OS capabilities, and so ACPI can adjust the return of other Processor methods taking the OS capabilities into account. When Linux is running as a Xen dom0, the hypervisor is the entity in charge of processor power management, and hence Xen needs to make sure the capabilities reported by _OSC/_PDC match the capabilities of the driver in Xen. Introduce a small helper to sanitize the buffer when running as Xen dom0. When Xen supports HWP, this serves as the equivalent of commit a21211672c9a ("ACPI / processor: Request native thermal interrupt handling via _OSC") to avoid SMM crashes. Xen will set bit ACPI_PROC_CAP_COLLAB_PROC_PERF (bit 12) in the capability bits and the _OSC/_PDC call will apply it. [ jandryuk: Mention Xen HWP's need. Support _OSC & _PDC ] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: stable@vger.kernel.org Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20231108212517.72279-1-jandryuk@gmail.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-13xen/events: avoid using info_for_irq() in xen_send_IPI_one()Juergen Gross1-4/+8
xen_send_IPI_one() is being used by cpuhp_report_idle_dead() after it calls rcu_report_dead(), meaning that any RCU usage by xen_send_IPI_one() is a bad idea. Unfortunately xen_send_IPI_one() is using notify_remote_via_irq() today, which is using irq_get_chip_data() via info_for_irq(). And irq_get_chip_data() in turn is using a maple-tree lookup requiring RCU. Avoid this problem by caching the ipi event channels in another percpu variable, allowing the use notify_remote_via_evtchn() in xen_send_IPI_one(). Fixes: 721255b9826b ("genirq: Use a maple tree for interrupt descriptor management") Reported-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-7/+10
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-02Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-0/+3
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, megaraid_sas, lpfc, target, ibmvfc, scsi_debug) plus the usual assorted minor fixes and updates. The major change this time around is a prep patch for rethreading of the driver reset handler API not to take a scsi_cmd structure which starts to reduce various drivers' dependence on scsi_cmd in error handling" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (132 commits) scsi: ufs: core: Leave space for '\0' in utf8 desc string scsi: ufs: core: Conversion to bool not necessary scsi: ufs: core: Fix race between force complete and ISR scsi: megaraid: Fix up debug message in megaraid_abort_and_reset() scsi: aic79xx: Fix up NULL command in ahd_done() scsi: message: fusion: Initialize return value in mptfc_bus_reset() scsi: mpt3sas: Fix loop logic scsi: snic: Remove useless code in snic_dr_clean_pending_req() scsi: core: Add comment to target_destroy in scsi_host_template scsi: core: Clean up scsi_dev_queue_ready() scsi: pmcraid: Add missing scsi_device_put() in pmcraid_eh_target_reset_handler() scsi: target: core: Fix kernel-doc comment scsi: pmcraid: Fix kernel-doc comment scsi: core: Handle depopulation and restoration in progress scsi: ufs: core: Add support for parsing OPP scsi: ufs: core: Add OPP support for scaling clocks and regulators scsi: ufs: dt-bindings: common: Add OPP table scsi: scsi_debug: Add param to control sdev's allow_restart scsi: scsi_debug: Add debugfs interface to fail target reset scsi: scsi_debug: Add new error injection type: Reset LUN failed ...
2023-11-01Merge tag 'sysctl-6.7-rc1' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...
2023-11-01Merge tag 'for-linus-6.7-rc1-tag' of ↵Linus Torvalds9-37/+437
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - two small cleanup patches - a fix for PCI passthrough under Xen - a four patch series speeding up virtio under Xen with user space backends * tag 'for-linus-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen-pciback: Consider INTx disabled when MSI/MSI-X is enabled xen: privcmd: Add support for ioeventfd xen: evtchn: Allow shared registration of IRQ handers xen: irqfd: Use _IOW instead of the internal _IOC() macro xen: Make struct privcmd_irqfd's layout architecture independent xen/xenbus: Add __counted_by for struct read_buffer and use struct_size() xenbus: fix error exit in xenbus_init()
2023-10-30Merge tag 'locking-core-2023-10-28' of ↵Linus Torvalds2-20/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Info Molnar: "Futex improvements: - Add the 'futex2' syscall ABI, which is an attempt to get away from the multiplex syscall and adds a little room for extentions, while lifting some limitations. - Fix futex PI recursive rt_mutex waiter state bug - Fix inter-process shared futexes on no-MMU systems - Use folios instead of pages Micro-optimizations of locking primitives: - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock architectures, to improve lockref code generation - Improve the x86-32 lockref_get_not_zero() main loop by adding build-time CMPXCHG8B support detection for the relevant lockref code, and by better interfacing the CMPXCHG8B assembly code with the compiler - Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg() code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg(). - Micro-optimize rcuref_put_slowpath() Locking debuggability improvements: - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well - Enforce atomicity of sched_submit_work(), which is de-facto atomic but was un-enforced previously. - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics - Fix ww_mutex self-tests - Clean up const-propagation in <linux/seqlock.h> and simplify the API-instantiation macros a bit RT locking improvements: - Provide the rt_mutex_*_schedule() primitives/helpers and use them in the rtmutex code to avoid recursion vs. rtlock on the PI state. - Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock() and rwbase_read_lock() .. plus misc fixes & cleanups" * tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) futex: Don't include process MM in futex key on no-MMU locking/seqlock: Fix grammar in comment alpha: Fix up new futex syscall numbers locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning locking/seqlock: Change __seqprop() to return the function pointer locking/seqlock: Simplify SEQCOUNT_LOCKNAME() locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath() locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg() locking/atomic/x86: Introduce arch_sync_try_cmpxchg() locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback locking/seqlock: Fix typo in comment futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic() locking/local, arch: Rewrite local_add_unless() as a static inline function locking/debug: Fix debugfs API return value checks to use IS_ERR() locking/ww_mutex/test: Make sure we bail out instead of livelock locking/ww_mutex/test: Fix potential workqueue corruption locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup futex: Add sys_futex_requeue() futex: Add flags2 argument to futex_requeue() ...
2023-10-16xen-pciback: Consider INTx disabled when MSI/MSI-X is enabledMarek Marczykowski-Górecki3-25/+23
Linux enables MSI-X before disabling INTx, but keeps MSI-X masked until the table is filled. Then it disables INTx just before clearing MASKALL bit. Currently this approach is rejected by xen-pciback. According to the PCIe spec, device cannot use INTx when MSI/MSI-X is enabled (in other words: enabling MSI/MSI-X implicitly disables INTx). Change the logic to consider INTx disabled if MSI/MSI-X is enabled. This applies to three places: - checking currently enabled interrupts type, - transition to MSI/MSI-X - where INTx would be implicitly disabled, - clearing INTx disable bit - which can be allowed even if MSI/MSI-X is enabled, as device should consider INTx disabled anyway in that case Fixes: 5e29500eba2a ("xen-pciback: Allow setting PCI_MSIX_FLAGS_MASKALL too") Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20231016131348.1734721-1-marmarek@invisiblethingslab.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-16xen: privcmd: Add support for ioeventfdViresh Kumar2-6/+407
Virtio guests send VIRTIO_MMIO_QUEUE_NOTIFY notification when they need to notify the backend of an update to the status of the virtqueue. The backend or another entity, polls the MMIO address for updates to know when the notification is sent. It works well if the backend does this polling by itself. But as we move towards generic backend implementations, we end up implementing this in a separate user-space program. Generally, the Virtio backends are implemented to work with the Eventfd based mechanism. In order to make such backends work with Xen, another software layer needs to do the polling and send an event via eventfd to the backend once the notification from guest is received. This results in an extra context switch. This is not a new problem in Linux though. It is present with other hypervisors like KVM, etc. as well. The generic solution implemented in the kernel for them is to provide an IOCTL call to pass the address to poll and eventfd, which lets the kernel take care of polling and raise an event on the eventfd, instead of handling this in user space (which involves an extra context switch). This patch adds similar support for xen. Inspired by existing implementations for KVM, etc.. This also copies ioreq.h header file (only struct ioreq and related macros) from Xen's source tree (Top commit 5d84f07fe6bf ("xen/pci: drop remaining uses of bool_t")). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/b20d83efba6453037d0c099912813c79c81f7714.1697439990.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-16xen: evtchn: Allow shared registration of IRQ handersViresh Kumar2-2/+3
Currently the handling of events is supported either in the kernel or userspace, but not both. In order to support fast delivery of interrupts from the guest to the backend, we need to handle the Queue notify part of Virtio protocol in kernel and the rest in userspace. Update the interrupt handler registration flag to IRQF_SHARED for event channels, which would allow multiple entities to bind their interrupt handler for the same event channel port. Also increment the reference count of irq_info when multiple entities try to bind event channel to irqchip, so the unbinding happens only after all the users are gone. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/99b1edfd3147c6b5d22a5139dab5861e767dc34a.1697439990.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-16xen: Make struct privcmd_irqfd's layout architecture independentViresh Kumar1-1/+1
Using indirect pointers in an ioctl command argument means that the layout is architecture specific, in particular we can't use the same one from 32-bit compat tasks. The general recommendation is to have __u64 members and use u64_to_user_ptr() to access it from the kernel if we are unable to avoid the pointers altogether. Fixes: f8941e6c4c71 ("xen: privcmd: Add support for irqfd") Reported-by: Arnd Bergmann <arnd@kernel.org> Closes: https://lore.kernel.org/all/268a2031-63b8-4c7d-b1e5-8ab83ca80b4a@app.fastmail.com/ Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/a4ef0d4a68fc858b34a81fd3f9877d9b6898eb77.1697439990.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-16xen/xenbus: Add __counted_by for struct read_buffer and use struct_size()Gustavo A. R. Silva1-2/+2
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). While there, use struct_size() helper, instead of the open-coded version, to calculate the size for the allocation of the whole flexible structure, including of course, the flexible-array member. This code was found with the help of Coccinelle, and audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Link: https://lore.kernel.org/r/ZSRMosLuJJS5Y/io@work Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-16xenbus: fix error exit in xenbus_init()Juergen Gross1-1/+1
In case an error occurs in xenbus_init(), xen_store_domain_type should be set to XS_UNKNOWN. Fix one instance where this action is missing. Fixes: 5b3353949e89 ("xen: add support for initializing xenstore later as HVM domain") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/r/202304200845.w7m4kXZr-lkp@intel.com/ Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Link: https://lore.kernel.org/r/20230822091138.4765-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-13scsi: target: Have drivers report if they support direct submissionsMike Christie1-0/+3
In some cases, like with multiple LUN targets or where the target has to respond to transport level requests from the receiving context it can be better to defer cmd submission to a helper thread. If the backend driver blocks on something like request/tag allocation it can block the entire target submission path and other LUs and transport IO on that session. In other cases like single LUN targets with storage that can support all the commands that the target can queue, then it's best to submit the cmd to the backend from the target's cmd receiving context. Subsequent commits will allow the user to config what they prefer, but drivers like loop can't directly submit because they can be called from a context that can't sleep. And, drivers like vhost-scsi can support direct submission, but need to keep their default behavior of deferring execution to avoid possible regressions where the backend can block. Make the drivers tell LIO core if they support direct submissions and their current default, so we can prevent users from misconfiguring the system and initialize devices correctly. Signed-off-by: Mike Christie <michael.christie@oracle.com> Link: https://lore.kernel.org/r/20230928020907.5730-2-michael.christie@oracle.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-10-11xen: Remove now superfluous sentinel element from ctl_table arrayJoel Granados1-1/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from balloon_table Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-10-09locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg()Uros Bizjak2-20/+16
Use sync_try_cmpxchg() instead of sync_cmpxchg(*ptr, old, new) == old in clear_masked_cond(), clear_linked() and gnttab_end_foreign_access_ref_v1(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg), improving the cmpxchg loop in gnttab_end_foreign_access_ref_v1() from: 174: eb 0e jmp 184 <...> 176: 89 d0 mov %edx,%eax 178: f0 66 0f b1 31 lock cmpxchg %si,(%rcx) 17d: 66 39 c2 cmp %ax,%dx 180: 74 11 je 193 <...> 182: 89 c2 mov %eax,%edx 184: 89 d6 mov %edx,%esi 186: 66 83 e6 18 and $0x18,%si 18a: 74 ea je 176 <...> to: 614: 89 c1 mov %eax,%ecx 616: 66 83 e1 18 and $0x18,%cx 61a: 75 11 jne 62d <...> 61c: f0 66 0f b1 0a lock cmpxchg %cx,(%rdx) 621: 75 f1 jne 614 <...> No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org
2023-10-09xen/events: replace evtchn_rwlock with RCUJuergen Gross1-41/+46
In unprivileged Xen guests event handling can cause a deadlock with Xen console handling. The evtchn_rwlock and the hvc_lock are taken in opposite sequence in __hvc_poll() and in Xen console IRQ handling. Normally this is no problem, as the evtchn_rwlock is taken as a reader in both paths, but as soon as an event channel is being closed, the lock will be taken as a writer, which will cause read_lock() to block: CPU0 CPU1 CPU2 (IRQ handling) (__hvc_poll()) (closing event channel) read_lock(evtchn_rwlock) spin_lock(hvc_lock) write_lock(evtchn_rwlock) [blocks] spin_lock(hvc_lock) [blocks] read_lock(evtchn_rwlock) [blocks due to writer waiting, and not in_interrupt()] This issue can be avoided by replacing evtchn_rwlock with RCU in xen_free_irq(). Note that RCU is used only to delay freeing of the irq_info memory. There is no RCU based dereferencing or replacement of pointers involved. In order to avoid potential races between removing the irq_info reference and handling of interrupts, set the irq_info pointer to NULL only when freeing its memory. The IRQ itself must be freed at that time, too, as otherwise the same IRQ number could be allocated again before handling of the old instance would have been finished. This is XSA-441 / CVE-2023-34324. Fixes: 54c9de89895e ("xen/events: add a new "late EOI" evtchn framework") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-10-04xenbus/backend: dynamically allocate the xen-backend shrinkerQi Zheng1-7/+10
Use new APIs to dynamically allocate the xen-backend shrinker. Link: https://lkml.kernel.org/r/20230911094444.68966-6-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19xen: simplify evtchn_do_upcall() call mazeJuergen Gross2-20/+3
There are several functions involved for performing the functionality of evtchn_do_upcall(): - __xen_evtchn_do_upcall() doing the real work - xen_hvm_evtchn_do_upcall() just being a wrapper for __xen_evtchn_do_upcall(), exposed for external callers - xen_evtchn_do_upcall() calling __xen_evtchn_do_upcall(), too, but without any user Simplify this maze by: - removing the unused xen_evtchn_do_upcall() - removing xen_hvm_evtchn_do_upcall() as the only left caller of __xen_evtchn_do_upcall(), while renaming __xen_evtchn_do_upcall() to xen_evtchn_do_upcall() Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-29Merge tag 'dma-mapping-6.6-2023-08-29' of ↵Linus Torvalds1-1/+1
git://git.infradead.org/users/hch/dma-mapping Pull dma-maping updates from Christoph Hellwig: - allow dynamic sizing of the swiotlb buffer, to cater for secure virtualization workloads that require all I/O to be bounce buffered (Petr Tesarik) - move a declaration to a header (Arnd Bergmann) - check for memory region overlap in dma-contiguous (Binglei Wang) - remove the somewhat dangerous runtime swiotlb-xen enablement and unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross) - per-node CMA improvements (Yajun Deng) * tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: optimize get_max_slots() swiotlb: move slot allocation explanation comment where it belongs swiotlb: search the software IO TLB only if the device makes use of it swiotlb: allocate a new memory pool when existing pools are full swiotlb: determine potential physical address limit swiotlb: if swiotlb is full, fall back to a transient memory pool swiotlb: add a flag whether SWIOTLB is allowed to grow swiotlb: separate memory pool data from other allocator data swiotlb: add documentation and rename swiotlb_do_find_slots() swiotlb: make io_tlb_default_mem local to swiotlb.c swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated dma-contiguous: check for memory region overlap dma-contiguous: support numa CMA for specified node dma-contiguous: support per-numa CMA for all architectures dma-mapping: move arch_dma_set_mask() declaration to header swiotlb: unexport is_swiotlb_active x86: always initialize xen-swiotlb when xen-pcifront is enabling xen/pci: add flag for PCI passthrough being possible
2023-08-22xen: privcmd: Add support for irqfdViresh Kumar2-2/+287
Xen provides support for injecting interrupts to the guests via the HYPERVISOR_dm_op() hypercall. The same is used by the Virtio based device backend implementations, in an inefficient manner currently. Generally, the Virtio backends are implemented to work with the Eventfd based mechanism. In order to make such backends work with Xen, another software layer needs to poll the Eventfds and raise an interrupt to the guest using the Xen based mechanism. This results in an extra context switch. This is not a new problem in Linux though. It is present with other hypervisors like KVM, etc. as well. The generic solution implemented in the kernel for them is to provide an IOCTL call to pass the interrupt details and eventfd, which lets the kernel take care of polling the eventfd and raising of the interrupt, instead of handling this in user space (which involves an extra context switch). This patch adds support to inject a specific interrupt to guest using the eventfd mechanism, by preventing the extra context switch. Inspired by existing implementations for KVM, etc.. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/8e724ac1f50c2bc1eb8da9b3ff6166f1372570aa.1692697321.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-22xen/xenbus: Avoid a lockdep warning when adding a watchPetr Pavlu1-2/+2
The following lockdep warning appears during boot on a Xen dom0 system: [ 96.388794] ====================================================== [ 96.388797] WARNING: possible circular locking dependency detected [ 96.388799] 6.4.0-rc5-default+ #8 Tainted: G EL [ 96.388803] ------------------------------------------------------ [ 96.388804] xenconsoled/1330 is trying to acquire lock: [ 96.388808] ffffffff82acdd10 (xs_watch_rwsem){++++}-{3:3}, at: register_xenbus_watch+0x45/0x140 [ 96.388847] but task is already holding lock: [ 96.388849] ffff888100c92068 (&u->msgbuffer_mutex){+.+.}-{3:3}, at: xenbus_file_write+0x2c/0x600 [ 96.388862] which lock already depends on the new lock. [ 96.388864] the existing dependency chain (in reverse order) is: [ 96.388866] -> #2 (&u->msgbuffer_mutex){+.+.}-{3:3}: [ 96.388874] __mutex_lock+0x85/0xb30 [ 96.388885] xenbus_dev_queue_reply+0x48/0x2b0 [ 96.388890] xenbus_thread+0x1d7/0x950 [ 96.388897] kthread+0xe7/0x120 [ 96.388905] ret_from_fork+0x2c/0x50 [ 96.388914] -> #1 (xs_response_mutex){+.+.}-{3:3}: [ 96.388923] __mutex_lock+0x85/0xb30 [ 96.388930] xenbus_backend_ioctl+0x56/0x1c0 [ 96.388935] __x64_sys_ioctl+0x90/0xd0 [ 96.388942] do_syscall_64+0x5c/0x90 [ 96.388950] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 96.388957] -> #0 (xs_watch_rwsem){++++}-{3:3}: [ 96.388965] __lock_acquire+0x1538/0x2260 [ 96.388972] lock_acquire+0xc6/0x2b0 [ 96.388976] down_read+0x2d/0x160 [ 96.388983] register_xenbus_watch+0x45/0x140 [ 96.388990] xenbus_file_write+0x53d/0x600 [ 96.388994] vfs_write+0xe4/0x490 [ 96.389003] ksys_write+0xb8/0xf0 [ 96.389011] do_syscall_64+0x5c/0x90 [ 96.389017] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 96.389023] other info that might help us debug this: [ 96.389025] Chain exists of: xs_watch_rwsem --> xs_response_mutex --> &u->msgbuffer_mutex [ 96.413429] Possible unsafe locking scenario: [ 96.413430] CPU0 CPU1 [ 96.413430] ---- ---- [ 96.413431] lock(&u->msgbuffer_mutex); [ 96.413432] lock(xs_response_mutex); [ 96.413433] lock(&u->msgbuffer_mutex); [ 96.413434] rlock(xs_watch_rwsem); [ 96.413436] *** DEADLOCK *** [ 96.413436] 1 lock held by xenconsoled/1330: [ 96.413438] #0: ffff888100c92068 (&u->msgbuffer_mutex){+.+.}-{3:3}, at: xenbus_file_write+0x2c/0x600 [ 96.413446] An ioctl call IOCTL_XENBUS_BACKEND_SETUP (record #1 in the report) results in calling xenbus_alloc() -> xs_suspend() which introduces ordering xs_watch_rwsem --> xs_response_mutex. The xenbus_thread() operation (record #2) creates xs_response_mutex --> &u->msgbuffer_mutex. An XS_WATCH write to the xenbus file then results in a complain about the opposite lock order &u->msgbuffer_mutex --> xs_watch_rwsem. The dependency xs_watch_rwsem --> xs_response_mutex is spurious. Avoid it and the warning by changing the ordering in xs_suspend(), first acquire xs_response_mutex and then xs_watch_rwsem. Reverse also the unlocking order in xs_suspend_cancel() for consistency, but keep xs_resume() as is because it needs to have xs_watch_rwsem unlocked only after exiting xs suspend and re-adding all watches. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230607123624.15739-1-petr.pavlu@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-21xen: Fix one kernel-doc commentYang Li1-1/+1
Use colon to separate parameter name from their specific meaning. silence the warning: drivers/xen/grant-table.c:1051: warning: Function parameter or member 'nr_pages' not described in 'gnttab_free_pages' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6030 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230731030037.123946-1-yang.lee@linux.alibaba.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-21xen: xenbus: Use helper function IS_ERR_OR_NULL()Li Zetao1-1/+1
Use IS_ERR_OR_NULL() to detect an error pointer or a null pointer open-coding to simplify the code. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230817014736.3094289-1-lizetao1@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-21xen: Switch to use kmemdup() helperRuan Jinjie1-5/+2
Use kmemdup() helper instead of open-coding to simplify the code. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Chen Jiahao <chenjiahao16@huawei.com> Link: https://lore.kernel.org/r/20230815092434.1206386-1-ruanjinjie@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-21xen-pciback: Remove unused function declarationsYue Haibing2-5/+0
Commit a92336a1176b ("xen/pciback: Drop two backends, squash and cleanup some code.") declared but never implemented these functions. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230808150912.43416-1-yuehaibing@huawei.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-08-01swiotlb: make io_tlb_default_mem local to swiotlb.cPetr Tesarik1-1/+1
SWIOTLB implementation details should not be exposed to the rest of the kernel. This will allow to make changes to the implementation without modifying non-swiotlb code. To avoid breaking existing users, provide helper functions for the few required fields. As a bonus, using a helper function to initialize struct device allows to get rid of an #ifdef in driver core. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-07-27xen: speed up grant-table reclaimDemi Marie Obenour1-11/+29
When a grant entry is still in use by the remote domain, Linux must put it on a deferred list. Normally, this list is very short, because the PV network and block protocols expect the backend to unmap the grant first. However, Qubes OS's GUI protocol is subject to the constraints of the X Window System, and as such winds up with the frontend unmapping the window first. As a result, the list can grow very large, resulting in a massive memory leak and eventual VM freeze. To partially solve this problem, make the number of entries that the VM will attempt to free at each iteration tunable. The default is still 10, but it can be overridden via a module parameter. This is Cc: stable because (when combined with appropriate userspace changes) it fixes a severe performance and stability problem for Qubes OS users. Cc: stable@vger.kernel.org Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230726165354.1252-1-demi@invisiblethingslab.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-07-26xen/evtchn: Introduce new IOCTL to bind static evtchnRahul Singh2-20/+31
Xen 4.17 supports the creation of static evtchns. To allow user space application to bind static evtchns introduce new ioctl "IOCTL_EVTCHN_BIND_STATIC". Existing IOCTL doing more than binding that’s why we need to introduce the new IOCTL to only bind the static event channels. Static evtchns to be available for use during the lifetime of the guest. When the application exits, __unbind_from_irq() ends up being called from release() file operations because of that static evtchns are getting closed. To avoid closing the static event channel, add the new bool variable "is_static" in "struct irq_info" to mark the event channel static when creating the event channel to avoid closing the static evtchn. Also, take this opportunity to remove the open-coded version of the evtchn close in drivers/xen/evtchn.c file and use xen_evtchn_close(). Signed-off-by: Rahul Singh <rahul.singh@arm.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/ae7329bf1713f83e4aad4f3fa0f316258c40a3e9.1689677042.git.rahul.singh@arm.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-07-25xenbus: check xen_domain in xenbus_probe_initcallStefano Stabellini1-0/+3
The same way we already do in xenbus_init. Fixes the following warning: [ 352.175563] Trying to free already-free IRQ 0 [ 352.177355] WARNING: CPU: 1 PID: 88 at kernel/irq/manage.c:1893 free_irq+0xbf/0x350 [...] [ 352.213951] Call Trace: [ 352.214390] <TASK> [ 352.214717] ? __warn+0x81/0x170 [ 352.215436] ? free_irq+0xbf/0x350 [ 352.215906] ? report_bug+0x10b/0x200 [ 352.216408] ? prb_read_valid+0x17/0x20 [ 352.216926] ? handle_bug+0x44/0x80 [ 352.217409] ? exc_invalid_op+0x13/0x60 [ 352.217932] ? asm_exc_invalid_op+0x16/0x20 [ 352.218497] ? free_irq+0xbf/0x350 [ 352.218979] ? __pfx_xenbus_probe_thread+0x10/0x10 [ 352.219600] xenbus_probe+0x7a/0x80 [ 352.221030] xenbus_probe_thread+0x76/0xc0 Fixes: 5b3353949e89 ("xen: add support for initializing xenstore later as HVM domain") Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Tested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Link: https://lore.kernel.org/r/alpine.DEB.2.22.394.2307211609140.3118466@ubuntu-linux-20-04-desktop Signed-off-by: Juergen Gross <jgross@suse.com>
2023-07-13Merge tag 'for-linus-6.5-rc2-tag' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a cleanup of the Xen related ELF-notes - a fix for virtio handling in Xen dom0 when running Xen in a VM * tag 'for-linus-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parent x86/Xen: tidy xen-head.S
2023-07-04xen/virtio: Fix NULL deref when a bridge of PCI root bus has no parentPetr Pavlu1-0/+2
When attempting to run Xen on a QEMU/KVM virtual machine with virtio devices (all x86_64), function xen_dt_get_node() crashes on accessing bus->bridge->parent->of_node because a bridge of the PCI root bus has no parent set: [ 1.694192][ T1] BUG: kernel NULL pointer dereference, address: 0000000000000288 [ 1.695688][ T1] #PF: supervisor read access in kernel mode [ 1.696297][ T1] #PF: error_code(0x0000) - not-present page [ 1.696297][ T1] PGD 0 P4D 0 [ 1.696297][ T1] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1.696297][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.7-1-default #1 openSUSE Tumbleweed a577eae57964bb7e83477b5a5645a1781df990f0 [ 1.696297][ T1] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 1.696297][ T1] RIP: e030:xen_virtio_restricted_mem_acc+0xd9/0x1c0 [ 1.696297][ T1] Code: 45 0c 83 e8 c9 a3 ea ff 31 c0 eb d7 48 8b 87 40 ff ff ff 48 89 c2 48 8b 40 10 48 85 c0 75 f4 48 8b 82 10 01 00 00 48 8b 40 40 <48> 83 b8 88 02 00 00 00 0f 84 45 ff ff ff 66 90 31 c0 eb a5 48 89 [ 1.696297][ T1] RSP: e02b:ffffc90040013cc8 EFLAGS: 00010246 [ 1.696297][ T1] RAX: 0000000000000000 RBX: ffff888006c75000 RCX: 0000000000000029 [ 1.696297][ T1] RDX: ffff888005ed1000 RSI: ffffc900400f100c RDI: ffff888005ee30d0 [ 1.696297][ T1] RBP: ffff888006c75010 R08: 0000000000000001 R09: 0000000330000006 [ 1.696297][ T1] R10: ffff888005850028 R11: 0000000000000002 R12: ffffffff830439a0 [ 1.696297][ T1] R13: 0000000000000000 R14: ffff888005657900 R15: ffff888006e3e1e8 [ 1.696297][ T1] FS: 0000000000000000(0000) GS:ffff88804a000000(0000) knlGS:0000000000000000 [ 1.696297][ T1] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.696297][ T1] CR2: 0000000000000288 CR3: 0000000002e36000 CR4: 0000000000050660 [ 1.696297][ T1] Call Trace: [ 1.696297][ T1] <TASK> [ 1.696297][ T1] virtio_features_ok+0x1b/0xd0 [ 1.696297][ T1] virtio_dev_probe+0x19c/0x270 [ 1.696297][ T1] really_probe+0x19b/0x3e0 [ 1.696297][ T1] __driver_probe_device+0x78/0x160 [ 1.696297][ T1] driver_probe_device+0x1f/0x90 [ 1.696297][ T1] __driver_attach+0xd2/0x1c0 [ 1.696297][ T1] bus_for_each_dev+0x74/0xc0 [ 1.696297][ T1] bus_add_driver+0x116/0x220 [ 1.696297][ T1] driver_register+0x59/0x100 [ 1.696297][ T1] virtio_console_init+0x7f/0x110 [ 1.696297][ T1] do_one_initcall+0x47/0x220 [ 1.696297][ T1] kernel_init_freeable+0x328/0x480 [ 1.696297][ T1] kernel_init+0x1a/0x1c0 [ 1.696297][ T1] ret_from_fork+0x29/0x50 [ 1.696297][ T1] </TASK> [ 1.696297][ T1] Modules linked in: [ 1.696297][ T1] CR2: 0000000000000288 [ 1.696297][ T1] ---[ end trace 0000000000000000 ]--- The PCI root bus is in this case created from ACPI description via acpi_pci_root_add() -> pci_acpi_scan_root() -> acpi_pci_root_create() -> pci_create_root_bus() where the last function is called with parent=NULL. It indicates that no parent is present and then bus->bridge->parent is NULL too. Fix the problem by checking bus->bridge->parent in xen_dt_get_node() for NULL first. Fixes: ef8ae384b4c9 ("xen/virtio: Handle PCI devices which Host controller is described in DT") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20230621131214.9398-2-petr.pavlu@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-27Merge tag 'wq-for-6.5-cleanup-ordered' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull ordered workqueue creation updates from Tejun Heo: "For historical reasons, unbound workqueues with max concurrency limit of 1 are considered ordered, even though the concurrency limit hasn't been system-wide for a long time. This creates ambiguity around whether ordered execution is actually required for correctness, which was actually confusing for e.g. btrfs (btrfs updates are being routed through the btrfs tree). There aren't that many users in the tree which use the combination and there are pending improvements to unbound workqueue affinity handling which will make inadvertent use of ordered workqueue a bigger loss. This clarifies the situation for most of them by updating the ones which require ordered execution to use alloc_ordered_workqueue(). There are some conversions being routed through subsystem-specific trees and likely a few stragglers. Once they're all converted, workqueue can trigger a warning on unbound + @max_active==1 usages and eventually drop the implicit ordered behavior" * tag 'wq-for-6.5-cleanup-ordered' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: rxrpc: Use alloc_ordered_workqueue() to create ordered workqueues net: qrtr: Use alloc_ordered_workqueue() to create ordered workqueues net: wwan: t7xx: Use alloc_ordered_workqueue() to create ordered workqueues dm integrity: Use alloc_ordered_workqueue() to create ordered workqueues media: amphion: Use alloc_ordered_workqueue() to create ordered workqueues scsi: NCR5380: Use default @max_active for hostdata->work_q media: coda: Use alloc_ordered_workqueue() to create ordered workqueues crypto: octeontx2: Use alloc_ordered_workqueue() to create ordered workqueues wifi: ath10/11/12k: Use alloc_ordered_workqueue() to create ordered workqueues wifi: mwifiex: Use default @max_active for workqueues wifi: iwlwifi: Use default @max_active for trans_pcie->rba.alloc_wq xen/pvcalls: Use alloc_ordered_workqueue() to create ordered workqueues virt: acrn: Use alloc_ordered_workqueue() to create ordered workqueues net: octeontx2: Use alloc_ordered_workqueue() to create ordered workqueues net: thunderx: Use alloc_ordered_workqueue() to create ordered workqueues greybus: Use alloc_ordered_workqueue() to create ordered workqueues powerpc, workqueue: Use alloc_ordered_workqueue() to create ordered workqueues
2023-06-19mm: ptep_get() conversionRyan Roberts1-1/+1
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-27Merge tag 'for-linus-6.4-rc4-tag' of ↵Linus Torvalds1-5/+4
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a double free fix in the Xen pvcalls backend driver - a fix for a regression causing the MSI related sysfs entries to not being created in Xen PV guests - a fix in the Xen blkfront driver for handling insane input data better * tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/pci/xen: populate MSI sysfs entries xen/pvcalls-back: fix double frees with pvcalls_new_active_socket() xen/blkfront: Only check REQ_FUA for writes
2023-05-24xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()Dan Carpenter1-5/+4
In the pvcalls_new_active_socket() function, most error paths call pvcalls_back_release_active(fedata->dev, fedata, map) which calls sock_release() on "sock". The bug is that the caller also frees sock. Fix this by making every error path in pvcalls_new_active_socket() release the sock, and don't free it in the caller. Fixes: 5db4d286a8ef ("xen/pvcalls: implement connect command") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/e5f98dc2-0305-491f-a860-71bbd1398a2f@kili.mountain Signed-off-by: Juergen Gross <jgross@suse.com>
2023-05-08xen/pvcalls: Use alloc_ordered_workqueue() to create ordered workqueuesTejun Heo1-2/+2
BACKGROUND ========== When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue(). However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues. While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. This patch series audits all callsites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. WHAT TO LOOK FOR ================ The conversions are from alloc_workqueue(WQ_UNBOUND | flags, 1, args..) to alloc_ordered_workqueue(flags, args...) which don't cause any functional changes. If you know that fully ordered execution is not ncessary, please let me know. I'll drop the conversion and instead add a comment noting the fact to reduce confusion while conversion is in progress. If you aren't fully sure, it's completely fine to let the conversion through. The behavior will stay exactly the same and we can always reconsider later. As there are follow-up workqueue core changes, I'd really appreciate if the patch can be routed through the workqueue tree w/ your acks. Thanks. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: xen-devel@lists.xenproject.org
2023-04-27Merge tag 'for-linus-6.4-rc1-tag' of ↵Linus Torvalds3-37/+42
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - some cleanups in the Xen blkback driver - fix potential sleeps under lock in various Xen drivers * tag 'for-linus-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/blkback: move blkif_get_x86_*_req() into blkback.c xen/blkback: simplify free_persistent_gnts() interface xen/blkback: remove stale prototype xen/blkback: fix white space code style issues xen/pvcalls: don't call bind_evtchn_to_irqhandler() under lock xen/scsiback: don't call scsiback_free_translation_entry() under lock xen/pciback: don't call pcistub_device_put() under lock
2023-04-27Merge tag 'sysctl-6.4-rc1' of ↵Linus Torvalds1-19/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "This only does a few sysctl moves from the kernel/sysctl.c file, the rest of the work has been put towards deprecating two API calls which incur recursion and prevent us from simplifying the registration process / saving memory per move. Most of the changes have been soaking on linux-next since v6.3-rc3. I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's feedback that we should see if we could *save* memory with these moves instead of incurring more memory. We currently incur more memory since when we move a syctl from kernel/sysclt.c out to its own file we end up having to add a new empty sysctl used to register it. To achieve saving memory we want to allow syctls to be passed without requiring the end element being empty, and just have our registration process rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls would make the sysctl registration pretty brittle, hard to read and maintain as can be seen from Meng Tang's efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations also implies doing the work to deprecate two API calls which use recursion in order to support sysctl declarations with subdirectories. And so during this development cycle quite a bit of effort went into this deprecation effort. I've annotated the following two APIs are deprecated and in few kernel releases we should be good to remove them: - register_sysctl_table() - register_sysctl_paths() During this merge window we should be able to deprecate and unexport register_sysctl_paths(), we can probably do that towards the end of this merge window. Deprecating register_sysctl_table() will take a bit more time but this pull request goes with a few example of how to do this. As it turns out each of the conversions to move away from either of these two API calls *also* saves memory. And so long term, all these changes *will* prove to have saved a bit of memory on boot. The way I see it then is if remove a user of one deprecated call, it gives us enough savings to move one kernel/sysctl.c out from the generic arrays as we end up with about the same amount of bytes. Since deprecating register_sysctl_table() and register_sysctl_paths() does not require maintainer coordination except the final unexport you'll see quite a bit of these changes from other pull requests, I've just kept the stragglers after rc3" Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0] * tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits) fs: fix sysctls.c built mm: compaction: remove incorrect #ifdef checks mm: compaction: move compaction sysctl to its own file mm: memory-failure: Move memory failure sysctls to its own file arm: simplify two-level sysctl registration for ctl_isa_vars ia64: simplify one-level sysctl registration for kdump_ctl_table utsname: simplify one-level sysctl registration for uts_kern_table ntfs: simplfy one-level sysctl registration for ntfs_sysctls coda: simplify one-level sysctl registration for coda_table fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls xfs: simplify two-level sysctl registration for xfs_table nfs: simplify two-level sysctl registration for nfs_cb_sysctls nfs: simplify two-level sysctl registration for nfs4_cb_sysctls lockd: simplify two-level sysctl registration for nlm_sysctls proc_sysctl: enhance documentation xen: simplify sysctl registration for balloon md: simplify sysctl registration hv: simplify sysctl registration scsi: simplify sysctl registration with register_sysctl() csky: simplify alignment sysctl registration ...
2023-04-26Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-30/+0
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (megaraid_sas, scsi_debug, lpfc, target, mpi3mr, hisi_sas, arcmsr). The major core change is the constification of the host templates (which touches everything) along with other minor fixups and clean ups" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (207 commits) scsi: ufs: mcq: Use pointer arithmetic in ufshcd_send_command() scsi: ufs: mcq: Annotate ufshcd_inc_sq_tail() appropriately scsi: cxlflash: s/semahpore/semaphore/ scsi: lpfc: Silence an incorrect device output scsi: mpi3mr: Use IRQ save variants of spinlock to protect chain frame allocation scsi: scsi_debug: Fix missing error code in scsi_debug_init() scsi: hisi_sas: Work around build failure in suspend function scsi: lpfc: Fix ioremap issues in lpfc_sli4_pci_mem_setup() scsi: mpt3sas: Fix an issue when driver is being removed scsi: mpt3sas: Remove HBA BIOS version in the kernel log scsi: target: core: Fix invalid memory access scsi: scsi_debug: Drop sdebug_queue scsi: scsi_debug: Only allow sdebug_max_queue be modified when no shosts scsi: scsi_debug: Use scsi_host_busy() in delay_store() and ndelay_store() scsi: scsi_debug: Use blk_mq_tagset_busy_iter() in stop_all_queued() scsi: scsi_debug: Use blk_mq_tagset_busy_iter() in sdebug_blk_mq_poll() scsi: scsi_debug: Dynamically allocate sdebug_queued_cmd scsi: scsi_debug: Use scsi_block_requests() to block queues scsi: scsi_debug: Protect block_unblock_all_queues() with mutex scsi: scsi_debug: Change shost list lock to a mutex ...
2023-04-24xen/pvcalls: don't call bind_evtchn_to_irqhandler() under lockJuergen Gross1-20/+26
bind_evtchn_to_irqhandler() shouldn't be called under spinlock, as it can sleep. This requires to move the calls of create_active() out of the locked regions. This is no problem, as the worst which could happen would be a spurious call of the interrupt handler, causing a spurious wake_up(). Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/lkml/Y+JUIl64UDmdkboh@kadam/ Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20230403092711.15285-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-04-24xen/scsiback: don't call scsiback_free_translation_entry() under lockJuergen Gross1-13/+14
scsiback_free_translation_entry() shouldn't be called under spinlock, as it can sleep. This requires to split removing a translation entry from the v2p list from actually calling kref_put() for the entry. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/lkml/Y+JUIl64UDmdkboh@kadam/ Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Link: https://lore.kernel.org/r/20230328084602.20729-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-04-24xen/pciback: don't call pcistub_device_put() under lockJuergen Gross1-4/+2
pcistub_device_put() shouldn't be called under spinlock, as it can sleep. For this reason pcistub_device_get_pci_dev() needs to be modified: instead of always calling pcistub_device_get() just do the call of pcistub_device_get() only if it is really needed. This removes the need to call pcistub_device_put(). Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/lkml/Y+JUIl64UDmdkboh@kadam/ Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Link: https://lore.kernel.org/r/20230328084549.20695-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-04-13xen: simplify sysctl registration for balloonLuis Chamberlain1-19/+1
register_sysctl_table() is a deprecated compatibility wrapper. register_sysctl_init() can do the directory creation for you so just use that. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com>
2023-03-22ACPI: processor: Fix evaluating _PDC method when running as Xen dom0Roger Pau Monne1-0/+20
In ACPI systems, the OS can direct power management, as opposed to the firmware. This OS-directed Power Management is called OSPM. Part of telling the firmware that the OS going to direct power management is making ACPI "_PDC" (Processor Driver Capabilities) calls. These _PDC methods must be evaluated for every processor object. If these _PDC calls are not completed for every processor it can lead to inconsistency and later failures in things like the CPU frequency driver. In a Xen system, the dom0 kernel is responsible for system-wide power management. The dom0 kernel is in charge of OSPM. However, the number of CPUs available to dom0 can be different than the number of CPUs physically present on the system. This leads to a problem: the dom0 kernel needs to evaluate _PDC for all the processors, but it can't always see them. In dom0 kernels, ignore the existing ACPI method for determining if a processor is physically present because it might not be accurate. Instead, ask the hypervisor for this information. Fix this by introducing a custom function to use when running as Xen dom0 in order to check whether a processor object matches a CPU that's online. Such checking is done using the existing information fetched by the Xen pCPU subsystem, extending it to also store the ACPI ID. This ensures that _PDC method gets evaluated for all physically online CPUs, regardless of the number of CPUs made available to dom0. Fixes: 5d554a7bb064 ("ACPI: processor: add internal processor_physically_present()") Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-03-17Merge tag 'for-linus-6.3-rc3-tag' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - cleanup for xen time handling - enable the VGA console in a Xen PVH dom0 - cleanup in the xenfs driver * tag 'for-linus-6.3-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: remove unnecessary (void*) conversions x86/PVH: obtain VGA console info in Dom0 x86/xen/time: cleanup xen_tsc_safe_clocksource xen: update arch/x86/include/asm/xen/cpuid.h
2023-03-16scsi: xen-scsiback: Remove default fabric ops calloutsDmitry Bogdanov1-30/+0
Remove callouts that are identical to the default implementations in TCM Core. Acked-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Link: https://lore.kernel.org/r/20230313181110.20566-10-d.bogdanov@yadro.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-03-16xen: remove unnecessary (void*) conversionsYu Zhe1-5/+5
Pointer variables of void * type do not require type cast. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230316083954.4223-1-yuzhe@nfschina.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-24Merge tag 'driver-core-6.3-rc1' of ↵Linus Torvalds3-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.3-rc1. There's a lot of changes this development cycle, most of the work falls into two different categories: - fw_devlink fixes and updates. This has gone through numerous review cycles and lots of review and testing by lots of different devices. Hopefully all should be good now, and Saravana will be keeping a watch for any potential regression on odd embedded systems. - driver core changes to work to make struct bus_type able to be moved into read-only memory (i.e. const) The recent work with Rust has pointed out a number of areas in the driver core where we are passing around and working with structures that really do not have to be dynamic at all, and they should be able to be read-only making things safer overall. This is the contuation of that work (started last release with kobject changes) in moving struct bus_type to be constant. We didn't quite make it for this release, but the remaining patches will be finished up for the release after this one, but the groundwork has been laid for this effort. Other than that we have in here: - debugfs memory leak fixes in some subsystems - error path cleanups and fixes for some never-able-to-be-hit codepaths. - cacheinfo rework and fixes - Other tiny fixes, full details are in the shortlog All of these have been in linux-next for a while with no reported problems" [ Geert Uytterhoeven points out that that last sentence isn't true, and that there's a pending report that has a fix that is queued up - Linus ] * tag 'driver-core-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (124 commits) debugfs: drop inline constant formatting for ERR_PTR(-ERROR) OPP: fix error checking in opp_migrate_dentry() debugfs: update comment of debugfs_rename() i3c: fix device.h kernel-doc warnings dma-mapping: no need to pass a bus_type into get_arch_dma_ops() driver core: class: move EXPORT_SYMBOL_GPL() lines to the correct place Revert "driver core: add error handling for devtmpfs_create_node()" Revert "devtmpfs: add debug info to handle()" Revert "devtmpfs: remove return value of devtmpfs_delete_node()" driver core: cpu: don't hand-override the uevent bus_type callback. devtmpfs: remove return value of devtmpfs_delete_node() devtmpfs: add debug info to handle() driver core: add error handling for devtmpfs_create_node() driver core: bus: update my copyright notice driver core: bus: add bus_get_dev_root() function driver core: bus: constify bus_unregister() driver core: bus: constify some internal functions driver core: bus: constify bus_get_kset() driver core: bus: constify bus_register/unregister_notifier() driver core: remove private pointer from struct bus_type ...
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds4-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-23Merge tag 'efi-next-for-v6.3' of ↵Linus Torvalds1-0/+61
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: "A healthy mix of EFI contributions this time: - Performance tweaks for efifb earlycon (Andy) - Preparatory refactoring and cleanup work in the efivar layer, which is needed to accommodate the Snapdragon arm64 laptops that expose their EFI variable store via a TEE secure world API (Johan) - Enhancements to the EFI memory map handling so that Xen dom0 can safely access EFI configuration tables (Demi Marie) - Wire up the newly introduced IBT/BTI flag in the EFI memory attributes table, so that firmware that is generated with ENDBR/BTI landing pads will be mapped with enforcement enabled - Clean up how we check and print the EFI revision exposed by the firmware - Incorporate EFI memory attributes protocol definition and wire it up in the EFI zboot code (Evgeniy) This ensures that these images can execute under new and stricter rules regarding the default memory permissions for EFI page allocations (More work is in progress here) - CPER header cleanup (Dan Williams) - Use a raw spinlock to protect the EFI runtime services stack on arm64 to ensure the correct semantics under -rt (Pierre) - EFI framebuffer quirk for Lenovo Ideapad (Darrell)" * tag 'efi-next-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits) firmware/efi sysfb_efi: Add quirk for Lenovo IdeaPad Duet 3 arm64: efi: Make efi_rt_lock a raw_spinlock efi: Add mixed-mode thunk recipe for GetMemoryAttributes efi: x86: Wire up IBT annotation in memory attributes table efi: arm64: Wire up BTI annotation in memory attributes table efi: Discover BTI support in runtime services regions efi/cper, cxl: Remove cxl_err.h efi: Use standard format for printing the EFI revision efi: Drop minimum EFI version check at boot efi: zboot: Use EFI protocol to remap code/data with the right attributes efi/libstub: Add memory attribute protocol definitions efi: efivars: prevent double registration efi: verify that variable services are supported efivarfs: always register filesystem efi: efivars: add efivars printk prefix efi: Warn if trying to reserve memory under Xen efi: Actually enable the ESRT under Xen efi: Apply allowlist to EFI configuration tables when running under Xen efi: xen: Implement memory descriptor lookup based on hypercall efi: memmap: Disregard bogus entries instead of returning them ...
2023-02-21Merge tag 'net-next-6.3' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
2023-02-18xen: sysfs: make kobj_type structure constantThomas Weißschuh1-1/+1
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230216-kobj_type-xen-v1-1-742423de7d71@weissschuh.net Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13xen: Replace one-element array with flexible-array memberGustavo A. R. Silva1-1/+1
One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element array with flexible-array member in struct xen_page_directory. This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines on memcpy() and help us make progress towards globally enabling -fstrict-flex-arrays=3 [1]. This results in no differences in binary output. Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/255 Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/Y9xjN6Wa3VslgXeX@work Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13xen/grant-dma-iommu: Implement a dummy probe_device() callbackOleksandr Tyshchenko1-2/+9
Update stub IOMMU driver (which main purpose is to reuse generic IOMMU device-tree bindings by Xen grant DMA-mapping layer on Arm) according to the recent changes done in the following commit 57365a04c921 ("iommu: Move bus setup to IOMMU device registration"). With probe_device() callback being called during IOMMU device registration, the uninitialized callback just leads to the "kernel NULL pointer dereference" issue during boot. Fix that by adding a dummy callback. Looks like the release_device() callback is not mandatory to be implemented as IOMMU framework makes sure that callback is initialized before dereferencing. Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Fixes: 57365a04c921 ("iommu: Move bus setup to IOMMU device registration") Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Tested-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20230208153649.3604857-1-olekstysh@gmail.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13xen/pvcalls-back: fix permanently masked event channelVolodymyr Babchuk1-1/+2
There is a sequence of events that can lead to a permanently masked event channel, because xen_irq_lateeoi() is newer called. This happens when a backend receives spurious write event from a frontend. In this case pvcalls_conn_back_write() returns early and it does not clears the map->write counter. As map->write > 0, pvcalls_back_ioworker() returns without calling xen_irq_lateeoi(). This leaves the event channel in masked state, a backend does not receive any new events from a frontend and the whole communication stops. Move atomic_set(&map->write, 0) to the very beginning of pvcalls_conn_back_write() to fix this issue. Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reported-by: Oleksii Moisieiev <oleksii_moisieiev@epam.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230119211037.1234931-1-volodymyr_babchuk@epam.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13xen: Allow platform PCI interrupt to be sharedDavid Woodhouse2-6/+8
When we don't use the per-CPU vector callback, we ask Xen to deliver event channel interrupts as INTx on the PCI platform device. As such, it can be shared with INTx on other PCI devices. Set IRQF_SHARED, and make it return IRQ_HANDLED or IRQ_NONE according to whether the evtchn_upcall_pending flag was actually set. Now I can share the interrupt: 11: 82 0 IO-APIC 11-fasteoi xen-platform-pci, ens4 Drop the IRQF_TRIGGER_RISING. It has no effect when the IRQ is shared, and besides, the only effect it was having even beforehand was to trigger a debug message in both I/OAPIC and legacy PIC cases: [ 0.915441] genirq: No set_type function for IRQ 11 (IO-APIC) [ 0.951939] genirq: No set_type function for IRQ 11 (XT-PIC) Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/f9a29a68d05668a3636dd09acd94d970269eaec6.camel@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-13drivers/xen/hypervisor: Expose Xen SIF flags to userspacePer Bilse1-4/+65
/proc/xen is a legacy pseudo filesystem which predates Xen support getting merged into Linux. It has largely been replaced with more normal locations for data (/sys/hypervisor/ for info, /dev/xen/ for user devices). We want to compile xenfs support out of the dom0 kernel. There is one item which only exists in /proc/xen, namely /proc/xen/capabilities with "control_d" being the signal of "you're in the control domain". This ultimately comes from the SIF flags provided at VM start. This patch exposes all SIF flags in /sys/hypervisor/start_flags/ as boolean files, one for each bit, returning '1' if set, '0' otherwise. Two known flags, 'privileged' and 'initdomain', are explicitly named, and all remaining flags can be accessed via generically named files, as suggested by Andrew Cooper. Signed-off-by: Per Bilse <per.bilse@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230103130213.2129753-1-per.bilse@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-02-09mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan4-6/+6
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-27driver core: make struct bus_type.uevent() take a const *Greg Kroah-Hartman3-7/+7
The uevent() callback in struct bus_type should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-16-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-23net/sock: Introduce trace_sk_data_ready()Peilin Ye1-0/+5
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example: <...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23efi: Apply allowlist to EFI configuration tables when running under XenDemi Marie Obenour1-0/+25
As it turns out, Xen does not guarantee that EFI boot services data regions in memory are preserved, which means that EFI configuration tables pointing into such memory regions may be corrupted before the dom0 OS has had a chance to inspect them. This is causing problems for Qubes OS when it attempts to perform system firmware updates, which requires that the contents of the EFI System Resource Table are valid when the fwupd userspace program runs. However, other configuration tables such as the memory attributes table or the runtime properties table are equally affected, and so we need a comprehensive workaround that works for any table type. So when running under Xen, check the EFI memory descriptor covering the start of the table, and disregard the table if it does not reside in memory that is preserved by Xen. Co-developed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-01-22efi: xen: Implement memory descriptor lookup based on hypercallDemi Marie Obenour1-0/+36
Xen on x86 boots dom0 in EFI mode but without providing a memory map. This means that some consistency checks we would like to perform on configuration tables or other data structures in memory are not currently possible. Xen does, however, expose EFI memory descriptor info via a Xen hypercall, so let's wire that up instead. It turns out that the returned information is not identical to what Linux's efi_mem_desc_lookup would return: the address returned is the address passed to the hypercall, and the size returned is the number of bytes remaining in the configuration table. However, none of the callers of efi_mem_desc_lookup() currently care about this. In the future, Xen may gain a hypercall that returns the actual start address, which can be used instead. Co-developed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-01-12Merge tag 'for-linus-6.2-rc4-tag' of ↵Linus Torvalds4-11/+7
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - two cleanup patches - a fix of a memory leak in the Xen pvfront driver - a fix of a locking issue in the Xen hypervisor console driver * tag 'for-linus-6.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/pvcalls: free active map buffer on pvcalls_front_free_map hvc/xen: lock console list traversal x86/xen: Remove the unused function p2m_index() xen: make remove callback of xen driver void returned
2023-01-09xen/pvcalls: free active map buffer on pvcalls_front_free_mapOleksii Moisieiev1-1/+3
Data buffer for active map is allocated in alloc_active_ring and freed in free_active_ring function, which is used only for the error cleanup. pvcalls_front_release is calling pvcalls_front_free_map which ends foreign access for this buffer, but doesn't free allocated pages. Call free_active_ring to clean all allocated resources. Signed-off-by: Oleksii Moisieiev <oleksii_moisieiev@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/6a762ee32dd655cbb09a4aa0e2307e8919761311.1671531297.git.oleksii_moisieiev@epam.com Signed-off-by: Juergen Gross <jgross@suse.com>
2022-12-15xen: make remove callback of xen driver void returnedDawei Li4-10/+4
Since commit fc7a6209d571 ("bus: Make remove callback return void") forces bus_type::remove be void-returned, it doesn't make much sense for any bus based driver implementing remove callbalk to return non-void to its caller. This change is for xen bus based drivers. Acked-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dawei Li <set_pte_at@outlook.com> Link: https://lore.kernel.org/r/TYCP286MB23238119AB4DF190997075C9CAE39@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM Signed-off-by: Juergen Gross <jgross@suse.com>
2022-12-13Merge tag 'drm-next-2022-12-13' of git://anongit.freedesktop.org/drm/drmLinus Torvalds1-4/+4
Pull drm updates from Dave Airlie: "The biggest highlight is that the accel subsystem framework is merged. Hopefully for 6.3 we will be able to line up a driver to use it. In drivers land, i915 enables DG2 support by default now, and nouveau has a big stability refactoring and initial ampere support, AMD includes new hw IP support and should build on ARM again. There is also an ofdrm driver to take over offb on platforms it's used. Stuff outside my tree, the dma-buf patches hit a few places, the vc4 firmware changes also do, and i915 has some interactions with MEI for discrete GPUs. I think all of those should have been acked/reviewed by relevant parties. New driver: - ofdrm - replacement for offb fbdev: - add support for nomodeset fourcc: - add Vivante tiled modifier core: - atomic-helpers: CRTC primary plane test fixes, fb access hooks - connector: TV API consistency, cmdline parser improvements - send connector hotplug on cleanup - sort makefile objects tests: - sort kunit tests - improve DP-MST tests - add kunit helpers to create a device sched: - module param for scheduling policy - refcounting fix buddy: - add back random seed log ttm: - convert ttm_resource to size_t - optimize pool allocations edid: - HFVSDB parsing support fixes - logging/debug improvements - DSC quirks dma-buf: - Add unlocked vmap and attachment mapping - move drivers to common locking convention - locking improvements firmware: - new API for rPI firmware and vc4 xilinx: - zynqmp: displayport bridge support - dpsub fix bridge: - adv7533: Remove dynamic lane switching - it6505: Runtime PM support, sync improvements - ps8640: Handle AUX defer messages - tc358775: Drop soft-reset over I2C panel: - panel-edp: Add INX N116BGE-EA2 C2 and C4 support. - Jadard JD9365DA-H3 - NewVision NV3051D amdgpu: - DCN support on ARM - DCN 2.1 secure display - Sienna Cichlid mode2 reset fixes - new GC 11.x firmware versions - drop AMD specific DSC workarounds in favour of drm code - clang warning fixes - scheduler rework - SR-IOV fixes - GPUVM locking fixes - fix memory leak in CS IOCTL error path - flexible array updates - enable new GC/PSP/SMU/NBIO IP - GFX preemption support for gfx9 amdkfd: - cache size fixes - userptr fixes - enable cooperative launch on gfx 10.3 - enable GC 11.0.4 KFD support radeon: - replace kmap with kmap_local_page - ACPI ref count fix - HDA audio notifier support i915: - DG2 enabled by default - MTL enablement work - hotplug refactoring - VBT improvements - Display and watermark refactoring - ADL-P workaround - temp disable runtime_pm for discrete- - fix for A380 as a secondary GPU - Wa_18017747507 for DG2 - CS timestamp support fixes for gen5 and earlier - never purge busy TTM objects - use i915_sg_dma_sizes for all backends - demote GuC kernel contexts to normal priority - gvt: refactor for new MDEV interface - enable DC power states on eDP ports - fix gen 2/3 workarounds nouveau: - fix page fault handling - Ampere acceleration support - driver stability improvements - nva3 backlight support msm: - MSM_INFO_GET_FLAGS support - DPU: XR30 and P010 image formats - Qualcomm SM6115 support - DSI PHY support for QCM2290 - HDMI: refactored dev init path - remove exclusive-fence hack - fix speed-bin detection - enable clamp to idle on 7c3 - improved hangcheck detection vmwgfx: - fb and cursor refactoring - convert to generic hashtable - cursor improvements etnaviv: - hw workarounds - softpin MMU fixes ast: - atomic gamma LUT support - convert to SHMEM lcdif: - support YUV planes - Increase DMA burst size - FIFO threshold tuning meson: - fix return type of cvbs mode_valid mgag200: - fix PLL setup on some revisions sun4i: - A100 and D1 support udl: - modesetting improvements - hot unplug support vc4: - support PAL-M - fix regression preventing 4K @ 60Hz - fix NULL ptr deref v3d: - switch to drm managed resources renesas: - RZ/G2L DSI support - DU Kconfig cleanup mediatek: - fixup dpi and hdmi - MT8188 dpi support - MT8195 AFBC support tegra: - NVDEC hardware on Tegra234 SoC hdlcd: - switch to drm managed resources ingenic: - fix registration error path hisilicon: - convert to drm_mode_init maildp: - use managed resources mtk: - use drm_mode_init rockchip: - use drm_mode_copy" * tag 'drm-next-2022-12-13' of git://anongit.freedesktop.org/drm/drm: (1397 commits) drm/amdgpu: fix mmhub register base coding error drm/amdgpu: add tmz support for GC IP v11.0.4 drm/amdgpu: enable GFX Clock Gating control for GC IP v11.0.4 drm/amdgpu: enable GFX Power Gating for GC IP v11.0.4 drm/amdgpu: enable GFX IP v11.0.4 CG support drm/amdgpu: Make amdgpu_ring_mux functions as static drm/amdgpu: generally allow over-commit during BO allocation drm/amd/display: fix array index out of bound error in DCN32 DML drm/amd/display: 3.2.215 drm/amd/display: set optimized required for comp buf changes drm/amd/display: Add debug option to skip PSR CRTC disable drm/amd/display: correct DML calc error of UrgentLatency drm/amd/display: correct static_screen_event_mask drm/amd/display: Ensure commit_streams returns the DC return code drm/amd/display: read invalid ddc pin status cause engine busy drm/amd/display: Bypass DET swath fill check for max clocks drm/amd/display: Disable uclk pstate for subvp pipes drm/amd/display: Fix DCN2.1 default DSC clocks drm/amd/display: Enable dp_hdmi21_pcon support drm/amd/display: prevent seamless boot on displays that don't have the preferred dig ...
2022-12-12Merge tag 'pull-iov_iter' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter work; most of that is about getting rid of direction misannotations and (hopefully) preventing more of the same for the future" * tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use less confusing names for iov_iter direction initializers iov_iter: saner checks for attempt to copy to/from iterator [xen] fix "direction" argument of iov_iter_kvec() [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [target] fix iov_iter_bvec() "direction" argument [s390] memcpy_real(): WRITE is "data source", not destination... [s390] zcore: WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [fsi] WRITE is "data source", not destination... [s390] copy_oldmem_kernel() - WRITE is "data source", not destination csum_and_copy_to_iter(): handle ITER_DISCARD get rid of unlikely() on page_copy_sane() calls
2022-12-12Merge tag 'acpi-6.2-rc1' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and PNP updates from Rafael Wysocki: "These include new code (for instance, support for the FFH address space type and support for new firmware data structures in ACPICA), some new quirks (mostly related to backlight handling and I2C enumeration), a number of fixes and a fair amount of cleanups all over. Specifics: - Update the ACPICA code in the kernel to the 20221020 upstream version and fix a couple of issues in it: - Make acpi_ex_load_op() match upstream implementation (Rafael Wysocki) - Add support for loong_arch-specific APICs in MADT (Huacai Chen) - Add support for fixed PCIe wake event (Huacai Chen) - Add EBDA pointer sanity checks (Vit Kabele) - Avoid accessing VGA memory when EBDA < 1KiB (Vit Kabele) - Add CCEL table support to both compiler/disassembler (Kuppuswamy Sathyanarayanan) - Add a couple of new UUIDs to the known UUID list (Bob Moore) - Add support for FFH Opregion special context data (Sudeep Holla) - Improve warning message for "invalid ACPI name" (Bob Moore) - Add support for CXL 3.0 structures (CXIMS & RDPAS) in the CEDT table (Alison Schofield) - Prepare IORT support for revision E.e (Robin Murphy) - Finish support for the CDAT table (Bob Moore) - Fix error code path in acpi_ds_call_control_method() (Rafael Wysocki) - Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage() (Li Zetao) - Update the version of the ACPICA code in the kernel (Bob Moore) - Use ZERO_PAGE(0) instead of empty_zero_page in the ACPI device enumeration code (Giulio Benetti) - Change the return type of the ACPI driver remove callback to void and update its users accordingly (Dawei Li) - Add general support for FFH address space type and implement the low- level part of it for ARM64 (Sudeep Holla) - Fix stale comments in the ACPI tables parsing code and make it print more messages related to MADT (Hanjun Guo, Huacai Chen) - Replace invocations of generic library functions with more kernel- specific counterparts in the ACPI sysfs interface (Christophe JAILLET, Xu Panda) - Print full name paths of ACPI power resource objects during enumeration (Kane Chen) - Eliminate a compiler warning regarding a missing function prototype in the ACPI power management code (Sudeep Holla) - Fix and clean up the ACPI processor driver (Rafael Wysocki, Li Zhong, Colin Ian King, Sudeep Holla) - Add quirk for the HP Pavilion Gaming 15-cx0041ur to the ACPI EC driver (Mia Kanashi) - Add some mew ACPI backlight handling quirks and update some existing ones (Hans de Goede) - Make the ACPI backlight driver prefer the native backlight control over vendor backlight control when possible (Hans de Goede) - Drop unsetting ACPI APEI driver data on remove (Uwe Kleine-König) - Use xchg_release() instead of cmpxchg() for updating new GHES cache slots (Ard Biesheuvel) - Clean up the ACPI APEI code (Sudeep Holla, Christophe JAILLET, Jay Lu) - Add new I2C device enumeration quirks for Medion Lifetab S10346 and Lenovo Yoga Tab 3 Pro (YT3-X90F) (Hans de Goede) - Make the ACPI battery driver notify user space about adding new battery hooks and removing the existing ones (Armin Wolf) - Modify the pfr_update and pfr_telemetry drivers to use ACPI_FREE() for freeing acpi_object structures to help diagnostics (Wang ShaoBo) - Make the ACPI fan driver use sysfs_emit_at() in its sysfs interface code (ye xingchen) - Fix the _FIF package extraction failure handling in the ACPI fan driver (Hanjun Guo) - Fix the PCC mailbox handling error code path (Huisong Li) - Avoid using PCC Opregions if there is no platform interrupt allocated for this purpose (Huisong Li) - Use sysfs_emit() instead of scnprintf() in the ACPI PAD driver and CPPC library (ye xingchen) - Fix some kernel-doc issues in the ACPI GSI processing code (Xiongfeng Wang) - Fix name memory leak in pnp_alloc_dev() (Yang Yingliang) - Do not disable PNP devices on suspend when they cannot be re-enabled on resume (Hans de Goede) - Clean up the ACPI thermal driver a bit (Rafael Wysocki)" * tag 'acpi-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (67 commits) ACPI: x86: Add skip i2c clients quirk for Medion Lifetab S10346 ACPI: APEI: EINJ: Refactor available_error_type_show() ACPI: APEI: EINJ: Fix formatting errors ACPI: processor: perflib: Adjust acpi_processor_notify_smm() return value ACPI: processor: perflib: Rearrange acpi_processor_notify_smm() ACPI: processor: perflib: Rearrange unregistration routine ACPI: processor: perflib: Drop redundant parentheses ACPI: processor: perflib: Adjust white space ACPI: processor: idle: Drop unnecessary statements and parens ACPI: thermal: Adjust critical.flags.valid check ACPI: fan: Convert to use sysfs_emit_at() API ACPICA: Fix use-after-free in acpi_ut_copy_ipackage_to_ipackage() ACPI: battery: Call power_supply_changed() when adding hooks ACPI: use sysfs_emit() instead of scnprintf() ACPI: x86: Add skip i2c clients quirk for Lenovo Yoga Tab 3 Pro (YT3-X90F) ACPI: APEI: Remove a useless include PNP: Do not disable devices on suspend when they cannot be re-enabled on resume ACPI: processor: Silence missing prototype warnings ACPI: processor_idle: Silence missing prototype warnings ACPI: PM: Silence missing prototype warning ...
2022-12-12Merge branches 'acpi-scan', 'acpi-bus', 'acpi-tables' and 'acpi-sysfs'Rafael J. Wysocki1-2/+1
Merge ACPI changes related to device enumeration, device object managenet, operation region handling, table parsing and sysfs interface: - Use ZERO_PAGE(0) instead of empty_zero_page in the ACPI device enumeration code (Giulio Benetti). - Change the return type of the ACPI driver remove callback to void and update its users accordingly (Dawei Li). - Add general support for FFH address space type and implement the low- level part of it for ARM64 (Sudeep Holla). - Fix stale comments in the ACPI tables parsing code and make it print more messages related to MADT (Hanjun Guo, Huacai Chen). - Replace invocations of generic library functions with more kernel- specific counterparts in the ACPI sysfs interface (Christophe JAILLET, Xu Panda). * acpi-scan: ACPI: scan: substitute empty_zero_page with helper ZERO_PAGE(0) * acpi-bus: ACPI: FFH: Silence missing prototype warnings ACPI: make remove callback of ACPI driver void ACPI: bus: Fix the _OSC capability check for FFH OpRegion arm64: Add architecture specific ACPI FFH Opregion callbacks ACPI: Implement a generic FFH Opregion handler * acpi-tables: ACPI: tables: Fix the stale comments for acpi_locate_initial_tables() ACPI: tables: Print CORE_PIC information when MADT is parsed * acpi-sysfs: ACPI: sysfs: use sysfs_emit() to instead of scnprintf() ACPI: sysfs: Use kstrtobool() instead of strtobool()
2022-12-05xen/privcmd: Fix a possible warning in privcmd_ioctl_mmap_resource()Harshit Mogalapalli1-1/+1
As 'kdata.num' is user-controlled data, if user tries to allocate memory larger than(>=) MAX_ORDER, then kcalloc() will fail, it creates a stack trace and messes up dmesg with a warning. Call trace: -> privcmd_ioctl --> privcmd_ioctl_mmap_resource Add __GFP_NOWARN in order to avoid too large allocation warning. This is detected by static analysis using smatch. Fixes: 3ad0876554ca ("xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20221126050745.778967-1-harshit.m.mogalapalli@oracle.com Signed-off-by: Juergen Gross <jgross@suse.com>
2022-12-05xen/virtio: Handle PCI devices which Host controller is described in DTOleksandr Tyshchenko1-7/+39
Use the same "xen-grant-dma" device concept for the PCI devices behind device-tree based PCI Host controller, but with one modification. Unlike for platform devices, we cannot use generic IOMMU bindings (iommus property), as we need to support more flexible configuration. The problem is that PCI devices under the single PCI Host controller may have the backends running in different Xen domains and thus have different endpoints ID (backend domains ID). Add ability to deal with generic PCI-IOMMU bindings (iommu-map/ iommu-map-mask properties) which allows us to describe relationship between PCI devices and backend domains ID properly. To avoid having to look up for the PCI Host bridge twice and reduce the amount of checks pass an extra struct device_node *np to xen_dt_grant_init_backend_domid(). So with current patch the code expects iommus property for the platform devices and iommu-map/iommu-map-mask properties for PCI devices. The example of generated by the toolstack iommu-map property for two PCI devices 0000:00:01.0 and 0000:00:02.0 whose backends are running in different Xen domains with IDs 1 and 2 respectively: iommu-map = <0x08 0xfde9 0x01 0x08 0x10 0xfde9 0x02 0x08>; Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Xenia Ragiadakou <burzalodowa@gmail.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20221025162004.8501-3-olekstysh@gmail.com Signed-off-by: Juergen Gross <jgross@suse.com>