| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Replace busylock at the Tx queuing layer with a lockless list.
Resulting in a 300% (4x) improvement on heavy TX workloads, sending
twice the number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue.
Normally we perform queue re-selection when flow comes out of idle,
but under extreme circumstances the flows may be constantly busy.
Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already
did for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is
sadly quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for
packets.
- Make MPTCP use Rx backlog processing. This lowers the lock
pressure, improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor
fit for modern container workloads, where limits are imposed using
cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of
RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
unnecessarily aggressive rcvbuf growth and overshot when the
connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC
5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API:
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
defined by the OPEN Alliance's "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in
zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads
IPsec and performs RSS.
- Add info to devlink params whether the current setting is the
default or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame
duplication for a lot of switches without dedicated HSR offload.
- Add uAPI defines for 1.6Tbps link modes.
Device drivers:
- Add Motorcomm YT921x gigabit Ethernet switch support.
- Add MUCSE driver for N500/N210 1GbE NIC series.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback for
reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which
supports Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as
other drivers) to avoid wasting resources if feature is
unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor
format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211
debugfs interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support"
* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
net: page_pool: sanitise allocation order
net: page pool: xa init with destroy on pp init
net/mlx5e: Support XDP target xmit with dummy program
net/mlx5e: Update XDP features in switch channels
selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
wireguard: netlink: generate netlink code
wireguard: uapi: generate header with ynl-gen
wireguard: uapi: move flag enums
wireguard: uapi: move enum wg_cmd
wireguard: netlink: add YNL specification
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
selftests: drv-net: introduce Iperf3Runner for measurement use cases
selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
Documentation: net: dsa: mention simple HSR offload helpers
Documentation: net: dsa: mention availability of RedBox
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild updates from Nicolas Schier:
- Enable -fms-extensions, allowing anonymous use of tagged struct or
union in struct/union (tag kbuild-ms-extensions-6.19). An exemplary
conversion patch is added here, too (btrfs).
[ Editor's note: the core of this actually came in early through a
shared branch and a few other trees - Linus ]
- Introduce architecture-specific CC_CAN_LINK and flags for userprogs
- Add new packaging target 'modules-cpio-pkg' for building a initramfs
cpio w/ kmods
- Handle included .c files in gen_compile_commands
- Minor kbuild changes:
- Use objtree for module signing key path, fixing oot kmod signing
- Improve documentation of KBUILD_BUILD_TIMESTAMP
- Reuse KBUILD_USERCFLAGS for UAPI, instead of defining twice
- Rename scripts/Makefile.extrawarn to Makefile.warn
- Drop obsolete types.h check from headers_check.pl
- Remove outdated config leak ignore entries
* tag 'kbuild-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: add target to build a cpio containing modules
initramfs: add gen_init_cpio to hostprogs unconditionally
kbuild: allow architectures to override CC_CAN_LINK
init: deduplicate cc-can-link.sh invocations
kbuild: don't enable CC_CAN_LINK if the dummy program generates warnings
scripts: headers_install.sh: Remove two outdated config leak ignore entries
scripts/clang-tools: Handle included .c files in gen_compile_commands
kbuild: uapi: Drop types.h check from headers_check.pl
kbuild: Rename Makefile.extrawarn to Makefile.warn
MAINTAINERS, .mailmap: Update mail address for Nicolas Schier
kbuild: uapi: reuse KBUILD_USERCFLAGS
kbuild: doc: improve KBUILD_BUILD_TIMESTAMP documentation
kbuild: Use objtree for module signing key path
btrfs: send: make use of -fms-extensions for defining struct fs_path
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Allow creaing nbcon console drivers with an unsafe write_atomic()
callback that can only be called by the final nbcon_atomic_flush_unsafe().
Otherwise, the driver would rely on the kthread.
It is going to be used as the-best-effort approach for an
experimental nbcon netconsole driver, see
https://lore.kernel.org/r/20251121-nbcon-v1-2-503d17b2b4af@debian.org
Note that a safe .write_atomic() callback is supposed to work in NMI
context. But some networking drivers are not safe even in IRQ
context:
https://lore.kernel.org/r/oc46gdpmmlly5o44obvmoatfqo5bhpgv7pabpvb6sjuqioymcg@gjsma3ghoz35
In an ideal world, all networking drivers would be fixed first and
the atomic flush would be blocked only in NMI context. But it brings
the question how reliable networking drivers are when the system is
in a bad state. They might block flushing more reliable serial
consoles which are more suitable for serious debugging anyway.
- Allow to use the last 4 bytes of the printk ring buffer.
- Prevent queuing IRQ work and block printk kthreads when consoles are
suspended. Otherwise, they create non-necessary churn or even block
the suspend.
- Release console_lock() between each record in the kthread used for
legacy consoles on RT. It might significantly speed up the boot.
- Release nbcon context between each record in the atomic flush. It
prevents stalls of the related printk kthread after it has lost the
ownership in the middle of a record
- Add support for NBCON consoles into KDB
- Add %ptsP modifier for printing struct timespec64 and use it where
possible
- Misc code clean up
* tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (48 commits)
printk: Use console_is_usable on console_unblank
arch: um: kmsg_dump: Use console_is_usable
drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
lib/vsprintf: Unify FORMAT_STATE_NUM handlers
printk: Avoid irq_work for printk_deferred() on suspend
printk: Avoid scheduling irq_work on suspend
printk: Allow printk_trigger_flush() to flush all types
tracing: Switch to use %ptSp
scsi: snic: Switch to use %ptSp
scsi: fnic: Switch to use %ptSp
s390/dasd: Switch to use %ptSp
ptp: ocp: Switch to use %ptSp
pps: Switch to use %ptSp
PCI: epf-test: Switch to use %ptSp
net: dsa: sja1105: Switch to use %ptSp
mmc: mmc_test: Switch to use %ptSp
media: av7110: Switch to use %ptSp
ipmi: Switch to use %ptSp
igb: Switch to use %ptSp
e1000e: Switch to use %ptSp
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for AMD's Smart Data Cache Injection feature which allows
for direct insertion of data from I/O devices into the L3 cache, thus
bypassing DRAM and saving its bandwidth; the resctrl side of the
feature allows the size of the L3 used for data injection to be
controlled
- Add Intel Clearwater Forest to the list of CPUs which support
Sub-NUMA clustering
- Other fixes and cleanups
* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Update bit_usage to reflect io_alloc
fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
fs/resctrl: Introduce interface to display io_alloc CBMs
fs/resctrl: Add user interface to enable/disable io_alloc feature
fs/resctrl: Introduce interface to display "io_alloc" support
x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
x86,fs/resctrl: Detect io_alloc feature
x86/resctrl: Add SDCIAE feature in the command line options
x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
fs/resctrl: Consider sparse masks when initializing new group's allocation
x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
"A large overhaul of the restartable sequences and CID management:
The recent enablement of RSEQ in glibc resulted in regressions which
are caused by the related overhead. It turned out that the decision to
invoke the exit to user work was not really a decision. More or less
each context switch caused that. There is a long list of small issues
which sums up nicely and results in a 3-4% regression in I/O
benchmarks.
The other detail which caused issues due to extra work in context
switch and task migration is the CID (memory context ID) management.
It also requires to use a task work to consolidate the CID space,
which is executed in the context of an arbitrary task and results in
sporadic uncontrolled exit latencies.
The rewrite addresses this by:
- Removing deprecated and long unsupported functionality
- Moving the related data into dedicated data structures which are
optimized for fast path processing.
- Caching values so actual decisions can be made
- Replacing the current implementation with a optimized inlined
variant.
- Separating fast and slow path for architectures which use the
generic entry code, so that only fault and error handling goes into
the TIF_NOTIFY_RESUME handler.
- Rewriting the CID management so that it becomes mostly invisible in
the context switch path. That moves the work of switching modes
into the fork/exit path, which is a reasonable tradeoff. That work
is only required when a process creates more threads than the
cpuset it is allowed to run on or when enough threads exit after
that. An artificial thread pool benchmarks which triggers this did
not degrade, it actually improved significantly.
The main effect in migration heavy scenarios is that runqueue lock
held time and therefore contention goes down significantly"
* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/mmcid: Switch over to the new mechanism
sched/mmcid: Implement deferred mode change
irqwork: Move data struct to a types header
sched/mmcid: Provide CID ownership mode fixup functions
sched/mmcid: Provide new scheduler CID mechanism
sched/mmcid: Introduce per task/CPU ownership infrastructure
sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
sched/mmcid: Provide precomputed maximal value
sched/mmcid: Move initialization out of line
signal: Move MMCID exit out of sighand lock
sched/mmcid: Convert mm CID mask to a bitmap
cpumask: Cache num_possible_cpus()
sched/mmcid: Use cpumask_weighted_or()
cpumask: Introduce cpumask_weighted_or()
sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
sched/mmcid: Move scheduler code out of global header
sched: Fixup whitespace damage
sched/mmcid: Cacheline align MM CID storage
sched/mmcid: Use proper data structures
sched/mmcid: Revert the complex CID management
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
"Scoped user mode access and related changes:
- Implement the missing u64 user access function on ARM when
CONFIG_CPU_SPECTRE=n.
This makes it possible to access a 64bit value in generic code with
[unsafe_]get_user(). All other architectures and ARM variants
provide the relevant accessors already.
- Ensure that ASM GOTO jump label usage in the user mode access
helpers always goes through a local C scope label indirection
inside the helpers.
This is required because compilers are not supporting that a ASM
GOTO target leaves a auto cleanup scope. GCC silently fails to emit
the cleanup invocation and CLANG fails the build.
[ Editor's note: gcc-16 will have fixed the code generation issue
in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
goto jumps [PR122835]"). But we obviously have to deal with clang
and older versions of gcc, so.. - Linus ]
This provides generic wrapper macros and the conversion of affected
architecture code to use them.
- Scoped user mode access with auto cleanup
Access to user mode memory can be required in hot code paths, but
if it has to be done with user controlled pointers, the access is
shielded with a speculation barrier, so that the CPU cannot
speculate around the address range check. Those speculation
barriers impact performance quite significantly.
This cost can be avoided by "masking" the provided pointer so it is
guaranteed to be in the valid user memory access range and
otherwise to point to a guaranteed unpopulated address space. This
has to be done without branches so it creates an address dependency
for the access, which the CPU cannot speculate ahead.
This results in repeating and error prone programming patterns:
if (can_do_masked_user_access())
from = masked_user_read_access_begin((from));
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
user_read_access_end();
return 0;
Efault:
user_read_access_end();
return -EFAULT;
which can be replaced with scopes and automatic cleanup:
scoped_user_read_access(from, Efault)
unsafe_get_user(val, from, Efault);
return 0;
Efault:
return -EFAULT;
- Convert code which implements the above pattern over to
scope_user.*.access(). This also corrects a couple of imbalanced
masked_*_begin() instances which are harmless on most
architectures, but prevent PowerPC from implementing the masking
optimization.
- Add a missing speculation barrier in copy_from_user_iter()"
* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
scm: Convert put_cmsg() to scoped user access
iov_iter: Add missing speculation barrier to copy_from_user_iter()
iov_iter: Convert copy_from_user_iter() to masked user access
select: Convert to scoped user access
x86/futex: Convert to scoped user access
futex: Convert to get/put_user_inline()
uaccess: Provide put/get_user_inline()
uaccess: Provide scoped user access regions
arm64: uaccess: Use unsafe wrappers for ASM GOTO
s390/uaccess: Use unsafe wrappers for ASM GOTO
riscv/uaccess: Use unsafe wrappers for ASM GOTO
powerpc/uaccess: Use unsafe wrappers for ASM GOTO
x86/uaccess: Use unsafe wrappers for ASM GOTO
uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
ARM: uaccess: Implement missing __get_user_asm_dword()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Mutexes:
- Redo __mutex_init() to reduce generated code size (Sebastian
Andrzej Siewior)
Seqlocks:
- Introduce scoped_seqlock_read() (Peter Zijlstra)
- Change thread_group_cputime() to use scoped_seqlock_read() (Oleg
Nesterov)
- Change do_task_stat() to use scoped_seqlock_read() (Oleg Nesterov)
- Change do_io_accounting() to use scoped_seqlock_read() (Oleg
Nesterov)
- Fix the incorrect documentation of read_seqbegin_or_lock() /
need_seqretry() (Oleg Nesterov)
- Allow KASAN to fail optimizing (Peter Zijlstra)
Local lock updates:
- Fix all kernel-doc warnings (Randy Dunlap)
- Add the <linux/local_lock*.h> headers to MAINTAINERS (Sebastian
Andrzej Siewior)
- Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/ (Vincent
Mailhol)
Lock debugging:
- spinlock/debug: Fix data-race in do_raw_write_lock (Alexander
Sverdlin)
Atomic primitives infrastructure:
- atomic: Skip alignment check for try_cmpxchg() old arg (Arnd
Bergmann)
Rust runtime integration:
- sync: atomic: Enable generated Atomic<T> usage (Boqun Feng)
- sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng)
- debugfs: Remove Rust native atomics and replace them with Linux
versions (Boqun Feng)
- debugfs: Implement Reader for Mutex<T> only when T is Unpin (Boqun
Feng)
- lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida)
- lock: Pin the inner data (Daniel Almeida)
- lock: Add a Pin<&mut T> accessor (Daniel Almeida)"
* tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/local_lock: Fix all kernel-doc warnings
locking/local_lock: s/l/__l/ and s/tl/__tl/ to reduce the risk of shadowing
locking/local_lock: Add the <linux/local_lock*.h> headers to MAINTAINERS
locking/mutex: Redo __mutex_init() to reduce generated code size
rust: debugfs: Replace the usage of Rust native atomics
rust: sync: atomic: Implement Debug for Atomic<Debug>
rust: sync: atomic: Make Atomic*Ops pub(crate)
seqlock: Allow KASAN to fail optimizing
rust: debugfs: Implement Reader for Mutex<T> only when T is Unpin
seqlock: Change do_io_accounting() to use scoped_seqlock_read()
seqlock: Change do_task_stat() to use scoped_seqlock_read()
seqlock: Change thread_group_cputime() to use scoped_seqlock_read()
seqlock: Introduce scoped_seqlock_read()
documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
atomic: Skip alignment check for try_cmpxchg() old arg
rust: lock: Add a Pin<&mut T> accessor
rust: lock: Pin the inner data
rust: lock: guard: Add T: Unpin bound to DerefMut
locking/spinlock/debug: Fix data-race in do_raw_write_lock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fd prepare updates from Christian Brauner:
"This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
common pattern of get_unused_fd_flags() + create file + fd_install()
that is used extensively throughout the kernel and currently requires
cumbersome cleanup paths.
FD_ADD() - For simple cases where a file is installed immediately:
fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
if (fd < 0)
vfio_device_put_registration(device);
return fd;
FD_PREPARE() - For cases requiring access to the fd or file, or
additional work before publishing:
FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
if (fdf.err) {
fput(sync_file->file);
return fdf.err;
}
data.fence = fd_prepare_fd(fdf);
if (copy_to_user((void __user *)arg, &data, sizeof(data)))
return -EFAULT;
return fd_publish(fdf);
The primitives are centered around struct fd_prepare. FD_PREPARE()
encapsulates all allocation and cleanup logic and must be followed by
a call to fd_publish() which associates the fd with the file and
installs it into the caller's fdtable. If fd_publish() isn't called,
both are deallocated automatically. FD_ADD() is a shorthand that does
fd_publish() immediately and never exposes the struct to the caller.
I've implemented this in a way that it's compatible with the cleanup
infrastructure while also being usable separately. IOW, it's centered
around struct fd_prepare which is aliased to class_fd_prepare_t and so
we can make use of all the basica guard infrastructure"
* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
io_uring: convert io_create_mock_file() to FD_PREPARE()
file: convert replace_fd() to FD_PREPARE()
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
tty: convert ptm_open_peer() to FD_ADD()
ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
media: convert media_request_alloc() to FD_PREPARE()
hv: convert mshv_ioctl_create_partition() to FD_ADD()
gpio: convert linehandle_create() to FD_PREPARE()
pseries: port papr_rtas_setup_file_interface() to FD_ADD()
pseries: convert papr_platform_dump_create_handle() to FD_ADD()
spufs: convert spufs_gang_open() to FD_PREPARE()
papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
spufs: convert spufs_context_open() to FD_PREPARE()
net/socket: convert __sys_accept4_file() to FD_ADD()
net/socket: convert sock_map_fd() to FD_ADD()
net/kcm: convert kcm_ioctl() to FD_PREPARE()
net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
secretmem: convert memfd_secret() to FD_ADD()
memfd: convert memfd_create() to FD_ADD()
bpf: convert bpf_token_create() to FD_PREPARE()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull autofs update from Christian Brauner:
"Prevent futile mount triggers in private mount namespaces.
Fix a problematic loop in autofs when a mount namespace contains
autofs mounts that are propagation private and there is no
namespace-specific automount daemon to handle possible automounting.
Previously, attempted path resolution would loop until MAXSYMLINKS was
reached before failing, causing significant noise in the log.
The fix adds a check in autofs ->d_automount() so that the VFS can
immediately return EPERM in this case. Since the mount is propagation
private, EPERM is the most appropriate error code"
* tag 'vfs-6.19-rc1.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
autofs: dont trigger mount if it cant succeed
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull overlayfs cred guard conversion from Christian Brauner:
"This converts all of overlayfs to use credential guards, eliminating
manual credential management throughout the filesystem.
Credential guard conversion:
- Convert all of overlayfs to use credential guards, replacing the
manual ovl_override_creds()/ovl_revert_creds() pattern with scoped
guards.
This makes credential handling visually explicit and eliminates a
class of potential bugs from mismatched override/revert calls.
(1) Basic credential guard (with_ovl_creds)
(2) Creator credential guard (ovl_override_creator_creds):
Introduced a specialized guard for file creation operations
that handles the two-phase credential override (mounter
credentials, then fs{g,u}id override). The new pattern is much
clearer:
with_ovl_creds(dentry->d_sb) {
scoped_class(prepare_creds_ovl, cred, dentry, inode, mode) {
if (IS_ERR(cred))
return PTR_ERR(cred);
/* creation operations */
}
}
(3) Copy-up credential guard (ovl_cu_creds):
Introduced a specialized guard for copy-up operations,
simplifying the previous struct ovl_cu_creds helper and
associated functions.
Ported ovl_copy_up_workdir() and ovl_copy_up_tmpfile() to this
pattern.
Cleanups:
- Remove ovl_revert_creds() after all callers converted to guards
- Remove struct ovl_cu_creds and associated functions
- Drop ovl_setup_cred_for_create() after conversion
- Refactor ovl_fill_super(), ovl_lookup(), ovl_iterate(),
ovl_rename() for cleaner credential guard scope
- Introduce struct ovl_renamedata to simplify rename handling
- Don't override credentials for ovl_check_whiteouts() (unnecessary)
- Remove unneeded semicolon"
* tag 'vfs-6.19-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (54 commits)
ovl: remove unneeded semicolon
ovl: remove struct ovl_cu_creds and associated functions
ovl: port ovl_copy_up_tmpfile() to cred guard
ovl: mark *_cu_creds() as unused temporarily
ovl: port ovl_copy_up_workdir() to cred guard
ovl: add copy up credential guard
ovl: drop ovl_setup_cred_for_create()
ovl: port ovl_create_or_link() to new ovl_override_creator_creds cleanup guard
ovl: mark ovl_setup_cred_for_create() as unused temporarily
ovl: reflow ovl_create_or_link()
ovl: port ovl_create_tmpfile() to new ovl_override_creator_creds cleanup guard
ovl: add ovl_override_creator_creds cred guard
ovl: remove ovl_revert_creds()
ovl: port ovl_fill_super() to cred guard
ovl: refactor ovl_fill_super()
ovl: port ovl_lower_positive() to cred guard
ovl: port ovl_lookup() to cred guard
ovl: refactor ovl_lookup()
ovl: port ovl_copyfile() to cred guard
ovl: port ovl_rename() to cred guard
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull directory locking updates from Christian Brauner:
"This contains the work to add centralized APIs for directory locking
operations.
This series is part of a larger effort to change directory operation
locking to allow multiple concurrent operations in a directory. The
ultimate goal is to lock the target dentry(s) rather than the whole
parent directory.
To help with changing the locking protocol, this series centralizes
locking and lookup in new helper functions. The helpers establish a
pattern where it is the dentry that is being locked and unlocked
(currently the lock is held on dentry->d_parent->d_inode, but that can
change in the future).
This also changes vfs_mkdir() to unlock the parent on failure, as well
as dput()ing the dentry. This allows end_creating() to only require
the target dentry (which may be IS_ERR() after vfs_mkdir()), not the
parent"
* tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nfsd: fix end_creating() conversion
VFS: introduce end_creating_keep()
VFS: change vfs_mkdir() to unlock on failure.
ecryptfs: use new start_creating/start_removing APIs
Add start_renaming_two_dentries()
VFS/ovl/smb: introduce start_renaming_dentry()
VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
VFS: add start_creating_killable() and start_removing_killable()
VFS: introduce start_removing_dentry()
smb/server: use end_removing_noperm for for target of smb2_create_link()
VFS: introduce start_creating_noperm() and start_removing_noperm()
VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
VFS: tidy up do_unlinkat()
VFS: introduce start_dirop() and end_dirop()
debugfs: rename end_creating() to debugfs_end_creating()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull directory delegations update from Christian Brauner:
"This contains the work for recall-only directory delegations for
knfsd.
Add support for simple, recallable-only directory delegations. This
was decided at the fall NFS Bakeathon where the NFS client and server
maintainers discussed how to merge directory delegation support.
The approach starts with recallable-only delegations for several reasons:
1. RFC8881 has gaps that are being addressed in RFC8881bis. In
particular, it requires directory position information for
CB_NOTIFY callbacks, which is difficult to implement properly
under Linux. The spec is being extended to allow that information
to be omitted.
2. Client-side support for CB_NOTIFY still lags. The client side
involves heuristics about when to request a delegation.
3. Early indication shows simple, recallable-only delegations can
help performance. Anna Schumaker mentioned seeing a multi-minute
speedup in xfstests runs with them enabled.
With these changes, userspace can also request a read lease on a
directory that will be recalled on conflicting accesses. This may be
useful for applications like Samba. Users can disable leases
altogether via the fs.leases-enable sysctl if needed.
VFS changes:
- Dedicated Type for Delegations
Introduce struct delegated_inode to track inodes that may have
delegations that need to be broken. This replaces the previous
approach of passing raw inode pointers through the delegation
breaking code paths, providing better type safety and clearer
semantics for the delegation machinery.
- Break parent directory delegations in open(..., O_CREAT) codepath
- Allow mkdir to wait for delegation break on parent
- Allow rmdir to wait for delegation break on parent
- Add try_break_deleg calls for parents to vfs_link(), vfs_rename(),
and vfs_unlink()
- Make vfs_create(), vfs_mknod(), and vfs_symlink() break delegations
on parent directory
- Clean up argument list for vfs_create()
- Expose delegation support to userland
Filelock changes:
- Make lease_alloc() take a flags argument
- Rework the __break_lease API to use flags
- Add struct delegated_inode
- Push the S_ISREG check down to ->setlease handlers
- Lift the ban on directory leases in generic_setlease
NFSD changes:
- Allow filecache to hold S_IFDIR files
- Allow DELEGRETURN on directories
- Wire up GET_DIR_DELEGATION handling
Fixes:
- Fix kernel-doc warnings in __fcntl_getlease
- Add needed headers for new struct delegation definition"
* tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: add needed headers for new struct delegation definition
filelock: __fcntl_getlease: fix kernel-doc warnings
vfs: expose delegation support to userland
nfsd: wire up GET_DIR_DELEGATION handling
nfsd: allow DELEGRETURN on directories
nfsd: allow filecache to hold S_IFDIR files
filelock: lift the ban on directory leases in generic_setlease
vfs: make vfs_symlink break delegations on parent dir
vfs: make vfs_mknod break delegations on parent directory
vfs: make vfs_create break delegations on parent directory
vfs: clean up argument list for vfs_create()
vfs: break parent dir delegations in open(..., O_CREAT) codepath
vfs: allow rmdir to wait for delegation break on parent
vfs: allow mkdir to wait for delegation break on parent
vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
filelock: push the S_ISREG check down to ->setlease handlers
filelock: add struct delegated_inode
filelock: rework the __break_lease API to use flags
filelock: make lease_alloc() take a flags argument
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull minix fixes from Christian Brauner:
"Fix two syzbot corruption bugs in the minix filesystem.
Syzbot fuzzes filesystems by trying to mount and manipulate
deliberately corrupted images. This should not lead to BUG_ONs and
WARN_ONs for easy to detect corruptions.
- Add error handling to minix filesystem for inode corruption
detection, enabling the filesystem to report such corruptions
cleanly.
- Fix a drop_nlink warning in minix_rmdir() triggered by corrupted
directory link counts.
- Fix a drop_nlink warning in minix_rename() triggered by corrupted
inode link counts"
* tag 'vfs-6.19-rc1.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Fix a drop_nlink warning in minix_rename
Fix a drop_nlink warning in minix_rmdir
Add error handling to minix filesystem for inode corruption detection
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock lock guard updates from Christian Brauner:
"This starts the work of introducing guards for superblock related
locks.
Introduce super_write_guard for scoped superblock write protection.
This provides a guard-based alternative to the manual sb_start_write()
and sb_end_write() pattern, allowing the compiler to automatically
handle the cleanup"
* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: use super write guard in xfs_file_ioctl()
open: use super write guard in do_ftruncate()
btrfs: use super write guard in relocating_repair_kthread()
ext4: use super write guard in write_mmp_block()
btrfs: use super write guard in sb_start_write()
btrfs: use super write guard btrfs_run_defrag_inode()
btrfs: use super write guard in btrfs_reclaim_bgs_work()
fs: add super_write_guard
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fs header updates from Christian Brauner:
"This contains initial work to start splitting up fs.h.
Begin the long-overdue work of splitting up the monolithic fs.h
header. The header has grown to over 3000 lines and includes types and
functions for many different subsystems, making it difficult to
navigate and causing excessive compilation dependencies.
This series introduces new focused headers for superblock-related
code:
- Rename fs_types.h to fs_dirent.h to better reflect its actual
content (directory entry types)
- Add fs/super_types.h containing superblock type definitions
- Add fs/super.h containing superblock function declarations
This is the first step in a longer effort to modularize the VFS
headers.
Cleanups:
- Inode Field Layout Optimization (Mateusz Guzik)
Move inode fields used during fast path lookup closer together to
improve cache locality during path resolution.
- current_umask() Optimization (Mateusz Guzik)
Inline current_umask() and move it to fs_struct.h. This improves
performance by avoiding function call overhead for this
frequently-used function, and places it in a more appropriate
header since it operates on fs_struct"
* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: move inode fields used during fast path lookup closer together
fs: inline current_umask() and move it to fs_struct.h
fs: add fs/super.h header
fs: add fs/super_types.h header
fs: rename fs_types.h to fs_dirent.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull cred guard updates from Christian Brauner:
"This contains substantial credential infrastructure improvements
adding guard-based credential management that simplifies code and
eliminates manual reference counting in many subsystems.
Features:
- Kernel Credential Guards
Add with_kernel_creds() and scoped_with_kernel_creds() guards that
allow using the kernel credentials without allocating and copying
them. This was requested by Linus after seeing repeated
prepare_kernel_creds() calls that duplicate the kernel credentials
only to drop them again later.
The new guards completely avoid the allocation and never expose the
temporary variable to hold the kernel credentials anywhere in
callers.
- Generic Credential Guards
Add scoped_with_creds() guards for the common override_creds() and
revert_creds() pattern. This builds on earlier work that made
override_creds()/revert_creds() completely reference count free.
- Prepare Credential Guards
Add prepare credential guards for the more complex pattern of
preparing a new set of credentials and overriding the current
credentials with them:
- prepare_creds()
- modify new creds
- override_creds()
- revert_creds()
- put_cred()
Cleanups:
- Make init_cred static since it should not be directly accessed
- Add kernel_cred() helper to properly access the kernel credentials
- Fix scoped_class() macro that was introduced two cycles ago
- coredump: split out do_coredump() from vfs_coredump() for cleaner
credential handling
- coredump: move revert_cred() before coredump_cleanup()
- coredump: mark struct mm_struct as const
- coredump: pass struct linux_binfmt as const
- sev-dev: use guard for path"
* tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
trace: use override credential guard
trace: use prepare credential guard
coredump: use override credential guard
coredump: use prepare credential guard
coredump: split out do_coredump() from vfs_coredump()
coredump: mark struct mm_struct as const
coredump: pass struct linux_binfmt as const
coredump: move revert_cred() before coredump_cleanup()
sev-dev: use override credential guards
sev-dev: use prepare credential guard
sev-dev: use guard for path
cred: add prepare credential guard
net/dns_resolver: use credential guards in dns_query()
cgroup: use credential guards in cgroup_attach_permissions()
act: use credential guards in acct_write_process()
smb: use credential guards in cifs_get_spnego_key()
nfs: use credential guards in nfs_idmap_get_key()
nfs: use credential guards in nfs_local_call_write()
nfs: use credential guards in nfs_local_call_read()
erofs: use credential guards
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfd and coredump updates from Christian Brauner:
"Features:
- Expose coredump signal via pidfd
Expose the signal that caused the coredump through the pidfd
interface. The recent changes to rework coredump handling to rely
on unix sockets are in the process of being used in systemd. The
previous systemd coredump container interface requires the coredump
file descriptor and basic information including the signal number
to be sent to the container. This means the signal number needs to
be available before sending the coredump to the container.
- Add supported_mask field to pidfd
Add a new supported_mask field to struct pidfd_info that indicates
which information fields are supported by the running kernel. This
allows userspace to detect feature availability without relying on
error codes or kernel version checks.
Cleanups:
- Drop struct pidfs_exit_info and prepare to drop exit_info pointer,
simplifying the internal publication mechanism for exit and
coredump information retrievable via the pidfd ioctl
- Use guard() for task_lock in pidfs
- Reduce wait_pidfd lock scope
- Add missing PIDFD_INFO_SIZE_VER1 constant
- Add missing BUILD_BUG_ON() assert on struct pidfd_info
Fixes:
- Fix PIDFD_INFO_COREDUMP handling
Selftests:
- Split out coredump socket tests and common helpers into separate
files for better organization
- Fix userspace coredump client detection issues
- Handle edge-triggered epoll correctly
- Ignore ENOSPC errors in tests
- Add debug logging to coredump socket tests, socket protocol tests,
and test helpers
- Add tests for PIDFD_INFO_COREDUMP_SIGNAL
- Add tests for supported_mask field
- Update pidfd header for selftests"
* tag 'vfs-6.19-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
pidfs: reduce wait_pidfd lock scope
selftests/coredump: add second PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: add first PIDFD_INFO_COREDUMP_SIGNAL test
selftests/coredump: ignore ENOSPC errors
selftests/coredump: add debug logging to coredump socket protocol tests
selftests/coredump: add debug logging to coredump socket tests
selftests/coredump: add debug logging to test helpers
selftests/coredump: handle edge-triggered epoll correctly
selftests/coredump: fix userspace coredump client detection
selftests/coredump: fix userspace client detection
selftests/coredump: split out coredump socket tests
selftests/coredump: split out common helpers
selftests/pidfd: add second supported_mask test
selftests/pidfd: add first supported_mask test
selftests/pidfd: update pidfd header
pidfs: expose coredump signal
pidfs: drop struct pidfs_exit_info
pidfs: prepare to drop exit_info pointer
pidfd: add a new supported_mask field
pidfs: add missing BUILD_BUG_ON() assert on struct pidfd_info
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"FUSE iomap Support for Buffered Reads:
This adds iomap support for FUSE buffered reads and readahead. This
enables granular uptodate tracking with large folios so only
non-uptodate portions need to be read. Also fixes a race condition
with large folios + writeback cache that could cause data corruption
on partial writes followed by reads.
- Refactored iomap read/readahead bio logic into helpers
- Added caller-provided callbacks for read operations
- Moved buffered IO bio logic into new file
- FUSE now uses iomap for read_folio and readahead
Zero Range Folio Batch Support:
Add folio batch support for iomap_zero_range() to handle dirty
folios over unwritten mappings. Fix raciness issues where dirty data
could be lost during zero range operations.
- filemap_get_folios_tag_range() helper for dirty folio lookup
- Optional zero range dirty folio processing
- XFS fills dirty folios on zero range of unwritten mappings
- Removed old partial EOF zeroing optimization
DIO Write Completions from Interrupt Context:
Restore pre-iomap behavior where pure overwrite completions run
inline rather than being deferred to workqueue. Reduces context
switches for high-performance workloads like ScyllaDB.
- Removed unused IOCB_DIO_CALLER_COMP code
- Error completions always run in user context (fixes zonefs)
- Reworked REQ_FUA selection logic
- Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP
Buffered IO Cleanups:
Some performance and code clarity improvements:
- Replace manual bitmap scanning with find_next_bit()
- Simplify read skip logic for writes
- Optimize pending async writeback accounting
- Better variable naming
- Documentation for iomap_finish_folio_write() requirements
Misaligned Vectors for Zoned XFS:
Enables sub-block aligned vectors in XFS always-COW mode for zoned
devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.
Bug Fixes:
- Allocate s_dio_done_wq for async reads (fixes syzbot report after
error completion changes)
- Fix iomap_read_end() for already uptodate folios (regression fix)"
* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
iomap: allocate s_dio_done_wq for async reads as well
iomap: fix iomap_read_end() for already uptodate folios
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
...
|
|
Fix an uninitialised variable (key) in afs_alloc_anon_key() by setting it
to cell->anonymous_key. Without this change, the error check may return a
false failure with a bad error number.
Most of the time this is unlikely to happen because the first encounter
with afs_alloc_anon_key() will usually be from (auto)mount, for which all
subsequent operations must wait - apart from other (auto)mounts. Once the
call->anonymous_key is allocated, all further calls to afs_request_key()
will skip the call to afs_alloc_anon_key() for that cell.
Fixes: d27c71257825 ("afs: Fix delayed allocation of a cell's anonymous key")
Reported-by: Paulo Alcantra <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: syzbot+41c68824eefb67cdf00c@syzkaller.appspotmail.com
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- afs: Fix delayed allocation of a cell's anonymous key
The allocation of a cell's anonymous key is done in a background
thread along with other cell setup such as doing a DNS upcall. The
normal key lookup tries to use the key description on the anonymous
authentication key as the reference for request_key() - but it may
not yet be set, causing an oops
- ovl: fail ovl_lock_rename_workdir() if either target is unhashed
As well as checking that the parent hasn't changed after getting the
lock, the code needs to check that the dentry hasn't been unhashed.
Otherwise overlayfs might try to rename something that has been
removed
- namespace: fix a reference leak in grab_requested_mnt_ns
lookup_mnt_ns() already takes a reference on mnt_ns, and so
grab_requested_mnt_ns() doesn't need to take an extra reference
* tag 'vfs-6.18-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Fix delayed allocation of a cell's anonymous key
ovl: fail ovl_lock_rename_workdir() if either target is unhashed
fs/namespace: fix reference leak in grab_requested_mnt_ns
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-44-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-21-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-17-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-16-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-15-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-14-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-13-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-12-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-11-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-10-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-9-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
The fix sent in [1] was squashed into this commit.
Link: https://lore.kernel.org/20251127201618.2115275-1-kuniyu@google.com [1]
Reported-by: syzbot+321168dfa622eda99689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/6928b121.a70a0220.d98e3.0110.GAE@google.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-8-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Christian Brauner <brauner@kernel.org> says:
A variant of the fix sent in [1] was squashed into this commit.
Link: https://lore.kernel.org/20251128035149.392402-1-kartikey406@gmail.com [1]
Reported-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+94048264da5715c251f9@syzkaller.appspotmail.com
Tested-by: syzbot+94048264da5715c251f9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=94048264da5715c251f9
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-7-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-6-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-5-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-4-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-3-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-2-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The allocation of a cell's anonymous key is done in a background thread
along with other cell setup such as doing a DNS upcall. In the reported
bug, this is triggered by afs_parse_source() parsing the device name given
to mount() and calling afs_lookup_cell() with the name of the cell.
The normal key lookup then tries to use the key description on the
anonymous authentication key as the reference for request_key() - but it
may not yet be set and so an oops can happen.
This has been made more likely to happen by the fix for dynamic lookup
failure.
Fix this by firstly allocating a reference name and attaching it to the
afs_cell record when the record is created. It can share the memory
allocation with the cell name (unfortunately it can't just overlap the cell
name by prepending it with "afs@" as the cell name already has a '.'
prepended for other purposes). This reference name is then passed to
request_key().
Secondly, the anon key is now allocated on demand at the point a key is
requested in afs_request_key() if it is not already allocated. A mutex is
used to prevent multiple allocation for a cell.
Thirdly, make afs_request_key_rcu() return NULL if the anonymous key isn't
yet allocated (if we need it) and then the caller can return -ECHILD to
drop out of RCU-mode and afs_request_key() can be called.
Note that the anonymous key is kind of necessary to make the key lookup
cache work as that doesn't currently cache a negative lookup, but it's
probably worth some investigation to see if NULL can be used instead.
Fixes: 330e2c514823 ("afs: Fix dynamic lookup to fail on cell lookup failure")
Reported-by: syzbot+41c68824eefb67cdf00c@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/800328.1764325145@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove unnecessary semicolons reported by Coccinelle/coccicheck and the
semantic patch at scripts/coccinelle/misc/semicolon.cocci.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Fixed: 7ab96df840e60 ("VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
As well as checking that the parent hasn't changed after getting the
lock we need to check that the dentry hasn't been unhashed.
Otherwise we might try to rename something that has been removed.
Reported-by: syzbot+bfc9a0ccf0de47d04e8c@syzkaller.appspotmail.com
Fixes: d2c995581c7c ("ovl: Call ovl_create_temp() without lock held.")
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/176429295510.634289.1552337113663461690@noble.neil.brown.name
Tested-by: syzbot+bfc9a0ccf0de47d04e8c@syzkaller.appspotmail.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Rationale is that if the parent dentry is the same and the length is the
same, then you have to be unlucky for the name to not match.
At the same time the dentry was literally just found on the hash, so you
have to be even more unlucky to determine it is unhashed.
While here add commentary while d_unhashed() is necessary. It was
already removed once and brought back in:
2e321806b681b192 ("Revert "vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu"")
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251127131526.4137768-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the correct function name and add description for the @flavor
parameter to avoid these kernel-doc warnings:
Warning: fs/locks.c:1706 function parameter 'flavor' not described in
'__fcntl_getlease'
WARNING: fs/locks.c:1706 expecting prototype for fcntl_getlease().
Prototype was for __fcntl_getlease() instead
Fixes: 1602bad16d7d ("vfs: expose delegation support to userland")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251128000826.457120-1-rdunlap@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Avoid a double-unlock as nfs_create_locked() will have unlocked the
parent and do the dput() manually.
Christian Brauner <brauner@kernel.org> says:
I've taken Neil's proposed fix from [1] and added a commit message.
Fixes: https://lore.kernel.org/202511252132.2c621407-lkp@intel.com [1]
Fixes: bd6ede8a06e8 ("VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()")
Signed-off-by: Neil Brown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Conflicts:
net/xdp/xsk.c
0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
8da7bea7db69 ("xsk: add indirect call for xsk_destruct_skb")
30ed05adca4a ("xsk: use a smaller new lock for shared pool case")
https://lore.kernel.org/20251127105450.4a1665ec@canb.auug.org.au
https://lore.kernel.org/eb4eee14-7e24-4d1b-b312-e9ea738fefee@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull ceph fixes from Ilya Dryomov:
"A patch to make sparse read handling work in msgr2 secure mode from
Slava and a couple of fixes from Ziming and myself to avoid operating
on potentially invalid memory, all marked for stable"
* tag 'ceph-for-6.18-rc8' of https://github.com/ceph/ceph-client:
libceph: prevent potential out-of-bounds writes in handle_auth_session_key()
libceph: replace BUG_ON with bounds check for map->max_osd
ceph: fix crash in process_v2_sparse_read() for encrypted directories
libceph: drop started parameter of __ceph_open_session()
libceph: fix potential use-after-free in have_mon_and_osd_map()
|
|
With the previous commit revamping the timeout handling, started isn't
used anymore. It could be taken into account by adjusting the initial
value of the timeout, but there is little point as both callers capture
the timestamp shortly before calling __ceph_open_session() -- the only
thing of note that happens in the interim is taking client->mount_mutex
and that isn't expected to take multiple seconds.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
The primary consumer is link_path_walk(), calling walk_component() every
time which in turn calls step_into().
Inlining these saves overhead of 2 function calls per path component,
along with allowing the compiler to do better job optimizing them in place.
step_into() had absolutely atrocious assembly to facilitate the
slowpath. In order to lessen the burden at the callsite all the hard
work is moved into step_into_slowpath() and instead an inline-able
fastpath is implemented for rcu-walk.
The new fastpath is a stripped down step_into() RCU handling with a
d_managed() check from handle_mounts().
Benchmarked as follows on Sapphire Rapids:
1. the "before" was a kernel with not-yet-merged optimizations (notably
elision of calls to security_inode_permission() and marking ext4
inodes as not having acls as applicable)
2. "after" is the same + the prep patch + this patch
3. benchmark consists of issuing 205 calls to access(2) in a loop with
pathnames lifted out of gcc and the linker building real code, most
of which have several path components and 118 of which fail with
-ENOENT.
Result in terms of ops/s:
before: 21619
after: 22536 (+4%)
profile before:
20.25% [kernel] [k] __d_lookup_rcu
10.54% [kernel] [k] link_path_walk
10.22% [kernel] [k] entry_SYSCALL_64
6.50% libc.so.6 [.] __GI___access
6.35% [kernel] [k] strncpy_from_user
4.87% [kernel] [k] step_into
3.68% [kernel] [k] kmem_cache_alloc_noprof
2.88% [kernel] [k] walk_component
2.86% [kernel] [k] kmem_cache_free
2.14% [kernel] [k] set_root
2.08% [kernel] [k] lookup_fast
after:
23.38% [kernel] [k] __d_lookup_rcu
11.27% [kernel] [k] entry_SYSCALL_64
10.89% [kernel] [k] link_path_walk
7.00% libc.so.6 [.] __GI___access
6.88% [kernel] [k] strncpy_from_user
3.50% [kernel] [k] kmem_cache_alloc_noprof
2.01% [kernel] [k] kmem_cache_free
2.00% [kernel] [k] set_root
1.99% [kernel] [k] lookup_fast
1.81% [kernel] [k] do_syscall_64
1.69% [kernel] [k] entry_SYSCALL_64_safe_stack
While walk_component() and step_into() of course disappear from the
profile, the link_path_walk() barely gets more overhead despite the
inlining thanks to the fast path added and while completing more walks
per second.
I did not investigate why overhead grew a lot on __d_lookup_rcu().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Symlink handling is already marked as unlikely and pushing out some of
it into pick_link() reduces register spillage on entry to step_into()
with gcc 14.2.
The compiler needed additional convincing that handle_mounts() is
unlikely to fail.
At the same time neither clang nor gcc could be convinced to tail-call
into pick_link().
While pick_link() takes an address of stack-based object as an argument
(which definitely prevents the optimization), splitting it into separate
<dentry, mount> tuple did not help. The issue persists even when
compiled without stack protector. As such nothing was done about this
for the time being to not grow the diff.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Orangefs has no i_version handling and __orangefs_setattr already
explicitly marks the inode dirty. So instead of the using
the flags return value from generic_update_time, just call the
lower level inode_update_timestamps helper directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-7-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since commit e41f941a2311 ("Btrfs: move over to use ->update_time") this
is not a copy of the high-level file_update_time helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Btrfs updates the device node timestamps for block device special files
when it stop using the device.
Commit 8f96a5bfa150 ("btrfs: update the bdev time directly when closing")
switch that update from the correct layering to directly call the
low-level helper on the bdev inode. This is wrong and got fixed in
commit 54fde91f52f5 ("btrfs: update device path inode time instead of
bd_inode") by updating the file system inode instead of the bdev inode,
but this kept the incorrect bypassing of the VFS interfaces and file
system ->update_times method. Fix this by using the propet vfs_utimes
interface.
Fixes: 8f96a5bfa150 ("btrfs: update the bdev time directly when closing")
Fixes: 54fde91f52f5 ("btrfs: update device path inode time instead of bd_inode")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This will be used to replace an incorrect direct call into
generic_update_time in btrfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-4-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based
"invisible I/O", but commit e5e9b24ab8fa ("nfsd: freeze c/mtime updates
with outstanding WRITE_ATTRS delegation") started using it from
generic callers.
I'm not sure other file systems are actually read for this in general,
so the above commit should get a closer look, but for it to make any
sense, file_update_time needs to respect the flag.
Lift the check from file_modified_flags to file_update_time so that
users of file_update_time inherit the behavior and so that all the
checks are done in one place.
Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-3-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently the two high-level APIs use two helper functions to implement
almost all of the logic. Refactor the two helpers and the common logic
into a new file_update_time_flags routine that gets the iocb flags or
0 in case of file_update_time passed so that the entire logic is
contained in a single function and can be easily understood and modified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-2-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a comment on regeneration to the generated files.
The comment is placed after the YNL-GEN line[1], as to not interfere
with ynl-regen.sh's detection logic.
[1] and after the optional YNL-ARG line.
Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When having a multiuser mount with domain= specified and using
cifscreds, cifs_set_cifscreds() will end up setting @ctx->domainname,
so it needs to be freed before leaving cifs_construct_tcon().
This fixes the following memory leak reported by kmemleak:
mount.cifs //srv/share /mnt -o domain=ZELDA,multiuser,...
su - testuser
cifscreds add -d ZELDA -u testuser
...
ls /mnt/1
...
umount /mnt
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8881203c3f08 (size 8):
comm "ls", pid 5060, jiffies 4307222943
hex dump (first 8 bytes):
5a 45 4c 44 41 00 cc cc ZELDA...
backtrace (crc d109a8cf):
__kmalloc_node_track_caller_noprof+0x572/0x710
kstrdup+0x3a/0x70
cifs_sb_tlink+0x1209/0x1770 [cifs]
cifs_get_fattr+0xe1/0xf50 [cifs]
cifs_get_inode_info+0xb5/0x240 [cifs]
cifs_revalidate_dentry_attr+0x2d1/0x470 [cifs]
cifs_getattr+0x28e/0x450 [cifs]
vfs_getattr_nosec+0x126/0x180
vfs_statx+0xf6/0x220
do_statx+0xab/0x110
__x64_sys_statx+0xd5/0x130
do_syscall_64+0xbb/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: f2aee329a68f ("cifs: set domainName when a domain-key is used in multiuser")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Cc: stable@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
For consistency with sb routines.
ext4 is the only consumer outside of evict(). Damage-controlling it is
outside of the scope of this cleanup.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251103230911.516866-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and
inode_add_lru(). move it up
2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is
needed
3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself
(inode_lru_list_del()) and similar routines for sb and io list
management
4. push list presence check into inode_lru_list_del(), just like sb and
io list
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In the inode hash code grab the state while ->i_lock is held. If found
to be set, synchronize the sleep once more with the lock held.
In the real world the flag is not set most of the time.
Apart from being simpler to reason about, it comes with a minor speed up
as now clearing the flag does not require the smp_mb() fence.
While here rename wait_on_inode() to wait_on_new_inode() to line it up
with __wait_on_freeing_inode().
Christian Brauner <brauner@kernel.org> says:
As per the discussion in [1] I folded in the diff sent in [2].
Link: https://lore.kernel.org/69238e4d.a70a0220.d98e3.006e.GAE@google.com [1]
Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251010221737.1403539-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since commit 222f2c7c6d14 ("iomap: always run error completions in user
context"), read error completions are deferred to s_dio_done_wq. This
means the workqueue also needs to be allocated for async reads.
Fixes: 222f2c7c6d14 ("iomap: always run error completions in user context")
Reported-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251124140013.902853-1-hch@lst.de
Tested-by: syzbot+a2b9a4ed0d61b1efb3f5@syzkaller.appspotmail.com
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
There are some cases where when iomap_read_end() is called, the folio
may already have been marked uptodate. For example, if the iomap block
needed zeroing, then the folio may have been marked uptodate after the
zeroing.
iomap_read_end() should unlock the folio instead of calling
folio_end_read(), which is how these cases were handled prior to commit
f8eaf79406fe ("iomap: simplify ->read_folio_range() error handling for
reads"). Calling folio_end_read() on an uptodate folio leads to buggy
behavior where marking an already uptodate folio as uptodate will XOR it
to be marked nonuptodate.
Fixes: f8eaf79406fe ("iomap: simplify ->read_folio_range() error handling for reads")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251118211111.1027272-2-joannelkoong@gmail.com
Tested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace IOMAP_DIO_INLINE_COMP with a flag to indicate that the
completion should be offloaded. This removes a tiny bit of boilerplate
code, but more importantly just makes the code easier to follow as this
new flag gets set most of the time and only cleared in one place, while
it was the inverse for the old version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-6-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Completions for pure overwrites don't need to be deferred to a workqueue
as there is no work to be done, or at least no work that needs a user
context. Set the IOMAP_DIO_INLINE_COMP by default for writes like we
already do for reads, and the clear it for all the cases that actually
do need a user context for completions to update the inode size or
record updates to the logical to physical mapping.
I've audited all users of the ->end_io callback, and they only require
user context for I/O that involves unwritten extents, COW, size
extensions, or error handling and all those are still run from workqueue
context.
This restores the behavior of the old pre-iomap direct I/O code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-5-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The way how iomap_dio_can_use_fua and the caller is structured is
a bit confusing, as the main guarding condition is hidden in the
helper, and the secondary conditions are split between caller and
callee.
Refactor the code, so that iomap_dio_bio_iter itself tracks if a write
might need metadata updates based on the iomap type and flags, and
then have a condition based on that to use the FUA flag.
Note that this also moves the REQ_OP_WRITE assignment to the end of
the branch to improve readability a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-4-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
At least zonefs expects error completions to be able to sleep. Because
error completions aren't performance critical, just defer them to workqueue
context unconditionally.
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This was added by commit 099ada2c8726 ("io_uring/rw: add write support
for IOCB_DIO_CALLER_COMP") and disabled a little later by commit
838b35bb6a89 ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it
didn't work. Remove all the related code that sat unused for 2 years.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use find_next_bit()/find_next_zero_bit() for iomap uptodate bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next uptodate or non-uptodate bit than iterating through the
the bitmap range testing every bit.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-10-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use find_next_bit()/find_next_zero_bit() for iomap dirty bitmap
scanning. This uses __ffs() internally and is more efficient for
finding the next dirty or clean bit than iterating through the bitmap
range testing every bit.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-9-joannelkoong@gmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Trivial fix.
Signed-off-by: Askar Safin <safinaskar@gmail.com>
Link: https://patch.msgid.link/20251120211316.706725-1-safinaskar@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Otherwise it gets inlined notably in walk_component(), which convinces
the compiler to push/pop additional registers in the fast path to
accomodate existence of the inlined version.
Shortens the fast path of that routine from 87 to 71 bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251119144930.2911698-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Stats from nd->depth usage during the venerable kernel build collected like so:
bpftrace -e 'kprobe:terminate_walk,kprobe:walk_component,kprobe:legitimize_links
{ @[probe] = lhist(((struct nameidata *)arg0)->depth, 0, 8, 1); }'
@[kprobe:legitimize_links]:
[0, 1) 6554906 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 3534 | |
@[kprobe:terminate_walk]:
[0, 1) 12153664 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@[kprobe:walk_component]:
[0, 1) 53075749 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 971421 | |
[2, 3) 84946 | |
Additionally a custom probe was added for depth within link_path_walk():
bpftrace -e 'kprobe:link_path_walk_probe { @[probe] = lhist(arg0, 0, 8, 1); }'
@[kprobe:link_path_walk_probe]:
[0, 1) 7528231 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2) 407905 |@@ |
Given these results:
1. terminate_walk() is called towards the end of the lookup and in this
test it never had any links to clean up.
2. legitimize_links() is also called towards the end of lookup and most
of the time there s 0 depth. Patch consumers to avoid calling into it
in that case.
3. walk_component() is typically called with WALK_MORE and zero depth,
checked in that order. Check depth first and predict it is 0.
4. link_path_walk() also does not deal with a symlink most of the time
when !*name
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251119142954.2909394-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
lookup_mnt_ns() already takes a reference on mnt_ns.
grab_requested_mnt_ns() doesn't need to take an extra reference.
Fixes: 78f0e33cd6c93 ("fs/namespace: correctly handle errors returned by grab_requested_mnt_ns")
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20251122071953.3053755-1-avagin@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull xfs fix from Carlos Maiolino:
"A single out-of-bounds fix, nothing special"
* tag 'xfs-fixes-6.18-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix out of bounds memory read error in symlink repair
|
|
The "shareable_bits" and "bit_usage" resctrl files associated with cache
resources give insight into how instances of a cache is used.
Update the annotated capacity bitmasks displayed by "bit_usage" to include the
cache portions allocated for I/O via the "io_alloc" feature. "shareable_bits"
is a global bitmask of shareable cache with I/O and can thus not present the
per-domain I/O allocations possible with the "io_alloc" feature. Revise the
"shareable_bits" documentation to direct users to "bit_usage" for accurate
cache usage information.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/e02a0d424129fd7f3e45822a559b1c614ae4652a.1762995456.git.babu.moger@amd.com
|
|
The io_alloc feature in resctrl enables system software to configure the
portion of the cache allocated for I/O traffic. When supported, the
io_alloc_cbm file in resctrl provides access to capacity bitmasks (CBMs)
allocated for I/O devices.
Enable users to modify io_alloc CBMs by writing to the io_alloc_cbm resctrl
file when the io_alloc feature is enabled.
Mirror the CBMs between CDP_CODE and CDP_DATA when CDP is enabled to present
consistent I/O allocation information to user space.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/67609641b03ccfba18a8ee0bf9dbd1f3dcbecda3.1762995456.git.babu.moger@amd.com
|
|
parse_cbm() requires resource group mode and CLOSID to validate the capacity
bitmask (CBM). It is passed via struct rdtgroup in struct rdt_parse_data.
The io_alloc feature also uses CBMs to indicate which portions of cache are
allocated for I/O traffic. The CBMs are provided by user space and need to be
validated the same as CBMs provided for general (CPU) cache allocation.
parse_cbm() cannot be used as-is since io_alloc does not have rdtgroup context.
Pass the resource group mode and CLOSID directly to parse_cbm() via struct
rdt_parse_data, instead of through the rdtgroup struct, to facilitate calling
parse_cbm() to verify the CBM of the io_alloc feature.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/f8ec6ab5cf594d906a3fe75f56793d5fbd63f38f.1762995456.git.babu.moger@amd.com
|
|
Introduce the "io_alloc_cbm" resctrl file to display the capacity bitmasks
(CBMs) that represent the portions of each cache instance allocated
for I/O traffic on a cache resource that supports the "io_alloc" feature.
io_alloc_cbm resides in the info directory of a cache resource, for example,
/sys/fs/resctrl/info/L3/. Since the resource name is part of the path, it
is not necessary to display the resource name as done in the schemata file.
When CDP is enabled, io_alloc routes traffic using the highest CLOSID
associated with the CDP_CODE resource and that CLOSID becomes unusable for
the CDP_DATA resource. The highest CLOSID of CDP_CODE and CDP_DATA resources
will be kept in sync to ensure consistent user interface. In preparation for
this, access the CBMs for I/O traffic through highest CLOSID of either
CDP_CODE or CDP_DATA resource.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/55a3ff66a70e7ce8239f022e62b334e9d64af604.1762995456.git.babu.moger@amd.com
|
|
AMD's SDCIAE forces all SDCI lines to be placed into the L3 cache portions
identified by the highest-supported L3_MASK_n register, where n is the maximum
supported CLOSID.
To support this, when io_alloc resctrl feature is enabled, reserve the highest
CLOSID exclusively for I/O allocation traffic making it no longer available for
general CPU cache allocation.
Introduce user interface to enable/disable io_alloc feature and encourage users
to enable io_alloc only when running workloads that can benefit from this
functionality. On enable, initialize the io_alloc CLOSID with all usable CBMs
across all the domains.
Since CLOSIDs are managed by resctrl fs, it is least invasive to make "io_alloc
is supported by maximum supported CLOSID" part of the initial resctrl fs
support for io_alloc. Take care to minimally (only in error messages) expose
this use of CLOSID for io_alloc to user space so that this is not required from
other architectures that may support io_alloc differently in the future.
When resctrl is mounted with "-o cdp" to enable code/data prioritization,
there are two L3 resources that can support I/O allocation: L3CODE and
L3DATA. From resctrl fs perspective the two resources share a CLOSID and
the architecture's available CLOSID are halved to support this.
The architecture's underlying CLOSID used by SDCIAE when CDP is enabled is the
CLOSID associated with the CDP_CODE resource, but from resctrl's perspective
there is only one CLOSID for both CDP_CODE and CDP_DATA. CDP_DATA is thus not
usable for general (CPU) cache allocation nor I/O allocation.
Keep the CDP_CODE and CDP_DATA I/O alloc status in sync to avoid any confusion
to user space. That is, enabling io_alloc on CDP_CODE does so on CDP_DATA and
vice-versa, and keep the I/O allocation CBMs of CDP_CODE and CDP_DATA in sync.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/c7d3037795e653e22b02d8fc73ca80d9b075031c.1762995456.git.babu.moger@amd.com
|
|
Introduce the "io_alloc" resctrl file to the "info" area of a cache resource,
for example /sys/fs/resctrl/info/L3/io_alloc. "io_alloc" indicates support for
the "io_alloc" feature that allows direct insertion of data from I/O
devices into the cache.
Restrict exposing support for "io_alloc" to the L3 resource that is the only
resource where this feature can be backed by AMD's L3 Smart Data Cache
Injection Allocation Enforcement (SDCIAE). With that, the "io_alloc" file is
only visible to user space if the L3 resource supports "io_alloc".
Doing so makes the file visible for all cache resources though, for example
also L2 cache (if it supports cache allocation). As a consequence, add
capability for file to report expected "enabled" and "disabled", as well as
"not supported".
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/e8b116a8f424128b227734bb1d433c14af478d90.1762995456.git.babu.moger@amd.com
|
|
Pull smb client fixes from Steve French:
- Fix potential memory leak in mount
- Add some missing read tracepoints
- Fix locking issue with directory leases
* tag 'v6.18-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Add the smb3_read_* tracepoints to SMB1
cifs: fix memory leak in smb3_fs_context_parse_param error path
smb: client: introduce close_cached_dir_locked()
|
|
Cross-merge networking fixes after downstream PR (net-6.18-rc7).
No conflicts, adjacent changes:
tools/testing/selftests/net/af_unix/Makefile
e1bb28bf13f4 ("selftest: af_unix: Add test for SO_PEEK_OFF.")
45a1cd8346ca ("selftests: af_unix: Add tests for ECONNRESET and EOF semantics")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
xfs/286 produced this report on my test fleet:
==================================================================
BUG: KFENCE: out-of-bounds read in memcpy_orig+0x54/0x110
Out-of-bounds read at 0xffff88843fe9e038 (184B right of kfence-#184):
memcpy_orig+0x54/0x110
xrep_symlink_salvage_inline+0xb3/0xf0 [xfs]
xrep_symlink_salvage+0x100/0x110 [xfs]
xrep_symlink+0x2e/0x80 [xfs]
xrep_attempt+0x61/0x1f0 [xfs]
xfs_scrub_metadata+0x34f/0x5c0 [xfs]
xfs_ioc_scrubv_metadata+0x387/0x560 [xfs]
xfs_file_ioctl+0xe23/0x10e0 [xfs]
__x64_sys_ioctl+0x76/0xc0
do_syscall_64+0x4e/0x1e0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
kfence-#184: 0xffff88843fe9df80-0xffff88843fe9dfea, size=107, cache=kmalloc-128
allocated by task 3470 on cpu 1 at 263329.131592s (192823.508886s ago):
xfs_init_local_fork+0x79/0xe0 [xfs]
xfs_iformat_local+0xa4/0x170 [xfs]
xfs_iformat_data_fork+0x148/0x180 [xfs]
xfs_inode_from_disk+0x2cd/0x480 [xfs]
xfs_iget+0x450/0xd60 [xfs]
xfs_bulkstat_one_int+0x6b/0x510 [xfs]
xfs_bulkstat_iwalk+0x1e/0x30 [xfs]
xfs_iwalk_ag_recs+0xdf/0x150 [xfs]
xfs_iwalk_run_callbacks+0xb9/0x190 [xfs]
xfs_iwalk_ag+0x1dc/0x2f0 [xfs]
xfs_iwalk_args.constprop.0+0x6a/0x120 [xfs]
xfs_iwalk+0xa4/0xd0 [xfs]
xfs_bulkstat+0xfa/0x170 [xfs]
xfs_ioc_fsbulkstat.isra.0+0x13a/0x230 [xfs]
xfs_file_ioctl+0xbf2/0x10e0 [xfs]
__x64_sys_ioctl+0x76/0xc0
do_syscall_64+0x4e/0x1e0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
CPU: 1 UID: 0 PID: 1300113 Comm: xfs_scrub Not tainted 6.18.0-rc4-djwx #rc4 PREEMPT(lazy) 3d744dd94e92690f00a04398d2bd8631dcef1954
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-4.module+el8.8.0+21164+ed375313 04/01/2014
==================================================================
On further analysis, I realized that the second parameter to min() is
not correct. xfs_ifork::if_bytes is the size of the xfs_ifork::if_data
buffer. if_bytes can be smaller than the data fork size because:
(a) the forkoff code tries to keep the data area as large as possible
(b) for symbolic links, if_bytes is the ondisk file size + 1
(c) forkoff is always a multiple of 8.
Case in point: for a single-byte symlink target, forkoff will be
8 but the buffer will only be 2 bytes long.
In other words, the logic here is wrong and we walk off the end of the
incore buffer. Fix that.
Cc: stable@vger.kernel.org # v6.10
Fixes: 2651923d8d8db0 ("xfs: online repair of symbolic links")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add the smb3_read_* tracepoints to SMB1's cifs_async_readv() and
cifs_readv_callback().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add proper cleanup of ctx->source and fc->source to the
cifs_parse_mount_err error handler. This ensures that memory allocated
for the source strings is correctly freed on all error paths, matching
the cleanup already performed in the success path by
smb3_cleanup_fs_context_contents().
Pointers are also set to NULL after freeing to prevent potential
double-free issues.
This change fixes a memory leak originally detected by syzbot. The
leak occurred when processing Opt_source mount options if an error
happened after ctx->source and fc->source were successfully
allocated but before the function completed.
The specific leak sequence was:
1. ctx->source = smb3_fs_context_fullpath(ctx, '/') allocates memory
2. fc->source = kstrdup(ctx->source, GFP_KERNEL) allocates more memory
3. A subsequent error jumps to cifs_parse_mount_err
4. The old error handler freed passwords but not the source strings,
causing the memory to leak.
This issue was not addressed by commit e8c73eb7db0a ("cifs: client:
fix memory leak in smb3_fs_context_parse_param"), which only fixed
leaks from repeated fsconfig() calls but not this error path.
Patch updated with minor change suggested by kernel test robot
Reported-by: syzbot+87be6809ed9bf6d718e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=87be6809ed9bf6d718e3
Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api")
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Replace close_cached_dir() calls under cfid_list_lock with a new
close_cached_dir_locked() variant that uses kref_put() instead of
kref_put_lock() to avoid recursive locking when dropping references.
While the existing code works if the refcount >= 2 invariant holds,
this area has proven error-prone. Make deadlocks impossible and WARN
on invariant violations.
Cc: stable@vger.kernel.org
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Now that we have this all ported to a cred guard remove the struct and
the associated helpers.
Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-5-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove the complicated struct ovl_cu_creds dance and use our new copy up
cred guard.
Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-4-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
They will become unused in the next patch and we'll drop them after the
conversion is finished together with the struct. This keeps the changes
small and reviewable.
Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-3-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove the complicated struct ovl_cu_creds dance and use our new copy up
cred guard.
Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-2-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a credential guard for copy up. This will allows us to waste struct
struct ovl_cu_creds and simplify the code.
Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-1-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It is now unused and can be removed.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-6-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This clearly indicates the double-credential override and makes the code
a lot easier to grasp with one glance.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-5-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The function will become unused in the next patch.
We'll remove it in later patches to keep the diff legible.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-4-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Reflow the creation routine in preparation of porting it to a guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-3-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This clearly indicates the double-credential override and makes the code
a lot easier to grasp with one glance.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-2-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The current code to override credentials for creation operations is
pretty difficult to understand. We effectively override the credentials
twice:
(1) override with the mounter's credentials
(2) copy the mounts credentials and override the fs{g,u}id with the inode {u,g}id
And then we elide the revert because it would be an idempotent revert.
That elision doesn't buy us anything anymore though because I've made it
all work without any reference counting anyway. All it does is mix the
two credential overrides together.
We can use a cleanup guard to clarify the creation codepaths and make
them easier to understand.
This just introduces the cleanup guard keeping the patch reviewable.
We'll convert the caller in follow-up patches and then drop the
duplicated code.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-1-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The wrapper isn't needed anymore. Overlayfs completely relies on its
cleanup guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-42-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-41-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Split the core into a separate helper in preparation of converting the
caller to the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-40-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-39-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-38-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Split the core into a separate helper in preparation of converting the
caller to the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-37-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-36-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-35-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Extract the code that runs under overridden credentials into a separate
ovl_rename_upper() helper function and the code that runs before/after to
ovl_rename_start/end(). Error handling is simplified.
The helpers returns errors directly instead of using goto labels.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-34-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a struct ovl_renamedata to group rename-related state that was
previously stored in local variables. Embedd struct renamedata directly
aligning with the vfs.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-33-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-32-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-31-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-30-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-29-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-28-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-27-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-26-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
factor out ovl_iterate_merged() and move some code into
ovl_iterate_real() for easier use of the scoped ovl cred guard.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-25-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The function is only called when rdd->dentry is non-NULL:
if (!err && rdd->first_maybe_whiteout && rdd->dentry)
err = ovl_check_whiteouts(realpath, rdd);
| Caller | Sets rdd->dentry? | Can call ovl_check_whiteouts()? |
|-------------------------------|-------------------|---------------------------------|
| ovl_dir_read_merged() | ✓ Yes (line 430) | ✓ YES |
| ovl_dir_read_impure() | ✗ No | ✗ NO |
| ovl_check_d_type_supported() | ✗ No | ✗ NO |
| ovl_workdir_cleanup_recurse() | ✗ No | ✗ NO |
| ovl_indexdir_cleanup() | ✗ No | ✗ NO |
VFS layer (.iterate_shared file operation)
→ ovl_iterate()
[CRED OVERRIDE]
→ ovl_cache_get()
→ ovl_dir_read_merged()
→ ovl_dir_read()
→ ovl_check_whiteouts()
[CRED REVERT]
ovl_unlink()
→ ovl_do_remove()
→ ovl_check_empty_dir()
[CRED OVERRIDE]
→ ovl_dir_read_merged()
→ ovl_dir_read()
→ ovl_check_whiteouts()
[CRED REVERT]
ovl_rename()
→ ovl_check_empty_dir()
[CRED OVERRIDE]
→ ovl_dir_read_merged()
→ ovl_dir_read()
→ ovl_check_whiteouts()
[CRED REVERT]
All valid callchains already override credentials so drop the override.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-24-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-23-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-22-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-21-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-20-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-19-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-18-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-17-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-16-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-15-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-14-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-13-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-12-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-11-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-10-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-9-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-8-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-7-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-6-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-5-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-4-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-3-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the scoped ovl cred guard.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-2-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Overlayfs plucks the relevant creds from the superblock. Extend the
override_creds cleanup class I added to override_creds_ovl which uses
the ovl_override_creds() function as initialization helper. Add
with_ovl_creds() based on this new class.
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-1-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Bring in the directory locking changes as they touch overlayfs in a
pretty substantial way and we are about to change the credential
override semantics quite substantially as well.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Bring in the shared branch with the kbuild tree to enable
'-fms-extensions' for 6.19. The overlayfs cred guard work
depends on this.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In the stock variant the compiler spills several registers on the stack
and employs stack smashing protection, adding even more code + a branch
on exit..
The actual fast path is small enough that the compiler inlines it for
all callers -- the symbol is no longer emitted.
Forcing noinline on it just for code-measurement purposes shows the fast
path dropping from 111 to 39 bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251114201803.2183505-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Commit 2f952c9e8fe1 ("fs: split fileattr related helpers into separate
file") added various exports without users despite claiming to be a
simple refactor. Drop them again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251119101415.2732320-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If a mount namespace contains autofs mounts, and they are propagation
private, and there is no namespace specific automount daemon to handle
possible automounting then attempted path resolution will loop until
MAXSYMLINKS is reached before failing causing quite a bit of noise in
the log.
Add a check for this in autofs ->d_automount() so that the VFS can
immediately return an error in this case. Since the mount is propagation
private an EPERM return seems most appropriate.
Suggested by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://patch.msgid.link/20251118024631.10854-2-raven@themaw.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use %ptSp instead of open coded variants to print content of
struct timespec64 in human readable format.
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251113150217.3030010-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
A new resource group is intended to be created with sane defaults. For a cache
resource this means all cache portions the new group could possibly allocate
into. This includes unused cache portions and shareable cache portions used by
other groups and hardware.
New resource group creation does not take sparse masks into account. After
determining the bitmask reflecting the new group's possible allocations the
bitmask is forced to be contiguous even if the system supports sparse masks.
For example, a new group could by default allocate into a large portion of
cache represented by 0xff0f, but it is instead created with a mask of 0xf.
Do not force a contiguous allocation range if the system supports sparse masks.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/abbbb008bc09d982d715e79d3b885c10f92c64e0.1763426240.git.reinette.chatre@intel.com
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix unitialized variable in statmount_string()
- Fix hostfs mounting when passing host root during boot
- Fix dynamic lookup to fail on cell lookup failure
- Fix missing file type when reading bfs inodes from disk
- Enforce checking of sb_min_blocksize() calls and update all callers
accordingly
- Restore write access before closing files opened by open_exec() in
binfmt_misc
- Always freeze efivarfs during suspend/hibernate cycles
- Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to
actually allow mount namespace file descriptor in addition to mount
namespace ids
- Fix tmpfs remount when noswap is specified
- Switch Landlock to iput_not_last() to remove false-positives from
might_sleep() annotations in iput()
- Remove dead node_to_mnt_ns() code
- Ensure that per-queue kobjects are successfully created
* tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
landlock: fix splats from iput() after it started calling might_sleep()
fs: add iput_not_last()
shmem: fix tmpfs reconfiguration (remount) when noswap is set
fs/namespace: correctly handle errors returned by grab_requested_mnt_ns
power: always freeze efivarfs
binfmt_misc: restore write access before closing files opened by open_exec()
block: add __must_check attribute to sb_min_blocksize()
virtio-fs: fix incorrect check for fsvq->kobj
xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super
isofs: check the return value of sb_min_blocksize() in isofs_fill_super
exfat: check return value of sb_min_blocksize in exfat_read_boot_sector
vfat: fix missing sb_min_blocksize() return value checks
mnt: Remove dead code which might prevent from building
bfs: Reconstruct file type when loading from disk
afs: Fix dynamic lookup to fail on cell lookup failure
hostfs: Fix only passing host root in boot stage with new mount
fs: Fix uninitialized 'offp' in statmount_string()
|
|
We have reworked namespaces sufficiently that all this special-casing
shouldn't be needed anymore
Link: https://patch.msgid.link/20251117-eidesstattlich-apotheke-36d2e644079f@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull NFS client fixes from Anna Schumaker:
- Various fixes when using NFS with TLS
- Localio direct-IO fixes
- Fix error handling in nfs_atomic_open_v23()
- Fix sysfs memory leak when nfs_client kobject add fails
- Fix an incorrect parameter when calling nfs4_call_sync()
- Fix a failing LTP test when using delegated timestamps
* tag 'nfs-for-6.18-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix LTP test failures when timestamps are delegated
NFSv4: Fix an incorrect parameter when calling nfs4_call_sync()
NFS: sysfs: fix leak when nfs_client kobject add fails
NFSv2/v3: Fix error handling in nfs_atomic_open_v23()
nfs/localio: do not issue misaligned DIO out-of-order
nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
nfs/localio: backfill missing partial read support for misaligned DIO
nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
NFS: Check the TLS certificate fields in nfs_match_client()
pnfs: Set transport security policy to RPC_XPRTSEC_NONE unless using TLS
pnfs: Fix TLS logic in _nfs4_pnfs_v4_ds_connect()
pnfs: Fix TLS logic in _nfs4_pnfs_v3_ds_connect()
|
|
Pull smb client fixes from Steve French:
- Multichannel reconnect channel selection fix
- Fix for smbdirect (RDMA) disconnect bug
- Fix for incorrect username length check
- Fix memory leak in mount parm processing
* tag 'v6.18-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: let smbd_disconnect_rdma_connection() turn CREATED into DISCONNECTED
smb: fix invalid username check in smb3_fs_context_parse_param()
cifs: client: fix memory leak in smb3_fs_context_parse_param
smb: client: fix cifs_pick_channel when channel needs reconnect
|
|
Occasionally the caller of end_creating() wants to keep using the dentry.
Rather then requiring them to dget() the dentry (when not an error)
before calling end_creating(), provide end_creating_keep() which does
this.
cachefiles and overlayfs make use of this.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-16-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
vfs_mkdir() already drops the reference to the dentry on failure but it
leaves the parent locked.
This complicates end_creating() which needs to unlock the parent even
though the dentry is no longer available.
If we change vfs_mkdir() to unlock on failure as well as releasing the
dentry, we can remove the "parent" arg from end_creating() and simplify
the rules for calling it.
Note that cachefiles_get_directory() can choose to substitute an error
instead of actually calling vfs_mkdir(), for fault injection. In that
case it needs to call end_creating(), just as vfs_mkdir() now does on
error.
ovl_create_real() will now unlock on error. So the conditional
end_creating() after the call is removed, and end_creating() is called
internally on error.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-15-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This requires the addition of start_creating_dentry() which is given the
dentry which has already been found, and asks for it to be locked and
its parent validated.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-14-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A few callers want to lock for a rename and already have both dentries.
Also debugfs does want to perform a lookup but doesn't want permission
checking, so start_renaming_dentry() cannot be used.
This patch introduces start_renaming_two_dentries() which is given both
dentries. debugfs performs one lookup itself. As it will only continue
with a negative dentry and as those cannot be renamed or unlinked, it is
safe to do the lookup before getting the rename locks.
overlayfs uses start_renaming_two_dentries() in three places and selinux
uses it twice in sel_make_policy_nodes().
In sel_make_policy_nodes() we now lock for rename twice instead of just
once so the combined operation is no longer atomic w.r.t the parent
directory locks. As selinux_state.policy_mutex is held across the whole
operation this does not open up any interesting races.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-13-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Several callers perform a rename on a dentry they already have, and only
require lookup for the target name. This includes smb/server and a few
different places in overlayfs.
start_renaming_dentry() performs the required lookup and takes the
required lock using lock_rename_child()
It is used in three places in overlayfs and in ksmbd_vfs_rename().
In the ksmbd case, the parent of the source is not important - the
source must be renamed from wherever it is. So start_renaming_dentry()
allows rd->old_parent to be NULL and only checks it if it is non-NULL.
On success rd->old_parent will be the parent of old_dentry with an extra
reference taken. Other start_renaming function also now take the extra
reference and end_renaming() now drops this reference as well.
ovl_lookup_temp(), ovl_parent_lock(), and ovl_parent_unlock() are
all removed as they are no longer needed.
OVL_TEMPNAME_SIZE and ovl_tempname() are now declared in overlayfs.h so
that ovl_check_rename_whiteout() can access them.
ovl_copy_up_workdir() now always cleans up on error.
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-12-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
start_renaming() combines name lookup and locking to prepare for rename.
It is used when two names need to be looked up as in nfsd and overlayfs -
cases where one or both dentries are already available will be handled
separately.
__start_renaming() avoids the inode_permission check and hash
calculation and is suitable after filename_parentat() in do_renameat2().
It subsumes quite a bit of code from that function.
start_renaming() does calculate the hash and check X permission and is
suitable elsewhere:
- nfsd_rename()
- ovl_rename()
In ovl, ovl_do_rename_rd() is factored out of ovl_do_rename(), which
itself will be gone by the end of the series.
Acked-by: Chuck Lever <chuck.lever@oracle.com> (for nfsd parts)
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
--
Changes since v3:
- added missig dput() in ovl_rename when "whiteout" is not-NULL.
Changes since v2:
- in __start_renaming() some label have been renamed, and err
is always set before a "goto out_foo" rather than passing the
error in a dentry*.
- ovl_do_rename() changed to call the new ovl_do_rename_rd() rather
than keeping duplicate code
- code around ovl_cleanup() call in ovl_rename() restructured.
Link: https://patch.msgid.link/20251113002050.676694-11-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
These are similar to start_creating() and start_removing(), but allow a
fatal signal to abort waiting for the lock.
They are used in btrfs for subvol creation and removal.
btrfs_may_create() no longer needs IS_DEADDIR() and
start_creating_killable() includes that check.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-10-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
start_removing_dentry() is similar to start_removing() but instead of
providing a name for lookup, the target dentry is given.
start_removing_dentry() checks that the dentry is still hashed and in
the parent, and if so it locks and increases the refcount so that
end_removing() can be used to finish the operation.
This is used in cachefiles, overlayfs, smb/server, and apparmor.
There will be other users including ecryptfs.
As start_removing_dentry() takes an extra reference to the dentry (to be
put by end_removing()), there is no need to explicitly take an extra
reference to stop d_delete() from using dentry_unlink_inode() to negate
the dentry - as in cachefiles_delete_object(), and ksmbd_vfs_unlink().
cachefiles_bury_object() now gets an extra ref to the victim, which is
drops. As it includes the needed end_removing() calls, the caller
doesn't need them.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-9-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Sometimes smb2_create_link() needs to remove the target before creating
the link.
It uses ksmbd_vfs_kern_locked(), and is the only user of that interface.
To match the new naming, that function is changed to
ksmbd_vfs_kern_start_removing(), and related functions or flags are also
renamed.
The lock actually happens in ksmbd_vfs_path_lookup() and that is changed
to use start_removing_noperm() - permission to perform lookup in the
parent was already checked in vfs_path_parent_lookup().
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-8-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
xfs, fuse, ipc/mqueue need variants of start_creating or start_removing
which do not check permissions.
This patch adds _noperm versions of these functions.
Note that do_mq_open() was only calling mntget() so it could call
path_put() - it didn't really need an extra reference on the mnt.
Now it doesn't call mntget() and uses end_creating() which does
the dput() half of path_put().
Also mq_unlink() previously passed
d_inode(dentry->d_parent)
as the dir inode to vfs_unlink(). This is after locking
d_inode(mnt->mnt_root)
These two inodes are the same, but normally calls use the textual
parent.
So I've changes the vfs_unlink() call to be given d_inode(mnt->mnt_root).
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
--
changes since v2:
- dir arg passed to vfs_unlink() in mq_unlink() changed to match
the dir passed to lookup_noperm()
- restore assignment to path->mnt even though the mntget() is removed.
Link: https://patch.msgid.link/20251113002050.676694-7-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
start_removing() is similar to start_creating() but will only return a
positive dentry with the expectation that it will be removed. This is
used by nfsd, cachefiles, and overlayfs. They are changed to also use
end_removing() to terminate the action begun by start_removing(). This
is a simple alias for end_dirop().
Apart from changes to the error paths, as we no longer need to unlock on
a lookup error, an effect on callers is that they don't need to test if
the found dentry is positive or negative - they can be sure it is
positive.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-6-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
start_creating() is similar to simple_start_creating() but is not so
simple.
It takes a qstr for the name, includes permission checking, and does NOT
report an error if the name already exists, returning a positive dentry
instead.
This is currently used by nfsd, cachefiles, and overlayfs.
end_creating() is called after the dentry has been used.
end_creating() drops the reference to the dentry as it is generally no
longer needed. This is exactly the first section of end_creating_path()
so that function is changed to call the new end_creating()
These calls help encapsulate locking rules so that directory locking can
be changed.
Occasionally this change means that the parent lock is held for a
shorter period of time, for example in cachefiles_commit_tmpfile().
As this function now unlocks after an unlink and before the following
lookup, it is possible that the lookup could again find a positive
dentry, so a while loop is introduced there.
In overlayfs the ovl_lookup_temp() function has ovl_tempname()
split out to be used in ovl_start_creating_temp(). The other use
of ovl_lookup_temp() is preparing for a rename. When rename handling
is updated, ovl_lookup_temp() will be removed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-5-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The simplification of locking in the previous patch opens up some room
for tidying up do_unlinkat()
- change all "exit" labels to describe what will happen at the label.
- always goto an exit label on an error - unwrap the "if (!IS_ERR())" branch.
- Move the "slashes" handing inline, but mark it as unlikely()
- simplify use of the "inode" variable - we no longer need to test for NULL.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-4-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The fact that directory operations (create,remove,rename) are protected
by a lock on the parent is known widely throughout the kernel.
In order to change this - to instead lock the target dentry - it is
best to centralise this knowledge so it can be changed in one place.
This patch introduces start_dirop() which is local to VFS code.
It performs the required locking for create and remove. Rename
will be handled separately.
Various functions with names like start_creating() or start_removing_path(),
some of which already exist, will export this functionality beyond the VFS.
end_dirop() is the partner of start_dirop(). It drops the lock and
releases the reference on the dentry.
It *is* exported so that various end_creating etc functions can be inline.
As vfs_mkdir() drops the dentry on error we cannot use end_dirop() as
that won't unlock when the dentry IS_ERR(). For now we need an explicit
unlock when dentry IS_ERR(). I hope to change vfs_mkdir() to unlock
when it drops a dentry so that explicit unlock can go away.
end_dirop() can always be called on the result of start_dirop(), but not
after vfs_mkdir(). After a vfs_mkdir() we still may need the explicit
unlock as seen in end_creating_path().
As well as adding start_dirop() and end_dirop()
this patch uses them in:
- simple_start_creating (which requires sharing lookup_noperm_common()
with libfs.c)
- start_removing_path / start_removing_user_path_at
- filename_create / end_creating_path()
- do_rmdir(), do_unlinkat()
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-3-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
By not using the generic end_creating() name here we are free to use it
more globally for a more generic function.
This should have been done when start_creating() was renamed.
For consistency, also rename failed_creating().
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-2-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.18-rc6).
No conflicts, adjacent changes in:
drivers/net/phy/micrel.c
96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface")
61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814")
and a trivial one in tools/testing/selftests/drivers/net/Makefile.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rationale:
- ND_ROOT_PRESET is only set in a condition already marked unlikely
- LOOKUP_IS_SCOPED already has unlikely on it, but inconsistently
applied
- set_root() only fails if there is a bug
- most names are not empty (see !*s)
- most of the time path_init() does not encounter LOOKUP_CACHED without
LOOKUP_RCU
- LOOKUP_IN_ROOT is a rarely seen flag
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105150630.756606-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Add Chunhai Guo as a EROFS reviewer to get more eyes from interested
industry vendors
- Fix infinite loop caused by incomplete crafted zstd-compressed data
(thanks to Robert again!)
* tag 'erofs-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: avoid infinite loop due to incomplete zstd-compressed data
MAINTAINERS: erofs: add myself as reviewer
|
|
Pull smb server fixes from Steve French:
- Fix smbdirect (RDMA) disconnect hang bug
- Fix potential Denial of Service when connection limit exceeded
- Fix smbdirect (RDMA) connection (potentially accessing freed memory)
bug
* tag 'v6.18-rc5-smb-server-fixes' of git://git.samba.org/ksmbd:
smb: server: let smb_direct_disconnect_rdma_connection() turn CREATED into DISCONNECTED
ksmbd: close accepted socket when per-IP limit rejects connection
smb: server: rdma: avoid unmapping posted recv on accept failure
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Address recently reported issues or issues found at the recent NFS
bake-a-thon held in Raleigh, NC.
Issues reported with v6.18-rc:
- Address a kernel build issue
- Reorder SEQUENCE processing to avoid spurious NFS4ERR_SEQ_MISORDERED
Issues that need expedient stable backports:
- Close a refcount leak exposure
- Report support for NFSv4.2 CLONE correctly
- Fix oops during COPY_NOTIFY processing
- Prevent rare crash after XDR encoding failure
- Prevent crash due to confused or malicious NFSv4.1 client"
* tag 'nfsd-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it"
nfsd: ensure SEQUENCE replay sends a valid reply.
NFSD: Never cache a COMPOUND when the SEQUENCE operation fails
NFSD: Skip close replay processing if XDR encoding fails
NFSD: free copynotify stateid in nfs4_free_ol_stateid()
nfsd: add missing FATTR4_WORD2_CLONE_BLKSIZE from supported attributes
nfsd: fix refcount leak in nfsd_set_fh_dentry()
|
|
On stock kernel gcc 14 emits avoidable register spillage:
endbr64
call ffffffff81374630 <__fentry__>
push %r13
push %r12
push %rbx
sub $0x8,%rsp
[snip]
Total fast path is 99 bytes.
Moving the slowpath out avoids it and shortens the fast path to 74
bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251110095634.1433061-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105153622.758836-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Helps out some of the asm, the routine is still a mess.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251109125254.1288882-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The primary non-MAY_WRITE consumer now uses lookup_inode_permission_may_exec().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-4-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Root filesystem was ext4, btrfs was mounted on /testfs.
Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h
on Sapphire Rapids (ops/s):
before: 3447976
after: 3620879 (+5%)
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The generic inode_permission() routine does work which is known to be of
no significance for lookup. There are checks for MAY_WRITE, while the
requested permission is MAY_EXEC. Additionally devcgroup_inode_permission()
is called to check for devices, but it is an invariant the inode is a
directory.
Absent a ->permission func, execution lands in generic_permission()
which checks upfront if the requested permission is granted for
everyone.
We can elide the branches which are guaranteed to be false and cut
straight to the check if everyone happens to be allowed MAY_EXEC on the
inode (which holds true most of the time).
Moreover, filesystems which provide their own ->permission routine can
take advantage of the optimization by setting the IOP_FASTPERM_MAY_EXEC
flag on their inodes, which they can legitimately do if their MAY_EXEC
handling matches generic_permission().
As a simple benchmark, as part of compilation gcc issues access(2) on
numerous long paths, for example /usr/lib/gcc/x86_64-linux-gnu/12/crtendS.o
Issuing access(2) on it in a loop on ext4 on Sapphire Rapids (ops/s):
before: 3797556
after: 3987789 (+5%)
Note: this depends on the not-yet-landed ext4 patch to mark inodes with
cache_no_acl()
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently, the logic for skipping the read range for a write is
if (!(iter->flags & IOMAP_UNSHARE) &&
(from <= poff || from >= poff + plen) &&
(to <= poff || to >= poff + plen))
which breaks down to skipping the read if any of these are true:
a) from <= poff && to <= poff
b) from <= poff && to >= poff + plen
c) from >= poff + plen && to <= poff
d) from >= poff + plen && to >= poff + plen
This can be simplified to
if (!(iter->flags & IOMAP_UNSHARE) && from <= poff && to >= poff + plen)
from the following reasoning:
a) from <= poff && to <= poff
This reduces to 'to <= poff' since it is guaranteed that 'from <= to'
(since to = from + len). It is not possible for 'from <= to' to be true
here because we only reach here if plen > 0 (thanks to the preceding 'if
(plen == 0)' check that would break us out of the loop). If 'to <=
poff', plen would have to be 0 since poff and plen get adjusted in
lockstep for uptodate blocks. This means we can eliminate this check.
c) from >= poff + plen && to <= poff
This is not possible since 'from <= to' and 'plen > 0'. We can eliminate
this check.
d) from >= poff + plen && to >= poff + plen
This reduces to 'from >= poff + plen' since 'from <= to'.
It is not possible for 'from >= poff + plen' to be true here. We only
reach here if plen > 0 and for writes, poff and plen will always be
block-aligned, which means poff <= from < poff + plen. We can eliminate
this check.
The only valid check is b) from <= poff && to >= poff + plen.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-7-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Instead of requiring that the caller calls iomap_finish_folio_read()
even if the ->read_folio_range() callback returns an error, account for
this internally in iomap instead, which makes the interface simpler and
makes it match writeback's ->read_folio_range() error handling
expectations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pending writebacks must be accounted for to determine when all requests
have completed and writeback on the folio should be ended. Currently
this is done by atomically incrementing ifs->write_bytes_pending for
every range to be written back.
Instead, the number of atomic operations can be minimized by setting
ifs->write_bytes_pending to the folio size, internally tracking how many
bytes are written back asynchronously, and then after sending off all
the requests, decrementing ifs->write_bytes_pending by the number of
bytes not written back asynchronously. Now, for N ranges written back,
only N + 2 atomic operations are required instead of 2N + 2.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The end position to start truncating from may be at an offset into a
block, which under the current logic would result in overtruncation.
Adjust the calculation to account for unaligned end offsets.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-3-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
bytes_submitted/bytes_not_submitted
The naming "bytes_pending" and "bytes_accounted" may be confusing and
could be better named. Rename this to "bytes_submitted" and
"bytes_not_submitted" to make it more clear that these are bytes we
passed to the IO helper to read in.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-2-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251105212025.807549-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
grab_requested_mnt_ns was changed to return error codes on failure, but
its callers were not updated to check for error pointers, still checking
only for a NULL return value.
This commit updates the callers to use IS_ERR() or IS_ERR_OR_NULL() and
PTR_ERR() to correctly check for and propagate errors.
This also makes sure that the logic actually works and mount namespace
file descriptors can be used to refere to mounts.
Christian Brauner <brauner@kernel.org> says:
Rework the patch to be more ergonomic and in line with our overall error
handling patterns.
Fixes: 7b9d14af8777 ("fs: allow mount namespace fd")
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20251111062815.2546189-1-avagin@google.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The efivarfs filesystems must always be frozen and thawed to resync
variable state. Make it so.
Link: https://patch.msgid.link/20251105-vorbild-zutreffen-fe00d1dd98db@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that support for recallable directory delegations is available,
expose this functionality to userland with new F_SETDELEG and F_GETDELEG
commands for fcntl().
Note that this also allows userland to request a FL_DELEG type lease on
files too. Userland applications that do will get signalled when there
are metadata changes in addition to just data changes (which is a
limitation of FL_LEASE leases).
These commands accept a new "struct delegation" argument that contains a
flags field for future expansion.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-17-52f3feebb2f2@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a new routine for acquiring a read delegation on a directory. These
are recallable-only delegations with no support for CB_NOTIFY. That will
be added in a later phase.
Since the same CB_RECALL/DELEGRETURN infrastructure is used for regular
and directory delegations, a normal nfs4_delegation is used to represent
a directory delegation.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-16-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
As Trond pointed out: "...provided that the presented stateid is
actually valid, it is also sufficient to uniquely identify the file to
which it is associated (see RFC8881 Section 8.2.4), so the filehandle
should be considered mostly irrelevant for operations like DELEGRETURN."
Don't ask fh_verify to filter on file type.
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-15-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The filecache infrastructure will only handle S_IFREG files at the
moment. Directory delegations will require adding support for opening
S_IFDIR inodes.
Plumb a "type" argument into nfsd_file_do_acquire() and have all of the
existing callers set it to S_IFREG. Add a new nfsd_file_acquire_dir()
wrapper that nfsd can call to request a nfsd_file that holds a directory
open.
For now, there is no need for a fsnotify_mark for directories, as
CB_NOTIFY is not yet supported. Change nfsd_file_do_acquire() to avoid
allocating one for non-S_IFREG inodes.
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-14-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With the addition of the try_break_lease calls in directory changing
operations, allow generic_setlease to hand them out. Write leases on
directories are never allowed however, so continue to reject them.
For now, there is no API for requesting delegations from userland, so
ensure that userland is prevented from acquiring a lease on a directory.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-13-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to add directory delegation support, we must break delegations
on the parent on any change to the directory.
Add a delegated_inode parameter to vfs_symlink() and have it break the
delegation. do_symlinkat() can then wait on the delegation break before
proceeding.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-12-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.
Add a new delegated_inode pointer to vfs_mknod() and have the
appropriate callers wait when there is an outstanding delegation. All
other callers just set the pointer to NULL.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-11-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.
Add a delegated_inode parameter to vfs_create. Most callers are
converted to pass in NULL, but do_mknodat() is changed to wait for a
delegation break if there is one.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-10-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
As Neil points out:
"I would be in favour of dropping the "dir" arg because it is always
d_inode(dentry->d_parent) which is stable."
...and...
"Also *every* caller of vfs_create() passes ".excl = true". So maybe we
don't need that arg at all."
Drop both arguments from vfs_create() and fix up the callers.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-9-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.
Add a delegated_inode parameter to lookup_open and have it break the
delegation. Then, open_last_lookups can wait for the delegation break
and retry the call to lookup_open once it's done.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-8-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.
Add a delegated_inode struct to vfs_rmdir() and populate that
pointer with the parent inode if it's non-NULL. Most existing in-kernel
callers pass in a NULL pointer.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-7-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|