aboutsummaryrefslogtreecommitdiffstats
path: root/fs/stat.c
AgeCommit message (Collapse)AuthorFilesLines
2025-09-15constify path argument of vfs_statx_path()Al Viro1-1/+1
Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-26Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+5
Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...
2025-05-26Merge tag 'vfs-6.16-rc1.misc' of ↵Linus Torvalds1-15/+20
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Use folios for symlinks in the page cache FUSE already uses folios for its symlinks. Mirror that conversion in the generic code and the NFS code. That lets us get rid of a few folio->page->folio conversions in this path, and some of the few remaining users of read_cache_page() / read_mapping_page() - Try and make a few filesystem operations killable on the VFS inode->i_mutex level - Add sysctl vfs_cache_pressure_denom for bulk file operations Some workloads need to preserve more dentries than we currently allow through out sysctl interface A HDFS servers with 12 HDDs per server, on a HDFS datanode startup involves scanning all files and caching their metadata (including dentries and inodes) in memory. Each HDD contains approximately 2 million files, resulting in a total of ~20 million cached dentries after initialization To minimize dentry reclamation, they set vfs_cache_pressure to 1. Despite this configuration, memory pressure conditions can still trigger reclamation of up to 50% of cached dentries, reducing the cache from 20 million to approximately 10 million entries. During the subsequent cache rebuild period, any HDFS datanode restart operation incurs substantial latency penalties until full cache recovery completes To maintain service stability, more dentries need to be preserved during memory reclamation. The current minimum reclaim ratio (1/100 of total dentries) remains too aggressive for such workload. This patch introduces vfs_cache_pressure_denom for more granular cache pressure control The configuration [vfs_cache_pressure=1, vfs_cache_pressure_denom=10000] effectively maintains the full 20 million dentry cache under memory pressure, preventing datanode restart performance degradation - Avoid some jumps in inode_permission() using likely()/unlikely() - Avid a memory access which is most likely a cache miss when descending into devcgroup_inode_permission() - Add fastpath predicts for stat() and fdput() - Anonymous inodes currently don't come with a proper mode causing issues in the kernel when we want to add useful VFS debug assert. Fix that by giving them a proper mode and masking it off when we report it to userspace which relies on them not having any mode - Anonymous inodes currently allow to change inode attributes because the VFS falls back to simple_setattr() if i_op->setattr isn't implemented. This means the ownership and mode for every single user of anon_inode_inode can be changed. Block that as it's either useless or actively harmful. If specific ownership is needed the respective subsystem should allocate anonymous inodes from their own private superblock - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock - Add proper tests for anonymous inode behavior - Make it easy to detect proper anonymous inodes and to ensure that we can detect them in codepaths such as readahead() Cleanups: - Port pidfs to the new anon_inode_{g,s}etattr() helpers - Try to remove the uselib() system call - Add unlikely branch hint return path for poll - Add unlikely branch hint on return path for core_sys_select - Don't allow signals to interrupt getdents copying for fuse - Provide a size hint to dir_context for during readdir() - Use writeback_iter directly in mpage_writepages - Update compression and mtime descriptions in initramfs documentation - Update main netfs API document - Remove useless plus one in super_cache_scan() - Remove unnecessary NULL-check guards during setns() - Add separate separate {get,put}_cgroup_ns no-op cases Fixes: - Fix typo in root= kernel parameter description - Use KERN_INFO for infof()|info_plog()|infofc() - Correct comments of fs_validate_description() - Mark an unlikely if condition with unlikely() in vfs_parse_monolithic_sep() - Delete macro fsparam_u32hex() - Remove unused and problematic validate_constant_table() - Fix potential unsigned integer underflow in fs_name() - Make file-nr output the total allocated file handles" * tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits) fs: Pass a folio to page_put_link() nfs: Use a folio in nfs_get_link() fs: Convert __page_get_link() to use a folio fs/read_write: make default_llseek() killable fs/open: make do_truncate() killable fs/open: make chmod_common() and chown_common() killable include/linux/fs.h: add inode_lock_killable() readdir: supply dir_context.count as readdir buffer size hint vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations fuse: don't allow signals to interrupt getdents copying Documentation: fix typo in root= kernel parameter description include/cgroup: separate {get,put}_cgroup_ns no-op case kernel/nsproxy: remove unnecessary guards fs: use writeback_iter directly in mpage_writepages fs: remove useless plus one in super_cache_scan() fs: add S_ANON_INODE fs: remove uselib() system call device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission() fs/fs_parse: Remove unused and problematic validate_constant_table() fs: touch up predicts in inode_permission() ...
2025-05-07fs: add atomic write unit max opt to statxJohn Garry1-1/+5
XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-04-17fs: move the bdex_statx call to vfs_getattr_nosecChristoph Hellwig1-14/+18
Currently bdex_statx is only called from the very high-level vfs_statx_path function, and thus bypassing it for in-kernel calls to vfs_getattr or vfs_getattr_nosec. This breaks querying the block ѕize of the underlying device in the loop driver and also is a pitfall for any other new kernel caller. Move the call into the lowest level helper to ensure all callers get the right results. Fixes: 2d985f8c6b91 ("vfs: support STATX_DIOALIGN on block devices") Fixes: f4774e92aab8 ("loop: take the file system minimum dio alignment into account") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250417064042.712140-1-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-08fs: sort out cosmetic differences between stat funcs and add predictsMateusz Guzik1-15/+20
This is a nop, but I did verify asm improves. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/20250406235806.1637000-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()Su Hui1-1/+3
Clang static checker(scan-build) warning: fs/stat.c:287:21: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage. 287 | stat->result_mask |= STATX_MNT_ID_UNIQUE; | ~~~~~~~~~~~~~~~~~ ^ fs/stat.c:290:21: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage. 290 | stat->result_mask |= STATX_MNT_ID; When vfs_getattr() failed because of security_inode_getattr(), 'stat' is uninitialized. In this case, there is a harmless garbage problem in vfs_statx_path(). It's better to return error directly when vfs_getattr() failed, avoiding garbage value and more clearly. Signed-off-by: Su Hui <suhui@nfschina.com> Link: https://lore.kernel.org/r/20250119025946.1168957-1-suhui@nfschina.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-09fs: add STATX_DIO_READ_ALIGNChristoph Hellwig1-0/+1
Add a separate dio read align field, as many out of place write file systems can easily do reads aligned to the device sector size, but require bigger alignment for writes. This is usually papered over by falling back to buffered I/O for smaller writes and doing read-modify-write cycles, but performance for this sucks, so applications benefit from knowing the actual write alignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250109083109.1441561-3-hch@lst.de Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-18Merge tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-17/+7
Pull statx updates from Al Viro: "Sanitize struct filename and lookup flags handling in statx and friends" * tag 'pull-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: libfs: kill empty_dir_getattr() fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flag fs/stat.c: switch to CLASS(fd_raw) kill getname_statx_lookup_flags() io_statx_prep(): use getname_uflags()
2024-11-18Merge tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-24/+4
Pull xattr updates from Al Viro: "Sanitize xattr and io_uring interactions with it, add *xattrat() syscalls, sanitize struct filename handling in there" * tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xattr: remove redundant check on variable err fs/xattr: add *at family syscalls new helpers: file_removexattr(), filename_removexattr() new helpers: file_listxattr(), filename_listxattr() replace do_getxattr() with saner helpers. replace do_setxattr() with saner helpers. new helper: import_xattr_name() fs: rename struct xattr_ctx to kernel_xattr_ctx xattr: switch to CLASS(fd) io_[gs]etxattr_prep(): just use getname() io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE getname_maybe_null() - the third variant of pathname copy-in teach filename_lookup() to treat NULL filename as ""
2024-11-13fs: Simplify getattr interface function checking AT_GETATTR_NOSEC flagStefan Berger1-4/+1
Commit 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function")' introduced the AT_GETATTR_NOSEC flag to ensure that the call paths only call vfs_getattr_nosec if it is set instead of vfs_getattr. Now, simplify the getattr interface functions of filesystems where the flag AT_GETATTR_NOSEC is checked. There is only a single caller of inode_operations getattr function and it is located in fs/stat.c in vfs_getattr_nosec. The caller there is the only one from which the AT_GETATTR_NOSEC flag is passed from. Two filesystems are checking this flag in .getattr and the flag is always passed to them unconditionally from only vfs_getattr_nosec: - ecryptfs: Simplify by always calling vfs_getattr_nosec in ecryptfs_getattr. From there the flag is passed to no other function and this function is not called otherwise. - overlayfs: Simplify by always calling vfs_getattr_nosec in ovl_getattr. From there the flag is passed to no other function and this function is not called otherwise. The query_flags in vfs_getattr_nosec will mask-out AT_GETATTR_NOSEC from any caller using AT_STATX_SYNC_TYPE as mask so that the flag is not important inside this function. Also, since no filesystem is checking the flag anymore, remove the flag entirely now, including the BUG_ON check that never triggered. The net change of the changes here combined with the original commit is that ecryptfs and overlayfs do not call vfs_getattr but only vfs_getattr_nosec. Fixes: 8a924db2d7b5 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Closes: https://lore.kernel.org/linux-fsdevel/20241101011724.GN1350452@ZenIV/T/#u Cc: Tyler Hicks <code@tyhicks.com> Cc: ecryptfs@vger.kernel.org Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Amir Goldstein <amir73il@gmail.com> Cc: linux-unionfs@vger.kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-13fs/stat.c: switch to CLASS(fd_raw)Al Viro1-9/+4
... and use fd_empty() consistently Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-13kill getname_statx_lookup_flags()Al Viro1-4/+2
LOOKUP_EMPTY is ignored by the only remaining user, and without that 'getname_' prefix makes no sense. Remove LOOKUP_EMPTY part, rename to statx_lookup_flags() and make static. It most likely is _not_ statx() specific, either, but that's the next step. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-19getname_maybe_null() - the third variant of pathname copy-inAl Viro1-24/+4
Semantics used by statx(2) (and later *xattrat(2)): without AT_EMPTY_PATH it's standard getname() (i.e. ERR_PTR(-ENOENT) on empty string, ERR_PTR(-EFAULT) on NULL), with AT_EMPTY_PATH both empty string and NULL are accepted. Calling conventions: getname_maybe_null(user_pointer, flags) returns * pointer to struct filename when non-empty string had been successfully read * ERR_PTR(...) on error * NULL if an empty string or NULL pointer had been given with AT_EMPTY_PATH in the flags argument. It tries to avoid allocation in the last case; it's not always able to do so, in which case the temporary struct filename instance is freed and NULL returned anyway. Fast path is inlined. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-10Merge patch series "timekeeping/fs: multigrain timestamp redux"Christian Brauner1-2/+44
Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10fs: tracepoints around multigrain timestamp eventsJeff Layton1-0/+3
Add some tracepoints around various multigrain timestamp events. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-6-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07fs: add infrastructure for multigrain timestampsJeff Layton1-2/+41
The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If fine-grained timestamps were always used, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What is needed is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, allow the update to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, accept that value. If it isn't, then get a fine-grained timestamp and attempt to stamp the inode ctime with that value. If that races with another concurrent stamp, then abandon the update and take the new value without retrying. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-3-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro1-4/+4
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-07-15Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linuxLinus Torvalds1-7/+41
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...
2024-06-27vfs: support statx(..., NULL, AT_EMPTY_PATH, ...)Mateusz Guzik1-28/+83
The newly used helper also checks for empty ("") paths. NULL paths with any flag value other than AT_EMPTY_PATH go the usual route and end up with -EFAULT to retain compatibility (Rust is abusing calls of the sort to detect availability of statx). This avoids path lookup code, lockref management, memory allocation and in case of NULL path userspace memory access (which can be quite expensive with SMAP on x86_64). Benchmarked with statx(..., AT_EMPTY_PATH, ...) running on Sapphire Rapids, with the "" path for the first two cases and NULL for the last one. Results in ops/s: stock: 4231237 pre-check: 5944063 (+40%) NULL path: 6601619 (+11%/+56%) Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240625151807.620812-1-mjguzik@gmail.com Tested-by: Xi Ruoyao <xry111@xry111.site> [brauner: use path_mounted() and other tweaks] Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-27stat: use vfs_empty_path() helperChristian Brauner1-10/+2
Use the newly added helper for this. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-20block: Add atomic write support for statxPrasad Singamsetty1-7/+9
Extend statx system call to return additional info for atomic write support support if the specified file is a block device. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-7-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20fs: Add initial atomic write support info to statxPrasad Singamsetty1-0/+34
Extend statx system call to return additional info for atomic write support support for a file. Helper function generic_fill_statx_atomic_writes() can be used by FSes to fill in the relevant statx fields. For now atomic_write_segments_max will always be 1, otherwise some rules would need to be imposed on iovec length and alignment, which we don't want now. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: relocate bdev support to another patch Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-05vfs: retire user_path_at_empty and drop empty arg from getname_flagsMateusz Guzik1-3/+3
No users after do_readlinkat started doing the job on its own. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240604155257.109500-3-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-05vfs: stop using user_path_at_empty in do_readlinkatMateusz Guzik1-19/+24
It is the only consumer and it saddles getname_flags with an argument set to NULL by everyone else. Instead the routine can do the empty check on its own. Then user_path_at_empty can get retired and getname_flags can lose the argument. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240604155257.109500-2-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-26statx: stx_subvolKent Overstreet1-0/+1
Add a new statx field for (sub)volume identifiers, as implemented by btrfs and bcachefs. This includes bcachefs support; we'll definitely want btrfs support as well. Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-08Merge tag 'vfs-6.8.mount' of ↵Linus Torvalds1-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains the work to retrieve detailed information about mounts via two new system calls. This is hopefully the beginning of the end of the saga that started with fsinfo() years ago. The LWN articles in [1] and [2] can serve as a summary so we can avoid rehashing everything here. At LSFMM in May 2022 we got into a room and agreed on what we want to do about fsinfo(). Basically, split it into pieces. This is the first part of that agreement. Specifically, it is concerned with retrieving information about mounts. So this only concerns the mount information retrieval, not the mount table change notification, or the extended filesystem specific mount option work. That is separate work. Currently mounts have a 32bit id. Mount ids are already in heavy use by libmount and other low-level userspace but they can't be relied upon because they're recycled very quickly. We agreed that mounts should carry a unique 64bit id by which they can be referenced directly. This is now implemented as part of this work. The new 64bit mount id is exposed in statx() through the new STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is returned. If it is raised and the kernel supports the new 64bit mount id the flag is raised in the result mask and the new 64bit mount id is returned. New and old mount ids do not overlap so they cannot be conflated. Two new system calls are introduced that operate on the 64bit mount id: statmount() and listmount(). A summary of the api and usage can be found on LWN as well (cf. [3]) but of course, I'll provide a summary here as well. Both system calls rely on struct mnt_id_req. Which is the request struct used to pass the 64bit mount id identifying the mount to operate on. It is extensible to allow for the addition of new parameters and for future use in other apis that make use of mount ids. statmount() mimicks the semantics of statx() and exposes a set flags that userspace may raise in mnt_id_req to request specific information to be retrieved. A statmount() call returns a struct statmount filled in with information about the requested mount. Supported requests are indicated by raising the request flag passed in struct mnt_id_req in the @mask argument in struct statmount. Currently we do support: - STATMOUNT_SB_BASIC: Basic filesystem info - STATMOUNT_MNT_BASIC Mount information (mount id, parent mount id, mount attributes etc) - STATMOUNT_PROPAGATE_FROM Propagation from what mount in current namespace - STATMOUNT_MNT_ROOT Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla) - STATMOUNT_MNT_POINT Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt) - STATMOUNT_FS_TYPE Name of the filesystem type as the magic number isn't enough due to submounts The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE are appended to the end of the struct. Userspace can use the offsets in @fs_type, @mnt_root, and @mnt_point to reference those strings easily. The struct statmount reserves quite a bit of space currently for future extensibility. This isn't really a problem and if this bothers us we can just send a follow-up pull request during this cycle. listmount() is given a 64bit mount id via mnt_id_req just as statmount(). It takes a buffer and a size to return an array of the 64bit ids of the child mounts of the requested mount. Userspace can thus choose to either retrieve child mounts for a mount in batches or iterate through the child mounts. For most use-cases it will be sufficient to just leave space for a few child mounts. But for big mount tables having an iterator is really helpful. Iterating through a mount table works by setting @param in mnt_id_req to the mount id of the last child mount retrieved in the previous listmount() call" Link: https://lwn.net/Articles/934469 [1] Link: https://lwn.net/Articles/829212 [2] Link: https://lwn.net/Articles/950569 [3] * tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: add selftest for statmount/listmount fs: keep struct mnt_id_req extensible wire up syscalls for statmount/listmount add listmount(2) syscall statmount: simplify string option retrieval statmount: simplify numeric option retrieval add statmount(2) syscall namespace: extract show_path() helper mounts: keep list of mounts in an rbtree add unique mount ID
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2023-12-21fs: fix doc comment typo fs tree wideAlexander Mikhalitsyn1-1/+1
Do the replacement: s/simply passs @nop_mnt_idmap/simply pass @nop_mnt_idmap/ in the fs/ tree. Found by chance while working on support for idmapped mounts in fuse. Cc: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-kernel@vger.kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20231215130927.136917-1-aleksandr.mikhalitsyn@canonical.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18add unique mount IDMiklos Szeredi1-2/+7
If a mount is released then its mnt_id can immediately be reused. This is bad news for user interfaces that want to uniquely identify a mount. Implementing a unique mount ID is trivial (use a 64bit counter). Unfortunately userspace assumes 32bit size and would overflow after the counter reaches 2^32. Introduce a new 64bit ID alongside the old one. Initialize the counter to 2^32, this guarantees that the old and new IDs are never mixed up. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20231025140205.3586473-2-mszeredi@redhat.com Reviewed-by: Ian Kent <raven@themaw.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: Pass AT_GETATTR_NOSEC flag to getattr interface functionStefan Berger1-1/+5
When vfs_getattr_nosec() calls a filesystem's getattr interface function then the 'nosec' should propagate into this function so that vfs_getattr_nosec() can again be called from the filesystem's gettattr rather than vfs_getattr(). The latter would add unnecessary security checks that the initial vfs_getattr_nosec() call wanted to avoid. Therefore, introduce the getattr flag GETATTR_NOSEC and allow to pass with the new getattr_flags parameter to the getattr interface function. In overlayfs and ecryptfs use this flag to determine which one of the two functions to call. In a recent code change introduced to IMA vfs_getattr_nosec() ended up calling vfs_getattr() in overlayfs, which in turn called security_inode_getattr() on an exiting process that did not have current->fs set anymore, which then caused a kernel NULL pointer dereference. With this change the call to security_inode_getattr() can be avoided, thus avoiding the NULL pointer dereference. Reported-by: <syzbot+a67fc5321ffb4b311c98@syzkaller.appspotmail.com> Fixes: db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version") Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <linux-fsdevel@vger.kernel.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Tyler Hicks <code@tyhicks.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Suggested-by: Christian Brauner <brauner@kernel.org> Co-developed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Stefan Berger <stefanb@linux.ibm.com> Link: https://lore.kernel.org/r/20231002125733.1251467-1-stefanb@linux.vnet.ibm.com Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18fs: convert core infrastructure to new timestamp accessorsJeff Layton1-2/+2
Convert the core vfs code to use the new timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185239.80830-2-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20Revert "fs: add infrastructure for multigrain timestamps"Christian Brauner1-39/+2
This reverts commit ffb6cf19e06334062744b7e3493f71e500964f8e. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-17stat: remove no-longer-used helper macrosLinus Torvalds1-6/+0
The choose_32_64() macros were added to deal with an odd inconsistency between the 32-bit and 64-bit layout of 'struct stat' way back when in commit a52dd971f947 ("vfs: de-crapify "cp_new_stat()" function"). Then a decade later Mikulas noticed that said inconsistency had been a mistake in the early x86-64 port, and shouldn't have existed in the first place. So commit 932aba1e1690 ("stat: fix inconsistency between struct stat and struct compat_stat") removed the uses of the helpers. But the helpers remained around, unused. Get rid of them. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-07vfs: mostly undo glibc turning 'fstat()' into 'fstatat(AT_EMPTY_PATH)'Linus Torvalds1-0/+17
Mateusz reports that glibc turns 'fstat()' calls into 'fstatat()', and that seems to have been going on for quite a long time due to glibc having tried to simplify its stat logic into just one point. This turns out to cause completely unnecessary overhead, where we then go off and allocate the kernel side pathname, and actually look up the empty path. Sure, our path lookup is quite optimized, but it still causes a fair bit of allocation overhead and a couple of completely unnecessary rounds of lockref accesses etc. This is all hopefully getting fixed in user space, and there is a patch floating around for just having glibc use the native fstat() system call. But even with the current situation we can at least improve on things by catching the situation and short-circuiting it. Note that this is still measurably slower than just a plain 'fstat()', since just checking that the filename is actually empty is somewhat expensive due to inevitable user space access overhead from the kernel (ie verifying pointers, and SMAP on x86). But it's still quite a bit faster than actually looking up the path for real. To quote numers from Mateusz: "Sapphire Rapids, will-it-scale, ops/s stock fstat 5088199 patched fstat 7625244 (+49%) real fstat 8540383 (+67% / +12%)" where that 'stock fstat' is the glibc translation of fstat into fstatat() with an empty path, the 'patched fstat' is with this short circuiting of the path lookup, and the 'real fstat' is the actual native fstat() system call with none of this overhead. Link: https://lore.kernel.org/lkml/20230903204858.lv7i3kqvw6eamhgz@f/ Reported-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-11fs: add infrastructure for multigrain timestampsJeff Layton1-2/+39
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. POSIX generally mandates that when the the mtime changes, the ctime must also change. The kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Use the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Later patches will convert individual filesystems to use the new infrastructure. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-9-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09fs: pass the request_mask to generic_fillattrJeff Layton1-11/+13
generic_fillattr just fills in the entire stat struct indiscriminately today, copying data from the inode. There is at least one attribute (STATX_CHANGE_COOKIE) that can have side effects when it is reported, and we're looking at adding more with the addition of multigrain timestamps. Add a request_mask argument to generic_fillattr and have most callers just pass in the value that is passed to getattr. Have other callers (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of STATX_CHANGE_COOKIE into generic_fillattr. Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13fs: convert to ctime accessor functionsJeff Layton1-1/+1
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230705190309.579783-23-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds1-12/+12
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-01-26vfs: plumb i_version handling into struct kstatJeff Layton1-2/+15
The NFS server has a lot of special handling for different types of change attribute access, depending on the underlying filesystem. In most cases, it's doing a getattr anyway and then fetching that value after the fact. Rather that do that, add a new STATX_CHANGE_COOKIE flag that is a kernel-only symbol (for now). If requested and getattr can implement it, it can fill out this field. For IS_I_VERSION inodes, add a generic implementation in vfs_getattr_nosec. Take care to mask STATX_CHANGE_COOKIE off in requests from userland and in the result mask. Since not all filesystems can give the same guarantees of monotonicity, claim a STATX_ATTR_CHANGE_MONOTONIC flag that filesystems can set to indicate that they offer an i_version value that can never go backward. Eventually if we decide to make the i_version available to userland, we can just designate a field for it in struct statx, and move the STATX_CHANGE_COOKIE definition to the uapi header. Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-01-19fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmapChristian Brauner1-4/+2
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns(). Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-19fs: port ->getattr() to pass mnt_idmapChristian Brauner1-10/+12
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-10-26fs: use type safe idmapping helpersChristian Brauner1-2/+5
We already ported most parts and filesystems over for v6.0 to the new vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining places so we can remove all the old helpers. This is a non-functional change. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-09-11vfs: support STATX_DIOALIGN on block devicesEric Biggers1-0/+12
Add support for STATX_DIOALIGN to block devices, so that direct I/O alignment restrictions are exposed to userspace in a generic way. Note that this breaks the tradition of stat operating only on the block device node, not the block device itself. However, it was felt that doing this is preferable, in order to make the interface useful and avoid needing separate interfaces for regular files and block devices. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220827065851.135710-3-ebiggers@kernel.org
2022-09-11statx: add direct I/O alignment informationEric Biggers1-0/+2
Traditionally, the conditions for when DIO (direct I/O) is supported were fairly simple. For both block devices and regular files, DIO had to be aligned to the logical block size of the block device. However, due to filesystem features that have been added over time (e.g. multi-device support, data journalling, inline data, encryption, verity, compression, checkpoint disabling, log-structured mode), the conditions for when DIO is allowed on a regular file have gotten increasingly complex. Whether a particular regular file supports DIO, and with what alignment, can depend on various file attributes and filesystem mount options, as well as which block device(s) the file's data is located on. Moreover, the general rule of DIO needing to be aligned to the block device's logical block size was recently relaxed to allow user buffers (but not file offsets) aligned to the DMA alignment instead. See commit bf8d08532bc1 ("iomap: add support for dma aligned direct-io"). XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information. Uplifting this to the VFS is one possibility. However, as discussed (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u), this ioctl is rarely used and not known to be used outside of XFS-specific code. It was also never intended to indicate when a file doesn't support DIO at all, nor was it intended for block devices. Therefore, let's expose this information via statx(). Add the STATX_DIOALIGN flag and two new statx fields associated with it: * stx_dio_mem_align: the alignment (in bytes) required for user memory buffers for DIO, or 0 if DIO is not supported on the file. * stx_dio_offset_align: the alignment (in bytes) required for file offsets and I/O segment lengths for DIO, or 0 if DIO is not supported on the file. This will only be nonzero if stx_dio_mem_align is nonzero, and vice versa. Note that as with other statx() extensions, if STATX_DIOALIGN isn't set in the returned statx struct, then these new fields won't be filled in. This will happen if the file is neither a regular file nor a block device, or if the file is a regular file and the filesystem doesn't support STATX_DIOALIGN. It might also happen if the caller didn't include STATX_DIOALIGN in the request mask, since statx() isn't required to return unrequested information. This commit only adds the VFS-level plumbing for STATX_DIOALIGN. For regular files, individual filesystems will still need to add code to support it. For block devices, a separate commit will wire it up too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220827065851.135710-2-ebiggers@kernel.org
2022-05-31Merge tag 'riscv-for-linus-5.19-mw0' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the Svpbmt extension, which allows memory attributes to be encoded in pages - Support for the Allwinner D1's implementation of page-based memory attributes - Support for running rv32 binaries on rv64 systems, via the compat subsystem - Support for kexec_file() - Support for the new generic ticket-based spinlocks, which allows us to also move to qrwlock. These should have already gone in through the asm-geneic tree as well - A handful of cleanups and fixes, include some larger ones around atomics and XIP * tag 'riscv-for-linus-5.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits) RISC-V: Prepare dropping week attribute from arch_kexec_apply_relocations[_add] riscv: compat: Using seperated vdso_maps for compat_vdso_info RISC-V: Fix the XIP build RISC-V: Split out the XIP fixups into their own file RISC-V: ignore xipImage RISC-V: Avoid empty create_*_mapping definitions riscv: Don't output a bogus mmu-type on a no MMU kernel riscv: atomic: Add custom conditional atomic operation implementation riscv: atomic: Optimize dec_if_positive functions riscv: atomic: Cleanup unnecessary definition RISC-V: Load purgatory in kexec_file RISC-V: Add purgatory RISC-V: Support for kexec_file on panic RISC-V: Add kexec_file support RISC-V: use memcpy for kexec_file mode kexec_file: Fix kexec_file.c build error for riscv platform riscv: compat: Add COMPAT Kbuild skeletal support riscv: compat: ptrace: Add compat_arch_ptrace implement riscv: compat: signal: Add rt_frame implementation riscv: add memory-type errata for T-Head ...
2022-04-26fs: stat: compat: Add __ARCH_WANT_COMPAT_STATGuo Ren1-1/+1
RISC-V doesn't neeed compat_stat, so using __ARCH_WANT_COMPAT_STAT to exclude unnecessary SYSCALL functions. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Heiko Stuebner <heiko@sntech.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Link: https://lore.kernel.org/r/20220405071314.3225832-6-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-12stat: fix inconsistency between struct stat and struct compat_statMikulas Patocka1-9/+10
struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit st_dev and st_rdev; struct compat_stat (defined in arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a 16-bit padding. This patch fixes struct compat_stat to match struct stat. [ Historical note: the old x86 'struct stat' did have that 16-bit field that the compat layer had kept around, but it was changes back in 2003 by "struct stat - support larger dev_t": https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d and back in those days, the x86_64 port was still new, and separate from the i386 code, and had already picked up the old version with a 16-bit st_dev field ] Note that we can't change compat_dev_t because it is used by compat_loop_info. Also, if the st_dev and st_rdev values are 32-bit, we don't have to use old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major number 259. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-10io-uring: Make statx API stableStefan Roesch1-14/+35
One of the key architectual tenets is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation. IO-Uring change: This changes replaces the const char * filename pointer in the io_statx structure with a struct filename *. In addition it also creates the filename object during the prepare phase. With this change, the opcode also needs to invoke cleanup, so the filename object gets freed after processing the request. fs change: This replaces the const char* __user filename parameter in the two functions do_statx and vfs_statx with a struct filename *. In addition to be able to correctly construct a filename object a new helper function getname_statx_lookup_flags is introduced. The function makes sure that do_statx and vfs_statx is invoked with the correct lookup flags. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-17fs: add generic helper for filling statx attribute flagsAmir Goldstein1-0/+18
The immutable and append-only properties on an inode are published on the inode's i_flags and enforced by the VFS. Create a helper to fill the corresponding STATX_ATTR_ flags in the kstat structure from the inode's i_flags. Only orange was converted to use this helper. Other filesystems could use it in the future. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-17fs: fix reporting supported extra file attributes for statx()Theodore Ts'o1-0/+8
statx(2) notes that any attribute that is not indicated as supported by stx_attributes_mask has no usable value. Commits 801e523796004 ("fs: move generic stat response attr handling to vfs_getattr_nosec") and 712b2698e4c02 ("fs/stat: Define DAX statx attribute") sets STATX_ATTR_AUTOMOUNT and STATX_ATTR_DAX, respectively, without setting stx_attributes_mask, which can cause xfstests generic/532 to fail. Fix this in the same way as commit 1b9598c8fb99 ("xfs: fix reporting supported extra file attributes for statx()") Fixes: 801e523796004 ("fs: move generic stat response attr handling to vfs_getattr_nosec") Fixes: 712b2698e4c02 ("fs/stat: Define DAX statx attribute") Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-24fs: make helpers idmap mount awareChristian Brauner1-3/+5
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner1-6/+14
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-09-26fs: remove KSTAT_QUERY_FLAGSChristoph Hellwig1-4/+4
KSTAT_QUERY_FLAGS expands to AT_STATX_SYNC_TYPE, which itself already is a mask. Remove the double name, especially given that the prefix is a little confusing vs the normal AT_* flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-26fs: remove vfs_stat_set_lookup_flagsChristoph Hellwig1-21/+12
The function really obsfucates checking for valid flags and setting the lookup flags. The fact that it returns -EINVAL through and unsigned return value, which is then used as boolean really doesn't help either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-26fs: move vfs_fstatat out of lineChristoph Hellwig1-2/+7
This allows to keep vfs_statx static in fs/stat.c to prepare for the following changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-26fs: remove vfs_statx_fdChristoph Hellwig1-15/+7
vfs_statx_fd is only used to implement vfs_fstat. Remove vfs_statx_fd and just implement vfs_fstat directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-02Merge tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-0/+3
Pull DAX updates part one from Darrick Wong: "After many years of LKML-wrangling about how to enable programs to query and influence the file data access mode (DAX) when a filesystem resides on storage devices such as persistent memory, Ira Weiny has emerged with a proposed set of standard behaviors that has not been shot down by anyone! We're more or less standardizing on the current XFS behavior and adapting ext4 to do the same. This is the first of a handful pull requests that will make ext4 and XFS present a consistent interface for user programs that care about DAX. We add a statx attribute that programs can check to see if DAX is enabled on a particular file. Then, we update the DAX documentation to spell out the user-visible behaviors that filesystems will guarantee (until the next storage industry shakeup). The on-disk inode flag has been in XFS for a few years now. Summary: - Clean up io_is_direct. - Add a new statx flag to indicate when file data access is being done via DAX (as opposed to the page cache). - Update the documentation for how system administrators and application programmers can take advantage of the (still experimental DAX) feature" Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/ * tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: Documentation/dax: Update Usage section fs/stat: Define DAX statx attribute fs: Remove unneeded IS_DAX() check in io_is_direct()
2020-06-02Merge tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-15/+22
Pull io_uring updates from Jens Axboe: "A relatively quiet round, mostly just fixes and code improvements. In particular: - Make statx just use the generic statx handler, instead of open coding it. We don't need that anymore, as we always call it async safe (Bijan) - Enable closing of the ring itself. Also fixes O_PATH closure (me) - Properly name completion members (me) - Batch reap of dead file registrations (me) - Allow IORING_OP_POLL with double waitqueues (me) - Add tee(2) support (Pavel) - Remove double off read (Pavel) - Fix overflow cancellations (Pavel) - Improve CQ timeouts (Pavel) - Async defer drain fixes (Pavel) - Add support for enabling/disabling notifications on a registered eventfd (Stefano) - Remove dead state parameter (Xiaoguang) - Disable SQPOLL submit on dying ctx (Xiaoguang) - Various code cleanups" * tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits) io_uring: fix overflowed reqs cancellation io_uring: off timeouts based only on completions io_uring: move timeouts flushing to a helper statx: hide interfaces no longer used by io_uring io_uring: call statx directly statx: allow system call to be invoked from io_uring io_uring: add io_statx structure io_uring: get rid of manual punting in io_close io_uring: separate DRAIN flushing into a cold path io_uring: don't re-read sqe->off in timeout_prep() io_uring: simplify io_timeout locking io_uring: fix flush req->refs underflow io_uring: don't submit sqes when ctx->refs is dying io_uring: async task poll trigger cleanup io_uring: add tee(2) support splice: export do_tee() io_uring: don't repeat valid flag list io_uring: rename io_file_put() io_uring: remove req->needs_fixed_files io_uring: cleanup io_poll_remove_one() logic ...
2020-05-26statx: hide interfaces no longer used by io_uringBijan Mottahedeh1-2/+3
The io_uring interfaces have been replaced by do_statx() and are no longer needed. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-26statx: allow system call to be invoked from io_uringBijan Mottahedeh1-13/+19
This is a prepatory patch to allow io_uring to invoke statx directly. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14statx: add mount_rootMiklos Szeredi1-0/+3
Determining whether a path or file descriptor refers to a mountpoint (or more precisely a mount root) is not trivial using current tools. Add a flag to statx that indicates whether the path or fd refers to the root of a mount or not. Cc: linux-api@vger.kernel.org Cc: linux-man@vger.kernel.org Reported-by: Lennart Poettering <mzxreary@0pointer.de> Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14statx: add mount IDMiklos Szeredi1-0/+4
Systemd is hacking around to get it and it's trivial to add to statx, so... Cc: linux-api@vger.kernel.org Cc: linux-man@vger.kernel.org Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14statx: don't clear STATX_ATIME on SB_RDONLYMiklos Szeredi1-1/+2
IS_NOATIME(inode) is defined as __IS_FLG(inode, SB_RDONLY|SB_NOATIME), so generic_fillattr() will clear STATX_ATIME from the result_mask if the super block is marked read only. This was probably not the intention, so fix to only clear STATX_ATIME if the fs doesn't support atime at all. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-14uapi: deprecate STATX_ALLMiklos Szeredi1-1/+0
Constants of the *_ALL type can be actively harmful due to the fact that developers will usually fail to consider the possible effects of future changes to the definition. Deprecate STATX_ALL in the uapi, while no damage has been done yet. We could keep something like this around in the kernel, but there's actually no point, since all filesystems should be explicitly checking flags that they support and not rely on the VFS masking unknown ones out: a flag could be known to the VFS, yet not known to the filesystem. Cc: David Howells <dhowells@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-man@vger.kernel.org Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-05-04fs/stat: Define DAX statx attributeIra Weiny1-0/+3
In order for users to determine if a file is currently operating in DAX state (effective DAX). Define a statx attribute value and set that attribute if the effective DAX flag is set. To go along with this we propose the following addition to the statx man page: STATX_ATTR_DAX The file is in the DAX (cpu direct access) state. DAX state attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a file system which has been configured to support DAX. DAX generally assumes all accesses are via cpu load / store instructions which can minimize overhead for small accesses, but may adversely affect cpu utilization for large transfers. File I/O is done directly to/from user-space buffers and memory mapped I/O may be performed with direct memory mappings that bypass kernel page cache. While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees of O_SYNC where data and the necessary metadata are transferred together. A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush instructions to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-01-20fs: make two stat prep helpers availableJens Axboe1-12/+22
To implement an async stat, we need to provide the flags mapping and the statx user copy. Make them available internally, through fs/internal.h. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-01fs: move generic stat response attr handling to vfs_getattr_nosecChristoph Hellwig1-5/+7
generic_fillattr is an optional helper that isn't used by all file systems, move handling purely based on inode flags to vfs_getattr_nosec, which is common code. This fixes setting this flag for file systems not using generic_fillattr like xfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-29y2038: Remove newstat family from default syscall setArnd Bergmann1-0/+3
We have four generations of stat() syscalls: - the oldstat syscalls that are only used on the older architectures - the newstat family that is used on all 64-bit architectures but lacked support for large files on 32-bit architectures. - the stat64 family that is used mostly on 32-bit architectures to replace newstat - statx() to replace all of the above, adding 64-bit timestamps among other things. We already compile stat64 only on those architectures that need it, but newstat is always built, including on those that don't reference it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of __ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of newstat. All architectures that need it use an explict define, the others now get a little bit smaller, and future architecture (including 64-bit targets) won't ever see it. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-02fs: add do_readlinkat() helper; remove internal call to sys_readlinkat()Dominik Brodowski1-3/+9
Using the do_readlinkat() helper removes an in-kernel call to the sys_readlinkat() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-17fs: Provide __inode_get_bytes()Jan Kara1-1/+1
Provide helper __inode_get_bytes() which assumes i_lock is already acquired. Quota code will need this to be able to use i_lock to protect consistency of quota accounting information and inode usage. Signed-off-by: Jan Kara <jack@suse.cz>
2017-06-09ufs: restore maintaining ->i_blocksAl Viro1-0/+1
Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-02Merge branch 'work.compat' of ↵Linus Torvalds1-0/+86
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fs/compat.c cleanups from Al Viro: "More moving of compat syscalls from fs/compat.c to fs/*.c where the native counterparts live. And death to compat_sys_getdents64() - the only architecture that used to need it was ia64, and _that_ has lost biarch support quite a few years ago" * 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/compat.c: trim unused includes move compat_rw_copy_check_uvector() over to fs/read_write.c fhandle: move compat syscalls from compat.c open: move compat syscalls from compat.c stat: move compat syscalls from compat.c fcntl: move compat syscalls from compat.c readdir: move compat syscalls from compat.c statfs: move compat syscalls from compat.c utimes: move compat syscalls from compat.c move compat select-related syscalls to fs/select.c Remove compat_sys_getdents64()
2017-04-27statx: correct error handling of NULL pathnameMichael Kerrisk (man-pages)1-2/+0
The change in commit 1e2f82d1e9d1 ("statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATH") to error on a NULL pathname to statx() is inconsistent. It results in the error EINVAL for a NULL pathname. Other system calls with similar APIs (fchownat(), fstatat(), linkat()), return EFAULT. The solution is simply to remove the EINVAL check. As I already pointed out in [1], user_path_at*() and filename_lookup() will handle the NULL pathname as per the other APIs, to correctly produce the error EFAULT. [1] https://lkml.org/lkml/2017/4/26/561 Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-26statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATHDavid Howells1-7/+6
With the new statx() syscall, the following both allow the attributes of the file attached to a file descriptor to be retrieved: statx(dfd, NULL, 0, ...); and: statx(dfd, "", AT_EMPTY_PATH, ...); Change the code to reject the first option, though this means copying the path and engaging pathwalk for the fstat() equivalent. dfd can be a non-directory provided path is "". [ The timing of this isn't wonderful, but applying this now before we have statx() in any released kernel, before anybody starts using the NULL special case. - Linus ] Fixes: a528d35e8bfc ("statx: Add a system call to make enhanced file info available") Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Sandeen <sandeen@sandeen.net> cc: fstests@vger.kernel.org cc: linux-api@vger.kernel.org cc: linux-man@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-17stat: move compat syscalls from compat.cAl Viro1-0/+86
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03statx: Include a mask for stx_attributes in struct statxDavid Howells1-0/+1
Include a mask in struct stat to indicate which bits of stx_attributes the filesystem actually supports. This would also be useful if we add another system call that allows you to do a 'bulk attribute set' and pass in a statx struct with the masks appropriately set to say what you want to set. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03statx: Reserve the top bit of the mask for future struct expansionDavid Howells1-0/+2
Reserve the top bit of the mask for future expansion of the statx struct and give an error if statx() sees it set. All the other bits are ignored if we see them set but don't support the bit; we just clear the bit in the returned mask. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03statx: optimize copy of struct statx to userspaceEric Biggers1-42/+32
I found that statx() was significantly slower than stat(). As a microbenchmark, I compared 10,000,000 invocations of fstat() on a tmpfs file to the same with statx() passed a NULL path: $ time ./stat_benchmark real 0m1.464s user 0m0.275s sys 0m1.187s $ time ./statx_benchmark real 0m5.530s user 0m0.281s sys 0m5.247s statx is expected to be a little slower than stat because struct statx is larger than struct stat, but not by *that* much. It turns out that most of the overhead was in copying struct statx to userspace, mostly in all the stac/clac instructions that got generated for each __put_user() call. (This was on x86_64, but some other architectures, e.g. arm64, have something similar now too.) stat() instead initializes its struct on the stack and copies it to userspace with a single call to copy_to_user(). This turns out to be much faster, and changing statx to do this makes it almost as fast as stat: $ time ./statx_benchmark real 0m1.624s user 0m0.270s sys 0m1.354s For zeroing the reserved fields, start by zeroing the full struct with memset. This makes it clear that every byte copied to userspace is initialized, even implicit padding bytes (though there are none currently). In the scenarios I tested, it also performed the same as a designated initializer. Manually initializing each field was still slightly faster, but would have been more error-prone and less verifiable. Also rename statx_set_result() to cp_statx() for consistency with cp_old_stat() et al., and make it noinline so that struct statx doesn't add to the stack usage during the main portion of the syscall execution. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03statx: remove incorrect part of vfs_statx() commentEric Biggers1-3/+0
request_mask and query_flags are function arguments, not passed in struct kstat. So remove the part of the comment which claims otherwise. This was apparently left over from an earlier version of the statx patch. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03statx: reject unknown flags when using NULL pathEric Biggers1-1/+5
The statx() system call currently accepts unknown flags when called with a NULL path to operate on a file descriptor. Left unchanged, this could make it hard to introduce new query flags in the future, since applications may not be able to tell whether a given flag is supported. Fix this by failing the system call with EINVAL if any flags other than KSTAT_QUERY_FLAGS are specified in combination with a NULL path. Arguably, we could still permit known lookup-related flags such as AT_SYMLINK_NOFOLLOW. However, that would be inconsistent with how sys_utimensat() behaves when passed a NULL path, which seems to be the closest precedent. And given that the NULL path case is (I believe) mainly intended to be used to implement a wrapper function like fstatx() that doesn't have a path argument, I think rejecting lookup-related flags too is probably the best choice. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-03Merge branch 'rebased-statx' of ↵Linus Torvalds1-38/+176
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs 'statx()' update from Al Viro. This adds the new extended stat() interface that internally subsumes our previous stat interfaces, and allows user mode to specify in more detail what kind of information it wants. It also allows for some explicit synchronization information to be passed to the filesystem, which can be relevant for network filesystems: is the cached value ok, or do you need open/close consistency, or what? From David Howells. Andreas Dilger points out that the first version of the extended statx interface was posted June 29, 2010: https://www.spinics.net/lists/linux-fsdevel/msg33831.html * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: statx: Add a system call to make enhanced file info available
2017-03-02statx: Add a system call to make enhanced file info availableDavid Howells1-38/+176
Add a system call to make extended file information available, including file creation and some attribute flags where available through the underlying filesystem. The getattr inode operation is altered to take two additional arguments: a u32 request_mask and an unsigned int flags that indicate the synchronisation mode. This change is propagated to the vfs_getattr*() function. Functions like vfs_stat() are now inline wrappers around new functions vfs_statx() and vfs_statx_fd() to reduce stack usage. ======== OVERVIEW ======== The idea was initially proposed as a set of xattrs that could be retrieved with getxattr(), but the general preference proved to be for a new syscall with an extended stat structure. A number of requests were gathered for features to be included. The following have been included: (1) Make the fields a consistent size on all arches and make them large. (2) Spare space, request flags and information flags are provided for future expansion. (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an __s64). (4) Creation time: The SMB protocol carries the creation time, which could be exported by Samba, which will in turn help CIFS make use of FS-Cache as that can be used for coherency data (stx_btime). This is also specified in NFSv4 as a recommended attribute and could be exported by NFSD [Steve French]. (5) Lightweight stat: Ask for just those details of interest, and allow a netfs (such as NFS) to approximate anything not of interest, possibly without going to the server [Trond Myklebust, Ulrich Drepper, Andreas Dilger] (AT_STATX_DONT_SYNC). (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks its cached attributes are up to date [Trond Myklebust] (AT_STATX_FORCE_SYNC). And the following have been left out for future extension: (7) Data version number: Could be used by userspace NFS servers [Aneesh Kumar]. Can also be used to modify fill_post_wcc() in NFSD which retrieves i_version directly, but has just called vfs_getattr(). It could get it from the kstat struct if it used vfs_xgetattr() instead. (There's disagreement on the exact semantics of a single field, since not all filesystems do this the same way). (8) BSD stat compatibility: Including more fields from the BSD stat such as creation time (st_btime) and inode generation number (st_gen) [Jeremy Allison, Bernd Schubert]. (9) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd Schubert]. (This was asked for but later deemed unnecessary with the open-by-handle capability available and caused disagreement as to whether it's a security hole or not). (10) Extra coherency data may be useful in making backups [Andreas Dilger]. (No particular data were offered, but things like last backup timestamp, the data version number and the DOS archive bit would come into this category). (11) Allow the filesystem to indicate what it can/cannot provide: A filesystem can now say it doesn't support a standard stat feature if that isn't available, so if, for instance, inode numbers or UIDs don't exist or are fabricated locally... (This requires a separate system call - I have an fsinfo() call idea for this). (12) Store a 16-byte volume ID in the superblock that can be returned in struct xstat [Steve French]. (Deferred to fsinfo). (13) Include granularity fields in the time data to indicate the granularity of each of the times (NFSv4 time_delta) [Steve French]. (Deferred to fsinfo). (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags. Note that the Linux IOC flags are a mess and filesystems such as Ext4 define flags that aren't in linux/fs.h, so translation in the kernel may be a necessity (or, possibly, we provide the filesystem type too). (Some attributes are made available in stx_attributes, but the general feeling was that the IOC flags were to ext[234]-specific and shouldn't be exposed through statx this way). (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer, Michael Kerrisk]. (Deferred, probably to fsinfo. Finding out if there's an ACL or seclabal might require extra filesystem operations). (16) Femtosecond-resolution timestamps [Dave Chinner]. (A __reserved field has been left in the statx_timestamp struct for this - if there proves to be a need). (17) A set multiple attributes syscall to go with this. =============== NEW SYSTEM CALL =============== The new system call is: int ret = statx(int dfd, const char *filename, unsigned int flags, unsigned int mask, struct statx *buffer); The dfd, filename and flags parameters indicate the file to query, in a similar way to fstatat(). There is no equivalent of lstat() as that can be emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is also no equivalent of fstat() as that can be emulated by passing a NULL filename to statx() with the fd of interest in dfd. Whether or not statx() synchronises the attributes with the backing store can be controlled by OR'ing a value into the flags argument (this typically only affects network filesystems): (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this respect. (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise its attributes with the server - which might require data writeback to occur to get the timestamps correct. (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a network filesystem. The resulting values should be considered approximate. mask is a bitmask indicating the fields in struct statx that are of interest to the caller. The user should set this to STATX_BASIC_STATS to get the basic set returned by stat(). It should be noted that asking for more information may entail extra I/O operations. buffer points to the destination for the data. This must be 256 bytes in size. ====================== MAIN ATTRIBUTES RECORD ====================== The following structures are defined in which to return the main attribute set: struct statx_timestamp { __s64 tv_sec; __s32 tv_nsec; __s32 __reserved; }; struct statx { __u32 stx_mask; __u32 stx_blksize; __u64 stx_attributes; __u32 stx_nlink; __u32 stx_uid; __u32 stx_gid; __u16 stx_mode; __u16 __spare0[1]; __u64 stx_ino; __u64 stx_size; __u64 stx_blocks; __u64 __spare1[1]; struct statx_timestamp stx_atime; struct statx_timestamp stx_btime; struct statx_timestamp stx_ctime; struct statx_timestamp stx_mtime; __u32 stx_rdev_major; __u32 stx_rdev_minor; __u32 stx_dev_major; __u32 stx_dev_minor; __u64 __spare2[14]; }; The defined bits in request_mask and stx_mask are: STATX_TYPE Want/got stx_mode & S_IFMT STATX_MODE Want/got stx_mode & ~S_IFMT STATX_NLINK Want/got stx_nlink STATX_UID Want/got stx_uid STATX_GID Want/got stx_gid STATX_ATIME Want/got stx_atime{,_ns} STATX_MTIME Want/got stx_mtime{,_ns} STATX_CTIME Want/got stx_ctime{,_ns} STATX_INO Want/got stx_ino STATX_SIZE Want/got stx_size STATX_BLOCKS Want/got stx_blocks STATX_BASIC_STATS [The stuff in the normal stat struct] STATX_BTIME Want/got stx_btime{,_ns} STATX_ALL [All currently available stuff] stx_btime is the file creation time, stx_mask is a bitmask indicating the data provided and __spares*[] are where as-yet undefined fields can be placed. Time fields are structures with separate seconds and nanoseconds fields plus a reserved field in case we want to add even finer resolution. Note that times will be negative if before 1970; in such a case, the nanosecond fields will also be negative if not zero. The bits defined in the stx_attributes field convey information about a file, how it is accessed, where it is and what it does. The following attributes map to FS_*_FL flags and are the same numerical value: STATX_ATTR_COMPRESSED File is compressed by the fs STATX_ATTR_IMMUTABLE File is marked immutable STATX_ATTR_APPEND File is append-only STATX_ATTR_NODUMP File is not to be dumped STATX_ATTR_ENCRYPTED File requires key to decrypt in fs Within the kernel, the supported flags are listed by: KSTAT_ATTR_FS_IOC_FLAGS [Are any other IOC flags of sufficient general interest to be exposed through this interface?] New flags include: STATX_ATTR_AUTOMOUNT Object is an automount trigger These are for the use of GUI tools that might want to mark files specially, depending on what they are. Fields in struct statx come in a number of classes: (0) stx_dev_*, stx_blksize. These are local system information and are always available. (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino, stx_size, stx_blocks. These will be returned whether the caller asks for them or not. The corresponding bits in stx_mask will be set to indicate whether they actually have valid values. If the caller didn't ask for them, then they may be approximated. For example, NFS won't waste any time updating them from the server, unless as a byproduct of updating something requested. If the values don't actually exist for the underlying object (such as UID or GID on a DOS file), then the bit won't be set in the stx_mask, even if the caller asked for the value. In such a case, the returned value will be a fabrication. Note that there are instances where the type might not be valid, for instance Windows reparse points. (2) stx_rdev_*. This will be set only if stx_mode indicates we're looking at a blockdev or a chardev, otherwise will be 0. (3) stx_btime. Similar to (1), except this will be set to 0 if it doesn't exist. ======= TESTING ======= The following test program can be used to test the statx system call: samples/statx/test-statx.c Just compile and run, passing it paths to the files you want to examine. The file is built automatically if CONFIG_SAMPLES is enabled. Here's some example output. Firstly, an NFS directory that crosses to another FSID. Note that the AUTOMOUNT attribute is set because transiting this directory will cause d_automount to be invoked by the VFS. [root@andromeda ~]# /tmp/test-statx -A /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:26 Inode: 1703937 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------) Secondly, the result of automounting on that directory. [root@andromeda ~]# /tmp/test-statx /warthog/data statx(/warthog/data) = 0 results=7ff Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 00:27 Inode: 2 Links: 125 Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041 Access: 2016-11-24 09:02:12.219699527+0000 Modify: 2016-11-17 10:44:36.225653653+0000 Change: 2016-11-17 10:44:36.225653653+0000 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>Ingo Molnar1-0/+1
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-27fs: add i_blocksize()Fabian Frederick1-1/+1
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs branch. This patch also fixes multiple checkpatch warnings: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Thanks to Andrew Morton for suggesting more appropriate function instead of macro. [geliangtang@gmail.com: truncate: use i_blocksize()] Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-09vfs: replace calling i_op->readlink with vfs_readlink()Miklos Szeredi1-3/+5
Also check d_is_symlink() in callers instead of inode->i_op->readlink because following patches will allow NULL ->readlink for symlinks. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-01-16fs/stat.c: drop the last new_valid_dev checkYaowei Bai1-1/+1
New_valid_dev() always returns true, so that's unnecessary to perform new_valid_dev() checks in some filesystems. Most checks of new_valid_dev() have been removed so let's drop this last one and then we can remove new_valid_dev() from the source code. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09fs/stat.c: remove unnecessary new_valid_dev() checkYaowei Bai1-2/+0
new_valid_dev() always returns 1, so the !new_valid_dev() check is not needed. Remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15VFS: assorted d_backing_inode() annotationsDavid Howells1-2/+2
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11switch security_inode_getattr() to struct path *Al Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09vfs: split out vfs_getattr_nosecJ. Bruce Fields1-6/+25
The filehandle lookup code wants this version of getattr. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-17quota: provide interface for readding allocated space into reserved spaceJan Kara1-2/+9
ext4 needs to convert allocated (metadata) blocks back into blocks reserved for delayed allocation. Add functions into quota code for supporting such operation. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-26switch vfs_getattr() to struct pathAl Viro1-7/+6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20vfs: fix readlinkat to retry on ESTALEJeff Layton1-1/+7
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20vfs: make fstatat retry on ESTALE errors from getattr callJeff Layton1-2/+6
Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
2012-10-01Merge tag 'arm64-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull arm64 support from Catalin Marinas: "Linux support for the 64-bit ARM architecture (AArch64) Features currently supported: - 39-bit address space for user and kernel (each) - 4KB and 64KB page configurations - Compat (32-bit) user applications (ARMv7, EABI only) - Flattened Device Tree (mandated for all AArch64 platforms) - ARM generic timers" * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits) arm64: ptrace: remove obsolete ptrace request numbers from user headers arm64: Do not set the SMP/nAMP processor bit arm64: MAINTAINERS update arm64: Build infrastructure arm64: Miscellaneous header files arm64: Generic timers support arm64: Loadable modules arm64: Miscellaneous library functions arm64: Performance counters support arm64: Add support for /proc/sys/debug/exception-trace arm64: Debugging support arm64: Floating point and SIMD arm64: 32-bit (compat) applications support arm64: User access library functions arm64: Signal handling support arm64: VDSO support arm64: System calls handling arm64: ELF definitions arm64: SMP support arm64: DMA mapping API ...
2012-09-26switch simple cases of fget_light to fdgetAl Viro1-5/+5
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-14vfs: make O_PATH file descriptors usable for 'fstat()'Linus Torvalds1-1/+1
We already use them for openat() and friends, but fstat() also wants to be able to use O_PATH file descriptors. This should make it more directly comparable to the O_SEARCH of Solaris. Note that you could already do the same thing with "fstatat()" and an empty path, but just doing "fstat()" directly is simpler and faster, so there is no reason not to just allow it directly. See also commit 332a2e1244bd, which did the same thing for fchdir, for the same reasons. Reported-by: ольга крыжановская <olga.kryzhanovska@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org # O_PATH introduced in 3.0+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-14fs: Build sys_stat64() and friends if __ARCH_WANT_COMPAT_STAT64Catalin Marinas1-2/+2
On AArch64 Linux, we want the sys_stat64() and related functions for compat support but do not need the generic struct stat64, enabled automatically if __ARCH_WANT_STAT64. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2012-05-23Merge branch 'for-linus' of ↵Linus Torvalds1-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-21Merge branch 'vfs-cleanups' (random vfs cleanups)Linus Torvalds1-2/+3
This teaches vfs_fstat() to use the appropriate f[get|put]_light functions, allowing it to avoid some unnecessary locking for the common case. More noticeably, it also cleans up and simplifies the "getname_flags()" function, which now relies on the architecture strncpy_from_user() doing all the user access checks properly, instead of hacking around the fact that on x86 it didn't use to do it right (see commit 92ae03f2ef99: "x86: merge 32/64-bit versions of 'strncpy_from_user()' and speed it up"). * vfs-cleanups: VFS: make vfs_fstat() use f[get|put]_light() VFS: clean up and simplify getname_flags() x86: make word-at-a-time strncpy_from_user clear bytes at the end
2012-05-15userns: Convert stat to return values mapped from kuids and kgidsEric W. Biederman1-6/+6
- Store uids and gids with kuid_t and kgid_t in struct kstat - Convert uid and gids to userspace usable values with from_kuid and from_kgid Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-06vfs: don't force a big memset of stat data just to clear padding fieldsLinus Torvalds1-2/+10
Admittedly this is something that the compiler should be able to just do for us, but gcc just isn't that smart. And trying to use a structure initializer (which would get us the right semantics) ends up resulting in gcc allocating stack space for _two_ 'struct stat', and then copying one into the other. So do it by hand - just have a per-architecture macro that initializes the padding fields. And if the architecture doesn't provide one, fall back to the old behavior of just doing the whole memset() first. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06vfs: de-crapify "cp_new_stat()" functionLinus Torvalds1-18/+14
It's an unreadable mess of 32-bit vs 64-bit #ifdef's that mostly follow a rather simple pattern. Make a helper #define to handle that pattern, in the process making the code both shorter and more readable. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-28VFS: make vfs_fstat() use f[get|put]_light()Linus Torvalds1-2/+3
Use the *_light() versions that properly avoid doing the file user count updates when they are unnecessary. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-24Merge tag 'module-for-3.4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull cleanup of fs/ and lib/ users of module.h from Paul Gortmaker: "Fix up files in fs/ and lib/ dirs to only use module.h if they really need it. These are trivial in scope vs the work done previously. We now have things where any few remaining cleanups can be farmed out to arch or subsystem maintainers, and I have done so when possible. What is remaining here represents the bits that don't clearly lie within a single arch/subsystem boundary, like the fs dir and the lib dir. Some duplicate includes arising from overlapping fixes from independent subsystem maintainer submissions are also quashed." Fix up trivial conflicts due to clashes with other include file cleanups (including some due to the previous bug.h cleanup pull). * tag 'module-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: lib: reduce the use of module.h wherever possible fs: reduce the use of module.h wherever possible includecheck: delete any duplicate instances of module.h
2012-03-20switch touch_atime to struct pathAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-28fs: reduce the use of module.h wherever possiblePaul Gortmaker1-1/+1
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-11-02readlinkat: ensure we return ENOENT for the empty pathname for normal lookupsAndy Whitcroft1-2/+3
Since the commit below which added O_PATH support to the *at() calls, the error return for readlink/readlinkat for the empty pathname has switched from ENOENT to EINVAL: commit 65cfc6722361570bfe255698d9cd4dccaf47570d Author: Al Viro <viro@zeniv.linux.org.uk> Date: Sun Mar 13 15:56:26 2011 -0400 readlinkat(), fchownat() and fstatat() with empty relative pathnames This is both unexpected for userspace and makes readlink/readlinkat inconsistant with all other interfaces; and inconsistant with our stated return for these pathnames. As the readlinkat call does not have a flags parameter we cannot use the AT_EMPTY_PATH approach used in the other calls. Therefore expose whether the original path is infact entry via a new user_path_at_empty() path lookup function. Use this to determine whether to default to EINVAL or ENOENT for failures. Addresses http://bugs.launchpad.net/bugs/817187 [akpm@linux-foundation.org: remove unused getname_flags()] Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-09-27vfs: remove LOOKUP_NO_AUTOMOUNT flagLinus Torvalds1-2/+0
That flag no longer makes sense, since we don't look up automount points as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT handling was buggy to begin with: it would avoid automounting even for cases where we really *needed* to do the automount handling, and could return ENOENT for autofs entries that hadn't been instantiated yet. With our new non-eager automount semantics, one discussion has been about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the newfstatat() and fstatat64() system calls), but it's probably not worth it: you can always force at least directory automounting by simply adding the final '/' to the filename, which works for *all* of the stat family system calls, old and new. So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a result of our bad default behavior. Acked-by: Ian Kent <raven@themaw.net> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-06vfs: optimize inode cache access patternsLinus Torvalds1-2/+2
The inode structure layout is largely random, and some of the vfs paths really do care. The path lookup in particular is already quite D$ intensive, and profiles show that accessing the 'inode->i_op->xyz' fields is quite costly. We already optimized the dcache to not unnecessarily load the d_op structure for members that are often NULL using the DCACHE_OP_xyz bits in dentry->d_flags, and this does something very similar for the inode ops that are used during pathname lookup. It also re-orders the fields so that the fields accessed by 'stat' are together at the beginning of the inode structure, and roughly in the order accessed. The effect of this seems to be in the 1-2% range for an empty kernel "make -j" run (which is fairly kernel-intensive, mostly in filename lookup), so it's visible. The numbers are fairly noisy, though, and likely depend a lot on exact microarchitecture. So there's more tuning to be done. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-15readlinkat(), fchownat() and fstatat() with empty relative pathnamesAl Viro1-2/+5
For readlinkat() we simply allow empty pathname; it will fail unless we have dfd equal to O_PATH-opened symlink, so we are outside of POSIX scope here. For fchownat() and fstatat() we allow AT_EMPTY_PATH; let the caller explicitly ask for such behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-15Add an AT_NO_AUTOMOUNT flag to suppress terminal automountDavid Howells1-1/+3
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount point directories. This can be used by fstatat() users to permit the gathering of attributes on an automount point and also prevent mass-automounting of a directory of automount points by ls. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-13Mark arguments to certain syscalls as being constDavid Howells1-11/+18
Mark arguments to certain system calls as being const where they should be but aren't. The list includes: (*) The filename arguments of various stat syscalls, execve(), various utimes syscalls and some mount syscalls. (*) The filename arguments of some syscall helpers relating to the above. (*) The buffer argument of various write syscalls. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-23Add unlocked version of inode_add_bytes() functionDmitry Monakhov1-2/+8
Quota code requires unlocked version of this function. Off course we can just copy-paste the code, but copy-pasting is always an evil. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-04-20kill vfs_stat_fd / vfs_lstat_fdChristoph Hellwig1-63/+42
There's really no reason to keep vfs_stat_fd and vfs_lstat_fd with Oleg's vfs_fstatat. Use vfs_fstatat for the few cases having the directory fd, and switch all others to vfs_stat / vfs_lstat. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-04-20Separate out common fstatat code into vfs_fstatatOleg Drokin1-28/+28
This is a version incorporating Christoph's suggestion. Separate out common *fstatat functionality into a single function instead of duplicating it all over the code. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-01-14[CVE-2009-0029] System call wrappers part 30Heiko Carstens1-6/+6
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14[CVE-2009-0029] System call wrappers part 16Heiko Carstens1-2/+2
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14[CVE-2009-0029] System call wrappers part 11Heiko Carstens1-8/+12
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14[CVE-2009-0029] System call wrappers part 10Heiko Carstens1-1/+1
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-05inode->i_op is never NULLAl Viro1-1/+1
We used to have rather schizophrenic set of checks for NULL ->i_op even though it had been eliminated years ago. You'd need to go out of your way to set it to NULL explicitly _and_ a bunch of code would die on such inodes anyway. After killing two remaining places that still did that bogosity, all that crap can go away. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-07-26[PATCH] sanitize __user_walk_fd() et.al.Al Viro1-16/+16
* do not pass nameidata; struct path is all the callers want. * switch to new helpers: user_path_at(dfd, pathname, flags, &path) user_path(pathname, &path) user_lpath(pathname, &path) user_path_dir(pathname, &path) (fail if not a directory) The last 3 are trivial macro wrappers for the first one. * remove nameidata in callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-14Introduce path_put()Jan Blunck1-3/+3
* Add path_put() functions for releasing a reference to the dentry and vfsmount of a struct path in the right order * Switch from path_release(nd) to path_put(&nd->path) * Rename dput_path() to path_put_conditional() [akpm@linux-foundation.org: fix cifs] Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Cc: <linux-fsdevel@vger.kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14Embed a struct path into struct nameidata instead of nd->{dentry,mnt}Jan Blunck1-6/+7
This is the central patch of a cleanup series. In most cases there is no good reason why someone would want to use a dentry for itself. This series reflects that fact and embeds a struct path into nameidata. Together with the other patches of this series - it enforced the correct order of getting/releasing the reference count on <dentry,vfsmount> pairs - it prepares the VFS for stacking support since it is essential to have a struct path in every place where the stack can be traversed - it reduces the overall code size: without patch series: text data bss dec hex filename 5321639 858418 715768 6895825 6938d1 vmlinux with patch series: text data bss dec hex filename 5320026 858418 715768 6894212 693284 vmlinux This patch: Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix cifs] [akpm@linux-foundation.org: fix smack] Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08header cleaning: don't include smp_lock.h when not usedRandy Dunlap1-1/+0
Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-08[PATCH] VFS: change struct file to use struct pathJosef "Jeff" Sipek1-1/+1
This patch changes struct file to use struct path instead of having independent pointers to struct dentry and struct vfsmount, and converts all users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}. Additionally, it adds two #define's to make the transition easier for users of the f_dentry and f_vfsmnt. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] vfs_getattr(): remove dead codeAndrew Morton1-7/+0
As Mikulas points out, (1 << anything) won't be evaluating to zero. This code is long-dead. Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03[PATCH] VFS: Make filldir_t and struct kstat deal in 64-bit inode numbersDavid Howells1-0/+6
These patches make the kernel pass 64-bit inode numbers internally when communicating to userspace, even on a 32-bit system. They are required because some filesystems have intrinsic 64-bit inode numbers: NFS3+ and XFS for example. The 64-bit inode numbers are then propagated to userspace automatically where the arch supports it. Problems have been seen with userspace (eg: ld.so) using the 64-bit inode number returned by stat64() or getdents64() to differentiate files, and failing because the 64-bit inode number space was compressed to 32-bits, and so overlaps occur. This patch: Make filldir_t take a 64-bit inode number and struct kstat carry a 64-bit inode number so that 64-bit inode numbers can be passed back to userspace. The stat functions then returns the full 64-bit inode number where available and where possible. If it is not possible to represent the inode number supplied by the filesystem in the field provided by userspace, then error EOVERFLOW will be issued. Similarly, the getdents/readdir functions now pass the full 64-bit inode number to userspace where possible, returning EOVERFLOW instead when a directory entry is encountered that can't be properly represented. Note that this means that some inodes will not be stat'able on a 32-bit system with old libraries where they were before - but it does mean that there will be no ambiguity over what a 32-bit inode number refers to. Note similarly that directory scans may be cut short with an error on a 32-bit system with old libraries where the scan would work before for the same reasons. It is judged unlikely that this situation will occur because modern glibc uses 64-bit capable versions of stat and getdents class functions exclusively, and that older systems are unlikely to encounter unrepresentable inode numbers anyway. [akpm: alpha build fix] Signed-off-by: David Howells <dhowells@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27[PATCH] inode-diet: Eliminate i_blksize from the inode structureTheodore Ts'o1-1/+2
This eliminates the i_blksize field from struct inode. Filesystems that want to provide a per-inode st_blksize can do so by providing their own getattr routine instead of using the generic_fillattr() function. Note that some filesystems were providing pretty much random (and incorrect) values for i_blksize. [bunk@stusta.de: cleanup] [akpm@osdl.org: generic_fillattr() fix] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel1-1/+0
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-28[PATCH] powerpc: Wire up *at syscallsAndreas Schwab1-1/+1
Wire up *at syscalls. This patch has been tested on ppc64 (using glibc's testsuite, both 32bit and 64bit), and compile-tested for ppc32 (I have currently no ppc32 system available, but I expect no problems). Signed-off-by: Andreas Schwab <schwab@suse.de> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-11[PATCH] fstatat64 supportUlrich Drepper1-0/+22
The *at patches introduced fstatat and, due to inusfficient research, I used the newfstat functions generally as the guideline. The result is that on 32-bit platforms we don't have all the information needed to implement fstatat64. This patch modifies the code to pass up 64-bit information if __ARCH_WANT_STAT64 is defined. I renamed the syscall entry point to make this clear. Other archs will continue to use the existing code. On x86-64 the compat code is implemented using a new sys32_ function. this is what is done for the other stat syscalls as well. This patch might break some other archs (those which define __ARCH_WANT_STAT64 and which already wired up the syscall). Yet others might need changes to accomodate the compatibility mode. I really don't want to do that work because all this stat handling is a mess (more so in glibc, but the kernel is also affected). It should be done by the arch maintainers. I'll provide some stand-alone test shortly. Those who are eager could compile glibc and run 'make check' (no installation needed). The patch below has been tested on x86 and x86-64. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] vfs: *at functions: coreUlrich Drepper1-13/+53
Here is a series of patches which introduce in total 13 new system calls which take a file descriptor/filename pair instead of a single file name. These functions, openat etc, have been discussed on numerous occasions. They are needed to implement race-free filesystem traversal, they are necessary to implement a virtual per-thread current working directory (think multi-threaded backup software), etc. We have in glibc today implementations of the interfaces which use the /proc/self/fd magic. But this code is rather expensive. Here are some results (similar to what Jim Meyering posted before). The test creates a deep directory hierarchy on a tmpfs filesystem. Then rm -fr is used to remove all directories. Without syscall support I get this: real 0m31.921s user 0m0.688s sys 0m31.234s With syscall support the results are much better: real 0m20.699s user 0m0.536s sys 0m20.149s The interfaces are for obvious reasons currently not much used. But they'll be used. coreutils (and Jeff's posixutils) are already using them. Furthermore, code like ftw/fts in libc (maybe even glob) will also start using them. I expect a patch to make follow soon. Every program which is walking the filesystem tree will benefit. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds1-0/+410
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!