| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
This is one instruction more efficient than open-coding folio_pos() +
folio_size(). It's the equivalent of (x + y) << z rather than
x << z + y << z.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-9-willy@infradead.org
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paulo Alcantara <pc@manguebit.org>
Cc: netfs@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.
The script:
@@
expression inode, flags;
@@
- inode->i_state & flags
+ inode_state_read(inode) & flags
@@
expression inode, flags;
@@
- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)
@@
expression inode, flag1, flag2;
@@
- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)
@@
expression inode, flags;
@@
- inode->i_state |= flags
+ inode_state_set(inode, flags)
@@
expression inode, flags;
@@
- inode->i_state = flags
+ inode_state_assign(inode, flags)
@@
expression inode, flags;
@@
- flags = inode->i_state
+ flags = inode_state_read(inode)
@@
expression inode, flags;
@@
- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs workqueue updates from Christian Brauner:
"This contains various workqueue changes affecting the filesystem
layer.
Currently if a user enqueue a work item using schedule_delayed_work()
the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
to schedule_work() that is using system_wq and queue_work(), that
makes use again of WORK_CPU_UNBOUND.
This replaces the use of system_wq and system_unbound_wq. system_wq is
a per-CPU workqueue which isn't very obvious from the name and
system_unbound_wq is to be used when locality is not required.
So this renames system_wq to system_percpu_wq, and system_unbound_wq
to system_dfl_wq.
This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
WQ_PERCPU flags coexist for one release cycle to allow callers to
transition their calls. WQ_UNBOUND will be removed in a next release
cycle"
* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: WQ_PERCPU added to alloc_workqueue users
fs: replace use of system_wq with system_percpu_wq
fs: replace use of system_unbound_wq with system_dfl_wq
|
|
Commit 20d72b00ca81 ("netfs: Fix the request's work item to not
require a ref") modified netfs_alloc_request() to initialize the
reference counter to 2 instead of 1. The rationale was that the
requet's "work" would release the second reference after completion
(via netfs_{read,write}_collection_worker()). That works most of the
time if all goes well.
However, it leaks this additional reference if the request is released
before the I/O operation has been submitted: the error code path only
decrements the reference counter once and the work item will never be
queued because there will never be a completion.
This has caused outages of our whole server cluster today because
tasks were blocked in netfs_wait_for_outstanding_io(), leading to
deadlocks in Ceph (another bug that I will address soon in another
patch). This was caused by a netfs_pgpriv2_begin_copy_to_cache() call
which failed in fscache_begin_write_operation(). The leaked
netfs_io_request was never completed, leaving `netfs_inode.io_count`
with a positive value forever.
All of this is super-fragile code. Finding out which code paths will
lead to an eventual completion and which do not is hard to see:
- Some functions like netfs_create_write_req() allocate a request, but
will never submit any I/O.
- netfs_unbuffered_read_iter_locked() calls netfs_unbuffered_read()
and then netfs_put_request(); however, netfs_unbuffered_read() can
also fail early before submitting the I/O request, therefore another
netfs_put_request() call must be added there.
A rule of thumb is that functions that return a `netfs_io_request` do
not submit I/O, and all of their callers must be checked.
For my taste, the whole netfs code needs an overhaul to make reference
counting easier to understand and less fragile & obscure. But to fix
this bug here and now and produce a patch that is adequate for a
stable backport, I tried a minimal approach that quickly frees the
request object upon early failure.
I decided against adding a second netfs_put_request() each time
because that would cause code duplication which obscures the code
further. Instead, I added the function netfs_put_failed_request()
which frees such a failed request synchronously under the assumption
that the reference count is exactly 2 (as initially set by
netfs_alloc_request() and never touched), verified by a
WARN_ON_ONCE(). It then deinitializes the request object (without
going through the "cleanup_work" indirection) and frees the allocation
(with RCU protection to protect against concurrent access by
netfs_requests_seq_start()).
All code paths that fail early have been changed to call
netfs_put_failed_request() instead of netfs_put_request().
Additionally, I have added a netfs_put_request() call to
netfs_unbuffered_read() as explained above because the
netfs_put_failed_request() approach does not work there.
Fixes: 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The filio lock has been released here, so there is no need to jump to
error_folio_unlock to release it again.
Reported-by: syzbot+b73c7d94a151e2ee1e9b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b73c7d94a151e2ee1e9b
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If all the subrequests in an unbuffered write stream fail, the subrequest
collector doesn't update the stream->transferred value and it retains its
initial LONG_MAX value. Unfortunately, if all active streams fail, then we
take the smallest value of { LONG_MAX, LONG_MAX, ... } as the value to set
in wreq->transferred - which is then returned from ->write_iter().
LONG_MAX was chosen as the initial value so that all the streams can be
quickly assessed by taking the smallest value of all stream->transferred -
but this only works if we've set any of them.
Fix this by adding a flag to indicate whether the value in
stream->transferred is valid and checking that when we integrate the
values. stream->transferred can then be initialised to zero.
This was found by running the generic/750 xfstest against cifs with
cache=none. It splices data to the target file. Once (if) it has used up
all the available scratch space, the writes start failing with ENOSPC.
This causes ->write_iter() to fail. However, it was returning
wreq->transferred, i.e. LONG_MAX, rather than an error (because it thought
the amount transferred was non-zero) and iter_file_splice_write() would
then try to clean up that amount of pipe bufferage - leading to an oops
when it overran. The kernel log showed:
CIFS: VFS: Send error in write = -28
followed by:
BUG: kernel NULL pointer dereference, address: 0000000000000008
with:
RIP: 0010:iter_file_splice_write+0x3a4/0x520
do_splice+0x197/0x4e0
or:
RIP: 0010:pipe_buf_release (include/linux/pipe_fs_i.h:282)
iter_file_splice_write (fs/splice.c:755)
Also put a warning check into splice to announce if ->write_iter() returned
that it had written more than it was asked to.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Reported-by: Xiaoli Feng <fengxiaoli0714@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220445
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/915443.1755207950@warthog.procyon.org.uk
cc: Paulo Alcantara <pc@manguebit.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When netfslib is issuing subrequests, the subrequests start processing
immediately and may complete before we reach the end of the issuing
function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED
to indicate to the collector that we aren't going to issue any more subreqs
and that it can do the final notifications and cleanup.
Now, this isn't a problem if the request is synchronous
(NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be
done in-thread and we're guaranteed an opportunity to run the collector.
However, if the request is asynchronous, collection is primarily triggered
by the termination of subrequests queuing it on a workqueue. Now, a race
can occur here if the app thread sets ALL_QUEUED after the last subrequest
terminates.
This can happen most easily with the copy2cache code (as used by Ceph)
where, in the collection routine of a read request, an asynchronous write
request is spawned to copy data to the cache. Folios are added to the
write request as they're unlocked, but there may be a delay before
ALL_QUEUED is set as the write subrequests may complete before we get
there.
If all the write subreqs have finished by the ALL_QUEUED point, no further
events happen and the collection never happens, leaving the request
hanging.
Fix this by queuing the collector after setting ALL_QUEUED. This is a bit
heavy-handed and it may be sufficient to do it only if there are no extant
subreqs.
Also add a tracepoint to cross-reference both requests in a copy-to-request
operation and add a trace to the netfs_rreq tracepoint to indicate the
setting of ALL_QUEUED.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250711151005.2956810-3-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The netfs copy-to-cache that is used by Ceph with local caching sets up a
new request to write data just read to the cache. The request is started
and then left to look after itself whilst the app continues. The request
gets notified by the backing fs upon completion of the async DIO write, but
then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't
set - but the app isn't waiting there, and so the request just hangs.
Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the
notification from the backing filesystem to put the collection onto a work
queue instead.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250711151005.2956810-2-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make a number of updates to the netfs tracepoints:
(1) Remove a duplicate trace from netfs_unbuffered_write_iter_locked().
(2) Move the trace in netfs_wake_rreq_flag() to after the flag is cleared
so that the change appears in the trace.
(3) Differentiate the use of netfs_rreq_trace_wait/woke_queue symbols.
(4) Don't do so many trace emissions in the wait functions as some of them
are redundant.
(5) In netfs_collect_read_results(), differentiate a subreq that's being
abandoned vs one that has been consumed in a regular way.
(6) Add a tracepoint to indicate the call to ->ki_complete().
(7) Don't double-increment the subreq_counter when retrying a write.
(8) Move the netfs_sreq_trace_io_progress tracepoint within cifs code to
just MID_RESPONSE_RECEIVED and add different tracepoints for other MID
states and note check failure.
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: Paulo Alcantara <pc@manguebit.org>
Link: https://lore.kernel.org/20250701163852.2171681-14-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Renumber the NETFS_RREQ_* flags to put the most useful status bits in the
bottom nibble - and therefore the last hex digit in the trace output -
making it easier to grasp the state at a glance.
In particular, put the IN_PROGRESS flag in bit 0 and ALL_QUEUED at bit 1.
Also make the flags field in /proc/fs/netfs/requests larger to accommodate
all the flags.
Also make the flags field in the netfs_sreq tracepoint larger to
accommodate all the NETFS_SREQ_* flags.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-13-dhowells@redhat.com
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Netfslib has two functions for updating the i_size after a write: one for
buffered writes into the pagecache and one for direct/unbuffered writes.
However, what needs to be done is much the same in both cases, so merge
them together.
This does raise one question, though: should updating the i_size after a
direct write do the same estimated update of i_blocks as is done for
buffered writes.
Also get rid of the cleanup function pointer from netfs_io_request as it's
only used for direct write to update i_size; instead do the i_size setting
directly from write collection.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-12-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix the updating of i_size, particularly in regard to the completion of DIO
writes and especially async DIO writes by using a lock.
The bug is triggered occasionally by the generic/207 xfstest as it chucks a
bunch of AIO DIO writes at the filesystem and then checks that fstat()
returns a reasonable st_size as each completes.
The problem is that netfs is trying to do "if new_size > inode->i_size,
update inode->i_size" sort of thing but without a lock around it.
This can be seen with cifs, but shouldn't be seen with kafs because kafs
serialises modification ops on the client whereas cifs sends the requests
to the server as they're generated and lets the server order them.
Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-11-dhowells@redhat.com
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The write-retry algorithm will insert extra subrequests into the list if it
can't get sufficient capacity to split the range that needs to be retried
into the sequence of subrequests it currently has (for instance, if the
cifs credit pool has fewer credits available than it did when the range was
originally divided).
However, the allocator furnishes each new subreq with 2 refs and then
another is added for resubmission, causing one to be leaked.
Fix this by replacing the ref-getting line with a neutral trace line.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-6-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
netfs_wait_for_request() and netfs_wait_for_pause() can loop forever if
netfs_collect_in_app() returns 2, indicating that it wants to repeat
because the ALL_QUEUED flag isn't yet set and there are no subreqs left
that haven't been collected.
The problem is that, unless collection is offloaded (OFFLOAD_COLLECTION),
we have to return to the application thread to continue and eventually set
ALL_QUEUED after pausing to deal with a retry - but we never get there.
Fix this by inserting checks for the IN_PROGRESS and PAUSE flags as
appropriate before cycling round - and add cond_resched() for good measure.
Fixes: 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-5-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Provide helpers to clear and test the NETFS_RREQ_IN_PROGRESS and to insert
the appropriate barrierage.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-4-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If a netfs request finishes during the pause loop, it will have the ref
that belongs to the IN_PROGRESS flag removed at that point - however, if it
then goes to the final wait loop, that will *also* put the ref because it
sees that the IN_PROGRESS flag is clear and incorrectly assumes that this
happened when it called the collector.
In fact, since IN_PROGRESS is clear, we shouldn't call the collector again
since it's done all the cleanup, such as calling ->ki_complete().
Fix this by making netfs_collect_in_app() just return, indicating that
we're done if IN_PROGRESS is removed.
Fixes: 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-3-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When doing a DIO read, if the subrequests we issue fail and cause the
request PAUSE flag to be set to put a pause on subrequest generation, we
may complete collection of the subrequests (possibly discarding them) prior
to the ALL_QUEUED flags being set.
In such a case, netfs_read_collection() doesn't see ALL_QUEUED being set
after netfs_collect_read_results() returns and will just return to the app
(the collector can be seen unpausing the generator in the trace log).
The subrequest generator can then set ALL_QUEUED and the app thread reaches
netfs_wait_for_request(). This causes netfs_collect_in_app() to be called
to see if we're done yet, but there's missing case here.
netfs_collect_in_app() will see that a thread is active and set inactive to
false, but won't see any subrequests in the read stream, and so won't set
need_collect to true. The function will then just return 0, indicating
that the caller should just sleep until further activity (which won't be
forthcoming) occurs.
Fix this by making netfs_collect_in_app() check to see if an active thread
is complete - i.e. that ALL_QUEUED is set and the subrequests list is empty
- and to skip the sleep return path. The collector will then be called
which will clear the request IN_PROGRESS flag, allowing the app to
progress.
Fixes: 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used")
Reported-by: Steve French <sfrench@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250701163852.2171681-2-dhowells@redhat.com
Tested-by: Steve French <sfrench@samba.org>
Reviewed-by: Paulo Alcantara <pc@manguebit.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
- The main API document has been extensively updated/rewritten
- Fix an oops in write-retry due to mis-resetting the I/O iterator
- Fix the recording of transferred bytes for short DIO reads
- Fix a request's work item to not require a reference, thereby
avoiding the need to get rid of it in BH/IRQ context
- Fix waiting and waking to be consistent about the waitqueue used
- Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE,
NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR,
NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED
- Reorder structs to eliminate holes
- Remove netfs_io_request::ractl
- Only provide proc_link field if CONFIG_PROC_FS=y
- Remove folio_queue::marks3
- Fix undifferentiation of DIO reads from unbuffered reads
* tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix undifferentiation of DIO reads from unbuffered reads
netfs: Fix wait/wake to be consistent about the waitqueue used
netfs: Fix the request's work item to not require a ref
netfs: Fix setting of transferred bytes with short DIO reads
netfs: Fix oops in write-retry from mis-resetting the subreq iterator
fs/netfs: remove unused flag NETFS_RREQ_BLOCKED
fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS
folio_queue: remove unused field `marks3`
fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y
fs/netfs: remove `netfs_io_request.ractl`
fs/netfs: reorder struct fields to eliminate holes
fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR
fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH
fs/netfs: remove unused source NETFS_INVALID_WRITE
fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
|
|
On cifs, "DIO reads" (specified by O_DIRECT) need to be differentiated from
"unbuffered reads" (specified by cache=none in the mount parameters). The
difference is flagged in the protocol and the server may behave
differently: Windows Server will, for example, mandate that DIO reads are
block aligned.
Fix this by adding a NETFS_UNBUFFERED_READ to differentiate this from
NETFS_DIO_READ, parallelling the write differentiation that already exists.
cifs will then do the right thing.
Fixes: 016dc8516aec ("netfs: Implement unbuffered/DIO read support")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/3444961.1747987072@warthog.procyon.org.uk
Reviewed-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
David Howells <dhowells@redhat.com> says:
Here are some miscellaneous fixes and changes for netfslib, if you could
pull them:
(1) Fix an oops in write-retry due to mis-resetting the I/O iterator.
(2) Fix the recording of transferred bytes for short DIO reads.
(3) Fix a request's work item to not require a reference, thereby avoiding
the need to get rid of it in BH/IRQ context.
(4) Fix waiting and waking to be consistent about the waitqueue used.
* patches from https://lore.kernel.org/20250519090707.2848510-1-dhowells@redhat.com:
netfs: Fix wait/wake to be consistent about the waitqueue used
netfs: Fix the request's work item to not require a ref
netfs: Fix setting of transferred bytes with short DIO reads
netfs: Fix oops in write-retry from mis-resetting the subreq iterator
Link: https://lore.kernel.org/20250519090707.2848510-1-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix further inconsistencies in the use of waitqueues
(clear_and_wake_up_bit() vs private waitqueue).
Move some of this stuff from the read and write sides into common code so
that it can be done in fewer places.
To make this work, async I/O needs to set NETFS_RREQ_OFFLOAD_COLLECTION to
indicate that a workqueue will do the collecting and places that call the
wait function need to deal with it returning the amount transferred.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519090707.2848510-5-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <stfrench@microsoft.com>
cc: Ihor Solodrai <ihor.solodrai@pm.me>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When the netfs_io_request struct's work item is queued, it must be supplied
with a ref to the work item struct to prevent it being deallocated whilst
on the queue or whilst it is being processed. This is tricky to manage as
we have to get a ref before we try and queue it and then we may find it's
already queued and is thus already holding a ref - in which case we have to
try and get rid of the ref again.
The problem comes if we're in BH or IRQ context and need to drop the ref:
if netfs_put_request() reduces the count to 0, we have to do the cleanup -
but the cleanup may need to wait.
Fix this by adding a new work item to the request, ->cleanup_work, and
dispatching that when the refcount hits zero. That can then synchronously
cancel any outstanding work on the main work item before doing the cleanup.
Adding a new work item also deals with another problem upstream where it's
sometimes changing the work func in the put function and requeuing it -
which has occasionally in the past caused the cleanup to happen
incorrectly.
As a bonus, this allows us to get rid of the 'was_async' parameter from a
bunch of functions. This indicated whether the put function might not be
permitted to sleep.
Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519090707.2848510-4-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <stfrench@microsoft.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A netfslib request comprises an ordered stream of subrequests that, when
doing an unbuffered/DIO read, are contiguous. The subrequests may be
performed in parallel, but may not be fully completed.
For instance, if we try and make a 256KiB DIO read from a 3-byte file with
a 64KiB rsize and 256KiB bsize, netfslib will attempt to make a read of
256KiB, broken up into four 64KiB subreads, with the expectation that the
first will be short and the subsequent three be completely devoid - but we
do all four on the basis that the file may have been changed by a third
party.
The read-collection code, however, walks through all the subreqs and
advances the notion of how much data has been read in the stream to the
start of each subreq plus its amount transferred (which are 3, 0, 0, 0 for
the example above) - which gives an amount apparently read of 3*64KiB -
which is incorrect.
Fix the collection code to cut short the calculation of the transferred
amount with the first short subrequest in an unbuffered read; everything
beyond that must be ignored as there's a hole that cannot be filled. This
applies both to shortness due to hitting the EOF and shortness due to an
error.
This is achieved by setting a flag on the request when we collect the first
short subrequest (collection is done in ascending order).
This can be tested by mounting a cifs volume with rsize=65536,bsize=262144
and doing a 256k DIO read of a very small file (e.g. 3 bytes). read()
should return 3, not >3.
This problem came in when netfs_read_collection() set rreq->transferred to
stream->transferred, even for DIO. Prior to that, netfs_rreq_assess_dio()
just went over the list and added up the subreqs till it met a short one -
but now the subreqs are discarded earlier.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Nicolas Baranger <nicolas.baranger@3xo.fr>
Closes: https://lore.kernel.org/all/10bec2430ed4df68bde10ed95295d093@3xo.fr/
Signed-off-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519090707.2848510-3-dhowells@redhat.com
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix the resetting of the subrequest iterator in netfs_retry_write_stream()
to use the iterator-reset function as the iterator may have been shortened
by a previous retry. In such a case, the amount of data to be written by
the subrequest is not "subreq->len" but "subreq->len -
subreq->transferred".
Without this, KASAN may see an error in iov_iter_revert():
BUG: KASAN: slab-out-of-bounds in iov_iter_revert lib/iov_iter.c:633 [inline]
BUG: KASAN: slab-out-of-bounds in iov_iter_revert+0x443/0x5a0 lib/iov_iter.c:611
Read of size 4 at addr ffff88802912a0b8 by task kworker/u32:7/1147
CPU: 1 UID: 0 PID: 1147 Comm: kworker/u32:7 Not tainted 6.15.0-rc6-syzkaller-00052-g9f35e33144ae #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: events_unbound netfs_write_collection_worker
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xc3/0x670 mm/kasan/report.c:521
kasan_report+0xe0/0x110 mm/kasan/report.c:634
iov_iter_revert lib/iov_iter.c:633 [inline]
iov_iter_revert+0x443/0x5a0 lib/iov_iter.c:611
netfs_retry_write_stream fs/netfs/write_retry.c:44 [inline]
netfs_retry_writes+0x166d/0x1a50 fs/netfs/write_retry.c:231
netfs_collect_write_results fs/netfs/write_collect.c:352 [inline]
netfs_write_collection_worker+0x23fd/0x3830 fs/netfs/write_collect.c:374
process_one_work+0x9cf/0x1b70 kernel/workqueue.c:3238
process_scheduled_works kernel/workqueue.c:3319 [inline]
worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400
kthread+0x3c2/0x780 kernel/kthread.c:464
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: syzbot+25b83a6f2c702075fcbc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=25b83a6f2c702075fcbc
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519090707.2848510-2-dhowells@redhat.com
Tested-by: syzbot+25b83a6f2c702075fcbc@syzkaller.appspotmail.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
NETFS_RREQ_BLOCKED was added by commit 016dc8516aec ("netfs: Implement
unbuffered/DIO read support") but has never been used either. Without
NETFS_RREQ_BLOCKED, NETFS_RREQ_NONBLOCK makes no sense, and thus can
be removed as well.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519134813.2975312-12-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
NETFS_RREQ_DONT_UNLOCK_FOLIOS has never been used ever since it was
added by commit 3d3c95046742 ("netfs: Provide readahead and readpage
netfs helpers").
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519134813.2975312-11-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since this field is only used by netfs_prepare_read_iterator() when
called by netfs_readahead(), we can simply pass it as parameter. This
shrinks the struct from 576 to 568 bytes.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519134813.2975312-8-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This flag was added by commit 41d8e7673a77 ("netfs: Implement a
write-through caching option") but it was never used.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519134813.2975312-5-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This enum choice was added by commit 16af134ca4b7 ("netfs: Extend the
netfs_io_*request structs to handle writes") and its only user was
later removed by commit c245868524cc ("netfs: Remove the old writeback
code").
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/20250519134813.2975312-4-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
GCC 15's new -Wunterminated-string-initialization notices that the
character lookup tables "fscache_cache_states" and "fscache_cookie_states"
(which are not used as a C-String) need to be marked as "nonstring":
fs/netfs/fscache_cache.c:375:67: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (6 chars into 5 available) [-Wunterminated-string-initialization]
375 | static const char fscache_cache_states[NR__FSCACHE_CACHE_STATE] = "-PAEW";
| ^~~~~~~
fs/netfs/fscache_cookie.c:32:69: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (11 chars into 10 available) [-Wunterminated-string-initialization]
32 | static const char fscache_cookie_states[FSCACHE_COOKIE_STATE__NR] = "-LCAIFUWRD";
| ^~~~~~~~~~~~
Annotate the arrays.
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/20250416221654.work.028-kees@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When testing a special config:
CONFIG_NETFS_SUPPORTS=y
CONFIG_PROC_FS=n
The system crashes with something like:
[ 3.766197] ------------[ cut here ]------------
[ 3.766484] kernel BUG at mm/mempool.c:560!
[ 3.766789] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 3.767123] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W
[ 3.767777] Tainted: [W]=WARN
[ 3.767968] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
[ 3.768523] RIP: 0010:mempool_alloc_slab.cold+0x17/0x19
[ 3.768847] Code: 50 fe ff 58 5b 5d 41 5c 41 5d 41 5e 41 5f e9 93 95 13 00
[ 3.769977] RSP: 0018:ffffc90000013998 EFLAGS: 00010286
[ 3.770315] RAX: 000000000000002f RBX: ffff888100ba8640 RCX: 0000000000000000
[ 3.770749] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000ffffffff
[ 3.771217] RBP: 0000000000092880 R08: 0000000000000000 R09: ffffc90000013828
[ 3.771664] R10: 0000000000000001 R11: 00000000ffffffea R12: 0000000000092cc0
[ 3.772117] R13: 0000000000000400 R14: ffff8881004b1620 R15: ffffea0004ef7e40
[ 3.772554] FS: 0000000000000000(0000) GS:ffff8881b5f3c000(0000) knlGS:0000000000000000
[ 3.773061] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3.773443] CR2: ffffffff830901b4 CR3: 0000000004296001 CR4: 0000000000770ef0
[ 3.773884] PKRU: 55555554
[ 3.774058] Call Trace:
[ 3.774232] <TASK>
[ 3.774371] mempool_alloc_noprof+0x6a/0x190
[ 3.774649] ? _printk+0x57/0x80
[ 3.774862] netfs_alloc_request+0x85/0x2ce
[ 3.775147] netfs_readahead+0x28/0x170
[ 3.775395] read_pages+0x6c/0x350
[ 3.775623] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3.775928] page_cache_ra_unbounded+0x1bd/0x2a0
[ 3.776247] filemap_get_pages+0x139/0x970
[ 3.776510] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3.776820] filemap_read+0xf9/0x580
[ 3.777054] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3.777368] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3.777674] ? find_held_lock+0x32/0x90
[ 3.777929] ? netfs_start_io_read+0x19/0x70
[ 3.778221] ? netfs_start_io_read+0x19/0x70
[ 3.778489] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3.778800] ? lock_acquired+0x1e6/0x450
[ 3.779054] ? srso_alias_return_thunk+0x5/0xfbef5
[ 3.779379] netfs_buffered_read_iter+0x57/0x80
[ 3.779670] __kernel_read+0x158/0x2c0
[ 3.779927] bprm_execve+0x300/0x7a0
[ 3.780185] kernel_execve+0x10c/0x140
[ 3.780423] ? __pfx_kernel_init+0x10/0x10
[ 3.780690] kernel_init+0xd5/0x150
[ 3.780910] ret_from_fork+0x2d/0x50
[ 3.781156] ? __pfx_kernel_init+0x10/0x10
[ 3.781414] ret_from_fork_asm+0x1a/0x30
[ 3.781677] </TASK>
[ 3.781823] Modules linked in:
[ 3.782065] ---[ end trace 0000000000000000 ]---
This is caused by the following error path in netfs_init():
if (!proc_mkdir("fs/netfs", NULL))
goto error_proc;
Fix this by adding ifdef in netfs_main(), so that /proc/fs/netfs is only
created with CONFIG_PROC_FS.
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/20250409170015.2651829-1-song@kernel.org
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_unbuffered_read() to return an ssize_t rather than an int as
netfs_wait_for_read() returns ssize_t and this gets implicitly truncated.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250314164201.1993231-5-dhowells@redhat.com
Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Viacheslav Dubeyko <slava@dubeyko.com>
cc: Alex Markuze <amarkuze@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
rolling_buffer_load_from_ra() looms large in the perf report because it
loops around doing an atomic clear for each of the three mark bits per
folio. However, this is both inefficient (it would be better to build a
mask and atomically AND them out) and unnecessary as they shouldn't be set.
Fix this by removing the loop.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250314164201.1993231-4-dhowells@redhat.com
Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Many filesystems such as NFS and Ceph do not implement the
`invalidate_cache` method. On those filesystems, if writing to the
cache (`NETFS_WRITE_TO_CACHE`) fails for some reason, the kernel
crashes like this:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0
Oops: Oops: 0010 [#1] SMP PTI
CPU: 9 UID: 0 PID: 3380 Comm: kworker/u193:11 Not tainted 6.13.3-cm4all1-hp #437
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018
Workqueue: events_unbound netfs_write_collection_worker
RIP: 0010:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0018:ffff9b86e2ca7dc0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 7fffffffffffffff
RDX: 0000000000000001 RSI: ffff89259d576a18 RDI: ffff89259d576900
RBP: ffff89259d5769b0 R08: ffff9b86e2ca7d28 R09: 0000000000000002
R10: ffff89258ceaca80 R11: 0000000000000001 R12: 0000000000000020
R13: ffff893d158b9338 R14: ffff89259d576900 R15: ffff89259d5769b0
FS: 0000000000000000(0000) GS:ffff893c9fa40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000054442e003 CR4: 00000000001706f0
Call Trace:
<TASK>
? __die+0x1f/0x60
? page_fault_oops+0x15c/0x460
? try_to_wake_up+0x2d2/0x530
? exc_page_fault+0x5e/0x100
? asm_exc_page_fault+0x22/0x30
netfs_write_collection_worker+0xe9f/0x12b0
? xs_poll_check_readable+0x3f/0x80
? xs_stream_data_receive_workfn+0x8d/0x110
process_one_work+0x134/0x2d0
worker_thread+0x299/0x3a0
? __pfx_worker_thread+0x10/0x10
kthread+0xba/0xe0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2: 0000000000000000
This patch adds the missing `NULL` check.
Fixes: 0e0f2dfe880f ("netfs: Dispatch write requests to process a writeback slice")
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250314164201.1993231-3-dhowells@redhat.com
Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A netfs read request can run in one of two modes: for synchronous reads
writes, the app thread does the collection of results and for asynchronous
reads, this is offloaded to a worker thread. This is controlled by the
NETFS_RREQ_OFFLOAD_COLLECTION flag.
Now, if a subrequest incurs an error, the NETFS_RREQ_PAUSE flag is set to
stop the issuing loop temporarily from issuing more subrequests until a
retry is successful or the request is abandoned.
When the issuing loop sees NETFS_RREQ_PAUSE, it jumps to
netfs_wait_for_pause() which will wait for the PAUSE flag to be cleared -
and whilst it is waiting, it will call out to the collector as more results
acrue... But this is the wrong thing to do if OFFLOAD_COLLECTION is set as
we can then end up with both the app thread and the work item collecting
results simultaneously.
This manifests itself occasionally when running the generic/323 xfstest
against multichannel cifs as an oops that's a bit random but frequently
involving io_submit() (the test does lots of simultaneous async DIO reads).
Fix this by only doing the collection in netfs_wait_for_pause() if the
NETFS_RREQ_OFFLOAD_COLLECTION is not set.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Steve French <stfrench@microsoft.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250314164201.1993231-2-dhowells@redhat.com
Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Due to the code that queues a subreq on the active subrequest list getting
moved to netfs_issue_read(), the NETFS_RREQ_ALL_QUEUED flag may now get set
before the list-add actually happens. This is not a problem if the
collection worker happens after the list-add, but it's a race - and, for
9P, where the read from the server is synchronous and done in the
submitting thread, this is a lot more likely.
The result is that, if the timing is wrong, a ref gets leaked because the
collector thinks that all the subreqs have completed (because it can't see
the last one yet) and clears NETFS_RREQ_IN_PROGRESS - at which point, the
collection worker no longer goes into the collector.
This can be provoked with AFS by injecting an msleep() right before the
final subreq is queued.
Fix this by splitting the queuing part out of netfs_issue_read() into a new
function, netfs_queue_read(), and calling it separately. The setting of
NETFS_RREQ_ALL_QUEUED is then done by netfs_queue_read() whilst it is
holding the spinlock (that's probably unnecessary, but shouldn't hurt).
It might be better to set a flag on the final subreq, but this could be a
problem if an error occurs and we can't queue it.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Reported-by: Ihor Solodrai <ihor.solodrai@pm.me>
Closes: https://lore.kernel.org/r/a7x33d4dnMdGTtRivptq6S1i8btK70SNBP2XyX_xwDAhLvgQoPox6FVBOkifq4eBinfFfbZlIkMZBe3QarlWTxoEtHZwJCZbNKtaqrR7PvI=@pm.me/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250212222402.3618494-4-dhowells@redhat.com
Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <stfrench@microsoft.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add stat counters to count the number of request and subrequest retries and
display them in /proc/fs/netfs/stats.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250212222402.3618494-3-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix a number of hangs in the netfslib read-retry code, including:
(1) netfs_reissue_read() doubles up the getting of references on
subrequests, thereby leaking the subrequest and causing inode eviction
to wait indefinitely. This can lead to the kernel reporting a hang in
the filesystem's evict_inode().
Fix this by removing the get from netfs_reissue_read() and adding one
to netfs_retry_read_subrequests() to deal with the one place that
didn't double up.
(2) The loop in netfs_retry_read_subrequests() that retries a sequence of
failed subrequests doesn't record whether or not it retried the one
that the "subreq" pointer points to when it leaves the loop. It may
not if renegotiation/repreparation of the subrequests means that fewer
subrequests are needed to span the cumulative range of the sequence.
Because it doesn't record this, the piece of code that discards
now-superfluous subrequests doesn't know whether it should discard the
one "subreq" points to - and so it doesn't.
Fix this by noting whether the last subreq it examines is superfluous
and if it is, then getting rid of it and all subsequent subrequests.
If that one one wasn't superfluous, then we would have tried to go
round the previous loop again and so there can be no further unretried
subrequests in the sequence.
(3) netfs_retry_read_subrequests() gets yet an extra ref on any additional
subrequests it has to get because it ran out of ones it could reuse to
to renegotiation/repreparation shrinking the subrequests.
Fix this by removing that extra ref.
(4) In netfs_retry_reads(), it was using wait_on_bit() to wait for
NETFS_SREQ_IN_PROGRESS to be cleared on all subrequests in the
sequence - but netfs_read_subreq_terminated() is now using a wait
queue on the request instead and so this wait will never finish.
Fix this by waiting on the wait queue instead. To make this work, a
new flag, NETFS_RREQ_RETRYING, is now set around the wait loop to tell
the wake-up code to wake up the wait queue rather than requeuing the
request's work item.
Note that this flag replaces the NETFS_RREQ_NEED_RETRY flag which is
no longer used.
(5) Whilst not strictly anything to do with the hang,
netfs_retry_read_subrequests() was also doubly incrementing the
subreq_counter and re-setting the debug index, leaving a gap in the
trace. This is also fixed.
One of these hangs was observed with 9p and with cifs. Others were forced
by manual code injection into fs/afs/file.c. Firstly, afs_prepare_read()
was created to provide an changing pattern of maximum subrequest sizes:
static int afs_prepare_read(struct netfs_io_subrequest *subreq)
{
struct netfs_io_request *rreq = subreq->rreq;
if (!S_ISREG(subreq->rreq->inode->i_mode))
return 0;
if (subreq->retry_count < 20)
rreq->io_streams[0].sreq_max_len =
umax(200, 2222 - subreq->retry_count * 40);
else
rreq->io_streams[0].sreq_max_len = 3333;
return 0;
}
and pointed to by afs_req_ops. Then the following:
struct netfs_io_subrequest *subreq = op->fetch.subreq;
if (subreq->error == 0 &&
S_ISREG(subreq->rreq->inode->i_mode) &&
subreq->retry_count < 20) {
subreq->transferred = subreq->already_done;
__clear_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
afs_fetch_data_notify(op);
return;
}
was inserted into afs_fetch_data_success() at the beginning and struct
netfs_io_subrequest given an extra field, "already_done" that was set to
the value in "subreq->transferred" by netfs_reissue_read().
When reading a 4K file, the subrequests would get gradually smaller, a new
subrequest would be allocated around the 3rd retry and then eventually be
rendered superfluous when the 20th retry was hit and the limit on the first
subrequest was eased.
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20250212222402.3618494-2-dhowells@redhat.com
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Steve French <stfrench@microsoft.com>
cc: Ihor Solodrai <ihor.solodrai@pm.me>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs netfs updates from Christian Brauner:
"This contains read performance improvements and support for monolithic
single-blob objects that have to be read/written as such (e.g. AFS
directory contents). The implementation of the two parts is interwoven
as each makes the other possible.
- Read performance improvements
The read performance improvements are intended to speed up some
loss of performance detected in cifs and to a lesser extend in afs.
The problem is that we queue too many work items during the
collection of read results: each individual subrequest is collected
by its own work item, and then they have to interact with each
other when a series of subrequests don't exactly align with the
pattern of folios that are being read by the overall request.
Whilst the processing of the pages covered by individual
subrequests as they complete potentially allows folios to be woken
in parallel and with minimum delay, it can shuffle wakeups for
sequential reads out of order - and that is the most common I/O
pattern.
The final assessment and cleanup of an operation is then held up
until the last I/O completes - and for a synchronous sequential
operation, this means the bouncing around of work items just adds
latency.
Two changes have been made to make this work:
(1) All collection is now done in a single "work item" that works
progressively through the subrequests as they complete (and
also dispatches retries as necessary).
(2) For readahead and AIO, this work item be done on a workqueue
and can run in parallel with the ultimate consumer of the data;
for synchronous direct or unbuffered reads, the collection is
run in the application thread and not offloaded.
Functions such as smb2_readv_callback() then just tell netfslib
that the subrequest has terminated; netfslib does a minimal bit of
processing on the spot - stat counting and tracing mostly - and
then queues/wakes up the worker. This simplifies the logic as the
collector just walks sequentially through the subrequests as they
complete and walks through the folios, if buffered, unlocking them
as it goes. It also keeps to a minimum the amount of latency
injected into the filesystem's low-level I/O handling
The way netfs supports filesystems using the deprecated
PG_private_2 flag is changed: folios are flagged and added to a
write request as they complete and that takes care of scheduling
the writes to the cache. The originating read request can then just
unlock the pages whatever happens.
- Single-blob object support
Single-blob objects are files for which the content of the file
must be read from or written to the server in a single operation
because reading them in parts may yield inconsistent results. AFS
directories are an example of this as there exists the possibility
that the contents are generated on the fly and would differ between
reads or might change due to third party interference.
Such objects will be written to and retrieved from the cache if one
is present, though we allow/may need to propose multiple
subrequests to do so. The important part is that read from/write to
the *server* is monolithic.
Single blob reading is, for the moment, fully synchronous and does
result collection in the application thread and, also for the
moment, the API is supplied the buffer in the form of a folio_queue
chain rather than using the pagecache.
- Related afs changes
This series makes a number of changes to the kafs filesystem,
primarily in the area of directory handling:
- AFS's FetchData RPC reply processing is made partially
asynchronous which allows the netfs_io_request's outstanding
operation counter to be removed as part of reducing the
collection to a single work item.
- Directory and symlink reading are plumbed through netfslib using
the single-blob object API and are now cacheable with fscache.
This also allows the afs_read struct to be eliminated and
netfs_io_subrequest to be used directly instead.
- Directory and symlink content are now stored in a folio_queue
buffer rather than in the pagecache. This means we don't require
the RCU read lock and xarray iteration to access it, and folios
won't randomly disappear under us because the VM wants them
back.
- The vnode operation lock is changed from a mutex struct to a
private lock implementation. The problem is that the lock now
needs to be dropped in a separate thread and mutexes don't
permit that.
- When a new directory or symlink is created, we now initialise it
locally and mark it valid rather than downloading it (we know
what it's likely to look like).
- We now use the in-directory hashtable to reduce the number of
entries we need to scan when doing a lookup. The edit routines
have to maintain the hash chains.
- Cancellation (e.g. by signal) of an async call after the
rxrpc_call has been set up is now offloaded to the worker thread
as there will be a notification from rxrpc upon completion. This
avoids a double cleanup.
- A "rolling buffer" implementation is created to abstract out the
two separate folio_queue chaining implementations I had (one for
read and one for write).
- Functions are provided to create/extend a buffer in a folio_queue
chain and tear it down again.
This is used to handle AFS directories, but could also be used to
create bounce buffers for content crypto and transport crypto.
- The was_async argument is dropped from netfs_read_subreq_terminated()
Instead we wake the read collection work item by either queuing it
or waking up the app thread.
- We don't need to use BH-excluding locks when communicating between
the issuing thread and the collection thread as neither of them now
run in BH context.
- Also included are a number of new tracepoints; a split of the
netfslib write collection code to put retrying into its own file
(it gets more complicated with content encryption).
- There are also some minor fixes AFS included, including fixing the
AFS directory format struct layout, reducing some directory
over-invalidation and making afs_mkdir() translate EEXIST to
ENOTEMPY (which is not available on all systems the servers
support).
- Finally, there's a patch to try and detect entry into the folio
unlock function with no folio_queue structs in the buffer (which
isn't allowed in the cases that can get there).
This is a debugging patch, but should be minimal overhead"
* tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
netfs: Report on NULL folioq in netfs_writeback_unlock_folios()
afs: Add a tracepoint for afs_read_receive()
afs: Locally initialise the contents of a new symlink on creation
afs: Use the contained hashtable to search a directory
afs: Make afs_mkdir() locally initialise a new directory's content
netfs: Change the read result collector to only use one work item
afs: Make {Y,}FS.FetchData an asynchronous operation
afs: Fix cleanup of immediately failed async calls
afs: Eliminate afs_read
afs: Use netfslib for symlinks, allowing them to be cached
afs: Use netfslib for directories
afs: Make afs_init_request() get a key if not given a file
netfs: Add support for caching single monolithic objects such as AFS dirs
netfs: Add functions to build/clean a buffer in a folio_queue
afs: Add more tracepoints to do with tracking validity
cachefiles: Add auxiliary data trace
cachefiles: Add some subrequest tracepoints
netfs: Remove some extraneous directory invalidations
afs: Fix directory format encoding struct
afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY
...
|
|
Fix netfslib's read-retry to only call ->prepare_read() in the backing
filesystem such a function is provided. We can get to this point if a
there's an active cache as failed reads from the cache need negotiating
with the server instead.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/529329.1736261010@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Netfslib needs to be able to handle kernel-initiated asynchronous DIO that
is supplied with a bio_vec[] array. Currently, because of the async flag,
this gets passed to netfs_extract_user_iter() which throws a warning and
fails because it only handles IOVEC and UBUF iterators. This can be
triggered through a combination of cifs and a loopback blockdev with
something like:
mount //my/cifs/share /foo
dd if=/dev/zero of=/foo/m0 bs=4K count=1K
losetup --sector-size 4096 --direct-io=on /dev/loop2046 /foo/m0
echo hello >/dev/loop2046
This causes the following to appear in syslog:
WARNING: CPU: 2 PID: 109 at fs/netfs/iterator.c:50 netfs_extract_user_iter+0x170/0x250 [netfs]
and the write to fail.
Fix this by removing the check in netfs_unbuffered_write_iter_locked() that
causes async kernel DIO writes to be handled as userspace writes. Note
that this change relies on the kernel caller maintaining the existence of
the bio_vec array (or kvec[] or folio_queue) until the op is complete.
Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
Reported-by: Nicolas Baranger <nicolas.baranger@3xo.fr>
Closes: https://lore.kernel.org/r/fedd8a40d54b2969097ffa4507979858@3xo.fr/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/608725.1736275167@warthog.procyon.org.uk
Tested-by: Nicolas Baranger <nicolas.baranger@3xo.fr>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Steve French <smfrench@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It seems that it's possible to get to netfs_writeback_unlock_folios() with
an empty rolling buffer during buffered writes. This should not be
possible as the rolling buffer is initialised as the write request is set
up and thereafter maintains at least one folio_queue struct therein until
it gets destroyed. This allows lockless addition and removal of
folio_queue structs in the buffer because, unlike with a ring buffer, the
producer and consumer each only need to look at and alter one pointer into
the buffer.
Now, the rolling buffer is only used for buffered I/O operations as
netfs_collect_write_results() should only call
netfs_writeback_unlock_folios() if the request is of origin type
NETFS_WRITEBACK, NETFS_WRITETHROUGH or NETFS_PGPRIV2_COPY_TO_CACHE.
So it would seem that one of the following occurred: (1) I/O started before
the request was fully initialised, (2) the origin got switched mid-flow or
(3) the request has already been freed and this is a UAF error. I think
the last is the most likely.
Make netfs_writeback_unlock_folios() report information about the request
and subrequests if folioq is seen to be NULL to try and help debug this,
throw a warning and return.
Note that this does not try to fix the problem.
Reported-by: syzbot+af5c06208fa71bf31b16@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=af5c06208fa71bf31b16
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/ZxshMEW4U7MTgQYa@gmail.com/
Link: https://lore.kernel.org/r/20241216204124.3752367-33-dhowells@redhat.com
cc: Chang Yu <marcus.yu.56@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since we know what the contents of a symlink will be when we create it on
the server, initialise its contents locally too to avoid the need to
download it.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-31-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Change the way netfslib collects read results to do all the collection for
a particular read request using a single work item that walks along the
subrequest queue as subrequests make progress or complete, unlocking folios
progressively rather than doing the unlock in parallel as parallel requests
come in.
The code is remodelled to be more like the write-side code, though only
using a single stream. This makes it more directly comparable and thus
easier to duplicate fixes between the two sides.
This has a number of advantages:
(1) It's simpler. There doesn't need to be a complex donation mechanism
to handle mismatches between the size and alignment of subrequests and
folios. The collector unlocks folios as the subrequests covering each
complete.
(2) It should cause less scheduler overhead as there's a single work item
in play unlocking pages in parallel when a read gets split up into a
lot of subrequests instead of one per subrequest.
Whilst the parallellism is nice in theory, in practice, the vast
majority of loads are sequential reads of the whole file, so
committing a bunch of threads to unlocking folios out of order doesn't
help in those cases.
(3) It should make it easier to implement content decryption. A folio
cannot be decrypted until all the requests that contribute to it have
completed - and, again, most loads are sequential and so, most of the
time, we want to begin decryption sequentially (though it's great if
the decryption can happen in parallel).
There is a disadvantage in that we're losing the ability to decrypt and
unlock things on an as-things-arrive basis which may affect some
applications.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-28-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that directory and symlink reads go through netfslib, the afs_read
struct is mostly redundant with almost all data duplicated in the
netfs_io_request and netfs_io_subrequest structs that are also available
any time we're doing a fetch.
Eliminate afs_read by moving the one field we still need there to the
afs_call struct (we may be given a different amount of data than what we
asked for and have to track what remains of that) and using the
netfs_io_subrequest directly instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-24-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add support for caching the content of a file that contains a single
monolithic object that must be read/written with a single I/O operation,
such as an AFS directory.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-20-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add two netfslib functions to build up or clean up a buffer in a
folio_queue. The first, netfs_alloc_folioq_buffer() will add folios to a
buffer, extending up at least to the given size. If it can, it will add
multipage folios. The folios are optionally have the mapping set and will
have the index set according to the distance from the front of the folio
queue.
The second function will free up a folio queue and put any folios in the
queue that have the first mark set.
The netfs_folio tracepoint is also altered to cope with folios that have a
NULL mapping, and the folios being added/put will have trace lines emitted
and will be accounted in the stats.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-19-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
All the accessing of the subrequest lists is now done in process context,
possibly in a workqueue, but not now in a BH context, so we don't need the
lock against BH interference when taking the netfs_io_request::lock
spinlock.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-11-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Drop the was_async argument from netfs_read_subreq_terminated(). Almost
every caller is either in process context and passes false. Some
filesystems delegate the call to a workqueue to avoid doing the work in
their network message queue parsing thread.
The only exception is netfs_cache_read_terminated() which handles
completion in the cache - which is usually a callback from the backing
filesystem in softirq context, though it can be from process context if an
error occurred. In this case, delegate to a workqueue.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wiVC5Cgyz6QKXFu6fTaA6h4CjexDR-OV9kL6Vo5x9v8=A@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-10-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Drop the error argument from netfs_read_subreq_terminated() in favour of
passing the value in subreq->error.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-9-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Split write-retry code out of fs/netfs/write_collect.c as it will become
more elaborate when content crypto is introduced.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-8-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
netfs_advance_write() calculates the amount of data it's attaching to a
stream with size_t, but then returns this as an int. Switch the return
value to size_t for consistency.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-7-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A rolling buffer is a series of folios held in a list of folio_queues. New
folios and folio_queue structs may be inserted at the head simultaneously
with spent ones being removed from the tail without the need for locking.
The rolling buffer includes an iov_iter and it has to be careful managing
this as the list of folio_queues is extended such that an oops doesn't
incurred because the iterator was pointing to the end of a folio_queue
segment that got appended to and then removed.
We need to use the mechanism twice, once for read and once for write, and,
in future patches, we will use a second rolling buffer to handle bounce
buffering for content encryption.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-6-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a tracepoint to log the lifespan of folio_queue structs. For tracing
illustrative purposes, folio_queues are tagged with the debug ID of
whatever they're related to (typically a netfs_io_request) and a debug ID
of their own.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-5-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Provide and use folio_queue allocation and free functions to combine the
allocation, initialisation and stat (un)accounting steps that are repeated
in several places.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-4-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
netfs: Fix is-caching check in read-retry
The read-retry code checks the NETFS_RREQ_COPY_TO_CACHE flag to determine
if there might be failed reads from the cache that need turning into reads
from the server, with the intention of skipping the complicated part if it
can. The code that set the flag, however, got lost during the read-side
rewrite.
Fix the check to see if the cache_resources are valid instead. The flag
can then be removed.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3752048.1734381285@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When the caching for a cookie is temporarily disabled (e.g. due to a DIO
write on that file), future copying to the cache for that file is disabled
until all fds open on that file are closed. However, if netfslib is using
the deprecated PG_private_2 method (such as is currently used by ceph), and
decides it wants to copy to the cache, netfs_advance_write() will just bail
at the first check seeing that the cache stream is unavailable, and
indicate that it dealt with all the content.
This means that we have no subrequests to provide notifications to drive
the state machine or even to pin the request and the request just gets
discarded, leaving the folios with PG_private_2 set.
Fix this by jumping directly to cancel the request if the cache is not
available. That way, we don't remove mark3 from the folio_queue list and
netfs_pgpriv2_cancel() will clean up the folios.
This was found by running the generic/013 xfstest against ceph with an
active cache and the "-o fsc" option passed to ceph. That would usually
hang
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+_4m80thNy5_fvROoxBm689YtA0dZ-=gcmkzwYSY4syqw@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-11-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
At the end of netfs_unlock_read_folio() in which folios are marked
appropriately for copying to the cache (either with by being marked dirty
and having their private data set or by having PG_private_2 set) and then
unlocked, the folio_queue struct has the entry pointing to the folio
cleared. This presents a problem for netfs_pgpriv2_write_to_the_cache(),
which is used to write folios marked with PG_private_2 to the cache as it
expects to be able to trawl the folio_queue list thereafter to find the
relevant folios, leading to a hang.
Fix this by not clearing the folio_queue entry if we're going to do the
deprecated copy-to-cache. The clearance will be done instead as the folios
are written to the cache.
This can be reproduced by starting cachefiles, mounting a ceph filesystem
with "-o fsc" and writing to it.
Fixes: 796a4049640b ("netfs: In readahead, put the folio refs as soon extracted")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+_4m80thNy5_fvROoxBm689YtA0dZ-=gcmkzwYSY4syqw@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-10-dhowells@redhat.com
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
cc: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
syzkaller reported recursion with a loop of three calls (netfs_rreq_assess,
netfs_retry_reads and netfs_rreq_terminated) hitting the limit of the stack
during an unbuffered or direct I/O read.
There are a number of issues:
(1) There is no limit on the number of retries.
(2) A subrequest is supposed to be abandoned if it does not transfer
anything (NETFS_SREQ_NO_PROGRESS), but that isn't checked under all
circumstances.
(3) The actual root cause, which is this:
if (atomic_dec_and_test(&rreq->nr_outstanding))
netfs_rreq_terminated(rreq, ...);
When we do a retry, we bump the rreq->nr_outstanding counter to
prevent the final cleanup phase running before we've finished
dispatching the retries. The problem is if we hit 0, we have to do
the cleanup phase - but we're in the cleanup phase and end up
repeating the retry cycle, hence the recursion.
Work around the problem by limiting the number of retries. This is based
on Lizhi Xu's patch[1], and makes the following changes:
(1) Replace NETFS_SREQ_NO_PROGRESS with NETFS_SREQ_MADE_PROGRESS and make
the filesystem set it if it managed to read or write at least one byte
of data. Clear this bit before issuing a subrequest.
(2) Add a ->retry_count member to the subrequest and increment it any time
we do a retry.
(3) Remove the NETFS_SREQ_RETRYING flag as it is superfluous with
->retry_count. If the latter is non-zero, we're doing a retry.
(4) Abandon a subrequest if retry_count is non-zero and we made no
progress.
(5) Use ->retry_count in both the write-side and the read-size.
[?] Question: Should I set a hard limit on retry_count in both read and
write? Say it hits 50, we always abandon it. The problem is that
these changes only mitigate the issue. As long as it made at least one
byte of progress, the recursion is still an issue. This patch
mitigates the problem, but does not fix the underlying cause. I have
patches that will do that, but it's an intrusive fix that's currently
pending for the next merge window.
The oops generated by KASAN looks something like:
BUG: TASK stack guard page was hit at ffffc9000482ff48 (stack is ffffc90004830000..ffffc90004838000)
Oops: stack guard page: 0000 [#1] PREEMPT SMP KASAN NOPTI
...
RIP: 0010:mark_lock+0x25/0xc60 kernel/locking/lockdep.c:4686
...
mark_usage kernel/locking/lockdep.c:4646 [inline]
__lock_acquire+0x906/0x3ce0 kernel/locking/lockdep.c:5156
lock_acquire.part.0+0x11b/0x380 kernel/locking/lockdep.c:5825
local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
___slab_alloc+0x123/0x1880 mm/slub.c:3695
__slab_alloc.constprop.0+0x56/0xb0 mm/slub.c:3908
__slab_alloc_node mm/slub.c:3961 [inline]
slab_alloc_node mm/slub.c:4122 [inline]
kmem_cache_alloc_noprof+0x2a7/0x2f0 mm/slub.c:4141
radix_tree_node_alloc.constprop.0+0x1e8/0x350 lib/radix-tree.c:253
idr_get_free+0x528/0xa40 lib/radix-tree.c:1506
idr_alloc_u32+0x191/0x2f0 lib/idr.c:46
idr_alloc+0xc1/0x130 lib/idr.c:87
p9_tag_alloc+0x394/0x870 net/9p/client.c:321
p9_client_prepare_req+0x19f/0x4d0 net/9p/client.c:644
p9_client_zc_rpc.constprop.0+0x105/0x880 net/9p/client.c:793
p9_client_read_once+0x443/0x820 net/9p/client.c:1570
p9_client_read+0x13f/0x1b0 net/9p/client.c:1534
v9fs_issue_read+0x115/0x310 fs/9p/vfs_addr.c:74
netfs_retry_read_subrequests fs/netfs/read_retry.c:60 [inline]
netfs_retry_reads+0x153a/0x1d00 fs/netfs/read_retry.c:232
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
...
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_dispatch_unbuffered_reads fs/netfs/direct_read.c:103 [inline]
netfs_unbuffered_read fs/netfs/direct_read.c:127 [inline]
netfs_unbuffered_read_iter_locked+0x12f6/0x19b0 fs/netfs/direct_read.c:221
netfs_unbuffered_read_iter+0xc5/0x100 fs/netfs/direct_read.c:256
v9fs_file_read_iter+0xbf/0x100 fs/9p/vfs_file.c:361
do_iter_readv_writev+0x614/0x7f0 fs/read_write.c:832
vfs_readv+0x4cf/0x890 fs/read_write.c:1025
do_preadv fs/read_write.c:1142 [inline]
__do_sys_preadv fs/read_write.c:1192 [inline]
__se_sys_preadv fs/read_write.c:1187 [inline]
__x64_sys_preadv+0x22d/0x310 fs/read_write.c:1187
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Closes: https://syzkaller.appspot.com/bug?extid=1fc6f64c40a9d143cfb6
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241108034020.3695718-1-lizhi.xu@windriver.com/ [1]
Link: https://lore.kernel.org/r/20241213135013.2964079-9-dhowells@redhat.com
Tested-by: syzbot+885c03ad650731743489@syzkaller.appspotmail.com
Suggested-by: Lizhi Xu <lizhi.xu@windriver.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Reported-by: syzbot+885c03ad650731743489@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use clear_and_wake_up_bit() rather than something like:
clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
as there needs to be a barrier inserted between which is present in
clear_and_wake_up_bit().
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-8-dhowells@redhat.com
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
cc: Zilin Guan <zilin@seu.edu.cn>
cc: Akira Yokosawa <akiyks@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The function netfs_unbuffered_write_iter_locked() in
fs/netfs/direct_write.c contains an unnecessary smp_rmb() call after
wait_on_bit(). Since wait_on_bit() already incorporates a memory barrier
that ensures the flag update is visible before the function returns, the
smp_rmb() provides no additional benefit and incurs unnecessary overhead.
This patch removes the redundant barrier to simplify and optimize the code.
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241207021952.2978530-1-zilin@seu.edu.cn/
Link: https://lore.kernel.org/r/20241213135013.2964079-7-dhowells@redhat.com
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
cc: Akira Yokosawa <akiyks@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If netfs_read_to_pagecache() gets an error from either ->prepare_read() or
from netfs_prepare_read_iterator(), it needs to decrement ->nr_outstanding,
cancel the subrequest and break out of the issuing loop. Currently, it
only does this for two of the cases, but there are two more that aren't
handled.
Fix this by moving the handling to a common place and jumping to it from
all four places. This is in preference to inserting a wrapper around
netfs_prepare_read_iterator() as proposed by Dmitry Antipov[1].
Link: https://lore.kernel.org/r/20241202093943.227786-1-dmantipov@yandex.ru/ [1]
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: syzbot+404b4b745080b6210c6c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=404b4b745080b6210c6c
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-4-dhowells@redhat.com
Tested-by: syzbot+404b4b745080b6210c6c@syzkaller.appspotmail.com
cc: Dmitry Antipov <dmantipov@yandex.ru>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When a read subrequest finishes, if it doesn't have sufficient coverage to
complete the folio(s) covering either side of it, it will donate the excess
coverage to the adjacent subrequests on either side, offloading
responsibility for unlocking the folio(s) covered to them.
Now, preference is given to donating down to a lower file offset over
donating up because that check is done first - but there's no check that
the lower subreq is actually contiguous, and so we can end up donating
incorrectly.
The scenario seen[1] is that an 8MiB readahead request spanning four 2MiB
folios is split into eight 1MiB subreqs (numbered 1 through 8). These
terminate in the order 1,6,2,5,3,7,4,8. What happens is:
- 1 donates to 2
- 6 donates to 5
- 2 completes, unlocking the first folio (with 1).
- 5 completes, unlocking the third folio (with 6).
- 3 donates to 4
- 7 donates to 4 incorrectly
- 4 completes, unlocking the second folio (with 3), but can't use
the excess from 7.
- 8 donates to 4, also incorrectly.
Fix this by preventing downward donation if the subreqs are not contiguous
(in the example above, 7 donates to 4 across the gap left by 5 and 6).
Reported-by: Shyam Prasad N <nspmangalore@gmail.com>
Closes: https://lore.kernel.org/r/CANT5p=qBwjBm-D8soFVVtswGEfmMtQXVW83=TNfUtvyHeFQZBA@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/526707.1733224486@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/20241213135013.2964079-3-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove duplicate included header file linux/uio.h
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/r/20240628062329.321162-2-thorsten.blum@toblux.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"Various fixes for the netfs library and related infrastructure:
cachefiles:
- Fix a dentry leak in cachefiles_open_file()
- Fix incorrect length return value in
cachefiles_ondemand_fd_write_iter()
- Fix missing pos updates in cachefiles_ondemand_fd_write_iter()
- Clean up in cachefiles_commit_tmpfile()
- Fix NULL pointer dereference in object->file
- Add a memory barrier for FSCACHE_VOLUME_CREATING
netfs:
- Remove call to folio_index()
- Fix a few minor bugs in netfs_page_mkwrite()
- Remove unnecessary references to pages"
* tag 'vfs-6.13.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs/fscache: Add a memory barrier for FSCACHE_VOLUME_CREATING
cachefiles: Fix NULL pointer dereference in object->file
cachefiles: Clean up in cachefiles_commit_tmpfile()
cachefiles: Fix missing pos updates in cachefiles_ondemand_fd_write_iter()
cachefiles: Fix incorrect length return value in cachefiles_ondemand_fd_write_iter()
netfs: Remove unnecessary references to pages
netfs: Fix a few minor bugs in netfs_page_mkwrite()
netfs: Remove call to folio_index()
|
|
In fscache_create_volume(), there is a missing memory barrier between the
bit-clearing operation and the wake-up operation. This may cause a
situation where, after a wake-up, the bit-clearing operation hasn't been
detected yet, leading to an indefinite wait. The triggering process is as
follows:
[cookie1] [cookie2] [volume_work]
fscache_perform_lookup
fscache_create_volume
fscache_perform_lookup
fscache_create_volume
fscache_create_volume_work
cachefiles_acquire_volume
clear_and_wake_up_bit
test_and_set_bit
test_and_set_bit
goto maybe_wait
goto no_wait
In the above process, cookie1 and cookie2 has the same volume. When cookie1
enters the -no_wait- process, it will clear the bit and wake up the waiting
process. If a barrier is missing, it may cause cookie2 to remain in the
-wait- process indefinitely.
In commit 3288666c7256 ("fscache: Use clear_and_wake_up_bit() in
fscache_create_volume_work()"), barriers were added to similar operations
in fscache_create_volume_work(), but fscache_create_volume() was missed.
By combining the clear and wake operations into clear_and_wake_up_bit() to
fix this issue.
Fixes: bfa22da3ed65 ("fscache: Provide and use cache methods to lookup/create/free a volume")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-6-wozizhi@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In the I/O locking code borrowed from NFS into netfslib, i_rwsem is held
locked across a buffered write - but this causes a performance regression
in cifs as it excludes buffered reads for the duration (cifs didn't use any
locking for buffered reads).
Mitigate this somewhat by downgrading the i_rwsem to a read lock across the
buffered write. This at least allows parallel reads to occur whilst
excluding other writes, DIO, truncate and setattr.
Note that this shouldn't be a problem for a buffered write as a read
through an mmap can circumvent i_rwsem anyway.
Also note that we might want to make this change in NFS also.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1317958.1729096113@warthog.procyon.org.uk
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trondmy@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
netfslib currently defers dropping the ref on the folios it obtains during
readahead to after it has started I/O on the basis that we can do it whilst
we wait for the I/O to complete, but this runs the risk of the I/O
collection racing with this in future.
Furthermore, Matthew Wilcox strongly suggests that the refs should be
dropped immediately, as readahead_folio() does (netfslib is using
__readahead_batch() which doesn't drop the refs).
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3771538.1728052438@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Matthew Wilcox (Oracle) <willy@infradead.org> says:
A few minor fixes; nothing earth-shattering.
Matthew Wilcox (Oracle) (3):
netfs: Remove call to folio_index()
netfs: Fix a few minor bugs in netfs_page_mkwrite()
netfs: Remove unnecessary references to pages
Link: https://lore.kernel.org/r/20241005182307.3190401-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
These places should all use folios instead of pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241005182307.3190401-4-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We can't return with VM_FAULT_SIGBUS | VM_FAULT_LOCKED; the core
code will not unlock the folio in this instance. Introduce a new
"unlock" error exit to handle this case. Use it to handle
the "folio is truncated" check, and change the "writeback interrupted
by a fatal signal" to do a NOPAGE exit instead of letting the core
code install the folio currently under writeback before killing the
process.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241005182307.3190401-3-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
After dividing up a proposed write into subrequests, netfslib sets
NETFS_RREQ_ALL_QUEUED to indicate to the collector that it can move on to
the final cleanup once it has emptied the subrequest queues.
Now, whilst the collector will normally end up running at least once after
this bit is set just because it takes a while to process all the write
subrequests before the collector runs out of subrequests, there exists the
possibility that the issuing thread will be forced to sleep and the
collector thread will clean up all the subrequests before ALL_QUEUED gets
set.
In such a case, the collector thread will not get triggered again and will
never clear NETFS_RREQ_IN_PROGRESS thus leaving a request uncompleted and
causing a potential futute hang.
Fix this by scheduling the write collector if all the subrequest queues are
empty (and thus no writes pending issuance).
Note that we'd do this ideally before queuing the subrequest, but in the
case of buffered writeback, at least, we can't find out that we've run out
of folios until after we've called writeback_iter() and it has returned
NULL - at which point we might not actually have any subrequests still
under construction.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3317784.1727880350@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use folioq_count instead of folioq_nr_slots to fix a KMSAN uninit-value
error in netfs_clear_buffer
Signed-off-by: Chang Yu <marcus.yu.56@gmail.com>
Link: https://lore.kernel.org/r/ZvuXWC2bYpvQsWgS@gmail.com
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Acked-by: David Howells <dhowells@redhat.com>
Reported-by: syzbot+921873345a95f4dae7e9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=921873345a95f4dae7e9
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"afs:
- Fix setting of the server responding flag
- Remove unused struct afs_address_list and afs_put_address_list()
function
- Fix infinite loop because of unresponsive servers
- Ensure that afs_retry_request() function is correctly added to the
afs_req_ops netfs operations table
netfs:
- Fix netfs_folio tracepoint handling to handle NULL mappings
- Add a missing folio_queue API documentation
- Ensure that netfs_write_folio() correctly advances the iterator via
iov_iter_advance()
- Fix a dentry leak during concurrent cull and cookie lookup
operations in cachefiles
pidfs:
- Correctly handle accessing another task's pid namespace"
* tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix the netfs_folio tracepoint to handle NULL mapping
netfs: Add folio_queue API documentation
netfs: Advance iterator correctly rather than jumping it
afs: Fix the setting of the server responding flag
afs: Remove unused struct and function prototype
afs: Fix possible infinite loop with unresponsive servers
pidfs: check for valid pid namespace
afs: Fix missing wire-up of afs_retry_request()
cachefiles: fix dentry leak in cachefiles_open_file()
|
|
In netfs_write_folio(), use iov_iter_advance() to advance the folio as we
split bits of it off to subrequests rather than manually jumping the
->iov_offset value around. This becomes more problematic when we use a
bounce buffer made out of single-page folios to cover a multipage pagecache
folio.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/2238548.1727424522@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The cifs flag CIFS_INO_MODIFIED_ATTR, which indicates that the mtime and
ctime need to be written back on close, got taken over by netfs as
NETFS_ICTX_MODIFIED_ATTR to avoid the need to call a function pointer to
set it.
The flag gets set correctly on buffered writes, but doesn't get set by
netfs_page_mkwrite(), leading to occasional failures in generic/080 and
generic/215.
Fix this by setting the flag in netfs_page_mkwrite().
Fixes: 73425800ac94 ("netfs, cifs: Move CIFS_INO_MODIFIED_ATTR to netfs_inode")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409161629.98887b2-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This contains the work to improve read/write performance for the new
netfs library.
The main performance enhancing changes are:
- Define a structure, struct folio_queue, and a new iterator type,
ITER_FOLIOQ, to hold a buffer as a replacement for ITER_XARRAY. See
that patch for questions about naming and form.
ITER_FOLIOQ is provided as a replacement for ITER_XARRAY. The
problem with an xarray is that accessing it requires the use of a
lock (typically the RCU read lock) - and this means that we can't
supply iterate_and_advance() with a step function that might sleep
(crypto for example) without having to drop the lock between pages.
ITER_FOLIOQ is the iterator for a chain of folio_queue structs,
where each folio_queue holds a small list of folios. A folio_queue
struct is a simpler structure than xarray and is not subject to
concurrent manipulation by the VM. folio_queue is used rather than
a bvec[] as it can form lists of indefinite size, adding to one end
and removing from the other on the fly.
- Provide a copy_folio_from_iter() wrapper.
- Make cifs RDMA support ITER_FOLIOQ.
- Use folio queues in the write-side helpers instead of xarrays.
- Add a function to reset the iterator in a subrequest.
- Simplify the write-side helpers to use sheaves to skip gaps rather
than trying to work out where gaps are.
- In afs, make the read subrequests asynchronous, putting them into
work items to allow the next patch to do progressive
unlocking/reading.
- Overhaul the read-side helpers to improve performance.
- Fix the caching of a partial block at the end of a file.
- Allow a store to be cancelled.
Then some changes for cifs to make it use folio queues instead of
xarrays for crypto bufferage:
- Use raw iteration functions rather than manually coding iteration
when hashing data.
- Switch to using folio_queue for crypto buffers.
- Remove the xarray bits.
Make some adjustments to the /proc/fs/netfs/stats file such that:
- All the netfs stats lines begin 'Netfs:' but change this to
something a bit more useful.
- Add a couple of stats counters to track the numbers of skips and
waits on the per-inode writeback serialisation lock to make it
easier to check for this as a source of performance loss.
Miscellaneous work:
- Ensure that the sb_writers lock is taken around
vfs_{set,remove}xattr() in the cachefiles code.
- Reduce the number of conditional branches in netfs_perform_write().
- Move the CIFS_INO_MODIFIED_ATTR flag to the netfs_inode struct and
remove cifs_post_modify().
- Move the max_len/max_nr_segs members from netfs_io_subrequest to
netfs_io_request as they're only needed for one subreq at a time.
- Add an 'unknown' source value for tracing purposes.
- Remove NETFS_COPY_TO_CACHE as it's no longer used.
- Set the request work function up front at allocation time.
- Use bh-disabling spinlocks for rreq->lock as cachefiles completion
may be run from block-filesystem DIO completion in softirq context.
- Remove fs/netfs/io.c"
* tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
docs: filesystems: corrected grammar of netfs page
cifs: Don't support ITER_XARRAY
cifs: Switch crypto buffer to use a folio_queue rather than an xarray
cifs: Use iterate_and_advance*() routines directly for hashing
netfs: Cancel dirty folios that have no storage destination
cachefiles, netfs: Fix write to partial block at EOF
netfs: Remove fs/netfs/io.c
netfs: Speed up buffered reading
afs: Make read subreqs async
netfs: Simplify the writeback code
netfs: Provide an iterator-reset function
netfs: Use new folio_queue data type and iterator instead of xarray iter
cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
iov_iter: Provide copy_folio_from_iter()
mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
netfs: Use bh-disabling spinlocks for rreq->lock
netfs: Set the request work function upon allocation
netfs: Remove NETFS_COPY_TO_CACHE
netfs: Reserve netfs_sreq_source 0 as unset/unknown
netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
...
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual pile of misc updates:
Features:
- Add F_CREATED_QUERY fcntl() that allows userspace to query whether
a file was actually created. Often userspace wants to know whether
an O_CREATE request did actually create a file without using
O_EXCL. The current logic is that to first attempts to open the
file without O_CREAT | O_EXCL and if ENOENT is returned userspace
tries again with both flags. If that succeeds all is well. If it
now reports EEXIST it retries.
That works fairly well but some corner cases make this more
involved. If this operates on a dangling symlink the first openat()
without O_CREAT | O_EXCL will return ENOENT but the second openat()
with O_CREAT | O_EXCL will fail with EEXIST.
The reason is that openat() without O_CREAT | O_EXCL follows the
symlink while O_CREAT | O_EXCL doesn't for security reasons. So
it's not something we can really change unless we add an explicit
opt-in via O_FOLLOW which seems really ugly.
All available workarounds are really nasty (fanotify, bpf lsm etc)
so add a simple fcntl().
- Try an opportunistic lookup for O_CREAT. Today, when opening a file
we'll typically do a fast lookup, but if O_CREAT is set, the kernel
always takes the exclusive inode lock. This was likely done with
the expectation that O_CREAT means that we always expect to do the
create, but that's often not the case. Many programs set O_CREAT
even in scenarios where the file already exists (see related
F_CREATED_QUERY patch motivation above).
The series contained in the pr rearranges the pathwalk-for-open
code to also attempt a fast_lookup in certain O_CREAT cases. If a
positive dentry is found, the inode_lock can be avoided altogether
and it can stay in rcuwalk mode for the last step_into.
- Expose the 64 bit mount id via name_to_handle_at()
Now that we provide a unique 64-bit mount ID interface in statx(2),
we can now provide a race-free way for name_to_handle_at(2) to
provide a file handle and corresponding mount without needing to
worry about racing with /proc/mountinfo parsing or having to open a
file just to do statx(2).
While this is not necessary if you are using AT_EMPTY_PATH and
don't care about an extra statx(2) call, users that pass full paths
into name_to_handle_at(2) need to know which mount the file handle
comes from (to make sure they don't try to open_by_handle_at a file
handle from a different filesystem) and switching to AT_EMPTY_PATH
would require allocating a file for every name_to_handle_at(2) call
- Add a per dentry expire timeout to autofs
There are two fairly well known automounter map formats, the autofs
format and the amd format (more or less System V and Berkley).
Some time ago Linux autofs added an amd map format parser that
implemented a fair amount of the amd functionality. This was done
within the autofs infrastructure and some functionality wasn't
implemented because it either didn't make sense or required extra
kernel changes. The idea was to restrict changes to be within the
existing autofs functionality as much as possible and leave changes
with a wider scope to be considered later.
One of these changes is implementing the amd options:
1) "unmount", expire this mount according to a timeout (same as
the current autofs default).
2) "nounmount", don't expire this mount (same as setting the
autofs timeout to 0 except only for this specific mount) .
3) "utimeout=<seconds>", expire this mount using the specified
timeout (again same as setting the autofs timeout but only for
this mount)
To implement these options per-dentry expire timeouts need to be
implemented for autofs indirect mounts. This is because all map
keys (mounts) for autofs indirect mounts use an expire timeout
stored in the autofs mount super block info. structure and all
indirect mounts use the same expire timeout.
Fixes:
- Fix missing fput for FSCONFIG_SET_FD in autofs
- Use param->file for FSCONFIG_SET_FD in coda
- Delete the 'fs/netfs' proc subtreee when netfs module exits
- Make sure that struct uid_gid_map fits into a single cacheline
- Don't flush in-flight wb switches for superblocks without cgroup
writeback
- Correcting the idmapping mount example in the idmapping
documentation
- Fix a race between evice_inodes() and find_inode() and iput()
- Refine the show_inode_state() macro definition in writeback code
- Prevent dump_mapping() from accessing invalid dentry.d_name.name
- Show actual source for debugfs in /proc/mounts
- Annotate data-race of busy_poll_usecs in eventpoll
- Don't WARN for racy path_noexec check in exec code
- Handle OOM on mnt_warn_timestamp_expiry()
- Fix some spelling in the iomap design documentation
- Fix typo in procfs comment
- Fix typo in fs/namespace.c comment
Cleanups:
- Add the VFS git tree to the MAINTAINERS file
- Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
bit in struct file bringing us to 5 free f_mode bits
- Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
the wait mechanism
- Remove the unused path_put_init() helper
- Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
specific
- Replace the unsigned long i_state member with a u32 i_state member
in struct inode freeing up 4 bytes in struct inode. Instead of
using the bit based wait apis we're now using the var event apis
and using the individual bytes of the i_state member to wait on
state changes
- Explain how per-syscall AT_* flags should be allocated
- Use in_group_or_capable() helper to simplify the posix acl mode
update code
- Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
- Removed comment about d_rcu_to_refcount() as that function doesn't
exist anymore
- Add kernel documentation for lookup_fast()
- Don't re-zero evenpoll fields
- Remove outdated comment after close_fd()
- Fix imprecise wording in comment about the pipe filesystem
- Drop GFP_NOFAIL mode from alloc_page_buffers
- Missing blank line warnings and struct declaration improved in
file_table
- Annotate struct poll_list with __counted_by()
- Remove the unused read parameter in percpu-rwsem
- Remove linux/prefetch.h include from direct-io code
- Use kmemdup_array instead of kmemdup for multiple allocation in
mnt_idmapping code
- Remove unused mnt_cursor_del() declaration
Performance tweaks:
- Dodge smp_mb in break_lease and break_deleg in the common case
- Only read fops once in fops_{get,put}()
- Use RCU in ilookup()
- Elide smp_mb in iversion handling in the common case
- Drop one lock trip in evict()"
* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
uidgid: make sure we fit into one cacheline
proc: Fix typo in the comment
fs/pipe: Correct imprecise wording in comment
fhandle: expose u64 mount id to name_to_handle_at(2)
uapi: explain how per-syscall AT_* flags should be allocated
fs: drop GFP_NOFAIL mode from alloc_page_buffers
writeback: Refine the show_inode_state() macro definition
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
netfs: Delete subtree of 'fs/netfs' when netfs module exits
fs: use LIST_HEAD() to simplify code
inode: make i_state a u32
inode: port __I_LRU_ISOLATING to var event
vfs: fix race between evice_inodes() and find_inode()&iput()
inode: port __I_NEW to var event
inode: port __I_SYNC to var event
fs: reorder i_state bits
fs: add i_state helpers
MAINTAINERS: add the VFS git tree
fs: s/__u32/u32/ for s_fsnotify_mask
...
|
|
Kafs wants to be able to cache the contents of directories (and symlinks),
but whilst these are downloaded from the server with the FS.FetchData RPC
op and similar, the same as for regular files, they can't be updated by
FS.StoreData, but rather have special operations (FS.MakeDir, etc.).
Now, rather than redownloading a directory's content after each change made
to that directory, kafs modifies the local blob. This blob can be saved
out to the cache, and since it's using netfslib, kafs just marks the folios
dirty and lets ->writepages() on the directory take care of it, as for an
regular file.
This is fine as long as there's a cache as although the upload stream is
disabled, there's a cache stream to drive the procedure. But if the cache
goes away in the meantime, suddenly there's no way do any writes and the
code gets confused, complains "R=%x: No submit" to dmesg and leaves the
dirty folio hanging.
Fix this by just cancelling the store of the folio if neither stream is
active. (If there's no cache at the time of dirtying, we should just not
mark the folio dirty).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-23-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Because it uses DIO writes, cachefiles is unable to make a write to the
backing file if that write is not aligned to and sized according to the
backing file's DIO block alignment. This makes it tricky to handle a write
to the cache where the EOF on the network file is not correctly aligned.
To get around this, netfslib attempts to tell the driver it is calling how
much more data there is available beyond the EOF that it can use to pad the
write (netfslib preclears the part of the folio above the EOF). However,
it tries to tell the cache what the maximum length is, but doesn't
calculate this correctly; and, in any case, cachefiles actually ignores the
value and just skips the block.
Fix this by:
(1) Change the value passed to indicate the amount of extra data that can
be added to the operation (now ->submit_extendable_to). This is much
simpler to calculate as it's just the end of the folio minus the top
of the data within the folio - rather than having to account for data
spread over multiple folios.
(2) Make cachefiles add some of this data if the subrequest it is given
ends at the network file's i_size if the extra data is sufficient to
pad out to a whole block.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-22-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove fs/netfs/io.c as it is no longer used.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-21-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Improve the efficiency of buffered reads in a number of ways:
(1) Overhaul the algorithm in general so that it's a lot more compact and
split the read submission code between buffered and unbuffered
versions. The unbuffered version can be vastly simplified.
(2) Read-result collection is handed off to a work queue rather than being
done in the I/O thread. Multiple subrequests can be processes
simultaneously.
(3) When a subrequest is collected, any folios it fully spans are
collected and "spare" data on either side is donated to either the
previous or the next subrequest in the sequence.
Notes:
(*) Readahead expansion is massively slows down fio, presumably because it
causes a load of extra allocations, both folio and xarray, up front
before RPC requests can be transmitted.
(*) RDMA with cifs does appear to work, both with SIW and RXE.
(*) PG_private_2-based reading and copy-to-cache is split out into its own
file and altered to use folio_queue. Note that the copy to the cache
now creates a new write transaction against the cache and adds the
folios to be copied into it. This allows it to use part of the
writeback I/O code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Use the new folio_queue structures to simplify the writeback code. The
problem with referring to the i_pages xarray directly is that we may have
gaps in the sequence of folios we're writing from that we need to skip when
we're removing the writeback mark from the folios we're writing back from.
At the moment the code tries to deal with this by carefully tracking the
gaps in each writeback stream (eg. write to server and write to cache) and
divining when there's a gap that spans folios (something that's not helped
by folios not being a consistent size).
Instead, the folio_queue buffer contains pointers only the folios we're
dealing with, has them in ascending order and indicates a gap by placing
non-consequitive folios next to each other. This makes it possible to
track where we need to clean up to by just keeping track of where we've
processed to on each stream and taking the minimum.
Note that the I/O iterator is always rounded up to the end of the folio,
even if that is beyond the EOF position, so that the cache can do DIO from
the page. The excess space is cleared, though mmapped writes clobber it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-18-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Provide a function to reset the iterator on a subrequest.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-17-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make the netfs write-side routines use the new folio_queue struct to hold a
rolling buffer of folios, with the issuer adding folios at the tail and the
collector removing them from the head as they're processed instead of using
an xarray.
This will allow a subsequent patch to simplify the write collector.
The primary mark (as tested by folioq_is_marked()) is used to note if the
corresponding folio needs putting.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-16-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull smb client fixes from Steve French:
- fix potential mount hang
- fix retry problem in two types of compound operations
- important netfs integration fix in SMB1 read paths
- fix potential uninitialized zero point of inode
- minor patch to improve debugging for potential crediting problems
* tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
netfs, cifs: Improve some debugging bits
cifs: Fix SMB1 readv/writev callback in the same way as SMB2/3
cifs: Fix zero_point init on inode initialisation
smb: client: fix double put of @cfile in smb2_set_path_size()
smb: client: fix double put of @cfile in smb2_rename_path()
smb: client: fix hang in wait_for_response() for negproto
|
|
Use bh-disabling spinlocks when accessing rreq->lock because, in the
future, it may be twiddled from softirq context when cleanup is driven from
cache backend DIO completion.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-12-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Set the work function in the netfs_io_request work_struct when we allocate
the request rather than doing this later. This reduces the number of
places we need to set it in future code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-11-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Remove NETFS_COPY_TO_CACHE as it isn't used anymore.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-10-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move max_len/max_nr_segs from struct netfs_io_subrequest to struct
netfs_io_stream as we only issue one subreq at a time and then don't need
these values again for that subreq unless and until we have to retry it -
in which case we want to renegotiate them.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-8-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move CIFS_INO_MODIFIED_ATTR to netfs_inode as NETFS_ICTX_MODIFIED_ATTR and
then make netfs_perform_write() set it. This means that cifs doesn't need
to implement the ->post_modify() hook.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-7-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Reduce the number of conditional branches in netfs_perform_write() by
merging in netfs_how_to_modify() and then creating a separate if-statement
for each way we might modify a folio. Note that this means replicating the
data copy in each path.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-6-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Record statistics for contention upon the writeback serialisation lock that
prevents racing writeback calls from causing each other to interleave their
writebacks. These can be viewed in /proc/fs/netfs/stats on the WbLock line,
with skip=N indicating the number of non-SYNC writebacks skipped and wait=N
indicating the number of SYNC writebacks that waited.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-5-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Adjust the labels in /proc/fs/netfs/stats that refer to netfs-specific
counters. These currently all begin with "Netfs", but change them to begin
with more specific labels.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-4-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"Two netfs fixes for this merge window:
- Ensure that fscache_cookie_lru_time is deleted when the fscache
module is removed to prevent UAF
- Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()
Before it used truncate_inode_pages_partial() which causes
copy_file_range() to fail on cifs"
* tag 'vfs-6.11-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fscache: delete fscache_cookie_lru_timer when fscache exits to avoid UAF
mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()
|
|
Improve some debugging bits:
(1) The netfslib _debug() macro doesn't need a newline in its format
string.
(2) Display the request debug ID and subrequest index in messages emitted
in smb2_adjust_credits() to make it easier to reference in traces.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The fscache_cookie_lru_timer is initialized when the fscache module
is inserted, but is not deleted when the fscache module is removed.
If timer_reduce() is called before removing the fscache module,
the fscache_cookie_lru_timer will be added to the timer list of
the current cpu. Afterwards, a use-after-free will be triggered
in the softIRQ after removing the fscache module, as follows:
==================================================================
BUG: unable to handle page fault for address: fffffbfff803c9e9
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 21ffea067 P4D 21ffea067 PUD 21ffe6067 PMD 110a7c067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G W 6.11.0-rc3 #855
Tainted: [W]=WARN
RIP: 0010:__run_timer_base.part.0+0x254/0x8a0
Call Trace:
<IRQ>
tmigr_handle_remote_up+0x627/0x810
__walk_groups.isra.0+0x47/0x140
tmigr_handle_remote+0x1fa/0x2f0
handle_softirqs+0x180/0x590
irq_exit_rcu+0x84/0xb0
sysvec_apic_timer_interrupt+0x6e/0x90
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20
RIP: 0010:default_idle+0xf/0x20
default_idle_call+0x38/0x60
do_idle+0x2b5/0x300
cpu_startup_entry+0x54/0x60
start_secondary+0x20d/0x280
common_startup_64+0x13e/0x148
</TASK>
Modules linked in: [last unloaded: netfs]
==================================================================
Therefore delete fscache_cookie_lru_timer when removing the fscahe module.
Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240826112056.2458299-1-libaokun@huaweicloud.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In netfs_init() or fscache_proc_init(), we create dentry under 'fs/netfs',
but in netfs_exit(), we only delete the proc entry of 'fs/netfs' without
deleting its subtree. This triggers the following WARNING:
==================================================================
remove_proc_entry: removing non-empty directory 'fs/netfs', leaking at least 'requests'
WARNING: CPU: 4 PID: 566 at fs/proc/generic.c:717 remove_proc_entry+0x160/0x1c0
Modules linked in: netfs(-)
CPU: 4 UID: 0 PID: 566 Comm: rmmod Not tainted 6.11.0-rc3 #860
RIP: 0010:remove_proc_entry+0x160/0x1c0
Call Trace:
<TASK>
netfs_exit+0x12/0x620 [netfs]
__do_sys_delete_module.isra.0+0x14c/0x2e0
do_syscall_64+0x4b/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================
Therefore use remove_proc_subtree() instead of remove_proc_entry() to
fix the above problem.
Fixes: 7eb5b3e3a0a5 ("netfs, fscache: Move /proc/fs/fscache to /proc/fs/netfs and put in a symlink")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240826113404.3214786-1-libaokun@huaweicloud.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Afaict, we can just rely on inode->i_dio_count for waiting instead of
this awkward indirection through __I_DIO_WAKEUP. This survives LTP dio
and xfstests dio tests.
Link: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Short DIO reads, particularly in relation to cifs, are not being handled
correctly by cifs and netfslib. This can be tested by doing a DIO read of
a file where the size of read is larger than the size of the file. When it
crosses the EOF, it gets a short read and this gets retried, and in the
case of cifs, the retry read fails, with the failure being translated to
ENODATA.
Fix this by the following means:
(1) Add a flag, NETFS_SREQ_HIT_EOF, for the filesystem to set when it
detects that the read did hit the EOF.
(2) Make the netfslib read assessment stop processing subrequests when it
encounters one with that flag set.
(3) Return rreq->transferred, the accumulated contiguous amount read to
that point, to userspace for a DIO read.
(4) Make cifs set the flag and clear the error if the read RPC returned
ENODATA.
(5) Make cifs set the flag and clear the error if a short read occurred
without error and the read-to file position is now at the remote inode
size.
Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When netfslib asks cifs to issue a read operation, it prefaces this with a
call to ->clamp_length() which cifs uses to negotiate credits, providing
receive capacity on the server; however, in the event that a read op needs
reissuing, netfslib doesn't call ->clamp_length() again as that could
shorten the subrequest, leaving a gap.
This causes the retried read to be done with zero credits which causes the
server to reject it with STATUS_INVALID_PARAMETER. This is a problem for a
DIO read that is requested that would go over the EOF. The short read will
be retried, causing EINVAL to be returned to the user when it fails.
Fix this by making cifs_req_issue_read() negotiate new credits if retrying
(NETFS_SREQ_RETRYING now gets set in the read side as well as the write
side in this instance).
This isn't sufficient, however: the new credits might not be sufficient to
complete the remainder of the read, so also add an additional field,
rreq->actual_len, that holds the actual size of the op we want to perform
without having to alter subreq->len.
We then rely on repeated short reads being retried until we finish the read
or reach the end of file and make a zero-length read.
Also fix a couple of places where the subrequest start and length need to
be altered by the amount so far transferred when being used.
Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When a folio that is marked for streaming write (dirty, but not uptodate,
with partial content specified in the private data) is written back, the
folio is effectively switched to the blank state upon completion of the
write. This means that if we want to read it in future, we need to reread
the whole folio.
However, if the folio is above the zero_point position, when it is read
back, it will just be cleared and the read skipped, leading to apparent
local corruption.
Fix this by increasing the zero_point to the end of the dirty data in the
folio when clearing the folio state after writeback. This is analogous to
the folio having ->release_folio() called upon it.
This was causing the config.log generated by configuring a cpython tree on
a cifs share to get corrupted because the scripts involved were appending
text to the file in small pieces.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/563286.1724500613@warthog.procyon.org.uk
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_rreq_perform_resubmissions() to reset before retrying a short
read, otherwise the wrong part of the output buffer will be used.
Fixes: 92b6cc5d1e7c ("netfs: Add iov_iters to (sub)requests to describe various buffers")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-6-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_release_folio() to say no (ie. return false) if the folio is
dirty (analogous with iomap's behaviour). Without this, it will say yes to
the release of a dirty page by split_huge_page_to_list_to_order(), which
will result in the loss of untruncated data in the folio.
Without this, the generic/075 and generic/112 xfstests (both fsx-based
tests) fail with minimum folio size patches applied[1].
Fixes: c1ec4d7c2e13 ("netfs: Provide invalidate_folio and release_folio calls")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240815090849.972355-1-kernel@pankajraghav.com/ [1]
Link: https://lore.kernel.org/r/20240823200819.532106-4-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
folio->private and marking dirty"
This partially reverts commit 2ff1e97587f4d398686f52c07afde3faf3da4e5c.
In addition to reverting the removal of PG_private_2 wrangling from the
buffered read code[1][2], the removal of the waits for PG_private_2 from
netfs_release_folio() and netfs_invalidate_folio() need reverting too.
It also adds a wait into ceph_evict_inode() to wait for netfs read and
copy-to-cache ops to complete.
Fixes: 2ff1e97587f4 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2]
Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com
cc: Max Kellermann <max.kellermann@ionos.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If a program is watching a file on a 9p mount, it won't see any change in
size if the file being exported by the server is changed directly in the
source filesystem, presumably because 9p doesn't have change notifications,
and because netfs skips the reads if the file is empty.
Fix this by attempting to read the full size specified when a DIO read is
requested (such as when 9p is operating in unbuffered mode) and dealing
with a short read if the EOF was less than the expected read.
To make this work, filesystems using netfslib must not set
NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF.
I don't want to mandatorily clear this flag in netfslib for DIO because,
say, ceph might make a read from an object that is not completely filled,
but does not reside at the end of file - and so we need to clear the
excess.
This can be tested by watching an empty file over 9p within a VM (such as
in the ktest framework):
while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo
then writing something into the empty file. The watcher should immediately
display the file content and break out of the loop. Without this fix, it
remains in the loop indefinitely.
Fixes: 80105ed2fd27 ("9p: Use netfslib read/write_iter")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly. The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct. Further, we secondarily sample the
cache state and make contradictory decisions later.
The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.
Fix this in the following way:
(1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
->init_request() rather than trying to juggle that in
netfs_alloc_request().
(2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
to be done, then PG_private_2 is to be used rather than only setting
it if we decide to cache and then having netfs_rreq_unlock_folios()
set the non-PG_private_2 writeback-to-cache if it wasn't set.
(3) Split netfs_rreq_unlock_folios() into two functions, one of which
contains the deprecated code for using PG_private_2 to avoid
accidentally doing the writeback path - and always use it if
USE_PGPRIV2 is set.
(4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
wait for PG_private_2. This function is deprecated and only used by
ceph anyway, and so label it so.
(5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
fscache_operation_valid() on the cache_resources instead. This has
the advantage of picking up the result of netfs_begin_cache_read() and
fscache_begin_write_operation() - which are called after the object is
initialised and will wait for the cache to come to a usable state.
Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that. Without this as well, things like:
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {
and:
WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386
may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.
Fixes: 2ff1e97587f4 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
second writeback flag"
This reverts commit ae678317b95e760607c7b20b97c9cd4ca9ed6e1a.
Revert the patch that removes the deprecated use of PG_private_2 in
netfslib for the moment as Ceph is actually still using this to track
data copied to the cache.
Fixes: ae678317b95e ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Commit 6b8e61472529 ("netfs: Rename CONFIG_FSCACHE_DEBUG to
CONFIG_NETFS_DEBUG") renames the config, but introduces two issues: First,
NETFS_DEBUG mistakenly depends on the non-existing config NETFS, whereas
the actual intended config is called NETFS_SUPPORT. Second, the config
renaming misses to adjust the documentation of the functionality of this
config.
Clean up those two points.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Link: https://lore.kernel.org/r/20240731073902.69262-1-lukas.bulwahn@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This fixes a NULL pointer dereference bug due to a data race which
looks like this:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018
Workqueue: events_unbound netfs_rreq_write_to_cache_work
RIP: 0010:cachefiles_prepare_write+0x30/0xa0
Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10
RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286
RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000
RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438
RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001
R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68
R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00
FS: 0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0
Call Trace:
<TASK>
? __die+0x1f/0x70
? page_fault_oops+0x15d/0x440
? search_module_extables+0xe/0x40
? fixup_exception+0x22/0x2f0
? exc_page_fault+0x5f/0x100
? asm_exc_page_fault+0x22/0x30
? cachefiles_prepare_write+0x30/0xa0
netfs_rreq_write_to_cache_work+0x135/0x2e0
process_one_work+0x137/0x2c0
worker_thread+0x2e9/0x400
? __pfx_worker_thread+0x10/0x10
kthread+0xcc/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
Modules linked in:
CR2: 0000000000000008
---[ end trace 0000000000000000 ]---
This happened because fscache_cookie_state_machine() was slow and was
still running while another process invoked fscache_unuse_cookie();
this led to a fscache_cookie_lru_do_one() call, setting the
FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by
fscache_cookie_state_machine(), withdrawing the cookie via
cachefiles_withdraw_cookie(), clearing cookie->cache_priv.
At the same time, yet another process invoked
cachefiles_prepare_write(), which found a NULL pointer in this code
line:
struct cachefiles_object *object = cachefiles_cres_object(cres);
The next line crashes, obviously:
struct cachefiles_cache *cache = object->volume->cache;
During cachefiles_prepare_write(), the "n_accesses" counter is
non-zero (via fscache_begin_operation()). The cookie must not be
withdrawn until it drops to zero.
The counter is checked by fscache_cookie_state_machine() before
switching to FSCACHE_COOKIE_STATE_RELINQUISHING and
FSCACHE_COOKIE_STATE_WITHDRAWING (in "case
FSCACHE_COOKIE_STATE_FAILED"), but not for
FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case
FSCACHE_COOKIE_STATE_ACTIVE").
This patch adds the missing check. With a non-zero access counter,
the function returns and the next fscache_end_cookie_access() call
will queue another fscache_cookie_state_machine() call to handle the
still-pending FSCACHE_COOKIE_DO_LRU_DISCARD.
Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
As in commit 4e527d5841e2 ("iomap: fault in smaller chunks for non-large
folio mappings"), we can see a performance loss for filesystems
which have not yet been converted to large folios.
Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240527201735.1898381-1-willy@infradead.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When netfslib is performing writeback (ie. ->writepages), it maintains two
parallel streams of writes, one to the server and one to the cache, but it
doesn't mark either stream of writes as active until it gets some data that
needs to be written to that stream.
This is done because some folios will only be written to the cache
(e.g. copying to the cache on read is done by marking the folios and
letting writeback do the actual work) and sometimes we'll only be writing
to the server (e.g. if there's no cache).
Now, since we don't actually dispatch uploads and cache writes in parallel,
but rather flip between the streams, depending on which has the lowest
so-far-issued offset, and don't wait for the subreqs to finish before
flipping, we can end up in a situation where, say, we issue a write to the
server and this completes before we start the write to the cache.
But because we only activate a stream when we first add a subreq to it, the
result collection code may run before we manage to activate the stream -
resulting in the folio being cleaned and having the writeback-in-progress
mark removed. At this point, the folio no longer belongs to us.
This is only really a problem for folios that need to be written to both
streams - and in that case, the upload to the server is started first,
followed by the write to the cache - and the cache write may see a bad
folio.
Fix this by activating the cache stream up front if there's a cache
available. If there's a cache, then all data is going to be written to it.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1599053.1721398818@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
CONFIG_FSCACHE_DEBUG should have been renamed to CONFIG_NETFS_DEBUG, so do
that now.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1410796.1721333406@warthog.procyon.org.uk
cc: Uwe Kleine-König <ukleinek@kernel.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Revert commit 163eae0fb0d4c610c59a8de38040f8e12f89fd43 to get back the
original operation of the debugging macros.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org
Link: https://lore.kernel.org/r/1410685.1721333252@warthog.procyon.org.uk
cc: Uwe Kleine-König <ukleinek@kernel.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"cachefiles:
- Export an existing and add a new cachefile helper to be used in
filesystems to fix reference count bugs
- Use the newly added fscache_ty_get_volume() helper to get a
reference count on an fscache_volume to handle volumes that are
about to be removed cleanly
- After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN
wait for all ongoing cookie lookups to complete and for the object
count to reach zero
- Propagate errors from vfs_getxattr() to avoid an infinite loop in
cachefiles_check_volume_xattr() because it keeps seeing ESTALE
- Don't send new requests when an object is dropped by raising
CACHEFILES_ONDEMAND_OJBSTATE_DROPPING
- Cancel all requests for an object that is about to be dropped
- Wait for the ondemand_boject_worker to finish before dropping a
cachefiles object to prevent use-after-free
- Use cyclic allocation for message ids to better handle id recycling
- Add missing lock protection when iterating through the xarray when
polling
netfs:
- Use standard logging helpers for debug logging
VFS:
- Fix potential use-after-free in file locks during
trace_posix_lock_inode(). The tracepoint could fire while another
task raced it and freed the lock that was requested to be traced
- Only increment the nr_dentry_negative counter for dentries that are
present on the superblock LRU. Currently, DCACHE_LRU_LIST list is
used to detect this case. However, the flag is also raised in
combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru
is used. So checking only DCACHE_LRU_LIST will lead to wrong
nr_dentry_negative count. Fix the check to not count dentries that
are on a shrink related list
Misc:
- hfsplus: fix an uninitialized value issue in copy_name
- minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even
though we switched it to kmap_local_page() a while ago"
* tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
minixfs: Fix minixfs_rename with HIGHMEM
hfsplus: fix uninit-value in copy_name
vfs: don't mod negative dentry count when on shrinker list
filelock: fix potential use-after-free in posix_lock_inode
cachefiles: add missing lock protection when polling
cachefiles: cyclic allocation of msg_id to avoid reuse
cachefiles: wait for ondemand_object_worker to finish when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: stop sending new request when dropping object
cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
netfs: Switch debug logging to pr_debug()
|
|
libaokun@huaweicloud.com <libaokun@huaweicloud.com> says:
This is the third version of this patch series, in which another patch set
is subsumed into this one to avoid confusing the two patch sets.
(https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
We've been testing ondemand mode for cachefiles since January, and we're
almost done. We hit a lot of issues during the testing period, and this
patch series fixes some of the issues. The patches have passed internal
testing without regression.
The following is a brief overview of the patches, see the patches for
more details.
Patch 1-2: Add fscache_try_get_volume() helper function to avoid
fscache_volume use-after-free on cache withdrawal.
Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
concurrency causing cachefiles_volume use-after-free.
Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
endless loops.
Patch 5-7: A read request waiting for reopen could be closed maliciously
before the reopen worker is executing or waiting to be scheduled. So
ondemand_object_worker() may be called after the info and object and even
the cache have been freed and trigger use-after-free. So use
cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
reopen worker or wait for it to finish. Since it makes no sense to wait
for the daemon to complete the reopen request, to avoid this pointless
operation blocking cancel_work_sync(), Patch 1 avoids request generation
by the DROPPING state when the request has not been sent, and Patch 2
flushes the requests of the current object before cancel_work_sync().
Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
the daemon to cause hung.
Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
use-after-free. This issue was triggered frequently in our tests, and we
found that anolis 5.10 had fixed it. So to avoid failing the test, this
patch is pushed upstream as well.
Baokun Li (7):
netfs, fscache: export fscache_put_volume() and add
fscache_try_get_volume()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: propagate errors from vfs_getxattr() to avoid infinite
loop
cachefiles: stop sending new request when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: cyclic allocation of msg_id to avoid reuse
Hou Tao (1):
cachefiles: wait for ondemand_object_worker to finish when dropping
object
Jingbo Xu (1):
cachefiles: add missing lock protection when polling
fs/cachefiles/cache.c | 45 ++++++++++++++++++++++++++++-
fs/cachefiles/daemon.c | 4 +--
fs/cachefiles/internal.h | 3 ++
fs/cachefiles/ondemand.c | 52 ++++++++++++++++++++++++++++++----
fs/cachefiles/volume.c | 1 -
fs/cachefiles/xattr.c | 5 +++-
fs/netfs/fscache_volume.c | 14 +++++++++
fs/netfs/internal.h | 2 --
include/linux/fscache-cache.h | 6 ++++
include/trace/events/fscache.h | 4 +++
10 files changed, 123 insertions(+), 13 deletions(-)
Link: https://lore.kernel.org/r/20240628062930.2467993-1-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
folio_file_pos is only needed for mixed usage of page cache and swap
cache, for pure page cache usage, the caller can just use folio_pos
instead.
It can't be a swap cache page here. Swap mapping may only call into fs
through swap_rw and that is not supported for netfs. So just drop it and
use folio_pos instead.
Link: https://lkml.kernel.org/r/20240521175854.96038-7-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Export fscache_put_volume() and add fscache_try_get_volume()
helper function to allow cachefiles to get/put fscache_volume
via linux/fscache-cache.h.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240628062930.2467993-2-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_page_mkwrite() to use filemap_fdatawrite_range(), not
filemap_fdatawait_range() to flush conflicting data.
Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/614300.1719228243@warthog.procyon.org.uk
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has
taken the folio lock (as filemap_page_mkwrite() does). Without this,
generic/247 occasionally oopses with something like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0
...
Call Trace:
<TASK>
? __die_body+0x1a/0x60
? page_fault_oops+0x6e/0xa0
? exc_page_fault+0xc2/0xe0
? asm_exc_page_fault+0x22/0x30
? trace_event_raw_event_netfs_folio+0x61/0xc0
trace_netfs_folio+0x39/0x40
netfs_page_mkwrite+0x14c/0x1d0
do_page_mkwrite+0x50/0x90
do_pte_missing+0x184/0x200
__handle_mm_fault+0x42d/0x500
handle_mm_fault+0x121/0x1f0
do_user_addr_fault+0x23e/0x3c0
exc_page_fault+0xc2/0xe0
asm_exc_page_fault+0x22/0x30
This is due to the invalidate_inode_pages2_range() issued at the end of the
DIO write interfering with the mmap'd writes.
Fixes: 102a7e2c598c ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/780211.1719318546@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Delete some xarray-based buffer wangling functions that are intended for
use with bounce buffering, but aren't used because bounce-buffering got
deferred to a later patch series. Now, however, the intention is to use
something other than an xarray to do this.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-9-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
During the writeback procedure, at the end of netfs_write_folio(), pending
write operations are flushed if the amount of write-streaming data stored
in a page is less than the size of the folio because if we haven't modified
a folio to the end, it cannot be contiguous with the following folio...
except if the dirty region of the folio is right at the end of the folio
space.
Fix the test to take the offset into the folio into account as well, such
that if the dirty region runs right up to the end of the folio, we leave
the flushing for later.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com> (DFS, global name space)
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
[This was included in v2 of 9b038d004ce95551cb35381c49fe896c5bc11ffe, but
v1 got pushed instead]
Fix netfs_unbuffered_write_iter_locked() to set the total request length in
the netfs_io_request struct rather than leaving it as zero.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <stfrench@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: Christian Brauner <brauner@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-2-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Instead of inventing a custom way to conditionally enable debugging,
just make use of pr_debug(), which also has dynamic debugging facilities
and is more likely known to someone who hunts a problem in the netfs
code. Also drop the module parameter netfs_debug which didn't have any
effect without further source changes. (The variable netfs_debug was
only used in #ifdef blocks for cpp vars that don't exist; Note that
CONFIG_NETFS_DEBUG isn't settable via kconfig, a variable with that name
never existed in the mainline and is probably just taken over (and
renamed) from similar custom debug logging implementations.)
Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org>
Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix io_uring based write-through after converting cifs to use the
netfs library
- Fix aio error handling when doing write-through via netfs library
- Fix performance regression in iomap when used with non-large folio
mappings
- Fix signalfd error code
- Remove obsolete comment in signalfd code
- Fix async request indication in netfs_perform_write() by raising
BDP_ASYNC when IOCB_NOWAIT is set
- Yield swap device immediately to prevent spurious EBUSY errors
- Don't cross a .backup mountpoint from backup volumes in afs to avoid
infinite loops
- Fix a race between umount and async request completion in 9p after 9p
was converted to use the netfs library
* tag 'vfs-6.10-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs, 9p: Fix race between umount and async request completion
afs: Don't cross .backup mountpoint from backup volume
swap: yield device immediately
netfs: Fix setting of BDP_ASYNC from iocb flags
signalfd: drop an obsolete comment
signalfd: fix error return code
iomap: fault in smaller chunks for non-large folio mappings
filemap: add helper mapping_max_folio_size()
netfs: Fix AIO error handling when doing write-through
netfs: Fix io_uring based write-through
|
|
There's a problem in 9p's interaction with netfslib whereby a crash occurs
because the 9p_fid structs get forcibly destroyed during client teardown
(without paying attention to their refcounts) before netfslib has finished
with them. However, it's not a simple case of deferring the clunking that
p9_fid_put() does as that requires the p9_client record to still be
present.
The problem is that netfslib has to unlock pages and clear the IN_PROGRESS
flag before destroying the objects involved - including the fid - and, in
any case, nothing checks to see if writeback completed barring looking at
the page flags.
Fix this by keeping a count of outstanding I/O requests (of any type) and
waiting for it to quiesce during inode eviction.
Reported-by: syzbot+df038d463cca332e8414@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000005be0aa061846f8d6@google.com/
Reported-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000b86c5e06130da9c6@google.com/
Reported-by: syzbot+1527696d41a634cc1819@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000041f960618206d7e@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/755891.1716560771@warthog.procyon.org.uk
Tested-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Hillf Danton <hdanton@sina.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Reported-and-tested-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_perform_write() to set BDP_ASYNC if IOCB_NOWAIT is set rather
than if IOCB_SYNC is not set. It reflects asynchronicity in the sense of
not waiting rather than synchronicity in the sense of not returning until
the op is complete.
Without this, generic/590 fails on cifs in strict caching mode with a
complaint that one of the writes fails with EAGAIN. The test can be
distilled down to:
mount -t cifs /my/share /mnt -ostuff
xfs_io -i -c 'falloc 0 8191M -c fsync -f /mnt/file
xfs_io -i -c 'pwrite -b 1M -W 0 8191M' /mnt/file
Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/316306.1716306586@warthog.procyon.org.uk
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Jeff Layton <jlayton@kernel.org>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If an error occurs whilst we're doing an AIO write in write-through mode,
we may end up calling ->ki_complete() *and* returning an error from
->write_iter(). This can result in either a UAF (the ->ki_complete() func
pointer may get overwritten, for example) or a refcount underflow in
io_submit() as ->ki_complete is called twice.
Fix this by making netfs_end_writethrough() - and thus
netfs_perform_write() - unconditionally return -EIOCBQUEUED if we're doing
an AIO write and wait for completion if we're not.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/295052.1716298587@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This can be triggered by mounting a cifs filesystem with a cache=strict
mount option and then, using the fsx program from xfstests, doing:
ltp/fsx -A -d -N 1000 -S 11463 -P /tmp /cifs-mount/foo \
--replay-ops=gen112-fsxops
Where gen112-fsxops holds:
fallocate 0x6be7 0x8fc5 0x377d3
copy_range 0x9c71 0x77e8 0x2edaf 0x377d3
write 0x2776d 0x8f65 0x377d3
The problem is that netfs_io_request::len is being used for two purposes
and ends up getting set to the amount of data we transferred, not the
amount of data the caller asked to be transferred (for various reasons,
such as mmap'd writes, we might end up rounding out the data written to the
server to include the entire folio at each end).
Fix this by keeping the amount we were asked to write in ->len and using
->submitted to track what we issued ops for. Then, when we come to calling
->ki_complete(), ->len is the right size.
This also required netfs_cleanup_dio_write() to change since we're no
longer advancing wreq->len. Use wreq->transferred instead as we might have
done a short read.
With this, the generic/112 xfstest passes if cifs is forced to put all
non-DIO opens into write-through mode.
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/295086.1716298663@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <stfrench@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With the changes to folios/netfs it is now easier to reenable
swapfile support over SMB3 which fixes various xfstests
Reviewed-by: David Howells <dhowells@redhat.com>
Suggested-by: David Howells <dhowells@redhat.com>
Fixes: e1209d3a7a67 ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space")
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix to take the i_rwsem (through the netfs locking wrappers) before taking
cinode->lock_sem.
Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib")
Reported-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Make the cifs filesystem use netfslib to handle reading and writing on
behalf of cifs. The changes include:
(1) Various read_iter/write_iter type functions are turned into wrappers
around netfslib API functions or are pointed directly at those
functions:
cifs_file_direct{,_nobrl}_ops switch to use
netfs_unbuffered_read_iter and netfs_unbuffered_write_iter.
Large pieces of code that will be removed are #if'd out and will be removed
in subsequent patches.
[?] Why does cifs mark the page dirty in the destination buffer of a DIO
read? Should that happen automatically? Does netfs need to do that?
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide implementation of the netfslib hooks that will be used by netfslib
to ask cifs to set up and perform operations. Of particular note are
(*) cifs_clamp_length() - This is used to negotiate the size of the next
subrequest in a read request, taking into account the credit available
and the rsize. The credits are attached to the subrequest.
(*) cifs_req_issue_read() - This is used to issue a subrequest that has
been set up and clamped.
(*) cifs_prepare_write() - This prepares to fill a subrequest by picking a
channel, reopening the file and requesting credits so that we can set
the maximum size of the subrequest and also sets the maximum number of
segments if we're doing RDMA.
(*) cifs_issue_write() - This releases any unneeded credits and issues an
asynchronous data write for the contiguous slice of file covered by
the subrequest. This should possibly be folded in to all
->async_writev() ops and that called directly.
(*) cifs_begin_writeback() - This gets the cached writable handle through
which we do writeback (this does not affect writethrough, unbuffered
or direct writes).
At this point, cifs is not wired up to actually *use* netfslib; that will
be done in a subsequent patch.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Use a hook in the new writeback code's retry algorithm to rotate the keys
once all the outstanding subreqs have failed rather than doing it
separately on each subreq.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Do a couple of miscellaneous tidy ups:
(1) Add a qualifier into a file banner comment.
(2) Put the writeback folio traces back into alphabetical order.
(3) Remove some unused folio traces.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Remove the old writeback code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Cut over to using the new writeback code. The old code is #ifdef'd out or
otherwise removed from compilation to avoid conflicts and will be removed
in a future patch.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Add some write-side stats to count buffered writes, buffered writethrough,
and writepages calls.
Whilst we're at it, clean up the naming on some of the existing stats
counters and organise the output into two sets.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
The current netfslib writeback implementation creates writeback requests of
contiguous folio data and then separately tiles subrequests over the space
twice, once for the server and once for the cache. This creates a few
issues:
(1) Every time there's a discontiguity or a change between writing to only
one destination or writing to both, it must create a new request.
This makes it harder to do vectored writes.
(2) The folios don't have the writeback mark removed until the end of the
request - and a request could be hundreds of megabytes.
(3) In future, I want to support a larger cache granularity, which will
require aggregation of some folios that contain unmodified data (which
only need to go to the cache) and some which contain modifications
(which need to be uploaded and stored to the cache) - but, currently,
these are treated as discontiguous.
There's also a move to get everyone to use writeback_iter() to extract
writable folios from the pagecache. That said, currently writeback_iter()
has some issues that make it less than ideal:
(1) there's no way to cancel the iteration, even if you find a "temporary"
error that means the current folio and all subsequent folios are going
to fail;
(2) there's no way to filter the folios being written back - something
that will impact Ceph with it's ordered snap system;
(3) and if you get a folio you can't immediately deal with (say you need
to flush the preceding writes), you are left with a folio hanging in
the locked state for the duration, when really we should unlock it and
relock it later.
In this new implementation, I use writeback_iter() to pump folios,
progressively creating two parallel, but separate streams and cleaning up
the finished folios as the subrequests complete. Either or both streams
can contain gaps, and the subrequests in each stream can be of variable
size, don't need to align with each other and don't need to align with the
folios.
Indeed, subrequests can cross folio boundaries, may cover several folios or
a folio may be spanned by multiple folios, e.g.:
+---+---+-----+-----+---+----------+
Folios: | | | | | | |
+---+---+-----+-----+---+----------+
+------+------+ +----+----+
Upload: | | |.....| | |
+------+------+ +----+----+
+------+------+------+------+------+
Cache: | | | | | |
+------+------+------+------+------+
The progressive subrequest construction permits the algorithm to be
preparing both the next upload to the server and the next write to the
cache whilst the previous ones are already in progress. Throttling can be
applied to control the rate of production of subrequests - and, in any
case, we probably want to write them to the server in ascending order,
particularly if the file will be extended.
Content crypto can also be prepared at the same time as the subrequests and
run asynchronously, with the prepped requests being stalled until the
crypto catches up with them. This might also be useful for transport
crypto, but that happens at a lower layer, so probably would be harder to
pull off.
The algorithm is split into three parts:
(1) The issuer. This walks through the data, packaging it up, encrypting
it and creating subrequests. The part of this that generates
subrequests only deals with file positions and spans and so is usable
for DIO/unbuffered writes as well as buffered writes.
(2) The collector. This asynchronously collects completed subrequests,
unlocks folios, frees crypto buffers and performs any retries. This
runs in a work queue so that the issuer can return to the caller for
writeback (so that the VM can have its kswapd thread back) or async
writes.
(3) The retryer. This pauses the issuer, waits for all outstanding
subrequests to complete and then goes through the failed subrequests
to reissue them. This may involve reprepping them (with cifs, the
credits must be renegotiated, and a subrequest may need splitting),
and doing RMW for content crypto if there's a conflicting change on
the server.
[!] Note that some of the functions are prefixed with "new_" to avoid
clashes with existing functions. These will be renamed in a later patch
that cuts over to the new algorithm.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Switch to using unsigned long long rather than loff_t in netfslib to avoid
problems with the sign flipping in the maths when we're dealing with the
byte at position 0x7fffffffffffffff.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
|
|
Use mempools for allocating requests and subrequests in an effort to make
sure that allocation always succeeds so that when performing writeback we
can always make progress.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Remove support for ->launder_folio() from netfslib and expect filesystems
to use filemap_invalidate_inode() instead. netfs_launder_folio() can then
be got rid of.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <sfrench@samba.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: devel@lists.orangefs.org
|
|
Implement a replacement for launder_folio. The key feature of
invalidate_inode_pages2() is that it locks each folio individually, unmaps
it to prevent mmap'd accesses interfering and calls the ->launder_folio()
address_space op to flush it. This has problems: firstly, each folio is
written individually as one or more small writes; secondly, adjacent folios
cannot be added so easily into the laundry; thirdly, it's yet another op to
implement.
Instead, use the invalidate lock to cause anyone wanting to add a folio to
the inode to wait, then unmap all the folios if we have mmaps, then,
conditionally, use ->writepages() to flush any dirty data back and then
discard all pages.
The invalidate lock prevents ->read_iter(), ->write_iter() and faulting
through mmap all from adding pages for the duration.
This is then used from netfslib to handle the flusing in unbuffered and
direct writes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: devel@lists.orangefs.org
|
|
Use the subreq_counter in netfs_io_request to allocate subrequest
debug_index values in read ops as well as write ops.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Make the netfs_io_request::subreq_counter, used to generate values for
netfs_io_subrequest::debug_index, into an atomic_t so that it can be called
from the retry thread at the same time as the app thread issuing writes.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
|
|
Remove the deprecated use of PG_private_2 in netfslib.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Remove the PG_fscache alias for PG_private_2 and use the latter directly.
Use of this flag for marking pages undergoing writing to the cache should
be considered deprecated and the folios should be marked dirty instead and
the write done in ->writepages().
Note that PG_private_2 itself should be considered deprecated and up for
future removal by the MM folks too.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
When dirty data is being written to the cache, setting/waiting on/clearing
the fscache flag is always done in tandem with setting/waiting on/clearing
the writeback flag. The netfslib buffered write routines wait on and set
both flags and the write request cleanup clears both flags, so the fscache
flag is almost superfluous.
The reason it isn't superfluous is because the fscache flag is also used to
indicate that data just read from the server is being written to the cache.
The flag is used to prevent a race involving overlapping direct-I/O writes
to the cache.
Change this to indicate that a page is in need of being copied to the cache
by placing a magic value in folio->private and marking the folios dirty.
Then when the writeback code sees a folio marked in this way, it only
writes it to the cache and not to the server.
If a folio that has this magic value set is modified, the value is just
replaced and the folio will then be uplodaded too.
With this, PG_fscache is no longer required by the netfslib core, 9p and
afs.
Ceph and nfs, however, still need to use the old PG_fscache-based tracking.
To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the
flags in the netfs_inode struct for those filesystems. This reenables the
use of PG_fscache in that inode. 9p and afs use the netfslib write helpers
so get switched over; cifs, for the moment, does page-by-page manual access
to the cache, so doesn't use PG_fscache and is unaffected.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Update i_blocks when i_size is updated when we finish making a write to the
pagecache to reflect the amount of space we think will be consumed.
This maintains cifs commit dbfdff402d89854126658376cbcb08363194d3cd ("smb3:
update allocation size more accurately on write completion") which would
otherwise be removed by the cifs part of the netfs writeback rewrite.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
In netfs_perform_write(), when the file is marked NETFS_ICTX_WRITETHROUGH
or O_*SYNC or RWF_*SYNC was specified, write-through caching is performed
on a buffered file. When setting up for write-through, we flush any
conflicting writes in the region and wait for the write to complete,
failing if there's a write error to return.
The issue arises if we're writing at or above the EOF position because we
skip the flush and - more importantly - the wait. This becomes a problem
if there's a partial folio at the end of the file that is being written out
and we want to make a write to it too. Both the already-running write and
the write we start both want to clear the writeback mark, but whoever is
second causes a warning looking something like:
------------[ cut here ]------------
R=00000012: folio 11 is not under writeback
WARNING: CPU: 34 PID: 654 at fs/netfs/write_collect.c:105
...
CPU: 34 PID: 654 Comm: kworker/u386:27 Tainted: G S ...
...
Workqueue: events_unbound netfs_write_collection_worker
...
RIP: 0010:netfs_writeback_lookup_folio
Fix this by making the flush-and-wait unconditional. It will do nothing if
there are no folios in the pagecache and will return quickly if there are
no folios in the region specified.
Further, move the WBC attachment above the flush call as the flush is going
to attach a WBC and detach it again if it is not present - and since we
need one anyway we might as well share it.
Fixes: 41d8e7673a77 ("netfs: Implement a write-through caching option")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202404161031.468b84f-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/2150448.1714130115@warthog.procyon.org.uk
Reviewed-by: Jeffrey Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix the error return in netfs_perform_write() acting in writethrough-mode
to return any cached error in the case that netfs_end_writethrough()
returns 0.
This can affect the use of O_SYNC/O_DSYNC/RWF_SYNC/RWF_DSYNC in 9p and afs.
Fixes: 41d8e7673a77 ("netfs: Implement a write-through caching option")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/6736.1713343639@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix fscache_begin_operation() to clear cres->cache_priv on error, otherwise
fscache_resources_valid() will report it as being valid.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3933237.1710514106@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix netfs_unbuffered_write_iter() to return immediately if
generic_write_checks() returns 0, indicating there's nothing to write.
Note that netfs_file_write_iter() already does this.
Also, whilst we're at it, put in checks for the size being zero before we
even take the locks. Note that generic_write_checks() can still reduce the
size to zero, so we still need that check.
Without this, a warning similar to the following is logged to dmesg:
netfs: Zero-sized write [R=1b6da]
and the syscall fails with EIO, e.g.:
/sbin/ldconfig.real: Writing of cache extension data failed: Input/output error
This can be reproduced on 9p by:
xfs_io -f -c 'pwrite 0 0' /xfstest.test/foo
Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
Reported-by: Eric Van Hensbergen <ericvh@kernel.org>
Link: https://lore.kernel.org/r/ZbQUU6QKmIftKsmo@FV7GG9FTHL/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240129094924.1221977-3-dhowells@redhat.com
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: <v9fs@lists.linux.dev>
cc: <linux_oss@crudebyte.com>
cc: <netfs@lists.linux.dev>
cc: <linux-fsdevel@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If netfs_begin_read gets a NETFS_DIO_READ request that begins
past i_size, it won't perform any i/o and just return 0. This
will leak an increment to i_dio_count that is done at the top
of the function.
This can cause subsequent buffered read requests to block
indefinitely, waiting for a non existing dio operation to complete.
Add a inode_dio_end() for the NETFS_DIO_READ case, before returning.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240129094924.1221977-2-dhowells@redhat.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: <linux-afs@lists.infradead.org>
cc: <netfs@lists.linux.dev>
cc: <linux-fsdevel@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The netfs_grab_folio_for_write() function doesn't return NULL, it returns
error pointers. Update the check accordingly.
Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/29fb1310-8e2d-47ba-b68d-40354eb7b896@moroto.mountain/
|
|
This function dereferences "cache" and then checks if it's
IS_ERR_OR_NULL(). Check first, then dereference.
Fixes: 9549332df4ed ("fscache: Implement cache registration")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/e84bc740-3502-4f16-982a-a40d5676615c@moroto.mountain/ # v2
|
|
Filesystems should use folio->index and folio->mapping, instead of
folio_index(folio), folio_mapping() and folio_file_mapping() since
they know that it's in the pagecache.
Change this automagically with:
perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/netfs/*.c
perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/netfs/*.c
perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/netfs/*.c
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This extends the netfs helper library that network filesystems can use
to replace their own implementations. Both afs and 9p are ported. cifs
is ready as well but the patches are way bigger and will be routed
separately once this is merged. That will remove lots of code as well.
The overal goal is to get high-level I/O and knowledge of the page
cache and ouf of the filesystem drivers. This includes knowledge about
the existence of pages and folios
The pull request converts afs and 9p. This removes about 800 lines of
code from afs and 300 from 9p. For 9p it is now possible to do writes
in larger than a page chunks. Additionally, multipage folio support
can be turned on for 9p. Separate patches exist for cifs removing
another 2000+ lines. I've included detailed information in the
individual pulls I took.
Summary:
- Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
calls to prevent these from happening at the same time.
- Support for direct and unbuffered I/O.
- Support for write-through caching in the page cache.
- O_*SYNC and RWF_*SYNC writes use write-through rather than writing
to the page cache and then flushing afterwards.
- Support for write-streaming.
- Support for write grouping.
- Skip reads for which the server could only return zeros or EOF.
- The fscache module is now part of the netfs library and the
corresponding maintainer entry is updated.
- Some helpers from the fscache subsystem are renamed to mark them as
belonging to the netfs library.
- Follow-up fixes for the netfs library.
- Follow-up fixes for the 9p conversion"
* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
netfs: Fix wrong #ifdef hiding wait
cachefiles: Fix signed/unsigned mixup
netfs: Fix the loop that unmarks folios after writing to the cache
netfs: Fix interaction between write-streaming and cachefiles culling
netfs: Count DIO writes
netfs: Mark netfs_unbuffered_write_iter_locked() static
netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
netfs: Rearrange netfs_io_subrequest to put request pointer first
9p: Use length of data written to the server in preference to error
9p: Do a couple of cleanups
9p: Fix initialisation of netfs_inode for 9p
cachefiles: Fix __cachefiles_prepare_write()
9p: Use netfslib read/write_iter
afs: Use the netfs write helpers
netfs: Export the netfs_sreq tracepoint
netfs: Optimise away reads above the point at which there can be no data
netfs: Implement a write-through caching option
netfs: Provide a launder_folio implementation
netfs: Provide a writepages implementation
netfs, cachefiles: Pass upper bound length to allow expansion
...
|
|
netfs_writepages_begin() has the wait on the fscache folio conditional on
CONFIG_NETFS_FSCACHE - which doesn't exist.
Fix it to be conditional on CONFIG_FSCACHE instead.
Fixes: 62c3b7481b9a ("netfs: Provide a writepages implementation")
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20240109083257.GK132648@kernel.org/
|
|
In the loop in netfs_rreq_unmark_after_write() that removes the PG_fscache
from folios after they've been written to the cache, as soon as we remove
the mark from a multipage folio, it can get split - and then we might see a
fragment of folio again.
Guard against this by advancing the 'unlocked' tracker to the index of the
last page in the folio to avoid a double removal of the PG_fscache mark.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
An issue can occur between write-streaming (storing dirty data in partial
non-uptodate pages) and a cachefiles object being culled to make space.
The problem occurs because the cache object is only marked in use while
there are files open using it. Once it has been released, it can be culled
and the cookie marked disabled.
At this point, a streaming write is permitted to occur (if the cache is
active, we require pages to be prefetched and cached), but the cache can
become active again before this gets flushed out - and then two effects can
occur:
(1) The cache may be asked to write out a region that's less than its DIO
block size (assumed by cachefiles to be PAGE_SIZE) - and this causes
one of two debugging statements to be emitted.
(2) netfs_how_to_modify() gets confused because it sees a page that isn't
allowed to be non-uptodate being uptodate and tries to prefetch it -
leading to a warning that PG_fscache is set twice.
Fix this by the following means:
(1) Add a netfs_inode flag to disallow write-streaming to an inode and set
it if we ever do local caching of that inode. It remains set for the
lifetime of that inode - even if the cookie becomes disabled.
(2) If the no-write-streaming flag is set, then make netfs_how_to_modify()
always want to prefetch instead.
(3) If netfs_how_to_modify() decides it wants to prefetch a folio, but
that folio has write-streamed data in it, then it requires the folio
be flushed first.
(4) Export a counter of the number of times we wanted to prefetch a
non-uptodate page, but found it had write-streamed data in it.
(5) Export a counter of the number of times we cancelled a write to the
cache because it didn't DIO align and remove the debug statements.
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a counter for DIO writes to match that for DIO reads.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Mark netfs_unbuffered_write_iter_locked() static as it's only called from
the file in which it is defined.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Fix the proc/fs/fscache symlink to point to "netfs" not "../netfs".
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
|
|
Export the netfs_sreq tracepoint so that it can be called directly from
client filesystems/cache backend modules.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.
Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:
(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.
(2) On local inode creation, we set zero_point to 0.
(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.
(4) On local truncation up, we don't change zero_point.
(5) On local modification, we don't change zero_point.
(6) On remote invalidation, we set zero_point to the new i_size.
(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.
(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.
(9) If a direct I/O write is made, set zero_point above that.
Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.
The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a flag whereby a filesystem may request that cifs_perform_write()
perform write-through caching. This involves putting pages directly into
writeback rather than dirty and attaching them to a write operation as we
go.
Further, the writes being made are limited to the byte range being written
rather than whole folios being written. This can be used by cifs, for
example, to deal with strict byte-range locking.
This can't be used with content encryption as that may require expansion of
the write RPC beyond the write being made.
This doesn't affect writes via mmap - those are written back in the normal
way; similarly failed writethrough writes are marked dirty and left to
writeback to retry. Another option would be to simply invalidate them, but
the contents can be simultaneously accessed by read() and through mmap.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a launder_folio implementation for netfslib.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide an implementation of writepages for network filesystems to delegate
to.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Make netfslib pass the maximum length to the ->prepare_write() op to tell
the cache how much it can expand the length of a write to. This allows a
write to the server at the end of a file to be limited to a few bytes
whilst writing an entire block to the cache (something required by direct
I/O).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a top-level-ish function that can be pointed to directly by
->read_iter file op.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide an entry point to delegate a filesystem's ->page_mkwrite() to.
This checks for conflicting writes, then attached any netfs-specific group
marking (e.g. ceph snap) to the page to be considered dirty.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Institute a netfs write helper, netfs_file_write_iter(), to be pointed at
by the network filesystem ->write_iter() call. Make it handled buffered
writes by calling the previously defined netfs_perform_write() to copy the
source data into the pagecache.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Implement support for unbuffered writes and direct I/O writes. If the
write is misaligned with respect to the fscrypt block size, then RMW cycles
are performed if necessary. DIO writes are a special case of unbuffered
writes with extra restriction imposed, such as block size alignment
requirements.
Also provide a field that can tell the code to add some extra space onto
the bounce buffer for use by the filesystem in the case of a
content-encrypted file.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Implement support for unbuffered and DIO reads in the netfs library,
utilising the existing read helper code to do block splitting and
individual queuing. The code also handles extraction of the destination
buffer from the supplied iterator, allowing async unbuffered reads to take
place.
The read will be split up according to the rsize setting and, if supplied,
the ->clamp_length() method. Note that the next subrequest will be issued
as soon as issue_op returns, without waiting for previous ones to finish.
The network filesystem needs to pause or handle queuing them if it doesn't
want to fire them all at the server simultaneously.
Once all the subrequests have finished, the state will be assessed and the
amount of data to be indicated as having being obtained will be
determined. As the subrequests may finish in any order, if an intermediate
subrequest is short, any further subrequests may be copied into the buffer
and then abandoned.
In the future, this will also take care of doing an unbuffered read from
encrypted content, with the decryption being done by the library.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Allocate a multipage folio when copying data into the pagecache if possible
if there's sufficient data to warrant it.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
netfs_read_folio() needs to handle partially-valid pages that are marked
dirty, but not uptodate in the event that someone tries to read a page was
used to cache data by a streaming write.
In such a case, make netfs_read_folio() set up a bvec iterator that points
to the parts of the folio that need filling and to a sink page for the data
that should be discarded and use that instead of i_pages as the iterator to
be written to.
This requires netfs_rreq_unlock_folios() to convert the page into a normal
dirty uptodate page, getting rid of the partial write record and bumping
the group pointer over to folio->private.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide a netfs write helper, netfs_perform_write() to buffer data to be
written in the pagecache and mark the modified folios dirty.
It will perform "streaming writes" for folios that aren't currently
resident, if possible, storing data in partially modified folios that are
marked dirty, but not uptodate. It will also tag pages as belonging to
fs-specific write groups if so directed by the filesystem.
This is derived from generic_perform_write(), but doesn't use
->write_begin() and ->write_end(), having that logic rolled in instead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Dispatch one or more write reqeusts to process a writeback slice, where a
slice is tailored more to logical block divisions within the file (such as
crypto blocks, an object layout or cache granules) than the protocol RPC
maximum capacity.
The dispatch doesn't happen until throttling allows, at which point the
entire writeback slice is processed and queued. A slice may be written to
multiple destinations (one or more servers and the local cache) and the
writes to each destination might be split up along different lines.
The writeback slice holds the required folios pinned. An iov_iter is
provided in netfs_write_request that describes the buffer to be used. This
may be part of the pagecache, may have auxiliary padding pages attached or
may be a bounce buffer resulting from crypto or compression. Consequently,
the filesystem must not twiddle the folio markings directly.
The following API is available to the filesystem:
(1) The ->create_write_requests() method is called to ask the filesystem
to create the requests it needs. This is passed the writeback slice
to be processed.
(2) The filesystem should then call netfs_create_write_request() to create
the requests it needs.
(3) Once a request is initialised, netfs_queue_write_request() can be
called to dispatch it asynchronously, if not completed immediately.
(4) netfs_write_request_completed() should be called to note the
completion of a request.
(5) netfs_get_write_request() and netfs_put_write_request() are provided
to refcount a request. These take constants from the netfs_wreq_trace
enum for logging into ftrace.
(6) The ->free_write_request is method is called to ask the filesystem to
clean up a request.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Prepare to use folio->private to hold information write grouping and
streaming write. These are implemented in the same commit as they both
make use of folio->private and will be both checked at the same time in
several places.
"Write grouping" involves ordering the writeback of groups of writes, such
as is needed for ceph snaps. A group is represented by a
filesystem-supplied object which must contain a netfs_group struct. This
contains just a refcount and a pointer to a destructor.
"Streaming write" is the storage of data in folios that are marked dirty,
but not uptodate, to avoid unnecessary reads of data. This is represented
by a netfs_folio struct. This contains the offset and length of the
modified region plus the otherwise displaced write grouping pointer.
The way folio->private is multiplexed is:
(1) If private is NULL then neither is in operation on a dirty folio.
(2) If private is set, with bit 0 clear, then this points to a group.
(3) If private is set, with bit 0 set, then this points to a netfs_folio
struct (with bit 0 AND'ed out).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Make the refcounting of netfs_begin_read() easier to use by not eating the
caller's ref on the netfs_io_request it's given. This makes it easier to
use when we need to look in the request struct after.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Make netfs_put_request() just return if given a NULL request pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Modify the netfs_io_request struct to act as a point around which writes
can be coordinated. It represents and pins a range of pages that need
writing and a list of regions of dirty data in that range of pages.
If RMW is required, the original data can be downloaded into the bounce
buffer, decrypted if necessary, the modifications made, then the modified
data can be reencrypted/recompressed and sent back to the server.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Limit a subrequest to a maximum size and/or a maximum number of contiguous
physical regions. This permits, for instance, an subreq's iterator to be
limited to the number of DMA'able segments that a large RDMA request can
handle.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a function to work out how much of an ITER_BVEC or ITER_XARRAY iterator
we can use in a pagecount-limited and size-limited span. This will be
used, for example, to limit the number of segments in a subrequest to the
maximum number of elements that an RDMA transfer can handle.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide tools to create a buffer in an xarray, with a function to add new
folios with a mark. This will be used to create bounce buffer and can be
used more easily to create a list of folios the span of which would require
more than a page's worth of bio_vec structs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a bvec array pointer and an iterator to netfs_io_request for either
holding a copy of a DIO iterator or a list of all the bits of buffer
pointed to by a DIO iterator.
There are two problems: Firstly, if an iovec-class iov_iter is passed to
->read_iter() or ->write_iter(), this cannot be passed directly to
kernel_sendmsg() or kernel_recvmsg() as that may cause locking recursion if
a fault is generated, so we need to keep track of the pages involved
separately.
Secondly, if the I/O is asynchronous, we must copy the iov_iter describing
the buffer before returning to the caller as it may be immediately
deallocated.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add three iov_iter structs:
(1) Add an iov_iter (->iter) to the I/O request to describe the
unencrypted-side buffer.
(2) Add an iov_iter (->io_iter) to the I/O request to describe the
encrypted-side I/O buffer. This may be a different size to the buffer
in (1).
(3) Add an iov_iter (->io_iter) to the I/O subrequest to describe the part
of the I/O buffer for that subrequest.
This will allow future patches to point to a bounce buffer instead for
purposes of handling oversize writes, decryption (where we want to save the
encrypted data to the cache) and decompression.
These iov_iters persist for the lifetime of the (sub)request, and so can be
accessed multiple times without worrying about them being deallocated upon
return to the caller.
The network filesystem must appropriately advance the iterator before
terminating the request.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Borrow NFS's direct-vs-buffered I/O locking into netfslib. Similar code is
also used in ceph.
Modify it to have the correct checker annotations for i_rwsem lock
acquisition/release and to return -ERESTARTSYS if waits are interrupted.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Provide default invalidate_folio and release_folio calls. These will need
to interact with invalidation correctly at some point. They will be needed
if netfslib is to make use of folio->private for its own purposes.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a ->free_subrequest() op so that the netfs can clean up data attached
to a subrequest.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Allow the network filesystem to specify extra space to be allocated on the
end of the io (sub)request. This allows cifs, for example, to use this
space rather than allocating its own cifs_readdata struct.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Add a procfile, /proc/fs/netfs/requests, to list in-progress netfslib I/O
requests.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Move the resource pinning-for-writeback from fscache code to netfslib code.
This is used to keep a cache backing object pinned whilst we have dirty
pages on the netfs inode in the pagecache such that VM writeback will be
able to reach it.
Whilst we're at it, switch the parameters of netfs_unpin_writeback() to
match ->write_inode() so that it can be used for that directly.
Note that this mechanism could be more generically useful than that for
network filesystems. Quite often they have to keep around other resources
(e.g. authentication tokens or network connections) until the writeback is
complete.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
|
|
Rename /proc/fs/fscache to "netfs" and make a symlink from fscache to that.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
|