|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4 |
|
| #
9e705016 |
| 16-Dec-2024 |
David Howells <[email protected]> |
afs: Add more tracepoints to do with tracking validity
Add wrappers to set and clear the callback promise and to mark a directory as invalidated, and add tracepoints to track these events:
(1) afs
afs: Add more tracepoints to do with tracking validity
Add wrappers to set and clear the callback promise and to mark a directory as invalidated, and add tracepoints to track these events:
(1) afs_cb_promise: Log when a callback promise is set on a vnode.
(2) afs_vnode_invalid: Log when the server's callback promise for a vnode is no longer valid and we need to refetch the vnode metadata.
(3) afs_dir_invalid: Log when the contents of a directory are marked invalid and requiring refetching from the server and the cache invalidating.
and two tracepoints to record data version number management:
(4) afs_set_dv: Log when the DV is recorded on a vnode.
(5) afs_dv_mismatch: Log when the DV recorded on a vnode plus the expected delta for the operation does not match the DV we got back from the server.
Signed-off-by: David Howells <[email protected]> Link: https://lore.kernel.org/r/[email protected] cc: Marc Dionne <[email protected]> cc: [email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1 |
|
| #
453924de |
| 08-Nov-2023 |
David Howells <[email protected]> |
afs: Overhaul invalidation handling to better support RO volumes
Overhaul the third party-induced invalidation handling, making use of the previously added volume-level event counters (cb_scrub and
afs: Overhaul invalidation handling to better support RO volumes
Overhaul the third party-induced invalidation handling, making use of the previously added volume-level event counters (cb_scrub and cb_ro_snapshot) that are now being parsed out of the VolSync record returned by the fileserver in many of its replies.
This allows better handling of RO (and Backup) volumes. Since these are snapshot of a RW volume that are updated atomically simultantanously across all servers that host them, they only require a single callback promise for the entire volume. The currently upstream code assumes that RO volumes operate in the same manner as RW volumes, and that each file has its own individual callback - which means that it does a status fetch for *every* file in a RO volume, whether or not the volume got "released" (volume callback breaks can occur for other reasons too, such as the volumeserver taking ownership of a volume from a fileserver).
To this end, make the following changes:
(1) Change the meaning of the volume's cb_v_break counter so that it is now a hint that we need to issue a status fetch to work out the state of a volume. cb_v_break is incremented by volume break callbacks and by server initialisation callbacks.
(2) Add a second counter, cb_v_check, to the afs_volume struct such that if this differs from cb_v_break, we need to do a check. When the check is complete, cb_v_check is advanced to what cb_v_break was at the start of the status fetch.
(3) Move the list of mmap'd vnodes to the volume and trigger removal of PTEs that map to files on a volume break rather than on a server break.
(4) When a server reinitialisation callback comes in, use the server-to-volume reverse mapping added in a preceding patch to iterate over all the volumes using that server and clear the volume callback promises for that server and the general volume promise as a whole to trigger reanalysis.
(5) Replace the AFS_VNODE_CB_PROMISED flag with an AFS_NO_CB_PROMISE (TIME64_MIN) value in the cb_expires_at field, reducing the number of checks we need to make.
(6) Change afs_check_validity() to quickly see if various event counters have been incremented or if the vnode or volume callback promise is due to expire/has expired without making any changes to the state. That is now left to afs_validate() as this may get more complicated in future as we may have to examine server records too.
(7) Overhaul afs_validate() so that it does a single status fetch if we need to check the state of either the vnode or the volume - and do so under appropriate locking. The function does the following steps:
(A) If the vnode/volume is no longer seen as valid, then we take the vnode validation lock and, if the volume promise has expired, the volume check lock also. The latter prevents redundant checks being made to find out if a new version of the volume got released.
(B) If a previous RPC call found that the volsync changed unexpectedly or that a RO volume was updated, then we unmap all PTEs pointing to the file to stop mmap being used for access.
(C) If the vnode is still seen to be of uncertain validity, then we perform an FS.FetchStatus RPC op to jointly update the volume status and the vnode status. This assessment is done as part of parsing the reply:
If the RO volume creation timestamp advances, cb_ro_snapshot is incremented; if either the creation or update timestamps changes in an unexpected way, the cb_scrub counter is incremented
If the Data Version returned doesn't match the copy we have locally, then we ask for the pagecache to be zapped. This takes care of handling RO update.
(D) If cb_scrub differs between volume and vnode, the vnode's pagecache is zapped and the vnode's cb_scrub is updated unless the file is marked as having been deleted.
Signed-off-by: David Howells <[email protected]> cc: Marc Dionne <[email protected]> cc: [email protected]
show more ...
|
| #
16069e13 |
| 05-Nov-2023 |
David Howells <[email protected]> |
afs: Parse the VolSync record in the reply of a number of RPC ops
A number of fileserver RPC operations return a VolSync record as part of their reply that gives some information about the state of
afs: Parse the VolSync record in the reply of a number of RPC ops
A number of fileserver RPC operations return a VolSync record as part of their reply that gives some information about the state of the volume being accessed, including:
(1) A volume Creation timestamp. For an RW volume, this is the time at which the volume was created; if it changes, the RW volume was presumably restored from a backup and all cached data should be scrubbed as Data Version numbers could regress on the files in the volume.
For an RO volume, this is the time it was last snapshotted from the RW volume. It is expected to advance each time this happens; if it regresses, cached data should be scrubbed.
(2) A volume Update timestamp (Auristor only). For an RW volume, this is updated any time any change is made to a volume or its contents. If it regresses, all cached data must be scrubbed.
For an RO volume, this is a copy of the RW volume's Update timestamp at the point of snapshotting. It can be used as a version number when checking to see if a callback on a RO volume was due to a snapshot. If it regresses, all cached data must be scrubbed.
but this is currently not made use of by the in-kernel afs filesystem.
Make the afs filesystem use this by:
(1) Add an update time field to the afs_volsync struct and use a value of TIME64_MIN in both that and the creation time to indicate that they are unset.
(2) Add creation and update time fields to the afs_volume struct and use this to track the two timestamps.
(3) Add a volsync_lock mutex to the afs_volume struct to control modification access for when we detect a change in these values.
(3) Add a 'pre-op volsync' struct to the afs_operation struct to record the state of the volume tracking before the op.
(4) Add a new counter, cb_scrub, to the afs_volume struct to count events that require all data to be scrubbed. A copy is placed in the afs_vnode struct (inode) and if they no longer match, a scrub takes place.
(5) When the result of an operation is being parsed, parse the VolSync data too, if it is provided. Note that the two timestamps are handled separately, since they don't work in quite the same way.
- If the afs_volume tracking is unset, just set it and do nothing else.
- If the result timestamps are the same as the ones in afs_volume, do nothing.
- If the timestamps regress, increment cb_scrub if not already done so.
- If the creation timestamp on a RW volume changes, increment cb_scrub if not already done so.
- If the creation timestamp on a RO volume advances, update the server list and see if the current server has been excluded, if so reissue the op. Once over half of the replication sites have been updated, increment cb_ro_snapshot to indicate updates may be required and switch over to excluding unupdated replication sites.
- If the creation timestamp on a Backup volume advances, just increment cb_ro_snapshot to trigger updates.
Signed-off-by: David Howells <[email protected]> cc: Marc Dionne <[email protected]> cc: [email protected]
show more ...
|
| #
32222f09 |
| 07-Nov-2023 |
David Howells <[email protected]> |
afs: Apply server breaks to mmap'd files in the call processor
Apply server breaks to mmap'd files that are being used from that server from the call processor work function rather than punting it o
afs: Apply server breaks to mmap'd files in the call processor
Apply server breaks to mmap'd files that are being used from that server from the call processor work function rather than punting it off to a workqueue. The work item, afs_server_init_callback(), then bumps each individual inode off to its own work item introducing a potentially lengthy delay. This reduces that delay at the cost of extending the amount of time we delay replying to the CB.InitCallBack3 notification RPC from the server.
Signed-off-by: David Howells <[email protected]> cc: Marc Dionne <[email protected]> cc: [email protected]
show more ...
|
| #
4121b433 |
| 30-Nov-2023 |
Oleg Nesterov <[email protected]> |
afs: fix the usage of read_seqbegin_or_lock() in afs_lookup_volume_rcu()
David Howells says:
(2) afs_lookup_volume_rcu().
There can be a lot of volumes known by a system. A thousand would
afs: fix the usage of read_seqbegin_or_lock() in afs_lookup_volume_rcu()
David Howells says:
(2) afs_lookup_volume_rcu().
There can be a lot of volumes known by a system. A thousand would require a 10-step walk and this is drivable by remote operation, so I think this should probably take a lock on the second pass too.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock() never takes the lock.
Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: David Howells <[email protected]> cc: Marc Dionne <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/
show more ...
|
|
Revision tags: v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2 |
|
| #
874c8ca1 |
| 09-Jun-2022 |
David Howells <[email protected]> |
netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context
While randstruct was satisfied with using an open-coded "void *" offset cast for the netfs_i_context <-> inode casting, __builtin_
netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context
While randstruct was satisfied with using an open-coded "void *" offset cast for the netfs_i_context <-> inode casting, __builtin_object_size() as used by FORTIFY_SOURCE was not as easily fooled. This was causing the following complaint[1] from gcc v12:
In file included from include/linux/string.h:253, from include/linux/ceph/ceph_debug.h:7, from fs/ceph/inode.c:2: In function 'fortify_memset_chk', inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2, inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2: include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 242 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by embedding a struct inode into struct netfs_i_context (which should perhaps be renamed to struct netfs_inode). The struct inode vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode structs and vfs_inode is then simply changed to "netfs.inode" in those filesystems.
Further, rename netfs_i_context to netfs_inode, get rid of the netfs_inode() function that converted a netfs_i_context pointer to an inode pointer (that can now be done with &ctx->inode) and rename the netfs_i_context() function to netfs_inode() (which is now a wrapper around container_of()).
Most of the changes were done with:
perl -p -i -e 's/vfs_inode/netfs.inode/'g \ `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`
Kees suggested doing it with a pair structure[2] and a special declarator to insert that into the network filesystem's inode wrapper[3], but I think it's cleaner to embed it - and then it doesn't matter if struct randomisation reorders things.
Dave Chinner suggested using a filesystem-specific VFS_I() function in each filesystem to convert that filesystem's own inode wrapper struct into the VFS inode struct[4].
Version #2: - Fix a couple of missed name changes due to a disabled cifs option. - Rename nfs_i_context to nfs_inode - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper structs.
[ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily disable '-Wattribute-warning' for now") that is no longer needed ]
Fixes: bc899ee1c898 ("netfs: Add a netfs inode context") Reported-by: Jeff Layton <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Xiubo Li <[email protected]> cc: Jonathan Corbet <[email protected]> cc: Eric Van Hensbergen <[email protected]> cc: Latchesar Ionkov <[email protected]> cc: Dominique Martinet <[email protected]> cc: Christian Schoenebeck <[email protected]> cc: Marc Dionne <[email protected]> cc: Ilya Dryomov <[email protected]> cc: Steve French <[email protected]> cc: William Kucharski <[email protected]> cc: "Matthew Wilcox (Oracle)" <[email protected]> cc: Dave Chinner <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ [1] Link: https://lore.kernel.org/r/[email protected]/ [2] Link: https://lore.kernel.org/r/[email protected]/ [3] Link: https://lore.kernel.org/r/[email protected]/ [4] Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2 Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1 |
|
| #
4fe6a946 |
| 02-Sep-2021 |
David Howells <[email protected]> |
afs: Try to avoid taking RCU read lock when checking vnode validity
Try to avoid taking the RCU read lock when checking the validity of a vnode's callback state. The only thing it's needed for is t
afs: Try to avoid taking RCU read lock when checking vnode validity
Try to avoid taking the RCU read lock when checking the validity of a vnode's callback state. The only thing it's needed for is to pin the parent volume's server list whilst we search it to find the record of the server we're currently using to see if it has been reinitialised (ie. it sent us a CB.InitCallBackState* RPC).
Do this by the following means:
(1) Keep an additional per-cell counter (fs_s_break) that's incremented each time any of the fileservers in the cell reinitialises.
Since the new counter can be accessed without RCU from the vnode, we can check that first - and only if it differs, get the RCU read lock and check the volume's server list.
(2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now indicates whether the callback promise is still expected to be present on the server. This does the checks as described in (1).
(3) Restructure afs_check_validity() to take account of the change in (2).
We can also get rid of the valid variable and just use the need_clear variable with the addition of the afs_cb_break_no_promise reason.
(4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.
Move the change to vnode->cb_v_break to __afs_break_callback().
Delegate the change to vnode->cb_s_break to afs_select_fileserver() and set vnode->cb_fs_s_break there also.
(5) afs_validate() no longer needs to get the RCU read lock around its call to afs_check_validity() - and can skip the call entirely if we don't have a promise.
Signed-off-by: David Howells <[email protected]> Tested-by: Markus Suvanto <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/
show more ...
|
| #
6e0e99d5 |
| 02-Sep-2021 |
David Howells <[email protected]> |
afs: Fix mmap coherency vs 3rd-party changes
Fix the coherency management of mmap'd data such that 3rd-party changes become visible as soon as possible after the callback notification is delivered b
afs: Fix mmap coherency vs 3rd-party changes
Fix the coherency management of mmap'd data such that 3rd-party changes become visible as soon as possible after the callback notification is delivered by the fileserver. This is done by the following means:
(1) When we break a callback on a vnode specified by the CB.CallBack call from the server, we queue a work item (vnode->cb_work) to go and clobber all the PTEs mapping to that inode.
This causes the CPU to trip through the ->map_pages() and ->page_mkwrite() handlers if userspace attempts to access the page(s) again.
(Ideally, this would be done in the service handler for CB.CallBack, but the server is waiting for our reply before considering, and we have a list of vnodes, all of which need breaking - and the process of getting the mmap_lock and stripping the PTEs on all CPUs could be quite slow.)
(2) Call afs_validate() from the ->map_pages() handler to check to see if the file has changed and to get a new callback promise from the server.
Also handle the fileserver telling us that it's dropping all callbacks, possibly after it's been restarted by sending us a CB.InitCallBackState* call by the following means:
(3) Maintain a per-cell list of afs files that are currently mmap'd (cell->fs_open_mmaps).
(4) Add a work item to each server that is invoked if there are any open mmaps when CB.InitCallBackState happens. This work item goes through the aforementioned list and invokes the vnode->cb_work work item for each one that is currently using this server.
This causes the PTEs to be cleared, causing ->map_pages() or ->page_mkwrite() to be called again, thereby calling afs_validate() again.
I've chosen to simply strip the PTEs at the point of notification reception rather than invalidate all the pages as well because (a) it's faster, (b) we may get a notification for other reasons than the data being altered (in which case we don't want to clobber the pagecache) and (c) we need to ask the server to find out - and I don't want to wait for the reply before holding up userspace.
This was tested using the attached test program:
#include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/mman.h> int main(int argc, char *argv[]) { size_t size = getpagesize(); unsigned char *p; bool mod = (argc == 3); int fd; if (argc != 2 && argc != 3) { fprintf(stderr, "Format: %s <file> [mod]\n", argv[0]); exit(2); } fd = open(argv[1], mod ? O_RDWR : O_RDONLY); if (fd < 0) { perror(argv[1]); exit(1); }
p = mmap(NULL, size, mod ? PROT_READ|PROT_WRITE : PROT_READ, MAP_SHARED, fd, 0); if (p == MAP_FAILED) { perror("mmap"); exit(1); } for (;;) { if (mod) { p[0]++; msync(p, size, MS_ASYNC); fsync(fd); } printf("%02x", p[0]); fflush(stdout); sleep(1); } }
It runs in two modes: in one mode, it mmaps a file, then sits in a loop reading the first byte, printing it and sleeping for a second; in the second mode it mmaps a file, then sits in a loop incrementing the first byte and flushing, then printing and sleeping.
Two instances of this program can be run on different machines, one doing the reading and one doing the writing. The reader should see the changes made by the writer, but without this patch, they aren't because validity checking is being done lazily - only on entry to the filesystem.
Testing the InitCallBackState change is more complicated. The server has to be taken offline, the saved callback state file removed and then the server restarted whilst the reading-mode program continues to run. The client machine then has to poke the server to trigger the InitCallBackState call.
Signed-off-by: David Howells <[email protected]> Tested-by: Markus Suvanto <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/163111668833.283156.382633263709075739.stgit@warthog.procyon.org.uk/
show more ...
|
|
Revision tags: v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7 |
|
| #
3c4c4075 |
| 27-May-2020 |
David Howells <[email protected]> |
afs: Fix the by-UUID server tree to allow servers with the same UUID
Whilst it shouldn't happen, it is possible for multiple fileservers to share a UUID, particularly if an entire cell has been dupl
afs: Fix the by-UUID server tree to allow servers with the same UUID
Whilst it shouldn't happen, it is possible for multiple fileservers to share a UUID, particularly if an entire cell has been duplicated, UUIDs and all. In such a case, it's not necessarily possible to map the effect of the CB.InitCallBackState3 incoming RPC to a specific server unambiguously by UUID and thus to a specific cell.
Indeed, there's a problem whereby multiple server records may need to occupy the same spot in the rb_tree rooted in the afs_net struct.
Fix this by allowing servers to form a list, with the head of the list in the tree. When the front entry in the list is removed, the second in the list just replaces it. afs_init_callback_state() then just goes down the line, poking each server in the list.
This means that some servers will be unnecessarily poked, unfortunately. An alternative would be to route by call parameters.
Reported-by: Jeffrey Altman <[email protected]> Signed-off-by: David Howells <[email protected]> Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
show more ...
|
|
Revision tags: v5.7-rc7, v5.7-rc6, v5.7-rc5, v5.7-rc4 |
|
| #
20325960 |
| 30-Apr-2020 |
David Howells <[email protected]> |
afs: Reorganise volume and server trees to be rooted on the cell
Reorganise afs_volume objects such that they're in a tree keyed on volume ID, rooted at on an afs_cell object rather than being in mu
afs: Reorganise volume and server trees to be rooted on the cell
Reorganise afs_volume objects such that they're in a tree keyed on volume ID, rooted at on an afs_cell object rather than being in multiple trees, each of which is rooted on an afs_server object.
afs_server structs become per-cell and acquire a pointer to the cell.
The process of breaking a callback then starts with finding the server by its network address, following that to the cell and then looking up each volume ID in the volume tree.
This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web and allows those structs and the code for maintaining them to be simplified or removed.
It does make a couple of things a bit more tricky, though:
(1) Operations now start with a volume, not a server, so there can be more than one answer as to whether or not the server we'll end up using supports the FS.InlineBulkStatus RPC.
(2) CB RPC operations that specify the server UUID. There's still a tree of servers by UUID on the afs_net struct, but the UUIDs in it aren't guaranteed unique.
Signed-off-by: David Howells <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc3, v5.7-rc2, v5.7-rc1 |
|
| #
e49c7b2f |
| 10-Apr-2020 |
David Howells <[email protected]> |
afs: Build an abstraction around an "operation" concept
Turn the afs_operation struct into the main way that most fileserver operations are managed. Various things are added to the struct, includin
afs: Build an abstraction around an "operation" concept
Turn the afs_operation struct into the main way that most fileserver operations are managed. Various things are added to the struct, including the following:
(1) All the parameters and results of the relevant operations are moved into it, removing corresponding fields from the afs_call struct. afs_call gets a pointer to the op.
(2) The target volume is made the main focus of the operation, rather than the target vnode(s), and a bunch of op->vnode->volume are made op->volume instead.
(3) Two vnode records are defined (op->file[]) for the vnode(s) involved in most operations. The vnode record (struct afs_vnode_param) contains:
- The vnode pointer.
- The fid of the vnode to be included in the parameters or that was returned in the reply (eg. FS.MakeDir).
- The status and callback information that may be returned in the reply about the vnode.
- Callback break and data version tracking for detecting simultaneous third-parth changes.
(4) Pointers to dentries to be updated with new inodes.
(5) An operations table pointer. The table includes pointers to functions for issuing AFS and YFS-variant RPCs, handling the success and abort of an operation and handling post-I/O-lock local editing of a directory.
To make this work, the following function restructuring is made:
(A) The rotation loop that issues calls to fileservers that can be found in each function that wants to issue an RPC (such as afs_mkdir()) is extracted out into common code, in a new file called fs_operation.c.
(B) The rotation loops, such as the one in afs_mkdir(), are replaced with a much smaller piece of code that allocates an operation, sets the parameters and then calls out to the common code to do the actual work.
(C) The code for handling the success and failure of an operation are moved into operation functions (as (5) above) and these are called from the core code at appropriate times.
(D) The pseudo inode getting stuff used by the dynamic root code is moved over into dynroot.c.
(E) struct afs_iget_data is absorbed into the operation struct and afs_iget() expects to be given an op pointer and a vnode record.
(F) Point (E) doesn't work for the root dir of a volume, but we know the FID in advance (it's always vnode 1, unique 1), so a separate inode getter, afs_root_iget(), is provided to special-case that.
(G) The inode status init/update functions now also take an op and a vnode record.
(H) The RPC marshalling functions now, for the most part, just take an afs_operation struct as their only argument. All the data they need is held there. The result delivery functions write their answers there as well.
(I) The call is attached to the operation and then the operation core does the waiting.
And then the new operation code is, for the moment, made to just initialise the operation, get the appropriate vnode I/O locks and do the same rotation loop as before.
This lays the foundation for the following changes in the future:
(*) Overhauling the rotation (again).
(*) Support for asynchronous I/O, where the fileserver rotation must be done asynchronously also.
Signed-off-by: David Howells <[email protected]>
show more ...
|
|
Revision tags: v5.6 |
|
| #
8230fd82 |
| 27-Mar-2020 |
David Howells <[email protected]> |
afs: Make callback processing more efficient.
afs_vol_interest objects represent the volume IDs currently being accessed from a fileserver. These hold lists of afs_cb_interest objects that repesent
afs: Make callback processing more efficient.
afs_vol_interest objects represent the volume IDs currently being accessed from a fileserver. These hold lists of afs_cb_interest objects that repesent the superblocks using that volume ID on that server.
When a callback notification from the server telling of a modification by another client arrives, the volume ID specified in the notification is looked up in the server's afs_vol_interest list. Through the afs_cb_interest list, the relevant superblocks can be iterated over and the specific inode looked up and marked in each one.
Make the following efficiency improvements:
(1) Hold rcu_read_lock() over the entire processing rather than locking it each time.
(2) Do all the callbacks for each vid together rather than individually. Each volume then only needs to be looked up once.
(3) afs_vol_interest objects are now stored in an rb_tree rather than a flat list to reduce the lookup step count.
(4) afs_vol_interest lookup is now done with RCU, but because it's in an rb_tree which may rotate under us, a seqlock is used so that if it changes during the walk, we repeat the walk with a lock held.
With this and the preceding patch which adds RCU-based lookups in the inode cache, target volumes/vnodes can be taken without the need to take any locks, except on the target itself.
Signed-off-by: David Howells <[email protected]>
show more ...
|
|
Revision tags: v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4, v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8, v5.4-rc7, v5.4-rc6, v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6, v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2, v5.3-rc1, v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5, v5.2-rc4, v5.2-rc3, v5.2-rc2, v5.2-rc1, v5.1, v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3, v5.1-rc2, v5.1-rc1, v5.0, v5.0-rc8, v5.0-rc7, v5.0-rc6, v5.0-rc5, v5.0-rc4, v5.0-rc3, v5.0-rc2, v5.0-rc1, v4.20, v4.20-rc7, v4.20-rc6, v4.20-rc5, v4.20-rc4, v4.20-rc3, v4.20-rc2, v4.20-rc1, v4.19, v4.19-rc8, v4.19-rc7, v4.19-rc6, v4.19-rc5, v4.19-rc4, v4.19-rc3, v4.19-rc2, v4.19-rc1, v4.18, v4.18-rc8, v4.18-rc7, v4.18-rc6, v4.18-rc5, v4.18-rc4, v4.18-rc3, v4.18-rc2, v4.18-rc1, v4.17, v4.17-rc7, v4.17-rc6, v4.17-rc5, v4.17-rc4, v4.17-rc3, v4.17-rc2, v4.17-rc1, v4.16, v4.16-rc7, v4.16-rc6, v4.16-rc5, v4.16-rc4, v4.16-rc3, v4.16-rc2, v4.16-rc1, v4.15, v4.15-rc9, v4.15-rc8, v4.15-rc7, v4.15-rc6, v4.15-rc5, v4.15-rc4, v4.15-rc3, v4.15-rc2 |
|
| #
3f19b2ab |
| 01-Dec-2017 |
David Howells <[email protected]> |
vfs, afs, ext4: Make the inode hash table RCU searchable
Make the inode hash table RCU searchable so that searches that want to access or modify an inode without taking a ref on that inode can do so
vfs, afs, ext4: Make the inode hash table RCU searchable
Make the inode hash table RCU searchable so that searches that want to access or modify an inode without taking a ref on that inode can do so without taking the inode hash table lock.
The main thing this requires is some RCU annotation on the list manipulation operations. Inodes are already freed by RCU in most cases.
Users of this interface must take care as the inode may be still under construction or may be being torn down around them.
There are at least three instances where this can be of use:
(1) Testing whether the inode number iunique() is going to return is currently unique (the iunique_lock is still held).
(2) Ext4 date stamp updating.
(3) AFS callback breaking.
Signed-off-by: David Howells <[email protected]> Acked-by: Konstantin Khlebnikov <[email protected]> cc: [email protected] cc: [email protected]
show more ...
|
| #
cd340703 |
| 21-Nov-2019 |
Marc Dionne <[email protected]> |
afs: Fix possible assert with callbacks from yfs servers
Servers sending callback breaks to the YFS_CM_SERVICE service may send up to YFSCBMAX (1024) fids in a single RPC. Anything over AFSCBMAX (5
afs: Fix possible assert with callbacks from yfs servers
Servers sending callback breaks to the YFS_CM_SERVICE service may send up to YFSCBMAX (1024) fids in a single RPC. Anything over AFSCBMAX (50) will cause the assert in afs_break_callbacks to trigger.
Remove the assert, as the count has already been checked against the appropriate max values in afs_deliver_cb_callback and afs_deliver_yfs_cb_callback.
Fixes: 35dbfba3111a ("afs: Implement the YFS cache manager service") Signed-off-by: Marc Dionne <[email protected]> Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
45218193 |
| 20-Jun-2019 |
David Howells <[email protected]> |
afs: Trace afs_server usage
Add a tracepoint (afs_server) to track the afs_server object usage count.
Signed-off-by: David Howells <[email protected]>
|
| #
051d2525 |
| 20-Jun-2019 |
David Howells <[email protected]> |
afs: Add some callback management tracepoints
Add a couple of tracepoints to track callback management:
(1) afs_cb_miss - Logs when we were unable to apply a callback, either due to the inode
afs: Add some callback management tracepoints
Add a couple of tracepoints to track callback management:
(1) afs_cb_miss - Logs when we were unable to apply a callback, either due to the inode being discarded or due to a competing thread applying a callback first.
(2) afs_cb_break - Logs when we attempted to clear the noted callback promise, either due to the server explicitly breaking the callback, the callback promise lapsing or a local event obsoleting it.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
90fa9b64 |
| 20-Jun-2019 |
David Howells <[email protected]> |
afs: Fix uninitialised spinlock afs_volume::cb_break_lock
Fix the cb_break_lock spinlock in afs_volume struct by initialising it when the volume record is allocated.
Also rename the lock to cb_v_br
afs: Fix uninitialised spinlock afs_volume::cb_break_lock
Fix the cb_break_lock spinlock in afs_volume struct by initialising it when the volume record is allocated.
Also rename the lock to cb_v_break_lock to distinguish it from the lock of the same name in the afs_server struct.
Without this, the following trace may be observed when a volume-break callback is received:
INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 2 PID: 50 Comm: kworker/2:1 Not tainted 5.2.0-rc1-fscache+ #3045 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Workqueue: afs SRXAFSCB_CallBack Call Trace: dump_stack+0x67/0x8e register_lock_class+0x23b/0x421 ? check_usage_forwards+0x13c/0x13c __lock_acquire+0x89/0xf73 lock_acquire+0x13b/0x166 ? afs_break_callbacks+0x1b2/0x3dd _raw_write_lock+0x2c/0x36 ? afs_break_callbacks+0x1b2/0x3dd afs_break_callbacks+0x1b2/0x3dd ? trace_event_raw_event_afs_server+0x61/0xac SRXAFSCB_CallBack+0x11f/0x16c process_one_work+0x2c5/0x4ee ? worker_thread+0x234/0x2ac worker_thread+0x1d8/0x2ac ? cancel_delayed_work_sync+0xf/0xf kthread+0x11f/0x127 ? kthread_park+0x76/0x76 ret_from_fork+0x24/0x30
Fixes: 68251f0a6818 ("afs: Fix whole-volume callback handling") Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
f642404a |
| 13-May-2019 |
David Howells <[email protected]> |
afs: Make vnode->cb_interest RCU safe
Use RCU-based freeing for afs_cb_interest struct objects and use RCU on vnode->cb_interest. Use that change to allow afs_check_validity() to use read_seqbegin_
afs: Make vnode->cb_interest RCU safe
Use RCU-based freeing for afs_cb_interest struct objects and use RCU on vnode->cb_interest. Use that change to allow afs_check_validity() to use read_seqbegin_or_lock() instead of read_seqlock_excl().
This also requires the caller of afs_check_validity() to hold the RCU read lock across the call.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
c7226e40 |
| 10-May-2019 |
David Howells <[email protected]> |
afs: Fix lock-wait/callback-break double locking
__afs_break_callback() holds vnode->lock around its call of afs_lock_may_be_available() - which also takes that lock.
Fix this by not taking the loc
afs: Fix lock-wait/callback-break double locking
__afs_break_callback() holds vnode->lock around its call of afs_lock_may_be_available() - which also takes that lock.
Fix this by not taking the lock in __afs_break_callback().
Also, there's no point checking the granted_locks and pending_locks queues; it's sufficient to check lock_state, so move that check out of afs_lock_may_be_available() into __afs_break_callback() to replace the queue checks.
Fixes: e8d6c554126b ("AFS: implement file locking") Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
eeba1e9c |
| 13-Apr-2019 |
David Howells <[email protected]> |
afs: Fix in-progess ops to ignore server-level callback invalidation
The in-kernel afs filesystem client counts the number of server-level callback invalidation events (CB.InitCallBackState* RPC ope
afs: Fix in-progess ops to ignore server-level callback invalidation
The in-kernel afs filesystem client counts the number of server-level callback invalidation events (CB.InitCallBackState* RPC operations) that it receives from the server. This is stored in cb_s_break in various structures, including afs_server and afs_vnode.
If an inode is examined by afs_validate(), say, the afs_server copy is compared, along with other break counters, to those in afs_vnode, and if one or more of the counters do not match, it is considered that the server's callback promise is broken. At points where this happens, AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be refetched from the server.
afs_validate() issues an FS.FetchStatus operation to get updated metadata - and based on the updated data_version may invalidate the pagecache too.
However, the break counters are also used to determine whether to note a new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag) and whether to cache the permit data included in the YFSFetchStatus record by the server.
The problem comes when the server sends us a CB.InitCallBackState op. The first such instance doesn't cause cb_s_break to be incremented, but rather causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours after last use and all the volumes have been automatically unmounted and the server has forgotten about the client[*], this *will* likely cause an increment.
[*] There are other circumstances too, such as the server restarting or needing to make space in its callback table.
Note that the server won't send us a CB.InitCallBackState op until we talk to it again.
So what happens is:
(1) A mount for a new volume is attempted, a inode is created for the root vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set immediately, as we don't have a nominated server to talk to yet - and we may iterate through a few to find one.
(2) Before the operation happens, afs_fetch_status(), say, notes in the cursor (fc.cb_break) the break counter sum from the vnode, volume and server counters, but the server->cb_s_break is currently 0.
(3) We send FS.FetchStatus to the server. The server sends us back CB.InitCallBackState. We increment server->cb_s_break.
(4) Our FS.FetchStatus completes. The reply includes a callback record.
(5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether the callback promise was broken by checking the break counter sum from step (2) against the current sum.
This fails because of step (3), so we don't set the callback record and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.
This does not preclude the syscall from progressing, and we don't loop here rechecking the status, but rather assume it's good enough for one round only and will need to be rechecked next time.
(6) afs_validate() it triggered on the vnode, probably called from d_revalidate() checking the parent directory.
(7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't update vnode->cb_s_break and assumes the vnode to be invalid.
(8) afs_validate() needs to calls afs_fetch_status(). Go back to step (2) and repeat, every time the vnode is validated.
This primarily affects volume root dir vnodes. Everything subsequent to those inherit an already incremented cb_s_break upon mounting.
The issue is that we assume that the callback record and the cached permit information in a reply from the server can't be trusted after getting a server break - but this is wrong since the server makes sure things are done in the right order, holding up our ops if necessary[*].
[*] There is an extremely unlikely scenario where a reply from before the CB.InitCallBackState could get its delivery deferred till after - at which point we think we have a promise when we don't. This, however, requires unlucky mass packet loss to one call.
AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from a server we've never contacted before, but this should be unnecessary. It's also further insulated from the problem on an initial mount by querying the server first with FS.GetCapabilities, which triggers the CB.InitCallBackState.
Fix this by
(1) Remove AFS_SERVER_FL_NEW.
(2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the calculation.
(3) In afs_cb_is_broken(), don't include cb_s_break in the check.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
30062bd1 |
| 19-Oct-2018 |
David Howells <[email protected]> |
afs: Implement YFS support in the fs client
Implement support for talking to YFS-variant fileservers in the cache manager and the filesystem client. These implement upgraded services on the same po
afs: Implement YFS support in the fs client
Implement support for talking to YFS-variant fileservers in the cache manager and the filesystem client. These implement upgraded services on the same port as their AFS services.
YFS fileservers provide expanded capabilities over AFS.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
06aeb297 |
| 19-Oct-2018 |
David Howells <[email protected]> |
afs: Remove callback details from afs_callback_break struct
Remove unnecessary details of a broken callback, such as version, expiry and type, from the afs_callback_break struct as they're not actua
afs: Remove callback details from afs_callback_break struct
Remove unnecessary details of a broken callback, such as version, expiry and type, from the afs_callback_break struct as they're not actually used and make the list take more memory.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
3b6492df |
| 19-Oct-2018 |
David Howells <[email protected]> |
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
Increase the sizes of the volume ID to 64 bits and the vnode ID (inode number equivalent) to 96 bits to allow the support of YFS.
This
afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
Increase the sizes of the volume ID to 64 bits and the vnode ID (inode number equivalent) to 96 bits to allow the support of YFS.
This requires the iget comparator to check the vnode->fid rather than i_ino and i_generation as i_ino is not sufficiently capacious. It also requires this data to be placed into the vnode cache key for fscache.
For the moment, just discard the top 32 bits of the vnode ID when returning it though stat.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
47ea0f2e |
| 15-Jun-2018 |
David Howells <[email protected]> |
afs: Optimise callback breaking by not repeating volume lookup
At the moment, afs_break_callbacks calls afs_break_one_callback() for each separate FID it was given, and the latter looks up the volum
afs: Optimise callback breaking by not repeating volume lookup
At the moment, afs_break_callbacks calls afs_break_one_callback() for each separate FID it was given, and the latter looks up the volume individually for each one.
However, this is inefficient if two or more FIDs have the same vid as we could reuse the volume. This is complicated by cell aliasing whereby we may have multiple cells sharing a volume and can therefore have multiple callback interests for any particular volume ID.
At the moment afs_break_one_callback() scans the entire list of volumes we're getting from a server and breaks the appropriate callback in every matching volume, regardless of cell. This scan is done for every FID.
Optimise callback breaking by the following means:
(1) Sort the FID list by vid so that all FIDs belonging to the same volume are clumped together.
This is done through the use of an indirection table as we cannot do an insertion sort on the afs_callback_break array as we decode FIDs into it as we subsequently also have to decode callback info into it that corresponds by array index only.
We also don't really want to bubblesort afterwards if we can avoid it.
(2) Sort the server->cb_interests array by vid so that all the matching volumes are grouped together. This permits the scan to stop after finding a record that has a higher vid.
(3) When breaking FIDs, we try to keep server->cb_break_lock as long as possible, caching the start point in the array for that volume group as long as possible.
It might make sense to add another layer in that list and have a refcounted volume ID anchor that has the matching interests attached to it rather than being in the list. This would allow the lock to be dropped without losing the cursor.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
68251f0a |
| 12-May-2018 |
David Howells <[email protected]> |
afs: Fix whole-volume callback handling
It's possible for an AFS file server to issue a whole-volume notification that callbacks on all the vnodes in the file have been broken. This is done for R/O
afs: Fix whole-volume callback handling
It's possible for an AFS file server to issue a whole-volume notification that callbacks on all the vnodes in the file have been broken. This is done for R/O and backup volumes (which don't have per-file callbacks) and for things like a volume being taken offline.
Fix callback handling to detect whole-volume notifications, to track it across operations and to check it during inode validation.
Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <[email protected]>
show more ...
|