|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3 |
|
| #
417f01e7 |
| 18-Apr-2025 |
Kent Overstreet <[email protected]> |
bcachefs: Error ratelimiting is no longer only during fsck
We now more often do repair automatically, without the user invoking fsck - and sometimes that can involve fixing lots of errors, so let's
bcachefs: Error ratelimiting is no longer only during fsck
We now more often do repair automatically, without the user invoking fsck - and sometimes that can involve fixing lots of errors, so let's avoid flooding the dmesg log.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.15-rc2, v6.15-rc1 |
|
| #
6d77ce4a |
| 26-Mar-2025 |
Kent Overstreet <[email protected]> |
bcachefs: Better printing of inconsistency errors
Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in
bcachefs: Better printing of inconsistency errors
Build up and emit the error message for an inconsistency error all at once, instead of spread over multiple printk calls, so they're not jumbled in the dmesg log.
Also, add better indenting.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
7337f9f1 |
| 28-Mar-2025 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_count_fsck_err()
Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsc
bcachefs: bch2_count_fsck_err()
Factor out a helper from __bch2_fsck_err(), for counting the error in the superblock and deciding whether to print or ratelimit - will be used to replace some log_fsck_err() calls, where we want to lift out printing the error message.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
b00750c2 |
| 28-Mar-2025 |
Kent Overstreet <[email protected]> |
bcachefs: Better helpers for inconsistency errors
An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper
bcachefs: Better helpers for inconsistency errors
An inconsistency error often happens as part of an event with multiple error messages, and we want to build up one single error message with proper indenting to produce more readable log messages that don't get garbled.
Add new helpers that emit messages to a printbuf instead of printing them directly, next patch will convert to use them.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5 |
|
| #
981e3801 |
| 26-Feb-2025 |
Kent Overstreet <[email protected]> |
bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on oth
bcachefs: Kick devices out after too many write IO errors
We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible).
But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely.
This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only.
In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy".
Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset.
After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
b31c0704 |
| 28-Feb-2025 |
Kent Overstreet <[email protected]> |
bcachefs: Finish bch2_account_io_completion() conversions
More prep work for automatically kicking devices out after too many IO errors.
Signed-off-by: Kent Overstreet <[email protected]>
|
| #
3526bca3 |
| 28-Feb-2025 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_account_io_completion()
We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c.
Signed-
bcachefs: bch2_account_io_completion()
We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc4, v6.14-rc3 |
|
| #
1ccbcd32 |
| 10-Feb-2025 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_write_op_error() now prints info about data update
A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error.
This gener
bcachefs: bch2_write_op_error() now prints info about data update
A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error.
This generally indicates a hardware issue (and that may be the case here), but it might also indicate a bug, in which case we need more information to look for patterns.
Reported-by: Roland Vet <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc2 |
|
| #
06284963 |
| 07-Feb-2025 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts
we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart ha
bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction restarts
we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart handling, so let's prepare for that.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
45f0e6c8 |
| 07-Feb-2025 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number
We want all error messages converted to print paths, not just inode numbers - users want this information, and it sp
bcachefs: bch2_indirect_extent_missing_error() prints path, not just inode number
We want all error messages converted to print paths, not just inode numbers - users want this information, and it speeds up debugging too.
Auditing and converting all error messages is going to be a big project, so for the moment we're just doing this incrementally.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1 |
|
| #
f7727a67 |
| 28-Sep-2024 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_inum_to_path()
Add a function for walking backpointers to find a path from a given inode number, and convert various error messages to use it.
Signed-off-by: Kent Overstreet <kent.ov
bcachefs: bch2_inum_to_path()
Add a function for walking backpointers to find a path from a given inode number, and convert various error messages to use it.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
052210c3 |
| 28-Nov-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Don't error out when logging fsck error
Signed-off-by: Kent Overstreet <[email protected]>
|
| #
a6f4794f |
| 27-Nov-2024 |
Kent Overstreet <[email protected]> |
bcachefs: struct bkey_validate_context
Add a new parameter to bkey validate functions, and use it to improve invalid bkey error messages: we can now print the btree and depth it came from, or if it
bcachefs: struct bkey_validate_context
Add a new parameter to bkey validate functions, and use it to improve invalid bkey error messages: we can now print the btree and depth it came from, or if it came from the journal, or is a btree root.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
c8e58813 |
| 27-Oct-2024 |
Kent Overstreet <[email protected]> |
bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err
Factor out a common helper, need_discard_or_freespace_err(), which is now used by both fsck and the runtime checks, and can repair.
Si
bcachefs: bch2_bucket_do_index(): inconsistent_err -> fsck_err
Factor out a common helper, need_discard_or_freespace_err(), which is now used by both fsck and the runtime checks, and can repair.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
eb73e777 |
| 29-Oct-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Kill FSCK_NEED_FSCK
If we find an error that indicates that we need to run fsck, we can specify that directly with run_explicit_recovery_pass().
These are now log_fsck_err() calls: we're
bcachefs: Kill FSCK_NEED_FSCK
If we find an error that indicates that we need to run fsck, we can specify that directly with run_explicit_recovery_pass().
These are now log_fsck_err() calls: we're just logging in the superblock that an error occurred - and possibly doing an emergency shutdown, depending on policy.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
1f282f1e |
| 12-Nov-2024 |
Kent Overstreet <[email protected]> |
bcachefs: delete dead code
Signed-off-by: Kent Overstreet <[email protected]>
|
| #
658c82f4 |
| 04-Oct-2024 |
Kent Overstreet <[email protected]> |
bcachefs: bkey errors are only AUTOFIX during read
Newly generated keys, in the transaction commit path or write path, should not be AUTOFIX; those indicate bugs that we need to fail fast for.
Fixe
bcachefs: bkey errors are only AUTOFIX during read
Newly generated keys, in the transaction commit path or write path, should not be AUTOFIX; those indicate bugs that we need to fail fast for.
Fixes: 5612daafb764 ("bcachefs: Fix fsck warnings from bkey validation") Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
5612daaf |
| 26-Sep-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Fix fsck warnings from bkey validation
__bch2_fsck_err() warns if the current task has a btree_trans object and it wasn't passed in, because if it has to prompt for user input it has to be
bcachefs: Fix fsck warnings from bkey validation
__bch2_fsck_err() warns if the current task has a btree_trans object and it wasn't passed in, because if it has to prompt for user input it has to be able to unlock it.
But plumbing the btree_trans through bkey_validate(), as well as transaction restarts, is problematic - so instead make bkey fsck errors FSCK_AUTOFIX, which doesn't need to warn.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4 |
|
| #
d97de0d0 |
| 13-Aug-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bkey_fsck_err() was added as an interface that looks like fsck_err(), but previously all it did was ensure that the appropriate error count
bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bkey_fsck_err() was added as an interface that looks like fsck_err(), but previously all it did was ensure that the appropriate error counter was incremented in the superblock.
This is a cleanup and bugfix patch that converts it to a wrapper around fsck_err(). This is needed to fix an issue with the upgrade path to disk_accounting_v3, where the "silent fix" error list now includes bkey_fsck errors; fsck_err() handles this in a unified way, and since we need to change printing of bkey fsck errors from the caller to the inner bkey_fsck_err() calls, this ends up being a pretty big change.
Als,, rename .invalid() methods to .validate(), for clarity, while we're changing the function signature anyways (to drop the printbuf argument).
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4 |
|
| #
a850bde6 |
| 09-Feb-2024 |
Kent Overstreet <[email protected]> |
bcachefs: fsck_err() may now take a btree_trans
fsck_err() now optionally takes a btree_trans; if the current thread has one, it is required that it be passed.
The next patch will use this to unloc
bcachefs: fsck_err() may now take a btree_trans
fsck_err() now optionally takes a btree_trans; if the current thread has one, it is required that it be passed.
The next patch will use this to unlock when waiting for user input.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
e76a2b65 |
| 07-Jun-2024 |
Kent Overstreet <[email protected]> |
bcachefs: add might_sleep() annotations for fsck_err()
Signed-off-by: Kent Overstreet <[email protected]>
|
| #
33dfafa9 |
| 19-Jun-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Fix safe errors by default
i.e. the start of automatic self healing:
If errors=continue or fix_safe, we now automatically fix simple errors without user intervention.
New error action op
bcachefs: Fix safe errors by default
i.e. the start of automatic self healing:
If errors=continue or fix_safe, we now automatically fix simple errors without user intervention.
New error action option: fix_safe
This replaces the existing errors=ro option, which gets a new slot, i.e. existing errors=ro users now get errors=fix_safe.
This is currently only enabled for a limited set of errors - initially just disk accounting; errors we would never not want to fix, and we don't want to require user intervention (i.e. to make sure a bug report gets filed).
Errors will still be counted in the superblock, so we (developers) will still know they've been occuring if a bug report gets filed (as bug reports typically include the errors superblock section).
Eventually we'll be enabling this for a much wider set of errors, after we've done thorough error injection testing.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
79032b07 |
| 23-Mar-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Improved topology repair checks
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(), and replace them with an improved version, bch2_btree_node_check_topology().
This ch
bcachefs: Improved topology repair checks
Consolidate bch2_gc_check_topology() and btree_node_interior_verify(), and replace them with an improved version, bch2_btree_node_check_topology().
This checks that children of an interior node correctly span the full range of the parent node with no overlaps.
Also, ensure that topology repairs at runtime are always a fatal error; in particular, this adds a check in btree_iter_down() - if we don't find a key while walking down the btree that's indicative of a topology error and should be flagged as such, not a null ptr deref.
Some checks in btree_update_interior.c remaining BUG_ONS(), because we already checked the node for topology errors when starting the update, and the assertions indicate that we _just_ corrupted the btree node - i.e. the problem can't be that existing on disk corruption, they indicate an actual algorithmic bug.
In the future, we'll be annotating the fsck errors list with which recovery pass corrects them; the open coded "run explicit recovery pass or fatal error" in bch2_btree_node_check_topology() will in the future be done for every fsck_err() call.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
| #
3ed94062 |
| 18-Mar-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Improve bch2_fatal_error()
error messages should always include __func__
Signed-off-by: Kent Overstreet <[email protected]>
|
| #
52946d82 |
| 06-Feb-2024 |
Kent Overstreet <[email protected]> |
bcachefs: Kill more -EIO error codes
This converts -EIOs related to btree node errors to private error codes, which will help with some ongoing debugging by giving us better error messages.
Signed-
bcachefs: Kill more -EIO error codes
This converts -EIOs related to btree node errors to private error codes, which will help with some ongoing debugging by giving us better error messages.
Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|