Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
index indicating elder journal entries, but this is only true when fifo
index is not swapped.
Fifo structure journal.pin is implemented by a cycle buffer, if the head
index reaches highest location of the cycle buffer, it will be swapped
to 0. Once the swapping happens, it means a smaller fifo index might be
associated to a newer journal entry. So the btree node with oldest
journal entry won't be selected by btree_flush_write() to flush out to
cache device. The result is, the oldest journal entries may always has
no chance to be written into cache device, and after a reboot
bch_journal_replay() may complain some journal entries are missing.
This patch handles the fifo index swapping conditions properly, then in
btree_flush_write() the btree node with oldest journal entry can be
slected from c->flush_btree correctly.
Cc: stable(a)vger.kernel.org
Signed-off-by: Coly Li <colyli(a)suse.de>
---
drivers/md/bcache/journal.c | 47 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 6 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index bdb6f9cefe48..bc0e01151155 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
}
/* Journalling */
-#define journal_max_cmp(l, r) \
- (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
- fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
-#define journal_min_cmp(l, r) \
- (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
- fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
+#define journal_max_cmp(l, r) \
+({ \
+ int l_idx, r_idx, f_idx, b_idx; \
+ bool _ret = true; \
+ \
+ l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
+ r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
+ f_idx = c->journal.pin.front; \
+ b_idx = c->journal.pin.back; \
+ \
+ _ret = (l_idx < r_idx); \
+ /* in case fifo back pointer is swapped */ \
+ if (b_idx < f_idx) { \
+ if (l_idx <= b_idx && r_idx >= f_idx) \
+ _ret = false; \
+ else if (l_idx >= f_idx && r_idx <= b_idx) \
+ _ret = true; \
+ } \
+ _ret; \
+})
+
+#define journal_min_cmp(l, r) \
+({ \
+ int l_idx, r_idx, f_idx, b_idx; \
+ bool _ret = true; \
+ \
+ l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
+ r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
+ f_idx = c->journal.pin.front; \
+ b_idx = c->journal.pin.back; \
+ \
+ _ret = (l_idx > r_idx); \
+ /* in case fifo back pointer is swapped */ \
+ if (b_idx < f_idx) { \
+ if (l_idx <= b_idx && r_idx >= f_idx) \
+ _ret = true; \
+ else if (l_idx >= f_idx && r_idx <= b_idx) \
+ _ret = false; \
+ } \
+ _ret; \
+})
static void btree_flush_write(struct cache_set *c)
{
--
2.16.4
When bcache journal initiates during running cache set, cache set
journal.blocks_free is initiated as 0. Then during journal replay if
journal_meta() is called and an empty jset is written to cache device,
journal_reclaim() is called. If there is available journal bucket to
reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
bucket, which is c->sb.bucket_size >> c->block_bits.
Most of time the above process works correctly, expect the condtion
when journal space is almost full. "Almost full" means there is no free
journal bucket, but there are still free blocks in last available
bucket indexed by ja->cur_idx.
If system crashes or reboots when journal space is almost full, problem
comes. During cache set reload after the reboot, c->journal.blocks_free
is initialized as 0, when jouranl replay process writes bcache jouranl,
journal_reclaim() will be called to reclaim available journal bucket and
set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
there is no fully free bucket to reclaim in journal_reclaim(), so value
of c->journal.blocks_free will keep 0. If the first journal entry
processed by journal_replay() causes btree split and requires writing
journal space by journal_meta(), journal_meta() has to go into an
infinite loop to reclaim jouranl bucket, and blocks the whole cache set
to run.
Such buggy situation can be solved if we do following things before
journal replay starts,
- Recover previous value of c->journal.blocks_free in last run time,
and set it to current c->journal.blocks_free as initial value.
- Recover previous value of ja->cur_idx in last run time, and set it to
KEY_PTR of current c->journal.key as initial value.
After c->journal.blocks_free and c->journal.key are recovered, in
condition when jouranl space is almost full and cache set is reloaded,
meta journal entry from journal reply can be written into free blocks of
the last available journal bucket, then old jouranl entries can be
replayed and reclaimed for further journaling request.
This patch adds bch_journal_key_reload() to recover journal blocks_free
and key ptr value for above purpose. bch_journal_key_reload() is called
in bch_journal_read() before replying journal by bch_journal_replay().
Cc: stable(a)vger.kernel.org
Signed-off-by: Coly Li <colyli(a)suse.de>
---
drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 5180bed911ef..a6deb16c15c8 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -143,6 +143,89 @@ reread: left = ca->sb.bucket_size - offset;
return ret;
}
+static int bch_journal_key_reload(struct cache_set *c)
+{
+ struct cache *ca;
+ unsigned int iter, n = 0;
+ struct bkey *k = &c->journal.key;
+ int ret = 0;
+
+ for_each_cache(ca, c, iter) {
+ struct journal_device *ja = &ca->journal;
+ struct bio *bio = &ja->bio;
+ struct jset *j, *data = c->journal.w[0].data;
+ struct closure cl;
+ unsigned int len, left;
+ unsigned int offset = 0, used_blocks = 0;
+ sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
+
+ closure_init_stack(&cl);
+
+ while (offset < ca->sb.bucket_size) {
+reread: left = ca->sb.bucket_size - offset;
+ len = min_t(unsigned int,
+ left, PAGE_SECTORS << JSET_BITS);
+
+ bio_reset(bio);
+ bio->bi_iter.bi_sector = bucket + offset;
+ bio_set_dev(bio, ca->bdev);
+ bio->bi_iter.bi_size = len << 9;
+
+ bio->bi_end_io = journal_read_endio;
+ bio->bi_private = &cl;
+ bio_set_op_attrs(bio, REQ_OP_READ, 0);
+ bch_bio_map(bio, data);
+
+ closure_bio_submit(c, bio, &cl);
+ closure_sync(&cl);
+
+ j = data;
+ while (len) {
+ size_t blocks, bytes = set_bytes(j);
+
+ if (j->magic != jset_magic(&ca->sb))
+ goto out;
+
+ if (bytes > left << 9 ||
+ bytes > PAGE_SIZE << JSET_BITS) {
+ pr_err("jset may be correpted: too big");
+ ret = -EIO;
+ goto err;
+ }
+
+ if (bytes > len << 9)
+ goto reread;
+
+ if (j->csum != csum_set(j)) {
+ pr_err("jset may be corrupted: bad csum");
+ ret = -EIO;
+ goto err;
+ }
+
+ blocks = set_blocks(j, block_bytes(c));
+ used_blocks += blocks;
+
+ offset += blocks * ca->sb.block_size;
+ len -= blocks * ca->sb.block_size;
+ j = ((void *) j) + blocks * block_bytes(ca);
+ }
+ }
+out:
+ c->journal.blocks_free =
+ (c->sb.bucket_size >> c->block_bits) -
+ used_blocks;
+
+ k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
+ }
+
+ BUG_ON(n == 0);
+ bkey_init(k);
+ SET_KEY_PTRS(k, n);
+
+err:
+ return ret;
+}
+
int bch_journal_read(struct cache_set *c, struct list_head *list)
{
#define read_bucket(b) \
@@ -268,6 +351,10 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
struct journal_replay,
list)->j.seq;
+ /* Initial value of c->journal.blocks_free should be 0 */
+ BUG_ON(c->journal.blocks_free != 0);
+ ret = bch_journal_key_reload(c);
+
return ret;
#undef read_bucket
}
--
2.16.4
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.
The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),
529 for_each_cache(ca, c, iter) {
530 struct journal_device *ja = &ca->journal;
531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533 /* No space available on this device */
534 if (next == ja->discard_idx)
535 continue;
536
537 ja->cur_idx = next;
538 k->ptr[n++] = MAKE_PTR(0,
539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540 ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);
If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.
Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,
649 for (i = 0; i < KEY_PTRS(k); i++) {
650-671 submit journal data to cache device
672 }
If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.
Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.
Cc: stable(a)vger.kernel.org
Signed-off-by: Coly Li <colyli(a)suse.de>
---
drivers/md/bcache/journal.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6e18057d1d82..5180bed911ef 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c)
ca->sb.nr_this_dev);
}
- bkey_init(k);
- SET_KEY_PTRS(k, n);
-
- if (n)
+ if (n) {
+ bkey_init(k);
+ SET_KEY_PTRS(k, n);
c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
+ }
out:
if (!journal_full(&c->journal))
__closure_wake_up(&c->journal.wait);
@@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl)
ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
}
+ /* If KEY_PTRS(k) == 0, this jset gets lost in air */
+ BUG_ON(i == 0);
+
atomic_dec_bug(&fifo_back(&c->journal.pin));
bch_journal_next(&c->journal);
journal_reclaim(c);
--
2.16.4
Having a cyclic DMA, a residue 0 is not an indication of a completed
DMA. In case of cyclic DMA make sure that dma_set_residue() is called
and with this a residue of 0 is forwarded correctly to the caller.
Fixes: 3544d2878817 ("dmaengine: rcar-dmac: use result of updated get_residue in tx_status")
Signed-off-by: Dirk Behme <dirk.behme(a)de.bosch.com>
Signed-off-by: Achim Dahlhoff <Achim.Dahlhoff(a)de.bosch.com>
Signed-off-by: Hiroyuki Yokoyama <hiroyuki.yokoyama.vx(a)renesas.com>
Signed-off-by: Yao Lihua <ylhuajnu(a)outlook.com>
Cc: <stable(a)vger.kernel.org> # v4.8+
---
Note: Patch done against mainline v5.0
Changes in v2: None
Changes in v3: Move reading rchan into the spin lock protection.
drivers/dma/sh/rcar-dmac.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/sh/rcar-dmac.c b/drivers/dma/sh/rcar-dmac.c
index 2b4f25698169..54810ffd95e2 100644
--- a/drivers/dma/sh/rcar-dmac.c
+++ b/drivers/dma/sh/rcar-dmac.c
@@ -1368,6 +1368,7 @@ static enum dma_status rcar_dmac_tx_status(struct dma_chan *chan,
enum dma_status status;
unsigned long flags;
unsigned int residue;
+ bool cyclic;
status = dma_cookie_status(chan, cookie, txstate);
if (status == DMA_COMPLETE || !txstate)
@@ -1375,10 +1376,11 @@ static enum dma_status rcar_dmac_tx_status(struct dma_chan *chan,
spin_lock_irqsave(&rchan->lock, flags);
residue = rcar_dmac_chan_get_residue(rchan, cookie);
+ cyclic = rchan->desc.running ? rchan->desc.running->cyclic : false;
spin_unlock_irqrestore(&rchan->lock, flags);
/* if there's no residue, the cookie is complete */
- if (!residue)
+ if (!residue && !cyclic)
return DMA_COMPLETE;
dma_set_residue(txstate, residue);
--
2.20.0
In the fixes commit, removing SIGKILL from each thread signal mask
and executing "goto fatal" directly will skip the call to
"trace_signal_deliver". At this point, the delivery tracking of the SIGKILL
signal will be inaccurate.
Therefore, we need to add trace_signal_deliver before "goto fatal"
after executing sigdelset.
Note: The action[SIGKILL] must be SIG_DFL, and SEND_SIG_NOINFO matches
the fact that SIGKILL doesn't have any info.
Acked-by: Christian Brauner <christian(a)brauner.io>
Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
Signed-off-by: Zhenliang Wei <weizhenliang(a)huawei.com>
---
kernel/signal.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/signal.c b/kernel/signal.c
index 227ba170298e..0f69ada376ef 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2441,6 +2441,7 @@ bool get_signal(struct ksignal *ksig)
if (signal_group_exit(signal)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
+ trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO, SIG_DFL);
recalc_sigpending();
goto fatal;
}
--
2.14.1.windows.1
The patch titled
Subject: signal: trace_signal_deliver when signal_group_exit
has been removed from the -mm tree. Its filename was
signal-trace_signal_deliver-when-signal_group_exit.patch
This patch was dropped because it had testing failures
------------------------------------------------------
From: Zhenliang Wei <weizhenliang(a)huawei.com>
Subject: signal: trace_signal_deliver when signal_group_exit
In the fixes commit, removing SIGKILL from each thread signal mask and
executing "goto fatal" directly will skip the call to
"trace_signal_deliver". At this point, the delivery tracking of the
SIGKILL signal will be inaccurate.
Therefore, we need to add trace_signal_deliver before "goto fatal" after
executing sigdelset.
Note: The action[SIGKILL] must be SIG_DFL, and SEND_SIG_NOINFO matches the
fact that SIGKILL doesn't have any info.
Link: http://lkml.kernel.org/r/20190422145950.78056-1-weizhenliang@huawei.com
Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
Signed-off-by: Zhenliang Wei <weizhenliang(a)huawei.com>
Acked-by: Christian Brauner <christian(a)brauner.io>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Ivan Delalande <colona(a)arista.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Deepa Dinamani <deepa.kernel(a)gmail.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/signal.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/signal.c~signal-trace_signal_deliver-when-signal_group_exit
+++ a/kernel/signal.c
@@ -2441,6 +2441,7 @@ relock:
if (signal_group_exit(signal)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
+ trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO, SIG_DFL);
recalc_sigpending();
goto fatal;
}
_
Patches currently in -mm which might be from weizhenliang(a)huawei.com are