Hi Greg and Ted,
I've been looking at ext4 history the past couple of days seeing if the
patches that you attempted to apply from 4.17-rc1 were relevant and I
noticed a couple from previous 4.x versions that seem like they should
be applied here, as they are clean picks and tagged for stable:
74dae4278546 ("ext4: fix crashes in dioread_nolock mode")
c755e251357a ("ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()")
I've been running them for the day and not had any issues (though I
didn't have any before). If there are any objections, let me know!
Thanks!
Nathan
These are all the ext4 patches that were tagged for -stable and failed
to apply to 3.18.y.
Patch e40ff2138985 ("ext4: force revalidation of directory pointer after seekdir(2)")
was Cc'd to stable as well but it requires commmit ae5e165d855d
("fs: new API for handling inode->i_version") to be applied as well
which is neither a stable candidate nor under 100 lines so I've skipped e40ff2138985.
If somebody can suggest a backport of the commit which doesn't require ae5e165d855d, I'll
be glad.
Theodore Ts'o (3):
ext4: add validity checks for bitmap block numbers
ext4: fail ext4_iget for root directory if unallocated
ext4: don't allow r/w mounts if metadata blocks overlap the superblock
fs/ext4/balloc.c | 16 ++++++++++++++--
fs/ext4/ialloc.c | 8 +++++++-
fs/ext4/inode.c | 6 ++++++
fs/ext4/super.c | 6 ++++++
4 files changed, 33 insertions(+), 3 deletions(-)
--
2.15.0.2308.g658a28aa74af
Hi Greg,
Upstream commit cf0d53ba4947 ("vfio/pci: Virtualize Maximum Read Request
Size") and commit 523184972b28 ("vfio/pci: Virtualize Maximum Payload Size")
fixes nasty PCIe virtualization issues for platforms that support Maximum
Payload Size bigger than 128.
Issue shows up when a device is assigned to the guest machine as a passthrough.
Guest machine configures the MPS/MRRS settings to values that are incompatible
with the parent bridge device.
This causes PCIe transaction timeouts and AER errors to be spilled in the host
kernel.
Please apply commit cf0d53ba4947 and 523184972b28 to all affected releases to
fix the resulting regression.
Thanks,
Sinan
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
These are all the ext4 patches that were tagged for -stable and failed
to apply to 3.18.y.
Side note: Patch e15dc99dbb9c ("ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation")
which was tagged for -stable is not required on 3.18.y so I have skipped
the backport.
Theodore Ts'o (4):
ext4: add validity checks for bitmap block numbers
ext4: fail ext4_iget for root directory if unallocated
ext4: don't allow r/w mounts if metadata blocks overlap the superblock
ext4: force revalidation of directory pointer after seekdir(2)
fs/ext4/balloc.c | 16 ++++++++++++++--
fs/ext4/dir.c | 8 +++++---
fs/ext4/ialloc.c | 7 +++++++
fs/ext4/inode.c | 6 ++++++
fs/ext4/super.c | 6 ++++++
5 files changed, 38 insertions(+), 5 deletions(-)
--
2.15.0.2308.g658a28aa74af
The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.
Fix this race as follows:
- Use the deadline instead of the request generation to detect whether
or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
of the lowest two bits of 'gstate'.
- Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
enumeration member into a #define such that its type can be changed
into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
~(unsigned long)RQ_STATE_MASK.
- Remove all request member variables that became superfluous due to
this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
the RCU lock and unlock statements in blk_mq_complete_request() and
also the synchronize_rcu() call in the timeout handler.
Signed-off-by: Bart Van Assche <bart.vanassche(a)wdc.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Ming Lei <ming.lei(a)redhat.com>
Cc: Sagi Grimberg <sagi(a)grimberg.me>
Cc: Israel Rukshin <israelr(a)mellanox.com>,
Cc: Max Gurtovoy <maxg(a)mellanox.com>
Cc: <stable(a)vger.kernel.org> # v4.16
---
Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
handling and the actual timeout handling to avoid that timeout handling
starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
state concurrently. Use WRITE_ONCE() otherwise.
Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
there is now a nice and clean split between the legacy and blk-mq versions
of blk_add_timer().
- Changed the patch name and modified the patch description because there is
disagreement about whether or not the v4.16 blk-mq core can complete a
single request twice. Kept the "Cc: stable" tag because of
https://bugzilla.kernel.org/show_bug.cgi?id=199077.
Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
for blk-mq.
Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.
Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
the __deadline member to encode both the generation and state information.
block/blk-core.c | 6 --
block/blk-mq-debugfs.c | 1 -
block/blk-mq.c | 158 ++++++++++---------------------------------------
block/blk-mq.h | 85 +++++++++++++++++---------
block/blk-timeout.c | 89 ++++++++++++++++------------
block/blk.h | 13 ++--
include/linux/blkdev.h | 29 +++------
7 files changed, 154 insertions(+), 227 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index de90ecab61cd..730a8e3be7ce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,12 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
- seqcount_init(&rq->gstate_seq);
- u64_stats_init(&rq->aborted_gstate_sync);
- /*
- * See comment of blk_mq_init_request
- */
- WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
}
EXPORT_SYMBOL(blk_rq_init);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index adb8d6f00098..529383841b3b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -346,7 +346,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
- RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
};
#undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bb7f59d319fa..6f20845827f4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -481,7 +481,8 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
- blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+ if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+ WARN_ON_ONCE(true);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -527,8 +528,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
- WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
- blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+ WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -577,36 +577,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
}
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
- unsigned long flags;
-
- /*
- * blk_mq_rq_aborted_gstate() is used from the completion path and
- * can thus be called from irq context. u64_stats_fetch in the
- * middle of update on the same CPU leads to lockup. Disable irq
- * while updating.
- */
- local_irq_save(flags);
- u64_stats_update_begin(&rq->aborted_gstate_sync);
- rq->aborted_gstate = gstate;
- u64_stats_update_end(&rq->aborted_gstate_sync);
- local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
- unsigned int start;
- u64 aborted_gstate;
-
- do {
- start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
- aborted_gstate = rq->aborted_gstate;
- } while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
- return aborted_gstate;
-}
-
/**
* blk_mq_complete_request - end I/O on a request
* @rq: the request being processed
@@ -618,27 +588,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
void blk_mq_complete_request(struct request *rq)
{
struct request_queue *q = rq->q;
- struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
- int srcu_idx;
if (unlikely(blk_should_fake_timeout(q)))
return;
- /*
- * If @rq->aborted_gstate equals the current instance, timeout is
- * claiming @rq and we lost. This is synchronized through
- * hctx_lock(). See blk_mq_timeout_work() for details.
- *
- * Completion path never blocks and we can directly use RCU here
- * instead of hctx_lock() which can be either RCU or SRCU.
- * However, that would complicate paths which want to synchronize
- * against us. Let stay in sync with the issue path so that
- * hctx_lock() covers both issue and completion paths.
- */
- hctx_lock(hctx, &srcu_idx);
- if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+ if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
__blk_mq_complete_request(rq);
- hctx_unlock(hctx, srcu_idx);
}
EXPORT_SYMBOL(blk_mq_complete_request);
@@ -662,27 +617,7 @@ void blk_mq_start_request(struct request *rq)
wbt_issue(q->rq_wb, &rq->issue_stat);
}
- WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
- /*
- * Mark @rq in-flight which also advances the generation number,
- * and register for timeout. Protect with a seqcount to allow the
- * timeout path to read both @rq->gstate and @rq->deadline
- * coherently.
- *
- * This is the only place where a request is marked in-flight. If
- * the timeout path reads an in-flight @rq->gstate, the
- * @rq->deadline it reads together under @rq->gstate_seq is
- * guaranteed to be the matching one.
- */
- preempt_disable();
- write_seqcount_begin(&rq->gstate_seq);
-
- blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
- blk_add_timer(rq);
-
- write_seqcount_end(&rq->gstate_seq);
- preempt_enable();
+ blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@@ -695,22 +630,19 @@ void blk_mq_start_request(struct request *rq)
}
EXPORT_SYMBOL(blk_mq_start_request);
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
static void __blk_mq_requeue_request(struct request *rq)
{
struct request_queue *q = rq->q;
+ enum mq_rq_state old_state = blk_mq_rq_state(rq);
blk_mq_put_driver_tag(rq);
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, &rq->issue_stat);
- if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
- blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+ if (old_state != MQ_RQ_IDLE) {
+ if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+ WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}
@@ -819,8 +751,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
- req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
if (ops->timeout)
ret = ops->timeout(req, reserved);
@@ -829,13 +759,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
- /*
- * As nothing prevents from completion happening while
- * ->aborted_gstate is set, this may lead to ignored
- * completions and further spurious timeouts.
- */
- blk_mq_rq_update_aborted_gstate(req, 0);
- blk_add_timer(req);
+ blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -849,48 +773,35 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
{
struct blk_mq_timeout_data *data = priv;
- unsigned long gstate, deadline;
- int start;
-
- might_sleep();
+ unsigned long __deadline = READ_ONCE(rq->__deadline);
+ unsigned long deadline = __deadline & ~RQ_STATE_MASK;
+ enum mq_rq_state rq_state = __deadline & RQ_STATE_MASK;
- if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
- return;
-
- /* read coherent snapshots of @rq->state_gen and @rq->deadline */
- while (true) {
- start = read_seqcount_begin(&rq->gstate_seq);
- gstate = READ_ONCE(rq->gstate);
- deadline = blk_rq_deadline(rq);
- if (!read_seqcount_retry(&rq->gstate_seq, start))
- break;
- cond_resched();
- }
-
- /* if in-flight && overdue, mark for abortion */
- if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
- time_after_eq(jiffies, deadline)) {
- blk_mq_rq_update_aborted_gstate(rq, gstate);
+ rq->aborted_gstate = __deadline ^ (1ULL << 63);
+ if (time_after_eq(jiffies, deadline) && rq_state == MQ_RQ_IN_FLIGHT) {
+ rq->aborted_gstate = __deadline;
data->nr_expired++;
hctx->nr_expired++;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
+
}
static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
{
+ unsigned long old_val = rq->aborted_gstate;
+ unsigned long new_val = (rq->aborted_gstate & ~RQ_STATE_MASK) |
+ MQ_RQ_COMPLETE;
+
/*
- * We marked @rq->aborted_gstate and waited for RCU. If there were
- * completions that we lost to, they would have finished and
- * updated @rq->gstate by now; otherwise, the completion path is
- * now guaranteed to see @rq->aborted_gstate and yield. If
- * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+ * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+ * calls. If rq->__deadline has not changed that means that it is
+ * now safe to change the request state and to handle the timeout.
*/
- if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
- READ_ONCE(rq->gstate) == rq->aborted_gstate)
+ if (cmpxchg(&rq->__deadline, old_val, new_val) == old_val)
blk_mq_rq_timed_out(rq, reserved);
}
@@ -929,10 +840,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
bool has_rcu = false;
/*
- * Wait till everyone sees ->aborted_gstate. The
- * sequential waits for SRCUs aren't ideal. If this ever
- * becomes a problem, we can add per-hw_ctx rcu_head and
- * wait in parallel.
+ * For very short timeouts it can happen that
+ * blk_mq_check_expired() modifies the state of a request
+ * while .queue_rq() is still in progress. Hence wait until
+ * these .queue_rq() calls have finished.
*/
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx->nr_expired)
@@ -948,7 +859,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
if (has_rcu)
synchronize_rcu();
- /* terminate the ones we won */
+ /* Terminate the requests marked by blk_mq_check_expired(). */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
}
@@ -2060,15 +1971,6 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
return ret;
}
- seqcount_init(&rq->gstate_seq);
- u64_stats_init(&rq->aborted_gstate_sync);
- /*
- * start gstate with gen 1 instead of 0, otherwise it will be equal
- * to aborted_gstate, and be identified timed out by
- * blk_mq_terminate_expired.
- */
- WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
-
return 0;
}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..66efc8a3988b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,18 +27,11 @@ struct blk_mq_ctx {
struct kobject kobj;
} ____cacheline_aligned_in_smp;
-/*
- * Bits for request->gstate. The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* Lowest two bits of request->__deadline. */
enum mq_rq_state {
MQ_RQ_IDLE = 0,
MQ_RQ_IN_FLIGHT = 1,
MQ_RQ_COMPLETE = 2,
-
- MQ_RQ_STATE_BITS = 2,
- MQ_RQ_STATE_MASK = (1 << MQ_RQ_STATE_BITS) - 1,
- MQ_RQ_GEN_INC = 1 << MQ_RQ_STATE_BITS,
};
void blk_mq_freeze_queue(struct request_queue *q);
@@ -100,37 +93,73 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
void blk_mq_release(struct request_queue *q);
+/*
+ * If the state of request @rq equals @old_state, update deadline and request
+ * state atomically to @time and @new_state. blk-mq only. cmpxchg() is only
+ * used if there could be a concurrent update attempt from another context.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+ unsigned long new_time,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
+{
+ unsigned long old_val, new_val;
+
+ if (old_state != MQ_RQ_IN_FLIGHT) {
+ old_val = READ_ONCE(rq->__deadline);
+ if ((old_val & RQ_STATE_MASK) != old_state)
+ return false;
+ new_val = (new_time & ~RQ_STATE_MASK) |
+ (new_state & RQ_STATE_MASK);
+ WRITE_ONCE(rq->__deadline, new_val);
+ return true;
+ }
+
+ do {
+ old_val = READ_ONCE(rq->__deadline);
+ if ((old_val & RQ_STATE_MASK) != old_state)
+ return false;
+ new_val = (new_time & ~RQ_STATE_MASK) |
+ (new_state & RQ_STATE_MASK);
+ } while (cmpxchg(&rq->__deadline, old_val, new_val) != old_val);
+
+ return true;
+}
+
/**
* blk_mq_rq_state() - read the current MQ_RQ_* state of a request
* @rq: target request.
*/
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
{
- return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+ return READ_ONCE(rq->__deadline) & RQ_STATE_MASK;
}
/**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
- * @rq: target request.
- * @state: new state to set.
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
*
- * Set @rq's state to @state. The caller is responsible for ensuring that
- * there are no other updaters. A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
*/
-static inline void blk_mq_rq_update_state(struct request *rq,
- enum mq_rq_state state)
+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
{
- u64 old_val = READ_ONCE(rq->gstate);
- u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
- if (state == MQ_RQ_IN_FLIGHT) {
- WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
- new_val += MQ_RQ_GEN_INC;
- }
-
- /* avoid exposing interim values */
- WRITE_ONCE(rq->gstate, new_val);
+ unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+ old_state;
+ unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
+
+ /*
+ * For transitions from state in-flight to another state cmpxchg() must
+ * be used. For other state transitions it is safe to use WRITE_ONCE().
+ */
+ if (old_state == MQ_RQ_IN_FLIGHT)
+ return cmpxchg(&rq->__deadline, old_val, new_val) == old_val;
+ WRITE_ONCE(rq->__deadline, new_val);
+ return true;
}
static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 50a191720055..e98da6db7d4b 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,8 +165,9 @@ void blk_abort_request(struct request *req)
* immediately and that scan sees the new timeout value.
* No need for fancy synchronizations.
*/
- blk_rq_set_deadline(req, jiffies);
- kblockd_schedule_work(&req->q->timeout_work);
+ if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+ MQ_RQ_IN_FLIGHT))
+ kblockd_schedule_work(&req->q->timeout_work);
} else {
if (blk_mark_rq_complete(req))
return;
@@ -187,52 +188,17 @@ unsigned long blk_rq_timeout(unsigned long timeout)
return timeout;
}
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req: request that is about to start running.
- *
- * Notes:
- * Each request has its own timer, and as it is added to the queue, we
- * set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req)
{
struct request_queue *q = req->q;
unsigned long expiry;
- if (!q->mq_ops)
- lockdep_assert_held(q->queue_lock);
-
- /* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
- if (!q->mq_ops && !q->rq_timed_out_fn)
- return;
-
- BUG_ON(!list_empty(&req->timeout_list));
-
- /*
- * Some LLDs, like scsi, peek at the timeout to prevent a
- * command from being retried forever.
- */
- if (!req->timeout)
- req->timeout = q->rq_timeout;
-
- blk_rq_set_deadline(req, jiffies + req->timeout);
- req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
- /*
- * Only the non-mq case needs to add the request to a protected list.
- * For the mq case we simply scan the tag map.
- */
- if (!q->mq_ops)
- list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
/*
* If the timer isn't already pending or this timeout is earlier
* than an existing one, modify the timer. Round up to next nearest
* second.
*/
expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
if (!timer_pending(&q->timeout) ||
time_before(expiry, q->timeout.expires)) {
unsigned long diff = q->timeout.expires - expiry;
@@ -247,5 +213,52 @@ void blk_add_timer(struct request *req)
if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
mod_timer(&q->timeout, expiry);
}
+}
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req: request that is about to start running.
+ *
+ * Notes:
+ * Each request has its own timer, and as it is added to the queue, we
+ * set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+ struct request_queue *q = req->q;
+
+ lockdep_assert_held(q->queue_lock);
+ if (!q->rq_timed_out_fn)
+ return;
+ if (!req->timeout)
+ req->timeout = q->rq_timeout;
+
+ blk_rq_set_deadline(req, jiffies + req->timeout);
+ list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+ return __blk_add_timer(req);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req: request for which to set the deadline.
+ * @old: current request state.
+ * @new: new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+ enum mq_rq_state new)
+{
+ struct request_queue *q = req->q;
+
+ if (!req->timeout)
+ req->timeout = q->rq_timeout;
+ if (!blk_mq_rq_set_deadline(req, jiffies + req->timeout, old, new))
+ WARN_ON_ONCE(true);
+ return __blk_add_timer(req);
}
diff --git a/block/blk.h b/block/blk.h
index b034fd2460c4..7cd64f533a46 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@ static inline bool bio_integrity_endio(struct bio *bio)
void blk_timeout_work(struct work_struct *work);
unsigned long blk_rq_timeout(unsigned long timeout);
void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+ enum mq_rq_state new);
void blk_delete_timer(struct request *);
@@ -308,18 +310,19 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
}
/*
- * Steal a bit from this field for legacy IO path atomic IO marking. Note that
- * setting the deadline clears the bottom bit, potentially clearing the
- * completed bit. The user has to be OK with this (current ones are fine).
+ * Steal two bits from this field. The legacy IO path uses the lowest bit for
+ * atomic IO marking. Note that setting the deadline clears the bottom bit,
+ * potentially clearing the completed bit. The current legacy block layer is
+ * fine with that. Must be called with the request queue lock held.
*/
static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
{
- rq->__deadline = time & ~0x1UL;
+ rq->__deadline = time & RQ_STATE_MASK;
}
static inline unsigned long blk_rq_deadline(struct request *rq)
{
- return rq->__deadline & ~0x1UL;
+ return rq->__deadline & ~RQ_STATE_MASK;
}
/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b7681f3ee793..51cd69f14537 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,8 +27,6 @@
#include <linux/percpu-refcount.h>
#include <linux/scatterlist.h>
#include <linux/blkzoned.h>
-#include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
struct module;
struct scsi_ioctl_command;
@@ -125,10 +123,8 @@ typedef __u32 __bitwise req_flags_t;
#define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18))
/* The per-zone write lock is held for this request */
#define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED ((__force req_flags_t)(1 << 20))
/* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT ((__force req_flags_t)(1 << 20))
/* flags that prevent us from merging requests: */
#define RQF_NOMERGE_FLAGS \
@@ -225,28 +221,19 @@ struct request {
unsigned int extra_len; /* length of alignment and padding */
- /*
- * On blk-mq, the lower bits of ->gstate (generation number and
- * state) carry the MQ_RQ_* state value and the upper bits the
- * generation number which is monotonically incremented and used to
- * distinguish the reuse instances.
- *
- * ->gstate_seq allows updates to ->gstate and other fields
- * (currently ->deadline) during request start to be read
- * atomically from the timeout path, so that it can operate on a
- * coherent set of information.
- */
- seqcount_t gstate_seq;
- u64 gstate;
-
/*
* ->aborted_gstate is used by the timeout to claim a specific
* recycle instance of this request. See blk_mq_timeout_work().
*/
- struct u64_stats_sync aborted_gstate_sync;
u64 aborted_gstate;
- /* access through blk_rq_set_deadline, blk_rq_deadline */
+ /*
+ * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+ * blk_mark_rq_complete(), blk_clear_rq_complete() and
+ * blk_rq_is_complete() for legacy queues or blk_mq_rq_state() for
+ * blk-mq queues.
+ */
+#define RQ_STATE_MASK 0x3UL
unsigned long __deadline;
struct list_head timeout_list;
--
2.16.3
Friend,
My name is Miss Qadesa AbdulAziz and I am 17 years old girl from Syria. There is serious war crisis here in Syria, and I have lost my parents and
my two brothers in this war. I want you to help me and receive ($7.md) which my late father deposited with my name in a bank in London. I want
to come to your country and start a new life and invest with you, because am the only survival
in my family.
I wait to hear from you, Please do not let me die here, I begging you the name of Almighty. please respond here my pirate email qadesa(a)protonmail.com
Regards
Miss Qadesa AbdulAziz
The bug exists in the memcmp in which the length passed in must
be guaranteed to be 1. This bug currently exists because
the second pointer passed in, can be smaller than the
cmd->data_length, which causes a fortify_panic.
The fix is to use memchr_inv instead to find whether or not
a 0 exists instead of using memcmp. This way you dont have to
worry about buffer overflow which is the reason for the
fortify_panic.
The bug was found by running a block backstore via LIO.
[ 496.212958] Call Trace:
[ 496.212960] [c0000007e58e3800] [c000000000cbbefc] fortify_panic+0x24/0x38 (unreliable)
[ 496.212965] [c0000007e58e3860] [d00000000f150c28] iblock_execute_write_same+0x3b8/0x3c0 [target_core_iblock]
[ 496.212976] [c0000007e58e3910] [d000000006c737d4] __target_execute_cmd+0x54/0x150 [target_core_mod]
[ 496.212982] [c0000007e58e3940] [d000000006d32ce4] ibmvscsis_write_pending+0x74/0xe0 [ibmvscsis]
[ 496.212991] [c0000007e58e39b0] [d000000006c74fc8] transport_generic_new_cmd+0x318/0x370 [target_core_mod]
[ 496.213001] [c0000007e58e3a30] [d000000006c75084] transport_handle_cdb_direct+0x64/0xd0 [target_core_mod]
[ 496.213011] [c0000007e58e3aa0] [d000000006c75298] target_submit_cmd_map_sgls+0x1a8/0x320 [target_core_mod]
[ 496.213021] [c0000007e58e3b30] [d000000006c75458] target_submit_cmd+0x48/0x60 [target_core_mod]
[ 496.213026] [c0000007e58e3bd0] [d000000006d34c20] ibmvscsis_scheduler+0x370/0x600 [ibmvscsis]
[ 496.213031] [c0000007e58e3c90] [c00000000013135c] process_one_work+0x1ec/0x580
[ 496.213035] [c0000007e58e3d20] [c000000000131798] worker_thread+0xa8/0x600
[ 496.213039] [c0000007e58e3dc0] [c00000000013a468] kthread+0x168/0x1b0
[ 496.213044] [c0000007e58e3e30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
Fixes: 2237498f0b5c ("target/iblock: Convert WRITE_SAME to blkdev_issue_zeroout")
Signed-off-by: Bryant G. Ly <bryantly(a)linux.vnet.ibm.com>
Reviewed-by: Steven Royer <seroyer(a)linux.vnet.ibm.com>
Tested-by: Taylor Jakobson <tjakobs(a)us.ibm.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Nicholas Bellinger <nab(a)linux-iscsi.org>
Cc: <stable(a)vger.kernel.org>
---
drivers/target/target_core_iblock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index 07c814c..6042901 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -427,8 +427,8 @@ iblock_execute_zero_out(struct block_device *bdev, struct se_cmd *cmd)
{
struct se_device *dev = cmd->se_dev;
struct scatterlist *sg = &cmd->t_data_sg[0];
- unsigned char *buf, zero = 0x00, *p = &zero;
- int rc, ret;
+ unsigned char *buf, *not_zero;
+ int ret;
buf = kmap(sg_page(sg)) + sg->offset;
if (!buf)
@@ -437,10 +437,10 @@ iblock_execute_zero_out(struct block_device *bdev, struct se_cmd *cmd)
* Fall back to block_execute_write_same() slow-path if
* incoming WRITE_SAME payload does not contain zeros.
*/
- rc = memcmp(buf, p, cmd->data_length);
+ not_zero = memchr_inv(buf, 0x00, cmd->data_length);
kunmap(sg_page(sg));
- if (rc)
+ if (not_zero)
return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
ret = blkdev_issue_zeroout(bdev,
--
2.7.2
From: Matthew Wilcox <mawilcox(a)microsoft.com>
Subject: mm/filemap.c: fix NULL pointer in page_cache_tree_insert()
f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all. That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:
[<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
[<ffffff8008362018>] list_lru_del+0xac/0x1ac
[<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through if
they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.
Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Matthew Wilcox <mawilcox(a)microsoft.com>
Reported-by: Chris Fries <cfries(a)google.com>
Debugged-by: Minchan Kim <minchan(a)kernel.org>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/filemap.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff -puN mm/filemap.c~fix-null-pointer-in-page_cache_tree_insert mm/filemap.c
--- a/mm/filemap.c~fix-null-pointer-in-page_cache_tree_insert
+++ a/mm/filemap.c
@@ -786,7 +786,7 @@ int replace_page_cache_page(struct page
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);
- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_preload(gfp_mask & GFP_RECLAIM_MASK);
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
@@ -842,7 +842,7 @@ static int __add_to_page_cache_locked(st
return error;
}
- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_maybe_preload(gfp_mask & GFP_RECLAIM_MASK);
if (error) {
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
@@ -1585,8 +1585,7 @@ no_page:
if (fgp_flags & FGP_ACCESSED)
__SetPageReferenced(page);
- err = add_to_page_cache_lru(page, mapping, offset,
- gfp_mask & GFP_RECLAIM_MASK);
+ err = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
if (unlikely(err)) {
put_page(page);
page = NULL;
@@ -2387,7 +2386,7 @@ static int page_cache_read(struct file *
if (!page)
return -ENOMEM;
- ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & GFP_KERNEL);
+ ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
if (ret == 0)
ret = mapping->a_ops->readpage(file, page);
else if (ret == -EEXIST)
_
From: Ian Kent <raven(a)themaw.net>
Subject: autofs: mount point create should honour passed in mode
The autofs file system mkdir inode operation blindly sets the created
directory mode to S_IFDIR | 0555, ingoring the passed in mode, which can
cause selinux dac_override denials.
But the function also checks if the caller is the daemon (as no-one else
should be able to do anything here) so there's no point in not honouring
the passed in mode, allowing the daemon to set appropriate mode when
required.
Link: http://lkml.kernel.org/r/152361593601.8051.14014139124905996173.stgit@pluto…
Signed-off-by: Ian Kent <raven(a)themaw.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/autofs4/root.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN fs/autofs4/root.c~autofs-mount-point-create-should-honour-passed-in-mode fs/autofs4/root.c
--- a/fs/autofs4/root.c~autofs-mount-point-create-should-honour-passed-in-mode
+++ a/fs/autofs4/root.c
@@ -749,7 +749,7 @@ static int autofs4_dir_mkdir(struct inod
autofs4_del_active(dentry);
- inode = autofs4_get_inode(dir->i_sb, S_IFDIR | 0555);
+ inode = autofs4_get_inode(dir->i_sb, S_IFDIR | mode);
if (!inode)
return -ENOMEM;
d_add(dentry, inode);
_
From: Ioan Nicu <ioan.nicu.ext(a)nokia.com>
Subject: rapidio: fix rio_dma_transfer error handling
Some of the mport_dma_req structure members were initialized late
inside the do_dma_request() function, just before submitting the
request to the dma engine. But we have some error branches before
that. In case of such an error, the code would return on the error
path and trigger the calling of dma_req_free() with a req structure
which is not completely initialized. This causes a NULL pointer
dereference in dma_req_free().
This patch fixes these error branches by making sure that all
necessary mport_dma_req structure members are initialized in
rio_dma_transfer() immediately after the request structure gets
allocated.
Link: http://lkml.kernel.org/r/20180412150605.GA31409@nokia.com
Fixes: bbd876adb8c72 ("rapidio: use a reference count for struct mport_dma_req")
Signed-off-by: Ioan Nicu <ioan.nicu.ext(a)nokia.com>
Tested-by: Alexander Sverdlin <alexander.sverdlin(a)nokia.com>
Acked-by: Alexandre Bounine <alex.bou9(a)gmail.com>
Cc: Barry Wood <barry.wood(a)idt.com>
Cc: Matt Porter <mporter(a)kernel.crashing.org>
Cc: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Cc: Frank Kunz <frank.kunz(a)nokia.com>
Cc: <stable(a)vger.kernel.org> [4.6+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/rapidio/devices/rio_mport_cdev.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff -puN drivers/rapidio/devices/rio_mport_cdev.c~rapidio-fix-rio_dma_transfer-error-handling drivers/rapidio/devices/rio_mport_cdev.c
--- a/drivers/rapidio/devices/rio_mport_cdev.c~rapidio-fix-rio_dma_transfer-error-handling
+++ a/drivers/rapidio/devices/rio_mport_cdev.c
@@ -740,10 +740,7 @@ static int do_dma_request(struct mport_d
tx->callback = dma_xfer_callback;
tx->callback_param = req;
- req->dmach = chan;
- req->sync = sync;
req->status = DMA_IN_PROGRESS;
- init_completion(&req->req_comp);
kref_get(&req->refcount);
cookie = dmaengine_submit(tx);
@@ -831,13 +828,20 @@ rio_dma_transfer(struct file *filp, u32
if (!req)
return -ENOMEM;
- kref_init(&req->refcount);
-
ret = get_dma_channel(priv);
if (ret) {
kfree(req);
return ret;
}
+ chan = priv->dmach;
+
+ kref_init(&req->refcount);
+ init_completion(&req->req_comp);
+ req->dir = dir;
+ req->filp = filp;
+ req->priv = priv;
+ req->dmach = chan;
+ req->sync = sync;
/*
* If parameter loc_addr != NULL, we are transferring data from/to
@@ -925,11 +929,6 @@ rio_dma_transfer(struct file *filp, u32
xfer->offset, xfer->length);
}
- req->dir = dir;
- req->filp = filp;
- req->priv = priv;
- chan = priv->dmach;
-
nents = dma_map_sg(chan->device->dev,
req->sgt.sgl, req->sgt.nents, dir);
if (nents == 0) {
_
From: Greg Thelen <gthelen(a)google.com>
Subject: writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.
unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains. Switches occur when
enough writes are issued from a new domain.
This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);
If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock. This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().
truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end
<interrupts mistakenly enabled>
<interrupt>
end_page_writeback
test_clear_page_writeback
lock_page_memcg
<deadlock>
unlock_page_memcg
Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).
If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
cd /mnt/cgroup/memory
mkdir a b
echo 1 > a/memory.move_charge_at_immigrate
echo 1 > b/memory.move_charge_at_immigrate
(
echo $BASHPID > a/cgroup.procs
while true; do
dd if=/dev/zero of=/mnt/big bs=1M count=256
done
) &
while true; do
sync
done &
sleep 1h &
SLEEP=$!
while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
done
The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel. I suggest we should to prevent future
surprises. And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146
Wang Long said "this deadlock occurs three times in our environment"
[gthelen(a)google.com: v4]
Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm(a)linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen(a)google.com>
Reported-by: Wang Long <wanglong19(a)meituan.com>
Acked-by: Wang Long <wanglong19(a)meituan.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [v4.2+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/fs-writeback.c | 7 +++---
include/linux/backing-dev-defs.h | 5 ++++
include/linux/backing-dev.h | 30 +++++++++++++++--------------
mm/page-writeback.c | 18 ++++++++---------
4 files changed, 34 insertions(+), 26 deletions(-)
diff -puN fs/fs-writeback.c~writeback-safer-lock-nesting fs/fs-writeback.c
--- a/fs/fs-writeback.c~writeback-safer-lock-nesting
+++ a/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode,
*/
if (inode && inode_to_wb_is_valid(inode)) {
struct bdi_writeback *wb;
- bool locked, congested;
+ struct wb_lock_cookie lock_cookie = {};
+ bool congested;
- wb = unlocked_inode_to_wb_begin(inode, &locked);
+ wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
congested = wb_congested(wb, cong_bits);
- unlocked_inode_to_wb_end(inode, locked);
+ unlocked_inode_to_wb_end(inode, &lock_cookie);
return congested;
}
diff -puN include/linux/backing-dev-defs.h~writeback-safer-lock-nesting include/linux/backing-dev-defs.h
--- a/include/linux/backing-dev-defs.h~writeback-safer-lock-nesting
+++ a/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(str
set_wb_congested(bdi->wb.congested, sync);
}
+struct wb_lock_cookie {
+ bool locked;
+ unsigned long flags;
+};
+
#ifdef CONFIG_CGROUP_WRITEBACK
/**
diff -puN include/linux/backing-dev.h~writeback-safer-lock-nesting include/linux/backing-dev.h
--- a/include/linux/backing-dev.h~writeback-safer-lock-nesting
+++ a/include/linux/backing-dev.h
@@ -347,7 +347,7 @@ static inline struct bdi_writeback *inod
/**
* unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
* @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
*
* The caller wants to access the wb associated with @inode but isn't
* holding inode->i_lock, the i_pages lock or wb->list_lock. This
@@ -355,12 +355,12 @@ static inline struct bdi_writeback *inod
* association doesn't change until the transaction is finished with
* unlocked_inode_to_wb_end().
*
- * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction. IRQ may or may not be
- * disabled on return.
+ * The caller must call unlocked_inode_to_wb_end() with *@cookie afterwards and
+ * can't sleep during the transaction. IRQs may or may not be disabled on
+ * return.
*/
static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
{
rcu_read_lock();
@@ -368,10 +368,10 @@ unlocked_inode_to_wb_begin(struct inode
* Paired with store_release in inode_switch_wb_work_fn() and
* ensures that we see the new wb if we see cleared I_WB_SWITCH.
*/
- *lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
+ cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
- if (unlikely(*lockedp))
- xa_lock_irq(&inode->i_mapping->i_pages);
+ if (unlikely(cookie->locked))
+ xa_lock_irqsave(&inode->i_mapping->i_pages, cookie->flags);
/*
* Protected by either !I_WB_SWITCH + rcu_read_lock() or the i_pages
@@ -383,12 +383,13 @@ unlocked_inode_to_wb_begin(struct inode
/**
* unlocked_inode_to_wb_end - end inode wb access transaction
* @inode: target inode
- * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @cookie: @cookie from unlocked_inode_to_wb_begin()
*/
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+ struct wb_lock_cookie *cookie)
{
- if (unlikely(locked))
- xa_unlock_irq(&inode->i_mapping->i_pages);
+ if (unlikely(cookie->locked))
+ xa_unlock_irqrestore(&inode->i_mapping->i_pages, cookie->flags);
rcu_read_unlock();
}
@@ -435,12 +436,13 @@ static inline struct bdi_writeback *inod
}
static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)
{
return inode_to_wb(inode);
}
-static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
+static inline void unlocked_inode_to_wb_end(struct inode *inode,
+ struct wb_lock_cookie *cookie)
{
}
diff -puN mm/page-writeback.c~writeback-safer-lock-nesting mm/page-writeback.c
--- a/mm/page-writeback.c~writeback-safer-lock-nesting
+++ a/mm/page-writeback.c
@@ -2502,13 +2502,13 @@ void account_page_redirty(struct page *p
if (mapping && mapping_cap_account_dirty(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
- bool locked;
+ struct wb_lock_cookie cookie = {};
- wb = unlocked_inode_to_wb_begin(inode, &locked);
+ wb = unlocked_inode_to_wb_begin(inode, &cookie);
current->nr_dirtied--;
dec_node_page_state(page, NR_DIRTIED);
dec_wb_stat(wb, WB_DIRTIED);
- unlocked_inode_to_wb_end(inode, locked);
+ unlocked_inode_to_wb_end(inode, &cookie);
}
}
EXPORT_SYMBOL(account_page_redirty);
@@ -2614,15 +2614,15 @@ void __cancel_dirty_page(struct page *pa
if (mapping_cap_account_dirty(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
- bool locked;
+ struct wb_lock_cookie cookie = {};
lock_page_memcg(page);
- wb = unlocked_inode_to_wb_begin(inode, &locked);
+ wb = unlocked_inode_to_wb_begin(inode, &cookie);
if (TestClearPageDirty(page))
account_page_cleaned(page, mapping, wb);
- unlocked_inode_to_wb_end(inode, locked);
+ unlocked_inode_to_wb_end(inode, &cookie);
unlock_page_memcg(page);
} else {
ClearPageDirty(page);
@@ -2654,7 +2654,7 @@ int clear_page_dirty_for_io(struct page
if (mapping && mapping_cap_account_dirty(mapping)) {
struct inode *inode = mapping->host;
struct bdi_writeback *wb;
- bool locked;
+ struct wb_lock_cookie cookie = {};
/*
* Yes, Virginia, this is indeed insane.
@@ -2691,14 +2691,14 @@ int clear_page_dirty_for_io(struct page
* always locked coming in here, so we get the desired
* exclusion.
*/
- wb = unlocked_inode_to_wb_begin(inode, &locked);
+ wb = unlocked_inode_to_wb_begin(inode, &cookie);
if (TestClearPageDirty(page)) {
dec_lruvec_page_state(page, NR_FILE_DIRTY);
dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
dec_wb_stat(wb, WB_RECLAIMABLE);
ret = 1;
}
- unlocked_inode_to_wb_end(inode, locked);
+ unlocked_inode_to_wb_end(inode, &cookie);
return ret;
}
return TestClearPageDirty(page);
_
The pci-hyperv driver's channel callback hv_pci_onchannelcallback() is not
really a hot path, so we don't need to mark it as a perf_device, meaning
with this patch all HV_PCIE channels' target_cpu will be CPU0.
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
Cc: stable(a)vger.kernel.org
Cc: Stephen Hemminger <sthemmin(a)microsoft.com>
Cc: K. Y. Srinivasan <kys(a)microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys(a)microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
(cherry picked from commit 238064f13d057390a8c5e1a6a80f4f0a0ec46499)
Signed-off-by: Mohammed Gamal <mgamal(a)redhat.com>
---
drivers/hv/channel_mgmt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index c6d9d19..ecc2bd2 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -71,7 +71,7 @@ static const struct vmbus_device vmbus_devs[] = {
/* PCIE */
{ .dev_type = HV_PCIE,
HV_PCIE_GUID,
- .perf_device = true,
+ .perf_device = false,
},
/* Synthetic Frame Buffer */
--
1.8.3.1
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6225f9c64b40bc8a22503e9cda70f55d7a9dd3c6 Mon Sep 17 00:00:00 2001
From: Ryo Kodama <ryo.kodama.vz(a)renesas.com>
Date: Fri, 9 Mar 2018 20:24:21 +0900
Subject: [PATCH] pwm: rcar: Fix a condition to prevent mismatch value setting
to duty
This patch fixes an issue that is possible to set mismatch value to duty
for R-Car PWM if we input the following commands:
# cd /sys/class/pwm/<pwmchip>/
# echo 0 > export
# cd pwm0
# echo 30 > period
# echo 30 > duty_cycle
# echo 0 > duty_cycle
# cat duty_cycle
0
# echo 1 > enable
--> Then, the actual duty_cycle is 30, not 0.
So, this patch adds a condition into rcar_pwm_config() to fix this
issue.
Signed-off-by: Ryo Kodama <ryo.kodama.vz(a)renesas.com>
[shimoda: revise the commit log and add Fixes and Cc tags]
Fixes: ed6c1476bf7f ("pwm: Add support for R-Car PWM Timer")
Cc: Cc: <stable(a)vger.kernel.org> # v4.4+
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh(a)renesas.com>
Signed-off-by: Thierry Reding <thierry.reding(a)gmail.com>
diff --git a/drivers/pwm/pwm-rcar.c b/drivers/pwm/pwm-rcar.c
index 1c85ecc9e7ac..0fcf94ffad32 100644
--- a/drivers/pwm/pwm-rcar.c
+++ b/drivers/pwm/pwm-rcar.c
@@ -156,8 +156,12 @@ static int rcar_pwm_config(struct pwm_chip *chip, struct pwm_device *pwm,
if (div < 0)
return div;
- /* Let the core driver set pwm->period if disabled and duty_ns == 0 */
- if (!pwm_is_enabled(pwm) && !duty_ns)
+ /*
+ * Let the core driver set pwm->period if disabled and duty_ns == 0.
+ * But, this driver should prevent to set the new duty_ns if current
+ * duty_cycle is not set
+ */
+ if (!pwm_is_enabled(pwm) && !duty_ns && !pwm->state.duty_cycle)
return 0;
rcar_pwm_update(rp, RCAR_PWMCR_SYNC, RCAR_PWMCR_SYNC, RCAR_PWMCR);
The patch below does not apply to the 4.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5d447f09b8d8346c64f4c952a67c61f7ce88d3c1 Mon Sep 17 00:00:00 2001
From: Shirish S <shirish.s(a)amd.com>
Date: Wed, 21 Feb 2018 16:10:33 +0530
Subject: [PATCH] drm/amd/display: check for ipp before calling cursor
operations
Currently all cursor related functions are made to all
pipes that are attached to a particular stream.
This is not applicable to pipes that do not have cursor plane
initialised like underlay.
Hence this patch allows cursor related operations on a pipe
only if ipp in available on that particular pipe.
The check is added to set_cursor_position & set_cursor_attribute.
Signed-off-by: Shirish S <shirish.s(a)amd.com>
Reviewed-by: Harry Wentland <harry.wentland(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_stream.c b/drivers/gpu/drm/amd/display/dc/core/dc_stream.c
index 87a193ac2883..cd5819789d76 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_stream.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_stream.c
@@ -198,7 +198,8 @@ bool dc_stream_set_cursor_attributes(
for (i = 0; i < MAX_PIPES; i++) {
struct pipe_ctx *pipe_ctx = &res_ctx->pipe_ctx[i];
- if (pipe_ctx->stream != stream || (!pipe_ctx->plane_res.xfm && !pipe_ctx->plane_res.dpp))
+ if (pipe_ctx->stream != stream || (!pipe_ctx->plane_res.xfm &&
+ !pipe_ctx->plane_res.dpp) || !pipe_ctx->plane_res.ipp)
continue;
if (pipe_ctx->top_pipe && pipe_ctx->plane_state != pipe_ctx->top_pipe->plane_state)
continue;
@@ -237,7 +238,8 @@ bool dc_stream_set_cursor_position(
if (pipe_ctx->stream != stream ||
(!pipe_ctx->plane_res.mi && !pipe_ctx->plane_res.hubp) ||
!pipe_ctx->plane_state ||
- (!pipe_ctx->plane_res.xfm && !pipe_ctx->plane_res.dpp))
+ (!pipe_ctx->plane_res.xfm && !pipe_ctx->plane_res.dpp) ||
+ !pipe_ctx->plane_res.ipp)
continue;
core_dc->hwss.set_cursor_position(pipe_ctx);
The patch below does not apply to the 4.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea74e15fb547483f9f86088443f2d3c9f518de8b Mon Sep 17 00:00:00 2001
From: Harry Wentland <harry.wentland(a)amd.com>
Date: Tue, 20 Feb 2018 13:36:23 -0500
Subject: [PATCH] drm/amd/display: Default HDMI6G support to true. Log VBIOS
table error.
There have been many reports of Ellesmere and Baffin systems not being
able to drive HDMI 4k60 due to the fact that we check the HDMI_6GB_EN
bit from VBIOS table. Windows seems to not have this issue.
On some systems we fail to the encoder cap info from VBIOS. In that case
we should default to enabling HDMI6G support.
This was tested by dwagner on
https://bugs.freedesktop.org/show_bug.cgi?id=102820
Signed-off-by: Harry Wentland <harry.wentland(a)amd.com>
Reviewed-by: Roman Li <Roman.Li(a)amd.com>
Reviewed-by: Tony Cheng <Tony.Cheng(a)amd.com>
Acked-by: Harry Wentland <harry.wentland(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_link_encoder.c b/drivers/gpu/drm/amd/display/dc/dce/dce_link_encoder.c
index f0d63ac7724a..81776e4797ed 100644
--- a/drivers/gpu/drm/amd/display/dc/dce/dce_link_encoder.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dce_link_encoder.c
@@ -678,6 +678,7 @@ void dce110_link_encoder_construct(
{
struct bp_encoder_cap_info bp_cap_info = {0};
const struct dc_vbios_funcs *bp_funcs = init_data->ctx->dc_bios->funcs;
+ enum bp_result result = BP_RESULT_OK;
enc110->base.funcs = &dce110_lnk_enc_funcs;
enc110->base.ctx = init_data->ctx;
@@ -752,15 +753,24 @@ void dce110_link_encoder_construct(
enc110->base.preferred_engine = ENGINE_ID_UNKNOWN;
}
+ /* default to one to mirror Windows behavior */
+ enc110->base.features.flags.bits.HDMI_6GB_EN = 1;
+
+ result = bp_funcs->get_encoder_cap_info(enc110->base.ctx->dc_bios,
+ enc110->base.id, &bp_cap_info);
+
/* Override features with DCE-specific values */
- if (BP_RESULT_OK == bp_funcs->get_encoder_cap_info(
- enc110->base.ctx->dc_bios, enc110->base.id,
- &bp_cap_info)) {
+ if (BP_RESULT_OK == result) {
enc110->base.features.flags.bits.IS_HBR2_CAPABLE =
bp_cap_info.DP_HBR2_EN;
enc110->base.features.flags.bits.IS_HBR3_CAPABLE =
bp_cap_info.DP_HBR3_EN;
enc110->base.features.flags.bits.HDMI_6GB_EN = bp_cap_info.HDMI_6GB_EN;
+ } else {
+ dm_logger_write(enc110->base.ctx->logger, LOG_WARNING,
+ "%s: Failed to get encoder_cap_info from VBIOS with error code %d!\n",
+ __func__,
+ result);
}
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0731de476a37c33485af82d64041c9d193208df8 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed, 21 Mar 2018 21:22:34 -0700
Subject: [PATCH] nfit: skip region registration for incomplete control regions
Per the ACPI specification the only functional purpose for a DIMM
Control Region to be mapped into the system physical address space, from
an OSPM perspective, is to support block-apertures. However, there are
some BIOSen that publish DIMM Control Region SPA entries for pre-boot
environment consumption. Undo the kernel policy of generating disabled
'ndblk' regions when this configuration is detected.
Cc: <stable(a)vger.kernel.org>
Fixes: 1f7df6f88b92 ("libnvdimm, nfit: regions (block-data-window...)")
Reviewed-by: Toshi Kani <toshi.kani(a)hpe.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 39ad06143e78..4530d89044db 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2578,7 +2578,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
struct acpi_nfit_system_address *spa = nfit_spa->spa;
struct nd_blk_region_desc *ndbr_desc;
struct nfit_mem *nfit_mem;
- int blk_valid = 0, rc;
+ int rc;
if (!nvdimm) {
dev_err(acpi_desc->dev, "spa%d dimm: %#x not found\n",
@@ -2598,15 +2598,14 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
if (!nfit_mem || !nfit_mem->bdw) {
dev_dbg(acpi_desc->dev, "spa%d %s missing bdw\n",
spa->range_index, nvdimm_name(nvdimm));
- } else {
- mapping->size = nfit_mem->bdw->capacity;
- mapping->start = nfit_mem->bdw->start_address;
- ndr_desc->num_lanes = nfit_mem->bdw->windows;
- blk_valid = 1;
+ break;
}
+ mapping->size = nfit_mem->bdw->capacity;
+ mapping->start = nfit_mem->bdw->start_address;
+ ndr_desc->num_lanes = nfit_mem->bdw->windows;
ndr_desc->mapping = mapping;
- ndr_desc->num_mappings = blk_valid;
+ ndr_desc->num_mappings = 1;
ndbr_desc = to_blk_region_desc(ndr_desc);
ndbr_desc->enable = acpi_nfit_blk_region_enable;
ndbr_desc->do_io = acpi_desc->blk_do_io;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0731de476a37c33485af82d64041c9d193208df8 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Wed, 21 Mar 2018 21:22:34 -0700
Subject: [PATCH] nfit: skip region registration for incomplete control regions
Per the ACPI specification the only functional purpose for a DIMM
Control Region to be mapped into the system physical address space, from
an OSPM perspective, is to support block-apertures. However, there are
some BIOSen that publish DIMM Control Region SPA entries for pre-boot
environment consumption. Undo the kernel policy of generating disabled
'ndblk' regions when this configuration is detected.
Cc: <stable(a)vger.kernel.org>
Fixes: 1f7df6f88b92 ("libnvdimm, nfit: regions (block-data-window...)")
Reviewed-by: Toshi Kani <toshi.kani(a)hpe.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 39ad06143e78..4530d89044db 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2578,7 +2578,7 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
struct acpi_nfit_system_address *spa = nfit_spa->spa;
struct nd_blk_region_desc *ndbr_desc;
struct nfit_mem *nfit_mem;
- int blk_valid = 0, rc;
+ int rc;
if (!nvdimm) {
dev_err(acpi_desc->dev, "spa%d dimm: %#x not found\n",
@@ -2598,15 +2598,14 @@ static int acpi_nfit_init_mapping(struct acpi_nfit_desc *acpi_desc,
if (!nfit_mem || !nfit_mem->bdw) {
dev_dbg(acpi_desc->dev, "spa%d %s missing bdw\n",
spa->range_index, nvdimm_name(nvdimm));
- } else {
- mapping->size = nfit_mem->bdw->capacity;
- mapping->start = nfit_mem->bdw->start_address;
- ndr_desc->num_lanes = nfit_mem->bdw->windows;
- blk_valid = 1;
+ break;
}
+ mapping->size = nfit_mem->bdw->capacity;
+ mapping->start = nfit_mem->bdw->start_address;
+ ndr_desc->num_lanes = nfit_mem->bdw->windows;
ndr_desc->mapping = mapping;
- ndr_desc->num_mappings = blk_valid;
+ ndr_desc->num_mappings = 1;
ndbr_desc = to_blk_region_desc(ndr_desc);
ndbr_desc->enable = acpi_nfit_blk_region_enable;
ndbr_desc->do_io = acpi_desc->blk_do_io;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c31898c8c711f2bbbcaebe802a55827e288d875a Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri, 6 Apr 2018 11:25:38 -0700
Subject: [PATCH] libnvdimm, dimm: fix dpa reservation vs uninitialized label
area
At initialization time the 'dimm' driver caches a copy of the memory
device's label area and reserves address space for each of the
namespaces defined.
However, as can be seen below, the reservation occurs even when the
index blocks are invalid:
nvdimm nmem0: nvdimm_init_config_data: len: 131072 rc: 0
nvdimm nmem0: config data size: 131072
nvdimm nmem0: __nd_label_validate: nsindex0 labelsize 1 invalid
nvdimm nmem0: __nd_label_validate: nsindex1 labelsize 1 invalid
nvdimm nmem0: : pmem-6025e505: 0x1000000000 @ 0xf50000000 reserve <-- bad
Gate dpa reservation on the presence of valid index blocks.
Cc: <stable(a)vger.kernel.org>
Fixes: 4a826c83db4e ("libnvdimm: namespace indices: read and validate")
Reported-by: Krzysztof Rusocki <krzysztof.rusocki(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index f8913b8124b6..233907889f96 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -67,9 +67,11 @@ static int nvdimm_probe(struct device *dev)
ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
nd_label_copy(ndd, to_next_namespace_index(ndd),
to_current_namespace_index(ndd));
- rc = nd_label_reserve_dpa(ndd);
- if (ndd->ns_current >= 0)
- nvdimm_set_aliasing(dev);
+ if (ndd->ns_current >= 0) {
+ rc = nd_label_reserve_dpa(ndd);
+ if (rc == 0)
+ nvdimm_set_aliasing(dev);
+ }
nvdimm_clear_locked(dev);
nvdimm_bus_unlock(dev);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c31898c8c711f2bbbcaebe802a55827e288d875a Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams(a)intel.com>
Date: Fri, 6 Apr 2018 11:25:38 -0700
Subject: [PATCH] libnvdimm, dimm: fix dpa reservation vs uninitialized label
area
At initialization time the 'dimm' driver caches a copy of the memory
device's label area and reserves address space for each of the
namespaces defined.
However, as can be seen below, the reservation occurs even when the
index blocks are invalid:
nvdimm nmem0: nvdimm_init_config_data: len: 131072 rc: 0
nvdimm nmem0: config data size: 131072
nvdimm nmem0: __nd_label_validate: nsindex0 labelsize 1 invalid
nvdimm nmem0: __nd_label_validate: nsindex1 labelsize 1 invalid
nvdimm nmem0: : pmem-6025e505: 0x1000000000 @ 0xf50000000 reserve <-- bad
Gate dpa reservation on the presence of valid index blocks.
Cc: <stable(a)vger.kernel.org>
Fixes: 4a826c83db4e ("libnvdimm: namespace indices: read and validate")
Reported-by: Krzysztof Rusocki <krzysztof.rusocki(a)intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index f8913b8124b6..233907889f96 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -67,9 +67,11 @@ static int nvdimm_probe(struct device *dev)
ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
nd_label_copy(ndd, to_next_namespace_index(ndd),
to_current_namespace_index(ndd));
- rc = nd_label_reserve_dpa(ndd);
- if (ndd->ns_current >= 0)
- nvdimm_set_aliasing(dev);
+ if (ndd->ns_current >= 0) {
+ rc = nd_label_reserve_dpa(ndd);
+ if (rc == 0)
+ nvdimm_set_aliasing(dev);
+ }
nvdimm_clear_locked(dev);
nvdimm_bus_unlock(dev);
The patch below does not apply to the 4.16-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e2fb992d82c626c43ed0566e07c410e56a087af3 Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley(a)HansenPartnership.com>
Date: Wed, 21 Mar 2018 11:43:48 -0700
Subject: [PATCH] tpm: add retry logic
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
TPM2 can return TPM2_RC_RETRY to any command and when it does we get
unexpected failures inside the kernel that surprise users (this is
mostly observed in the trusted key handling code). The UEFI 2.6 spec
has advice on how to handle this:
The firmware SHALL not return TPM2_RC_RETRY prior to the completion
of the call to ExitBootServices().
Implementer’s Note: the implementation of this function should check
the return value in the TPM response and, if it is TPM2_RC_RETRY,
resend the command. The implementation may abort if a sufficient
number of retries has been done.
So we follow that advice in our tpm_transmit() code using
TPM2_DURATION_SHORT as the initial wait duration and
TPM2_DURATION_LONG as the maximum wait time. This should fix all the
in-kernel use cases and also means that user space TSS implementations
don't have to have their own retry handling.
Signed-off-by: James Bottomley <James.Bottomley(a)HansenPartnership.com>
Cc: stable(a)vger.kernel.org
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index 22288ff70a0b..d5379a79274c 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -399,21 +399,10 @@ static void tpm_relinquish_locality(struct tpm_chip *chip)
chip->locality = -1;
}
-/**
- * tpm_transmit - Internal kernel interface to transmit TPM commands.
- *
- * @chip: TPM chip to use
- * @space: tpm space
- * @buf: TPM command buffer
- * @bufsiz: length of the TPM command buffer
- * @flags: tpm transmit flags - bitmap
- *
- * Return:
- * 0 when the operation is successful.
- * A negative number for system errors (errno).
- */
-ssize_t tpm_transmit(struct tpm_chip *chip, struct tpm_space *space,
- u8 *buf, size_t bufsiz, unsigned int flags)
+static ssize_t tpm_try_transmit(struct tpm_chip *chip,
+ struct tpm_space *space,
+ u8 *buf, size_t bufsiz,
+ unsigned int flags)
{
struct tpm_output_header *header = (void *)buf;
int rc;
@@ -544,6 +533,62 @@ ssize_t tpm_transmit(struct tpm_chip *chip, struct tpm_space *space,
return rc ? rc : len;
}
+/**
+ * tpm_transmit - Internal kernel interface to transmit TPM commands.
+ *
+ * @chip: TPM chip to use
+ * @space: tpm space
+ * @buf: TPM command buffer
+ * @bufsiz: length of the TPM command buffer
+ * @flags: tpm transmit flags - bitmap
+ *
+ * A wrapper around tpm_try_transmit that handles TPM2_RC_RETRY
+ * returns from the TPM and retransmits the command after a delay up
+ * to a maximum wait of TPM2_DURATION_LONG.
+ *
+ * Note: TPM1 never returns TPM2_RC_RETRY so the retry logic is TPM2
+ * only
+ *
+ * Return:
+ * the length of the return when the operation is successful.
+ * A negative number for system errors (errno).
+ */
+ssize_t tpm_transmit(struct tpm_chip *chip, struct tpm_space *space,
+ u8 *buf, size_t bufsiz, unsigned int flags)
+{
+ struct tpm_output_header *header = (struct tpm_output_header *)buf;
+ /* space for header and handles */
+ u8 save[TPM_HEADER_SIZE + 3*sizeof(u32)];
+ unsigned int delay_msec = TPM2_DURATION_SHORT;
+ u32 rc = 0;
+ ssize_t ret;
+ const size_t save_size = min(space ? sizeof(save) : TPM_HEADER_SIZE,
+ bufsiz);
+
+ /*
+ * Subtlety here: if we have a space, the handles will be
+ * transformed, so when we restore the header we also have to
+ * restore the handles.
+ */
+ memcpy(save, buf, save_size);
+
+ for (;;) {
+ ret = tpm_try_transmit(chip, space, buf, bufsiz, flags);
+ if (ret < 0)
+ break;
+ rc = be32_to_cpu(header->return_code);
+ if (rc != TPM2_RC_RETRY)
+ break;
+ delay_msec *= 2;
+ if (delay_msec > TPM2_DURATION_LONG) {
+ dev_err(&chip->dev, "TPM is in retry loop\n");
+ break;
+ }
+ tpm_msleep(delay_msec);
+ memcpy(buf, save, save_size);
+ }
+ return ret;
+}
/**
* tpm_transmit_cmd - send a tpm command to the device
* The function extracts tpm out header return code
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index ab3bcdd4d328..67656a97793a 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -115,6 +115,7 @@ enum tpm2_return_codes {
TPM2_RC_COMMAND_CODE = 0x0143,
TPM2_RC_TESTING = 0x090A, /* RC_WARN */
TPM2_RC_REFERENCE_H0 = 0x0910,
+ TPM2_RC_RETRY = 0x0922,
};
enum tpm2_algorithms {