Kernel message explanation:
* Description: * The FCP channel reported that its bit error threshold has been exceeded. * These errors might result from a problem with the physical components * of the local fibre link into the FCP channel. * The problem might be damage or malfunction of the cable or * cable connection between the FCP channel and * the adjacent fabric switch port or the point-to-point peer. * Find details about the errors in the HBA trace for the FCP device. * The zfcp device driver closed down the FCP device * to limit the performance impact from possible I/O command timeouts. * User action: * Check for problems on the local fibre link, ensure that fibre optics are * clean and functional, and all cables are properly plugged. * After the repair action, you can manually recover the FCP device by * writing "0" into its "failed" sysfs attribute. * If recovery through sysfs is not possible, set the CHPID of the device * offline and back online on the service element.
Signed-off-by: Steffen Maier maier@linux.ibm.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org #2.6.30+ Reviewed-by: Jens Remus jremus@linux.ibm.com Reviewed-by: Benjamin Block bblock@linux.ibm.com ---
Martin, James,
an important zfcp fix for v5.4-rc. It applies to Martin's 5.4/scsi-fixes or to James' fixes branch.
drivers/s390/scsi/zfcp_fsf.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c index 296bbc3c4606..cf63916814cc 100644 --- a/drivers/s390/scsi/zfcp_fsf.c +++ b/drivers/s390/scsi/zfcp_fsf.c @@ -27,6 +27,11 @@
struct kmem_cache *zfcp_fsf_qtcb_cache;
+static bool ber_stop = true; +module_param(ber_stop, bool, 0600); +MODULE_PARM_DESC(ber_stop, + "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)"); + static void zfcp_fsf_request_timeout_handler(struct timer_list *t) { struct zfcp_fsf_req *fsf_req = from_timer(fsf_req, t, timer); @@ -236,10 +241,15 @@ static void zfcp_fsf_status_read_handler(struct zfcp_fsf_req *req) case FSF_STATUS_READ_SENSE_DATA_AVAIL: break; case FSF_STATUS_READ_BIT_ERROR_THRESHOLD: - dev_warn(&adapter->ccw_device->dev, - "The error threshold for checksum statistics " - "has been exceeded\n"); zfcp_dbf_hba_bit_err("fssrh_3", req); + if (ber_stop) { + dev_warn(&adapter->ccw_device->dev, + "All paths over this FCP device are disused because of excessive bit errors\n"); + zfcp_erp_adapter_shutdown(adapter, 0, "fssrh_b"); + } else { + dev_warn(&adapter->ccw_device->dev, + "The error threshold for checksum statistics has been exceeded\n"); + } break; case FSF_STATUS_READ_LINK_DOWN: zfcp_fsf_status_read_link_down(req);
Steffen,
Kernel message explanation:
- Description:
- The FCP channel reported that its bit error threshold has been exceeded.
- These errors might result from a problem with the physical components
- of the local fibre link into the FCP channel.
- The problem might be damage or malfunction of the cable or
- cable connection between the FCP channel and
- the adjacent fabric switch port or the point-to-point peer.
- Find details about the errors in the HBA trace for the FCP device.
- The zfcp device driver closed down the FCP device
- to limit the performance impact from possible I/O command timeouts.
- User action:
- Check for problems on the local fibre link, ensure that fibre optics are
- clean and functional, and all cables are properly plugged.
- After the repair action, you can manually recover the FCP device by
- writing "0" into its "failed" sysfs attribute.
- If recovery through sysfs is not possible, set the CHPID of the device
- offline and back online on the service element.
This commentary does not read like a patch description. It makes no mention of the actual kernel changes and the introduced module parameter.
+static bool ber_stop = true; +module_param(ber_stop, bool, 0600); +MODULE_PARM_DESC(ber_stop,
"Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)");
On excessive bit errors for the FCP channel ingress fibre path, the channel notifies us. Previously, we only emitted a kernel message and a trace record. Since performance can become suboptimal with I/O timeouts due to bit errors, we now stop using an FCP device by default on channel notification so multipath on top can timely failover to other paths. A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.
User explanation of new kernel message: * Description: * The FCP channel reported that its bit error threshold has been exceeded. * These errors might result from a problem with the physical components * of the local fibre link into the FCP channel. * The problem might be damage or malfunction of the cable or * cable connection between the FCP channel and * the adjacent fabric switch port or the point-to-point peer. * Find details about the errors in the HBA trace for the FCP device. * The zfcp device driver closed down the FCP device * to limit the performance impact from possible I/O command timeouts. * User action: * Check for problems on the local fibre link, ensure that fibre optics are * clean and functional, and all cables are properly plugged. * After the repair action, you can manually recover the FCP device by * writing "0" into its "failed" sysfs attribute. * If recovery through sysfs is not possible, set the CHPID of the device * offline and back online on the service element.
Signed-off-by: Steffen Maier maier@linux.ibm.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org #2.6.30+ Reviewed-by: Jens Remus jremus@linux.ibm.com Reviewed-by: Benjamin Block bblock@linux.ibm.com ---
Martin, James,
an important zfcp fix for v5.4-rc. It applies to Martin's 5.4/scsi-fixes or to James' fixes branch.
Changes since v1: * Martin's review comments: describe code change and new module parameter
drivers/s390/scsi/zfcp_fsf.c | 16 +++++++++++++--- 2 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c index e31c6b47af97..1e279220f073 100644 --- a/drivers/s390/scsi/zfcp_fsf.c +++ b/drivers/s390/scsi/zfcp_fsf.c @@ -29,6 +29,11 @@
struct kmem_cache *zfcp_fsf_qtcb_cache;
+static bool ber_stop = true; +module_param(ber_stop, bool, 0600); +MODULE_PARM_DESC(ber_stop, + "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)"); + static void zfcp_fsf_request_timeout_handler(struct timer_list *t) { struct zfcp_fsf_req *fsf_req = from_timer(fsf_req, t, timer); @@ -238,10 +243,15 @@ static void zfcp_fsf_status_read_handler(struct zfcp_fsf_req *req) case FSF_STATUS_READ_SENSE_DATA_AVAIL: break; case FSF_STATUS_READ_BIT_ERROR_THRESHOLD: - dev_warn(&adapter->ccw_device->dev, - "The error threshold for checksum statistics " - "has been exceeded\n"); zfcp_dbf_hba_bit_err("fssrh_3", req); + if (ber_stop) { + dev_warn(&adapter->ccw_device->dev, + "All paths over this FCP device are disused because of excessive bit errors\n"); + zfcp_erp_adapter_shutdown(adapter, 0, "fssrh_b"); + } else { + dev_warn(&adapter->ccw_device->dev, + "The error threshold for checksum statistics has been exceeded\n"); + } break; case FSF_STATUS_READ_LINK_DOWN: zfcp_fsf_status_read_link_down(req);
On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote:
On excessive bit errors for the FCP channel ingress fibre path, the channel notifies us. Previously, we only emitted a kernel message and a trace record. Since performance can become suboptimal with I/O timeouts due to bit errors, we now stop using an FCP device by default on channel notification so multipath on top can timely failover to other paths. A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.
Ugh, module parameters? This isn't the 1990's anymore :(
Why not just make this a dynamic sysfs variable, that way you properly can set this on whatever device you want, not just "all or nothing"?
thanks,
greg k-h
On 10/1/19 4:14 PM, Greg KH wrote:
On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote:
On excessive bit errors for the FCP channel ingress fibre path, the channel notifies us. Previously, we only emitted a kernel message and a trace record. Since performance can become suboptimal with I/O timeouts due to bit errors, we now stop using an FCP device by default on channel notification so multipath on top can timely failover to other paths. A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.
Ugh, module parameters? This isn't the 1990's anymore :(
Why not just make this a dynamic sysfs variable, that way you properly can set this on whatever device you want, not just "all or nothing"?
Since we can see many more (virtual) FCP devices than we want to actually use, we defer probing. It means, we only start allocating structures and sysfs entries on setting an FCP "online" for the first time. Setting online works through another sysfs attribute owned by our ccw bus code component called "cio". IIRC, setting online does not emit a uevent. On setting online, the (add) uevent of hot-/coldplug of an FCP device had already happened, so we could not easily have end users craft udev rules to automatically/persistently configure a new sysfs attribute (which is FCP-device-specific and appears late) to disable the new code behavior.
Not sure if that could ever become a problem for end users: Even if we were to write into a new sysfs attribute, the attribute only appears during setting online so this might race with starting to actually use the FCP device with the new default behavior and could potentially disable I/O paths before the sysfs attribute write could become effective to disable the new behavor.
On Tue, Oct 01, 2019 at 05:07:50PM +0200, Steffen Maier wrote:
On 10/1/19 4:14 PM, Greg KH wrote:
On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote:
On excessive bit errors for the FCP channel ingress fibre path, the channel notifies us. Previously, we only emitted a kernel message and a trace record. Since performance can become suboptimal with I/O timeouts due to bit errors, we now stop using an FCP device by default on channel notification so multipath on top can timely failover to other paths. A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.
Ugh, module parameters? This isn't the 1990's anymore :(
Why not just make this a dynamic sysfs variable, that way you properly can set this on whatever device you want, not just "all or nothing"?
Since we can see many more (virtual) FCP devices than we want to actually use, we defer probing. It means, we only start allocating structures and sysfs entries on setting an FCP "online" for the first time. Setting online works through another sysfs attribute owned by our ccw bus code component called "cio". IIRC, setting online does not emit a uevent. On setting online, the (add) uevent of hot-/coldplug of an FCP device had already happened, so we could not easily have end users craft udev rules to automatically/persistently configure a new sysfs attribute (which is FCP-device-specific and appears late) to disable the new code behavior.
Not sure if that could ever become a problem for end users: Even if we were to write into a new sysfs attribute, the attribute only appears during setting online so this might race with starting to actually use the FCP device with the new default behavior and could potentially disable I/O paths before the sysfs attribute write could become effective to disable the new behavor.
Ok, then why make this a module option that you will have to support for the next 20+ years anyway if you feel this fix is the correct way that it should be done instead?
module options are tough to manage and support, only add them as a very last thing, when all other options have been ruled out.
thanks,
greg k-h
Greg,
Ok, then why make this a module option that you will have to support for the next 20+ years anyway if you feel this fix is the correct way that it should be done instead?
I agree.
Why not just shut FCP down unconditionally on excessive bit errors? What's the benefit of allowing things to continue? Are you hoping things will eventually recover in a single-path scenario?
On 10/1/19 8:26 PM, Martin K. Petersen wrote:
Ok, then why make this a module option that you will have to support for the next 20+ years anyway if you feel this fix is the correct way that it should be done instead?
I agree.
Why not just shut FCP down unconditionally on excessive bit errors? What's the benefit of allowing things to continue? Are you hoping things will eventually recover in a single-path scenario?
Experience told me that there will be an unforeseen end user scenario where I need a quick switch to let even shaky paths survive.
Steffen,
Why not just shut FCP down unconditionally on excessive bit errors? What's the benefit of allowing things to continue? Are you hoping things will eventually recover in a single-path scenario?
Experience told me that there will be an unforeseen end user scenario where I need a quick switch to let even shaky paths survive.
Can't say I like it. But it's your driver.
Applied to 5.4/scsi-fixes. Thanks!
linux-stable-mirror@lists.linaro.org