- Linux-stable-mirror - lists.linaro.org

[PATCH 4.14 00/44] 4.14.58-stable review

by Greg Kroah-Hartman

This is the start of the stable review cycle for the 4.14.58 release. There are 44 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Wed Jul 25 12:24:22 UTC 2018. Anything received after that time might be too late. The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.58-rc… or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y and the diffstat can be found below. thanks, greg k-h ------------- Pseudo-Shortlog of commits: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Linux 4.14.58-rc1 Mathias Nyman <mathias.nyman(a)linux.intel.com> xhci: Fix perceived dead host due to runtime suspend race with event handler Gautham R. Shenoy <ego(a)linux.vnet.ibm.com> powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle) Al Viro <viro(a)zeniv.linux.org.uk> cxl_getfile(): fix double-iput() on alloc_file() failures Al Viro <viro(a)ZenIV.linux.org.uk> alpha: fix osf_wait4() breakage Alexander Couzens <lynxis(a)fe80.eu> net: usb: asix: replace mii_nway_restart in resume path Sabrina Dubroca <sd(a)queasysnail.net> ipv6: make DAD fail with enhanced DAD when nonce length differs Florian Fainelli <f.fainelli(a)gmail.com> net: systemport: Fix CRC forwarding check for SYSTEMPORT Lite Saeed Mahameed <saeedm(a)mellanox.com> net/mlx4_en: Don't reuse RX page when XDP is set Haiyang Zhang <haiyangz(a)microsoft.com> hv_netvsc: Fix napi reschedule while receive completion is busy Sanjeev Bansal <sanjeevb.bansal(a)broadcom.com> tg3: Add higher cpu clock for 5762. Matevz Vucnik <vucnikm(a)gmail.com> qmi_wwan: add support for Quectel EG91 Gustavo A. R. Silva <gustavo(a)embeddedor.com> ptp: fix missing break in switch Heiner Kallweit <hkallweit1(a)gmail.com> net: phy: fix flag masking in __set_phy_supported David Ahern <dsahern(a)gmail.com> net/ipv4: Set oif in fib_compute_spec_dst Stefano Brivio <sbrivio(a)redhat.com> skbuff: Unconditionally copy pfmemalloc in __skb_clone() Stefano Brivio <sbrivio(a)redhat.com> net: Don't copy pfmemalloc flag in __copy_skb_header() Lorenzo Colitti <lorenzo(a)google.com> net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort Davidlohr Bueso <dave(a)stgolabs.net> lib/rhashtable: consider param->min_size when setting initial table size Arnd Bergmann <arnd(a)arndb.de> ipv6: ila: select CONFIG_DST_CACHE Colin Ian King <colin.king(a)canonical.com> ipv6: fix useless rol32 call on hash Tyler Hicks <tyhicks(a)canonical.com> ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user ns Toke Høiland-Jørgensen <toke(a)toke.dk> gen_stats: Fix netlink stats dumping in the presence of padding Lyude Paul <lyude(a)redhat.com> drm/nouveau: Avoid looping through fake MST connectors Lyude Paul <lyude(a)redhat.com> drm/nouveau: Use drm_connector_list_iter_* for iterating connectors Ville Syrjälä <ville.syrjala(a)linux.intel.com> drm/i915: Fix hotplug irq ack on i965/g4x Isaac J. Manjarres <isaacm(a)codeaurora.org> stop_machine: Disable preemption when waking two stopper threads Alexey Kardashevskiy <aik(a)ozlabs.ru> vfio/spapr: Use IOMMU pageshift rather than pagesize Gustavo A. R. Silva <gustavo(a)embeddedor.com> vfio/pci: Fix potential Spectre v1 Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> cpufreq: intel_pstate: Register when ACPI PCCH is present Hugh Dickins <hughd(a)google.com> mm/huge_memory.c: fix data loss when splitting a file pmd Jing Xia <jing.xia.mail(a)gmail.com> mm: memcg: fix use after free in mem_cgroup_iter() Vineet Gupta <vgupta(a)synopsys.com> ARC: mm: allow mprotect to make stack mappings executable Alexey Brodkin <Alexey.Brodkin(a)synopsys.com> ARC: configs: Remove CONFIG_INITRAMFS_SOURCE from defconfigs Alexey Brodkin <abrodkin(a)synopsys.com> ARC: Fix CONFIG_SWAP Vineet Gupta <vgupta(a)synopsys.com> ARCv2: [plat-hsdk]: Save accl reg pair by default Po-Hsu Lin <po-hsu.lin(a)canonical.com> ALSA: hda: add mute led support for HP ProBook 455 G5 YOKOTA Hiroshi <yokota.hgml(a)gmail.com> ALSA: hda/realtek - Add Panasonic CF-SZ6 headset jack quirk Takashi Iwai <tiwai(a)suse.de> ALSA: rawmidi: Change resized buffers atomically OGAWA Hirofumi <hirofumi(a)mail.parknet.co.jp> fat: fix memory allocation failure handling of match_strdup() Dewet Thibaut <thibaut.dewet(a)nokia.com> x86/MCE: Remove min interval polling limitation Hugh Dickins <hughd(a)google.com> x86/events/intel/ds: Fix bts_interrupt_threshold alignment Ville Syrjälä <ville.syrjala(a)linux.intel.com> x86/apm: Don't access __preempt_count with zeroed fs Lan Tianyu <tianyu.lan(a)intel.com> KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel. Damien Le Moal <damien.lemoal(a)wdc.com> scsi: sd_zbc: Fix variable type and bogus comment ------------- Diffstat: Makefile | 4 +-- arch/alpha/kernel/osf_sys.c | 5 +--- arch/arc/Kconfig | 2 +- arch/arc/configs/axs101_defconfig | 1 - arch/arc/configs/axs103_defconfig | 1 - arch/arc/configs/axs103_smp_defconfig | 1 - arch/arc/configs/haps_hs_defconfig | 1 - arch/arc/configs/haps_hs_smp_defconfig | 1 - arch/arc/configs/hsdk_defconfig | 1 - arch/arc/configs/nsim_700_defconfig | 1 - arch/arc/configs/nsim_hs_defconfig | 1 - arch/arc/configs/nsim_hs_smp_defconfig | 1 - arch/arc/configs/nsimosci_defconfig | 1 - arch/arc/configs/nsimosci_hs_defconfig | 1 - arch/arc/configs/nsimosci_hs_smp_defconfig | 1 - arch/arc/include/asm/page.h | 2 +- arch/arc/include/asm/pgtable.h | 2 +- arch/arc/plat-hsdk/Kconfig | 2 ++ arch/powerpc/kernel/idle_book3s.S | 2 ++ arch/x86/events/intel/ds.c | 8 +++--- arch/x86/include/asm/apm.h | 6 ----- arch/x86/kernel/apm_32.c | 5 ++++ arch/x86/kernel/cpu/mcheck/mce.c | 3 --- drivers/cpufreq/intel_pstate.c | 17 +++++++++++- drivers/cpufreq/pcc-cpufreq.c | 4 +++ drivers/gpu/drm/i915/i915_irq.c | 32 +++++++++++++++++++++-- drivers/gpu/drm/nouveau/nouveau_backlight.c | 6 +++-- drivers/gpu/drm/nouveau/nouveau_connector.c | 9 +++++-- drivers/gpu/drm/nouveau/nouveau_connector.h | 36 +++++++++++++++++++++++--- drivers/gpu/drm/nouveau/nouveau_display.c | 10 ++++++-- drivers/misc/cxl/api.c | 8 +++--- drivers/net/ethernet/broadcom/bcmsysport.c | 4 +-- drivers/net/ethernet/broadcom/bcmsysport.h | 3 ++- drivers/net/ethernet/broadcom/tg3.c | 9 +++++++ drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 ++++-- drivers/net/hyperv/netvsc.c | 17 +++++++----- drivers/net/phy/phy_device.c | 7 ++--- drivers/net/usb/asix_devices.c | 4 ++- drivers/net/usb/qmi_wwan.c | 1 + drivers/ptp/ptp_chardev.c | 1 + drivers/scsi/sd_zbc.c | 5 ++-- drivers/usb/host/xhci.c | 40 ++++++++++++++++++++++++++--- drivers/usb/host/xhci.h | 4 +++ drivers/vfio/pci/vfio_pci.c | 4 +++ drivers/vfio/vfio_iommu_spapr_tce.c | 8 +++--- fs/fat/inode.c | 20 ++++++++++----- include/linux/sched/task.h | 2 +- include/linux/skbuff.h | 10 ++++---- include/net/ipv6.h | 2 +- kernel/stop_machine.c | 6 ++++- lib/rhashtable.c | 17 +++++++----- mm/huge_memory.c | 2 ++ mm/memcontrol.c | 2 +- net/core/gen_stats.c | 16 ++++++++++-- net/core/skbuff.c | 1 + net/ipv4/fib_frontend.c | 1 + net/ipv4/sysctl_net_ipv4.c | 5 ++-- net/ipv4/tcp.c | 3 +-- net/ipv6/Kconfig | 1 + net/ipv6/ndisc.c | 2 +- sound/core/rawmidi.c | 20 ++++++++++----- sound/pci/hda/patch_conexant.c | 1 + sound/pci/hda/patch_realtek.c | 1 + virt/kvm/eventfd.c | 6 ++++- 64 files changed, 295 insertions(+), 113 deletions(-)

7 years, 5 months

3
41
0 0

[PATCH 4.9 00/28] 4.9.115-stable review

by Greg Kroah-Hartman

This is the start of the stable review cycle for the 4.9.115 release. There are 28 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Wed Jul 25 12:24:13 UTC 2018. Anything received after that time might be too late. The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.115-rc… or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y and the diffstat can be found below. thanks, greg k-h ------------- Pseudo-Shortlog of commits: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Linux 4.9.115-rc1 Alan Jenkins <alan.christopher.jenkins(a)gmail.com> block: do not use interruptible wait anywhere Chuck Lever <chuck.lever(a)oracle.com> xprtrdma: Return -ENOBUFS when no pages are available Mathias Nyman <mathias.nyman(a)linux.intel.com> xhci: Fix perceived dead host due to runtime suspend race with event handler Stefano Brivio <sbrivio(a)redhat.com> skbuff: Unconditionally copy pfmemalloc in __skb_clone() Stefano Brivio <sbrivio(a)redhat.com> net: Don't copy pfmemalloc flag in __copy_skb_header() Alexander Couzens <lynxis(a)fe80.eu> net: usb: asix: replace mii_nway_restart in resume path Sanjeev Bansal <sanjeevb.bansal(a)broadcom.com> tg3: Add higher cpu clock for 5762. Matevz Vucnik <vucnikm(a)gmail.com> qmi_wwan: add support for Quectel EG91 Gustavo A. R. Silva <gustavo(a)embeddedor.com> ptp: fix missing break in switch Heiner Kallweit <hkallweit1(a)gmail.com> net: phy: fix flag masking in __set_phy_supported David Ahern <dsahern(a)gmail.com> net/ipv4: Set oif in fib_compute_spec_dst Lorenzo Colitti <lorenzo(a)google.com> net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort Davidlohr Bueso <dave(a)stgolabs.net> lib/rhashtable: consider param->min_size when setting initial table size Colin Ian King <colin.king(a)canonical.com> ipv6: fix useless rol32 call on hash Tyler Hicks <tyhicks(a)canonical.com> ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user ns Toke Høiland-Jørgensen <toke(a)toke.dk> gen_stats: Fix netlink stats dumping in the presence of padding Ville Syrjälä <ville.syrjala(a)linux.intel.com> drm/i915: Fix hotplug irq ack on i965/g4x Gustavo A. R. Silva <gustavo(a)embeddedor.com> vfio/pci: Fix potential Spectre v1 Hugh Dickins <hughd(a)google.com> mm/huge_memory.c: fix data loss when splitting a file pmd Jing Xia <jing.xia.mail(a)gmail.com> mm: memcg: fix use after free in mem_cgroup_iter() Alexey Brodkin <Alexey.Brodkin(a)synopsys.com> ARC: configs: Remove CONFIG_INITRAMFS_SOURCE from defconfigs Vineet Gupta <vgupta(a)synopsys.com> ARC: mm: allow mprotect to make stack mappings executable Alexey Brodkin <abrodkin(a)synopsys.com> ARC: Fix CONFIG_SWAP Takashi Iwai <tiwai(a)suse.de> ALSA: rawmidi: Change resized buffers atomically OGAWA Hirofumi <hirofumi(a)mail.parknet.co.jp> fat: fix memory allocation failure handling of match_strdup() Dewet Thibaut <thibaut.dewet(a)nokia.com> x86/MCE: Remove min interval polling limitation Ville Syrjälä <ville.syrjala(a)linux.intel.com> x86/apm: Don't access __preempt_count with zeroed fs Lan Tianyu <tianyu.lan(a)intel.com> KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel. ------------- Diffstat: Makefile | 4 +-- arch/arc/configs/axs101_defconfig | 1 - arch/arc/configs/axs103_defconfig | 1 - arch/arc/configs/axs103_smp_defconfig | 1 - arch/arc/configs/nsim_700_defconfig | 1 - arch/arc/configs/nsim_hs_defconfig | 1 - arch/arc/configs/nsim_hs_smp_defconfig | 1 - arch/arc/configs/nsimosci_defconfig | 1 - arch/arc/configs/nsimosci_hs_defconfig | 1 - arch/arc/configs/nsimosci_hs_smp_defconfig | 1 - arch/arc/include/asm/page.h | 2 +- arch/arc/include/asm/pgtable.h | 2 +- arch/x86/include/asm/apm.h | 6 ----- arch/x86/kernel/apm_32.c | 5 ++++ arch/x86/kernel/cpu/mcheck/mce.c | 3 --- block/blk-core.c | 9 +++---- drivers/gpu/drm/i915/i915_irq.c | 32 ++++++++++++++++++++++-- drivers/net/ethernet/broadcom/tg3.c | 9 +++++++ drivers/net/phy/phy_device.c | 7 ++---- drivers/net/usb/asix_devices.c | 4 ++- drivers/net/usb/qmi_wwan.c | 1 + drivers/ptp/ptp_chardev.c | 1 + drivers/usb/host/xhci.c | 40 +++++++++++++++++++++++++++--- drivers/usb/host/xhci.h | 4 +++ drivers/vfio/pci/vfio_pci.c | 4 +++ fs/fat/inode.c | 20 +++++++++------ include/linux/skbuff.h | 10 ++++---- include/net/ipv6.h | 2 +- lib/rhashtable.c | 17 ++++++++----- mm/huge_memory.c | 2 ++ mm/memcontrol.c | 2 +- net/core/gen_stats.c | 16 ++++++++++-- net/core/skbuff.c | 1 + net/ipv4/fib_frontend.c | 1 + net/ipv4/sysctl_net_ipv4.c | 5 ++-- net/ipv4/tcp.c | 3 +-- net/sunrpc/xprtrdma/rpc_rdma.c | 2 +- sound/core/rawmidi.c | 20 ++++++++++----- virt/kvm/eventfd.c | 6 ++++- 39 files changed, 176 insertions(+), 73 deletions(-)

7 years, 5 months

4
27
0 0

[PATCH 1/2] [v2] hfs/hfsplus: follow MacOS time behavior

by Arnd Bergmann

According to the official documentation for HFS+ [1], inode timestamps are supposed to cover the time range from 1904 to 2040 as originally used in classic MacOS. The traditional Linux usage is to convert the timestamps into an unsigned 32-bit number based on the Unix epoch and from there to a time_t. On 32-bit systems, that wraps the time from 2038 to 1902, so the last two years of the valid time range become garbled. On 64-bit systems, all times before 1970 get turned into timestamps between 2038 and 2106, which is more convenient but also different from the documented behavior. Looking at the Darwin sources [2], it seems that MacOS is inconsistent in yet another way: all timestamps are wrapped around to a 32-bit unsigned number when written to the disk, but when read back, all numeric values lower than 2082844800U are assumed to be invalid, so we cannot represent the times before 1970 or the times after 2040. While all implementations seem to agree on the interpretation of values between 1970 and 2038, they often differ on the exact range they support when reading back values outside of the common range: MacOS (traditional): 1904-2040 Apple Documentation: 1904-2040 MacOS X source comments: 1970-2040 MacOS X source code: 1970-2038 32-bit Linux: 1902-2038 64-bit Linux: 1970-2106 hfsfuse: 1970-2040 hfsutils (32 bit, old libc) 1902-2038 hfsutils (32 bit, new libc) 1970-2106 hfsutils (64 bit) 1904-2040 hfsplus-utils 1904-2040 hfsexplorer 1904-2040 7-zip 1904-2040 This changes Linux over to mostly the same behavior as described in the code comment in MacOS X, disallowing all times before 1970 and after 2040, while still allowing times between 2038 and 2040 like most other implementations do. Most importantly, it means we can have the same behavior on 32-bit and 64-bit. Cc: stable(a)vger.kernel.org Link: [1] https://developer.apple.com/library/archive/technotes/tn/tn1150.html Link: [2] https://opensource.apple.com/source/hfs/hfs-407.30.1/core/MacOSStubs.c.auto… Suggested-by: Viacheslav Dubeyko <slava(a)dubeyko.com> Signed-off-by: Arnd Bergmann <arnd(a)arndb.de> --- v2: treat pre-1970 dates as invalid following MacOS X behavior, reword and expand changelog text --- fs/hfs/hfs_fs.h | 29 +++++++++++++++++++++++++---- fs/hfsplus/hfsplus_fs.h | 26 +++++++++++++++++++++++--- 2 files changed, 48 insertions(+), 7 deletions(-) diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h index 6d0783e2e276..1af998fb522e 100644 --- a/fs/hfs/hfs_fs.h +++ b/fs/hfs/hfs_fs.h @@ -246,14 +246,35 @@ extern void hfs_mark_mdb_dirty(struct super_block *sb); * mac: unsigned big-endian since 00:00 GMT, Jan. 1, 1904 * */ -#define __hfs_u_to_mtime(sec) cpu_to_be32(sec + 2082844800U - sys_tz.tz_minuteswest * 60) -#define __hfs_m_to_utime(sec) (be32_to_cpu(sec) - 2082844800U + sys_tz.tz_minuteswest * 60) +static inline time64_t __hfs_m_to_utime(__be32 mt) +{ + time64_t ut = (u32)(be32_to_cpu(mt) - 2082844800U); + + /* + * Times past 2040-02-06 06:28 are assumed to be invalid, + * matching the MacOS behavior. + */ + if (ut > 2082844800U + UINT_MAX) + ut = 0; + + return ut + sys_tz.tz_minuteswest * 60; +} +static inline __be32 __hfs_u_to_mtime(time64_t ut) +{ + ut -= - sys_tz.tz_minuteswest * 60; + + /* + * MacOS wraps "invalid" times after 2040 when writing back, so + * let's do the same here. + */ + return cpu_to_be32(lower_32_bits(ut + 2082844800U)); +} #define HFS_I(inode) (container_of(inode, struct hfs_inode_info, vfs_inode)) #define HFS_SB(sb) ((struct hfs_sb_info *)(sb)->s_fs_info) -#define hfs_m_to_utime(time) (struct timespec){ .tv_sec = __hfs_m_to_utime(time) } -#define hfs_u_to_mtime(time) __hfs_u_to_mtime((time).tv_sec) +#define hfs_m_to_utime(time) (struct timespec){ .tv_sec = __hfs_m_to_utime(time) } +#define hfs_u_to_mtime(time) __hfs_u_to_mtime((time).tv_sec) #define hfs_mtime() __hfs_u_to_mtime(get_seconds()) static inline const char *hfs_mdb_name(struct super_block *sb) diff --git a/fs/hfsplus/hfsplus_fs.h b/fs/hfsplus/hfsplus_fs.h index d9255abafb81..7f0943e540a0 100644 --- a/fs/hfsplus/hfsplus_fs.h +++ b/fs/hfsplus/hfsplus_fs.h @@ -530,9 +530,29 @@ int hfsplus_submit_bio(struct super_block *sb, sector_t sector, void *buf, void **data, int op, int op_flags); int hfsplus_read_wrapper(struct super_block *sb); -/* time macros */ -#define __hfsp_mt2ut(t) (be32_to_cpu(t) - 2082844800U) -#define __hfsp_ut2mt(t) (cpu_to_be32(t + 2082844800U)) +/* time helpers */ +static inline time64_t __hfsp_mt2ut(__be32 mt) +{ + time64_t ut = (u32)(be32_to_cpu(mt) - 2082844800U); + + /* + * Times past 2040-02-06 06:28 are assumed to be invalid, + * matching the MacOS behavior. + */ + if (ut > 2082844800U + UINT_MAX) + ut = 0; + + return ut; +} + +static inline __be32 __hfsp_ut2mt(time64_t ut) +{ + /* + * MacOS wraps "invalid" times after 2040 when writing back, so + * let's do the same here. + */ + return cpu_to_be32(lower_32_bits(ut + 2082844800U)); +} /* compatibility */ #define hfsp_mt2ut(t) (struct timespec){ .tv_sec = __hfsp_mt2ut(t) } -- 2.9.0

7 years, 5 months

2
2
0 0

[PATCH 2/3] [BUGFIX] ring_buffer: tracing: Inherit the tracing setting to next ring buffer

by Masami Hiramatsu

Inherit the tracing on/off setting on ring_buffer to next trace buffer when taking a snapshot. Taking a snapshot is done by swapping with backup ring buffer (max_tr_buffer). But since the tracing on/off setting is set in the ring buffer, when swapping it, tracing on/off setting can also be changed. This causes a strange result like below; /sys/kernel/debug/tracing # cat tracing_on 1 /sys/kernel/debug/tracing # echo 0 > tracing_on /sys/kernel/debug/tracing # echo 1 > snapshot /sys/kernel/debug/tracing # cat tracing_on 1 /sys/kernel/debug/tracing # echo 1 > snapshot /sys/kernel/debug/tracing # cat tracing_on 0 We don't touch tracing_on, but snapshot changes tracing_on setting each time. This must be a bug, because user never know that each "ring_buffer" stores tracing-enable state and snapshot is done by swapping ring buffers. This patch fixes above strange behavior. Fixes: commit debdd57f5145 ("tracing: Make a snapshot feature available from userspace") Signed-off-by: Masami Hiramatsu <mhiramat(a)kernel.org> Cc: Steven Rostedt <rostedt(a)goodmis.org> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Hiraku Toyooka <hiraku.toyooka(a)cybertrust.co.jp> Cc: stable(a)vger.kernel.org --- include/linux/ring_buffer.h | 1 + kernel/trace/ring_buffer.c | 12 ++++++++++++ kernel/trace/trace.c | 6 ++++++ 3 files changed, 19 insertions(+) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index b72ebdff0b77..003d09ab308d 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -165,6 +165,7 @@ void ring_buffer_record_enable(struct ring_buffer *buffer); void ring_buffer_record_off(struct ring_buffer *buffer); void ring_buffer_record_on(struct ring_buffer *buffer); int ring_buffer_record_is_on(struct ring_buffer *buffer); +int ring_buffer_record_is_set_on(struct ring_buffer *buffer); void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu); void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6a46af21765c..4038ed74ab95 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -3227,6 +3227,18 @@ int ring_buffer_record_is_on(struct ring_buffer *buffer) } /** + * ring_buffer_record_is_set_on - return true if the ring buffer is set writable + * @buffer: The ring buffer to see if write is set enabled + * + * Returns true if the ring buffer is set writable by ring_buffer_record_on(). + * Note that this does NOT mean it is in a writable state. + */ +int ring_buffer_record_is_set_on(struct ring_buffer *buffer) +{ + return !(atomic_read(&buffer->record_disabled) & RB_BUFFER_OFF); +} + +/** * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer * @buffer: The ring buffer to stop writes to. * @cpu: The CPU buffer to stop diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2556d8c097d2..bbd5a94a7ef1 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1378,6 +1378,12 @@ update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) arch_spin_lock(&tr->max_lock); + /* Inherit the recordable setting from trace_buffer */ + if (ring_buffer_record_is_set_on(tr->trace_buffer.buffer)) + ring_buffer_record_on(tr->max_buffer.buffer); + else + ring_buffer_record_off(tr->max_buffer.buffer); + swap(tr->trace_buffer.buffer, tr->max_buffer.buffer); __update_max_tr(tr, tsk, cpu);

7 years, 5 months

2
2
0 0

[PATCHv3 3/3] mm: Fix vma_is_anonymous() false-positives

by Kirill A. Shutemov

vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous VMA. This is unreliable as ->mmap may not set ->vm_ops. False-positive vma_is_anonymous() may lead to crashes: next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0 prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000 pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000 flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare) ------------[ cut here ]------------ kernel BUG at mm/memory.c:1422! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline] RIP: 0010:zap_pud_range mm/memory.c:1466 [inline] RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline] RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508 Code: ff 31 ff 4c 89 e6 42 c6 04 33 f8 e8 92 dd d0 ff 4d 85 e4 0f 85 4a eb ff ff e8 54 dc d0 ff 48 8b bd 10 fc ff ff e8 82 95 fe ff <0f> 0b e8 41 dc d0 ff 0f 0b 4c 89 ad 18 fc ff ff c7 85 7c fb ff ff RSP: 0018:ffff8801b0587330 EFLAGS: 00010286 RAX: 000000000000013c RBX: 1ffff100360b0e9c RCX: ffffc90002620000 RDX: 0000000000000000 RSI: ffffffff81631851 RDI: 0000000000000001 RBP: ffff8801b05877c8 R08: ffff880199d40300 R09: ffffed003b5c4fc0 R10: ffffed003b5c4fc0 R11: ffff8801dae27e07 R12: 0000000000000000 R13: ffff88019c1e13c0 R14: dffffc0000000000 R15: 0000000020e01000 FS: 00007fca32251700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f04c540d000 CR3: 00000001ac1f0000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: unmap_single_vma+0x1a0/0x310 mm/memory.c:1553 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644 unmap_mapping_range_vma mm/memory.c:2792 [inline] unmap_mapping_range_tree mm/memory.c:2813 [inline] unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845 unmap_mapping_range+0x48/0x60 mm/memory.c:2880 truncate_pagecache+0x54/0x90 mm/truncate.c:800 truncate_setsize+0x70/0xb0 mm/truncate.c:826 simple_setattr+0xe9/0x110 fs/libfs.c:409 notify_change+0xf13/0x10f0 fs/attr.c:335 do_truncate+0x1ac/0x2b0 fs/open.c:63 do_sys_ftruncate+0x492/0x560 fs/open.c:205 __do_sys_ftruncate fs/open.c:215 [inline] __se_sys_ftruncate fs/open.c:213 [inline] __x64_sys_ftruncate+0x59/0x80 fs/open.c:213 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Reproducer: #include <stdio.h> #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long) #define KCOV_ENABLE _IO('c', 100) #define KCOV_DISABLE _IO('c', 101) #define COVER_SIZE (1024<<10) #define KCOV_TRACE_PC 0 #define KCOV_TRACE_CMP 1 int main(int argc, char **argv) { int fd; unsigned long *cover; system("mount -t debugfs none /sys/kernel/debug"); fd = open("/sys/kernel/debug/kcov", O_RDWR); ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE); cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); munmap(cover, COVER_SIZE * sizeof(unsigned long)); cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long), PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); memset(cover, 0, COVER_SIZE * sizeof(unsigned long)); ftruncate(fd, 3UL << 20); return 0; } This can be fixed by assigning anonymous VMAs own vm_ops and not relying on it being NULL. If ->mmap() failed to set ->vm_ops, mmap_region() will set it to dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs. Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Reported-by: syzbot+3f84280d52be9b7083cc(a)syzkaller.appspotmail.com Cc: stable(a)vger.kernel.org Cc: Dmitry Vyukov <dvyukov(a)google.com> Cc: Oleg Nesterov <oleg(a)redhat.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> --- drivers/char/mem.c | 1 + fs/exec.c | 1 + include/linux/mm.h | 8 ++++++++ mm/mmap.c | 3 +++ mm/nommu.c | 2 ++ 5 files changed, 15 insertions(+) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index ffeb60d3434c..df66a9dd0aae 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -708,6 +708,7 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) #endif if (vma->vm_flags & VM_SHARED) return shmem_zero_setup(vma); + vma_set_anonymous(vma); return 0; } diff --git a/fs/exec.c b/fs/exec.c index 72e961a62adb..bdd0eacefdf5 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -293,6 +293,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm) bprm->vma = vma = vm_area_alloc(mm); if (!vma) return -ENOMEM; + vma_set_anonymous(vma); if (down_write_killable(&mm->mmap_sem)) { err = -EINTR; diff --git a/include/linux/mm.h b/include/linux/mm.h index 31540f166987..7ba6d356d18f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -454,10 +454,18 @@ struct vm_operations_struct { static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) { + static const struct vm_operations_struct dummy_vm_ops = {}; + vma->vm_mm = mm; + vma->vm_ops = &dummy_vm_ops; INIT_LIST_HEAD(&vma->anon_vma_chain); } +static inline void vma_set_anonymous(struct vm_area_struct *vma) +{ + vma->vm_ops = NULL; +} + struct mmu_gather; struct inode; diff --git a/mm/mmap.c b/mm/mmap.c index ff1944d8d458..17bbf4d3e24f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1778,6 +1778,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, error = shmem_zero_setup(vma); if (error) goto free_vma; + } else { + vma_set_anonymous(vma); } vma_link(mm, vma, prev, rb_link, rb_parent); @@ -2983,6 +2985,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla return -ENOMEM; } + vma_set_anonymous(vma); vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_pgoff = pgoff; diff --git a/mm/nommu.c b/mm/nommu.c index 1d22fdbf7d7c..9fc9e43335b6 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1145,6 +1145,8 @@ static int do_mmap_private(struct vm_area_struct *vma, if (ret < len) memset(base + ret, 0, len - ret); + } else { + vma_set_anonymous(vma); } return 0; -- 2.18.0

7 years, 5 months

1
0
0 0

[PATCH] nvmet-fc: fix target sgl list on large transfers

by James Smart

The existing code to carve up the sg list expected an sg element-per-page which can be very incorrect with iommu's remapping multiple memory pages to fewer bus addresses. To hit this error required a large io payload (greater than 256k) and a system that maps on a per-page basis. It's possible that large ios could get by fine if the system condensed the sgl list into the first 64 elements. This patch corrects the sg list handling by specifically walking the sg list element by element and attempting to divide the transfer up on a per-sg element boundary. While doing so, it still tries to keep sequences under 256k, but will exceed that rule if a single sg element is larger than 256k. Fixes: 48fa362b6c3f ("nvmet-fc: simplify sg list handling") Cc: <stable(a)vger.kernel.org> # 4.14 Signed-off-by: James Smart <james.smart(a)broadcom.com> --- drivers/nvme/target/fc.c | 44 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 408279cb6f2c..29b4b236afd8 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -58,8 +58,8 @@ struct nvmet_fc_ls_iod { struct work_struct work; } __aligned(sizeof(unsigned long long)); +/* desired maximum for a single sequence - if sg list allows it */ #define NVMET_FC_MAX_SEQ_LENGTH (256 * 1024) -#define NVMET_FC_MAX_XFR_SGENTS (NVMET_FC_MAX_SEQ_LENGTH / PAGE_SIZE) enum nvmet_fcp_datadir { NVMET_FCP_NODATA, @@ -74,6 +74,7 @@ struct nvmet_fc_fcp_iod { struct nvme_fc_cmd_iu cmdiubuf; struct nvme_fc_ersp_iu rspiubuf; dma_addr_t rspdma; + struct scatterlist *next_sg; struct scatterlist *data_sg; int data_sg_cnt; u32 offset; @@ -1025,8 +1026,7 @@ nvmet_fc_register_targetport(struct nvmet_fc_port_info *pinfo, INIT_LIST_HEAD(&newrec->assoc_list); kref_init(&newrec->ref); ida_init(&newrec->assoc_cnt); - newrec->max_sg_cnt = min_t(u32, NVMET_FC_MAX_XFR_SGENTS, - template->max_sgl_segments); + newrec->max_sg_cnt = template->max_sgl_segments; ret = nvmet_fc_alloc_ls_iodlist(newrec); if (ret) { @@ -1722,6 +1722,7 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod) ((fod->io_dir == NVMET_FCP_WRITE) ? DMA_FROM_DEVICE : DMA_TO_DEVICE)); /* note: write from initiator perspective */ + fod->next_sg = fod->data_sg; return 0; @@ -1866,24 +1867,49 @@ nvmet_fc_transfer_fcp_data(struct nvmet_fc_tgtport *tgtport, struct nvmet_fc_fcp_iod *fod, u8 op) { struct nvmefc_tgt_fcp_req *fcpreq = fod->fcpreq; + struct scatterlist *sg = fod->next_sg; unsigned long flags; - u32 tlen; + u32 remaininglen = fod->req.transfer_len - fod->offset; + u32 tlen = 0; int ret; fcpreq->op = op; fcpreq->offset = fod->offset; fcpreq->timeout = NVME_FC_TGTOP_TIMEOUT_SEC; - tlen = min_t(u32, tgtport->max_sg_cnt * PAGE_SIZE, - (fod->req.transfer_len - fod->offset)); + /* + * for next sequence: + * break at a sg element boundary + * attempt to keep sequence length capped at + * NVMET_FC_MAX_SEQ_LENGTH but allow sequence to + * be longer if a single sg element is larger + * than that amount. This is done to avoid creating + * a new sg list to use for the tgtport api. + */ + fcpreq->sg = sg; + fcpreq->sg_cnt = 0; + while (tlen < remaininglen && + fcpreq->sg_cnt < tgtport->max_sg_cnt && + tlen + sg_dma_len(sg) < NVMET_FC_MAX_SEQ_LENGTH) { + fcpreq->sg_cnt++; + tlen += sg_dma_len(sg); + sg = sg_next(sg); + } + if (tlen < remaininglen && fcpreq->sg_cnt == 0) { + fcpreq->sg_cnt++; + tlen += min_t(u32, sg_dma_len(sg), remaininglen); + sg = sg_next(sg); + } + if (tlen < remaininglen) + fod->next_sg = sg; + else + fod->next_sg = NULL; + fcpreq->transfer_length = tlen; fcpreq->transferred_length = 0; fcpreq->fcp_error = 0; fcpreq->rsplen = 0; - fcpreq->sg = &fod->data_sg[fod->offset / PAGE_SIZE]; - fcpreq->sg_cnt = DIV_ROUND_UP(tlen, PAGE_SIZE); - /* * If the last READDATA request: check if LLDD supports * combined xfr with response. -- 2.13.1

7 years, 5 months

2
1
0 0

editing for your photos

by Roland

I would like to speak with the person that managing photos for your company? We provide image editing like – photos cutting out and retouching. Enhancing your images is just a part of what we can do for your business. Whether you’re an ecommerce store or portrait photographer, real estate professional, or an e-Retailer, we are your personal team of photo editors that integrate seamlessly with your business. Our mainly services are: . Cut out, masking, clipping path, deep etching, transparent background . Colour correction, black and white, light and shadows etc. . Dust cleaning, spot cleaning . Beauty retouching, skin retouching, face retouching, body retouching . Fashion/Beauty Image Retouching . Product image Retouching . Real estate image Retouching . Wedding & Event Album Design. . Restoration and repair old images . Vector Conversion . Portrait image Retouching We can provide you editing test on your photos. Please reply if you are interested. Thanks, Roland

7 years, 5 months

1
0
0 0

[PATCH] perf/core: fix a possible deadlock scenario

by Cong Wang

hrtimer_cancel() busy-waits for the hrtimer callback to stop, pretty much like del_timer_sync(). This creates a possible deadlock scenario where we hold a spinlock before calling hrtimer_cancel() while in trying to acquire the same spinlock in the callback. This kind of deadlock is already known and is catchable by lockdep, like for del_timer_sync(), we can add lockdep annotations. However, it is still missing for hrtimer_cancel(). (I have a WIP patch to make it complete for hrtimer_cancel() but it breaks booting.) And there is such a deadlock scenario in kernel/events/core.c too, well actually, it is a simpler version: the hrtimer callback waits for itself to finish on the same CPU! It sounds stupid but it is not obvious at all, it hides very deeply in the perf event code: cpu_clock_event_init(): perf_swevent_init_hrtimer(): hwc->hrtimer.function = perf_swevent_hrtimer; perf_swevent_hrtimer(): __perf_event_overflow(): __perf_event_account_interrupt(): perf_adjust_period(): pmu->stop(): cpu_clock_event_stop(): perf_swevent_cancel(): hrtimer_cancel() Getting stuck in a timer doesn't sound very scary, however, in this case, its consequences are a disaster: perf_event_overflow() which calls __perf_event_overflow() is called in NMI handler too, so it is racy with hrtimer callback as disabling IRQ can't possibly disable NMI. This means this hrtimer callback once interrupted by an NMI handler could deadlock within NMI! As a further consequence, other IRQ handling is blocked too, notably the IPI handler, especially when smp_call_function_*() waits for their callbacks synchronously. This is why we saw so many soft lockup's in smp_call_function_single() given how widely they are used in kernel. Ironically, perf event code uses synchronous smp_call_function_single() heavily too. The fix is not easy. To minimize the impact, ideally we should just avoid busy waiting when it is called within the hrtimer callback on the same CPU, there is no reason to wait for itself to finish anyway. Probably it doesn't even need to cancel itself either since it will restart by pmu->start() later. There are two possible fixes here: 1. Modify hrtimer API to detect if a hrtimer callback is running on the same CPU now. This does not look pretty though. 2. Passing some information from perf_swevent_hrtimer() down to perf_swevent_cancel(). So I pick the latter approach, it is simple and straightforward. Note, currently perf_swevent_hrtimer() still races with perf_event_overflow() in NMI on the same CPU anyway, given there is no lock around and probably locking does not even help. But it is nothing new, and the race itself is not bad either, at most we have some inconsistent updates on the event sample period. Fixes: abd50713944c ("perf: Reimplement frequency driven sampling") Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme(a)kernel.org> Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com> Cc: Jiri Olsa <jolsa(a)redhat.com> Cc: Namhyung Kim <namhyung(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com> --- include/linux/perf_event.h | 3 +++ kernel/events/core.c | 43 +++++++++++++++++++++++++++---------------- 2 files changed, 30 insertions(+), 16 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1fa12887ec02..aab39b8aa720 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -310,6 +310,9 @@ struct pmu { #define PERF_EF_START 0x01 /* start the counter when adding */ #define PERF_EF_RELOAD 0x02 /* reload the counter when starting */ #define PERF_EF_UPDATE 0x04 /* update the counter when stopping */ +#define PERF_EF_NO_WAIT 0x08 /* do not wait when stopping, for + * example, waiting for a timer + */ /* * Adds/Removes a counter to/from the PMU, can be done inside a diff --git a/kernel/events/core.c b/kernel/events/core.c index 8f0434a9951a..f15832346b35 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3555,7 +3555,8 @@ do { \ static DEFINE_PER_CPU(int, perf_throttled_count); static DEFINE_PER_CPU(u64, perf_throttled_seq); -static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable) +static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, + bool disable, bool nowait) { struct hw_perf_event *hwc = &event->hw; s64 period, sample_period; @@ -3574,8 +3575,13 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo hwc->sample_period = sample_period; if (local64_read(&hwc->period_left) > 8*sample_period) { - if (disable) - event->pmu->stop(event, PERF_EF_UPDATE); + if (disable) { + int flags = PERF_EF_UPDATE; + + if (nowait) + flags |= PERF_EF_NO_WAIT; + event->pmu->stop(event, flags); + } local64_set(&hwc->period_left, 0); @@ -3645,7 +3651,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx, * twice. */ if (delta > 0) - perf_adjust_period(event, period, delta, false); + perf_adjust_period(event, period, delta, false, false); event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); next: @@ -7681,7 +7687,8 @@ static void perf_log_itrace_start(struct perf_event *event) } static int -__perf_event_account_interrupt(struct perf_event *event, int throttle) +__perf_event_account_interrupt(struct perf_event *event, int throttle, + bool nowait) { struct hw_perf_event *hwc = &event->hw; int ret = 0; @@ -7710,7 +7717,8 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle) hwc->freq_time_stamp = now; if (delta > 0 && delta < 2*TICK_NSEC) - perf_adjust_period(event, delta, hwc->last_period, true); + perf_adjust_period(event, delta, hwc->last_period, true, + nowait); } return ret; @@ -7718,7 +7726,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle) int perf_event_account_interrupt(struct perf_event *event) { - return __perf_event_account_interrupt(event, 1); + return __perf_event_account_interrupt(event, 1, false); } /* @@ -7727,7 +7735,7 @@ int perf_event_account_interrupt(struct perf_event *event) static int __perf_event_overflow(struct perf_event *event, int throttle, struct perf_sample_data *data, - struct pt_regs *regs) + struct pt_regs *regs, bool nowait) { int events = atomic_read(&event->event_limit); int ret = 0; @@ -7739,7 +7747,7 @@ static int __perf_event_overflow(struct perf_event *event, if (unlikely(!is_sampling_event(event))) return 0; - ret = __perf_event_account_interrupt(event, throttle); + ret = __perf_event_account_interrupt(event, throttle, nowait); /* * XXX event_limit might not quite work as expected on inherited @@ -7768,7 +7776,7 @@ int perf_event_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) { - return __perf_event_overflow(event, 1, data, regs); + return __perf_event_overflow(event, 1, data, regs, true); } /* @@ -7831,7 +7839,7 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow, for (; overflow; overflow--) { if (__perf_event_overflow(event, throttle, - data, regs)) { + data, regs, false)) { /* * We inhibit the overflow from happening when * hwc->interrupts == MAX_INTERRUPTS. @@ -9110,7 +9118,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) if (regs && !perf_exclude_event(event, regs)) { if (!(event->attr.exclude_idle && is_idle_task(current))) - if (__perf_event_overflow(event, 1, &data, regs)) + if (__perf_event_overflow(event, 1, &data, regs, true)) ret = HRTIMER_NORESTART; } @@ -9141,7 +9149,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event) HRTIMER_MODE_REL_PINNED); } -static void perf_swevent_cancel_hrtimer(struct perf_event *event) +static void perf_swevent_cancel_hrtimer(struct perf_event *event, bool sync) { struct hw_perf_event *hwc = &event->hw; @@ -9149,7 +9157,10 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event) ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer); local64_set(&hwc->period_left, ktime_to_ns(remaining)); - hrtimer_cancel(&hwc->hrtimer); + if (sync) + hrtimer_cancel(&hwc->hrtimer); + else + hrtimer_try_to_cancel(&hwc->hrtimer); } } @@ -9200,7 +9211,7 @@ static void cpu_clock_event_start(struct perf_event *event, int flags) static void cpu_clock_event_stop(struct perf_event *event, int flags) { - perf_swevent_cancel_hrtimer(event); + perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT); cpu_clock_event_update(event); } @@ -9277,7 +9288,7 @@ static void task_clock_event_start(struct perf_event *event, int flags) static void task_clock_event_stop(struct perf_event *event, int flags) { - perf_swevent_cancel_hrtimer(event); + perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT); task_clock_event_update(event, event->ctx->time); } -- 2.14.4

7 years, 5 months

3
7
0 0

[PATCH v2] bcache: set max writeback rate when I/O request is idle

by Coly Li

Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") allows the writeback rate to be faster if there is no I/O request on a bcache device. It works well if there is only one bcache device attached to the cache set. If there are many bcache devices attached to a cache set, it may introduce performance regression because multiple faster writeback threads of the idle bcache devices will compete the btree level locks with the bcache device who have I/O requests coming. This patch fixes the above issue by only permitting fast writebac when all bcache devices attached on the cache set are idle. And if one of the bcache devices has new I/O request coming, minimized all writeback throughput immediately and let PI controller __update_writeback_rate() to decide the upcoming writeback rate for each bcache device. Also when all bcache devices are idle, limited wrieback rate to a small number is wast of thoughput, especially when backing devices are slower non-rotation devices (e.g. SATA SSD). This patch sets a max writeback rate for each backing device if the whole cache set is idle. A faster writeback rate in idle time means new I/Os may have more available space for dirty data, and people may observe a better write performance then. Please note bcache may change its cache mode in run time, and this patch still works if the cache mode is switched from writeback mode and there is still dirty data on cache. Fixes: Commit b1092c9af9ed ("bcache: allow quick writeback when backing idle") Cc: stable(a)vger.kernel.org #4.16+ Signed-off-by: Coly Li <colyli(a)suse.de> Tested-by: Kai Krakow <kai(a)kaishome.de> Cc: Michael Lyle <mlyle(a)lyle.org> Cc: Stefan Priebe <s.priebe(a)profihost.ag> --- Channgelog: v2, Fix a deadlock reported by Stefan Priebe. v1, Initial version. drivers/md/bcache/bcache.h | 11 ++-- drivers/md/bcache/request.c | 51 ++++++++++++++- drivers/md/bcache/super.c | 1 + drivers/md/bcache/sysfs.c | 14 +++-- drivers/md/bcache/util.c | 2 +- drivers/md/bcache/util.h | 2 +- drivers/md/bcache/writeback.c | 115 ++++++++++++++++++++++++++-------- 7 files changed, 155 insertions(+), 41 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index d6bf294f3907..469ab1a955e0 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -328,13 +328,6 @@ struct cached_dev { */ atomic_t has_dirty; - /* - * Set to zero by things that touch the backing volume-- except - * writeback. Incremented by writeback. Used to determine when to - * accelerate idle writeback. - */ - atomic_t backing_idle; - struct bch_ratelimit writeback_rate; struct delayed_work writeback_rate_update; @@ -514,6 +507,8 @@ struct cache_set { struct cache_accounting accounting; unsigned long flags; + atomic_t idle_counter; + atomic_t at_max_writeback_rate; struct cache_sb sb; @@ -523,6 +518,8 @@ struct cache_set { struct bcache_device **devices; unsigned devices_max_used; + /* See set_at_max_writeback_rate() for how it is used */ + unsigned previous_dirty_dc_nr; struct list_head cached_devs; uint64_t cached_dev_sectors; struct closure caching; diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa8047..1af3d96abfa5 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -1104,6 +1104,43 @@ static void detached_dev_do_request(struct bcache_device *d, struct bio *bio) /* Cached devices - read & write stuff */ +static void quit_max_writeback_rate(struct cache_set *c, + struct cached_dev *this_dc) +{ + int i; + struct bcache_device *d; + struct cached_dev *dc; + + /* + * If bch_register_lock is acquired by other attach/detach operations, + * waiting here will increase I/O request latency for seconds or more. + * To avoid such situation, only writeback rate of current cached device + * is set to 1, and __update_write_back() will decide writeback rate + * of other cached devices (remember c->idle_counter is 0 now). + */ + if (mutex_trylock(&bch_register_lock)){ + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + /* + * set writeback rate to default minimum value, + * then let update_writeback_rate() to decide the + * upcoming rate. + */ + atomic64_set(&dc->writeback_rate.rate, 1); + } + + mutex_unlock(&bch_register_lock); + } else + atomic64_set(&this_dc->writeback_rate.rate, 1); +} + static blk_qc_t cached_dev_make_request(struct request_queue *q, struct bio *bio) { @@ -1119,7 +1156,19 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q, return BLK_QC_T_NONE; } - atomic_set(&dc->backing_idle, 0); + if (d->c) { + atomic_set(&d->c->idle_counter, 0); + /* + * If at_max_writeback_rate of cache set is true and new I/O + * comes, quit max writeback rate of all cached devices + * attached to this cache set, and set at_max_writeback_rate + * to false. + */ + if (unlikely(atomic_read(&d->c->at_max_writeback_rate) == 1)) { + atomic_set(&d->c->at_max_writeback_rate, 0); + quit_max_writeback_rate(d->c, dc); + } + } generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); bio_set_dev(bio, dc->bdev); diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fa4058e43202..fa532d9f9353 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1687,6 +1687,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) c->block_bits = ilog2(sb->block_size); c->nr_uuids = bucket_bytes(c) / sizeof(struct uuid_entry); c->devices_max_used = 0; + c->previous_dirty_dc_nr = 0; c->btree_pages = bucket_pages(c); if (c->btree_pages > BTREE_MAX_PAGES) c->btree_pages = max_t(int, c->btree_pages / 4, diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 225b15aa0340..d719021bff81 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -170,7 +170,8 @@ SHOW(__bch_cached_dev) var_printf(writeback_running, "%i"); var_print(writeback_delay); var_print(writeback_percent); - sysfs_hprint(writeback_rate, dc->writeback_rate.rate << 9); + sysfs_hprint(writeback_rate, + atomic64_read(&dc->writeback_rate.rate) << 9); sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); sysfs_printf(io_error_limit, "%i", dc->error_limit); sysfs_printf(io_disable, "%i", dc->io_disable); @@ -188,7 +189,8 @@ SHOW(__bch_cached_dev) char change[20]; s64 next_io; - bch_hprint(rate, dc->writeback_rate.rate << 9); + bch_hprint(rate, + atomic64_read(&dc->writeback_rate.rate) << 9); bch_hprint(dirty, bcache_dev_sectors_dirty(&dc->disk) << 9); bch_hprint(target, dc->writeback_rate_target << 9); bch_hprint(proportional,dc->writeback_rate_proportional << 9); @@ -255,8 +257,12 @@ STORE(__cached_dev) sysfs_strtoul_clamp(writeback_percent, dc->writeback_percent, 0, 40); - sysfs_strtoul_clamp(writeback_rate, - dc->writeback_rate.rate, 1, INT_MAX); + if (attr == &sysfs_writeback_rate) { + int v; + + sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX); + atomic64_set(&dc->writeback_rate.rate, v); + } sysfs_strtoul_clamp(writeback_rate_update_seconds, dc->writeback_rate_update_seconds, diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c index fc479b026d6d..84f90c3d996d 100644 --- a/drivers/md/bcache/util.c +++ b/drivers/md/bcache/util.c @@ -200,7 +200,7 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) { uint64_t now = local_clock(); - d->next += div_u64(done * NSEC_PER_SEC, d->rate); + d->next += div_u64(done * NSEC_PER_SEC, atomic64_read(&d->rate)); /* Bound the time. Don't let us fall further than 2 seconds behind * (this prevents unnecessary backlog that would make it impossible diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index cced87f8eb27..7e17f32ab563 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -442,7 +442,7 @@ struct bch_ratelimit { * Rate at which we want to do work, in units per second * The units here correspond to the units passed to bch_next_delay() */ - uint32_t rate; + atomic64_t rate; }; static inline void bch_ratelimit_reset(struct bch_ratelimit *d) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index ad45ebe1a74b..11ffadc3cf8f 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -49,6 +49,80 @@ static uint64_t __calc_target_rate(struct cached_dev *dc) return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT; } +static bool set_at_max_writeback_rate(struct cache_set *c, + struct cached_dev *dc) +{ + int i, dirty_dc_nr = 0; + struct bcache_device *d; + + /* + * bch_register_lock is acquired in cached_dev_detach_finish() before + * calling cancel_writeback_rate_update_dwork() to stop the delayed + * kworker writeback_rate_update (where the context we are for now). + * Therefore call mutex_lock() here may introduce deadlock when shut + * down the bcache device. + * c->previous_dirty_dc_nr is used to record previous calculated + * dirty_dc_nr when mutex_trylock() last time succeeded. Then if + * mutex_trylock() failed here, use c->previous_dirty_dc_nr as dirty + * cached device number. Of cause it might be inaccurate, but a few more + * or less loop before setting c->at_max_writeback_rate is much better + * then a deadlock here. + */ + if (mutex_trylock(&bch_register_lock)) { + for (i = 0; i < c->devices_max_used; i++) { + if (!c->devices[i]) + continue; + if (UUID_FLASH_ONLY(&c->uuids[i])) + continue; + d = c->devices[i]; + dc = container_of(d, struct cached_dev, disk); + if (atomic_read(&dc->has_dirty)) + dirty_dc_nr++; + } + c->previous_dirty_dc_nr = dirty_dc_nr; + + mutex_unlock(&bch_register_lock); + } else + dirty_dc_nr = c->previous_dirty_dc_nr; + + /* + * Idle_counter is increased everytime when update_writeback_rate() + * is rescheduled in. If all backing devices attached to the same + * cache set has same dc->writeback_rate_update_seconds value, it + * is about 10 rounds of update_writeback_rate() is called on each + * backing device, then the code will fall through at set 1 to + * c->at_max_writeback_rate, and a max wrteback rate to each + * dc->writeback_rate.rate. This is not very accurate but works well + * to make sure the whole cache set has no new I/O coming before + * writeback rate is set to a max number. + */ + if (atomic_inc_return(&c->idle_counter) < dirty_dc_nr * 10) + return false; + + if (atomic_read(&c->at_max_writeback_rate) != 1) + atomic_set(&c->at_max_writeback_rate, 1); + + + atomic64_set(&dc->writeback_rate.rate, INT_MAX); + + /* keep writeback_rate_target as existing value */ + dc->writeback_rate_proportional = 0; + dc->writeback_rate_integral_scaled = 0; + dc->writeback_rate_change = 0; + + /* + * Check c->idle_counter and c->at_max_writeback_rate agagain in case + * new I/O arrives during before set_at_max_writeback_rate() returns. + * Then the writeback rate is set to 1, and its new value should be + * decided via __update_writeback_rate(). + */ + if (atomic_read(&c->idle_counter) < dirty_dc_nr * 10 || + !atomic_read(&c->at_max_writeback_rate)) + return false; + + return true; +} + static void __update_writeback_rate(struct cached_dev *dc) { /* @@ -104,8 +178,9 @@ static void __update_writeback_rate(struct cached_dev *dc) dc->writeback_rate_proportional = proportional_scaled; dc->writeback_rate_integral_scaled = integral_scaled; - dc->writeback_rate_change = new_rate - dc->writeback_rate.rate; - dc->writeback_rate.rate = new_rate; + dc->writeback_rate_change = new_rate - + atomic64_read(&dc->writeback_rate.rate); + atomic64_set(&dc->writeback_rate.rate, new_rate); dc->writeback_rate_target = target; } @@ -138,9 +213,16 @@ static void update_writeback_rate(struct work_struct *work) down_read(&dc->writeback_lock); - if (atomic_read(&dc->has_dirty) && - dc->writeback_percent) - __update_writeback_rate(dc); + if (atomic_read(&dc->has_dirty) && dc->writeback_percent) { + /* + * If the whole cache set is idle, set_at_max_writeback_rate() + * will set writeback rate to a max number. Then it is + * unncessary to update writeback rate for an idle cache set + * in maximum writeback rate number(s). + */ + if (!set_at_max_writeback_rate(c, dc)) + __update_writeback_rate(dc); + } up_read(&dc->writeback_lock); @@ -422,27 +504,6 @@ static void read_dirty(struct cached_dev *dc) delay = writeback_delay(dc, size); - /* If the control system would wait for at least half a - * second, and there's been no reqs hitting the backing disk - * for awhile: use an alternate mode where we have at most - * one contiguous set of writebacks in flight at a time. If - * someone wants to do IO it will be quick, as it will only - * have to contend with one operation in flight, and we'll - * be round-tripping data to the backing disk as quickly as - * it can accept it. - */ - if (delay >= HZ / 2) { - /* 3 means at least 1.5 seconds, up to 7.5 if we - * have slowed way down. - */ - if (atomic_inc_return(&dc->backing_idle) >= 3) { - /* Wait for current I/Os to finish */ - closure_sync(&cl); - /* And immediately launch a new set. */ - delay = 0; - } - } - while (!kthread_should_stop() && !test_bit(CACHE_SET_IO_DISABLE, &dc->disk.c->flags) && delay) { @@ -715,7 +776,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc) dc->writeback_running = true; dc->writeback_percent = 10; dc->writeback_delay = 30; - dc->writeback_rate.rate = 1024; + atomic64_set(&dc->writeback_rate.rate, 1024); dc->writeback_rate_minimum = 8; dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT; -- 2.17.1

7 years, 5 months

2
3
0 0

[PATCH v6 00/13] mm: Teach memory_failure() about ZONE_DEVICE pages

by Dan Williams

Changes since v5 [1]: * Move put_page() before memory_failure() in madvise_inject_error() (Naoya) * The previous change uncovered a latent bug / broken assumption in __put_devmap_managed_page(). We need to preserve page->mapping for dax pages when they go idle. * Rename mapping_size() to dev_pagemap_mapping_size() (Naoya) * Catch and fail attempts to soft-offline dax pages (Naoya) * Collect Naoya's ack on "mm, memory_failure: Collect mapping size in collect_procs()" [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-July/016682.html --- As it stands, memory_failure() gets thoroughly confused by dev_pagemap backed mappings. The recovery code has specific enabling for several possible page states and needs new enabling to handle poison in dax mappings. In order to support reliable reverse mapping of user space addresses: 1/ Add new locking in the memory_failure() rmap path to prevent races that would typically be handled by the page lock. 2/ Since dev_pagemap pages are hidden from the page allocator and the "compound page" accounting machinery, add a mechanism to determine the size of the mapping that encompasses a given poisoned pfn. 3/ Given pmem errors can be repaired, change the speculatively accessed poison protection, mce_unmap_kpfn(), to be reversible and otherwise allow ongoing access from the kernel. A side effect of this enabling is that MADV_HWPOISON becomes usable for dax mappings, however the primary motivation is to allow the system to survive userspace consumption of hardware-poison via dax. Specifically the current behavior is: mce: Uncorrected hardware memory error in user-access at af34214200 {1}[Hardware Error]: It has been corrected by h/w and requires no further action mce: [Hardware Error]: Machine check events logged {1}[Hardware Error]: event severity: corrected Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users [..] Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed mce: Memory error not recovered <reboot> ...and with these changes: Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000 Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption Memory failure: 0x20cb00: recovery action for dax page: Recovered Given all the cross dependencies I propose taking this through nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax folks. --- Dan Williams (13): device-dax: Convert to vmf_insert_mixed and vm_fault_t device-dax: Enable page_mapping() device-dax: Set page->index filesystem-dax: Set page->index mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages mm, dev_pagemap: Do not clear ->mapping on final put mm, madvise_inject_error: Let memory_failure() optionally take a page reference mm, memory_failure: Collect mapping size in collect_procs() filesystem-dax: Introduce dax_lock_mapping_entry() mm, memory_failure: Teach memory_failure() about dev_pagemap pages x86/mm/pat: Prepare {reserve,free}_memtype() for "decoy" addresses x86/memory_failure: Introduce {set,clear}_mce_nospec() libnvdimm, pmem: Restore page attributes when clearing errors arch/x86/include/asm/set_memory.h | 42 ++++++ arch/x86/kernel/cpu/mcheck/mce-internal.h | 15 -- arch/x86/kernel/cpu/mcheck/mce.c | 38 ----- arch/x86/mm/pat.c | 16 ++ drivers/dax/device.c | 75 +++++++--- drivers/nvdimm/pmem.c | 26 ++++ drivers/nvdimm/pmem.h | 13 ++ fs/dax.c | 125 ++++++++++++++++- include/linux/dax.h | 13 ++ include/linux/huge_mm.h | 5 - include/linux/mm.h | 1 include/linux/set_memory.h | 14 ++ kernel/memremap.c | 1 mm/hmm.c | 2 mm/huge_memory.c | 4 - mm/madvise.c | 16 ++ mm/memory-failure.c | 210 +++++++++++++++++++++++------ 17 files changed, 481 insertions(+), 135 deletions(-)

7 years, 5 months

4
5
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror