linaro-kernel

linaro-kernel@lists.linaro.org

10095 discussions

by John Stultz

=== Highlights === * Flew to SF and presented at ABS, then flew back home on Monday. Slides are here: http://events.linuxfoundation.org/images/stories/slides/abs2013_stultz.pdf * Spurred by discussion at ABS, worked out how to get ADB running on vanilla linux: https://plus.google.com/u/0/111524780435806926688/posts/AaEccFjKNHE * My discussion proposal for lsf/mm-minisummit on volatile ranges was accepted and I was formally invited to attend * Discussed Serbans' ashmem compat_ioctl patches with Arve and Serban. * Sent out late android upstreaming subteam mail * Synced up with Jakub on Android Upstreaming session at connect * Got my 3.9 timekeeping queue merged upstream, and reviewed and queued a number of timekeeping patches for 3.10 * Discussed some timekeeping changes with tglx, and reviewed some of his patches. * Implemented a first pass at using valid-cycle-ranges with vdso based gettime calls to avoid potential race windows with virtualized kernels. This will allow for reduced lock hold times in the future. === Plans === * Review Serban's binder patches * Look into Androids support of large-files with 32bit applications * Send out sync driver for staging (as I've not heard back from Erik or other folks at Google) * Prep for Connect === Issues === * NA

12 years, 4 months

[ACTIVITY] (David Long) 2013-02-16 - 2013-02-22

by David Long

=== David Long === === Travel/Time Off === * Monday February 18th (U.S. Washington's Birthday, aka President's Day) === Highlights === * I'm dealing with problems getting the uprobe uprobe patch to correctly process the breakpoint. I see the breakpoint being placed but the result when it hits it seems to be corrupted context. * I received email back from Rabin Vincent saying he had no plans to work on this any more and he is happy if I want to take it over. He has volunteered to supply his tests, which I hope to see shortly. === Plans === * Debug the problems I'm experiencing with the patch, then move on to addressing the upstream concerns about its integration. === Issues === -dl

12 years, 4 months

[PATCH 0/2] cpustat: use atomic operations to read/update stats

by Kevin Hilman

On 64-bit platforms, reads/writes of the various cpustat fields are atomic due to native 64-bit loads/stores. However, on non 64-bit platforms, reads/writes of the cpustat fields are not atomic and could lead to inconsistent statistics. This problem was originally reported by Frederic Weisbecker as a 64-bit limitation with the nsec granularity cputime accounting for full dynticks, but then we realized that it's a problem that's been around for awhile and not specific to the new cputime accounting. This series fixes this by first converting all access to the cputime fields to use accessor functions, and then converting the accessor functions to use the atomic64 functions. Implemented based on idea proposed by Frederic Weisbecker. Kevin Hilman (2): cpustat: use accessor functions for get/set/add cpustat: convert to atomic operations arch/s390/appldata/appldata_os.c | 16 +++++++-------- drivers/cpufreq/cpufreq_governor.c | 18 ++++++++--------- drivers/cpufreq/cpufreq_ondemand.c | 2 +- drivers/macintosh/rack-meter.c | 6 +++--- fs/proc/stat.c | 40 +++++++++++++++++++------------------- fs/proc/uptime.c | 2 +- include/linux/kernel_stat.h | 11 ++++++++++- kernel/sched/core.c | 12 +++++------- kernel/sched/cputime.c | 29 +++++++++++++-------------- 9 files changed, 70 insertions(+), 66 deletions(-) -- 1.8.1.2

12 years, 4 months

[ACTIVITY] (Linus Walleij) 2013-02-11 - 2013-02-22

by Linus Walleij

== Linus Walleij linusw == === Highlights === * Finalized a GPIO+pinctrl presentation for the Embedded Linux Conference, and presented on the first day of the conference. Slides will be posted. * Finalized the pinctrl tree before traveling, sent a pull request to Torvalds as soon as the merge window opened and he pulled it in. * AB8500 GPIO patches and all other cleanup has been merged up to the pinctrl and ARM SoC trees and pulled in by Torvalds. MFD is pending but Sam has sent a pull request for this part as well. * Other queued fixes for mach-ux500 and also the PCI regression fix has propagated upstream. * Reviewed misc GPIO, pinctrl and other patches, updated blueprints... === Plans === * Fix regressions popping up in the merge window. There are always such... * Attack the remaining headers in arch/arm/mach-ux500 so we can move forward with multiplatform for v3.9. * Convert the Nomadik to multiplatform. * Convert Nomadik pinctrl driver to register GPIO ranges from the gpiochip side. * Test the PL08x patches on the Ericsson Research PB11MPCore and submit platform data for using pl08x DMA on that platform. * Look into other Ux500 stuff in need of mainlining... using an internal tracking sheet for this. * Get hands dirty with regmap. === Issues === * N/A Thanks, Linus Walleij

12 years, 4 months

[PATCH v2] arm: add check for global exclusive monitor

by Vladimir Murzin

Since ARMv6 new atomic instructions have been introduced: ldrex/strex. Several implementation are possible based on (1) global and local exclusive monitors and (2) local exclusive monitor and snoop unit. In case of the 2nd options exclusive store operation on uncached region may be faulty. Check for availability of global monitor to provide some hint about possible issues. Signed-off-by: Vladimir Murzin <murzin.v(a)gmail.com> --- Changes since v1: - Using L_PTE_MT_BUFFERABLE instead of L_PTE_MT_UNCACHABLE Thanks to Russell for ponting this silly error - added comment about how checking is done arch/arm/include/asm/bugs.h | 14 +++++++++-- arch/arm/mm/fault-armv.c | 55 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 2 deletions(-) diff --git a/arch/arm/include/asm/bugs.h b/arch/arm/include/asm/bugs.h index a97f1ea..29d73cd 100644 --- a/arch/arm/include/asm/bugs.h +++ b/arch/arm/include/asm/bugs.h @@ -13,9 +13,19 @@ #ifdef CONFIG_MMU extern void check_writebuffer_bugs(void); -#define check_bugs() check_writebuffer_bugs() +#if __LINUX_ARM_ARCH__ < 6 +static void check_gmonitor_bugs(void) {}; #else -#define check_bugs() do { } while (0) +extern void check_gmonitor_bugs(void); +#endif + +static inline void check_bugs(void) +{ + check_writebuffer_bugs(); + check_gmonitor_bugs(); +} +#else +static inline void check_bugs(void) { } #endif #endif diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c index 2a5907b..6a1a07e 100644 --- a/arch/arm/mm/fault-armv.c +++ b/arch/arm/mm/fault-armv.c @@ -205,6 +205,61 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, __flush_icache_all(); } } +#else +/* + * Check for the global exclusive monitor. The global monitor is a external + * transaction monitoring block for tracking exclusive accesses to sharable + * memory regions. LDREX/STREX rely on this monitor when accessing uncached + * shared memory. + * If global monitor is not implemented STREX operation on uncached shared + * memory region always fail, returning 0 in the destination register. + * We rely on this property to check whether global monitor is implemented + * or not. + * NB: The name of L_PTE_MT_BUFFERABLE is not for B bit, but for normal + * non-cacheable memory type (XXCB = 0001). + */ +void __init check_gmonitor_bugs(void) +{ + struct page *page; + const char *reason; + unsigned long res = 1; + + printk(KERN_INFO "CPU: Testing for global monitor: "); + + page = alloc_page(GFP_KERNEL); + if (page) { + unsigned long *p; + pgprot_t prot = __pgprot_modify(PAGE_KERNEL, + L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE); + + p = vmap(&page, 1, VM_IOREMAP, prot); + + if (p) { + int temp, res; + + __asm__ __volatile__( + "ldrex %1, [%2]\n" + "strex %0, %1, [%2]" + : "=&r" (res), "=&r" (temp) + : "r" (p) + : "cc", "memory"); + + reason = "n/a (atomic ops may be faulty)"; + } else { + reason = "unable to map memory\n"; + } + + vunmap(p); + put_page(page); + } else { + reason = "unable to grab page\n"; + } + + if (res) + printk("failed, %s\n", reason); + else + printk("ok\n"); +} #endif /* __LINUX_ARM_ARCH__ < 6 */ /* -- 1.7.10.4

12 years, 4 months

[PATCH] arm: add check for global exclusive monitor

by Vladimir Murzin

Thanks for review Russel! On Mon, Feb 18, 2013 at 04:44:20PM +0000, Russell King - ARM Linux wrote: > On Mon, Feb 18, 2013 at 08:26:50PM +0400, Vladimir Murzin wrote: > > Since ARMv6 new atomic instructions have been introduced: > > ldrex/strex. Several implementation are possible based on (1) global > > and local exclusive monitors and (2) local exclusive monitor and snoop > > unit. > > > > In case of the 2nd option exclusive store operation on uncached > > region may be faulty. > > > > Check for availability of the global monitor to provide some hint about > > possible issues. > > How does this code actually do that? According to DHT0008A_arm_synchronization_primitives.pdf the global monitor is introduce to track exclusive accesses to sharable memory regions. This article also says about some system-wide implication which should be taken into account: (1) for systems with coherency management (2) for systems without coherency management The first one lay on SCU, L1 data cache and local monitor. The second one requires implementation of global monitor if memory regions cannot be cached. It set up the behaviour for store-exclusive operations when global monitor is not available: these operations always fail. Taking all these into account we can guess about availability of global monitor by using store-exclusive operation on uncached memory region. > > > +void __init check_gmonitor_bugs(void) > > +{ > > + struct page *page; > > + const char *reason; > > + unsigned long res = 1; > > + > > + printk(KERN_INFO "CPU: Testing for global monitor: "); > > + > > + page = alloc_page(GFP_KERNEL); > > + if (page) { > > + unsigned long *p; > > + pgprot_t prot = __pgprot_modify(PAGE_KERNEL, > > + L_PTE_MT_MASK, L_PTE_MT_UNCACHED); > > + > > + p = vmap(&page, 1, VM_IOREMAP, prot); > > This is bad practise. Remapping a page of already mapped kernel memory > using different attributes (in this case, strongly ordered) is _definitely_ > a violation of the architecture requirements. The behaviour you will see > from this are in no way guaranteed. DDI0406C_arm_architecture_reference_manual.pdf (A3-131) says: A memory location can be marked as having different cacheability attributes, for example when using aliases in a virtual to physical address mapping: * if the attributes differ only in the cache allocation hint this does not affect the behavior of accesses to that location * for other cases see Mismatched memory attributes on page A3-136. Isn't L_PTE_MT_UNCACHED about cache allocation hint? > > If you want to do this, it must either come from highmem, or not already > be mapped. > > Moreover, this is absolutely silly - the ARM ARM says: > > "LDREX and STREX operations *must* be performed only on memory with the > Normal memory attribute." DDI0406C_arm_architecture_reference_manual.pdf (A3-121) says: It is IMPLEMENTATION DEFINED whether LDREX and STREX operations can be performed to a memory region with the Device or Strongly-ordered memory attribute. Unless the implementation documentation explicitly states that LDREX and STREX operations to a memory region with the Device or Strongly-ordered attribute are permitted, the effect of such operations is UNPREDICTABLE. At least it allows to perform operations on memory region with the Strongly-ordered attribute... but still unpredictable. > > L_PTE_MT_UNCACHED doesn't get you that. As I say above, that gets you > strongly ordered memory, not "normal memory" as required by the > architecture for use with exclusive types. > > > + > > + if (p) { > > + int temp; > > + > > + __asm__ __volatile__( \ > > + "ldrex %1, [%2]\n" \ > > + "strex %0, %1, [%2]" \ > > + : "=&r" (res), "=&r" (temp) \ > > + : "r" (p) \ > > \ character not required for any of the above. Neither is the __ version > of "asm" and "volatile". Thanks. > > > + : "cc", "memory"); > > + > > + reason = "n\\a (atomic ops may be faulty)"; > > "n\\a" ? "not detected"? > So... at the moment this has me wondering - you're testing atomic > operations with a strongly ordered memory region, which ARM already > define this to be outside of the architecture spec. The behaviour you > see is not defined architecturally. > > And if you're trying to use LDREX/STREX to a strongly ordered or device > memory region, then you're quite right that it'll be unreliable. It's > not defined to even work. That's not because they're faulty, it's because > you're abusing them. However, IRL it is not hard to meet this undefined difference. At least I'm able to see it on Tegra2 Harmony and Pandaboard. Moreover, demand on Normal memory attribute breaks up ability to turn caches off. In this case we are not able to boot the system up (seen on Tegra2 harmony). This patch is aimed to highlight the difference in implementation. That's why it has some softering in guessing about faulty. Might be it worth warning about unpredictable effect instead? Best wishes Vladimir

12 years, 4 months

13.1 Versatile Express kernel spewing lock warnings

by Eric Van Hensbergen

I was trying out the new linaro binary image for 13.1 on my TC2 and am getting lock warnings on my console. Should I be worried or is this expected behavior? ----- [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 3.8.0-1-linaro-vexpress (buildd@wani10) (gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1) ) #1ubuntu1~ci+130125174543-Ubuntu SMP Sat Jan 26 00:47:35 UTC 201 ... [ 602.909859] [ 602.909864] ====================================================== [ 602.909868] [ INFO: possible circular locking dependency detected ] [ 602.909877] 3.8.0-1-linaro-vexpress #1ubuntu1~ci+130125174543-Ubuntu Not tainted [ 602.909881] ------------------------------------------------------- [ 602.909887] kworker/0:1/376 is trying to acquire lock: [ 602.909922] ((fb_notifier_list).rwsem){.+.+.+}, at: [<c00385b9>] __blocking_notifier_call_chain+0x1d/0x40 [ 602.909926] [ 602.909926] but task is already holding lock: [ 602.909953] (console_lock){+.+.+.}, at: [<c0285103>] console_callback+0xf/0xe0 [ 602.909957] [ 602.909957] which lock already depends on the new lock. [ 602.909957] [ 602.909960] [ 602.909960] the existing dependency chain (in reverse order) is: [ 602.909976] [ 602.909976] -> #1 (console_lock){+.+.+.}: [ 602.909994] [<c005b041>] __lock_acquire+0x29d/0x858 [ 602.910010] [<c005b93f>] lock_acquire+0x5f/0xbc [ 602.910027] [<c001d2f5>] console_lock+0x31/0x40 [ 602.910041] [<c028350f>] register_con_driver+0x27/0xe8 [ 602.910054] [<c0283ded>] take_over_console+0x19/0x240 [ 602.910071] [<c0269263>] fbcon_takeover+0x3b/0x88 [ 602.910083] [<c00383ed>] notifier_call_chain+0x45/0x54 [ 602.910097] [<c00385cb>] __blocking_notifier_call_chain+0x2f/0x40 [ 602.910110] [<c00385f3>] blocking_notifier_call_chain+0x17/0x1c [ 602.910123] [<c0265d35>] register_framebuffer+0x109/0x1cc [ 602.910135] [<c026fdbb>] hdlcd_probe+0x507/0x5d0 [ 602.910148] [<c02909d3>] platform_drv_probe+0x17/0x18 [ 602.910159] [<c028fd49>] driver_probe_device+0x51/0x170 [ 602.910169] [<c028febf>] __driver_attach+0x57/0x58 [ 602.910185] [<c028ec9b>] bus_for_each_dev+0x2b/0x4c [ 602.910195] [<c028f7dd>] bus_add_driver+0xe5/0x170 [ 602.910206] [<c02901ef>] driver_register+0x43/0xd0 [ 602.910218] [<c0008511>] do_one_initcall+0xc9/0x114 [ 602.910231] [<c03d5ceb>] kernel_init+0xcf/0x1ec [ 602.910245] [<c000ce95>] ret_from_fork+0x11/0x20 [ 602.910261] [ 602.910261] -> #0 ((fb_notifier_list).rwsem){.+.+.+}: [ 602.910274] [<c005a799>] validate_chain.isra.26+0xafd/0xbe4 [ 602.910288] [<c005b041>] __lock_acquire+0x29d/0x858 [ 602.910301] [<c005b93f>] lock_acquire+0x5f/0xbc [ 602.910316] [<c03dfe8d>] down_read+0x25/0x30 [ 602.910329] [<c00385b9>] __blocking_notifier_call_chain+0x1d/0x40 [ 602.910342] [<c00385f3>] blocking_notifier_call_chain+0x17/0x1c [ 602.910354] [<c0264e7d>] fb_blank+0x29/0x64 [ 602.910364] [<c026a859>] fbcon_blank+0x135/0x1ac [ 602.910377] [<c0283471>] do_blank_screen+0x109/0x180 [ 602.910391] [<c028513d>] console_callback+0x49/0xe0 [ 602.910402] [<c00304a1>] process_one_work+0x12d/0x3c4 [ 602.910413] [<c00309a7>] worker_thread+0x117/0x344 [ 602.910426] [<c0033dd3>] kthread+0x77/0x84 [ 602.910439] [<c000ce95>] ret_from_fork+0x11/0x20 [ 602.910443] [ 602.910443] other info that might help us debug this: [ 602.910443] [ 602.910446] Possible unsafe locking scenario: [ 602.910446] [ 602.910450] CPU0 CPU1 [ 602.910453] ---- ---- [ 602.910463] lock(console_lock); [ 602.910472] lock((fb_notifier_list).rwsem); [ 602.910481] lock(console_lock); [ 602.910489] lock((fb_notifier_list).rwsem); [ 602.910493] [ 602.910493] *** DEADLOCK *** [ 602.910493] [ 602.910499] 3 locks held by kworker/0:1/376: [ 602.910523] #0: (events){.+.+.+}, at: [<c003044e>] process_one_work+0xda/0x3c4 [ 602.910546] #1: (console_work){+.+...}, at: [<c003044e>] process_one_work+0xda/0x3c4 [ 602.910572] #2: (console_lock){+.+.+.}, at: [<c0285103>] console_callback+0xf/0xe0 [ 602.910575] [ 602.910575] stack backtrace: [ 602.910596] [<c0011fd1>] (unwind_backtrace+0x1/0x9c) from [<c03da9cb>] (print_circular_bug+0x1a7/0x1f0) [ 602.910614] [<c03da9cb>] (print_circular_bug+0x1a7/0x1f0) from [<c005a799>] (validate_chain.isra.26+0xafd/0xbe4) [ 602.910632] [<c005a799>] (validate_chain.isra.26+0xafd/0xbe4) from [<c005b041>] (__lock_acquire+0x29d/0x858) [ 602.910649] [<c005b041>] (__lock_acquire+0x29d/0x858) from [<c005b93f>] (lock_acquire+0x5f/0xbc) [ 602.910665] [<c005b93f>] (lock_acquire+0x5f/0xbc) from [<c03dfe8d>] (down_read+0x25/0x30) [ 602.910683] [<c03dfe8d>] (down_read+0x25/0x30) from [<c00385b9>] (__blocking_notifier_call_chain+0x1d/0x40) [ 602.910701] [<c00385b9>] (__blocking_notifier_call_chain+0x1d/0x40) from [<c00385f3>] (blocking_notifier_call_chain+0x17/0x1c) [ 602.910718] [<c00385f3>] (blocking_notifier_call_chain+0x17/0x1c) from [<c0264e7d>] (fb_blank+0x29/0x64) [ 602.910732] [<c0264e7d>] (fb_blank+0x29/0x64) from [<c026a859>] (fbcon_blank+0x135/0x1ac) [ 602.910746] [<c026a859>] (fbcon_blank+0x135/0x1ac) from [<c0283471>] (do_blank_screen+0x109/0x180) [ 602.910764] [<c0283471>] (do_blank_screen+0x109/0x180) from [<c028513d>] (console_callback+0x49/0xe0) [ 602.910780] [<c028513d>] (console_callback+0x49/0xe0) from [<c00304a1>] (process_one_work+0x12d/0x3c4) [ 602.910793] [<c00304a1>] (process_one_work+0x12d/0x3c4) from [<c00309a7>] (worker_thread+0x117/0x344) [ 602.910808] [<c00309a7>] (worker_thread+0x117/0x344) from [<c0033dd3>] (kthread+0x77/0x84) [ 602.910825] [<c0033dd3>] (kthread+0x77/0x84) from [<c000ce95>] (ret_from_fork+0x11/0x20)

12 years, 4 months

Re: [PATCH v3 2/3] ab8500: make res_to_temp tables public

by Hongbo Zhang

On 22 February 2013 08:49, Guenter Roeck <linux(a)roeck-us.net> wrote: > On Thu, Feb 21, 2013 at 02:24:23PM -0800, Anton Vorontsov wrote: >> On Thu, Feb 21, 2013 at 06:32:40PM +0800, Hongbo Zhang wrote: >> > These NTC resistance to temperature tables should be public, so others such as >> > ab8500 hwmon driver can look up these tables to convert NTC resistance to >> > temperature. >> > >> > Signed-off-by: Hongbo Zhang <hongbo.zhang(a)linaro.org> >> > --- >> >> For 1/3 and 2/3 patches: >> >> Acked-by: Anton Vorontsov <anton(a)enomsg.org> >> >> (Do you need EXPORT_SYMBOL()? You don't use this from modules?) >> > I would think so. Also, the variables should be exported through an include > file. > I have these two lines in drivers/hwmon/ab8500.h, extern struct abx500_res_to_temp temp_tbl_A_thermistor[]; extern int temp_tbl_A_size; Do you mean this? Or do you mean we should create a public header file holding all the tables? Where to place these tables really baffled me, if the current hwmon driver is acceptable, I will talk to the ab8500_bmdata.c author to discuss how to re-arrange all the tables, that should be another patch in future if possible. > The variable names are quite generic for global variables; we need to find > something more specific/descriptive. > I noticed this too, this original naming isn't so good, there are also other names like this. I will rename these two tables I am using this time. > There is also some overlap with functionality in drivers/hwmon/ntc_thermistor.c. > Wonder if it would be possible to unify the code. > It seems not so easy to unify the code for me, if necessary and possible, that should be another dedicated patch I think. > Guenter > >> Thanks. >> >> > drivers/power/ab8500_bmdata.c | 8 ++++++-- >> > 1 file changed, 6 insertions(+), 2 deletions(-) >> > >> > diff --git a/drivers/power/ab8500_bmdata.c b/drivers/power/ab8500_bmdata.c >> > index f034ae4..53f3324 100644 >> > --- a/drivers/power/ab8500_bmdata.c >> > +++ b/drivers/power/ab8500_bmdata.c >> > @@ -11,7 +11,7 @@ >> > * Note that the res_to_temp table must be strictly sorted by falling resistance >> > * values to work. >> > */ >> > -static struct abx500_res_to_temp temp_tbl_A_thermistor[] = { >> > +struct abx500_res_to_temp temp_tbl_A_thermistor[] = { >> > {-5, 53407}, >> > { 0, 48594}, >> > { 5, 43804}, >> > @@ -29,7 +29,9 @@ static struct abx500_res_to_temp temp_tbl_A_thermistor[] = { >> > {65, 12500}, >> > }; >> > >> > -static struct abx500_res_to_temp temp_tbl_B_thermistor[] = { >> > +int temp_tbl_A_size = ARRAY_SIZE(temp_tbl_A_thermistor); >> > + >> > +struct abx500_res_to_temp temp_tbl_B_thermistor[] = { >> > {-5, 200000}, >> > { 0, 159024}, >> > { 5, 151921}, >> > @@ -47,6 +49,8 @@ static struct abx500_res_to_temp temp_tbl_B_thermistor[] = { >> > {65, 82869}, >> > }; >> > >> > +int temp_tbl_B_size = ARRAY_SIZE(temp_tbl_B_thermistor); >> > + >> > static struct abx500_v_to_cap cap_tbl_A_thermistor[] = { >> > {4171, 100}, >> > {4114, 95}, >> > -- >> > 1.8.0 >>

12 years, 4 months

[PATCH v2] memcg: Add memory.pressure_level events

by Anton Vorontsov

With this patch userland applications that want to maintain the interactivity/memory allocation cost can use the pressure level notifications. The levels are defined like this: The "low" level means that the system is reclaiming memory for new allocations. Monitoring this reclaiming activity might be useful for maintaining cache level. Upon notification, the program (typically "Activity Manager") might analyze vmstat and act in advance (i.e. prematurely shutdown unimportant services). The "medium" level means that the system is experiencing medium memory pressure, the system might be making swap, paging out active file caches, etc. Upon this event applications may decide to further analyze vmstat/zoneinfo/memcg or internal memory usage statistics and free any resources that can be easily reconstructed or re-read from a disk. The "critical" level means that the system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A->B->C. Now you set up an event listener on cgroups A, B and C, and suppose group C experiences some pressure. In this situation, only group C will receive the notification, i.e. groups A and B will not receive it. This is done to avoid excessive "broadcasting" of messages, which disturbs the system and which is especially bad if we are low on memory or thrashing. So, organize the cgroups wisely, or propagate the events manually (or, ask us to implement the pass-through events, explaining why would you need them.) Signed-off-by: Anton Vorontsov <anton.vorontsov(a)linaro.org> Acked-by: Kirill A. Shutemov <kirill(a)shutemov.name> --- Hi all, Many thanks for the previous reviews! In this revision: - Addressed Glauber Costa's comments: o Use parent_mem_cgroup() instead of own parent function (also suggested by Kamezawa). This change also affected events distribution logic, so it became more like memory thresholds notifications, i.e. we deliver the event to the cgroup where the event originated, not to the parent cgroup; (This also addreses Kamezawa's remark regarding which cgroup receives which event.) o Register vmpressure cgroup file directly in memcontrol.c. - Addressed Greg Thelen's comments: o Fixed bool/int inconsistency in the code; o Fixed nr_scanned accounting; o Don't use cryptic 's', 'r' abbreviations; get rid of confusing 'window' argument. - Addressed Kamezawa Hiroyuki's comments: o Moved declarations from mm/internal.h into linux/vmpressue.h; o Removed Kconfig symbol. Vmpressure is pretty lightweight (especially comparing to the memcg accounting). If it ever causes any measurable performance effect, we want to fix it, not paper it over with a Kconfig option. :-) o Removed read operation on pressure_level cgroup file. In apps, we only use notifications, we don't need the content of the file, so let's keep things simple for now. Plus this resolves questions like what should we return there when the system is not reclaiming; o Reworded documentation; o Improved comments for vmpressure_prio(). Old changelogs/submissions: v1: http://lkml.org/lkml/2013/2/10/140 mempressure cgroup: http://lkml.org/lkml/2013/1/4/55 Thanks! Anton Documentation/cgroups/memory.txt | 61 +++++++++- include/linux/vmpressure.h | 47 ++++++++ mm/Makefile | 2 +- mm/memcontrol.c | 28 +++++ mm/vmpressure.c | 252 +++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 8 ++ 6 files changed, 396 insertions(+), 2 deletions(-) create mode 100644 include/linux/vmpressure.h create mode 100644 mm/vmpressure.c diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index addb1f1..0c004de 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -40,6 +40,7 @@ Features: - soft limit - moving (recharging) account at moving a task is selectable. - usage threshold notifier + - memory pressure notifier - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. @@ -65,6 +66,7 @@ Brief summary of control files. memory.stat # show various statistics memory.use_hierarchy # set/show hierarchical account enabled memory.force_empty # trigger forced move charge to parent + memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate # set/show controls of moving charges @@ -778,7 +780,64 @@ At reading, current status of OOM is shown. under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may be stopped.) -11. TODO +11. Memory Pressure + +The pressure level notifications can be used to monitor the memory +allocation cost; based on the pressure, applications can implement +different strategies of managing their memory resources. The pressure +levels are defined as following: + +The "low" level means that the system is reclaiming memory for new +allocations. Monitoring this reclaiming activity might be useful for +maintaining cache level. Upon notification, the program (typically +"Activity Manager") might analyze vmstat and act in advance (i.e. +prematurely shutdown unimportant services). + +The "medium" level means that the system is experiencing medium memory +pressure, the system might be making swap, paging out active file caches, +etc. Upon this event applications may decide to further analyze +vmstat/zoneinfo/memcg or internal memory usage statistics and free any +resources that can be easily reconstructed or re-read from a disk. + +The "critical" level means that the system is actively thrashing, it is +about to out of memory (OOM) or even the in-kernel OOM killer is on its +way to trigger. Applications should do whatever they can to help the +system. It might be too late to consult with vmstat or any other +statistics, so it's advisable to take an immediate action. + +The events are propagated upward until the event is handled, i.e. the +events are not pass-through. Here is what this means: for example you have +three cgroups: A->B->C. Now you set up an event listener on cgroups A, B +and C, and suppose group C experiences some pressure. In this situation, +only group C will receive the notification, i.e. groups A and B will not +receive it. This is done to avoid excessive "broadcasting" of messages, +which disturbs the system and which is especially bad if we are low on +memory or thrashing. So, organize the cgroups wisely, or propagate the +events manually (or, ask us to implement the pass-through events, +explaining why would you need them.) + +The file memory.pressure_level is only used to setup an eventfd, +read/write operations are no implemented. + +Test: + + Here is a small script example that makes a new cgroup, sets up a + memory limit, sets up a notification in the cgroup and then makes child + cgroup experience a critical pressure: + + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x + + (Expect a bunch of notifications, and eventually, the oom-killer will + trigger.) + +12. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h new file mode 100644 index 0000000..fa84783 --- /dev/null +++ b/include/linux/vmpressure.h @@ -0,0 +1,47 @@ +#ifndef __LINUX_VMPRESSURE_H +#define __LINUX_VMPRESSURE_H + +#include <linux/mutex.h> +#include <linux/list.h> +#include <linux/workqueue.h> +#include <linux/gfp.h> +#include <linux/types.h> +#include <linux/cgroup.h> + +struct vmpressure { + unsigned int scanned; + unsigned int reclaimed; + /* The lock is used to keep the scanned/reclaimed above in sync. */ + struct mutex sr_lock; + + struct list_head events; + /* Have to grab the lock on events traversal or modifications. */ + struct mutex events_lock; + + struct work_struct work; +}; + +struct mem_cgroup; + +#ifdef CONFIG_MEMCG +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed); +extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); +#else +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed) {} +static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, + int prio) {} +#endif /* CONFIG_MEMCG */ + +extern void vmpressure_init(struct vmpressure *vmpr); +extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg); +extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr); +extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css); +extern int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, + const char *args); +extern void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd); + +#endif /* __LINUX_VMPRESSURE_H */ diff --git a/mm/Makefile b/mm/Makefile index 3a46287..72c5acb 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -50,7 +50,7 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o -obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o +obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 25ac5f4..b41727b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -49,6 +49,7 @@ #include <linux/fs.h> #include <linux/seq_file.h> #include <linux/vmalloc.h> +#include <linux/vmpressure.h> #include <linux/mm_inline.h> #include <linux/page_cgroup.h> #include <linux/cpu.h> @@ -370,6 +371,9 @@ struct mem_cgroup { atomic_t numainfo_events; atomic_t numainfo_updating; #endif + + struct vmpressure vmpr; + /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -570,6 +574,24 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) return container_of(s, struct mem_cgroup, css); } +/* Some nice accessors for the vmpressure. */ +struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg) +{ + if (!memcg) + memcg = root_mem_cgroup; + return &memcg->vmpr; +} + +struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr) +{ + return &container_of(vmpr, struct mem_cgroup, vmpr)->css; +} + +struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css) +{ + return &mem_cgroup_from_css(css)->vmpr; +} + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); @@ -6000,6 +6022,11 @@ static struct cftype mem_cgroup_files[] = { .unregister_event = mem_cgroup_oom_unregister_event, .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), }, + { + .name = "pressure_level", + .register_event = vmpressure_register_event, + .unregister_event = vmpressure_unregister_event, + }, #ifdef CONFIG_NUMA { .name = "numa_stat", @@ -6291,6 +6318,7 @@ mem_cgroup_css_alloc(struct cgroup *cont) memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); spin_lock_init(&memcg->move_lock); + vmpressure_init(&memcg->vmpr); return &memcg->css; diff --git a/mm/vmpressure.c b/mm/vmpressure.c new file mode 100644 index 0000000..ae0ff8e --- /dev/null +++ b/mm/vmpressure.c @@ -0,0 +1,252 @@ +/* + * Linux VM pressure + * + * Copyright 2012 Linaro Ltd. + * Anton Vorontsov <anton.vorontsov(a)linaro.org> + * + * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro, + * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include <linux/cgroup.h> +#include <linux/fs.h> +#include <linux/sched.h> +#include <linux/mm.h> +#include <linux/vmstat.h> +#include <linux/eventfd.h> +#include <linux/swap.h> +#include <linux/printk.h> +#include <linux/vmpressure.h> + +/* + * The window size is the number of scanned pages before we try to analyze + * the scanned/reclaimed ratio (or difference). + * + * It is used as a rate-limit tunable for the "low" level notification, + * and for averaging medium/critical levels. Using small window sizes can + * cause lot of false positives, but too big window size will delay the + * notifications. + * + * TODO: Make the window size depend on machine size, as we do for vmstat + * thresholds. + */ +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16; +static const unsigned int vmpressure_level_med = 60; +static const unsigned int vmpressure_level_critical = 95; +static const unsigned int vmpressure_level_critical_prio = 3; + +enum vmpressure_levels { + VMPRESSURE_LOW = 0, + VMPRESSURE_MEDIUM, + VMPRESSURE_CRITICAL, + VMPRESSURE_NUM_LEVELS, +}; + +static const char *vmpressure_str_levels[] = { + [VMPRESSURE_LOW] = "low", + [VMPRESSURE_MEDIUM] = "medium", + [VMPRESSURE_CRITICAL] = "critical", +}; + +static enum vmpressure_levels vmpressure_level(unsigned int pressure) +{ + if (pressure >= vmpressure_level_critical) + return VMPRESSURE_CRITICAL; + else if (pressure >= vmpressure_level_med) + return VMPRESSURE_MEDIUM; + return VMPRESSURE_LOW; +} + +static enum vmpressure_levels vmpressure_calc_level(unsigned int scanned, + unsigned int reclaimed) +{ + unsigned long scale = scanned + reclaimed; + unsigned long pressure; + + if (!scanned) + return VMPRESSURE_LOW; + + /* + * We calculate the ratio (in percents) of how many pages were + * scanned vs. reclaimed in a given time frame (window). Note that + * time is in VM reclaimer's "ticks", i.e. number of pages + * scanned. This makes it possible to set desired reaction time + * and serves as a ratelimit. + */ + pressure = scale - (reclaimed * scale / scanned); + pressure = pressure * 100 / scale; + + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, pressure, + scanned, reclaimed); + + return vmpressure_level(pressure); +} + +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, + unsigned long scanned, unsigned long reclaimed) +{ + struct vmpressure *vmpr = memcg_to_vmpr(memcg); + + /* + * So far we are only interested application memory, or, in case + * of low pressure, in FS/IO memory reclaim. We are also + * interested indirect reclaim (kswapd sets sc->gfp_mask to + * GFP_KERNEL). + */ + if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) + return; + + if (!scanned) + return; + + mutex_lock(&vmpr->sr_lock); + vmpr->scanned += scanned; + vmpr->reclaimed += reclaimed; + mutex_unlock(&vmpr->sr_lock); + + if (scanned < vmpressure_win || work_pending(&vmpr->work)) + return; + schedule_work(&vmpr->work); +} + +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) +{ + if (prio > vmpressure_level_critical_prio) + return; + + /* + * OK, the prio is below the threshold, updating vmpressure + * information before diving into long shrinking of long range + * vmscan. + */ + vmpressure(gfp, memcg, vmpressure_win, 0); +} + +static struct vmpressure *wk_to_vmpr(struct work_struct *wk) +{ + return container_of(wk, struct vmpressure, work); +} + +static struct vmpressure *cg_to_vmpr(struct cgroup *cg) +{ + return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id)); +} + +struct vmpressure_event { + struct eventfd_ctx *efd; + enum vmpressure_levels level; + struct list_head node; +}; + +static bool vmpressure_event(struct vmpressure *vmpr, + unsigned long scanned, unsigned long reclaimed) +{ + struct vmpressure_event *ev; + int level = vmpressure_calc_level(scanned, reclaimed); + bool signalled = false; + + mutex_lock(&vmpr->events_lock); + + list_for_each_entry(ev, &vmpr->events, node) { + if (level >= ev->level) { + eventfd_signal(ev->efd, 1); + signalled = true; + } + } + + mutex_unlock(&vmpr->events_lock); + + return signalled; +} + +static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) +{ + struct cgroup *cg = vmpr_to_css(vmpr)->cgroup; + struct mem_cgroup *memcg = mem_cgroup_from_cont(cg); + + memcg = parent_mem_cgroup(memcg); + if (!memcg) + return NULL; + return memcg_to_vmpr(memcg); +} + +static void vmpressure_wk_fn(struct work_struct *wk) +{ + struct vmpressure *vmpr = wk_to_vmpr(wk); + unsigned long s; + unsigned long r; + + mutex_lock(&vmpr->sr_lock); + s = vmpr->scanned; + r = vmpr->reclaimed; + vmpr->scanned = 0; + vmpr->reclaimed = 0; + mutex_unlock(&vmpr->sr_lock); + + do { + if (vmpressure_event(vmpr, s, r)) + break; + /* + * If not handled, propagate the event upward into the + * hierarchy. + */ + } while ((vmpr = vmpressure_parent(vmpr))); +} + +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, const char *args) +{ + struct vmpressure *vmpr = cg_to_vmpr(cg); + struct vmpressure_event *ev; + int lvl; + + for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) { + if (!strcmp(vmpressure_str_levels[lvl], args)) + break; + } + + if (lvl >= VMPRESSURE_NUM_LEVELS) + return -EINVAL; + + ev = kzalloc(sizeof(*ev), GFP_KERNEL); + if (!ev) + return -ENOMEM; + + ev->efd = eventfd; + ev->level = lvl; + + mutex_lock(&vmpr->events_lock); + list_add(&ev->node, &vmpr->events); + mutex_unlock(&vmpr->events_lock); + + return 0; +} + +void vmpressure_unregister_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd) +{ + struct vmpressure *vmpr = cg_to_vmpr(cg); + struct vmpressure_event *ev; + + mutex_lock(&vmpr->events_lock); + list_for_each_entry(ev, &vmpr->events, node) { + if (ev->efd != eventfd) + continue; + list_del(&ev->node); + kfree(ev); + break; + } + mutex_unlock(&vmpr->events_lock); +} + +void vmpressure_init(struct vmpressure *vmpr) +{ + mutex_init(&vmpr->sr_lock); + mutex_init(&vmpr->events_lock); + INIT_LIST_HEAD(&vmpr->events); + INIT_WORK(&vmpr->work, vmpressure_wk_fn); +} diff --git a/mm/vmscan.c b/mm/vmscan.c index 88c5fed..9530777 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -19,6 +19,7 @@ #include <linux/pagemap.h> #include <linux/init.h> #include <linux/highmem.h> +#include <linux/vmpressure.h> #include <linux/vmstat.h> #include <linux/file.h> #include <linux/writeback.h> @@ -1982,6 +1983,11 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc) } memcg = mem_cgroup_iter(root, memcg, &reclaim); } while (memcg); + + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, + sc->nr_scanned - nr_scanned, + sc->nr_reclaimed - nr_reclaimed); + } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); } @@ -2167,6 +2173,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL); do { + vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, + sc->priority); sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc); -- 1.8.1.1

12 years, 4 months

Re: [PATCH v3 2/3] ab8500: make res_to_temp tables public

by Hongbo Zhang

On 22 February 2013 06:24, Anton Vorontsov <anton(a)enomsg.org> wrote: > On Thu, Feb 21, 2013 at 06:32:40PM +0800, Hongbo Zhang wrote: >> These NTC resistance to temperature tables should be public, so others such as >> ab8500 hwmon driver can look up these tables to convert NTC resistance to >> temperature. >> >> Signed-off-by: Hongbo Zhang <hongbo.zhang(a)linaro.org> >> --- > > For 1/3 and 2/3 patches: > > Acked-by: Anton Vorontsov <anton(a)enomsg.org> > > (Do you need EXPORT_SYMBOL()? You don't use this from modules?) Thanks, will export them. > > Thanks. > >> drivers/power/ab8500_bmdata.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/power/ab8500_bmdata.c b/drivers/power/ab8500_bmdata.c >> index f034ae4..53f3324 100644 >> --- a/drivers/power/ab8500_bmdata.c >> +++ b/drivers/power/ab8500_bmdata.c >> @@ -11,7 +11,7 @@ >> * Note that the res_to_temp table must be strictly sorted by falling resistance >> * values to work. >> */ >> -static struct abx500_res_to_temp temp_tbl_A_thermistor[] = { >> +struct abx500_res_to_temp temp_tbl_A_thermistor[] = { >> {-5, 53407}, >> { 0, 48594}, >> { 5, 43804}, >> @@ -29,7 +29,9 @@ static struct abx500_res_to_temp temp_tbl_A_thermistor[] = { >> {65, 12500}, >> }; >> >> -static struct abx500_res_to_temp temp_tbl_B_thermistor[] = { >> +int temp_tbl_A_size = ARRAY_SIZE(temp_tbl_A_thermistor); >> + >> +struct abx500_res_to_temp temp_tbl_B_thermistor[] = { >> {-5, 200000}, >> { 0, 159024}, >> { 5, 151921}, >> @@ -47,6 +49,8 @@ static struct abx500_res_to_temp temp_tbl_B_thermistor[] = { >> {65, 82869}, >> }; >> >> +int temp_tbl_B_size = ARRAY_SIZE(temp_tbl_B_thermistor); >> + >> static struct abx500_v_to_cap cap_tbl_A_thermistor[] = { >> {4171, 100}, >> {4114, 95}, >> -- >> 1.8.0

12 years, 4 months

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

linaro-kernel