After some time investigating why I wasn't seeing some kernel section
mismatch errors that someone else was seeing, I found the cause was that
in Linaro we build Thumb2 kernels in the main, and modpost.c doesn't
have support for any of the Thumb relocation types in addend_arm_rel().
I thought I would spread this knowledge, because lack of section
mismatch warnings means we might miss some nasty bugs when developing
code.
If this is old news, then sorry for the noise.
--
Tixy
Hi Mark,
I am getting compilation warning while compiling v3.8
commit 19f949f52599ba7c3f67a5897ac6be14bfcb1200
Author: Linus Torvalds <torvalds(a)linux-foundation.org>
Date: Mon Feb 18 15:58:34 2013 -0800
Linux 3.8
Warning:
CC drivers/base/regmap/regmap-debugfs.o
drivers/base/regmap/regmap-debugfs.c: In function ‘regmap_read_debugfs’:
drivers/base/regmap/regmap-debugfs.c:180:9: warning: ‘ret’ may be used
uninitialized in this function [-Wmaybe-uninitialized]
I am unable to understand why this warning is coming and that too on
line 180 (as that doesn't use this variable). I can't see how this variable is
used uninitialized.
Toolchain i used:
arm-linux-gnueabihf-gcc (crosstool-NG
linaro-1.13.1-4.7-2012.12-20121214 - Linaro GCC 2012.12) 4.7.3
20121205 (prerelease)
--
viresh
=== David Long ===
=== Travel/Time Off ===
* Monday February 18th (U.S. Washington's Birthday, aka President's Day)
=== Highlights ===
Coming up to speed on process.
* Studied the history and content of Rabin Vincent's ARM uprobe kernel
patch. It does a good job of integrating with existing kprobe
instruction interpretation code.
* Upleveled the uprobe patch to 3.7 (for now) and booted on 4460
Panda. I am experimenting with it to verify basic correct operation.
* Sent email to Rabin on the topic of assisting in getting this
patch upstreamed.
=== Plans ===
* Once basic functionality is veriried uplevel the patch to 3.8 and
complete testing (especially as regards Thumb).
* Determine if it is possible to work with the patch originator, or
push for this patch independently.
=== Issues ===
* Eventually I will need hardware other than Panda for testing. For
now Panda works well enough, and QEMU is (theoretically) an option.
-dl
=== Highlights ===
* Lots of practice and refining of slides for ABS talk
* Sent out android upstreaming subteam mail
* Synced with Zach/Deepak
* Mailed a bit with Zach on hotplug and volatile ranges
* Submitted discussion proposal for lsf/mm-minisummit on volatile ranges
(and pinged Anton to maybe do so for mempressure cg)
* Pinged Arve on Serban's ashmem compat_ioctl patches
* Emailed briefly with Tom and Sumit about dmabuf-fences
* Pinged Erik again on my proposal to move sync driver to staging
* Mailed Maarten and Daniel about dmabuf-fences. Trying to see how we
can get folks talking on how to unify sync with dmabuf-fences.
=== Plans ===
* Give Android talk on Monday at ABS
* Follow up on additional sync/dmabuf-fences discussion
* Possibly submit sync upstream to staging
* Try to refocus back on volatile ranges some
=== Issues ===
* NA
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the new pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring reclaiming activity might be useful for
maintaining overall system's cache level. Upon notification, the program
(typically "Activity Manager") might analyze vmstat and act in advance
(i.e. prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, there is some mild swapping activity. Upon this event
applications may decide to analyze vmstat/zoneinfo/memcg or internal
memory usage statistics and free any resources that can be easily
reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you have
three cgroups: A->B->C. Now you set up an event listener on cgroup A and
cgroup B, and suppose group C experiences some pressure. In this
situation, only group B will receive the notification, i.e. group A will
not receive it. This is done to avoid excessive "broadcasting" of
messages, which disturbs the system and which is especially bad if we are
low on memory or thrashing. So, organize the cgroups wisely, or propagate
the events manually (or, ask us to implement the pass-through events,
explaining why would you need them.)
The file mempressure.level is used to show the current memory pressure
level, and cgroups event control file can be used to setup an eventfd
notification with a specific memory pressure level threshold.
Signed-off-by: Anton Vorontsov <anton.vorontsov(a)linaro.org>
Acked-by: Kirill A. Shutemov <kirill(a)shutemov.name>
---
Hi all,
Here comes another iteration of the memory pressure saga. The previous
version of the patch (and discussion) can be found here:
http://lkml.org/lkml/2013/1/4/55
And here are changes in this revision:
- Andrew Morton was concerned that the mempressure stuff was tied to
memcg, which was non-issue since mempressure wasn't actually bolted into
memcg at that time. But now it is. :) So now you need memcg to use
mempressure. Why? It makes things easier, simpler (e.g. this ends any
questions on how two different cgroups would interact, which can be
complex when two are distinct entities). Plus, as I understood it,
that's how cgroup folks want to see it eventually;
- Only cgroups API implemented. Let's start with making memcg people
happy, i.e. handling the most complex cases, and then we can start with
any niche solutions;
- Implemented Minchan Kim's idea of checking gfp mask. Unfortunately, it
is not as simple as checking '__GFP_HIGHMEM | __GFP_MOVABLE', since we
also need to account files caches and kswapd reclaim. But even so we can
filter out DMA or atomic allocations, which are not interesting for
userland. Plus it opens doors for other gfp tuning, so definitely a good
stuff;
- Per Leonid Moiseichuk's comments decreased vmpressure_level_critical to
95. I didn't look close enough, but it seems that we the minimum step is
indeed ~3%, and 99% makes it actually 100%. 95% should be fine;
- Per Kamezawa Hiroyuki added some words into documentation about that
it's always a good idea to consult with vmstat/zoneinfo/memcg statistics
before taking any action (with the exception of critical level). Also
added 'TODO' wrt. automatic window adjustment;
- Documented events propagation strategy;
- Removed ulong/uint usage, per Andrew's comments;
- Glauber Costa didn't like too short and non-descriptive mpc_ naming,
suggesting mempressure_ instead. And Andrew suggested mpcg_. I went with
something completely different: vmpressure_/vmpr_. :) Also renamed
xxx2yyy() to xxx_to_yyy() per Glauber Costa suggestion.
- _OOM level renamed to _CRITICAL. Andrew wanted _HIGH affix, but by using
'critical' I want to denote that this level is the last one (e.g. we
might want to introduce _HIGH some time later, if we can find a good
definition for it);
- This patch does not include shrinker interface. In the last series I
showed that implementing shrinker is possible, and that it actually can
be useful. At the same time I explained that shrinker is not a
substitution for the pressure levels. So, once we settle on the simple
thing, I might continue my shrinker efforts (which, btw, QEMU guys found
interesting and potentionally useful).
For those who curious, the shrinker patch is here:
http://lkml.org/lkml/2013/1/4/56
- Now tested with various debugging & preempt checks enabled, plus added
small comments on locks usage, thanks to Andrew;
- Rebased onto the current linux-next;
- While the thing somewhat changed, I preserved Kirill's ack. Kirill at
least liked the idea, and I desperately need Acks. :-D
Thanks!
Anton
Documentation/cgroups/memory.txt | 66 ++++++++-
init/Kconfig | 13 ++
mm/Makefile | 1 +
mm/internal.h | 34 +++++
mm/memcontrol.c | 25 ++++
mm/vmpressure.c | 300 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 6 +
7 files changed, 444 insertions(+), 1 deletion(-)
create mode 100644 mm/vmpressure.c
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index addb1f1..006ef58 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -40,6 +40,7 @@ Features:
- soft limit
- moving (recharging) account at moving a task is selectable.
- usage threshold notifier
+ - memory pressure notifier
- oom-killer disable knob and oom-notifier
- Root cgroup has no limit controls.
@@ -65,6 +66,7 @@ Brief summary of control files.
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
memory.force_empty # trigger forced move charge to parent
+ memory.pressure_level # show the memory pressure level
memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
@@ -778,7 +780,69 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)
-11. TODO
+11. Memory Pressure
+
+To maintain the interactivity/memory allocation cost, one can use the
+pressure level notifications, and the levels are defined like this:
+
+The "low" level means that the system is reclaiming memory for new
+allocations. Monitoring reclaiming activity might be useful for
+maintaining overall system's cache level. Upon notification, the program
+(typically "Activity Manager") might analyze vmstat and act in advance
+(i.e. prematurely shutdown unimportant services).
+
+The "medium" level means that the system is experiencing medium memory
+pressure, there is some mild swapping activity. Upon this event
+applications may decide to analyze vmstat/zoneinfo/memcg or internal
+memory usage statistics and free any resources that can be easily
+reconstructed or re-read from a disk.
+
+The "critical" level means that the system is actively thrashing, it is
+about to out of memory (OOM) or even the in-kernel OOM killer is on its
+way to trigger. Applications should do whatever they can to help the
+system. It might be too late to consult with vmstat or any other
+statistics, so it's advisable to take an immediate action.
+
+The events are propagated upward until the event is handled, i.e. the
+events are not pass-through. Here is what this means: for example you have
+three cgroups: A->B->C. Now you set up an event listener on cgroup A and
+cgroup B, and suppose group C experiences some pressure. In this
+situation, only group B will receive the notification, i.e. group A will
+not receive it. This is done to avoid excessive "broadcasting" of
+messages, which disturbs the system and which is especially bad if we are
+low on memory or thrashing. So, organize the cgroups wisely, or propagate
+the events manually (or, ask us to implement the pass-through events,
+explaining why would you need them.)
+
+The file mempressure.level is used to show the current memory pressure
+level, and cgroups event control file can be used to setup an eventfd
+notification with a specific memory pressure level threshold.
+
+ Read:
+ Reads mempory presure levels: low, medium or critical.
+ Write:
+ Not implemented.
+ Test:
+ Here is a script: make a new cgroup, set up a memory limit, set up a
+ notification on the parent cgroup, make child cgroup experience a
+ critical pressure. Expected result is that the parent cgroup gets a
+ notification:
+
+ (Note that we are seting up a listener on parent's cgroup, and then
+ creating a child cgroup, showing how event propagation works.)
+
+ # cd /sys/fs/cgroup/memory/
+ # cgroup_event_listener memory.pressure_level low &
+ # mkdir foo
+ # cd foo
+ # echo 8000000 > memory.limit_in_bytes
+ # echo $$ > tasks
+ # dd if=/dev/zero | read x
+
+ (Expect a bunch of notifications, and eventually, the oom-killer will
+ trigger.)
+
+12. TODO
1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
diff --git a/init/Kconfig b/init/Kconfig
index ccd1ca5..6d61ef5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -908,6 +908,19 @@ config MEMCG_DEBUG_ASYNC_DESTROY
This is a developer-oriented debugging facility only, and no
guarantees of interface stability will be given.
+config MEMCG_PRESSURE
+ bool "Memory Resource Controller Pressure Monitor"
+ help
+ The memory pressure monitor provides a facility for userland
+ programs to watch for memory pressure on per-cgroup basis. This
+ is useful if you have programs that want to respond to the
+ pressure, possibly improving memory management.
+
+ For more information see Memory Pressure section in
+ Documentation/cgroups/memory.txt.
+
+ If unsure, say N.
+
config CGROUP_HUGETLB
bool "HugeTLB Resource Controller for Control Groups"
depends on RESOURCE_COUNTERS && HUGETLB_PAGE
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..51f7f52 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMCG_PRESSURE) += vmpressure.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc..eb50685 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -374,4 +374,38 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
+struct vmpressure {
+#ifdef CONFIG_MEMCG_PRESSURE
+ unsigned int scanned;
+ unsigned int reclaimed;
+ /* The lock is used to keep the scanned/reclaimed above in sync. */
+ struct mutex sr_lock;
+
+ struct list_head events;
+ /* Have to grab the lock on events traversal or modifications. */
+ struct mutex events_lock;
+
+ struct work_struct work;
+#endif /* CONFIG_MEMCG_PRESSURE */
+};
+
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_PRESSURE
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed);
+extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
+extern void vmpressure_init(struct vmpressure *vmpr);
+extern struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg);
+extern struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr);
+extern struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css);
+extern void __init enable_pressure_cgroup(void);
+#else
+static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed) {}
+static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
+ int prio) {}
+static inline void vmpressure_init(struct vmpressure *vmpr) {}
+static inline void __init enable_pressure_cgroup(void) {}
+#endif /* CONFIG_MEMCG_PRESSURE */
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25ac5f4..60f277a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -370,6 +370,9 @@ struct mem_cgroup {
atomic_t numainfo_events;
atomic_t numainfo_updating;
#endif
+
+ struct vmpressure vmpr;
+
/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
@@ -575,6 +578,26 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
return (memcg == root_mem_cgroup);
}
+/* Some nice accessors for the vmpressure. */
+#ifdef CONFIG_MEMCG_PRESSURE
+struct vmpressure *memcg_to_vmpr(struct mem_cgroup *memcg)
+{
+ if (!memcg)
+ memcg = root_mem_cgroup;
+ return &memcg->vmpr;
+}
+
+struct cgroup_subsys_state *vmpr_to_css(struct vmpressure *vmpr)
+{
+ return &container_of(vmpr, struct mem_cgroup, vmpr)->css;
+}
+
+struct vmpressure *css_to_vmpr(struct cgroup_subsys_state *css)
+{
+ return &mem_cgroup_from_css(css)->vmpr;
+}
+#endif /* CONFIG_MEMCG_PRESSURE */
+
/* Writing them here to avoid exposing memcg's inner layout */
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
@@ -6291,6 +6314,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
+ vmpressure_init(&memcg->vmpr);
return &memcg->css;
@@ -7018,6 +7042,7 @@ static int __init mem_cgroup_init(void)
{
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
enable_swap_cgroup();
+ enable_pressure_cgroup();
mem_cgroup_soft_limit_tree_init();
memcg_stock_init();
return 0;
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
new file mode 100644
index 0000000..7922503
--- /dev/null
+++ b/mm/vmpressure.c
@@ -0,0 +1,300 @@
+/*
+ * Linux VM pressure
+ *
+ * Copyright 2012 Linaro Ltd.
+ * Anton Vorontsov <anton.vorontsov(a)linaro.org>
+ *
+ * Based on ideas from Andrew Morton, David Rientjes, KOSAKI Motohiro,
+ * Leonid Moiseichuk, Mel Gorman, Minchan Kim and Pekka Enberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <linux/eventfd.h>
+#include <linux/swap.h>
+#include <linux/printk.h>
+#include "internal.h"
+
+/*
+ * Generic VM Pressure routines (no cgroups or any other API details)
+ */
+
+/*
+ * The window size is the number of scanned pages before we try to analyze
+ * the scanned/reclaimed ratio (or difference).
+ *
+ * It is used as a rate-limit tunable for the "low" level notification,
+ * and for averaging medium/critical levels. Using small window sizes can
+ * cause lot of false positives, but too big window size will delay the
+ * notifications.
+ *
+ * TODO: Make the window size depend on machine size, as we do for vmstat
+ * thresholds.
+ */
+static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static const unsigned int vmpressure_level_med = 60;
+static const unsigned int vmpressure_level_critical = 95;
+static const unsigned int vmpressure_level_critical_prio = 3;
+
+enum vmpressure_levels {
+ VMPRESSURE_LOW = 0,
+ VMPRESSURE_MEDIUM,
+ VMPRESSURE_CRITICAL,
+ VMPRESSURE_NUM_LEVELS,
+};
+
+static const char *vmpressure_str_levels[] = {
+ [VMPRESSURE_LOW] = "low",
+ [VMPRESSURE_MEDIUM] = "medium",
+ [VMPRESSURE_CRITICAL] = "critical",
+};
+
+static enum vmpressure_levels vmpressure_level(unsigned int pressure)
+{
+ if (pressure >= vmpressure_level_critical)
+ return VMPRESSURE_CRITICAL;
+ else if (pressure >= vmpressure_level_med)
+ return VMPRESSURE_MEDIUM;
+ return VMPRESSURE_LOW;
+}
+
+static unsigned long vmpressure_calc_level(unsigned int win,
+ unsigned int s, unsigned int r)
+{
+ unsigned long p;
+
+ if (!s)
+ return 0;
+
+ /*
+ * We calculate the ratio (in percents) of how many pages were
+ * scanned vs. reclaimed in a given time frame (window). Note that
+ * time is in VM reclaimer's "ticks", i.e. number of pages
+ * scanned. This makes it possible to set desired reaction time
+ * and serves as a ratelimit.
+ */
+ p = win - (r * win / s);
+ p = p * 100 / win;
+
+ pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r);
+
+ return vmpressure_level(p);
+}
+
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
+ unsigned long scanned, unsigned long reclaimed)
+{
+ struct vmpressure *vmpr = memcg_to_vmpr(memcg);
+
+ /*
+ * So far we are only interested application memory, or, in case
+ * of low pressure, in FS/IO memory reclaim. We are also
+ * interested indirect reclaim (kswapd sets sc->gfp_mask to
+ * GFP_KERNEL).
+ */
+ if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
+ return;
+
+ if (!scanned)
+ return;
+
+ mutex_lock(&vmpr->sr_lock);
+ vmpr->scanned += scanned;
+ vmpr->reclaimed += reclaimed;
+ mutex_unlock(&vmpr->sr_lock);
+
+ if (scanned < vmpressure_win || work_pending(&vmpr->work))
+ return;
+ schedule_work(&vmpr->work);
+}
+
+void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
+{
+ if (prio > vmpressure_level_critical_prio)
+ return;
+
+ /* OK, the prio is below the threshold, we're about to oom. */
+ vmpressure(gfp, memcg, vmpressure_win, 0);
+}
+
+static struct vmpressure *wk_to_vmpr(struct work_struct *wk)
+{
+ return container_of(wk, struct vmpressure, work);
+}
+
+static struct vmpressure *cg_to_vmpr(struct cgroup *cg)
+{
+ return css_to_vmpr(cgroup_subsys_state(cg, mem_cgroup_subsys_id));
+}
+
+struct vmpressure_event {
+ struct eventfd_ctx *efd;
+ enum vmpressure_levels level;
+ struct list_head node;
+};
+
+static bool vmpressure_event(struct vmpressure *vmpr,
+ unsigned long s, unsigned long r)
+{
+ struct vmpressure_event *ev;
+ int level = vmpressure_calc_level(vmpressure_win, s, r);
+ bool signalled = 0;
+
+ mutex_lock(&vmpr->events_lock);
+
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (level >= ev->level) {
+ eventfd_signal(ev->efd, 1);
+ signalled++;
+ }
+ }
+
+ mutex_unlock(&vmpr->events_lock);
+
+ return signalled;
+}
+
+static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr)
+{
+ struct cgroup *cg = vmpr_to_css(vmpr)->cgroup->parent;
+
+ if (!cg)
+ return NULL;
+ return cg_to_vmpr(cg);
+}
+
+static void vmpressure_wk_fn(struct work_struct *wk)
+{
+ struct vmpressure *vmpr = wk_to_vmpr(wk);
+ unsigned long s;
+ unsigned long r;
+
+ mutex_lock(&vmpr->sr_lock);
+ s = vmpr->scanned;
+ r = vmpr->reclaimed;
+ vmpr->scanned = 0;
+ vmpr->reclaimed = 0;
+ mutex_unlock(&vmpr->sr_lock);
+
+ do {
+ if (vmpressure_event(vmpr, s, r))
+ break;
+ /*
+ * If not handled, propagate the event upward into the
+ * hierarchy.
+ */
+ } while ((vmpr = vmpressure_parent(vmpr)));
+}
+
+/* cgroups "frontend" for vmpressure. */
+
+static ssize_t vmpressure_read_level(struct cgroup *cg, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t sz, loff_t *ppos)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ unsigned int level;
+ const char *str;
+ ssize_t len = 0;
+
+ if (*ppos >= sz)
+ return 0;
+
+ mutex_lock(&vmpr->sr_lock);
+
+ level = vmpressure_calc_level(vmpressure_win,
+ vmpr->scanned, vmpr->reclaimed);
+
+ mutex_unlock(&vmpr->sr_lock);
+
+ str = vmpressure_str_levels[level];
+ len += strlen(str) + 1;
+ if (len > sz)
+ return -EINVAL;
+
+ if (copy_to_user(buf, str, len - 1))
+ return -EFAULT;
+ if (copy_to_user(buf + len - 1, "\n", 1))
+ return -EFAULT;
+
+ *ppos += sz;
+ return len;
+}
+
+static int vmpressure_register_level(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd,
+ const char *args)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ struct vmpressure_event *ev;
+ int lvl;
+
+ for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
+ if (!strcmp(vmpressure_str_levels[lvl], args))
+ break;
+ }
+
+ if (lvl >= VMPRESSURE_NUM_LEVELS)
+ return -EINVAL;
+
+ ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+ if (!ev)
+ return -ENOMEM;
+
+ ev->efd = eventfd;
+ ev->level = lvl;
+
+ mutex_lock(&vmpr->events_lock);
+ list_add(&ev->node, &vmpr->events);
+ mutex_unlock(&vmpr->events_lock);
+
+ return 0;
+}
+
+static void vmpressure_unregister_level(struct cgroup *cg, struct cftype *cft,
+ struct eventfd_ctx *eventfd)
+{
+ struct vmpressure *vmpr = cg_to_vmpr(cg);
+ struct vmpressure_event *ev;
+
+ mutex_lock(&vmpr->events_lock);
+ list_for_each_entry(ev, &vmpr->events, node) {
+ if (ev->efd != eventfd)
+ continue;
+ list_del(&ev->node);
+ kfree(ev);
+ break;
+ }
+ mutex_unlock(&vmpr->events_lock);
+}
+
+static struct cftype vmpressure_cgroup_files[] = {
+ {
+ .name = "pressure_level",
+ .read = vmpressure_read_level,
+ .register_event = vmpressure_register_level,
+ .unregister_event = vmpressure_unregister_level,
+ },
+ {},
+};
+
+void vmpressure_init(struct vmpressure *vmpr)
+{
+ mutex_init(&vmpr->sr_lock);
+ mutex_init(&vmpr->events_lock);
+ INIT_LIST_HEAD(&vmpr->events);
+ INIT_WORK(&vmpr->work, vmpressure_wk_fn);
+}
+
+void __init enable_pressure_cgroup(void)
+{
+ WARN_ON(cgroup_add_cftypes(&mem_cgroup_subsys,
+ vmpressure_cgroup_files));
+}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..34f09b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1982,6 +1982,10 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
}
memcg = mem_cgroup_iter(root, memcg, &reclaim);
} while (memcg);
+
+ vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
+ sc->nr_scanned - nr_scanned, nr_reclaimed);
+
} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc));
}
@@ -2167,6 +2171,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
count_vm_event(ALLOCSTALL);
do {
+ vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
+ sc->priority);
sc->nr_scanned = 0;
aborted_reclaim = shrink_zones(zonelist, sc);
--
1.8.1.1
== Linus Walleij linusw ==
=== Highlights ===
* Finalized AB8500 GPIO pathes, tested and obtained working IRQs.
Merged some of these into the MFD tree, some into the pinctrl tree
and some into a patch set targeted at ARM SoC.
* GPIO maintenance:
- Handed working tree over to Grant, who picked it and added
some more.
- Reviewed some of the nice GPIO descriptor rework patches,
and Grant started merging some of them.
* Pinctrl maintenance:
- Requested Torvalds to pull in the last two pinctrl fixes. He pulled
them in.
- Merged the ABx500 pinctrl stuff.
- Merged a bunch of lantiq patches.
* Reviewed some PXA SPI DMA stuff, they are basically splitting
the custom DMA API from the dmaengine API to optionally
compile out the former and eventually delete it, and this is
nice stuff. The PXA SPI is apparently also used by all the
Intel SoC:s so this is a big win.
* Cooked two fix-up patches agains the compile regression
introduced in the ux500 due to the <mach/id.h> removal
patches. Sent two patches fixing it up:
http://marc.info/?l=linux-arm-kernel&m=136051407426331&w=2http://marc.info/?l=linux-arm-kernel&m=136051407826332&w=2
Hopefully these can get merged. Still no clue how I managed
to screw things up like this, I know for sure I compiled this
branch, but maybe new support was introduced somewher in
the v3.7 cycle and I missed it.
* Russell merged the Versatile QEMU PCI fix.
* Interviewed a potential KWG assignee on Deepak's request.
* Got fed up with people not fixing the NO_IRQ business
(i.e. using Linux IRQ 0), so I sent two attack-patches bumping
fixed Linux IRQ offsets to 64 for mach-netx and mach-ep93xx.
netx patch ACKed, merging through Russell.
* Bystanding Fabio while he was root-casing an issue on the
DMA40 DMA controller. He found the culprit and everyone is
happy.
* Debated heavy subjects:
- Is virtio or dmaengine the best way forward for OMAPs
odd USB acceleration.
- Status of the HSI subsystem.
- Deferred probe is completing after __init sections have
been discarded, on the assumption nothing needing these
sections will be around. That doesn't work for the console
set-up calls, d'oh. Haojian has an interesting pending patch:
http://marc.info/?l=linux-kernel&m=136042916203488&w=2
=== Plans ===
* Finalize a GPIO+pinctrl presentation for the Embedded Linux
Conference next week. My presentation will be first day of the
conference. It's all fun! I will be travelling and hanging out at
ELC the whole next week, monday 18th thru monday the 25th.
* Attack the remaining headers in arch/arm/mach-ux500
so we can move forward with multiplatform for v3.9.
* Convert Nomadik pinctrl driver to register GPIO ranges
from the gpiochip side.
* Test the PL08x patches on the Ericsson Research
PB11MPCore and submit platform data for using
pl08x DMA on that platform.
* Look into other Ux500 stuff in need of mainlining...
using an internal tracking sheet for this.
* Get hands dirty with regmap.
=== Issues ===
* Some stress still but feels better when thing have started
working and regressions get fixed.
Thanks,
Linus Walleij
== Ulf Hansson ==
=== Highlights ===
Storage:
* Monitoring patches on mmc-list.
* Patches for fixing signal voltage switch procedure for SD card UHS
mode ready. Acked and tested by different host driver authors.
* Patch for improve dma handling for mmci host driver accepted for 3.9.
* Cooperating with internal STE colleague, Johan Rudholm, with regards
to rework parts of the HS200 and SDR104 support in the mmc protocol
layer.
* Received another eMMC -> SD card adapter with corresponding eMMC 4.5
samples, this time from Toshiba via Pär Andersson. Really great to
have another vendor to test with, thanks Toshiba!
Clk:
* Still high focus doing internal work for STE ux500. Started to
prepare a patchset for upstream this work, some dependencies to Lee
Jones upstream work for mfd driver related parts which complicates it
a bit. The patches will add support for abx500 clocks, update
different driver's clk support and include ux500 clk optimizations.
* Follow up on patchset for fixing clk_set_parent API.
* Follow up on patchset for disable unsed prepared clks.
=== Plans ===
Storage:
* Follow up on Idle time BKOPS patches on mmc list. Will soon send a
skeleton patch which the work can be based upon, related to runtime
PM.
* Doing an overall analyse about the eMMC 4.5/4.6 features. Check what
can be considered finished, what needs further fixing and point out
the new features for which we should spend our focus on in Linaro
storage team. As also stated above, rework of HS200/SDR104 support
started.
* Push patches for mmci host driver to support UHS cards.
* Push patches for mmci host driver to further extend the power
management support.
* Push patches for mmci host driver to add new features like CMD23
support and more.
* Push patches for mmci host driver to add support for new STE 8540 variant.
Clk:
* Upstreaming of internal work for ux500.
=== Issues ===
* Still need to increase focus towards storage, all work related to
clks has been given give higher prio for a while now.
Kind regards
Ulf Hansson
=== Highlights ===
* Got my current timekeeping queue merged into -tip for 3.9
* Got my plane tickets for ABS
* Got my ABS slides finished (including charts that were annoying hard
to create)
* Sent out android upstreaming subteam mail
* Synced with Deepak
* Agreed to help run the Android miniconf at LPC
* Reviewed and queued patch for NTP/RTC update issue
* Started looking at Android Sync driver, pinged Erik on his plans, and
pinged Maarten on dmabuf-fences
* Reworked Android Sync driver so it could be merged with staging
(pending feedback from Erik)
=== Plans ===
* Submit ABS slides
* Rehearsing for ABS talk & any last polishing of the slides
* Hopefully continue discussions around dmabuf-fence/android-sync and
possibly submit sync to staging.
=== Issues ===
* NA
== Linus Walleij linusw ==
=== Highlights ===
* Working on AB8500 GPIO as it is a
roadblock for the multiplatform, as it is a SPARSE_IRQ
regression.
https://blueprints.launchpad.net/linux-linaro/+spec/ab8500-gpio-shapeup
Working on Lee Jones' cleanup and IRQ fixup series.
Finally aquired a hardware that can actually fire these IRQs.
* Requested Torvalds to pull in a bunch of pinctrl fixes and he pulled
them in. One outstanding patch needs to be sent still :-(
* GPIO maintenance:
- Got PCA GPIO cleanups back from maintainer, modified and
working, merged them.
- Merged ACPI extensions for gpiolib from Mathias Nyman, the
build robot found issues, have asked Mathias to fix them.
- Finalizing tree for the merge window.
* Pinctrl maintenance:
- Merged a few allwinner pinctrl patches. More yet queued.
- Finalizing tree for the merge window.
* Arnd found a bug in the Nomadik (mach-nomadik) device tree
patch set: need to select USE_OF over just OF. Made a patch
and sent it.
* Got an ACK for the missing <mach/id.h> removal dependency
from the MFD maintainer. Send a pull request for it, and it has
landed in linux-next. However I seem to have screwed up the
patch set somehow and now I must fix it :-(
* Fixed a regression in the Versatile QEMU PCI code.
(I don't know if anyone is actually using the QEMU Versatile
PCI on real hardware, or if that even really works. There are
rumors that it does not.)
The patch is in Russell's patch tracker:
http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=7635/1
=== Plans ===
* First fix the AB8500 GPIO mess.
* Large pinctrl single patch set in the INBOX.
* Large GPIO descriptor rework patch set in the INBOX.
* Attack the remaining headers in arch/arm/mach-ux500
so we can move forward with multiplatform for v3.9.
* Convert Nomadik pinctrl driver to register GPIO ranges
from the gpiochip side.
* Test the PL08x patches on the Ericsson Research
PB11MPCore and submit platform data for using
pl08x DMA on that platform.
* Look into other Ux500 stuff in need of mainlining...
using an internal tracking sheet for this.
* Get hands dirty with regmap.
=== Issues ===
* The constant overload and still a feeling of not doing progress
make me do stupid mistakes like the bug in the Nomadik patch
set and the <mach/id.h> removal bugs. Maybe I should drop some
stuff from the merge window to avoid more stupid mistakes.
Thanks,
Linus Walleij