From: Al Viro viro@zeniv.linux.org.uk
cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it *is* busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well.
AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1. * in u32_destroy() we always drop the former; in u32_destroy_hnode() - the latter.
That way we have *all* references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root).
Cc: stable@vger.kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk --- net/sched/cls_u32.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index f218ccf1e2d9..3f985f29ef30 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -398,6 +398,7 @@ static int u32_init(struct tcf_proto *tp) rcu_assign_pointer(tp_c->hlist, root_ht); root_ht->tp_c = tp_c;
+ root_ht->refcnt++; rcu_assign_pointer(tp->root, root_ht); tp->data = tp_c; return 0; @@ -610,7 +611,7 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht, struct tc_u_hnode __rcu **hn; struct tc_u_hnode *phn;
- WARN_ON(ht->refcnt); + WARN_ON(--ht->refcnt);
u32_clear_hnode(tp, ht, extack);
@@ -649,7 +650,7 @@ static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
WARN_ON(root_ht == NULL);
- if (root_ht && --root_ht->refcnt == 0) + if (root_ht && --root_ht->refcnt == 1) u32_destroy_hnode(tp, root_ht, extack);
if (--tp_c->refcnt == 0) { @@ -698,7 +699,6 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last, }
if (ht->refcnt == 1) { - ht->refcnt--; u32_destroy_hnode(tp, ht, extack); } else { NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
On 2018-09-05 3:04 p.m., Al Viro wrote:
From: Al Viro viro@zeniv.linux.org.uk
cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references via ->hlist and via ->tp_root together. u32_destroy() drops the former and, in case when there had been links, leaves the sucker on the list. As the result, there's nothing to protect it from getting freed once links are dropped. That also makes the "is it busy" check incapable of catching the root hnode - it *is* busy (there's a reference from tp), but we don't see it as something separate. "Is it our root?" check partially covers that, but the problem exists for others' roots as well.
AFAICS, the minimal fix preserving the existing behaviour (where it doesn't include oopsen, that is) would be this: * count tp->root and tp_c->hlist as separate references. I.e. have u32_init() set refcount to 2, not 1.
- in u32_destroy() we always drop the former; in u32_destroy_hnode() -
the latter.
That way we have *all* references contributing to refcount. List removal happens in u32_destroy_hnode() (called only when ->refcnt is 1) an in u32_destroy() in case of tc_u_common going away, along with everything reachable from it. IOW, that way we know that u32_destroy_key() won't free something still on the list (or pointed to by someone's ->root).
Cc: stable@vger.kernel.org Signed-off-by: Al Viro viro@zeniv.linux.org.uk
For networking patches, subject should be reflective of tree and subsystem. Example for this one: "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting" Also useful to have a cover letter summarizing the patchset in 0/7. Otherwise
Acked-by: Jamal Hadi Salim jhs@mojatatu.com
cheers, jamal
On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote:
For networking patches, subject should be reflective of tree and subsystem. Example for this one: "[PATCH net 1/7]:net: sched: cls_u32: fix hnode refcounting" Also useful to have a cover letter summarizing the patchset in 0/7. Otherwise
Acked-by: Jamal Hadi Salim jhs@mojatatu.com
Argh... Unfortunately, there's this: in u32_delete() we have if (root_ht) { if (root_ht->refcnt > 1) { *last = false; goto ret; } if (root_ht->refcnt == 1) { if (!ht_empty(root_ht)) { *last = false; goto ret; } } } and that would need to be updated. However, that logics is bloody odd to start with. First of all, root_ht has come from struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); and the only place where it's ever modified is rcu_assign_pointer(tp->root, root_ht); in u32_init(), where we'd bloody well checked that root_ht is non-NULL (see if (root_ht == NULL) return -ENOBUFS; upstream of that place) and where that assignment is inevitable on the way to returning 0. No matter what, if tp has passed u32_init() it will have non-NULL ->root, forever. And there is no way for tcf_proto to be seen outside of tcf_proto_create() without ->init() having returned 0 - it gets freed before anyone sees it.
So this 'if (root_ht)' can't be false. What's more, what the hell is the whole thing checking? We are in u32_delete(). It's called (as ->delete()) from tfilter_del_notify(), which is called from tc_del_tfilter(). If we return 0 with *last true, we follow up calling tcf_proto_destroy(). OK, let's look at the logics in there: * if there are links to root hnode => false * if there's no links to root hnode and it has knodes => false (BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed) * if there is a tcf_proto sharing tp->data => false (i.e. any filters with different prio - don't bother) * if tp is the only one with reference to tp->data and there are *any* knodes => false.
Any extra links can come only from knodes in a non-empty hnode. And it's not a common case. Shouldn't thIe whole thing be * shared tp->data => false * any non-empty hnode => false instead? Perhaps even with the knode counter in tp->data, avoiding any loops in there, as well as the entire ht_empty()...
Now, in the very beginning of u32_delete() we have this: struct tc_u_hnode *ht = arg; if (ht == NULL) goto out; OK, but the call of ->delete() is err = tp->ops->delete(tp, fh, last, extack); and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify(). Which is called in if (!fh) { ... } else { bool last;
err = tfilter_del_notify(net, skb, n, tp, block, q, parent, fh, false, &last, extack); How can we ever get there with NULL fh?
The whole thing makes very little sense; looks like it used to live in u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp check from ->destroy() to ->delete()"), but looking at the rationale in that commit... I don't see how it fixes anything - sure, now we remove tcf_proto from the list before calling ->destroy(). Without any RCU delays in between. How could it possibly solve any issues with ->classify() called in parallel with ->destroy()? cls_u32 (at least these days) does try to survive u32_destroy() in parallel with u32_classify(); if any other classifiers do not, they are still broken and that commit has not done anything for them.
Anyway, adjusting 1/7 for that is trivial, but I would really like to understand what that code is doing... Comments?
On 2018-09-06 10:35 p.m., Al Viro wrote:
On Thu, Sep 06, 2018 at 06:21:09AM -0400, Jamal Hadi Salim wrote:
[..]
Argh... Unfortunately, there's this: in u32_delete() we have if (root_ht) { if (root_ht->refcnt > 1) { *last = false; goto ret; } if (root_ht->refcnt == 1) { if (!ht_empty(root_ht)) { *last = false; goto ret; } } } and that would need to be updated.
It is not detrimental as you have it right now but you are right an adjustment is needed...
Deleting of a root directly should not be allowed. But you can flush a whole tp. Consider this: -- sudo tc qdisc add dev $P ingress sudo tc filter add dev $P parent ffff: protocol ip prio 10 \ u32 match ip protocol 1 0xff
Which creates root ht 800
You shouldnt be allowed to do this: -- tc filter delete dev $P parent ffff: protocol ip prio 10 handle 800: u32 ---
But you can delete the tp entirely as such: --- tc filter delete dev $P parent ffff: protocol ip prio 10 u32 --
The later will go via the destroy() path and flush all filters.
You should also be able to delete individual filters. ex: $tc filter del dev $P parent ffff: prio 10 handle 800:0:800 u32
Where that code you are referring to is important is when the last filter deleted - we need the caller to know and it destroys root.
i.e you should return last=true when the last filter is deleted so root gets auto deleted (just like it was autocreated)
However, that logics is bloody odd to start with. First of all, root_ht has come from struct tc_u_hnode *root_ht = rtnl_dereference(tp->root); and the only place where it's ever modified is rcu_assign_pointer(tp->root, root_ht); in u32_init(), where we'd bloody well checked that root_ht is non-NULL (see if (root_ht == NULL) return -ENOBUFS; upstream of that place) and where that assignment is inevitable on the way to returning 0. No matter what, if tp has passed u32_init() it will have non-NULL ->root, forever. And there is no way for tcf_proto to be seen outside of tcf_proto_create() without ->init() having returned 0 - it gets freed before anyone sees it.
Yes, the check for root_ht is not necessary - but the check for the last filter (and testing for last) is needed.
So this 'if (root_ht)' can't be false. What's more, what the hell is the whole thing checking? We are in u32_delete(). It's called (as ->delete()) from tfilter_del_notify(), which is called from tc_del_tfilter(). If we return 0 with *last true, we follow up calling tcf_proto_destroy(). OK, let's look at the logics in there:
- if there are links to root hnode => false
- if there's no links to root hnode and it has knodes => false
(BTW, if we ever get there with root_ht->refcnt < 1, we are obviously screwed)
- if there is a tcf_proto sharing tp->data => false (i.e. any filters
with different prio - don't bother)
- if tp is the only one with reference to tp->data and there are *any*
knodes => false.
Any extra links can come only from knodes in a non-empty hnode. And it's not a common case. Shouldn't thIe whole thing be
- shared tp->data => false
- any non-empty hnode => false
instead? Perhaps even with the knode counter in tp->data, avoiding any loops in there, as well as the entire ht_empty()...
Now, in the very beginning of u32_delete() we have this: struct tc_u_hnode *ht = arg; if (ht == NULL) goto out; OK, but the call of ->delete() is err = tp->ops->delete(tp, fh, last, extack); and arg == NULL seen in u32_delete() means fh == NULL in tfilter_del_notify(). Which is called in if (!fh) { ... } else { bool last;
err = tfilter_del_notify(net, skb, n, tp, block, q, parent, fh, false, &last, extack);
How can we ever get there with NULL fh?
Try: tc filter delete dev $P parent ffff: protocol ip prio 10 u32 tcm handle is 0, so will hit that code path.
The whole thing makes very little sense; looks like it used to live in u32_destroy() prior to commit 763dbf6328e41 ("net_sched: move the empty tp check from ->destroy() to ->delete()"), but looking at the rationale in that commit... I don't see how it fixes anything - sure, now we remove tcf_proto from the list before calling ->destroy(). Without any RCU delays in between. How could it possibly solve any issues with ->classify() called in parallel with ->destroy()? cls_u32 (at least these days) does try to survive u32_destroy() in parallel with u32_classify(); if any other classifiers do not, they are still broken and that commit has not done anything for them.
Anyway, adjusting 1/7 for that is trivial, but I would really like to understand what that code is doing... Comments?
Refer to above.
cheers, jamal
To clarify with an example i used to test your patches:
#0 add ingress filter $TC qdisc add dev $P ingress #1 add filter $TC filter add dev $P parent ffff: protocol ip prio 10 \ u32 match ip protocol 1 0xff #2 display $TC filter ls dev $P parent ffff:
#3 try to delete root $TC filter delete dev $P parent ffff: protocol ip prio 10 \ handle 800: u32
#4 nothing changes.. $TC filter ls dev $P parent ffff: #5 delete filter $TC filter delete dev $P parent ffff: protocol ip prio 10 \ handle 800:0:800 u32 #6 filter gone but hash table still there.. $TC filter ls dev $P parent ffff: #7 delete tp $TC filter delete dev $P parent ffff: protocol ip prio 10 \ u32 #8 now it is gone.. $TC filter ls dev $P parent ffff:
your patches show #6 filter as still active. We want it to look like #8
Hope this helps.
cheers, jamal
On Fri, Sep 07, 2018 at 08:13:56AM -0400, Jamal Hadi Salim wrote:
} else { bool last;
err = tfilter_del_notify(net, skb, n, tp, block, q, parent, fh, false, &last, extack);
How can we ever get there with NULL fh?
Try: tc filter delete dev $P parent ffff: protocol ip prio 10 u32 tcm handle is 0, so will hit that code path.
Huh? It will hit tcf_proto_destroy() (and thus u32_destroy()), but where will it hit u32_delete()? Sure, we have fh == NULL there; what happens next is if (t->tcm_handle == 0) { tcf_chain_tp_remove(chain, &chain_info, tp); tfilter_notify(net, skb, n, tp, block, q, parent, fh, RTM_DELTFILTER, false); tcf_proto_destroy(tp, extack); and that's it. IDGI... Direct experiment shows that on e.g. tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip prio 10 u32 match ip protocol 1 0xff tc filter delete dev eth0 parent ffff: protocol ip prio 10 u32 we get u32_destroy() called, with u32_destroy_hnode() called by it, but no u32_delete() is called at all, let alone with ht == NULL...
linux-stable-mirror@lists.linaro.org