Hi Noah,
On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein tip-bot2@linutronix.de wrote:
The following commit has been merged into the x86/misc branch of tip:
Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad Author: Noah Goldstein goldstein.w.n@gmail.com AuthorDate: Wed, 10 May 2023 20:10:02 -05:00 Committer: Dave Hansen dave.hansen@linux.intel.com CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
x86/csum: Improve performance of `csum_partial`
Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case.
Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations).
Results from benchmarking on Icelake. Times measured with rdtsc() len lat_new lat_old r tput_new tput_old r 8 3.58 3.47 1.032 3.58 3.51 1.021 16 4.14 4.02 1.028 3.96 3.78 1.046 24 4.99 5.03 0.992 4.23 4.03 1.050 32 5.09 5.08 1.001 4.68 4.47 1.048 40 5.57 6.08 0.916 3.05 4.43 0.690 48 6.65 6.63 1.003 4.97 4.69 1.059 56 7.74 7.72 1.003 5.22 4.95 1.055 64 6.65 7.22 0.921 6.38 6.42 0.994 96 9.43 9.96 0.946 7.46 7.54 0.990 128 9.39 12.15 0.773 8.90 8.79 1.012 200 12.65 18.08 0.699 11.63 11.60 1.002 272 15.82 23.37 0.677 14.43 14.35 1.005 440 24.12 36.43 0.662 21.57 22.69 0.951 952 46.20 74.01 0.624 42.98 53.12 0.809 1024 47.12 78.24 0.602 46.36 58.83 0.788 1552 72.01 117.30 0.614 71.92 96.78 0.743 2048 93.07 153.25 0.607 93.28 137.20 0.680 2600 114.73 194.30 0.590 114.28 179.32 0.637 3608 156.34 268.41 0.582 154.97 254.02 0.610 4096 175.01 304.03 0.576 175.89 292.08 0.602
There is no such thing as a free lunch, however, and the special case for len == 40 does add overhead to the len != 40 cases. This seems to amount to be ~5% throughput and slightly less in terms of latency.
Testing: Part of this change is a new kunit test. The tests check all alignment X length pairs in [0, 64) X [0, 512). There are three cases. 1) Precomputed random inputs/seed. The expected results where generated use the generic implementation (which is assumed to be non-buggy). 2) An input of all 1s. The goal of this test is to catch any case a carry is missing. 3) An input that never carries. The goal of this test si to catch any case of incorrectly carrying.
More exhaustive tests that test all alignment X length pairs in [0, 8192) X [0, 8192] on random data are also available here: https://github.com/goldsteinn/csum-reproduction
The reposity also has the code for reproducing the above benchmark numbers.
Signed-off-by: Noah Goldstein goldstein.w.n@gmail.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com
Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum: Improve performance of `csum_partial`") in linus/master stable/master
Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.co...
This does not seem to be a message sent to a public mailing list archived at lore (yet).
On m68k (ARAnyM):
KTAP version 1 # Subtest: checksum 1..3 # test_csum_fixed_random_inputs: ASSERTION FAILED at lib/checksum_kunit.c:243 Expected result == expec, but result == 54991 (0xd6cf) expec == 33316 (0x8224) not ok 1 test_csum_fixed_random_inputs # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267 Expected result == expec, but result == 255 (0xff) expec == 65280 (0xff00)
Endianness issue in the test?
not ok 2 test_csum_all_carry_inputs # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306 Expected result == expec, but result == 64515 (0xfc03) expec == 0 (0x0) not ok 3 test_csum_no_carry_inputs # checksum: pass:0 fail:3 skip:0 total:3 # Totals: pass:0 fail:3 skip:0 total:3 not ok 1 checksum
Gr{oetje,eeting}s,
Geert
CC linux-m68k
On Sun, Jul 2, 2023 at 5:19 PM Geert Uytterhoeven geert@linux-m68k.org wrote:
On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein tip-bot2@linutronix.de wrote:
The following commit has been merged into the x86/misc branch of tip:
Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad Author: Noah Goldstein goldstein.w.n@gmail.com AuthorDate: Wed, 10 May 2023 20:10:02 -05:00 Committer: Dave Hansen dave.hansen@linux.intel.com CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
x86/csum: Improve performance of `csum_partial`
Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case.
Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations).
Results from benchmarking on Icelake. Times measured with rdtsc() len lat_new lat_old r tput_new tput_old r 8 3.58 3.47 1.032 3.58 3.51 1.021 16 4.14 4.02 1.028 3.96 3.78 1.046 24 4.99 5.03 0.992 4.23 4.03 1.050 32 5.09 5.08 1.001 4.68 4.47 1.048 40 5.57 6.08 0.916 3.05 4.43 0.690 48 6.65 6.63 1.003 4.97 4.69 1.059 56 7.74 7.72 1.003 5.22 4.95 1.055 64 6.65 7.22 0.921 6.38 6.42 0.994 96 9.43 9.96 0.946 7.46 7.54 0.990 128 9.39 12.15 0.773 8.90 8.79 1.012 200 12.65 18.08 0.699 11.63 11.60 1.002 272 15.82 23.37 0.677 14.43 14.35 1.005 440 24.12 36.43 0.662 21.57 22.69 0.951 952 46.20 74.01 0.624 42.98 53.12 0.809 1024 47.12 78.24 0.602 46.36 58.83 0.788 1552 72.01 117.30 0.614 71.92 96.78 0.743 2048 93.07 153.25 0.607 93.28 137.20 0.680 2600 114.73 194.30 0.590 114.28 179.32 0.637 3608 156.34 268.41 0.582 154.97 254.02 0.610 4096 175.01 304.03 0.576 175.89 292.08 0.602
There is no such thing as a free lunch, however, and the special case for len == 40 does add overhead to the len != 40 cases. This seems to amount to be ~5% throughput and slightly less in terms of latency.
Testing: Part of this change is a new kunit test. The tests check all alignment X length pairs in [0, 64) X [0, 512). There are three cases. 1) Precomputed random inputs/seed. The expected results where generated use the generic implementation (which is assumed to be non-buggy). 2) An input of all 1s. The goal of this test is to catch any case a carry is missing. 3) An input that never carries. The goal of this test si to catch any case of incorrectly carrying.
More exhaustive tests that test all alignment X length pairs in [0, 8192) X [0, 8192] on random data are also available here: https://github.com/goldsteinn/csum-reproduction
The reposity also has the code for reproducing the above benchmark numbers.
Signed-off-by: Noah Goldstein goldstein.w.n@gmail.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com
Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum: Improve performance of `csum_partial`") in linus/master stable/master
Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.co...
This does not seem to be a message sent to a public mailing list archived at lore (yet).
On m68k (ARAnyM):
KTAP version 1 # Subtest: checksum 1..3 # test_csum_fixed_random_inputs: ASSERTION FAILED at
lib/checksum_kunit.c:243 Expected result == expec, but result == 54991 (0xd6cf) expec == 33316 (0x8224) not ok 1 test_csum_fixed_random_inputs # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267 Expected result == expec, but result == 255 (0xff) expec == 65280 (0xff00)
Endianness issue in the test?
not ok 2 test_csum_all_carry_inputs # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306 Expected result == expec, but result == 64515 (0xfc03) expec == 0 (0x0) not ok 3 test_csum_no_carry_inputs
# checksum: pass:0 fail:3 skip:0 total:3 # Totals: pass:0 fail:3 skip:0 total:3 not ok 1 checksum
As this test already exposed bugs on x86 and UML:
https://lore.kernel.org/all/20230704083206.693155-2-davidgow@google.com https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@google.com
there's probably a real bug in the m68k csumming code.
Gr{oetje,eeting}s,
Geert
On Sun, 2 Jul 2023 at 23:19, Geert Uytterhoeven geert@linux-m68k.org wrote:
Hi Noah,
On Thu, May 25, 2023 at 8:04 PM tip-bot2 for Noah Goldstein tip-bot2@linutronix.de wrote:
The following commit has been merged into the x86/misc branch of tip:
Commit-ID: 688eb8191b475db5acfd48634600b04fd3dda9ad Gitweb: https://git.kernel.org/tip/688eb8191b475db5acfd48634600b04fd3dda9ad Author: Noah Goldstein goldstein.w.n@gmail.com AuthorDate: Wed, 10 May 2023 20:10:02 -05:00 Committer: Dave Hansen dave.hansen@linux.intel.com CommitterDate: Thu, 25 May 2023 10:55:18 -07:00
x86/csum: Improve performance of `csum_partial`
Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case.
Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations).
Results from benchmarking on Icelake. Times measured with rdtsc() len lat_new lat_old r tput_new tput_old r 8 3.58 3.47 1.032 3.58 3.51 1.021 16 4.14 4.02 1.028 3.96 3.78 1.046 24 4.99 5.03 0.992 4.23 4.03 1.050 32 5.09 5.08 1.001 4.68 4.47 1.048 40 5.57 6.08 0.916 3.05 4.43 0.690 48 6.65 6.63 1.003 4.97 4.69 1.059 56 7.74 7.72 1.003 5.22 4.95 1.055 64 6.65 7.22 0.921 6.38 6.42 0.994 96 9.43 9.96 0.946 7.46 7.54 0.990 128 9.39 12.15 0.773 8.90 8.79 1.012 200 12.65 18.08 0.699 11.63 11.60 1.002 272 15.82 23.37 0.677 14.43 14.35 1.005 440 24.12 36.43 0.662 21.57 22.69 0.951 952 46.20 74.01 0.624 42.98 53.12 0.809 1024 47.12 78.24 0.602 46.36 58.83 0.788 1552 72.01 117.30 0.614 71.92 96.78 0.743 2048 93.07 153.25 0.607 93.28 137.20 0.680 2600 114.73 194.30 0.590 114.28 179.32 0.637 3608 156.34 268.41 0.582 154.97 254.02 0.610 4096 175.01 304.03 0.576 175.89 292.08 0.602
There is no such thing as a free lunch, however, and the special case for len == 40 does add overhead to the len != 40 cases. This seems to amount to be ~5% throughput and slightly less in terms of latency.
Testing: Part of this change is a new kunit test. The tests check all alignment X length pairs in [0, 64) X [0, 512). There are three cases. 1) Precomputed random inputs/seed. The expected results where generated use the generic implementation (which is assumed to be non-buggy). 2) An input of all 1s. The goal of this test is to catch any case a carry is missing. 3) An input that never carries. The goal of this test si to catch any case of incorrectly carrying.
More exhaustive tests that test all alignment X length pairs in [0, 8192) X [0, 8192] on random data are also available here: https://github.com/goldsteinn/csum-reproduction
The reposity also has the code for reproducing the above benchmark numbers.
Signed-off-by: Noah Goldstein goldstein.w.n@gmail.com Signed-off-by: Dave Hansen dave.hansen@linux.intel.com
Thanks for your patch, which is now commit 688eb8191b475db5 ("x86/csum: Improve performance of `csum_partial`") in linus/master stable/master
Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.co...
This does not seem to be a message sent to a public mailing list archived at lore (yet).
On m68k (ARAnyM):
KTAP version 1 # Subtest: checksum 1..3 # test_csum_fixed_random_inputs: ASSERTION FAILED at
lib/checksum_kunit.c:243 Expected result == expec, but result == 54991 (0xd6cf) expec == 33316 (0x8224) not ok 1 test_csum_fixed_random_inputs # test_csum_all_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:267 Expected result == expec, but result == 255 (0xff) expec == 65280 (0xff00)
Endianness issue in the test?
not ok 2 test_csum_all_carry_inputs # test_csum_no_carry_inputs: ASSERTION FAILED at lib/checksum_kunit.c:306 Expected result == expec, but result == 64515 (0xfc03) expec == 0 (0x0) not ok 3 test_csum_no_carry_inputs
# checksum: pass:0 fail:3 skip:0 total:3 # Totals: pass:0 fail:3 skip:0 total:3 not ok 1 checksum
This also seems to be broken (albeit slightly differently) on 32-bit UML and non-UML x86 with CONFIG_M486=y.
In those cases, it's an alignment issue, not an endianness one, but I suspect this may be the first test to try calling the checksum functions with unaligned data: certainly the x86 version seems to have originally been written to assume 2-byte alignment (and the fixes for unaligned data missed some corner cases like UML and pre-pentium-pro codepaths).
It definitely looks like there are endianness issues as well, but do be on the lookout for alignment ones, too.
Patches for the x86 and UML issues, FTR: https://lore.kernel.org/linux-um/20230704083022.692368-2-davidgow@google.com... https://lore.kernel.org/linux-kselftest/20230704083206.693155-2-davidgow@goo...
-- David
linux-kselftest-mirror@lists.linaro.org