SVE routines for cortex-strings

List overview All Threads
Download

newer

older

[ANNOUNCE] Linaro Binary Toolchain...

[ACTIVITY] Week 24

Richard Henderson

31 May 2018 31 May '18

4:08 a.m.

I spoke with Ramana about these at HKG18, and I'm finally getting back to these. I have routines for

-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S -rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S -rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S -rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S -rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S -rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S -rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S -rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S -rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S -rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S -rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S

The tests pass when run under Foundation Platform 11.3. What is the best way to submit these for review and upstreaming? There's nothing in the git README about an upstream mailing list...

FWIW, my code is at

https://github.com/rth7680/cortex-strings/tree/rth/sve

Show replies by date

Siddhesh Poyarekar

31 May 31 May

8:12 a.m.

On 31 May 2018 at 09:38, Richard Henderson rth@twiddle.net wrote:

...

I spoke with Ramana about these at HKG18, and I'm finally getting back to these. I have routines for

-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S -rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S -rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S -rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S -rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S -rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S -rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S -rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S -rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S -rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S -rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S

The tests pass when run under Foundation Platform 11.3. What is the best way to submit these for review and upstreaming? There's nothing in the git README about an upstream mailing list...

FWIW, my code is at

https://github.com/rth7680/cortex-strings/tree/rth/sve

I believe the canonical procedure to submit patches to cortex-strings is via gerrit: review.linaro.org. If you have the git-review plugin installed it should be as simple as running the 'git review' command.

Siddhesh

Richard Henderson

5:37 p.m.

On 05/31/2018 01:12 AM, Siddhesh Poyarekar wrote:

...

I believe the canonical procedure to submit patches to cortex-strings is via gerrit: review.linaro.org. If you have the git-review plugin installed it should be as simple as running the 'git review' command.

Thanks. I believe I've now successfully done so. Please let me know if my procedure is off; this is my first time using gerrit.

Siddhesh Poyarekar

5:41 p.m.

On 31 May 2018 at 23:07, Richard Henderson richard.henderson@linaro.org wrote:

...

Thanks. I believe I've now successfully done so. Please let me know if my procedure is off; this is my first time using gerrit.

You need to add reviewers to your review requests. I'd suggest adding at least Adhemerval, Christophe and Maxim at the minimum; they may then suggest more reviewers if necessary.

Siddhesh

Richard Sandiford

1 Jun 1 Jun

9:32 a.m.

Richard Henderson rth@twiddle.net writes:

...

I spoke with Ramana about these at HKG18, and I'm finally getting back to these. I have routines for

-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S -rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S -rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S -rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S -rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S -rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S -rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S -rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S -rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S -rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S -rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S

The tests pass when run under Foundation Platform 11.3. What is the best way to submit these for review and upstreaming? There's nothing in the git README about an upstream mailing list...

FWIW, my code is at

https://github.com/rth7680/cortex-strings/tree/rth/sve

Thanks for doing these. One general comment is that the routines tend to use the FFR result even in the case where no potential fault is detected. Although it's not as obvious as it could be from some of the published documentation, the architecturally- preferred approach is instead to have the "normal" case depend only on the flags set by RDFFRS, not on the FFR itself. E.g. the typical structure would be something like this:

--------------------------------------------------------------------- SETFFR loop: ...[A] first-faulting and non-faulting loads predicated... ... on some predicate Pg... RDFFRS Pn.B, Pg/Z B.NLAST recovery ...[B] code that ignores Pn and operates on all lanes active in Pg... continue: ...[C] any code that can naturally be shared without using Pn... ...[D] branch back to loop when appropriate...

...

recovery: SETFFR ...[E] code that operates only on the lanes active in Pn... B continue ---------------------------------------------------------------------

Also, using INCB, INCH, INCW and INCD is architecturally preferred over INCP in cases where either could be used. So if the above loop has a pointer or byte index Xm, and if Pg is all-true, it would be better to do:

--------------------------------------------------------------------- SETFFR loop: ...[A] first-faulting and non-faulting loads predicated... ... on some predicate Pg... RDFFRS Pn.B, Pg/Z B.NLAST recovery INCB Xm ...[B] code that ignores Pn and operates on all lanes active in Pg... continue: ...[C] any code that can naturally be shared without using Pn... ...[D] branch back to loop when appropriate...

...

recovery: SETFFR INCP Xm, Pn.B ...[E] code that operates only on the lanes active in Pn... B continue ---------------------------------------------------------------------

If [B] is too complex to copy to [E], an alternative approach is to do something like this:

--------------------------------------------------------------------- SETFFR loop: PTRUE Pg.B or WHILELO Pg.B, Xi, Xj // set Pg for this iteration ...[A] first-faulting and non-faulting loads predicated on Pg... RDFFRS Pn.B, Pg/Z B.NLAST recovery INCB Xm continue: ...[B] code that ignores Pn and operates on all lanes active in Pg... ...[D] branch back to loop when appropriate...

...

recovery: SETFFR INCP Xm, Pn.B MOV Pg.B, Pn.B B continue ---------------------------------------------------------------------

The common case then has the extra overhead of resetting Pg in each iteration, but if [B] is complex (or if Pg is being reset anyway), then that shouldn't matter.

The idea is that the B.NLAST should be highly predictable, so it's usually not necessary to wait for the FFR value to become available. And in practice, getting a precise FFR predicate is likely to be a slow operation (to the extent that this is an ISA-level principle rather than a uarch optimisation).

Thanks, Richard

Richard Henderson

3:52 p.m.

On 06/01/2018 02:32 AM, Richard Sandiford wrote:

...

Richard Henderson rth@twiddle.net writes:

...
I spoke with Ramana about these at HKG18, and I'm finally getting back to these. I have routines for

-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S -rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S -rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S -rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S -rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S -rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S -rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S -rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S -rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S -rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S -rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S

The tests pass when run under Foundation Platform 11.3. What is the best way to submit these for review and upstreaming? There's nothing in the git README about an upstream mailing list...

FWIW, my code is at

https://github.com/rth7680/cortex-strings/tree/rth/sve

Thanks for doing these. One general comment is that the routines tend to use the FFR result even in the case where no potential fault is detected. Although it's not as obvious as it could be from some of the published documentation, the architecturally- preferred approach is instead to have the "normal" case depend only on the flags set by RDFFRS, not on the FFR itself.

Is it possible to elaborate on the reasons for that? In some cases, the usage of ffr are so trivial (e.g. strlen) that I can use the unpredicated RDFFR instead of the predicated.

...

Also, using INCB, INCH, INCW and INCD is architecturally preferred over INCP in cases where either could be used. So if the above loop has a pointer or byte index Xm, and if Pg is all-true, it would be better to do:

This is much easier to understand -- that adding a near constant is preferred to popcount over up to 256-bits. Avoiding INCP for strlen rearranges the loop in just the sort of ways you say are preferred for RDFFR.

...

The idea is that the B.NLAST should be highly predictable, so it's usually not necessary to wait for the FFR value to become available. And in practice, getting a precise FFR predicate is likely to be a slow operation (to the extent that this is an ISA-level principle rather than a uarch optimisation).

Interesting. From the language that I read, which was essentially "first-faulting loads can fail for any reason", I assumed that it happened more often than what you're implying here -- that it should be on the rare side.

Richard Sandiford

4:59 p.m.

Richard Henderson rth@twiddle.net writes:

...

On 06/01/2018 02:32 AM, Richard Sandiford wrote:

...
Richard Henderson rth@twiddle.net writes:

...
I spoke with Ramana about these at HKG18, and I'm finally getting back to these. I have routines for

-rw-rw-r--. 1 rth rth 2538 May 30 19:12 memchr.S -rw-rw-r--. 1 rth rth 2405 May 30 20:49 memcmp.S -rw-rw-r--. 1 rth rth 2385 May 30 19:12 rawmemchr.S -rw-rw-r--. 1 rth rth 2470 May 30 19:12 strchrnul.S -rw-rw-r--. 1 rth rth 2588 May 30 19:12 strchr.S -rw-rw-r--. 1 rth rth 2370 May 30 19:12 strcmp.S -rw-rw-r--. 1 rth rth 2403 May 30 19:12 strcpy.S -rw-rw-r--. 1 rth rth 2263 May 30 19:12 strlen.S -rw-rw-r--. 1 rth rth 2595 May 30 19:12 strncmp.S -rw-rw-r--. 1 rth rth 2344 May 30 19:12 strnlen.S -rw-rw-r--. 1 rth rth 3105 May 30 19:12 strrchr.S

The tests pass when run under Foundation Platform 11.3. What is the best way to submit these for review and upstreaming? There's nothing in the git README about an upstream mailing list...

FWIW, my code is at

https://github.com/rth7680/cortex-strings/tree/rth/sve

Thanks for doing these. One general comment is that the routines tend to use the FFR result even in the case where no potential fault is detected. Although it's not as obvious as it could be from some of the published documentation, the architecturally- preferred approach is instead to have the "normal" case depend only on the flags set by RDFFRS, not on the FFR itself.

Is it possible to elaborate on the reasons for that? In some cases, the usage of ffr are so trivial (e.g. strlen) that I can use the unpredicated RDFFR instead of the predicated.

Yeah, RDFFRS is preferred even then. The reason is that the FFR accumulates information across multiple loads (all LDFFs and LDNFs since the last SETFFR) and so it might not be able to issue until all those loads have completed. Using the other structure should mean that the normal path can be issued in a similar way to loops that use ordinary loads.

...

...
Also, using INCB, INCH, INCW and INCD is architecturally preferred over INCP in cases where either could be used. So if the above loop has a pointer or byte index Xm, and if Pg is all-true, it would be better to do:

This is much easier to understand -- that adding a near constant is preferred to popcount over up to 256-bits. Avoiding INCP for strlen rearranges the loop in just the sort of ways you say are preferred for RDFFR.

Yeah.

...

...
The idea is that the B.NLAST should be highly predictable, so it's usually not necessary to wait for the FFR value to become available. And in practice, getting a precise FFR predicate is likely to be a slow operation (to the extent that this is an ISA-level principle rather than a uarch optimisation).

Interesting. From the language that I read, which was essentially "first-faulting loads can fail for any reason", I assumed that it happened more often than what you're implying here -- that it should be on the rare side.

That's more a trade-off between giving software a guarantee about the specific situations in which a potential fault is and isn't flagged vs. giving implementations the leeway to do something efficient. It's not really saying how frequent the "potential fault" case is.

Thanks, Richard

Richard Henderson

5:24 p.m.

[ How odd. I could find no record that the message to which you are replying was ever actually sent. I assumed I simply closed the window instead of hitting send. And thus I just now sent a second reply. ]

On 06/01/2018 09:59 AM, Richard Sandiford wrote:

...

...
Is it possible to elaborate on the reasons for that? In some cases, the usage of ffr are so trivial (e.g. strlen) that I can use the unpredicated RDFFR instead of the predicated.

Yeah, RDFFRS is preferred even then. The reason is that the FFR accumulates information across multiple loads (all LDFFs and LDNFs since the last SETFFR) and so it might not be able to issue until all those loads have completed. Using the other structure should mean that the normal path can be issued in a similar way to loops that use ordinary loads.

Ok. I'll reorg the routines along these lines.

Peter Maydell

4 Jun 4 Jun

3:16 p.m.

On 1 June 2018 at 18:24, Richard Henderson richard.henderson@linaro.org wrote:

...

[ How odd. I could find no record that the message to which you are replying was ever actually sent. I assumed I simply closed the window instead of hitting send. And thus I just now sent a second reply. ]

Our mailing list server apparently sat on it over the weekend for some reason:

Received: by lists.linaro.org (Postfix, from userid 109) id 43DCB619F3; Mon, 4 Jun 2018 14:58:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on ip-10-142-244-252 X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=disabled version=3.4.0 Received: from [127.0.0.1] (localhost [127.0.0.1]) by lists.linaro.org (Postfix) with ESMTP id 1F544617CD; Mon, 4 Jun 2018 14:58:40 +0000 (UTC) X-Original-To: linaro-toolchain@lists.linaro.org Delivered-To: linaro-toolchain@lists.linaro.org Received: by lists.linaro.org (Postfix, from userid 109) id 750F361775; Fri, 1 Jun 2018 15:53:02 +0000 (UTC) Received: from mail-pl0-f67.google.com (mail-pl0-f67.google.com [209.85.160.67]) by lists.linaro.org (Postfix) with ESMTPS id 2067E617C0 for linaro-toolchain@lists.linaro.org; Fri, 1 Jun 2018 15:53:01 +0000 (UTC)

thanks -- PMM

Richard Henderson

1 Jun 1 Jun

5:18 p.m.

On 06/01/2018 02:32 AM, Richard Sandiford wrote:

...

Thanks for doing these. One general comment is that the routines tend to use the FFR result even in the case where no potential fault is detected. Although it's not as obvious as it could be from some of the published documentation, the architecturally- preferred approach is instead to have the "normal" case depend only on the flags set by RDFFRS, not on the FFR itself.

Clearly it would be interesting to read the microarch docs once they are available. This is not the result I would have imagined.

...

RDFFRS Pn.B, Pg/Z
B.NLAST recovery

So the takeaway is that, if the branch is predicted untaken, and we don't use Pn in the predicted path, then FFR is speculatively unused.

...

Also, using INCB, INCH, INCW and INCD is architecturally preferred over INCP in cases where either could be used.

This is much more directly understandable. I guess it's the sort of thing where the real cycle count for INCP is going to depend on the actual width of the implementation, whereas INCB is always going to be a simple add.

...

The idea is that the B.NLAST should be highly predictable, so it's usually not necessary to wait for the FFR value to become available. And in practice, getting a precise FFR predicate is likely to be a slow operation (to the extent that this is an ISA-level principle rather than a uarch optimisation).

I'm surprised about FFR being quite so slow. And I guess from the language elsewhere in the manual -- more or less that first-fault reads can fail at any time for any reason -- I expected them to fail more often than you are implying that they should.

I suppose if they only fail on tlb misses, and the OS is using 64k pages, then the failure rate must be below 0.5%.

Thanks. This is the sort of stuff that's missing from the public manual so far.

Richard Henderson

2 Jun 2 Jun

10:46 p.m.

On 06/01/2018 02:32 AM, Richard Sandiford wrote:

...

Thanks for doing these. One general comment is that the routines tend to use the FFR result even in the case where no potential fault is detected.

Thanks.

I have updated the patch set on review.linaro.org along the lines you suggested. I would appreciate any other feedback you might have.

Adhemerval Zanella

4 Jun 4 Jun

6 p.m.

Do you plan to adapt them to glibc?

On 02/06/2018 19:46, Richard Henderson wrote:

...

On 06/01/2018 02:32 AM, Richard Sandiford wrote:

...
Thanks for doing these. One general comment is that the routines tend to use the FFR result even in the case where no potential fault is detected.

Thanks.

I have updated the patch set on review.linaro.org along the lines you suggested. I would appreciate any other feedback you might have.

r~ _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Richard Henderson

6:12 p.m.

On 06/04/2018 11:00 AM, Adhemerval Zanella wrote:

...

Do you plan to adapt them to glibc?

Ramana suggested starting "upstream" in cortex-strings and copying to glibc from there, as a general work-flow preference. I believe this has been the case for all of the other aarch64 string routines.

For now, I'm not sure it's useful to copy to glibc until we have hardware against which we can run make bench -- given that would be the very first thing any reviewer would ask.

Adhemerval Zanella

6:19 p.m.

On 04/06/2018 15:12, Richard Henderson wrote:

...

On 06/04/2018 11:00 AM, Adhemerval Zanella wrote:

...
Do you plan to adapt them to glibc?

Ramana suggested starting "upstream" in cortex-strings and copying to glibc from there, as a general work-flow preference. I believe this has been the case for all of the other aarch64 string routines.

For now, I'm not sure it's useful to copy to glibc until we have hardware against which we can run make bench -- given that would be the very first thing any reviewer would ask.

r~

Yes, using cortex-strings as the bridge is indeed the preferred way. I am asking which strategy do you see:

- Adding ifunc variants and add them as default if system advertise SVE support.

- Or adding as an extra option (as Szabolcs has proposed sometime ago for LSE), where one would need to build and extra glibc library and loader will select it if the case.

Richard Henderson

6:30 p.m.

On 06/04/2018 11:19 AM, Adhemerval Zanella wrote:

...

Yes, using cortex-strings as the bridge is indeed the preferred way. I am asking which strategy do you see:

Adding ifunc variants and add them as default if system advertise SVE support.

Or adding as an extra option (as Szabolcs has proposed sometime ago for LSE), where one would need to build and extra glibc library and loader will select it if the case.

When bringing them to glibc, I would use ifuncs. SVE is isolatable to a few routines.

LSE becomes scattered throughout libc, which is why an extra binary was proposed. Although for LSE it might work just as well to pull the atomic operations out to helper functions and ifunc those as well. It really all depends on the relative costs.

Ramana Radhakrishnan

8 Jun 8 Jun

9:23 a.m.

On 04/06/2018 19:30, Richard Henderson wrote:

...

On 06/04/2018 11:19 AM, Adhemerval Zanella wrote:

...
Yes, using cortex-strings as the bridge is indeed the preferred way. I am asking which strategy do you see:

Adding ifunc variants and add them as default if system advertise SVE support.

Or adding as an extra option (as Szabolcs has proposed sometime ago for LSE), where one would need to build and extra glibc library and loader will select it if the case.

When bringing them to glibc, I would use ifuncs. SVE is isolatable to a few routines.

LSE becomes scattered throughout libc, which is why an extra binary was proposed. Although for LSE it might work just as well to pull the atomic operations out to helper functions and ifunc those as well. It really all depends on the relative costs.

Amen,.

Ramana

...

r~ _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Ramana Radhakrishnan

18 Jun 18 Jun

9:24 a.m.

On 04/06/2018 19:12, Richard Henderson wrote:

...

On 06/04/2018 11:00 AM, Adhemerval Zanella wrote:

...
Do you plan to adapt them to glibc?

Ramana suggested starting "upstream" in cortex-strings and copying to glibc from there, as a general work-flow preference. I believe this has been the case for all of the other aarch64 string routines.

For now, I'm not sure it's useful to copy to glibc until we have hardware against which we can run make bench -- given that would be the very first thing any reviewer would ask.

Sorry about getting into this a bit late.

To be honest, I would prefer that we tried to get this into glibc sooner rather than later.

I don't think the make bench discussion is totally relevant in this case. We'd like to see what the performance of this stuff is and it's always goign to be a battle to get things benchmarked perfectly on hardware and there is unlikely to be one implementation in the Arm world unlike other architectures. It's under an ifunc and gives folks something to reason with, if we are really that concerned about it's performance today lets put it under a tunable that allows us to turn it off easily ?

regards Ramana

...

r~ _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Siddhesh Poyarekar

9:39 a.m.

On Mon, 18 Jun 2018 at 14:54, Ramana Radhakrishnan ramana.radhakrishnan@arm.com wrote:

...

I don't think the make bench discussion is totally relevant in this case. We'd like to see what the performance of this stuff is and it's always goign to be a battle to get things benchmarked perfectly on hardware and there is unlikely to be one implementation in the Arm world

+1, I don't think there's a risk in pushing these patches without a benchmark result. When implementations appear we can run benchmarks and tweak and choose accordingly. In the worst case if we find that *everyone* botched up their SVE implementations then we can just drop that code in some years :)

...

unlike other architectures. It's under an ifunc and gives folks something to reason with, if we are really that concerned about it's performance today lets put it under a tunable that allows us to turn it off easily ?

Existing tunables can do the job just fine:

- glibc.tune.cpu can be used to emulate a different processor if that's suitable for the core - glibc.tune.hwcap_mask can be used to mask out HWCAP_SVE completely

Siddhesh

2623

days inactive

2641

days old

linaro-toolchain@lists.linaro.org

17 comments

participants

tags (0)

participants (7)

Adhemerval Zanella
Peter Maydell
Ramana Radhakrishnan
Richard Henderson
Richard Henderson
Richard Sandiford
Siddhesh Poyarekar