[Linux-kselftest-mirror] selftests/x86/fsgsbase_64 test problem

List overview All Threads
Download

newer

older

[PATCH] selftest: fix...

[RFC PATCH 0/2] Ion unit test with...

Dan Rue

26 Jan 2018 26 Jan '18

3:36 p.m.

We've noticed that fsgsbase_64 can fail intermittently with the following error:

[RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 Before schedule, set selector to 0x1 other thread: ARCH_SET_GS(0x1) -- sel is 0x0 [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0

This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.

for i in $(seq 1 10000); do ./fsgsbase_64 || break; done

This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.

Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Show replies by date

Andy Lutomirski

26 Jan 26 Jan

4:22 p.m.

On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue dan.rue@linaro.org wrote:

...

We've noticed that fsgsbase_64 can fail intermittently with the following error:
    [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
            Before schedule, set selector to 0x1
            other thread: ARCH_SET_GS(0x1) -- sel is 0x0
    [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.

Hmm, I can reproduce it, too. I'll look in a bit. -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

6:59 p.m.

On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski luto@kernel.org wrote:

...

On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue dan.rue@linaro.org wrote:

...
We've noticed that fsgsbase_64 can fail intermittently with the following error:
    [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
            Before schedule, set selector to 0x1
            other thread: ARCH_SET_GS(0x1) -- sel is 0x0
    [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.
Hmm, I can reproduce it, too. I'll look in a bit.

I'm triggering a different error, and I think what's going on is that the kernel doesn't currently re-save GSBASE when a task switches out and that task has save gsbase != 0 and in-register GS == 0. This is arguably a bug, but it's not an infoleak, and fixing it could be a wee bit expensive. I'm not sure what, if anything, to do about this. I suppose I could add some gross perf hackery to the test to detect this case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's up. It may be related to and old problem I've seen that causes signal delivery to sometimes corrupt %gs. It's deterministic, but it depends in some odd way on register state. I can currently reproduce that issue 100% of the time, and I'm trying to see if I can figure out what's happening. -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

7:46 p.m.

On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski luto@kernel.org wrote:

...

On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue dan.rue@linaro.org wrote:

...
We've noticed that fsgsbase_64 can fail intermittently with the following error:
    [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
            Before schedule, set selector to 0x1
            other thread: ARCH_SET_GS(0x1) -- sel is 0x0
    [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.
Hmm, I can reproduce it, too. I'll look in a bit.
I'm triggering a different error, and I think what's going on is that the kernel doesn't currently re-save GSBASE when a task switches out and that task has save gsbase != 0 and in-register GS == 0. This is arguably a bug, but it's not an infoleak, and fixing it could be a wee bit expensive. I'm not sure what, if anything, to do about this. I suppose I could add some gross perf hackery to the test to detect this case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's up. It may be related to and old problem I've seen that causes signal delivery to sometimes corrupt %gs. It's deterministic, but it depends in some odd way on register state. I can currently reproduce that issue 100% of the time, and I'm trying to see if I can figure out what's happening.

I think it's a CPU bug, and I'm a bit mystified. I can trigger the following, plausibly related issue:

Write a program that writes %gs = 1. Run that program under gdb break in which %gs == 1 display/x $gs si

Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs == 1. On TCG, I do. I don't think that's precisely the problem that's causing the test to fail, since the test doesn't use TF or ptrace, but I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need to make invalid_selector() in ptrace.c always return false. The current implementation is too strict and causes problems.) -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

10:38 p.m.

On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski luto@kernel.org wrote:

...

On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue dan.rue@linaro.org wrote:

...
We've noticed that fsgsbase_64 can fail intermittently with the following error:
    [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
            Before schedule, set selector to 0x1
            other thread: ARCH_SET_GS(0x1) -- sel is 0x0
    [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.
Hmm, I can reproduce it, too. I'll look in a bit.
I'm triggering a different error, and I think what's going on is that the kernel doesn't currently re-save GSBASE when a task switches out and that task has save gsbase != 0 and in-register GS == 0. This is arguably a bug, but it's not an infoleak, and fixing it could be a wee bit expensive. I'm not sure what, if anything, to do about this. I suppose I could add some gross perf hackery to the test to detect this case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's up. It may be related to and old problem I've seen that causes signal delivery to sometimes corrupt %gs. It's deterministic, but it depends in some odd way on register state. I can currently reproduce that issue 100% of the time, and I'm trying to see if I can figure out what's happening.
I think it's a CPU bug, and I'm a bit mystified. I can trigger the following, plausibly related issue:

Write a program that writes %gs = 1. Run that program under gdb break in which %gs == 1 display/x $gs si

Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs ==

On TCG, I do. I don't think that's precisely the problem that's

causing the test to fail, since the test doesn't use TF or ptrace, but I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need to make invalid_selector() in ptrace.c always return false. The current implementation is too strict and causes problems.)

Much simpler test. Run the attached program (gs1). It more or less just sets %gs to 1 and spins until it stops being 1. Do it on a kernel with the attached patch applied. I see stuff like this:

# ./gs1 PID = 129 [ 15.703015] pid 129 saved gs = 1 [ 15.703517] pid 129 loaded gs = 1 [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 ax = 0, cx = 0, dx = 0

So we're interrupting the program, switching out, switching back in, setting %gs to 1, observing that %gs is *still* 1 in prepare_exit_to_usermode(), returning to usermode, and observing %gs == 0.

Presumably what's happening is that the IRET microcode matches the SDM's pseudocode, which says:

RETURN-TO-OUTER-PRIVILEGE-LEVEL: ... FOR each SegReg in (ES, FS, GS, and DS) DO tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code THEN (* Segment register invalid *) SegReg ← NULL; FI; OD;

But this is very odd. The actual permission checks (in the docs for MOV) are:

IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) or ((RPL > DPL) and (CPL > DPL)) THEN #GP(selector); FI;

^^^^ This makes no sense. This says that the data segments cannot be loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL and RPL <= DPL, but I haven't checked that hard.

IF segment not marked present THEN #NP(selector); ELSE SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; FI; FI;

IF DS, ES, FS, or GS is loaded with NULL selector THEN SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; ^^^^ wtf? There is no "segment descriptor". Presumably what actually gets written to segment.DPL is nonsense. FI;

Anyway, I think it's nonsense that user code can load a selector using MOV that is, in turn, rejected by IRET. I don't suppose Intel would consider fixing this going forward.

Borislav, any chance you could run the attached program on an AMD machine to see what it does?

Andy Lutomirski

10:42 p.m.

On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski luto@kernel.org wrote:

...

On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue dan.rue@linaro.org wrote:

...
We've noticed that fsgsbase_64 can fail intermittently with the following error:
    [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
            Before schedule, set selector to 0x1
            other thread: ARCH_SET_GS(0x1) -- sel is 0x0
    [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.
Hmm, I can reproduce it, too. I'll look in a bit.
I'm triggering a different error, and I think what's going on is that the kernel doesn't currently re-save GSBASE when a task switches out and that task has save gsbase != 0 and in-register GS == 0. This is arguably a bug, but it's not an infoleak, and fixing it could be a wee bit expensive. I'm not sure what, if anything, to do about this. I suppose I could add some gross perf hackery to the test to detect this case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's up. It may be related to and old problem I've seen that causes signal delivery to sometimes corrupt %gs. It's deterministic, but it depends in some odd way on register state. I can currently reproduce that issue 100% of the time, and I'm trying to see if I can figure out what's happening.
I think it's a CPU bug, and I'm a bit mystified. I can trigger the following, plausibly related issue:

Write a program that writes %gs = 1. Run that program under gdb break in which %gs == 1 display/x $gs si

Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs ==

On TCG, I do. I don't think that's precisely the problem that's

causing the test to fail, since the test doesn't use TF or ptrace, but I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need to make invalid_selector() in ptrace.c always return false. The current implementation is too strict and causes problems.)
Much simpler test. Run the attached program (gs1). It more or less just sets %gs to 1 and spins until it stops being 1. Do it on a kernel with the attached patch applied. I see stuff like this:

# ./gs1 PID = 129 [ 15.703015] pid 129 saved gs = 1 [ 15.703517] pid 129 loaded gs = 1 [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 ax = 0, cx = 0, dx = 0

So we're interrupting the program, switching out, switching back in, setting %gs to 1, observing that %gs is *still* 1 in prepare_exit_to_usermode(), returning to usermode, and observing %gs == 0.

Presumably what's happening is that the IRET microcode matches the SDM's pseudocode, which says:

RETURN-TO-OUTER-PRIVILEGE-LEVEL: ... FOR each SegReg in (ES, FS, GS, and DS) DO tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code THEN (* Segment register invalid *) SegReg ← NULL; FI; OD;

But this is very odd. The actual permission checks (in the docs for MOV) are:

IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) or ((RPL > DPL) and (CPL > DPL)) THEN #GP(selector); FI;

^^^^ This makes no sense. This says that the data segments cannot be loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL and RPL <= DPL, but I haven't checked that hard.

Surely Intel meant:

... or ((segment is a data segment or nonconforming code segment) and ((RPL > DPL) or (CPL > DPL))

This would be consistent with the AMD APM #GP condition of "The DS, ES, FS, or GS register was loaded and the segment pointed to was a data or non-conforming code segment, but the RPL or CPL was greater than the DPL."

...

IF segment not marked present THEN #NP(selector); ELSE SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; FI; FI;

IF DS, ES, FS, or GS is loaded with NULL selector THEN SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; ^^^^ wtf? There is no "segment descriptor". Presumably what actually gets written to segment.DPL is nonsense. FI;

I think the bug is here. I think that, when writing a NULL selector to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas they should set DPL to 3.

...

Anyway, I think it's nonsense that user code can load a selector using MOV that is, in turn, rejected by IRET. I don't suppose Intel would consider fixing this going forward.

Borislav, any chance you could run the attached program on an AMD machine to see what it does?

-- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

28 Jan 28 Jan

7:21 p.m.

On Fri, Jan 26, 2018 at 2:42 PM, Andy Lutomirski luto@kernel.org wrote:

...

On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski luto@kernel.org wrote:

...
On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue dan.rue@linaro.org wrote:

...
We've noticed that fsgsbase_64 can fail intermittently with the following error:
    [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
            Before schedule, set selector to 0x1
            other thread: ARCH_SET_GS(0x1) -- sel is 0x0
    [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U.
Hmm, I can reproduce it, too. I'll look in a bit.
I'm triggering a different error, and I think what's going on is that the kernel doesn't currently re-save GSBASE when a task switches out and that task has save gsbase != 0 and in-register GS == 0. This is arguably a bug, but it's not an infoleak, and fixing it could be a wee bit expensive. I'm not sure what, if anything, to do about this. I suppose I could add some gross perf hackery to the test to detect this case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's up. It may be related to and old problem I've seen that causes signal delivery to sometimes corrupt %gs. It's deterministic, but it depends in some odd way on register state. I can currently reproduce that issue 100% of the time, and I'm trying to see if I can figure out what's happening.
I think it's a CPU bug, and I'm a bit mystified. I can trigger the following, plausibly related issue:

Write a program that writes %gs = 1. Run that program under gdb break in which %gs == 1 display/x $gs si

Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs ==

On TCG, I do. I don't think that's precisely the problem that's

causing the test to fail, since the test doesn't use TF or ptrace, but I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need to make invalid_selector() in ptrace.c always return false. The current implementation is too strict and causes problems.)
Much simpler test. Run the attached program (gs1). It more or less just sets %gs to 1 and spins until it stops being 1. Do it on a kernel with the attached patch applied. I see stuff like this:

# ./gs1 PID = 129 [ 15.703015] pid 129 saved gs = 1 [ 15.703517] pid 129 loaded gs = 1 [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 ax = 0, cx = 0, dx = 0

So we're interrupting the program, switching out, switching back in, setting %gs to 1, observing that %gs is *still* 1 in prepare_exit_to_usermode(), returning to usermode, and observing %gs == 0.

Presumably what's happening is that the IRET microcode matches the SDM's pseudocode, which says:

RETURN-TO-OUTER-PRIVILEGE-LEVEL: ... FOR each SegReg in (ES, FS, GS, and DS) DO tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code THEN (* Segment register invalid *) SegReg ← NULL; FI; OD;

But this is very odd. The actual permission checks (in the docs for MOV) are:

IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) or ((RPL > DPL) and (CPL > DPL)) THEN #GP(selector); FI;

^^^^ This makes no sense. This says that the data segments cannot be loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL and RPL <= DPL, but I haven't checked that hard.
Surely Intel meant:

... or ((segment is a data segment or nonconforming code segment) and ((RPL > DPL) or (CPL > DPL))

This would be consistent with the AMD APM #GP condition of "The DS, ES, FS, or GS register was loaded and the segment pointed to was a data or non-conforming code segment, but the RPL or CPL was greater than the DPL."

...
IF segment not marked present THEN #NP(selector); ELSE SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; FI; FI;

IF DS, ES, FS, or GS is loaded with NULL selector THEN SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; ^^^^ wtf? There is no "segment descriptor". Presumably what actually gets written to segment.DPL is nonsense. FI;

I think the bug is here. I think that, when writing a NULL selector to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas they should set DPL to 3.

As an experiment, I did this:

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { + [0] = { .dpl = 3, }, +

This had no apparent effect. I was hoping that maybe loading NULL into a selector would copy DPL from from gdt[0], but it seems like it doesn't. -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

29 Jan 29 Jan

9:13 a.m.

On 01/28/18 11:21, Andy Lutomirski wrote:

...

...
I think the bug is here. I think that, when writing a NULL selector to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas they should set DPL to 3.

As an experiment, I did this:

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
  [0] = { .dpl = 3, },
This had no apparent effect. I was hoping that maybe loading NULL into a selector would copy DPL from from gdt[0], but it seems like it doesn't.

GDT[0] doesn't actually exist. It is pretty much scratch space (I have suggested using it for the gsbase once all those issues get sorted out, because it lets the paranoid code do something like:

rdgsbase %rax push %rax /* Save old gsbase */ push %rax /* Reserve space on stack */ sgdt -2(%rsp) /* We don't care about the limit */ pop %rax /* %rax <- gdtbase */ mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */ wrgsbase %rax -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

4:37 p.m.

On Mon, Jan 29, 2018 at 1:13 AM, H. Peter Anvin hpa@zytor.com wrote:

...

On 01/28/18 11:21, Andy Lutomirski wrote:

...
...
I think the bug is here. I think that, when writing a NULL selector to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas they should set DPL to 3.

As an experiment, I did this:

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
  [0] = { .dpl = 3, },
This had no apparent effect. I was hoping that maybe loading NULL into a selector would copy DPL from from gdt[0], but it seems like it doesn't.
GDT[0] doesn't actually exist.

That's what I thought, too, and the SDM does say that, but the SDM says all kinds of not-quite-correct things about segmentation.

...

It is pretty much scratch space (I have suggested using it for the gsbase once all those issues get sorted out, because it lets the paranoid code do something like:
    rdgsbase %rax
    push %rax       /* Save old gsbase */
    push %rax       /* Reserve space on stack */
    sgdt -2(%rsp)   /* We don't care about the limit */
    pop %rax        /* %rax <- gdtbase */
    mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
    wrgsbase %rax

That will utterly suck on non-UMIP machines that have hypervisor-provided UMIP emulation. -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

6:12 p.m.

On 01/29/18 08:37, Andy Lutomirski wrote:

...

That's what I thought, too, and the SDM does say that, but the SDM says all kinds of not-quite-correct things about segmentation.

...
It is pretty much scratch space (I have suggested using it for the gsbase once all those issues get sorted out, because it lets the paranoid code do something like:
    rdgsbase %rax
    push %rax       /* Save old gsbase */
    push %rax       /* Reserve space on stack */
    sgdt -2(%rsp)   /* We don't care about the limit */
    pop %rax        /* %rax <- gdtbase */
    mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
    wrgsbase %rax
That will utterly suck on non-UMIP machines that have hypervisor-provided UMIP emulation.

Is that a valid thing to optimize for, especially given that paranoid entries aren't the most common anyway?

-hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

6:26 p.m.

...

On Jan 29, 2018, at 10:12 AM, H. Peter Anvin hpa@zytor.com wrote:

...
On 01/29/18 08:37, Andy Lutomirski wrote:

That's what I thought, too, and the SDM does say that, but the SDM says all kinds of not-quite-correct things about segmentation.

...
It is pretty much scratch space (I have suggested using it for the gsbase once all those issues get sorted out, because it lets the paranoid code do something like:
   rdgsbase %rax
   push %rax       /* Save old gsbase */
   push %rax       /* Reserve space on stack */
   sgdt -2(%rsp)   /* We don't care about the limit */
   pop %rax        /* %rax <- gdtbase */
   mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
   wrgsbase %rax
That will utterly suck on non-UMIP machines that have hypervisor-provided UMIP emulation.
Is that a valid thing to optimize for, especially given that paranoid entries aren't the most common anyway?

A bunch of people seem to care about NMI performance for perf. And the current patch set works without this trick.

FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.-- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

6:30 p.m.

On 01/29/18 10:26, Andy Lutomirski wrote:

...

...
...
That will utterly suck on non-UMIP machines that have hypervisor-provided UMIP emulation.

Is that a valid thing to optimize for, especially given that paranoid entries aren't the most common anyway?

A bunch of people seem to care about NMI performance for perf.

That wasn't really the question...

...

And the current patch set works without this trick.

But I believe the tricks it uses are fragile.

...

FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.

That might be a better option.

-hpa

-- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Dan Rue

27 Feb 27 Feb

10:59 p.m.

New subject: selftests/x86/fsgsbase_64 test problem

On Mon, Jan 29, 2018 at 10:30:05AM -0800, H. Peter Anvin wrote:

...

On 01/29/18 10:26, Andy Lutomirski wrote:

...
...
...
That will utterly suck on non-UMIP machines that have hypervisor-provided UMIP emulation.

Is that a valid thing to optimize for, especially given that paranoid entries aren't the most common anyway?

A bunch of people seem to care about NMI performance for perf.

That wasn't really the question...

...
And the current patch set works without this trick.

But I believe the tricks it uses are fragile.

...
FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.

That might be a better option.

Has there been any conclusion to this thread? I can still reproduce the issue on mainline and next.

Borislav Petkov

26 Jan 26 Jan

10:56 p.m.

On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote:

...

Borislav, any chance you could run the attached program on an AMD machine to see what it does?

[boris@pd: /tmp> ./gs1 PID = 3420 ax = 0, cx = 0, dx = 0

This is on 4.15-rc7-ish + tip/master from 2 wks ago.

-- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Andy Lutomirski

28 Jan 28 Jan

7:21 p.m.

On Fri, Jan 26, 2018 at 2:56 PM, Borislav Petkov bp@alien8.de wrote:

...

On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote:

...
Borislav, any chance you could run the attached program on an AMD machine to see what it does?

[boris@pd: /tmp> ./gs1 PID = 3420 ax = 0, cx = 0, dx = 0

This is on 4.15-rc7-ish + tip/master from 2 wks ago.

Phooey :( -- To unsubscribe from this list: send the line "unsubscribe linux-kselftest" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

26 Jan 26 Jan

10:51 p.m.

On 01/26/18 11:46, Andy Lutomirski wrote:

...

Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs ==

On TCG, I do. I don't think that's precisely the problem that's

causing the test to fail, since the test doesn't use TF or ptrace, but I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need to make invalid_selector() in ptrace.c always return false. The current implementation is too strict and causes problems.)

Looking at it. I want to grok this in the general context of fsgsbase as well, of course.

2852

days inactive

2884

days old

linux-kselftest-mirror@lists.linaro.org

15 comments

participants

tags (0)

participants (5)

Andy Lutomirski
Andy Lutomirski
Borislav Petkov
Dan Rue
H. Peter Anvin