On Tue, Mar 24, 2020 at 10:37:49AM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
On 2020/3/24 6:52, Jason Gunthorpe wrote:
On Mon, Mar 23, 2020 at 01:35:07PM -0700, Mike Kravetz wrote:
On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- if (!pgd_present(READ_ONCE(*pgd))) return NULL; p4d = p4d_offset(pgd, addr);
- if (!p4d_present(*p4d))
- if (!p4d_present(READ_ONCE(*p4d))) return NULL;
pud = pud_offset(p4d, addr);
One would argue that pgd and p4d can not change from present to !present during the execution of this code. To me, that seems like the issue which would cause an issue. Of course, I could be missing something.
This I am not sure of, I think it must be true under the read side of the mmap_sem, but probably not guarenteed under RCU..
In any case, it doesn't matter, the fact that *p4d can change at all is problematic. Unwinding the above inlines we get:
p4d = p4d_offset(pgd, addr) if (!p4d_present(*p4d)) return NULL; pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
According to our memory model the compiler/CPU is free to execute this as:
p4d = p4d_offset(pgd, addr) p4d_for_vaddr = *p4d; if (!p4d_present(*p4d)) return NULL; pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
Wow! How do you know this? You don't need to answer :)
It says explicitly in Documentation/memory-barriers.txt - see section COMPILER BARRIER:
(*) The compiler is within its rights to reorder loads and stores to the same variable, and in some cases, the CPU is within its rights to reorder loads to the same variable. This means that the following code:
a[0] = x; a[1] = x; Might result in an older value of x stored in a[1] than in a[0].
It also says READ_ONCE puts things in program order, but we don't use READ_ONCE inside pud_offset(), so it doesn't help us.
Best answer is to code things so there is exactly one dereference of the pointer protected by READ_ONCE. Very clear to read, very safe.
Maybe Longpeng can rework the patch around these principles?
Thanks Jason and Mike, I learn a lot from your analysis.
So... the patch should like this ?
Yes, the pattern looks right
The commit message should reference the above section of COMPILER BARRIER and explain that de-referencing the entries is a data race, so we must consolidate all the reads into one single place.
Also, since CH moved all the get_user_pages_fast code out of the arch's many/all archs can drop their arch specific version of this routine. This is really just a specialized version of gup_fast's algorithm..
(also the arch versions seem different, why do some return actual ptes, not null?)
Jason