On Tue, Mar 03, 2020 at 02:00:12PM +0100, Florian Weimer wrote:
- Peter Zijlstra:
So how about we introduce new syscalls:
sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);
struct futex_wait { void *uaddr; unsigned long val; unsigned long flags; }; sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters, unsigned long flags, ktime_t *timo);
sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);
sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake, unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);
Where flags:
- has 2 bits for size: 8,16,32,64
- has 2 more bits for size (requeue) ??
- has ... bits for clocks
- has private/shared
- has numa
What's the actual type of *uaddr? Does it vary by size (which I assume is in bits?)? Are there alignment constraints?
Yeah, u8, u16, u32, u64 depending on the size specified in flags. Naturally aligned.
These system calls seemed to be type-polymorphic still, which is problematic for defining a really nice C interface. I would really like to have a strongly typed interface for this, with a nice struct futex wrapper type (even if it means that we need four of them).
You mean like: futex_wait1(u8 *,...) futex_wait2(u16 *,...) futex_wait4(u32 *,...) etc.. ?
I suppose making it 16 or so syscalls (more if we want WAKE_OP or requeue across size) is a bit daft, so yeah, sucks.
Will all architectures support all sizes? If not, how do we probe which size/flags combinations are supported?
Up to the native word size (long), IOW ILP32 will not support u64.
Overlapping futexes are expressly forbidden, that is:
{ u32 var; void *addr = &var; }
P0() { futex_wait4(addr,...); }
P1() { futex_wait1(addr+1,...); }
Will have one of them return something bad.
For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int node_id', with the following semantics:
on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is directly used to index into per-node hash-tables. When -1, it is replaced by the current node_id and an smp_mb() is issued before we load and compare the @uaddr.
on WAKE/REQUEUE, it is an immediate index.
Does this mean the first waiter determines the NUMA index, and all future waiters use the same chain even if they are on different nodes?
Every new waiter could (re)set node_id, after all, when its not actually waiting, nobody cares what's in that field.
I think documenting this as a node index would be a mistake. It could be an arbitrary hint for locating the corresponding kernel data structures.
Nah, it allows explicit placement, after all, we have set_mempolicy() and sched_setaffinity() and all the other NUMA crud so that programs that think they know what they're doing, can do explicit placement.
Any invalid value with result in EINVAL.
Using uaddr-4 is slightly tricky with a 64-bit futex value, due to the need to maintain alignment and avoid padding.
Yes, but it works, unlike uaddr+4 :-) Also, 1 and 2 byte futexes and NUMA_FLAG are incompatible due to this, but I feel short futexes and NUMA don't really make sense anyway, the only reason to use a short futex is to save space, so you don't want another 4 bytes for numa on top of that anyway.