Hi All,
Added some people harvested from glibc.git and added libc-alpha.
We currently have 2 big new futex features proposed, and still have the whole NUMA thing on the table.
The proposed features are:
- a vectored FUTEX_WAIT (as per the parent thread); allows userspace to wait on up-to 128 futex values.
- multi-size (8,16,32) futexes (WAIT,WAKE,CMP_REQUEUE).
Both these features are specific to the 'simple' futex interfaces, that is, they exclude all the PI / robust stuff.
As is; the vectored WAIT doesn't nicely interact with the multi-size proposal (or for that matter with the already existing PRIVATE flag), for not allowing to specify flags per WAIT instance, but this should be fixable with some little changes to the proposed ABI.
The much bigger sticking point; as already noticed by the multi-size patches; is that the current ABI is a limiting factor. The giant horrible syscall.
Now, we have a whole bunch of futex ops that are already gone (FD) or are fundamentally broken (REQUEUE) or partially weird (WAIT_BITSET has CLOCK selection where WAIT does not) or unused (per glibc, WAKE_OP, WAKE_BITSET, WAIT_BITSET (except for that CLOCK crud)).
So how about we introduce new syscalls:
sys_futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timo);
struct futex_wait { void *uaddr; unsigned long val; unsigned long flags; }; sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters, unsigned long flags, ktime_t *timo);
sys_futex_wake(void *uaddr, unsigned int nr, unsigned long flags);
sys_futex_cmp_requeue(void *uaddr1, void *uaddr2, unsigned int nr_wake, unsigned int nr_requeue, unsigned long cmpval, unsigned long flags);
Where flags:
- has 2 bits for size: 8,16,32,64 - has 2 more bits for size (requeue) ?? - has ... bits for clocks - has private/shared - has numa
This does not provide BITSET functionality, as I found no use in glibc. Both wait and wake have arguments left, do we needs this?
For NUMA I propose that when NUMA_FLAG is set, uaddr-4 will be 'int node_id', with the following semantics:
- on WAIT, node_id is read and when 0 <= node_id <= nr_nodes, is directly used to index into per-node hash-tables. When -1, it is replaced by the current node_id and an smp_mb() is issued before we load and compare the @uaddr.
- on WAKE/REQUEUE, it is an immediate index.
Any invalid value with result in EINVAL.
Then later, we can look at doing sys_futex_{,un}lock_{,pi}(), which have all the mind-meld associated with robust and PI and possibly optimistic spinning etc.
Opinions?