On 1/25/20 10:54 PM, Andres Freund wrote:
Hi,
On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
From: Jens Axboe axboe@kernel.dk
commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
If the credentials or the mm doesn't match, don't allow the task to submit anything on behalf of this ring. The task that owns the ring can pass the file descriptor to another task, but we don't want to allow that task to submit an SQE that then assumes the ring mm and creds if it needs to go async.
Cc: stable@vger.kernel.org Suggested-by: Stefan Metzmacher metze@samba.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
fs/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+)
--- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned wake_up(&ctx->sqo_wait); submitted = to_submit; } else if (to_submit) {
if (current->mm != ctx->sqo_mm ||
current_cred() != ctx->creds) {
ret = -EPERM;
goto out;
}
I thought about this a bit more.
I'm not sure if this is actually to restrictive, because it means applications like Samba won't be able to use io-uring at all.
Yea, I think it is too restrictive. In fact, it broke my WIP branch to make postgres use io_uring.
Postgres uses a forked process model, with all sub-processes forked off one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED memory (buffer pool, locks, and lots of other IPC). My WIP branch so far has postmaster create a number of io_urings that then the different processes can use (with locking if necessary).
In plenty of the cases it's fairly important for performance to not require an additional context switch initiate IO, therefore we cannot delegate submitting to an io_uring to separate process. But it's not feasible to have one (or even two) urings for each process either: For one, that's just about guaranteed to bring us over the default RLIMIT_MEMLOCK limit, and requiring root only config changes is not an option for many (nor user friendly).
Not sharing queues makes it basically impossible to rely on io_uring ordering properties when operation interlock is needed. E.g. to guarantee that the journal is flushed before some data buffer can be written back, being able to make use of links and drains is great - but there's one journal for all processes. To be able to guarantee anything, all the interlocked writes need to go through one io_uring. I've not yet implemented this, so I don't have numbers, but I expect pretty significant savings.
Not being able to share urings also makes it harder to resolve deadlocks:
As we call into both library and user defined code, we cannot guarantee that a specific backend process will promptly (or at all, when waiting for some locks) process cqes. There's also sections where we don't want to constantly check for ready events, for performance reasons. But operations initiated by a process might be blocking other connections:
E.g. process #1 might have initiated transferring a number of blocks into postgres' buffer pool via io_uring , and now is busy processing the first block that completed. But now process #2 might need one of the buffers that had IO queued, but didn't complete in time for #1 to see the results. The way I have it set up right now, #2 simply can process pending cqes in the relevant queue. Which, in this example, would mark the pg buffer pool entry as valid, allowing #2 to continue.
Now, completions can still be read by all processes, so I could continue to do the above: But that'd require all potentially needed io_urings to be set up in postmaster, before the first fork, and all processes to keep all those FDs open (commonly several hundred). Not an attractive option either, imo.
Obviously we could solve this by having a sqe result processing thread running within each process - but that'd be a very significant new overhead. And it'd require making some non-threadsafe code threadsafe, which I do not relish tackling as a side-effect of io_uring adoption.
It also turns out to be nice from a performance / context-switch rate angle to be able to process cqes for submissions by other processes. Saves an expensive context switch, and often enough it really doesn't matter which process processes the completion (especially for readahead). And in other cases it's cheap to just schedule the subsequent work from the cqe processor, e.g. initiating readahead of a few more blocks into the pg buffer pool. Similarly, there are a few cases where it's useful for several processes to submit IO into a uring primarily drained by one specific process, to offload the subsequent action, if that's expensive
Now, I think there's a valid argument to be made that postgres should just use threads, and not be hampered by any of this. But a) that's not going to happen all that soon, it's a large change, b) it's far from free from problems either, especially scalability on larger machines, and robustness.
As even if current_cred() and ctx->creds describe the same set of uid,gids the != won't ever match again and makes the whole ring unuseable.
Indeed. It also seems weird that a sqpoll now basically has different semantics, allowing the io_uring to be used by multiple processes - a task with a different mm can still wake the sqpoll thread up, even.
Since the different processes attached still can still write to the io_uring mmaped memory, they can still queue sqes, they just can't initiate the processing. But the next time the creator of the uring submits, they will still be - and thus it seems that the kernel needs to handle this safely. So I really don't get what this actually achieves? Am I missing something here?
Thanks for your detailed reported, I'm going to revert this change for 5.5.