... But after reviewing the previous discussion I think I should try to find people with 32-bit systems who can tell me whether they see a performance regression.
I'll do that and try to get an answer soon.
Please consider this patch as a first-cut draft attempt to get the ball rolling on this again. It would be nice to take this to completion, given all the effort you and Arnd put into the discussion last here. I didn't get the chance to discuss this with Arnd before sending it out, and its likely that what I implemented is different from and sub-par compared to what he had mind. The discussion last year didn't seem to mention the need for a 32-bit divide, I ended up needing it here. Also, Arnd has some efficient interfaces in mind - introducing light-weight ktime_get_us() perhaps. I don't know how that would be done, and hence I tried the other approach he had suggested.