Hi,
On Sun, Nov 19, 2023 at 06:14:50AM -0700, Jens Axboe wrote:
On 11/18/23 4:45 PM, Timothy Pearson wrote:
During floating point and vector save to thread data fr0/vs0 are clobbered by the FPSCR/VSCR store routine. This leads to userspace register corruption and application data corruption / crash under the following rare condition:
- A userspace thread is executing with VSX/FP mode enabled
- The userspace thread is making active use of fr0 and/or vs0
- An IPI is taken in kernel mode, forcing the userspace thread to reschedule
- The userspace thread is interrupted by the IPI before accessing data it previously stored in fr0/vs0
- The thread being switched in by the IPI has a pending signal
If these exact criteria are met, then the following sequence happens:
- The existing thread FP storage is still valid before the IPI, due to a prior call to save_fpu() or store_fp_state(). Note that the current fr0/vs0 registers have been clobbered, so the FP/VSX state in registers is now invalid pending a call to restore_fp()/restore_altivec().
- IPI -- FP/VSX register state remains invalid
- interrupt_exit_user_prepare_main() calls do_notify_resume(), due to the pending signal
- do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which merrily reads and saves the invalid FP/VSX state to thread local storage.
- interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid FP/VSX state back to registers.
- Execution is released to userspace, and the application crashes or corrupts data.
What an epic bug hunt! Hats off to you for seeing it through and getting to the bottom of it. Particularly difficult as the commit that made it easier to trigger was in no way related to where the actual bug was.
I ran this on the vm I have access to, and it survived 2x500 iterations. Happy to call that good:
Tested-by: Jens Axboe axboe@kernel.dk
Thanks to all involved!
Is this going to land soon in mainline so it can be picked as well for the affected stable trees?
Regards, Salvatore