在 2022/9/30 AM4:52, Luck, Tony 写道:
Thanks for your patient explanations.
You are welcome :)
STEP2: In IRQ context, ghes_proc/_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue.
Why is there a difference if the interrupted task was a user task vs. a kernel thread?
It seems arbitrary. If the error can be handled in the kernel thread case without a task_work_add() to the current process, can't all errors be handled this way?
I'm afraid not. The kworker in workqueue is asynchronous with ret_to_user() of the interrupted task. If we return to user-space before the queued memory_failure() work is processed, we will take the fault again when the error is signal by synchronous external abort. This loop may cause platform firmware to exceed some threshold and reboot.
When a user task consuming poison data, a synchronous external abort will be signaled, for example "einj_mem_uc single" in ras-tools. In such case, the handling flow will be like bellow:
----------------------------------STEP 0------------------------------------------- [ghes_sdei_critical_callback: current einj_mem_uc, local cpu] ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry: peak and read estatus => irq_work_queue(&ghes_proc_irq_work) // ghes_proc_in_irq - irq_work [ghes_sdei_critical_callback: return] -----------------------------------STEP 1------------------------------------------ [ghes_proc_in_irq: current einj_mem_uc, local cpu] => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue - put work task on a specific cpu => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); [ghes_proc_in_irq: return] -----------------------------------STEP 3------------------------------------------ // kworker preempts einj_mem_uc on local cpu due to RESCHED flag [memory_failure_work_func: current kworker, local cpu] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => soft/hard offline ------------------------------------STEP 4----------------------------------------- [ghes_kick_task_work: current einj_mem_uc, other cpu] => memory_failure_queue_kick => cancel_work_sync //wait memory_failure_work_func finish => memory_failure_work_func(&mf_cpu->work) => kfifo_get(&mf_cpu->fifo, &entry); // no work here ------------------------------------STEP 5----------------------------------------- [current einj_mem_uc returned to userspace] => Killed by SIGBUS
STEP 4 add a task work to ensure the queued memory_failure() work is processed before returning to user-space. And the interrupted user will be killed by SIGBUS signal.
If we delete STEP 4, the interrupted user task will return to user space synchronously and consume the poison data again.
The current thread likely has nothing to do with the error. Just a matter of chance on what is running when the NMI is delivered, right?
Yes, the error is actually handled in workqueue. I think the point is that the synchronous exception signaled by synchronous external abort must be handled synchronously, otherwise, it will be signaled again.
Best Regards, Shuai