diff options
Diffstat (limited to 'Documentation/accel')
| -rw-r--r-- | Documentation/accel/qaic/aic100.rst | 25 | ||||
| -rw-r--r-- | Documentation/accel/qaic/qaic.rst | 8 |
2 files changed, 27 insertions, 6 deletions
diff --git a/Documentation/accel/qaic/aic100.rst b/Documentation/accel/qaic/aic100.rst index 273da6192fb346..41331cf580b118 100644 --- a/Documentation/accel/qaic/aic100.rst +++ b/Documentation/accel/qaic/aic100.rst @@ -487,8 +487,8 @@ one user crashes, the fallout of that should be limited to that workload and not impact other workloads. SSR accomplishes this. If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI -channel. This notification identifies the workload by it's assigned DBC. A -multi-stage recovery process is then used to cleanup both sides, and get the +channel. This notification identifies the workload by its assigned DBC. A +multi-stage recovery process is then used to cleanup both sides, and gets the DBC/NSPs into a working state. When SSR occurs, any state in the workload is lost. Any inputs that were in @@ -496,6 +496,27 @@ process, or queued by not yet serviced, are lost. The loaded artifacts will remain in on-card DDR, but the host will need to re-activate the workload if it desires to recover the workload. +When SSR occurs for a specific NSP, the assigned DBC goes through the +following state transactions in order: + +DBC_STATE_BEFORE_SHUTDOWN + Indicates that the affected NSP was found in an unrecoverable error + condition. +DBC_STATE_AFTER_SHUTDOWN + Indicates that the NSP is under reset. +DBC_STATE_BEFORE_POWER_UP + Indicates that the NSP's debug information has been collected, and is + ready to be collected by the host (if desired). At that stage the NSP + is restarted by QSM. +DBC_STATE_AFTER_POWER_UP + Indicates that the NSP has been restarted, fully operational and is + in idle state. + +SSR also has an optional crashdump collection feature. If enabled, the host can +collect the memory dump for the crashed NSP and dump it to the user space via +the dev_coredump subsystem. The host can also decline the crashdump collection +request from the device. + Reliability, Accessibility, Serviceability (RAS) ================================================ diff --git a/Documentation/accel/qaic/qaic.rst b/Documentation/accel/qaic/qaic.rst index 018d6cc173d7e9..ef27e262cb9141 100644 --- a/Documentation/accel/qaic/qaic.rst +++ b/Documentation/accel/qaic/qaic.rst @@ -36,7 +36,7 @@ polling mode and reenables the IRQ line. This mitigation in QAIC is very effective. The same lprnet usecase that generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64 IRQs over 5 minutes while keeping the host system stable, and having the same -workload throughput performance (within run to run noise variation). +workload throughput performance (within run-to-run noise variation). Single MSI Mode --------------- @@ -49,7 +49,7 @@ useful to be able to fall back to a single MSI when needed. To support this fallback, we allow the case where only one MSI is able to be allocated, and share that one MSI between MHI and the DBCs. The device detects when only one MSI has been configured and directs the interrupts for the DBCs -to the interrupt normally used for MHI. Unfortunately this means that the +to the interrupt normally used for MHI. Unfortunately, this means that the interrupt handlers for every DBC and MHI wake up for every interrupt that arrives; however, the DBC threaded irq handlers only are started when work to be done is detected (MHI will always start its threaded handler). @@ -62,9 +62,9 @@ never disabled, allowing each new entry to the FIFO to trigger a new interrupt. Neural Network Control (NNC) Protocol ===================================== -The implementation of NNC is split between the KMD (QAIC) and UMD. In general +The implementation of NNC is split between the KMD (QAIC) and UMD. In general, QAIC understands how to encode/decode NNC wire protocol, and elements of the -protocol which require kernel space knowledge to process (for example, mapping +protocol which requires kernel space knowledge to process (for example, mapping host memory to device IOVAs). QAIC understands the structure of a message, and all of the transactions. QAIC does not understand commands (the payload of a passthrough transaction). |
