Strange failure of CRU

I had just completed several data acquisition tests for packet mode with my CRU installed in a Dell PowerEdge R740, when all of the sudden the following messages appeared on the console:
Message from syslogd@gemini at Oct 24 15:41:38 …
kernel:Uhhuh. NMI received for unknown reason 20 on CPU 0.

Message from syslogd@gemini at Oct 24 15:41:38 …
kernel:Do you have a strange power saving mode enabled?

Message from syslogd@gemini at Oct 24 15:41:38 …
kernel:Dazed and confused, but trying to continue

The command roc-list-cards at that point resulted in:

[root@gemini data]# roc-list-cards
infoLoggerD not available, falling back to stdout logging
Error: CRU reported invalid serial number 0xffffffff, a fatal error may have occurred
[AliceO2::Common::ErrorInfo::_Message*] = CRU reported invalid serial number 0xffffffff, a fatal error may have occurred

Any idea what could have happened?

There are no additional log entries in /var/log/messages, other than:

Oct 24 15:41:38 gemini kernel: Uhhuh. NMI received for unknown reason 20 on CPU 0.
Oct 24 15:41:38 gemini kernel: Do you have a strange power saving mode enabled?
Oct 24 15:41:38 gemini kernel: Dazed and confused, but trying to continue

Reloading the firmware and executing the startup script after a power cycle brought the CRU back to life. Still no idea though what made it crash to begin with

Ciao Jo difficult to say.
The kernel message usually appears when the FPGA is realoaded (aka the card disappear from the PCIe bus).
I doubt it is a power issue, as we have several CRU cards running in our lab and they are not connected to the power cables.

I’ll keep an eye on that and see if it happens also in our lab.

Thank you
PiPPo