QC proxy occasionally crash

Dear experts,

We are using the QC v0.26.1 (commit 19a915e32b9c04904df19b9ae87de894f24d7197) and alidist (commit 17838c60e2ce56e24179eff8e11e74a75bcb66d2, FairMQ v1.4.18) for ITS commissioning.

But after running:

o2-qc --config json:///home/its/QC_Online/QualityControl/Modules/ITS/readout-multi-a.json --remote -b --session default

the proxy occasionally broke at the very beginning:

[26682:MERGER-ITSFHR1l-0]: [05:41:00][STATE] Starting FairMQ state machine → IDLE
[26683:QC-CHECK-RUNNER-sink-QC_ITSFHR-mo_0]: 2020-09-04 05:41:00 ESC[1;49mStdOut backend initializedESC[0m
[26683:QC-CHECK-RUNNER-sink-QC_ITSFHR-mo_0]: [05:41:00][STATE] Starting FairMQ state machine → IDLE
[26681:ITSFHR-proxy]: [METRIC] matcher_variables/197,1 null 15991
[26681:ITSFHR-proxy]: *** Break *** segmentation violation

I checked the /dev/shm, and it looks good for me:

-rw-r–r-- 1 its its 32 Sep 3 20:23 sem.fmq_11930162_mtx
-rw-r–r-- 1 its its 104 Sep 3 20:23 fmq_11930162_cv
-rw-r–r-- 1 its its 655360 Sep 4 12:33 fmq_11930162_mng
-rw-r–r-- 1 its its 2000000000 Sep 4 12:37 fmq_11930162_main

Any idea about that? Thank you.

Cheers,
Jian

Hi, is there any stack trace printed? Could you provide us with the full logs?

Hi,
The log file attached here
qc_remote_2020-09-04_05-40-37.log (119.6 KB)

Here is another similar break occasionally happened when starting QC:

[ERROR] error while setting up workflow: basic_string::_M_create
[INFO] Process 15323 is exiting.

Log file attached:
qc_b_run500522_2020-09-04_17-48-49.log (4.6 KB)

Thank you for these to log files. However, I must say that tells me very little aside from giving suspicions that it is related to shared memory.

I see that you are using quite old software versions. Would you consider updating them, or this is not an option now? @eulisse Where there any fixes in DPL<->FairMQ
in the last few months which could have fixed this issue?

@jian Does the problem occur each time or just occasionally? Did it work before?

Hi @pkonopka,

It only occasionally occurred. The chance is about 1/8.

We reverted to these old versions as we are still using the RDHv4 for the ITS commissioning (Readout v1.3.10-1). This workflow started processing data since Sep. 02 and the problem is always there.

The current workflow for shifters is: readout.exe (raw data replay) → StfB → raw-proxy → qc-its. We are running QC on 5 flps; 4 local + 1 “remote”. The problem is occurring randomly on these machines. We do clean the shm by running:

fairmq-shmmonitor --cleanup --session session-name

before starting the workflow. I can see the segments were removed.

We will update to the latest versions as soon as we run the detector with RDHv6 (hopefully a couple of weeks later).

For ITS QC task development, we keep following the latest software versions. I was facing a similar problem in July and August. Sorry, I did not remember clearly. It seemed the proxy thread occasionally could not be set up when starting QC but no crash happened. We will pay attention to gather the crash info.

Thanks,
Jian

I’ve created a Jira issue to follow it further: https://alice.its.cern.ch/jira/browse/QC-403

The fix has been merged https://github.com/AliceO2Group/AliceO2/pull/4371 .