O2 workflows stuck at initialization phase

Dear experts, since a while all the workflows I try to run on my development machine seem to get stuck at the initialization phase, before any of the run() methods get called.

I am building O2 via alibuild in the following way:

DEFAULTS=o2-epn

aliBuild init O2@dev --defaults ${DEFAULTS}
aliBuild init QualityControl@master --defaults ${DEFAULTS}
aliBuild init DataDistribution@dev --defaults ${DEFAULTS}

aliBuild -j 4 build O2Suite --defaults ${DEFAULTS} $1

Even a simple attempt to read a CFT file gets stuck. Here is an example of command:

o2-ctf-reader-workflow --session default --shm-segment-size 16000000000 --delay 1 --ctf-input o2_ctf_run00514053_orbit0002051111_tf0000015940.root --ctf-dict ctf_dictionary.root --onlyDet MCH,MID | o2-dpl-run --session default --shm-segment-size 16000000000 --batch --run

I am getting the following output:

[INFO] Initialising O2 Data Processing Layer. Driver PID: 103217.
[INFO] Rate limiting set up at 500MB distributed over 1 readers
[INFO] O2 Data Processing Layer initialised. We brake for nobody.
[INFO] Optimised build. O2DEBUG / LOG(debug) / LOGF(debug) / assert statement will not be shown.
[INFO] Redeployment of configuration asked.
[INFO] Starting internal-dpl-clock on pid 103234
[INFO] Starting ctf-reader on pid 103235
[INFO] Starting internal-dpl-ccdb-backend on pid 103236
[INFO] Starting mid-entropy-decoder on pid 103237
[INFO] Starting mch-entropy-decoder on pid 103238
[INFO] Starting internal-dpl-injected-dummy-sink on pid 103239
[INFO] Redeployment of configuration done.
[103235:ctf-reader]: [INFO] Instrumenting crash signals
[103235:ctf-reader]: [INFO] Spawing new device ctf-reader in process with pid 103235
[103235:ctf-reader]: [14:30:23][INFO] 
[103235:ctf-reader]:       ______      _    _______  _________ 
[103235:ctf-reader]:      / ____/___ _(_)_______   |/  /_  __ \    version 1.4.50
[103235:ctf-reader]:     / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RELWITHDEBINFO
[103235:ctf-reader]:    / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[103235:ctf-reader]:   /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2022 GSI
[103235:ctf-reader]: 
[103235:ctf-reader]: [14:30:23][STATE] Starting FairMQ state machine --> IDLE
[103239:internal-dpl-injected-dummy-sink]: [INFO] Instrumenting crash signals
[103236:internal-dpl-ccdb-backend]: [INFO] Instrumenting crash signals
[103235:ctf-reader]: [14:30:23][STATE] IDLE ---> INITIALIZING DEVICE
[103239:internal-dpl-injected-dummy-sink]: [INFO] Spawing new device internal-dpl-injected-dummy-sink in process with pid 103239
[103235:ctf-reader]: [14:30:23][INFO] Input contains 1 files, 0 remote
[103235:ctf-reader]: [14:30:23][INFO] Finished file fetching: 1 of 1 files fetched successfully in 0 iterations
[103236:internal-dpl-ccdb-backend]: [INFO] Spawing new device internal-dpl-ccdb-backend in process with pid 103236
[103237:mid-entropy-decoder]: [INFO] Instrumenting crash signals
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][INFO] 
[103239:internal-dpl-injected-dummy-sink]:       ______      _    _______  _________ 
[103239:internal-dpl-injected-dummy-sink]:      / ____/___ _(_)_______   |/  /_  __ \    version 1.4.50
[103239:internal-dpl-injected-dummy-sink]:     / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RELWITHDEBINFO
[103239:internal-dpl-injected-dummy-sink]:    / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[103239:internal-dpl-injected-dummy-sink]:   /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2022 GSI
[103239:internal-dpl-injected-dummy-sink]: 
[103235:ctf-reader]: [14:30:23][STATE] INITIALIZING DEVICE ---> INITIALIZED
[103235:ctf-reader]: [14:30:23][STATE] INITIALIZED ---> BINDING
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] Starting FairMQ state machine --> IDLE
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] 
[103236:internal-dpl-ccdb-backend]:       ______      _    _______  _________ 
[103236:internal-dpl-ccdb-backend]:      / ____/___ _(_)_______   |/  /_  __ \    version 1.4.50
[103236:internal-dpl-ccdb-backend]:     / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RELWITHDEBINFO
[103236:internal-dpl-ccdb-backend]:    / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[103236:internal-dpl-ccdb-backend]:   /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2022 GSI
[103236:internal-dpl-ccdb-backend]: 
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] Starting FairMQ state machine --> IDLE
[103235:ctf-reader]: [14:30:23][STATE] BINDING ---> BOUND
[103235:ctf-reader]: [14:30:23][STATE] BOUND ---> CONNECTING
[103235:ctf-reader]: [14:30:23][STATE] CONNECTING ---> DEVICE READY
[103235:ctf-reader]: [14:30:23][STATE] DEVICE READY ---> INITIALIZING TASK
[103235:ctf-reader]: [14:30:23][STATE] INITIALIZING TASK ---> READY
[103235:ctf-reader]: [14:30:23][STATE] READY ---> RUNNING
[103235:ctf-reader]: [14:30:23][INFO] fair::mq::Device running...
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] IDLE ---> INITIALIZING DEVICE
[103237:mid-entropy-decoder]: [INFO] Spawing new device mid-entropy-decoder in process with pid 103237
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] IDLE ---> INITIALIZING DEVICE
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] INITIALIZING DEVICE ---> INITIALIZED
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] INITIALIZED ---> BINDING
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] BINDING ---> BOUND
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] BOUND ---> CONNECTING
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] CCDB Backend at: http://alice-ccdb.cern.ch, validity check for every 1 TF
[103237:mid-entropy-decoder]: [14:30:23][INFO] 
[103237:mid-entropy-decoder]:       ______      _    _______  _________ 
[103237:mid-entropy-decoder]:      / ____/___ _(_)_______   |/  /_  __ \    version 1.4.50
[103237:mid-entropy-decoder]:     / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RELWITHDEBINFO
[103237:mid-entropy-decoder]:    / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[103237:mid-entropy-decoder]:   /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2022 GSI
[103237:mid-entropy-decoder]: 
[103237:mid-entropy-decoder]: [14:30:23][STATE] Starting FairMQ state machine --> IDLE
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] CONNECTING ---> DEVICE READY
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] DEVICE READY ---> INITIALIZING TASK
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] INITIALIZING TASK ---> READY
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][STATE] READY ---> RUNNING
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][INFO] fair::mq::Device running...
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][WARN] Unable to communicate with driver because client is not connected. Continuing connection attempts.
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][INFO] Correctly handshaken websocket connection.
[103239:internal-dpl-injected-dummy-sink]: [14:30:23][WARN] DriverClient connected successfully. Flushing message backlog of 3057 messages. All is good.
[103234:internal-dpl-clock]: [INFO] Instrumenting crash signals
[103234:internal-dpl-clock]: [INFO] Spawing new device internal-dpl-clock in process with pid 103234
[103234:internal-dpl-clock]: [14:30:23][INFO] 
[103234:internal-dpl-clock]:       ______      _    _______  _________ 
[103234:internal-dpl-clock]:      / ____/___ _(_)_______   |/  /_  __ \    version 1.4.50
[103234:internal-dpl-clock]:     / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RELWITHDEBINFO
[103234:internal-dpl-clock]:    / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[103234:internal-dpl-clock]:   /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2022 GSI
[103234:internal-dpl-clock]: 
[103234:internal-dpl-clock]: [14:30:23][STATE] Starting FairMQ state machine --> IDLE
[103234:internal-dpl-clock]: [14:30:23][STATE] IDLE ---> INITIALIZING DEVICE
[103234:internal-dpl-clock]: [14:30:23][STATE] INITIALIZING DEVICE ---> INITIALIZED
[103234:internal-dpl-clock]: [14:30:23][STATE] INITIALIZED ---> BINDING
[103234:internal-dpl-clock]: [14:30:23][STATE] BINDING ---> BOUND
[103234:internal-dpl-clock]: [14:30:23][STATE] BOUND ---> CONNECTING
[103234:internal-dpl-clock]: [14:30:23][STATE] CONNECTING ---> DEVICE READY
[103234:internal-dpl-clock]: [14:30:23][STATE] DEVICE READY ---> INITIALIZING TASK
[103234:internal-dpl-clock]: [14:30:23][STATE] INITIALIZING TASK ---> READY
[103234:internal-dpl-clock]: [14:30:23][STATE] READY ---> RUNNING
[103234:internal-dpl-clock]: [14:30:23][INFO] fair::mq::Device running...
[103234:internal-dpl-clock]: [14:30:23][WARN] Unable to communicate with driver because client is not connected. Continuing connection attempts.
[103237:mid-entropy-decoder]: [14:30:23][STATE] IDLE ---> INITIALIZING DEVICE
[103238:mch-entropy-decoder]: [INFO] Instrumenting crash signals
[103238:mch-entropy-decoder]: [INFO] Spawing new device mch-entropy-decoder in process with pid 103238
[103238:mch-entropy-decoder]: [14:30:23][INFO] 
[103238:mch-entropy-decoder]:       ______      _    _______  _________ 
[103238:mch-entropy-decoder]:      / ____/___ _(_)_______   |/  /_  __ \    version 1.4.50
[103238:mch-entropy-decoder]:     / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RELWITHDEBINFO
[103238:mch-entropy-decoder]:    / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[103238:mch-entropy-decoder]:   /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2022 GSI
[103238:mch-entropy-decoder]: 
[103238:mch-entropy-decoder]: [14:30:23][STATE] Starting FairMQ state machine --> IDLE
[103238:mch-entropy-decoder]: [14:30:23][STATE] IDLE ---> INITIALIZING DEVICE
[103234:internal-dpl-clock]: [14:30:23][INFO] Correctly handshaken websocket connection.
[103234:internal-dpl-clock]: [14:30:23][WARN] DriverClient connected successfully. Flushing message backlog of 7209 messages. All is good.
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] Is alien token present?: 1
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] Initialised default CCDB host http://alice-ccdb.cern.ch
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] The following route is a condition  { MID/CTFDICT/0}
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] - ccdb-path: MID/Calib/CTFDictionary
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] The following route is a condition  { MCH/CTFDICT/0}
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] - ccdb-path: MCH/Calib/CTFDictionary
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] INITIALIZING DEVICE ---> INITIALIZED
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] INITIALIZED ---> BINDING
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] BINDING ---> BOUND
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] BOUND ---> CONNECTING
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] CONNECTING ---> DEVICE READY
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] DEVICE READY ---> INITIALIZING TASK
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] INITIALIZING TASK ---> READY
[103236:internal-dpl-ccdb-backend]: [14:30:23][STATE] READY ---> RUNNING
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] fair::mq::Device running...
[103236:internal-dpl-ccdb-backend]: [14:30:23][WARN] Unable to communicate with driver because client is not connected. Continuing connection attempts.
[103236:internal-dpl-ccdb-backend]: [14:30:23][INFO] Correctly handshaken websocket connection.
[103236:internal-dpl-ccdb-backend]: [14:30:23][WARN] DriverClient connected successfully. Flushing message backlog of 2197 messages. All is good.
[103238:mch-entropy-decoder]: [14:30:31][INFO] Loaded ITS CTF dictionary v1.0.1626472048 (16/07/21 21:47:28 UTC) from ctf_dictionary.root
[103238:mch-entropy-decoder]: [14:30:31][STATE] INITIALIZING DEVICE ---> INITIALIZED
[103238:mch-entropy-decoder]: [14:30:31][STATE] INITIALIZED ---> BINDING
[103238:mch-entropy-decoder]: [14:30:31][STATE] BINDING ---> BOUND
[103238:mch-entropy-decoder]: [14:30:31][STATE] BOUND ---> CONNECTING
[103238:mch-entropy-decoder]: [14:30:31][STATE] CONNECTING ---> DEVICE READY
[103238:mch-entropy-decoder]: [14:30:31][STATE] DEVICE READY ---> INITIALIZING TASK
[103238:mch-entropy-decoder]: [14:30:31][STATE] INITIALIZING TASK ---> READY
[103238:mch-entropy-decoder]: [14:30:31][STATE] READY ---> RUNNING
[103238:mch-entropy-decoder]: [14:30:31][INFO] fair::mq::Device running...
[103238:mch-entropy-decoder]: [14:30:31][WARN] Unable to communicate with driver because client is not connected. Continuing connection attempts.
[103238:mch-entropy-decoder]: [14:30:31][INFO] Correctly handshaken websocket connection.
[103238:mch-entropy-decoder]: [14:30:31][WARN] DriverClient connected successfully. Flushing message backlog of 2286 messages. All is good.
[103237:mid-entropy-decoder]: [14:30:31][INFO] Loaded ITS CTF dictionary v1.0.1626472048 (16/07/21 21:47:28 UTC) from ctf_dictionary.root
[103237:mid-entropy-decoder]: [14:30:31][STATE] INITIALIZING DEVICE ---> INITIALIZED
[103237:mid-entropy-decoder]: [14:30:31][STATE] INITIALIZED ---> BINDING
[103237:mid-entropy-decoder]: [14:30:31][STATE] BINDING ---> BOUND
[103237:mid-entropy-decoder]: [14:30:31][STATE] BOUND ---> CONNECTING
[103237:mid-entropy-decoder]: [14:30:31][STATE] CONNECTING ---> DEVICE READY
[103237:mid-entropy-decoder]: [14:30:31][STATE] DEVICE READY ---> INITIALIZING TASK
[103237:mid-entropy-decoder]: [14:30:31][STATE] INITIALIZING TASK ---> READY
[103237:mid-entropy-decoder]: [14:30:31][STATE] READY ---> RUNNING
[103237:mid-entropy-decoder]: [14:30:31][INFO] fair::mq::Device running...
[103237:mid-entropy-decoder]: [14:30:31][WARN] Unable to communicate with driver because client is not connected. Continuing connection attempts.
[103237:mid-entropy-decoder]: [14:30:31][INFO] Correctly handshaken websocket connection.
[103237:mid-entropy-decoder]: [14:30:31][WARN] DriverClient connected successfully. Flushing message backlog of 2290 messages. All is good.

I suspect some issues with the CCDB access, even if it seem my alien access token is valid, and I am using a machine connected to the CERN network.

I tested the manual download of a CCDB object as suggested here, but I get the following:

[O2Suite/latest-o2-epn] ~/TED-Shots $> o2-ccdb-downloadccdbfile --host http://alice-ccdb.cern.ch -p CTP/Calib/OrbitReset -t 1635659148972
[INFO] Is alien token present?: 1
Querying host http://alice-ccdb.cern.ch for path(s) CTP/Calib/OrbitReset ... and timestamp 1635659148972
[ERROR] No ETag found in header for path CTP/Calib/OrbitReset. Aborting.

Does anyone have an idea of what could be wrong with my setup? At the moment all my development work is stuck due to this, as I cannot test any code…

Thanks!

Hi @aferrero

I’ve just tried and for me both your workflow and o2-ccdb-downloadccdbfile work. There could have been a temporary glitch with the CCDB server, though from the output of workflow you pasted it does not look like it is stuck on fetching.
Could you check if your /dev/shm does not contain files from previously crashed jobjs? It it is, just delete them.

Cleaning /dev/shm did the trick for the CTF reader workflow, thanks! I think the problem was due to some dangling /dev/shm/sem.fmq_* files. I admit I should have thought about the SHM, but usually one gets crashes due to insufficient memory…

On the other hand, the problem with o2-ccdb-downloadccdbfile is still there. However it does not seem to prevent the workflows from running…

Strange, I did not see any problem with o2-ccdb-downloadccdbfile.
BTW, you don’t need to provide --ctf-dict ctf_dictionary.root unless you want to use some local dict. file: now by default the dictionaries are fetched from the CCDB.
Another trick: unless you expect a change of the CCDB object within the timespan of the data you are processing, you can pass in the end an option --condition-tf-per-query -1 to query the CCDB only once in the beginning (and passing N instead of -1 will enforce query after every N TFs instead of doing this every TF).

1 Like