Debugging readout-proxy

laphecet · October 22, 2019, 10:40am

Hi,

I’m trying to run a basic readout chain (reading from an existing file) for the QC using :

readout.exe file://$(pwd)/readout.cfg

o2-qc-run-readout | o2-qc-run-qc --config json://$(pwd)/readout.json

and it seems that the QC is not seeing any input data and so is basically doing nothing.
Most probably I’ve misconfigured something but I realize I have no idea on how to start debugging this : e.g. is it the readout which is not sending or the qc which is not receiving ?

I’ve also tried :

o2-qc-run-readout-for-data-dump

to remove (most of) the QC parts from the equation (at least this is my belief…) but here I’m not so sure what to expect (don’t see much activity in the GUI nor in the terminal log…)

Thanks,

PS: I’m trying this on a Mac, but at some point I’ve checked on a FLP-like machine with the same outcome.

bvonhall · October 23, 2019, 6:57am

Hi Laurent,

The way I would start investigating the problem would be to look at the DPL debug gui. Do you have it ?

You should see where the data stops. Is it in the proxy ? if so the data is probably not being sent by the readout. If it is in the Dispatcher, then it is the data sampling that is not correctly configured.

As a reminder, the chain looks like this : https://github.com/AliceO2Group/QualityControl/blob/master/doc/QuickStart.md#readout-chain

You could also check the logs as there should be messages such as :

[40392:readout-proxy]: [08:55:22][INFO] from_internal-dpl-clock_to_readout-proxy[0]: in: 0 (0 MB) out: 0 (0 MB)
[40391:internal-dpl-clock]: [08:55:22][INFO] from_internal-dpl-clock_to_QC-TASK-RUNNER-daqTask[0]: in: 0 (0 MB) out: 0 (0 MB)
[40391:internal-dpl-clock]: [08:55:22][INFO]          from_internal-dpl-clock_to_readout-proxy[0]: in: 0 (0 MB) out: 0 (0 MB)
[40392:readout-proxy]: [08:55:22][INFO]         from_readout-proxy_to_Dispatcher[0]: in: 0 (0 MB) out: 331703 (7042.36 MB)
[40392:readout-proxy]: [08:55:22][INFO]                            readout-proxy[0]: in: 4735 (100.273 MB) out: 0 (0 MB)
[40393:Dispatcher]: [08:55:22][INFO] from_Dispatcher_to_QC-TASK-RUNNER-daqTask[0]: in: 0 (0 MB) out: 33055 (704.422 MB)
[40393:Dispatcher]: [08:55:22][INFO]          from_readout-proxy_to_Dispatcher[0]: in: 331703 (7042.36 MB) out: 0 (0 MB)
[40396:internal-dpl-global-binary-file-sink]: [08:55:22][INFO] from_daqTask-checker_to_internal-dpl-global-binary-file-sink[0]: in: 6 (0.222384 MB) out: 0 (0 MB)
[40395:daqTask-checker]: [08:55:23][INFO]               from_QC-TASK-RUNNER-daqTask_to_daqTask-checker[0]: in: 5.98205 (0.221719 MB) out: 0 (0 MB)
[40395:daqTask-checker]: [08:55:23][INFO] from_daqTask-checker_to_internal-dpl-global-binary-file-sink[0]: in: 0 (0 MB) out: 5.98205 (0.221719 MB)

This is a bit ugly to read but it actually tells you for each connection what is coming in or out (not both as we are looking at “ports”).

This being said, I am happy to consider a tool or a mean to more easily check this type of situation. Not sure what would help though. Any idea ?

Cheers,

laphecet · October 23, 2019, 11:51am

Hi Barth,

Thanks for the help.

Yes, I have the debug GUI (just needed Diego to tell me I had to double-click on a device to get its output ). Anyway, the log (either from terminal or from debug gui) says something like :

[48580:readout-proxy]: [12:24:55][INFO] from_internal-dpl-clock_to_readout-proxy[0]: in: 0 (0 MB) out: 0 (0 MB)
[48580:readout-proxy]: [12:24:55][INFO] from_readout-proxy_to_Dispatcher[0]: in: 0 (0 MB) out: 98.6056 (36960.7 MB)
[48581:internal-dpl-global-binary-file-sink]: [12:24:55][INFO] from_QcTaskMCHPhysics-checker_to_internal-dpl-global-binary-file-sink[0]: in: 0 (0 MB) out: 0 (0 MB)
[48576:internal-dpl-clock]: [12:24:55][INFO] from_internal-dpl-clock_to_QC-TASK-RUNNER-QcTaskMCHPhysics[0]: in: 0 (0 MB) out: 0 (0 MB)
[48576:internal-dpl-clock]: [12:24:55][INFO] from_internal-dpl-clock_to_readout-proxy[0]: in: 0 (0 MB) out: 0 (0 MB)
[48580:readout-proxy]: [12:24:55][INFO] readout-proxy[0]: in: 1.99203 (746.68 MB) out: 0 (0 MB)
[48577:Dispatcher]: [12:24:55][INFO] from_Dispatcher_to_QC-TASK-RUNNER-QcTaskMCHPhysics[0]: in: 0 (0 MB) out: 0 (0 MB)
[48577:Dispatcher]: [12:24:55][INFO] from_readout-proxy_to_Dispatcher[0]: in: 2 (749.667 MB) out: 0 (0 MB)
[48579:QcTaskMCHPhysics-checker]: [12:24:55][INFO] from_QC-TASK-RUNNER-QcTaskMCHPhysics_to_QcTaskMCHPhysics-checker[0]: in: 0 (0 MB) out: 0 (0 MB)
[48579:QcTaskMCHPhysics-checker]: [12:24:55][INFO] from_QcTaskMCHPhysics-checker_to_internal-dpl-global-binary-file-sink[0]: in: 0 (0 MB) out: 0 (0 MB)

So it seems the readout-proxy is outputting some data to the dispatcher, but then the flow stops, meaning my dispatcher is not configured correctly, right ?

As for ideas for how to check situations like this, not really, apart from maybe detecting (and notifying the user) when (and where) the data is not flowing somewhere in the workflow (while it flows elsewhere) : not so sure however if there are legit cases for no data transfer within a workflow ?

laphecet · October 23, 2019, 12:10pm

Should also mention that the workflow’s leaking memory

bvonhall · October 23, 2019, 12:15pm

Could you show us your qc config file ?

For the memory leak, I will have a look.

laphecet · October 23, 2019, 12:18pm

both the readout.cfg and readout.json are clickable in the original post (destination is cernbox files, as apparently one can not attach cfg or json files to Discourse)

bvonhall · October 23, 2019, 12:42pm

Sorry, I did not realize they were clickable.

I don’t see anything wrong with your config files. I am puzzled.

Could you also give us the log of the Dispatcher or the full output of the terminal ?

Cheers,

laphecet · October 23, 2019, 12:56pm

The terminal log of :

o2-qc-run-readout | o2-qc-run-qc --config json://$(pwd)/readout.json
can be found here.

(btw, any reason why one cannot upload txt or log files directly to discourse ?)

bvonhall · October 24, 2019, 6:26am

Thank you, I have enabled txt and log files in discourse.

The log file does not seem to show anything obviously wrong. @pkonopka any idea ?

pkonopka · October 24, 2019, 6:56am

Not sure what might be the problem. The config files look good to me, logs - suspiciously little happening, the QC task does not publish even empty Monitor Objects.
I would like to exclude that something is silently going wrong inside the QC task. Could you please run that setup, but with the default readout.json? (the one in QualityControl/Framework/readout.json) Please check if data seems to move throughout the chain in the logs and if the object qc/DAQ/daqTask/payloadSize is being updated on https://qcg-test.cern.ch

laphecet · October 24, 2019, 7:40am

Hi Piotr,

Instructive test indeed. With o2-qc-run-readout | o2-qc-run-qc --config json://$QUALITYCONTROL_ROOT/etc/readout.json data is flowing and payloadSize histo is updated…

So I’ll have a closer look at the QC task then (I’m not the author of that one, just wanted to use it for some tests…)

Thanks for help,

laphecet · October 24, 2019, 12:20pm

Ok, the origin of the problem is now identified. The “QcTaskMCHPhysics” initialize method was never ending (trying to read from a non-existing file, without proper treatment of non-existence…). With this part fixed, the data is now flowing correctly and the memory is not growing.

While this solves my initial problem (again, thanks for the help in debugging this), but maybe there’s a lesson to be learned here ? I would assume the memory growing I observed was some consequence of the task being “non-responsive” in some way ? If that’s the case then such occurrences should probably be detected by the framework itself to avoid memory going overboard ?

bvonhall · October 24, 2019, 12:35pm

I agree, there is something to be learned and probably some tool/mechanism to be developed to avoid this in the future.

When you say “framework”, did you mean DPL or QC ?

In general, I have the feeling that we lack a way to quickly identify where the data gets stuck/lost and/or which component is in which state. I was mislead by the message indicating that no data was going from the Dispatcher to the Task which made me blame the Dispatcher : [48577:Dispatcher]: [12:24:55][INFO] from_Dispatcher_to_QC-TASK-RUNNER-QcTaskMCHPhysics[0]: in: 0 (0 MB) out: 0 (0 MB).
@eulisse any idea why the dispatcher would report 0 when it was actually pushing data ? I understand that the task was not processing the data, but I would expect the “other side” to report 0.

For systems running with the ECS and thus without a GUI, I had in mind to use the Monitoring package to know in which state each of the components is and representing it in grafana (+ the throughput probably). But I have to see with Adam how to do that. It would be installed with the FLP Suite probably and possible to set up on dev machines.

laphecet · October 24, 2019, 8:18pm

Which framework I had in mind is a good question
I would have said DPL as its the mother framework, but maybe it’s “too” general to be able to set clear cut expectations about data flows in all and every devices ?
On the QC side on the other hand, we probably can expect the QC tasks to “see” some data, as long as the dispatcher is seeing some ? So some sort of (lack of) correlation between the dispatcher and the qc tasks could be used to detect such cases ?
Just thinking out loud here…

pkonopka · October 25, 2019, 10:36am

Hmm, Dispatcher could have debug output where it would publish some statistics each x seconds, so we know if it is its fault, that it doesn’t send data. QC TaskRunner could receive them and react accordingly. Though I am skeptical about automatizing that. Visualizing the data flow would be a good first step.