DPL synchronisation of devices

dstocco · June 6, 2019, 2:22pm

Dear @eulisse,
I’m working on the reconstruction workflow for MID.
We typically have 2 cases: the reconstruction from data and the one from simulations.
Imagine that the data reconstruction is given by a series of DPL devices as:
RAW_READ -- RECO -- SINK
When we run in simulation, we have the same structure but in addition we want to propagate the MC info. This can be done with (almost) fully separated devices, as:
MC_READ – RECO – SINK
| \ |
MC_LABELLER – LABEL_SINK

i.e. we can add a Labeller device that takes as input:

the same input of RECO
an additional input which is the MC info
the output of RECO

And provides the labels associating the RECO input with its output.

Of course, for things to work as expected I need that:

MC_READ sends the same output to both RECO and MC_LABELLER (I guess I can use a broadcaster for this?)
RECO processes its input and sends the output to MC_LABELLER and to SINK (another broadcaster needed)
MC_LABELLER waits for RECO output and then starts processing
MC_READ waits for both MC_LABELLER and RECO to finish before sending new data

If MC_READ sends further data as soon as RECO has finished, then I have a problem because MC_LABELLER might not receive the output of RECO that corresponds to the inputs.
In other words, I need everything to be in synch.

How do I get this with DPL? Should I add some specific options/statuses or everything will run out of the box?

It might be a silly question, but better asking before starting the development.
Thanks in advance,
ciao,
Diego

eulisse · June 11, 2019, 9:23am

Ciao,

sorry for the late reply, I was away.

Maybe I do not understand, but the DPL will always give you synched inputs (by default), so as long as all the objects being exchanged have the same time associated to it, there shouldn’t be any problem. It’s actually even smarter than that and in principle it allows you to define a time interval (albeit this still requires a bit more development to be fully functional).

Am I misunderstanding your question?

dstocco · June 11, 2019, 9:42am

Ciao @eulisse,
as I told you this might have been a trivial question
However, there is one thing that I do not fully understand. What do you mean by:

as long as all the objects being exchanged have the same time associated to it ?

I know that at some point we will send a timeframe. However, in my devices, there is no notion of time for the moment. I simply pack the output in std::vector and I send it to the next device through a snapshot.
Where is the concept of timestamp in this? Maybe each time I do a snapshot the DPL framework attaches an header with a unique timestamp to my payload and carries it all along the chain, or should I handle the timestamp myself instead?

Sorry for the very basic questions, but this part is not straightforward for me…

Thanks,
cheers,
Diego

eulisse · June 11, 2019, 9:48am

DPL attaches an incremental time, if no other time is provided, as part of the DataProcessingHeader. So unless I miss something everything should be handled by the framework for you.

dstocco · June 11, 2019, 9:49am

Hi @eulisse,
ok, thanks: everything is clear now!

Cheers,
Diego

dstocco · June 20, 2019, 2:51pm

Hi @eulisse,
sorry to come back to this topic, but after finally implementing the reconstruction workflow I realised that the synchronisation does not work as expected.
I have a MIDDigitsReader that sends the data to MIDClusterizer and the MC labels to MIDClusterLabeler.
The MIDClusterLabeler waits for the clusters from MIDClusterizer and, combining the info with the MC labels sent by MIDDigitsReader, generates the MC labels of the clusters.
The only CAVEAT here is that, for the moment, the MIDDigitsReader sends one vector per event, so my timeframe is basically one event.
My full workflow can be found here.
Now, what happens is that everything works fine at the beginning. Then, at some point (which tend to vary), I get these error messages:

[78860]: [16:34:22][WARN] Incoming data is already obsolete, not relaying.
[78860]: [16:34:22][ERROR] Unable to relay part.

and the MIDClusterLabeler gets stuck. Some events are skipped…and then at some point the inputs from MIDDigitsReader and MIDClusterizer are in synch again and the cluster labeller works again.
Why is that so? I thought that the chain is blocked until MIDClusterLabeler has all of its outputs but it looks like the input from MIDDigitsReader is dropped at some point, before the output of MIDClusterizer is available for MIDClusterLabeler.
Should I change the CompletionPolicy in order for this chain to work? Or am I doing some trivial mistake?

Thanks in advance for further info,
best regards,
Diego

P.S. For info, to run the workflow one can first produce the digits with:

o2-sim -g fwmugen -m MID -n 100
o2-sim-digitizer-workflow

(all of the needed code is already in O2)
And then one can launch the workflow in the same directory with:

mid-reco-workflow

(this code is in my GitHub, at the link provided above)

eulisse · June 24, 2019, 7:14am

I think this happens because one path is much faster than the other so the slower one gets dropped. I need to introduce a way to control the polling policy so that users can specify when one should wait for a slower channel, rather than drop it. In the meanwhile, can you try changing DEFAULT_PIPELINE_LENGTH in DataRelayer.cxx and see if that improves the situation?

dstocco · June 24, 2019, 10:22am

Hi @eulisse,
indeed, if I increase DEFAULT_PIPELINE_LENGTH the message is gone.
I tried with 42, 64 and 80. 42 seems not to be enough, but I suspect that the results varies a lot from one try to another.
When I use 80 or 64 the message is gone, but I have problems in quitting the GUI (it seems to hang forever). I have to stop the process with ctrl+C
Anyways, at least it seems that the source of the problem is clear.

Thanks a lot!
Cheers,
Diego

therman · April 1, 2020, 8:54pm

Hi @eulisse,

I’ve encountered the same error as Diego:

[11782:internal-dpl-global-binary-file-sink]: [19:18:28][WARN] Incoming data is already obsolete, not relaying.
[11782:internal-dpl-global-binary-file-sink]: [19:18:28][ERROR] Unable to relay part.

When running the MFT cluster QC task on MC file which is read by a data producer macro.

Here is the data producer: https://github.com/AliceO2Group/QualityControl/blob/master/Modules/MFT/src/runClustersRootFileReaderMFT.cxx
Here is the QC task: https://github.com/AliceO2Group/QualityControl/blob/master/Modules/MFT/src/BasicClusterQcTask.cxx

When I changed the DEFAULT_PIPELINE_LENGTH in DataRelayer.cxx from 16 to 64 it worked fine, no error was reported.

Anyway, I wanted to check if there is a more permanent solution now or if this is still the recommended approach?

Thanks,
Tomáš