DPL packets dropped in case of large back-pressure between DPL source and QC?

aferrero · January 25, 2021, 8:04pm

After having recently updated the O2 processing chain that we use for the MCH commissioning, I think I observe a different behavior with respect to the code from June 2020 that I was using beforehand.

For the MCH commissioning we analyze data that was previously written to disk using readout.exe. The data is readout via a custom DPL device that loads HB frames from the input file and pushes than as DPL messages. The messages are received by QC (at 100% sampling rate) and processed.
The DPL messages are pushed much faster than they are digested by QC, so some back-pressure builds up quite quickly.

Using an O2 version from before June 2020, the DPL source was apparently slowing down to cope with the sink speed, and all messages were received and processed by QC.

With the current O2 code however it seems that DPL messages get dropped in order to reduce the back-pressure, instead of slowing down the source.

I have not done any detailed benchmark so far, but I could try to prepare a minimal example if this is an unexpected behavior.

If instead this is a known feature, is there an option that allows to disable the DPL message dropping and allow to transfer all messages from source to sink?

Thanks a lot in advance.

Cheers, Andrea

laphecet · January 25, 2021, 8:46pm

Hi Andrea,

I suspect you’re hitting the same problem as we (Philippe and myself) hit, and that is described in https://alice.its.cern.ch/jira/browse/O2-1924 where Philippe actually prepared a minimal reproducer.

So either there’s indeed a bug somewhere or we (mch people) consistently misuse the DPL framework somehow

aferrero · January 26, 2021, 7:32am

Hi Laurent,

thanks for pointing me to the Jira ticket. I have tried to find similar discussions before posting this, but I missed that one…

My impression is that with the current behavior it is impossible to process 100% of the DPL messages, because in a DPL chain most of the consumers are typically slower than the producers… so for me it looks like a bug.

However, mixing the order of DPL messages and dropping some of them are two different issues IMHO. Mixing is somehow OK, as long as we consider that each TF is independent for the others. In the other hand, skipping messages means we completely miss some data…

Cheers, Andrea

bvonhall · January 26, 2021, 10:10am

My 2 cents: it should be, at the very least, configurable. In ZeroMQ, one would switch between subscriber-publisher and push-pull to have one behaviour or the other (i.e. non-blocking vs blocking).

aferrero · January 27, 2021, 8:58am

@laphecet of course in the case we want to compare the output of a given chain with a reference file, then the order in which the messages are processed is also important. I think we would ideally need the three options:

100% messages processed sequentially for debugging and cross-checks
100% messages in random order for data analysis, where the order of processing of the TFs should not matter as long as at the end we are sure to have analyzed 100% of the data
DPL messages dropping for online processing, if back-pressure cannot be avoided otherwise

@bvonhall I absolutely agree with you…