Issue in QC multinode setup with DPL EOS

sheckel · June 17, 2021, 5:45pm

Dear all,

while running tests with the QC multinode setup on local machines to investigate another issue, we now encounter a new issue: We are running on the same physical machine two processes, one as local and one as remote, using a common json file, as usual in a multinode setup. With this json, we run two QC tasks:

First one (PIDlocal) uses TPC tracks, the QC tasks runs locally and sends its output to the remote.
Second one (Clusters) calls locally only the data sampling, the QC task itself runs on the remote.

Such a setup was working fine locally until a few days ago. Now, with alidist, O2 and QualityControl from today, the PIDlocal-proxy from the remote process crashes with the following messages (in debug mode):

[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854153     DPL: 0: QC/PIDlocal-mo/1 part 0 of 1  payload 456224
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854174     DPL: processing QC/PIDlocal-mo/1 time slice 0 part 0 of 1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854190     DPL: matching: QC/PIDlocal-mo/1 to route QC/PIDlocal-mo/1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854206     DPL: associating part with index 0 to channel from_PIDlocal-proxy_to_MERGER-PIDlocal1l-0 (2)
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854217     DPL: sending 2 messages on from_PIDlocal-proxy_to_MERGER-PIDlocal1l-0
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854576     DPL: Data received from server
[10800:MERGER-PIDlocal1l-0]: 2021-06-17 16:49:05.854598     DPL: socket polled UV_READABLE: from_PIDlocal-proxy_to_MERGER-PIDlocal1l-0
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854644     DPL: Correctly handshaken websocket connection.
[10797:PIDlocal-proxy]: [16:49:05][INFO] Correctly handshaken websocket connection.
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855250     DPL: 0: DPL/EOS/0 part 0 of 1  payload 0
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855261     DPL: processing DPL/EOS/0 time slice 1 part 0 of 1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855271     DPL: matching: DPL/EOS/0 to route QC/PIDlocal-mo/1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855280     DPL: matching: DPL/EOS/0 to route QC/PIDlocal-mo/2
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855292  !  Warning - DPL: can not find matching channel, not able to adopt DPL/EOS/0
[10797:PIDlocal-proxy]: [16:49:05][WARN] can not find matching channel, not able to adopt DPL/EOS/0
[10797:PIDlocal-proxy]: [16:49:05][ERROR] Exception caught: No matching filter rule for input data 
[10797:PIDlocal-proxy]:    DPL/EOS/0
[10797:PIDlocal-proxy]:  Add appropriate matcher(s) to dataspec definition or allow to drop unmatched data 
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.856241     DPL: State transition requested
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.856324     DPL: Exception caught in ProcessWork(), going to Error state and rethrowing
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.856400     DPL: Shutting down Plugin Manager
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855514 !!! Error - DPL: Exception caught: No matching filter rule for input data 
[10797:PIDlocal-proxy]:    DPL/EOS/0
[10797:PIDlocal-proxy]:  Add appropriate matcher(s) to dataspec definition or allow to drop unmatched data

Afterwards, the remote still publishes the PIDlocal histograms and continues to publish the same histograms each cycle, but it does not accept any new input data. The Cluster task however still works as expected accepting new input data and summing it to the existing histograms.

The warning and errors originate from alice/O2/Framework/Core/src/ExternalFairMQDeviceProxy.cxx

Warning can not find matching channel, not able to adopt DPL/EOS/0, line 247
Error No matching filter rule for input data, line 295

We tried to comment the line throwing the error or even the complete if-statement, to see what happens, in case the corresponding warning is just ignored, but this leads to a crash even before the PID plots are published for the first time.

As far as we know, recently there have been changes in the handling of EndOfStream. Maybe, if a QC task forwards an EOS to the remote, which does not know how to handle it, this could be the reason for this error. The reason for the working Cluster task could be, that the message for the EOS is simply lost in the data sampling.

Some more info: We work on Ubuntu 18.04, and both Thomas and myself observe the same behaviour. Should it be of any relevance, the aliBuild version is 1.8.0.

We would be happy for any help on this issue! Let me also ping @richterm , who might have some insight.

Cheers,
Stefan

richterm · June 17, 2021, 6:37pm

Hello Stefan,
I think its a side effect of the recent changes to propagate end of stream. It seems that your workflow is combining both DPL output proxy and the input/raw proxy. I have added the forwarding of EOS to the output proxy, it is labeled DPL/EOS/0. If you have a chance to configure the input proxy that it also accepts this data, you can avoid the exception. If the o2-dpl-raw-proxy is used, there is also ‘–throwOnUnmatched’ option, if you set this one to false, you can also avoid the exception.

In any case this needs to be handled correctly, please open a ticket on Jira, post your message together with versions of O2 and QC packages, the commands you are executing and attach the workflow configuration files. Then we continue the discussion in Jira. You might not have all the information, so add what you have. The most important information is the workflow configuration.

Best regards
Matthias

sheckel · June 18, 2021, 8:08am

Hi Matthias,

thanks a lot for your reply! I have opened a JIRA Ticket to follow up on this topic.

Cheers,
Stefan