Dear all,
while running tests with the QC multinode setup on local machines to investigate another issue, we now encounter a new issue: We are running on the same physical machine two processes, one as local and one as remote, using a common json file, as usual in a multinode setup. With this json, we run two QC tasks:
- First one (
PIDlocal
) uses TPC tracks, the QC tasks runs locally and sends its output to the remote. - Second one (
Clusters
) calls locally only the data sampling, the QC task itself runs on the remote.
Such a setup was working fine locally until a few days ago. Now, with alidist, O2 and QualityControl from today, the PIDlocal-proxy
from the remote process crashes with the following messages (in debug mode):
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854153 DPL: 0: QC/PIDlocal-mo/1 part 0 of 1 payload 456224
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854174 DPL: processing QC/PIDlocal-mo/1 time slice 0 part 0 of 1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854190 DPL: matching: QC/PIDlocal-mo/1 to route QC/PIDlocal-mo/1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854206 DPL: associating part with index 0 to channel from_PIDlocal-proxy_to_MERGER-PIDlocal1l-0 (2)
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854217 DPL: sending 2 messages on from_PIDlocal-proxy_to_MERGER-PIDlocal1l-0
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854576 DPL: Data received from server
[10800:MERGER-PIDlocal1l-0]: 2021-06-17 16:49:05.854598 DPL: socket polled UV_READABLE: from_PIDlocal-proxy_to_MERGER-PIDlocal1l-0
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.854644 DPL: Correctly handshaken websocket connection.
[10797:PIDlocal-proxy]: [16:49:05][INFO] Correctly handshaken websocket connection.
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855250 DPL: 0: DPL/EOS/0 part 0 of 1 payload 0
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855261 DPL: processing DPL/EOS/0 time slice 1 part 0 of 1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855271 DPL: matching: DPL/EOS/0 to route QC/PIDlocal-mo/1
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855280 DPL: matching: DPL/EOS/0 to route QC/PIDlocal-mo/2
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855292 ! Warning - DPL: can not find matching channel, not able to adopt DPL/EOS/0
[10797:PIDlocal-proxy]: [16:49:05][WARN] can not find matching channel, not able to adopt DPL/EOS/0
[10797:PIDlocal-proxy]: [16:49:05][ERROR] Exception caught: No matching filter rule for input data
[10797:PIDlocal-proxy]: DPL/EOS/0
[10797:PIDlocal-proxy]: Add appropriate matcher(s) to dataspec definition or allow to drop unmatched data
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.856241 DPL: State transition requested
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.856324 DPL: Exception caught in ProcessWork(), going to Error state and rethrowing
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.856400 DPL: Shutting down Plugin Manager
[10797:PIDlocal-proxy]: 2021-06-17 16:49:05.855514 !!! Error - DPL: Exception caught: No matching filter rule for input data
[10797:PIDlocal-proxy]: DPL/EOS/0
[10797:PIDlocal-proxy]: Add appropriate matcher(s) to dataspec definition or allow to drop unmatched data
Afterwards, the remote still publishes the PIDlocal
histograms and continues to publish the same histograms each cycle, but it does not accept any new input data. The Cluster
task however still works as expected accepting new input data and summing it to the existing histograms.
The warning and errors originate from alice/O2/Framework/Core/src/ExternalFairMQDeviceProxy.cxx
- Warning
can not find matching channel, not able to adopt DPL/EOS/0
, line 247 - Error
No matching filter rule for input data
, line 295
We tried to comment the line throwing the error or even the complete if-statement, to see what happens, in case the corresponding warning is just ignored, but this leads to a crash even before the PID plots are published for the first time.
As far as we know, recently there have been changes in the handling of EndOfStream. Maybe, if a QC task forwards an EOS to the remote, which does not know how to handle it, this could be the reason for this error. The reason for the working Cluster task could be, that the message for the EOS is simply lost in the data sampling.
Some more info: We work on Ubuntu 18.04, and both Thomas and myself observe the same behaviour. Should it be of any relevance, the aliBuild version is 1.8.0.
We would be happy for any help on this issue! Let me also ping @richterm , who might have some insight.
Cheers,
Stefan