we are wondering about the dataSamplingPolicies for qc tasks.
If I understood correctly, to use a dataSamplingPolicy, one needs to specify it in the config file of a task and then set it as the dataSource.
E.g. QualityControl/Modules/TPC/run/tpcQCPID_sampled.json (can be found in QualityControl master)
For TPC/TRACKS it does not matter which fraction we set, there is always all tracks going through.
In the terminal output one can see that a Dispatcher is set up with the proper policyName. A log file of running the above config can be found here.
In case of TPC/CLUSTERNATIVE there is no data going through the sampling independently of the fraction we set.
In both cases everything works as expected when the sampling is by-passed via e.g.
This makes me think that the issue is with the data being sampled. How are you getting these tracks or clusters? If this is some file reader, then maybe it sends all tracks/clusters within one message? Data Sampling can sample messages, but it doesn’t inspect their content. It just chooses to accept or reject the full message, but it won’t randomly extract e.g. 10% of tracks in a vector of tracks.
Yes, exactly. Maybe this file reader has some option to split these tracks into several messages?
I assume that during the processing we will get them in multiple messages. Otherwise, we could think of some extension which allows for customized sampling (creating a new message based on the original one instead of copying it). Though given the amount of work we have now, I don’t think I could implement it sooner than in a few months.
It does not, I think but then this is not that much of a problem for now.
The more serious thing is that we do not get TPC/CLUSTERNATIVE at all through the data sampling. (afaik the same goes for TPC/RAWDATA)
If one runs e.g. the cluster workflow (QualityControl/Modules/TPC/run/tpcQCClusters_direct.json)
With the policy
while taking the clusters either from the o2-tpc-reco-workflow or from the file reader, there is no clusters arriving at the QC-TASK-RUNNER-Clusters. The published MOs are empty.
Another interesting thing we can observe is if we run the cluster task in multinode setting (within one machine, adding localhost to machines and a proper port) the data that is sent by the file reader does actually arrive at the data proxy and the monitorData function of the task is triggered but there is no data present, meaning the output MOs are still empty.
[30397:tpc-native-cluster-reader]: [16:30:15][INFO] branch TPCClusterNative: publishing binary chunk of 32691520 bytes(s)
regarding the problem with clusters alone, I would check if they aren’t also sent in one message. In this case it could be that the first message is usually discarded, due to the sequence of pseudo-random numbers generated in Data Sampling.
Regarding the problem with Data Sampling + multi node setup - I don’t know. We actually test it with a localhost configuration, so I would expect it to work. Could you provide more information on how you execute it, some logs and the configuration file?
For the log file for the local part I used the first option, getting the data from the reconstruction.
The log file of the remote task can be found here.
The log file of the local task can be found here.
I investigated more and found where the problem lies. However, I do not know why this happens nor how to fix it.
In the clusterHandler (QualityControl/Modules/TPC/src/Utility.cxx) that is “preparing” the CLUSTERNATIVE coming from the DPL, the loop that runs over InputRecordWalker(input, filter) (QualityControl/Utility.cxx at master · AliceO2Group/QualityControl · GitHub), where input is a o2::framework::InputRecord (obtained from the ProcessingContext), is not entered when the data sampling is used. If data sampling is by-passed, the loop is entered.
The problem is not that the data is not there but for some reason the filter
is not matching any data anymore. Again: the same works fine when by-passing the data sampling. When the filter that is passed to the InputRecordWalker is left empty, it works without problems and the MOs are filled again.
What we also see is that both, ctx.inputs().size() and ctx.inputs().countValidInputs(), are the same when using or by-passing data sampling. Size is 2 and valid inputs is 1.
For the record: I know that there is also the WorkflowHelper that can (and probably should) be used instead of the clusterHandler but in both cases there is exactly the same problem.
I see that this might be a problem that lies rather on our side but do you have an idea, what would possibly change in the InputSpec when the data runs through sampling so that the filter is not matching anymore? And do you know how I could check the specs of the available data in the DPL so that I maybe find out what the spec of the TPC/CLUSTERNATIVE actually is after sampling?
thanks again for your help, we managed to get the clusters through the data sampling when running locally and also when running the local part on one EPN and the QC task on another (acting as remote machine).
But now we have a new issue: When running the local part distributed on many EPNs, using those only to send the data through the data sampling to the merger on the “remote” EPN, where the QC task is supposed to run, we get a crash, because for each sector of the TPC only on data sample can be send/processed per time frame. In principle, each EPN should handle different time frames, so this should work, but unfortunately it does not. Out best guess at the moment is, that the information no the time frame is lost within the data sampling. Can you confirm this? Is there a way to preserve the time frame information? Or do you maybe have any other idea what maybe went wrong?
Thanks for any further help! Maybe here also @bvonhall could have a look.
I can’t quite follow you. Could you explain what is crashing and why exactly?
In principle Data Sampling/Dispatcher should forward all the additional headers on the header stack, if this is what you mean by the “time frame information”. Though maybe something is lost in proxies which interconnect the two parts of the workflow. Is the data lost if Dispatcher and QC Task run on the same machine?
thanks for your reply and sorry, I think I was not very clear.
There is a protection inside the TPC cluster handler which does not accept more than one set of data in one TPC sector within one time frame. The usual, expected behaviour is: each EPN processes full time frames, so it sends the data of specific time frames not handled by any other EPN. In this case, processing many time frames on many EPNs should work, because even if the merger gets data from several EPNs at the same time, and all of them having data in the same sector, they originate from different time frames.
Now, if the information on the time frame itself, i.e. which time frame it is, would be lost during the data sampling, the merger would get data on the same sector from several EPNs, but not knowing that these are from different time frames, and hence the protection described above leads to exaclty the crash we observe.
That’s why we now think, the actual information, to which time frame the data belongs, might be lost in the data sampling.
Thanks for the clarification. Do you know perhaps where this information about the time frame is contained? In the message payload, in some specific header? I am trying to understand which piece of information exactly is lost on the way.
Also repeating the question from my previous comment: Is the data lost if Dispatcher and QC Task run on the same machine?
Hm, unfortunately I don’t know, where excatly the information on the time frame is stored. But I can try to find out.
Running one local part (i.e. the dispatcher) and one remote part (i.e. QC task) on the same machine is working, but this is expected to work in any case, also without this information. I am not sure, whether we tried with multiple local parts on one machine for the cluster task. We will try this (but likely only beginning of next week).
What is meant here is that we expect the data of one time frame to be processed in one call of the processing function (monitorData) in case of QC.
The cluster data in our QC task is accessed here:
The exceptions is then thrown here:
So it seems that one ProcessingContext contains data of multiple time frames. Could this be? As far as I know we didn’t see a problem if the data is sent by only one process. We saw it when multiple EPNs sample data and send it to the merger node.
Thanks Jens, this helped me a lot to understand the issue you are dealing with… though I don’t know why it could happen. In principle, QC, or any other DPL device should not get more than one timeframe at a time, because DPL matches timeslices based on DataProcessingHeader::startTime (unless something has changed). Also, I don’t expect the DPL nor QC framework to join any messages into multi-parts, but again, maybe I am not up to date with DPL. You should get these samples sequentially in your QC task, to my understanding.
I see the following possibilities to rule out:
Dispatcher does not copy o2::tpc::TPCSectorHeader correctly and there is some random rubbish there, but it is only detected when you deal with more data at the same time. Not likely, because the headers use magic strings as a validation ("TPCSectH" in your case), but maybe the part of header which is not part of BaseHeader is wrong.
Timeframes do not have unique IDs (unique to DPL) and they are matched to the same timeslice. This could happen if your query is something like TPC/CLUSTERSNATIVE/0;TPC/CLUSTERSNATIVE/1;TPC/CLUSTERSNATIVE/2;... and EPNs running in parallel use separate subspecs, but common IDs. Maybe it could also happen if your query is TPC/CLUSTERNATIVE, but I am not sure.
Input or output proxies do some weird aggregation or overwrite some headers. It doesn’t seem so, looking quickly at the code.
There is a rare bug in your QC task or the reco code, which you cannot see with one EPN because it’s rare, but you see it with many EPNs, because the probability increases. Though looking at the QC task, I don’t see anything obvious.
Message parts get duplicated somewhere or iterated over twice.
At this point, I think I can help you only if I get a reproducer (if this is doable, a setup with files should suffice). Otherwise, maybe I would spot something obvious in logs and or the QC config file.
thanks a lot for your fast reply and all these thoughts. We’ll try to set up something simple that can reproduce this. We’ll also add some debug to our code to be better able to trace back what might go wrong and come back to you soon.