Problem with dataSamplingPolicies.samplingConditions

tklemenz · May 17, 2021, 12:23pm

Hi all,

we are wondering about the dataSamplingPolicies for qc tasks.

If I understood correctly, to use a dataSamplingPolicy, one needs to specify it in the config file of a task and then set it as the dataSource.
E.g. QualityControl/Modules/TPC/run/tpcQCPID_sampled.json (can be found in QualityControl master)

"dataSamplingPolicies": [
    {
      "id": "tpc-tracks",
      "active": "true",
      "machines": [],
      "query" : "inputTracks:TPC/TRACKS/0",
      "samplingConditions": [
        {
          "condition": "random",
          "fraction": "0.9",
          "seed": "1234"
        }
      ],
      "blocking": "false"
    }
  ]

"dataSource": {
  "type": "dataSamplingPolicy",
  "name": "tpc-tracks"
},

For TPC/TRACKS it does not matter which fraction we set, there is always all tracks going through.
In the terminal output one can see that a Dispatcher is set up with the proper policyName. A log file of running the above config can be found here.
In case of TPC/CLUSTERNATIVE there is no data going through the sampling independently of the fraction we set.

In both cases everything works as expected when the sampling is by-passed via e.g.

"dataSource": {
  "type": "direct",
  "name": "inputTracks:TPC/TRACKS/0"
},

Does anyone have an idea what could be wrong?

Thanks a lot for any help.

Cheers,
Thomas

pkonopka · May 17, 2021, 1:23pm

Hi, I tried with a bit modified setup:

o2-qc-run-producer -b --message-rate 10 | o2-qc --config json://home/pkonopka/alice/QualityControl/Modules/TPC/run/tpcQCPID_sampled.json

Where TPC/TRACKS/0 in the data sampling policy was replaced with TST/RAWDATA/0, so it can match the random data generator. In such case I could see working correctly. For fraction 0.9:

[15451:QC-TASK-RUNNER-PID]: [qc] qc_data_received messages_in_cycle=89 messages_per_second=8.940216 data_in_cycle=468291 data_per_second=47040.702742 1621256867978 hostname=aido2qc12.cern.ch,subsystem=QC,TaskNam

This makes me think that the issue is with the data being sampled. How are you getting these tracks or clusters? If this is some file reader, then maybe it sends all tracks/clusters within one message? Data Sampling can sample messages, but it doesn’t inspect their content. It just chooses to accept or reject the full message, but it won’t randomly extract e.g. 10% of tracks in a vector of tracks.

Perhaps this is the problem?

tklemenz · May 17, 2021, 1:27pm

Hi Piotr,

I think you are completely right. I am using a file reader and then the data is afaik sent in one message… I didn’t think of that. So then the data sampling is basically sampling messages?

Cheers,
Thomas

pkonopka · May 17, 2021, 1:40pm

Yes, exactly. Maybe this file reader has some option to split these tracks into several messages?

I assume that during the processing we will get them in multiple messages. Otherwise, we could think of some extension which allows for customized sampling (creating a new message based on the original one instead of copying it). Though given the amount of work we have now, I don’t think I could implement it sooner than in a few months.

tklemenz · May 17, 2021, 3:00pm

It does not, I think but then this is not that much of a problem for now.

The more serious thing is that we do not get TPC/CLUSTERNATIVE at all through the data sampling. (afaik the same goes for TPC/RAWDATA)
If one runs e.g. the cluster workflow (QualityControl/Modules/TPC/run/tpcQCClusters_direct.json)
With the policy

"dataSamplingPolicies": [
    {
      "id": "halfFromAll",
      "active": "true",
      "query" : "input:TPC/CLUSTERNATIVE",
      "samplingConditions": [
        {
          "condition": "random",
          "fraction": "0.5",
          "seed": "1234"
        }
      ],
      "blocking": "false"
    }
  ]

while taking the clusters either from the o2-tpc-reco-workflow or from the file reader, there is no clusters arriving at the QC-TASK-RUNNER-Clusters. The published MOs are empty.
Another interesting thing we can observe is if we run the cluster task in multinode setting (within one machine, adding localhost to machines and a proper port) the data that is sent by the file reader does actually arrive at the data proxy and the monitorData function of the task is triggered but there is no data present, meaning the output MOs are still empty.
File reader:

[30397:tpc-native-cluster-reader]: [16:30:15][INFO] branch TPCClusterNative: publishing binary chunk of 32691520 bytes(s)

data proxy of the remote task:

[30226:halfFromAll-localhost]: [16:31:00][INFO] halfFromAll-localhost[0]: in: 1 (32.7011 MB) out: 0 (0 MB)

Here is a log message that I put into the monitorData function.

[10704:QC-TASK-RUNNER-Clusters]: 2021-05-17 16:52:02.604165     The monitorData function was triggered!!

Do you maybe have an idea why this could be the case?

Cheers,
Thomas

pkonopka · May 18, 2021, 6:52am

Hi,
regarding the problem with clusters alone, I would check if they aren’t also sent in one message. In this case it could be that the first message is usually discarded, due to the sequence of pseudo-random numbers generated in Data Sampling.

Regarding the problem with Data Sampling + multi node setup - I don’t know. We actually test it with a localhost configuration, so I would expect it to work. Could you provide more information on how you execute it, some logs and the configuration file?

tklemenz · May 18, 2021, 8:08am

Hi,
we tried with different seed and also fraction 1 (I guess that is valid) and still nothing goes through.

In this case I am also testing in a localhost config.
The way the test is executed:
Simulate data

o2-sim -m TPC -n 10

and digitize

o2-sim-digitizer-workflow -b

Then one can take the o2-tpc-reco-workflow to reconstruct clusters and put them into the qc task.
The config file can be found here.

To run the remote task, I do

o2-qc --config json://<PATHTOJSON>/tpcQCClusters.json --remote

Then in another terminal I either run

o2-tpc-reco-workflow --infile tpcdigits.root --output-type clusters | o2-qc --config json://<PATHTOJSON>/tpcQCClusters.json --local --host localhost

to get the data from the reconstruction, or (can only be done after the reco ran at least once to have the cluster file)

o2-tpc-file-reader --tpc-native-cluster-reader '--infile tpc-native-clusters.root' --input-type clusters | o2-qc --config json://<PATHTOJSON>/tpcQCClusters.json --local --host localhost

For the log file for the local part I used the first option, getting the data from the reconstruction.
The log file of the remote task can be found here.
The log file of the local task can be found here.

Cheers,
Thomas

tklemenz · May 18, 2021, 12:59pm

I investigated more and found where the problem lies. However, I do not know why this happens nor how to fix it.

In the clusterHandler (QualityControl/Modules/TPC/src/Utility.cxx) that is “preparing” the CLUSTERNATIVE coming from the DPL, the loop that runs over InputRecordWalker(input, filter) (QualityControl/Utility.cxx at master · AliceO2Group/QualityControl · GitHub), where input is a o2::framework::InputRecord (obtained from the ProcessingContext), is not entered when the data sampling is used. If data sampling is by-passed, the loop is entered.
The problem is not that the data is not there but for some reason the filter

std::vector<InputSpec> filter = {
    { "check", ConcreteDataTypeMatcher{ o2::header::gDataOriginTPC, "CLUSTERNATIVE" }, Lifetime::Timeframe },
  };

is not matching any data anymore. Again: the same works fine when by-passing the data sampling. When the filter that is passed to the InputRecordWalker is left empty, it works without problems and the MOs are filled again.

What we also see is that both, ctx.inputs().size() and ctx.inputs().countValidInputs(), are the same when using or by-passing data sampling. Size is 2 and valid inputs is 1.

For the record: I know that there is also the WorkflowHelper that can (and probably should) be used instead of the clusterHandler but in both cases there is exactly the same problem.

I see that this might be a problem that lies rather on our side but do you have an idea, what would possibly change in the InputSpec when the data runs through sampling so that the filter is not matching anymore? And do you know how I could check the specs of the available data in the DPL so that I maybe find out what the spec of the TPC/CLUSTERNATIVE actually is after sampling?

Cheers,
Thomas

pkonopka · May 18, 2021, 2:03pm

OK, then I understand. The output of Data Sampling uses different Origin and Description, so it does not collide with the original data, this is why it is filtered out.

The possible ways out are:

You should be able to access your data with the binding that you set. In such case it would be “input”:

      "query" : "input:TPC/CLUSTERNATIVE",

And you should be able to access it the following way:

DataRef ref = ctx.inputs().get("input");

I am writing from my memory, so there might be a mistake in the syntax.

You can enforce the output spec of Data Sampling to something you want (see the doc, search for “outputs”). Still, it has to be different than the original one.
You can run your workflow with --dump at the end, then look for the outputs of Dispatcher in the workflow JSON dump. You can find the correct Origin/Description there.

I hope it helps, cheers!

tklemenz · May 18, 2021, 9:42pm

Thanks a lot, Piotr. This is of great help!

Variant 2 already works. Now I only need to think how to properly implement that but this is our problem now.

Cheers,
Thomas

sheckel · June 1, 2021, 10:52am

Hi Piotr,

thanks again for your help, we managed to get the clusters through the data sampling when running locally and also when running the local part on one EPN and the QC task on another (acting as remote machine).

But now we have a new issue: When running the local part distributed on many EPNs, using those only to send the data through the data sampling to the merger on the “remote” EPN, where the QC task is supposed to run, we get a crash, because for each sector of the TPC only on data sample can be send/processed per time frame. In principle, each EPN should handle different time frames, so this should work, but unfortunately it does not. Out best guess at the moment is, that the information no the time frame is lost within the data sampling. Can you confirm this? Is there a way to preserve the time frame information? Or do you maybe have any other idea what maybe went wrong?

Thanks for any further help! Maybe here also @bvonhall could have a look.

Cheers,
Stefan

pkonopka · June 1, 2021, 1:06pm

Hi,
I can’t quite follow you. Could you explain what is crashing and why exactly?

In principle Data Sampling/Dispatcher should forward all the additional headers on the header stack, if this is what you mean by the “time frame information”. Though maybe something is lost in proxies which interconnect the two parts of the workflow. Is the data lost if Dispatcher and QC Task run on the same machine?

sheckel · June 1, 2021, 1:19pm

Hi Piotr,

thanks for your reply and sorry, I think I was not very clear.

There is a protection inside the TPC cluster handler which does not accept more than one set of data in one TPC sector within one time frame. The usual, expected behaviour is: each EPN processes full time frames, so it sends the data of specific time frames not handled by any other EPN. In this case, processing many time frames on many EPNs should work, because even if the merger gets data from several EPNs at the same time, and all of them having data in the same sector, they originate from different time frames.

Now, if the information on the time frame itself, i.e. which time frame it is, would be lost during the data sampling, the merger would get data on the same sector from several EPNs, but not knowing that these are from different time frames, and hence the protection described above leads to exaclty the crash we observe.

That’s why we now think, the actual information, to which time frame the data belongs, might be lost in the data sampling.

I hope, this was a bit more clear.

pkonopka · June 2, 2021, 8:15am

Thanks for the clarification. Do you know perhaps where this information about the time frame is contained? In the message payload, in some specific header? I am trying to understand which piece of information exactly is lost on the way.

Also repeating the question from my previous comment: Is the data lost if Dispatcher and QC Task run on the same machine?

sheckel · June 2, 2021, 2:18pm

Hm, unfortunately I don’t know, where excatly the information on the time frame is stored. But I can try to find out.

Running one local part (i.e. the dispatcher) and one remote part (i.e. QC task) on the same machine is working, but this is expected to work in any case, also without this information. I am not sure, whether we tried with multiple local parts on one machine for the cluster task. We will try this (but likely only beginning of next week).

pkonopka · June 2, 2021, 2:24pm

Thanks. In the meanwhile, can you point me to the code which does this check that fails? Maybe I will recognize what is missing then.

sheckel · June 2, 2021, 3:16pm

Let me ping @wiechula who should kow this…

wiechula · June 3, 2021, 6:34pm

What is meant here is that we expect the data of one time frame to be processed in one call of the processing function (monitorData) in case of QC.
The cluster data in our QC task is accessed here:

github.com

AliceO2Group/QualityControl/blob/master/Modules/TPC/src/Clusters.cxx#L68

    
      
            QcInfoLogger::GetInstance() << "startOfActivity" << AliceO2::InfoLogger::InfoLogger::endm;
          }
          
          
void Clusters::startOfCycle()
          {
            QcInfoLogger::GetInstance() << "startOfCycle" << AliceO2::InfoLogger::InfoLogger::endm;
          }
          
          
void Clusters::monitorData(o2::framework::ProcessingContext& ctx)
          {
            o2::tpc::ClusterNativeAccess clusterIndex = clusterHandler(ctx.inputs());
          
          
  for (int isector = 0; isector < o2::tpc::constants::MAXSECTOR; ++isector) {
              for (int irow = 0; irow < o2::tpc::constants::MAXGLOBALPADROW; ++irow) {
                const int nClusters = clusterIndex.nClusters[isector][irow];
                for (int icl = 0; icl < nClusters; ++icl) {
                  const auto& cl = *(clusterIndex.clusters[isector][irow] + icl);
                  mQCClusters.processCluster(cl, o2::tpc::Sector(isector), irow);
                }
              }
            }

The exceptions is then thrown here:

github.com

AliceO2Group/QualityControl/blob/master/Modules/TPC/src/Utility.cxx#L104

    
      
              } else if (operation == 0) {
                // store the operation
                operation = sector;
              }
              continue;
            }
            if ((validInputs & sectorMask).any()) {
              // have already data for this sector, this should not happen in the current
              // sequential implementation, for parallel path merged at the tracker stage
              // multiple buffers need to be handled
              throw std::runtime_error("can only have one cluster data set per sector");
            }
            validInputs |= sectorMask;
            inputrefs[sector].data = ref;
          }
          
          
for (auto const& refentry : inputrefs) {
            //auto& sector = refentry.first;
            auto& ref = refentry.second.data;
            inputs.emplace_back(gsl::span(ref.payload, DataRefUtils::getPayloadSize(ref)));
          }

So it seems that one ProcessingContext contains data of multiple time frames. Could this be? As far as I know we didn’t see a problem if the data is sent by only one process. We saw it when multiple EPNs sample data and send it to the merger node.

pkonopka · June 4, 2021, 8:30am

Thanks Jens, this helped me a lot to understand the issue you are dealing with… though I don’t know why it could happen. In principle, QC, or any other DPL device should not get more than one timeframe at a time, because DPL matches timeslices based on DataProcessingHeader::startTime (unless something has changed). Also, I don’t expect the DPL nor QC framework to join any messages into multi-parts, but again, maybe I am not up to date with DPL. You should get these samples sequentially in your QC task, to my understanding.

I see the following possibilities to rule out:

Dispatcher does not copy o2::tpc::TPCSectorHeader correctly and there is some random rubbish there, but it is only detected when you deal with more data at the same time. Not likely, because the headers use magic strings as a validation ("TPCSectH" in your case), but maybe the part of header which is not part of BaseHeader is wrong.
Timeframes do not have unique IDs (unique to DPL) and they are matched to the same timeslice. This could happen if your query is something like TPC/CLUSTERSNATIVE/0;TPC/CLUSTERSNATIVE/1;TPC/CLUSTERSNATIVE/2;... and EPNs running in parallel use separate subspecs, but common IDs. Maybe it could also happen if your query is TPC/CLUSTERNATIVE, but I am not sure.
Input or output proxies do some weird aggregation or overwrite some headers. It doesn’t seem so, looking quickly at the code.
There is a rare bug in your QC task or the reco code, which you cannot see with one EPN because it’s rare, but you see it with many EPNs, because the probability increases. Though looking at the QC task, I don’t see anything obvious.
Message parts get duplicated somewhere or iterated over twice.

At this point, I think I can help you only if I get a reproducer (if this is doable, a setup with files should suffice). Otherwise, maybe I would spot something obvious in logs and or the QC config file.

wiechula · June 4, 2021, 9:16am

Hi Piotr,
thanks a lot for your fast reply and all these thoughts. We’ll try to set up something simple that can reproduce this. We’ll also add some debug to our code to be better able to trace back what might go wrong and come back to you soon.