Resetting of histograms in qc tasks

tklemenz · June 8, 2021, 12:47pm

Hi,

I have questions concerning the resetting of histograms within a qc task, especially in multinode setup.

We are running local qc tasks on the EPNs (mergingMode delta) and a remote task which is merging the MOs from the local tasks. The qc tasks have a function implemented which resets the MOs (just a loop calling Reset() on the histograms). We tried calling the reset functionality within the reset() function of the qc task but the output on the QCG is just piling up the data and the number of entries keeps increasing with every cycle. We also put the resetting functionality into the startOfCycle() function but this gives the same result. When we put the resetting to endOfCycle() we get empty histograms on the QCG (which is expected, I would say).

Now I am wondering why we never see only the data from one cycle but always add up cycle for cycle in the output. Could it be that the merger itself somehow adds up the data? What exactly is the merger process doing with the data? Does it take the MOs, merge all with the same name that are received within one cycle, publish and then reset/delete the objects so that the next cycle is completely fresh?

Also: Is it possible that the remote merger gets data from one local process multiple times within one cycle?

Summarizing: What we want to see on the QCG is histograms with only data that arrived within one cycle instead of adding up all cycles.

I guess @pkonopka is the expert here.
Please let me know if you need more information/code snippets.

Cheers,
Thomas

pkonopka · June 8, 2021, 2:33pm

Hi,
Short answer: it’s not a bug, it’s a feature
The generally expected behaviour for “delta” mode is actually that parallel QC Tasks send deltas (differences) to Mergers, which accumulate them and publish the most recent complete version of objects. It never resets. We are using deltas, because of the reduced cpu/memory/throughput required to merge them in case of sparse histograms and trees. It also facilitates tracking their provenance - if we are to receive entire objects, we have to know which one should update which previous one. This is what "entire" mode does.

To do so, we would have to implement a periodic Reset in Mergers, which was noted down some time ago:
https://alice.its.cern.ch/jira/browse/QC-134
I will increase the priority of the ticket.

Just to add, I don’t think it is realistic to assume you will reliably get objects from one cycle in an asynchronous system like O2. Some of them will arrive to Mergers later, some sooner, so they might not appear always in the right “merged cycle”. If we were to wait for all of the sources before publishing, we could get an object from one source twice before getting any from another one. Which leads us to:

Yes, in the case mentioned above and hopefully clarified with my graph below:

Task 1 objs arrive :    |       |       |       |
Task 2 objs arrive :  |       |         |     |
Merger cycle       :   |       |       |       |
Time               :  ----------------------------->

Thus you could perhaps keep the Merger Reset interval a few times longer than the QC tasks cycle to minimize the fluctuations.
Cheers!

tklemenz · June 8, 2021, 3:08pm

Thanks a lot for the fast reply! This is a lot of useful information.

How would one set the merger cycle? Afaik what one sets in the config file with cycleDurationSeconds is the QC task cycle and I thought that this is also the merger cycle. Can the merger cycle be adjusted independently?

Cheers,
Thomas

pkonopka · June 9, 2021, 6:53am

Now the QC framework sets it by itself to the same cycle length as for QC tasks. We will have to expose it to the config files alongside with a setting to reset the merger after x cycles.

wiechula · June 9, 2021, 8:21am

Hi @pkonopka ,
I also think the resetting of histograms in the merger is important, unless there is a different mechanism foreseen to do trending over time.
As it is now, if I understand correctly, the merger accumulates simply over the full data taking time. If I have a long run, say several hours, how would the trending work? I think doing trending over accumulated statistics does not make too much sense. What one would like to see is something like the trend for the variables integrated over lets say 10min each.

pkonopka · June 9, 2021, 8:40am

Yes, @wiechula, that is a fair point.

Btw, I am also trying to look a bit ahead. Do you think we should also expect cases when someone needs both 10 minutes windows and a 10-hours integral at the end? We could just run the same QC Task with two different configurations to achieve that, but this is perhaps not optimal.

wiechula · June 9, 2021, 9:41am

Hi @pkonopka , but if one has the 10min windows, one can at the end always merge, or? So you could do it on the level of QCDB queries and are even flexible in specific time windows. E.g. group in 30min, or 1h steps.
What might be more interesting is to have it triggered by some flag. E.g. in low IR running, 10min might be too short statistics wise. So one could say accumulate in 10min intervals if statistics is enough, otherwise extend by another 10min. Or merge afterwards, but then trending might not work out of the box, but also only as an afterburner.

pkonopka · June 9, 2021, 11:51am

Yes, it can be approached in many ways, if we actually need it. I just indicated how we could do it with existing tools.

I am afraid this might obscure the QC results, but it should be possible to do it with some effort.

tklemenz · July 14, 2021, 3:13pm

Hi @pkonopka, I was just looking into the moving window. I don’t know if I am using it correctly but it doesn’t seem to work the way I do it.

Here is the task config:

"tasks": {
  "Clusters": {
    "active": "true",
    "className": "o2::quality_control_modules::tpc::Clusters",
    "moduleName": "QcTPC",
    "detectorName": "TPC",
    "cycleDurationSeconds": "10",
    "maxNumberCycles": "-1",
    "resetAfterCycles": "87",
    "dataSource": {
      "type": "dataSamplingPolicy",
      "name": "random-clusters"
    },
    "taskParameters": {
      "NClustersNBins": "100",  "NClustersXMin": "0", "NClustersXMax": "100",
      "QmaxNBins":      "200",  "QmaxXMin":      "0", "QmaxXMax":      "200",
      "QtotNBins":      "600",  "QtotXMin":      "0", "QtotXMax":      "600",
      "SigmaPadNBins":  "200",  "SigmaPadXMin":  "0", "SigmaPadXMax":  "2",
      "SigmaTimeNBins": "200",  "SigmaTimeXMin": "0", "SigmaTimeXMax": "2",
      "TimeBinNBins":   "1000", "TimeBinXMin":   "0", "TimeBinXMax":   "100000"
    },
    "location": "remote"
  }
}

with dataSamplingPolicy

"dataSamplingPolicies": [
    {
      "id": "random-clusters",
      "active": "true",
      "machines": [
        "localhost"
      ],
      "port": "32627",
      "query": "inputClus:TPC/CLUSTERNATIVE",
      "outputs": "sampled-clusters:DS/CLUSTERNATIVE",
      "samplingConditions": [
        {
          "condition": "random",
          "fraction": "1",
          "seed": "1234"
        }
      ],
      "blocking": "false"
    }
  ]

Don’t mind fraction=1, I am just playing around at the moment.
I put the resetAfterCycles into the task config but then the workflow (e.g. when starting the --remote part) is invalid giving the following error:

[ERROR] invalid workflow in o2-qc: No matching output found for QC/Clusters-mo/0 as requested by data processor "QC-CHECK-RUNNER-sink-QC_Clusters-mo_0". Candidates:

and among the candidates there is

-QC/Clusters-mo/87

The number is always exactly the one I put for resetAfterCycles so the workflow can never be created unless I put 0, which is not what I want (I think).

To me this looks like something unintended is happening. Please let me know if I do an obvious mistake somewhere.

Thanks!

Cheers
Thomas

pkonopka · July 15, 2021, 6:33am

Hi, I’ve done something stupid in the code, let me fix it

tklemenz · July 15, 2021, 6:40am

Thanks a lot for the quick fix!