Usage of seed in DataSamplingConditionRandom

tklemenz · September 1, 2021, 9:00am

Hi,

we are wondering how exactly to use the seed in the random samplingCondition of dataSamplingPolicies.

In our case we have a task that is sending data from multiple EPNs to the merger on the QC node, which runs the remote QC task on the data. Therefore, each EPN has its own config, which, however, are the same on all EPNs. This means that also the seed in the dataSamplingPolicies is the same on all EPNs.
E.g.

  "dataSamplingPolicies": [
    {
      "id": "random-clusters",
      "active": "true",
      "machines": [
        "Machine1", "Machine2", "Machine3"
      ],
      "port": "32627",
      "query": "inputClus:TPC/CLUSTERNATIVE",
      "outputs": "sampled-clusters:DS/CLUSTERNATIVE",
      "samplingConditions": [
        {
          "condition": "random",
          "fraction": "0.01",
          "seed": "1234"
        }
      ],
      "blocking": "false"
    }
  ]

The main question is: Is this intended to be used like this where every machine has the same seed or should they have different ones?
Can it lead to “bursts” of data with equal seeds on each machine?
If the seeds should be randomized so that each machine has a different one, is there an easy way to do so? E.g. seed = 0 leads to a random one. However, after checking the code I don’t think this is the case.
I also checked the DataSampling docu but didn’t find a hint concerning our question.

I assume @pkonopka is the one to be summoned.

Thanks a lot!

Cheers,
Thomas

pkonopka · September 1, 2021, 9:38am

Hi,

Just for clarity, if a Task runs remotely, there are no mergers. Data just reach the remote QC task via some proxies. Mergers are used to merge Monitor Objects, which is not needed in this case.

If you use the same seed, you will always get the same pseudo-random selection decisions for the same timeframeIDs (which are taken from DataProcessingHeader::startTime).
As far as I understand, EPNs receive different TFs in a round-robin scheme or some other algorithm, so you cannot have any bursts of data then. These would happen for FLPs, which work on STFs with the same ID in parallel.

That being said, we can add the proposed feature of selecting a random seed for seed==0, if you need. It should be quick.

tklemenz · September 1, 2021, 10:35am

Hi Piotr,

thanks for the quick answer.

Yes, sorry. This is clear and just a sloppy use of the process nomenclature on the QC node on our part.

Ok then it looks like what we do is fine in general.

This would be very good to have actually. We would very much appreciate you implementing it.

Thanks a lot!

Cheers,
Thomas

pkonopka · September 1, 2021, 12:24pm

https://alice.its.cern.ch/jira/browse/QC-643