Trending in Framework cannot be published on local QCG

Dear all,

I wanted to test again the tpcQCSimpleTrending.json that we configured to try out some simple trending for the TPC. I was running the trending on the PID of some simulated data and publishing everything on my local QCG. I ran without any problem the json for the PID, but when I tried the json I have for the trending with the command o2-qc-run-postprocessing --config json://home/cindy/aliceO2/QualityControl/Modules/TPC/run/tpcQCSimpleTrending.json --name ExampleTrendingPID --period 10, I got the following terminal output

2020-04-25 17:16:59.653469     QualityControl Module QcCommon loaded 
2020-04-25 17:17:09.643909     Checking triggers of the task 'ExampleTrendingPID'
2020-04-25 17:17:19.643922     Checking triggers of the task 'ExampleTrendingPID'
2020-04-25 17:17:19.643986     Updating the user task due to trigger '7'
2020-04-25 17:17:19.686156     Retrieved object qc/TPC/PID/hNClusters
2020-04-25 17:17:19.686198     Version of object qc/TPC/PID/hNClusters is < 0.25
2020-04-25 17:17:19.686213 !!! Error - Could not cast the object qc/TPC/PID/hNClusters to MonitorObject
2020-04-25 17:17:19.694334     Retrieved object qc/TPC/PID/hPhi
2020-04-25 17:17:19.694359     Version of object qc/TPC/PID/hPhi is < 0.25
2020-04-25 17:17:19.694369 !!! Error - Could not cast the object qc/TPC/PID/hPhi to MonitorObject
2020-04-25 17:17:19.883837     Retrieved object qc/checks/DET/PIDClusterCheck
2020-04-25 17:17:19.884208     Generating and storing 6 plots.
2020-04-25 17:17:20.027480     Storing the trend, entries: 1

The script does not seem to be able to trend normal histograms published by the PID task but has no problem trending the checkers.
It should be mentioned, that this json was working without any problems before the latest QC update.

Furthermore, I tried to modify the json to publish on the central https://qcg-test.cern.ch and used the same commands as before. Again, I got no problem to publish the PID histograms, but this time, the trending ran and got published as expected with these lines in the terminal

2020-04-25 17:24:50.079463     QualityControl Module QcCommon loaded 
2020-04-25 17:25:00.069521     Checking triggers of the task 'ExampleTrendingPID'
2020-04-25 17:25:10.069305     Checking triggers of the task 'ExampleTrendingPID'
2020-04-25 17:25:10.069370     Updating the user task due to trigger '7'
2020-04-25 17:25:10.205655     Retrieved object qc/TPC/PID/hNClusters
2020-04-25 17:25:10.205700     Version of object qc/TPC/PID/hNClusters is >= 0.25
2020-04-25 17:25:10.309888     Retrieved object qc/TPC/PID/hPhi
2020-04-25 17:25:10.309916     Version of object qc/TPC/PID/hPhi is >= 0.25
2020-04-25 17:25:10.610320     Retrieved object qc/checks/DET/PIDClusterCheck
2020-04-25 17:25:10.610665     Generating and storing 6 plots.
2020-04-25 17:25:11.394009     Storing the trend, entries: 1

I am running on Ubuntu 18.04 and I have updated alidist, O2 and QualityControl this Monday 20.04. Moreover, @sheckel tried the same commands to publish on his local QCG and encountered the same issue with his freshly updated environment.

Thank you in advance!
Best,
Cindy

Dear Cindy, dear all,

just a bit of additional info from my side:

  • my version of alidist, O2 and QC still exhibiting the same issue is from Friday afternoon, so really rather recent.
  • We found the part, where the checking of the version takes place, in: QualityControl/Framework/src/CcdbDatabase.cxx
  • The check of the version is done in the function CcdbDatabase::retrieveMO starting at line 179.

I hope, this helps a bit to pin down the issue.

Cheers,
Stefan

Hello, I am sorry that you are encountering this. I will have a look.
You can track it here https://alice.its.cern.ch/jira/browse/QC-320
@bvonhall That must have broken after we started to store TObjects instead of MonitorObjects.

Could you check what is the “qc_version” of the latest object in your local CCDB? You should be able to see it there http://localhost:8080/browse/qc/TPC/PID/hNClusters, in the metadata column.
These latest objects were produced also by the latest QC version, right? I am surprised that they are interpreted as objects from versions <0.25.0, as seen in the logs.

Hi Piotr,

Thank you for your reply. I reran this afternoon the PID and the trending tasks in my localhost and went to check the metadata with the link you gave me. Apparently, hNClusters, but also other objects I checked, have “qc_version” 0.25.0, but I still got this error

2020-04-27 15:13:32.010971     Version of object qc/TPC/PID/hPhi is < 0.25
2020-04-27 15:13:32.010980 !!! Error - Could not cast the object qc/TPC/PID/hPhi to MonitorObject

in my terminal when I ran the trending json.
I updated alidist, O2 and QC last Monday, and did not update it since, so all objects are with the version from then. I don’t think it will change something, but I also produced new simulations with this version of the framework.

Best,
Cindy

Hi,

I have some questions to try and understand the problem here.

  1. Do you both, Stefan and Cindy, work on Ubuntu ?
  2. Stefan, you did try to access the objects with the qcg, not the post-processing ?
  3. Is your local ccdb set up using these instructions ? (https://github.com/AliceO2Group/QualityControl/blob/master/doc/Advanced.md#local-ccdb-setup).
    If so, it would help if you could zip/tar the directory /tmp/QC which contains your CCDB so that we can try ourselves against it.
  4. Could you try to run ./tests/testCcdbDatabase and ./tests/testCcdbDatabaseExtra from within your build directory to see if there are errors.

Meanwhile I will setup a fresh local CCDB and store some objects there using the tests and try to retrieve them with a local qcg.

EDIT: removed the last point that was not going to help

Hi again,

You could also try to add the following lines in CcdbDatabase.cxx after the line Version objectVersion(headers["qc_version"]); // retrieve headers to determine the version of the QC framework :

  ILOG(Debug) << "Version of object is " << objectVersion << ENDM;
  ILOG(Debug) << "objectVersion == Version(\"0.0.0\") : " << (objectVersion == Version("0.0.0")) << ENDM;
  ILOG(Debug) << "objectVersion < Version(\"0.25\") : " << (objectVersion < Version("0.25")) << ENDM

And recompile. This should give us a hint what is going on. I am not able to reproduce it on my computer.

Dear all,

Just to keep you posted. I have built everything on Ubuntu and could not reproduce the problem with the ccdb-test (you did not either, so that was expected). I then tried to store and retrieve objects on a local ccdb on Ubuntu and got no trouble. I accessed them with the QCG.

So it would really help if you could provide me with a zip of your local database which is causing problems.

Cheers,
Barth

Dear Barth, all,

To answer your questions. Yes, I am working on Ubuntu 18.04 LTS.
I added the ILOG lines in the CcdbDatabase.cxx like you suggested and this is the message in the terminal when I run the trending in my local QCG:

2020-04-29 15:41:52.284958     QualityControl Module QcCommon loaded 
2020-04-29 15:42:02.139988     Checking triggers of the task 'ExampleTrendingPID'
2020-04-29 15:42:12.139817     Checking triggers of the task 'ExampleTrendingPID'
2020-04-29 15:42:12.139881     Updating the user task due to trigger '7'
2020-04-29 15:42:12.181021     Retrieved object qc/TPC/PID/hNClusters
2020-04-29 15:42:12.181067     Version of object is 0.0.0
2020-04-29 15:42:12.181076     objectVersion == Version("0.0.0") : 1
2020-04-29 15:42:12.181082     objectVersion < Version("0.25") : 1
2020-04-29 15:42:12.181089     Version of object qc/TPC/PID/hNClusters is < 0.25
2020-04-29 15:42:12.181097 !!! Error - Could not cast the object qc/TPC/PID/hNClusters to MonitorObject
2020-04-29 15:42:12.189395     Retrieved object qc/TPC/PID/hPhi
2020-04-29 15:42:12.189428     Version of object is 0.0.0
2020-04-29 15:42:12.189437     objectVersion == Version("0.0.0") : 1
2020-04-29 15:42:12.189443     objectVersion < Version("0.25") : 1
2020-04-29 15:42:12.189450     Version of object qc/TPC/PID/hPhi is < 0.25
2020-04-29 15:42:12.189458 !!! Error - Could not cast the object qc/TPC/PID/hPhi to MonitorObject
2020-04-29 15:42:12.206266     Retrieved object qc/checks/DET/PIDClusterCheck
2020-04-29 15:42:12.206508     Generating and storing 6 plots.
2020-04-29 15:42:12.342133     Storing the trend, entries: 1

You can find the zip of my /tmp/QC/qc of today with this link

Please tell me if there is something more you need.
Cheers,
Cindy

Dear Barth,

thanks a lot for you investigations and sorry for the late reply.

To the four points in your first reply:

  1. Yes, we are both on Ubuntu 18.04. In addition, I did update my alidist, O2 and QualityControl on Friday and now again today and I see the same errors as Cindy.
  2. What I do, step by step:
  • I run a simulation corresponding to the instructions here (with option -g pythia8, although this probably does not matter).
  • I delete the folder tmp/QC/ to make sure, I have no leftovers.
  • I start my local CCDB and QCG
  • I run first:
o2-qc-run-tpctrackreader -b | o2-qc -b --config json://home/sheckel/alice/QualityControl/Modules/TPC/run/tpcQCClusterChecker.json
  • I check in a web browser on localhost:8081 (the port is set correspondingly in the config file) and I can access the histograms produced by the previous step.
  • Then I run, including now the log messages proposed by you:
o2-qc-run-postprocessing --config json://home/sheckel/alice/QualityControl/Modules/TPC/run/tpcQCSimpleTrending.json --name ExampleTrendingPID --period 10
  • It fails, as before, and I get:
2020-04-29 15:48:07.393033     Retrieved object qc/TPC/PID/hNClusters
2020-04-29 15:48:07.393078     Version of object is 0.0.0
2020-04-29 15:48:07.393086     objectVersion == Version("0.0.0") : 1
2020-04-29 15:48:07.393091     objectVersion < Version("0.25") : 1
2020-04-29 15:48:07.393098     Version of object qc/TPC/PID/hNClusters is < 0.25
2020-04-29 15:48:07.393106 !!! Error - Could not cast the object qc/TPC/PID/hNClusters to MonitorObject
2020-04-29 15:48:07.398717     Retrieved object qc/TPC/PID/hPhi
2020-04-29 15:48:07.398734     Version of object is 0.0.0
2020-04-29 15:48:07.398741     objectVersion == Version("0.0.0") : 1
2020-04-29 15:48:07.398746     objectVersion < Version("0.25") : 1
2020-04-29 15:48:07.398751     Version of object qc/TPC/PID/hPhi is < 0.25
2020-04-29 15:48:07.398757 !!! Error - Could not cast the object qc/TPC/PID/hPhi to MonitorObject
[ERROR] Invalid URL : localhost:8080/qc/checks/TPC/PIDClusterCheck/1588168087398/
invalid URL : localhost:8080/qc/checks/TPC/PIDClusterCheck/1588168087398/
2020-04-29 15:48:07.409321 !!! Error - We could NOT retrieve the object qc/checks/TPC/PIDClusterCheck.
2020-04-29 15:48:07.409334 !!! Error - Could not cast the object qc/checks/TPC/PIDClusterCheck to QualityObject

 *** Break *** segmentation violation
  • So, actually, somehow the version is “0.0.0” and therefore “< 0.25” :frowning:
  1. Yes, I followed these instruction, I will provide the directory.
  2. In which directory exactly I should run those commands? I tried in several place and always get
bash: ./tests/testCcdbDatabase: No such file or directory

Cheers,
Stefan

One addition, to the last point (4): I found the correct directory, and both commands run fine and produce some output, in the end concluding *** No errors detected.

I include here the last lines of the output of the 2nd command, which also illustrate, that here the version is obtained corretly as 0.25.3:

2020-04-29 16:19:36.742059     Retrieved object qc/TST/my/task/asdf/asdf
2020-04-29 16:19:36.742201     Version of object is 0.25.3
2020-04-29 16:19:36.742246     objectVersion == Version("0.0.0") : 0
2020-04-29 16:19:36.742283     objectVersion < Version("0.25") : 0
2020-04-29 16:19:36.742322     Version of object qc/TST/my/task/asdf/asdf is >= 0.25
2020-04-29 16:19:36.787194     Retrieved object qc/TST/my/task/asdf/asdf
2020-04-29 16:19:36.787273     Version of object is 0.25.3
2020-04-29 16:19:36.787311     objectVersion == Version("0.0.0") : 0
2020-04-29 16:19:36.787344     objectVersion < Version("0.25") : 0
2020-04-29 16:19:36.787379     Version of object qc/TST/my/task/asdf/asdf is >= 0.25
2020-04-29 16:19:36.831520     Retrieved object qc/TST/my/task/asdf/asdf
2020-04-29 16:19:36.831593     Version of object is 0.25.3
2020-04-29 16:19:36.831627     objectVersion == Version("0.0.0") : 0
2020-04-29 16:19:36.831656     objectVersion < Version("0.25") : 0
2020-04-29 16:19:36.831687     Version of object qc/TST/my/task/asdf/asdf is >= 0.25

*** No errors detected

Cheers,
Stefan

Dear both,
Thank you, this helps a lot. I see that the database sent by Cindy has the proper version number for the object and yet it reports “0.0.0” which is the default.
So the culprit is in the retrieval of the version number and not in the logic of the comparison.
I suspect strongly that the local CCDB, which very much different from the one installed on a web server, does not handle the metadata correctly, or at all. @grigoras might enlighten us on this.

I will do a few more tests tomorrow.
Cheers,
Barth

Ok, so it is not a problem with the local ccdb.

I noticed that the object qc/TPC/PID/hNClusters has a very short validity range (5 minutes). Could you confirm ?
If it is the case, then this is the reason for the break. Indeed, we retrieve the object based on the current timestamp which is after the validTo and thus there are no matches. As a consequence, no object is return and there is no version.

If it were the case, I note that

  1. We need to better handle the case no objects is retrieved. Better error logging at least. (JIRA ticket)
  2. We need to consider what retrieveMO should do. I am not convinced we want to retrieve the object based on timestamp “now” but rather use the “latest” object. This needs a discussion with CCDB experts and a change in O2 as this default comes from there. (JIRA ticket)
  3. We need to understand how this end of validity came to be. Have you changed it somehow ? The QC should always store with a 10 years validity range.

@grigoras There seems to be something weird. Inside the properties file, the validUntil is correct (1903527666251) :

[barth@iMac QC]$ cat qc/TPC/PID/hNClusters/1588167666251/136c8600-8a1f-11ea-8812-7f0000015566.properties
#Wed Apr 29 15:41:06 CEST 2020
InitialValidityLimit=1903527666251
qc_detector_name=TPC
OriginalFileName=TObject_1588167666251.root
UploadedFrom=127.0.0.1
custom=34
CreateTime=1588167666272
UpdatedFrom=127.0.0.1
Last-Modified=1588167666272
qc_version=0.25.0
Content-MD5=070b3e57a4383a9876df91a021b3c94c
ValidUntil=1903527666251
ObjectType=TH1F
qc_task_name=PID
File-Size=5120
Content-Type=application/octet-stream

But in the web interface and the ccdbApi it is wrong (1588167989000) :

To test it, download the database uploaded by Cindy and run the local.jar against it. Then access it with a browser. Do you see the same discrepancy ?

Ok, seems to work now, before I was getting a message like “405 Permission denied”. Never mind, looks like we can continue here then.

We discussed by email with @grigoras who had difficulties posting here.
There was a problem of zipping and sending the DB as the file timestamps were used. It is solved in a new version of the java program that Costin has kindly produced.

@cimordas @sheckel
Could you check out this new version, here, and see if it helps by any chance ?

I am still unable to reproduce the problem on my system after the update of the java program. I can retrieve the object from your local database you sent me. The version is properly detected, I can even count the bins so the object is fine. What is very strange is that it works if you point to ccdb-test.

Cindy, could you try to add even more output to your CcdbDatabase.cxx ? Sorry to bother you but until I am able to reproduce it it is hard to debug.
Right after TObject* obj = retrieveTObject(path, metadata, when, &headers); could you add :

  cout << "headers : " << endl;
  for (const auto &[k, v] : headers)
    std::cout << "m[" << k << "] = (" << v << ") " << std::endl;
  cout << "obj : " << obj << endl;

Cheers,
Barth

PS: my next step, next week because tomorrow is off, is to do exactly what you Stefan described. Could you upload or point me to the config files you use in the qc and the post-processing ?

Dear Barth,

first now, I did the same as you in your 2nd to last post here. I looked at the validity time stamps both in the file and online. In my case, both versions show a correct validity for 10 years!

From the file:

sheckel@QCbuilder:/tmp/QC/qc/TPC/PID/hNClusters/1588167582868$ cat e1b6ffa0-8a1e-11ea-b926-7f0000015566.properties
#Wed Apr 29 15:39:42 CEST 2020
InitialValidityLimit=1903527582868
...
ValidUntil=1903527582868
...

And in the CCDB in a web browser, looking at
localhost:8080/browse/qc/TPC/PID/hNClusters

Next, I will add the additional couts in the CcdbDatabase.cxx and try the new version of the local jar, but I am not sure, if I will still manage today.

The json config files I used are both already available in the QualityControl master. First, used in the qc together with the tpctrackreader:
QualityControl/Modules/TPC/run/tpcQCClusterChecker.json

And then, for the postprocessing:
QualityControl/Modules/TPC/run/tpcQCSimpleTrending.json

EDIT: I also add here the exact commands I used for the TPC simulation, then you don’t have to look it up in the link I posted.

o2-sim -m TPC [ITS PIPE] -n 100 -g pythia8
o2-sim-digitizer-workflow -b
o2-tpc-reco-workflow -b --infile tpcdigits.root  --output-type clusters,tracks

Have a nice long weekend!
Cheers,
Stefan

Hi,

I still found some time. First, I used the old CCDB and just included the further couts. The result is:

2020-04-30 20:25:07.809079     Retrieved object qc/TPC/PID/hNClusters
headers : 
m[Accept-Ranges] = (bytes) 
m[Content-Disposition] = (inline;filename="TObject_1588271081060.root") 
m[Content-Length] = (0) 
m[Content-MD5] = (a42ed5fd0086eff4c45cb603664be17a) 
m[Content-Type] = (application/octet-stream) 
m[Created] = (1588271081068) 
m[Date] = (Thu, 30 Apr 2020 18:25:07 GMT) 
m[ETag] = ("db721ac0-8b0f-11ea-8df4-7f0000015566") 
m[Last-Modified] = (Thu, 30 Apr 2020 18:24:41 GMT) 
m[Location] = (/qc/TPC/PID/hNClusters/1588271081060/db721ac0-8b0f-11ea-8df4-7f0000015566) 
m[Valid-From] = (1588271081060) 
m[Valid-Until] = (1903631081060) 
obj : 0x55f9a9686010
2020-04-30 20:25:07.809191     Version of object is 0.0.0
2020-04-30 20:25:07.809203     objectVersion == Version("0.0.0") : 1
2020-04-30 20:25:07.809210     objectVersion < Version("0.25") : 1
2020-04-30 20:25:07.809219     Version of object qc/TPC/PID/hNClusters is < 0.25
2020-04-30 20:25:07.809230 !!! Error - Could not cast the object qc/TPC/PID/hNClusters to MonitorObject

Then I used the new CCDB (local-1.0.6.jar). Now, the version check is fine! For the hNClusters I get:

2020-04-30 20:46:50.312868     Retrieved object qc/TPC/PID/hNClusters
headers : 
m[Accept-Ranges] = (bytes) 
m[Content-Disposition] = (inline;filename="TObject_1588272375941.root") 
m[Content-Length] = (0) 
m[Content-MD5] = (41cb2b2cfc5cc2eb3266a974d4cdc973) 
m[Content-Type] = (application/octet-stream) 
m[Created] = (1588272375947) 
m[Date] = (Thu, 30 Apr 2020 18:46:50 GMT) 
m[ETag] = ("df40f5b0-8b12-11ea-a0ca-7f0000015566") 
m[Last-Modified] = (Thu, 30 Apr 2020 18:46:15 GMT) 
m[Location] = (/qc/TPC/PID/hNClusters/1588272375941/df40f5b0-8b12-11ea-a0ca-7f0000015566) 
m[ObjectType] = (TH1F) 
m[Valid-From] = (1588272375941) 
m[Valid-Until] = (1903632375941) 
m[custom] = (34) 
m[qc_detector_name] = (TPC) 
m[qc_task_name] = (PID) 
m[qc_version] = (0.25.3) 
obj : 0x562352bac460
2020-04-30 20:46:50.312970     Version of object is 0.25.3
2020-04-30 20:46:50.312977     objectVersion == Version("0.0.0") : 0
2020-04-30 20:46:50.312982     objectVersion < Version("0.25") : 0
2020-04-30 20:46:50.312989     Version of object qc/TPC/PID/hNClusters is >= 0.25

After the hNClusters, also the hPhi is obtained successfully with similar output, but then we run into a sceond issue:

[ERROR] Invalid URL : localhost:8080/qc/checks/TPC/PIDClusterCheck/1588272410319/
invalid URL : localhost:8080/qc/checks/TPC/PIDClusterCheck/1588272410319/
2020-04-30 20:46:50.330813 !!! Error - We could NOT retrieve the object qc/checks/TPC/PIDClusterCheck.
2020-04-30 20:46:50.330824 !!! Error - Could not cast the object qc/checks/TPC/PIDClusterCheck to QualityObject

 *** Break *** segmentation violation

In fact, this object does not exist in the requested path, but similar objects are found here:

/tmp/QC/qc/checks/DET/PIDClusterCheck/1588272365965
/tmp/QC/qc/checks/DET/PIDClusterCheck/1588272375935

Here I see two issues:

  • In the path there is a “DET” instead of “TPC”.
  • The object itself being requested (i.e. the time-stamp number) is not identical to the ones available.

Before I ran with the new CCDB, I deleted the /tmp/QC/ directory, so it cannot be a leftover from before. I think, at least the “DET” can be in our configuration, maybe @cimordas or @lserksny could check.

To conclude: One issue solved, another one discovered.

Cheers,
Stefan

Hi Stefan, all,

@bvonhall I still did not test the new CCDB, nor add the output in CcdbDatabasec.cxx but it is on my to-do for the upcoming days for sure.

However, @sheckel, I also noticed this issue for the DET instead of TPC. In the json for the trending, the given path to trend checkers is qc/checks/TPC. This can be easily changed into qc/checks/DET to solve the issue. What confuses me is that I don’t recall having this problem when I tested and committed the json for the trending nor when I ran to get the plots for the WP7 meeting, so either the checkers were automatically published in TPC, either I had changed my local version of the json for the checkers to do so.
I also checked the json for the checkers and I did not find where the DET is indicated, but I may have missed it. I guess it would make more sense to change it here than in the trending. What do you think @lserksny?

Cheers,
Cindy

Dear all,

sorry it is a bug from my side from the last commit! I will fix it and make a pull request. Cindy is right, the problem is in my json file!

Cheers
Laura

Dear Stefan,
Dear Cindy,

I would call that progress :slight_smile:
I don’t know how or why the new CCDB jar fixes the problem but I am glad it does.

I take it from the subsequent messages that the problem with “DET” detector is understood with a fix under way.

I will work today on the next issue, i.e. the failure to retrieve PIDClusterCheck.

Thanks for all the involvement into this issue(s).
Cheers,
Barth