QC task:Error - ServiceDiscovery: Timeout was reached

Dear all,

we are observing the same error when running our TPC QC tasks. We investigated a bit, but found, that it did not interfere with the tasks - and therefore ignored it for the time being. Actually, we wanted to bring this up, but were busy with other stuff, but as it is discussed here, just some info also from our side:

  • We get consul errors both when using the default publishing on the QCG test webpage and also when running in a local environment (see below for more details).
  • We get the errors as well running the basic task, therefore concluded, that it has nothing to do with our tasks.
  • Trying to access the consul in a browser (i.e. following Piotr’s link) wet get a connection time-out.
  • Concerning the local running, we found a bug in the file: $QCG_ROOT/node_modules/@aliceo2/qc/config.js -> in line 28 it has to be localhost and not locahost.
  • When running both the CCDB and the QCG in two terminals and then in a webbrowser opening localhost:8081, the QCG shows the following errors (although it keeps on running):
Trace: Error: Unable to connect to Consul: Error: connect ECONNREFUSED 127.0.0.1:8500
    at /home/sheckel/alice/sw/ubuntu1804_x86-64/qcg/v1.6.9-10/node_modules/@aliceo2/qc/lib/ConsulConnector.js:31:13
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at Log.trace (/home/sheckel/alice/sw/ubuntu1804_x86-64/qcg/v1.6.9-10/node_modules/@aliceo2/web-ui/Backend/log/Log.js:113:13)
    at errorHandler (/home/sheckel/alice/sw/ubuntu1804_x86-64/qcg/v1.6.9-10/node_modules/@aliceo2/qc/lib/api.js:272:11)
    at /home/sheckel/alice/sw/ubuntu1804_x86-64/qcg/v1.6.9-10/node_modules/@aliceo2/qc/lib/api.js:62:23
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
2020-01-29T19:34:09.523Z error: [QualityControl] Unable to connect to Consul: Error: connect ECONNREFUSED 127.0.0.1:8500
2020-01-29T19:34:09.533Z debug: [WebSocket] ID 0 Processing "filter"
2020-01-29T19:34:09.534Z debug: [WebSocket] ID 0 Sent filter/200
2020-01-29T19:34:09.544Z debug: [HTTP] Page was not found: /favicon.ico

Some more information concerning the local environment:

  • We are running on Ubuntu 18.04 with a custom (and up-to-date) installation of O2 and QualityControl
  • We have the CCDB installed and running according to this
  • We have the QCG installed and running according to this
  • In the config file we changed the port for the QCG to 8081, i.e. in line 5 of
    $QCG_ROOT/node_modules/@aliceo2/qc/config.js

As said, this is not an issue for our current developments, but maybe it helps to pin down the issue.

Cheers,
Stefan

Thanks Stefan, this is very helpful information. For now, let me ask just one question.

  • Trying to access the consul in a browser (i.e. following Piotr’s link) wet get a connection time-out.

Are connecting from CERN or from outside? I would expect this link to work only from within the CERN network (and I am afraid it should better stay like that).

Hi Piotr,
we are trying to connect from outside. Which then also explains, why we get the time-out.
Cheers,
Stefan

I don’t think that it is accessible from outside CERN. Thus the error.

Hello experts,

I also have a question concerning consul. Apparently the

Unable to connect to Consul: Error: connect ECONNREFUSED 127.0.0.1:8500

does not pose a problem at the moment. When I run a local CCDB and QCG on my local machine (ubuntu 18.04) as described above by @sheckel I see similar lines as posted above and everything nonetheless works as expected. No problem here.

However, on a remote CentOS 7 machine I tried the exact same thing and it does not work. Upon starting the qcg I see the following lines:

2020-02-27T11:58:22.392Z warn: Created default instance of logger
2020-02-27T11:58:22.632Z info: [QualityControlLog] Read config file "/home/tklemenz/AliSoftware/sw/slc7_x86-64/qcg/v1.6.10-1/node_modules/@aliceo2/qc/config.js"
2020-02-27T11:58:22.821Z info: [QualityControlModel] Object listing: CCDB
Warning in <TClass::Init>: no dictionary for class o2::quality_control::core::CheckDefinition is available
2020-02-27T11:58:23.302Z info: [QualityControl] HTTP endpoint: http://localhost:8081
2020-02-27T11:58:23.313Z info: [HTTP] Server listening on port 8081
2020-02-27T11:58:23.316Z info: [WebSocket] Server started
2020-02-27T11:58:23.323Z error: [QualityControlModel] Consul Service connection could not be established. Please try restarting the service due to: Error: connect ECONNREFUSED 127.0.0.1:8500
2020-02-27T11:58:23.334Z info: [QualityControlJson] DB file updated
2020-02-27T11:58:23.334Z info: [QualityControlJson] Preferences will be saved in /home/tklemenz/AliSoftware/sw/slc7_x86-64/qcg/v1.6.10-1/node_modules/@aliceo2/qc/db.json

Then when I go to localhost:8081 in the browser, which shows me the qcg page, I immediately get the following output in the terminal by the qcg:

2020-02-27T11:58:33.917Z error: [QualityControl] Unable to retrieve Consul Status: Error: connect ECONNREFUSED 127.0.0.1:8500
Trace: Error: Non-2xx status code: 500
    at ClientRequest.requestHandler (/home/tklemenz/AliSoftware/sw/slc7_x86-64/qcg/v1.6.10-1/node_modules/@aliceo2/qc/lib/CCDBConnector.js:95:18)
    at Object.onceWrapper (events.js:300:26)
    at ClientRequest.emit (events.js:210:5)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:583:27)
    at HTTPParser.parserOnHeadersComplete (_http_common.js:115:17)
    at Socket.socketOnData (_http_client.js:456:22)
    at Socket.emit (events.js:210:5)
    at addChunk (_stream_readable.js:308:12)
    at readableAddChunk (_stream_readable.js:289:11)
    at Socket.Readable.push (_stream_readable.js:223:10)
    at Log.trace (/home/tklemenz/AliSoftware/sw/slc7_x86-64/qcg/v1.6.10-1/node_modules/@aliceo2/web-ui/Backend/log/Log.js:113:13)
    at errorHandler (/home/tklemenz/AliSoftware/sw/slc7_x86-64/qcg/v1.6.10-1/node_modules/@aliceo2/qc/lib/api.js:276:11)
    at /home/tklemenz/AliSoftware/sw/slc7_x86-64/qcg/v1.6.10-1/node_modules/@aliceo2/qc/lib/api.js:36:21
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
2020-02-27T11:58:33.924Z error: [QualityControl] Non-2xx status code: 500
2020-02-27T11:58:33.931Z debug: [WebSocket] ID 0 Processing "filter"
2020-02-27T11:58:33.931Z debug: [WebSocket] ID 0 Sent filter/200

In addition on the qcg page in the browser I get a small red window popping up stating:

Failed to retrieve list of objects due to undefined

and I cannot click on the objects tree on the left side.
This is a bit suspicious to me since there is again this error concerning the Consul Status and immediately afterwards there is this show-stopping error 500.

Would be interesting to know if this for sure does not result from the Consul Status error.

Thanks!

Cheers,
Thomas

Hi @sheckel,

It seems like there is a bit of confusion on how Quality Control, Consul (Service Discovery) and QCG work together. I will try my best now to explain and update the documentation.
Feel free to ask me any questions if there are still any concerns afterwards.

“We get consul errors both when using the default publishing on the QCG test webpage and also when running in a local environment (see below for more details).”

Consul(Service Discovery) is not a system part of QualityControl or QCG. It is an independent system that you may choose to install it and use it or not but it is not coming as a package of QC or QCG.
If you have chosen not to use it then you will have to remove it from config.js file. Otherwise QCG will consider you need it and will attempt to connect to it. Thus, it will print errors as it cannot connect to something that does not exist. (localhost for example)

Trying to access the consul in a browser (i.e. following Piotr’s link) wet get a connection time-out.

I believe this was answered by @bvonhall and @pkonopka

Concerning the local running, we found a bug in the file: $QCG_ROOT/node_modules/@aliceo2/qc/config.js -> in line 28 it has to be localhost and not locahost .

This is not a bug, as the config.js file is created by the user installing QCG. It is a simple typo in the example file of a configuration (config-default.js) but it does not impact in any way the work.

When running both the CCDB and the QCG in two terminals and then in a webbrowser opening localhost:8081 , the QCG shows the following errors (although it keeps on running):

This is the same issue as the first bullet point. It is attempting to connect to your localhost:8500 (as specified in your config.js file) but because you do not have Consul installed there is nothing to connect to. Simply remove consul from config.js and you will not see the errors anymore.

Please let me know if this clears everything and if not feel free to ask me anything else. In the next few days I will do my best to update the documentation of QCG and Consul on our repository as well.

Thank you,
George

Hi @tklemenz,

It seems like there is a bit of confusion on how Quality Control, Consul (Service Discovery) and QCG work together. I will try my best now to explain and update the documentation.
Feel free to ask me any questions if there are still any concerns afterwards.

Unable to connect to Consul: Error: connect ECONNREFUSED 127.0.0.1:8500

Consul(Service Discovery) is not a system part of QualityControl or QCG. It is an independent system that you may choose to install it and use it or not but it is not coming as a package of QC or QCG.
If you choose not to use it then you will have to remove it from config.js otherwise QCG will consider you need it and will attempt to connect. Thus, it will print errors as it cannot connect to something that does not exist. (localhost for example)

Failed to retrieve list of objects due to undefined

This is a more serious issue that I will kindly ask @pkonopka and @bvonhall to have a look at. The red notification is displayed due to a problem when trying to get the objects from Quality Control, more specifically from the logs you have posted seems to be this one:

2020-02-27T11:58:22.821Z info: [QualityControlModel] Object listing: CCDB
Warning in <TClass::Init>: no dictionary for class o2::quality_control::core::CheckDefinition is available

I also want to confirm to you that if Consul is not running it is NOT impacting in any way your object tree as this is treated as a separate module.

Please let me know if this clears everything and if not feel free to ask me anything else. In the next few days I will do my best to update the documentation of QCG and Consul on our repository as well.

Thank you,
George

It seems that a MonitorObject from Checkers 1.0 (with CheckDefinition) is being opened with a recent version of QC. This should not have been a problem as we increased the class version of MonitorObject (see this commit) when we removed its member CheckDefinition. However, could ROOT get confused to see a CheckDefinition and not find any information about it ?

I will have to try on my machine. In the worst case we should probably add back CheckDefinition to make it part of the dictionary. What do you think @pkonopka ?

Hello George, Barth,

I think now it is clear that Consul does not impact the outcome of our tasks in any way. Thanks for clarifying that.

Concerning the

Warning in <TClass::Init>: no dictionary for class o2::quality_control::core::CheckDefinition is available

I also see this warning when I run on my local machine and this does not seem to be a problem. The task still runs without apparent problems there.

Cheers,
Thomas

We could, but strongly indicating that this is for backward compatibility.

Given Thomas’ reply I would hold off on putting back the CheckDefinition.

Hi @tklemenz & @sheckel,

Please find below updated documentation which better describes the use and how to configure Online Mode within QC:

Thank you,
George

1 Like

Dear @graduta,

thanks a lot for this additional information! Now I have another question: We want to run our QC tasks in the final TPC pre-commissioning phase online directly connected to the TPC reconstruction workflow. Piping a QC task to the reco workflow works already (although parallel writing the tracks and handing them over to the QC does not yet work, which is discussed elsewhere). Up to now, we have only used simulated data, where the reco workflow is running once, and hence only providing data to the QC task once as well.

If now we want to run online, i.e. the reco workflow is constantly (i.e. with some give time intervals) producing data, the QC tasks have to grep up and monitor, do we need the consul for this? From your description here I get, that in general the consul is optional and that - running in online mode - the tasks will only get those objects generated live (i.e., what we want to have). Does this mean, without consul we would not get these objects continuously, or would this work also without consul?

Thanks and cheers,
Stefan

Hi @sheckel,

In order to use Online Mode in QCG , you will need to configure and use Consul.

QC uses ServiceDiscovery to post the name of the live objects to Consul and then QCG will monitor Consul to refresh the current objects that are considered to be in live mode.

Hence, without Consul, QCG will not have Online Mode.

I would also like to ask @bvonhall to read your requirement and confirm that it would work from QC side.

Thank you,
George

Regarding the other, almost forgotten topic here:
With the latest O2/dev and latest QC/master QC Tasks and Checks should correctly react to an EndOfStream signal.

When all of the producers will have sent EndOfStream (usage example here), QC Task will finish its cycle, publishing it MonitorObjects for the last time. Then it will send it to CheckRunner with an EndOfStream, which will perform Checks and store the results for the last time, then quit if all devices in the topology are done.

It means that even small files should be correctly handles now if data readers send EndOfStream upon reaching the end of file.

1 Like

The task publishes the objects at the end of every cycle whether Consul is there or not. The QCG will also display the latest version of an object if you refresh it (by clicking on a different object per instance).

Dear George, dear Barth,

thanks for your replies. So for the publishing of the objects and displaying the most recent ones via the QCG we do not have any isses, this also works without the consul.

Now, for me the questions remains, what about the connection from the reco workflow to the QC task? Up to now, what we do, is piping a QC task directly to the reco workflow. E.g., we first run a TPC O2 simulation, more precisely: the hit creation, digitisation and clusterisation. Afterwards, we start the reco workflow (performing the tracking) with a QC task directly piped to it. In this setup, first the reco workflow will run, finish the tracking, and when it is finished, provide the whole bunch of tracks once to the QC task. The QC task continues to run, but it will not receive further input. Now, what happens, when the reco workflow is running continuously (if I understand corretly, that would mean, in online mode)? In this latter case, we cannot use a simple pipe, because that would wait for the reco workflow to finish, which, however, is continuously running.

To me it is still unclear, how to get this connection from the continuously running reco workflow to the QC task, and whether for this we need the consul.

Cheers,
Stefan

Dear Stefan,
I am not sure I completely understand, I have an impression that you might have a few wrong assumptions about how O2, DPL and QC work. Let’s tackle it one by one.

If you need it to exit gracefully, you can send EndOfStream signal, as described a few post above.

QC tasks also run continuously. They can receive new data over and over, each x seconds (10 in debug, 60 in production) they will publish new versions of MonitorObjects.

The fact that you pipe the workflows ( say workflow-a | workflow-b) doesn’t mean, that the first one sends detector data through stdout and the latter receives them in stdin. Anyway, one process doesn’t always have to exit before the following one, after pipe, receives its stdin.

By piping these workflows, we only sum up their configuration, but all the DPL devices are run anyway by the last executable in the chain. Then, they transfer data by message queues. However, this is an internal detail and hopefully you shouldn’t have to worry about it too much.
QC tasks should be fine running continuously and I am pretty sure that other detector teams use them this way.

By the way, in production, we won’t use pipes. We will just connect different workflows by matching their channels with the Control software(s).

You don’t have to do anything unusual, I guess. If a reco workflow sends data continuously, QC tasks will continuously receive them. Please tell us if you encounter problems with that.

You absolutely don’t need consul for that. As for now, it is only needed if you want to see in QCG which Monitor Objects (MOs) are currently published (a.k.a. the online mode). Anyway, you will be able to see the latest published MOs, but they won’t be marked as “online”.

I hope that clarifies things. Don’t hesitate to ask more questions if something is not clear.
Best, Piotr

2 Likes

Dear Piotr,

thanks a lot for your detailed clarifications! I think, now I much better understand, how the different things works together. The connection from the reco workflow to the QC task is provided by the DPL and should work also running continously. The consul is only needed for checking, if the QC task has published anything new, and in this case updating the QCG. But this also works without consul, just that one has to update the QCG manually. Now, for the pre-commissioning this is not an issue, so we will not use the consul.

Thanks again!
Cheers, Stefan

Hi @sheckel,

In order to use Online Mode in
QCG, you will need to configure and use Consul.

QC uses ServiceDiscovery to post the name of the live objects to Consul and then
QCG will monitor Consul to refresh the current objects that are considered to be in live mode.

Hence, without Consul, QCG will not have Online Mode.

I would also like to ask @bvonhall to read your requirement and confirm that it would work from QC side.

Thank you,

George