Can not create environment through AliECS GUI (FLP suite v0.2.0)

Dear experts,

We upgraded a few days ago the target machine alimftcom2 with the FLP suite v0.2.0. After the machine was rebooted, the deployment was successfully tested with:
./o2-install-flp-standalone alimftcom2 root test-diagnosis

I am currently going further in the tests by following the detailed test scenarii in https://gitlab.cern.ch/AliceO2Group/system-configuration/blob/flp-suite-v0.2.1/ansible/docs/FLP_test_scenarios.xlsx

Following test scenario number 3, I tried to create an environment through AliECS GUI, but it failed, with the error message below from AliECS core/env:
workflow deployment error error = “workflow deployment timed out” timeout=“1m30s”

Below is the content of AliECS core journal:

[root@alimftcom2 ~]# journalctl -u o2-aliecs-core
– Logs begin at mar. 2019-10-22 15:46:43 CEST, end at jeu. 2019-10-24 13:44:13 CEST. –
oct. 22 15:46:52 localhost.localdomain systemd[1]: [/etc/systemd/system/o2-aliecs-core.service:14] Unknown lvalue ‘SyslogIndentifier’ in section ‘Service’
oct. 22 15:47:02 alimftcom2 systemd[1]: Started O² AliECS core.
oct. 22 15:47:03 alimftcom2 systemd[1]: [/etc/systemd/system/o2-aliecs-core.service:14] Unknown lvalue ‘SyslogIndentifier’ in section ‘Service’

Thanks in advance for any clue to solve this situation.

Best regards,
Andry

Hello Andry,

from what I can tell by looking at the Mesos log, this machine performed at least one successful or partially successful environment deployment, followed by a core restart.

Mesos then couldn’t allocate resources for your environment deployment attempts, because the resources were already in use for currently running processes (several instances of Readout in this case). This is likely to be a transitory condition, but if it happens again please let us know, and please include any information you have on the exact workflows that were deployed or attempted to deploy since the last reboot.

The easiest remedy at this point is to simply reboot the machine.

Dear Teo,

Thanks a lot for your advice. I just reboot alimftcom2. And right after the reboot I tried again to create a new environment. Unfortunately, it is still in error, but the error message is different:

Request to server failed (504 Gateway Timeout): 13 INTERNAL: cannot create new environment: transition canceled with error: transition unsuccessful: ok: false, trigger: DEVICE_ERROR, event: CONFIGURE, state: ERROR
Cheers,
Andry

Hello again Andry,

thank you for your reply.

I’m pasting here the Mesos log of what happened (you can get this kind of output from the Mesos GUI, click on a task’s “Sandbox” on the right and then “stdout”):

time="2019-10-24T16:36:01+02:00" level=debug msg="polling task for IDLE state reached" elapsed=3s id=97e02e33-f66b-11e9-99dd-10c37b1f7cbf prefix=executor task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=info msg="[request GetState] handler BEGIN" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=info msg="[request GetState] handler END" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="task status queried" prefix=executor state=STANDBY task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="task running and ready for control input" id=97e02e33-f66b-11e9-99dd-10c37b1f7cbf prefix=executor task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="notifying of task running state" prefix=executor task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="sending UPDATE on task status" id=97e02e33-f66b-11e9-99dd-10c37b1f7cbf prefix=executor status=TASK_RUNNING
time="2019-10-24T16:36:01+02:00" level=info msg="[request EventStream] handler BEGIN" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="handling event" event=ACKNOWLEDGED prefix=executor
time="2019-10-24T16:36:01+02:00" level=debug msg="begin new event loop iteration" prefix=executor
time="2019-10-24T16:36:01+02:00" level=debug msg="handling event" event=MESSAGE prefix=executor
time="2019-10-24T16:36:01+02:00" level=debug msg="received message data" length=380 message="{\"name\":\"MesosCommand_Transition\",\"id\":\"99d95f78-f66b-11e9-99dd-10c37b1f7cbf\",\"timeout\":45000000000,\"arguments\":{},\"targetList\":[{\"AgentId\":{\"value\":\"29fecec7-b21b-4c24-831b-c7ec85f828c9-S0\"},\"ExecutorId\":{\"value\":\"97dfe4da-f66b-11e9-99dd-10c37b1f7cbf\"},\"TaskId\":{\"value\":\"97e02e33-f66b-11e9-99dd-10c37b1f7cbf\"}}],\"source\":\"STANDBY\",\"event\":\"CONFIGURE\",\"destination\":\"CONFIGURED\"}" prefix=executor
time="2019-10-24T16:36:01+02:00" level=debug msg="processing incoming MESSAGE" name=MesosCommand_Transition payload="{\"name\":\"MesosCommand_Transition\",\"id\":\"99d95f78-f66b-11e9-99dd-10c37b1f7cbf\",\"timeout\":45000000000,\"arguments\":{},\"targetList\":[{\"AgentId\":{\"value\":\"29fecec7-b21b-4c24-831b-c7ec85f828c9-S0\"},\"ExecutorId\":{\"value\":\"97dfe4da-f66b-11e9-99dd-10c37b1f7cbf\"},\"TaskId\":{\"value\":\"97e02e33-f66b-11e9-99dd-10c37b1f7cbf\"}}],\"source\":\"STANDBY\",\"event\":\"CONFIGURE\",\"destination\":\"CONFIGURED\"}" prefix=executor
time="2019-10-24T16:36:01+02:00" level=debug msg="CONFIGURE pushing FairMQ properties" map="map[]" prefix=executor taskId="{97e02e33-f66b-11e9-99dd-10c37b1f7cbf}"
time="2019-10-24T16:36:01+02:00" level=debug msg="executor<->occplugin interface requesting transition" event=CONFIGURE prefix=executorcmd
time="2019-10-24T16:36:01+02:00" level=info msg="[request Transition] handler BEGIN" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=info msg="[request Transition] src: STANDBY currentState: STANDBY event: CONFIGURE" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=info msg="Object: Readout Process - processing event configure in state STANDBY with run number 0." prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=info msg="Object: Readout Process - event configure processed in state STANDBY. New state: ERROR" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="response received, about to parse status" prefix=executorcmd
time="2019-10-24T16:36:01+02:00" level=info msg="Object: Readout Process - updating state = ERROR" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=info msg="[request Transition] handler END, new state: ERROR" prefix=task-stdout task="github.com/AliceO2Group/ControlWorkflows/tasks/readout@4e4a405cfeb69a202aa1e534309de5bb4fe9e1dc#97e02e33-f66b-11e9-99dd-10c37b1f7cbf"
time="2019-10-24T16:36:01+02:00" level=debug msg="begin new event loop iteration" prefix=executor
time="2019-10-24T16:36:01+02:00" level=debug msg="response sent" commandId=99d95f78-f66b-11e9-99dd-10c37b1f7cbf commandName=MesosCommand_Transition error="transition unsuccessful: ok: false, trigger: DEVICE_ERROR, event: CONFIGURE, state: ERROR" prefix=executor state=ERROR

It looks like the Readout process started successfully, but failed somewhere in the CONFIGURE transition.

@Andry have you perhaps modified the Readout configuration file, or any other configuration file?

@sy-c does any of this look familiar to you?

The readout logs (in infologger) can certainly explain what happened, an error message was certainly issued in this situation. This can happen for any number of reasons involving readout hardware or system resources.
I don’t expect users to dig into mesos debug logs to find that readout fails to configure.

Dear Teo,

Thanks for the clue of how to get Mesos log. This will be very useful.

Indeed, I changed a bit the Readout config file, to enable the 2 CRUs in the FLP. The modifications were done in the lines of what was described in the test scenario number 3 of https://gitlab.cern.ch/AliceO2Group/system-configuration/blob/flp-suite-v0.2.1/ansible/docs/FLP_test_scenarios.xlsx

i.e. in more details: disable the dummy equipment, enable the rorc 1 and change the “generatorLoopback” to DDG. I also added a second rorc, but I just realised I forgot to change its name from equipment-rorc-1 to equipment-roc-2 (copy/paste mistake). I retried with this typo corrected, after a reboot, but it did not help.

Cheers,
Andry

----> relevant lines of readout.cfg

define a (enabled) dummy equipment generating random data

[equipment-dummy-1]
enabled=0
equipmentType=dummy
eventMinSize=100k
eventMaxSize=200k
memoryPoolNumberOfPages=100
memoryPoolPageSize=1M
fillData=1

define a (disabled) CRU equipment for CRU end point 02:00.0

with 1 link and internal data generator

[equipment-rorc-1]
enabled=1
equipmentType=rorc
cardId=02:00.0
generatorEnabled=1
generatorLoopback=DDG
memoryPoolNumberOfPages=2048
memoryPoolPageSize=1M
linkMask=0-3

define a (disabled) CRU equipment for CRU end point 85:00.0

with 1 link and internal data generator

[equipment-rorc-2]
enabled=1
equipmentType=rorc
cardId=85:00.0
generatorEnabled=1
generatorLoopback=DDG
memoryPoolNumberOfPages=2048
memoryPoolPageSize=1M
linkMask=4-11

Hi,

Has mentioned by Sylvain, the best starting point to debug is infoLogger.
Here’s what I did:

  1. Opened InfoLogger GUI at http://alimftcom2.dyndns.cern.ch:8081
  2. Unselected Debug messages
  3. Set facility=readout filter
  4. Clicked on Query

There I could see the following relevant messages:

24/10/2019 16:36:01: Error : duplicate section name in /home/flp/readout.cfg line 33
This refers to the failure you mentioned after the reboot. As stated this was fixed.

24/10/2019 17:55:58 Not enough space left in memory bank ‘bank-0’ (need 2147483648 bytes more)
Now the config error is gone, but there’s a new error related with memory allocation. The explanation is the following: you enabled the second CRU without provisioning the corresponding memory for it to operate. If you look into the readout config file, there’s a section at the top called [bank-0], please increase the size from 2G to 4G and it should work (I’d do a reboot to start from a clean state).

Let us know if this helped.

Cheers,
Vasco

Dear Vasco,

Thanks a lot for the detailed answer. Indeed, I am still in the learning curve of how to use the various FLP suite components. Now that all three MFT FLPs have a fresh re-installation of the FLP suite, I can try again. Since I know that Stefano had a hand-on session on the last FLP suite, I will also ask him to propagate his new knowledge to me :wink:

Cheers,
Andry