Problems with coconut during FLPSuite upgrading(0.17.2->0.19)

Dear Experts,

I cannot upgrade FLPsuite due to some problems with coconut, last part of related log:

Refresh AliECS repositories (via handler)... 
  j183.localdomain ok
included: /home/flp/.local/share/o2-flp-setup/system-configuration/ansible/roles/control-consul/tasks/refresh.yml for j183.localdomain
Wait for AliECS to respond on grpc port (via handler)... 
  j183.localdomain failed | msg: Timeout when waiting for j183:32102
Run coconut repo refresh (via handler)... 
  j183.localdomain failed | msg: non-zero return code | stdout: Repository refresh operation failed. | stderr: FATAL refresh:     command finished with error error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:32102: connect: connection refused"
Restart o2-qcg (via handler)... 
  j183.localdomain done
Restart mesos-master (via handler)... 
  j183.localdomain done
Restart mesos-slave (via handler)... 
  j183.localdomain done
  Ran out of hosts!

- Play recap -
  j183.localdomain           : ok=303  changed=63   unreachable=0    failed=2    rescued=0    ignored=1   

SSH key is not found. ssh to root@j183.localdomain using the keyboard interactive method.
Password: 
Couldn't establish a connection to the remote server  ssh: handshake failed: EOF

Full log: CERNBox

Just before upgrading I added my ControlWorkflow forked repository by using:

coconut repo add github.com/afurs/ControlWorkflows

I also tried to check coconut/AliECS status:

[flp@j183 ~]$ coconut repo list
FATAL list:        command finished with error error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:32102: connect: connection refused"
[flp@j183 ~]$ coconut info
FATAL info:        command finished with error error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:32102: connect: connection refused"

[flp@j183 ~]$ systemctl status o2-aliecs-core.service
● o2-aliecs-core.service - O² AliECS core
   Loaded: loaded (/etc/systemd/system/o2-aliecs-core.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Сб 2021-07-03 11:10:55 CEST; 1s ago
  Process: 18443 ExecStart=/bin/bash -a -c source /opt/rh/rh-git29/enable && /opt/o2/bin/o2-aliecs-core --controlPort 32102 --configServiceUri apricot://j183:32101 (code=exited, status=1/FAILURE)
  Process: 18413 ExecStartPre=/bin/bash -c until consul kv get o2/components/aliecs/ANY/any/settings; do sleep 1; done (code=exited, status=0/SUCCESS)
 Main PID: 18443 (code=exited, status=1/FAILURE)

июл 03 11:10:55 j183.localdomain systemd[1]: Unit o2-aliecs-core.service entered failed state.
июл 03 11:10:55 j183.localdomain systemd[1]: o2-aliecs-core.service failed.


Best Regards,
Artur Furs.

Additional info:

[flp@j183 ~]$ systemctl status consul.service
● consul.service - Consul agent
   Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: disabled)
   Active: active (running) since Сб 2021-07-03 11:05:00 CEST; 8min ago
 Main PID: 13905 (consul)
    Tasks: 38
   CGroup: /system.slice/consul.service
           └─13905 /usr/local/bin/consul agent -server -bootstrap-expect 1 -bind 127.0.0.1 -client {{ GetAllInterfaces | include "flags" "forwardable|up" | exclude ...

июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.194+0200 [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.194+0200 [INFO]  agent.server.raft: entering candidate state: node="Node at 127.0.0...." term=44
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.259+0200 [INFO]  agent.server.raft: election won: tally=1
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.259+0200 [INFO]  agent.server.raft: entering leader state: leader="Node at 127.0.0.1...[Leader]"
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.259+0200 [INFO]  agent.server: cluster leadership acquired
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.259+0200 [INFO]  agent.server: New leader elected: payload=j183.localdomain
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.507+0200 [INFO]  agent.leader: started routine: routine="federation state anti-entropy"
июл 03 11:05:08 j183.localdomain consul[13905]: 2021-07-03T11:05:08.507+0200 [INFO]  agent.leader: started routine: routine="federation state pruning"
июл 03 11:05:11 j183.localdomain consul[13905]: 2021-07-03T11:05:11.016+0200 [INFO]  agent: Synced node info
июл 03 11:05:19 j183.localdomain consul[13905]: 2021-07-03T11:05:19.833+0200 [INFO]  agent: Newer Consul version available: new_version=1.10.0 current_version=1.9.5
Hint: Some lines were ellipsized, use -l to show in full.
[flp@j183 ~]$ systemctl status o2-apricot.service
● o2-apricot.service - O² apricot
   Loaded: loaded (/etc/systemd/system/o2-apricot.service; enabled; vendor preset: disabled)
   Active: active (running) since Сб 2021-07-03 10:56:07 CEST; 17min ago
 Main PID: 22747 (o2-apricot)
    Tasks: 27
   CGroup: /system.slice/o2-apricot.service
           └─22747 /opt/o2/bin/o2-apricot --backendUri consul://j183:8500

июл 03 10:59:49 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:00:59 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:01:32 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:01:37 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:02:31 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:04:10 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:04:11 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:04:18 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:04:20 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'
июл 03 11:04:59 j183.localdomain systemd[1]: [/etc/systemd/system/o2-apricot.service:14] Unknown lvalue 'SyslogIndentifier' in section 'Service'

Hello Artur, the error comes in which phase: deploy --modules post-installation or deploy?

Did you reboot your machine(s) after each step, as by instructions?

Hello Roberto,
thank you for replying.
Well, it seems that the problem is gone :slight_smile:

I used “deploy” for upgrading and the problem was in side repos for coconut(I added repo before upgrade). I removed them manually from “/var/lib/o2/aliecs/repos/”. Some how it helped and AliECS service successfully restarted at the end of upgrading.

Best Regards,
Artur Furs.