Dear all,
we have different FLPs that we use as a test bench (plus the MID FLP at CERN).
All FLPs are ok, except for one, for which I can run readout.exe only as user root.
If I launch the readout.exe as user flp I get:
2020-11-20 14:27:07.170398 [pci=3b:00.0 serial=0 endpoint=0 channel=0] Initializing with DMA buffer from memory region
2020-11-20 14:27:07.251961 Freeing PDA buffer '/sys/bus/pci/drivers/uio_pci_dma/0000:3b:00.0/dma/1000000000/map'
2020-11-20 14:27:07.274780 [pci=3b:00.0 serial=0 endpoint=0 channel=0] Releasing DMA channel lock
2020-11-20 14:27:07.274874 !!! **Error - Exception : Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer**
2020-11-20 14:27:07.274982 !!! **Error - /mnt/mesos/sandbox/sandbox/jenkins/workspace/BuildRPM/sw/20156182/1/SOURCES/ReadoutCard/v0.25.1/v**
**0.25.1/src/Pda/PdaDmaBuffer.cxx(69): Throw in function AliceO2::roc::Pda::PdaDmaBuffer::PdaDmaBuffer(PciDevice*, void*, size_t, int, boo**
**l)**
**Dynamic exception type: boost::wrapexcept<AliceO2::roc::PdaException>**
**std::exception::what: Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer**
**[Possible cause] = Program previously exited without cleaning up DMA buffer, reinserting DMA kernel module may help, but ensure no channels are open before reinsertion (modprobe -r uio_pci_dma; modprobe uio_pci_dma**
**[AliceO2::Common::ErrorInfo::_Message*] = Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer**
2020-11-20 14:27:07.275087 !!! **Error - Failed to configure equipment equipment-rorc-1**
2020-11-20 14:27:07.275102 !!! **Fatal - Some equipments failed to initialize, exiting**
If I re-launch it as user root, everything works just fine.
I guess it is a permission issue, but I’m not sure how to solve it
Notice that flp is in group “pda” as it should.
Thanks in advance for any hint,
best regards,
Diego
Hi @sy-c ,
reboot does not really help. We have this issue since months. We’ve performed several reboots in these days (last one was yesterday), because we recently upgraded to flp suite 0.11, but the problem persists.
I’ve never seen this error when running as root, on the other hand.
roc-bench-dma --i=#0 --links=0-1 --to-file-bin=/tmp/mid --bytes=100M --data=FEE --no-er --bypass
Error: Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer
[Possible cause] = Program previously exited without cleaning up DMA buffer, reinserting DMA kernel module may help, but ensure no channels are open before reinsertion (modprobe -r uio_pci_dma; modprobe uio_pci_dma
[Error message] = Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer
And here is the configuration:
[readout]
# disable slicing into timeframes
# needed if we don't have enough pages to buffer at least 1 STF per link
disableAggregatorSlicing=1
# setup memory bank of 2GB using HugePages
[bank-0]
type=MemoryMappedFile
size=2G
[equipment-rorc-1]
enabled=1
equipmentType=rorc
cardId=#0
dataSource=Fee
memoryBankName=bank-0
memoryPoolNumberOfPages=2047
memoryPoolPageSize=1M
rdhUseFirstInPageEnabled=1
linkMask=0,1,15
firmwareCheckEnabled=0
[consumer-rec]
enabled=1
consumerType=fileRecorder
fileName=/tmp/data.raw
bytesMax=100M
Removing readout 2MB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-2MB/readout*?: No such file or directory
Removing readout 1GB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-1GB/readout*?: No such file or directory
Removing roc-bench-dma 2MB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-2MB/roc-bench-dma*?: No such file or directory
Removing roc-bench-dma 1GB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-1GB/roc-bench-dma*?: No such file or directory
Removing uio_pci_dma
Reinserting uio_pci_dma
After this, roc-bench-dma still fails.
Cheers,
Diego
Hi @kalexopo,
actually I connect to the machine with an ssh public key.
Can you send me your public key privately at dstocco@cern.ch ?
I’ll add it to the flp so that you can connect (and provide the hostname).
So we found a way to reproduce the problem. If the login happens as follows
ssh root@flpmid
su - flp
everything is alright, and more specifically ulimit -l correctly reports the PDA enforced value which is unlimited and the ReadoutCard library works.
If the login happens like this
ssh flp@flpmid
the env is different to the one before. Moreover ulimit -l incorrectly reports 64, the default value, which causes the issues encountered above.
This points to some issue with the SSH login that has not been previously reported on FLP setups. Scouting the internet, issues wrt PAM authentication come up. Interestingly, the SSHD status also reports a PAM issue.
[flp@flpmid ~] systemctl status sshd
● sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-11-18 12:46:10 CET; 2 days ago
Docs: man:sshd(8)
man:sshd_config(5)
Main PID: 3208 (sshd)
Tasks: 24
CGroup: /system.slice/sshd.service
├─ 3208 /usr/sbin/sshd -D
├─155958 sshd: flp [priv]
├─155961 sshd: flp@pts/1
├─155963 -bash
├─156195 dbus-launch --autolaunch 72549b562c264f7e8a7ea44c69b15fca --binary-syntax --close-stderr
├─156196 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
├─161615 sshd: root@pts/3
├─161677 -bash
├─181893 sshd: root@pts/4
├─181917 -bash
├─194160 sshd: root@pts/2
├─194177 -bash
├─194824 sshd: flp [priv]
├─194827 sshd: flp@pts/5
├─194829 -bash
├─194969 bash -i
├─196257 sshd: flp [priv]
├─196265 sshd: flp@pts/0
├─196266 -bash
├─196729 sshd: [accepted]
├─196730 sshd: [net]
├─196871 sshd: [accepted]
├─196873 sshd: [net]
└─196972 systemctl status sshd
Nov 20 16:44:28 flpmid sshd[196890]: WARNING: 'UsePAM no' is not supported in Red Hat Enterprise Linux and may cause several problems.
Nov 20 16:44:28 flpmid sshd[196890]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Nov 20 16:44:28 flpmid sshd[196890]: Connection from 137.74.173.182 port 39496 on 154.114.23.2 port 22
Nov 20 16:44:29 flpmid sshd[196890]: reprocess config line 47: Deprecated option RSAAuthentication
Nov 20 16:44:29 flpmid sshd[196890]: Invalid user ranjit from 137.74.173.182 port 39496
Nov 20 16:44:29 flpmid sshd[196890]: input_userauth_request: invalid user ranjit [preauth]
Nov 20 16:44:29 flpmid sshd[196890]: error: Could not get shadow information for NOUSER
Nov 20 16:44:29 flpmid sshd[196890]: Failed password for invalid user ranjit from 137.74.173.182 port 39496 ssh2
Nov 20 16:44:30 flpmid sshd[196890]: Received disconnect from 137.74.173.182 port 39496:11: Bye Bye [preauth]
Nov 20 16:44:30 flpmid sshd[196890]: Disconnected from 137.74.173.182 port 39496 [preauth]
Hint: Some lines were ellipsized, use -l to show in full.
I leave this to the system administrators in charge of the MID FLP.