Readout.exe only works if launched as user root

Dear all,
we have different FLPs that we use as a test bench (plus the MID FLP at CERN).
All FLPs are ok, except for one, for which I can run readout.exe only as user root.
If I launch the readout.exe as user flp I get:

2020-11-20 14:27:07.170398 [pci=3b:00.0 serial=0 endpoint=0 channel=0] Initializing with DMA buffer from memory region
2020-11-20 14:27:07.251961 Freeing PDA buffer '/sys/bus/pci/drivers/uio_pci_dma/0000:3b:00.0/dma/1000000000/map'
2020-11-20 14:27:07.274780 [pci=3b:00.0 serial=0 endpoint=0 channel=0] Releasing DMA channel lock
2020-11-20 14:27:07.274874 !!! **Error - Exception : Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer**
2020-11-20 14:27:07.274982 !!! **Error - /mnt/mesos/sandbox/sandbox/jenkins/workspace/BuildRPM/sw/20156182/1/SOURCES/ReadoutCard/v0.25.1/v**
**0.25.1/src/Pda/PdaDmaBuffer.cxx(69): Throw in function AliceO2::roc::Pda::PdaDmaBuffer::PdaDmaBuffer(PciDevice*, void*, size_t, int, boo**
**l)**
**Dynamic exception type: boost::wrapexcept<AliceO2::roc::PdaException>**
**std::exception::what: Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer**
**[Possible cause] = Program previously exited without cleaning up DMA buffer, reinserting DMA kernel module may help, but ensure no channels are open before reinsertion (modprobe -r uio_pci_dma; modprobe uio_pci_dma**
**[AliceO2::Common::ErrorInfo::_Message*] = Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer**
2020-11-20 14:27:07.275087 !!! **Error - Failed to configure equipment equipment-rorc-1**
2020-11-20 14:27:07.275102 !!! **Fatal - Some equipments failed to initialize, exiting**

If I re-launch it as user root, everything works just fine.
I guess it is a permission issue, but I’m not sure how to solve it
Notice that flp is in group “pda” as it should.

Thanks in advance for any hint,
best regards,
Diego

Hello,
I don’t think this is a permission problem, I have occasionally seen this error also running as root. Please try to reboot the machine.

cheers,
Sylvain

Hi @sy-c ,
reboot does not really help. We have this issue since months. We’ve performed several reboots in these days (last one was yesterday), because we recently upgraded to flp suite 0.11, but the problem persists.
I’ve never seen this error when running as root, on the other hand.

Cheers,
Diego

Can you reproduce this error with roc-bench-dma ?
Can you please show the readout configuration?

cheers,
Sylvain

Hi @sy-c,
yes I can:

roc-bench-dma --i=#0 --links=0-1 --to-file-bin=/tmp/mid --bytes=100M --data=FEE --no-er --bypass
Error: Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer
[Possible cause] = Program previously exited without cleaning up DMA buffer, reinserting DMA kernel module may help, but ensure no channels are open before reinsertion (modprobe -r uio_pci_dma; modprobe uio_pci_dma
[Error message] = Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer

And here is the configuration:

[readout]
# disable slicing into timeframes
# needed if we don't have enough pages to buffer at least 1 STF per link
disableAggregatorSlicing=1

# setup memory bank of 2GB using HugePages
[bank-0]
type=MemoryMappedFile
size=2G

[equipment-rorc-1]
enabled=1
equipmentType=rorc
cardId=#0
dataSource=Fee
memoryBankName=bank-0
memoryPoolNumberOfPages=2047
memoryPoolPageSize=1M
rdhUseFirstInPageEnabled=1
linkMask=0,1,15
firmwareCheckEnabled=0

[consumer-rec]
enabled=1
consumerType=fileRecorder
fileName=/tmp/data.raw
bytesMax=100M

If the error happens with roc-bench-dma, it’s a readoutcard issue, not related to readout.exe. @kalexopo should advise.

Hello @dstocco. Can you please give me the output of the next two commands?

cat /etc/o2.d/FLP_suite_version
aliswmod list #from within the env your are launching roc-bench-dma

Can you also please try sudo roc-cleanup and try running roc-bench-dma again?

Cheers,
Kostas

Hi @kalexopo
here they are:

cat /etc/o2.d/FLP_suite_version
0.11.0 

 aliswmod list
Currently Loaded Modulefiles:
  1) BASE/1.0                       16) lz4/v1.8.3-16
  2) GCC-Toolchain/v7.3.0-alice2-9  17) fmt/6.1.2-25
  3) zlib/v1.2.8-65                 18) FairLogger/v1.9.0-4
  4) libpng/v1.6.34-67              19) ZeroMQ/v4.3.2-9
  5) libffi/v3.2.1-21               20) ofi/v1.7.1-13
  6) FreeType/v2.10.1-15            21) asiofi/v0.3.1-94
  7) Python/v3.6.10-62              22) DDS/3.5.1-21
  8) boost/v1.72.0-alice2-9         23) FairMQ/v1.4.25-5
  9) Monitoring/v3.3.4-1            24) protobuf/v3.12.3-6
 10) Ppconsul/0.1.0-99              25) c-ares/v1.15.0-13
 11) Configuration/v2.2.6-50        26) OpenSSL/v1.0.2o-30
 12) Common-O2/v1.4.9-46            27) grpc/v1.30.0-alice1-6
 13) libInfoLogger/v1.3.14-5        28) Control-OCCPlugin/v0.17.1-1
 14) PDA/12.0.0-18                  29) Readout/v1.4.7.2-4
 15) ReadoutCard/v0.25.1-3

roc-cleanup

Removing readout 2MB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-2MB/readout*?: No such file or directory
Removing readout 1GB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-1GB/readout*?: No such file or directory
Removing roc-bench-dma 2MB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-2MB/roc-bench-dma*?: No such file or directory
Removing roc-bench-dma 1GB hugepage mappings
rm: cannot remove ?/var/lib/hugetlbfs/global/pagesize-1GB/roc-bench-dma*?: No such file or directory
Removing uio_pci_dma
Reinserting uio_pci_dma

After this, roc-bench-dma still fails.
Cheers,
Diego

Thanks for the info. Is it okay if I log on the machine to debug?

Cheers,
Kostas

Hi @kalexopo,
actually I connect to the machine with an ssh public key.
Can you send me your public key privately at dstocco@cern.ch ?
I’ll add it to the flp so that you can connect (and provide the hostname).

Thanks!
Cheers,
Diego

I can actually start roc-bench-dma and readout.exe as flp succesfully, without the error reported…

[flp@flpmid ~] INFOLOGGER_MODE=stdout roc-bench-dma --i=#0 --links=0-1 --to-file-bin=/tmp/mid --bytes=100M --data=FEE --no-er --bypass
2020-11-20 16:11:16.021580     DMA channel: 0
2020-11-20 16:11:16.022031     IOMMU enabled
2020-11-20 16:11:16.022205     Using buffer file path: /var/lib/hugetlbfs/global/pagesize-1GB/roc-bench-dma_id=#0_chan=0_pages
2020-11-20 16:11:16.022250     Buffer size: 1073741824
2020-11-20 16:11:16.022258     Superpage size: 1048576
2020-11-20 16:11:16.022281     Superpages in buffer: 1024
2020-11-20 16:11:16.022291     Superpage limit: 95
2020-11-20 16:11:16.022298     DMA page size: 8192
2020-11-20 16:11:16.095455     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Acquiring DMA channel lock
2020-11-20 16:11:16.095495     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Acquired DMA channel lock
2020-11-20 16:11:16.095754     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Initializing with DMA buffer from memory region
2020-11-20 16:11:16.354770     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Scatter-gather list size: 1
2020-11-20 16:11:16.357330     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Buffer is hugepage-backed
2020-11-20 16:11:16.357699     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Using link(s): 0 1 15
2020-11-20 16:11:16.357731     Card type: CRU
2020-11-20 16:11:16.357748     Card PCI address: 3b:00.0
2020-11-20 16:11:16.357811     Card NUMA node: 0
2020-11-20 16:11:16.357845     Card firmware info: 883db8e0
2020-11-20 16:11:16.357852     Starting benchmark
2020-11-20 16:11:16.357865     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Starting DMA
[flp@flpmid ~] INFOLOGGER_MODE=stdout readout.exe file:readout_mid_dbg.cfg
2020-11-20 16:15:55.030367     Readout 1.4.7 - process starting, pid 184565
2020-11-20 16:15:55.030534     Optional built features enabled:
2020-11-20 16:15:55.030553     READOUTCARD : yes
2020-11-20 16:15:55.030570     CONFIG : yes
2020-11-20 16:15:55.030587     FAIRMQ : yes
2020-11-20 16:15:55.030733     NUMA : yes
2020-11-20 16:15:55.030756     RDMA : no
2020-11-20 16:15:55.030774     OCC : yes
2020-11-20 16:15:55.030810     LOGBOOK : no
2020-11-20 16:15:55.030826     ZMQ : yes
2020-11-20 16:15:55.030854     Readout entering standalone state machine
2020-11-20 16:15:55.030875     Readout executing CONFIGURE
2020-11-20 16:15:55.030907     Reading configuration from file:readout_mid_dbg.cfg
2020-11-20 16:15:55.031279     Merging selected content of OCC configuration
2020-11-20 16:15:55.031308     No OCC FMQ channels configuration found
2020-11-20 16:15:55.031348     CPU deep sleep not disabled for process
2020-11-20 16:15:55.031978     Creating memory bank bank-0: type MemoryMappedFile size 2147483648
2020-11-20 16:15:55.032015     Creating shared memory block for bank bank-0 : size 2147483648 using /var/lib/hugetlbfs/global/pagesize-1GB/readout-bank-0
2020-11-20 16:15:55.032246     Shared memory block for bank bank-0 is ready
2020-11-20 16:15:55.667828     Bank bank-0 added
2020-11-20 16:15:55.667969 !!! Error - Failed to configure consumer consumer-rec : No such node (consumer-rec.consumerType)
2020-11-20 16:15:55.667983     Configuring equipment equipment-rorc-1: rorc
2020-11-20 16:15:55.668104     Equipment equipment-rorc-1: from config [equipment-rorc-1], max rate=-1.000000 Hz, idleSleepTime=200 us, outputFifoSize=2047
2020-11-20 16:15:55.668114     Equipment equipment-rorc-1: requesting memory pool 2047 pages x 1048576 bytes from bank 'bank-0', block aligned @ 0x200000, 1st page offset @ 0x0
2020-11-20 16:15:55.668140     pageSpaceReserved = 56, aligning 1st page @ 0xFFFC8
2020-11-20 16:15:55.668277  !  Warning - Bypassing RORC firmware compatibility check
2020-11-20 16:15:55.668300     Using superpage size 1015808
2020-11-20 16:15:55.668306     Opening ROC #0:0
2020-11-20 16:15:55.668564     Register DMA block 0x2aaac0000000:2147483648
2020-11-20 16:15:55.748128     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Acquiring DMA channel lock
2020-11-20 16:15:55.748174     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Acquired DMA channel lock
2020-11-20 16:15:55.748425     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Initializing with DMA buffer from memory region
2020-11-20 16:15:55.823005     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Scatter-gather list size: 1
2020-11-20 16:15:55.826155     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Buffer is hugepage-backed
2020-11-20 16:15:55.826505     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Using link(s): 0 1 15
2020-11-20 16:15:55.826606     Equipment equipment-rorc-1 : PCI 3b:00.0 @ NUMA node 0, serial number unknown, firmware version 883db8e0, card id 00540186-2853fc0d
2020-11-20 16:15:55.826617     Timeframe length = 256 orbits
2020-11-20 16:15:55.826625     Timeframe IDs generated from RDH trigger counters
2020-11-20 16:15:55.826648     Creating aggregator
2020-11-20 16:15:55.826686     Aggregator: 1 equipments
2020-11-20 16:15:55.826693     Readout completed CONFIGURE
2020-11-20 16:15:55.826708     Readout executing START
2020-11-20 16:15:55.826722     Starting aggregator
2020-11-20 16:15:55.826779     Starting readout equipments
2020-11-20 16:15:55.826811     Starting DMA for ROC equipment-rorc-1
2020-11-20 16:15:55.826845     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Starting DMA
2020-11-20 16:15:56.037310     ROC input queue size = 384 pages
2020-11-20 16:15:56.037363     Running
2020-11-20 16:15:56.037522     Readout completed START
2020-11-20 16:15:56.037605     Entering main loop

Hi @kalexopo,
well, I do not quite get it.
I exited and enter again.
Then:

aliswmod enter Readout

INFOLOGGER_MODE=stdout roc-bench-dma --i=#0 --links=0-1 --to-file-bin=/tmp/mid --bytes=100M --data=FEE --no-er --bypass

2020-11-20 16:22:02.818022     DMA channel: 0
2020-11-20 16:22:02.819075     IOMMU enabled
2020-11-20 16:22:02.819453     Using buffer file path: /var/lib/hugetlbfs/global/pagesize-1GB/roc-bench-dma_id=#0_chan=0_pages
2020-11-20 16:22:02.819549     Buffer size: 1073741824
2020-11-20 16:22:02.819571     Superpage size: 1048576
2020-11-20 16:22:02.819591     Superpages in buffer: 1024
2020-11-20 16:22:02.819610     Superpage limit: 95
2020-11-20 16:22:02.819642     DMA page size: 8192
2020-11-20 16:22:02.900988     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Acquiring DMA channel lock
2020-11-20 16:22:02.901028     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Acquired DMA channel lock
2020-11-20 16:22:02.901263     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Initializing with DMA buffer from memory region
2020-11-20 16:22:03.163879     Freeing PDA buffer '/sys/bus/pci/drivers/uio_pci_dma/0000:3b:00.0/dma/1000000000/map'
2020-11-20 16:22:03.179251     [pci=3b:00.0 serial=0 endpoint=0 channel=0] Releasing DMA channel lock
Error: Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer
[Possible cause] = Program previously exited without cleaning up DMA buffer, reinserting DMA kernel module may  help, but ensure no channels are open before reinsertion (modprobe -r uio_pci_dma; modprobe uio_pci_dma
[Error message] = Failed to register external DMA buffer; Failed retry after automatic cleanup of previous buffer 

So, what’s the difference w.r.t. what you do?

Hello,

So we found a way to reproduce the problem. If the login happens as follows

ssh root@flpmid
su - flp

everything is alright, and more specifically ulimit -l correctly reports the PDA enforced value which is unlimited and the ReadoutCard library works.

If the login happens like this

ssh flp@flpmid

the env is different to the one before. Moreover ulimit -l incorrectly reports 64, the default value, which causes the issues encountered above.

This points to some issue with the SSH login that has not been previously reported on FLP setups. Scouting the internet, issues wrt PAM authentication come up. Interestingly, the SSHD status also reports a PAM issue.

[flp@flpmid ~] systemctl status sshd
● sshd.service - OpenSSH server daemon
   Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-11-18 12:46:10 CET; 2 days ago
     Docs: man:sshd(8)
           man:sshd_config(5)
 Main PID: 3208 (sshd)
    Tasks: 24
   CGroup: /system.slice/sshd.service
           ├─  3208 /usr/sbin/sshd -D
           ├─155958 sshd: flp [priv]
           ├─155961 sshd: flp@pts/1
           ├─155963 -bash
           ├─156195 dbus-launch --autolaunch 72549b562c264f7e8a7ea44c69b15fca --binary-syntax --close-stderr
           ├─156196 /usr/bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
           ├─161615 sshd: root@pts/3
           ├─161677 -bash
           ├─181893 sshd: root@pts/4
           ├─181917 -bash
           ├─194160 sshd: root@pts/2
           ├─194177 -bash
           ├─194824 sshd: flp [priv]
           ├─194827 sshd: flp@pts/5
           ├─194829 -bash
           ├─194969 bash -i
           ├─196257 sshd: flp [priv]
           ├─196265 sshd: flp@pts/0
           ├─196266 -bash
           ├─196729 sshd: [accepted]
           ├─196730 sshd: [net]
           ├─196871 sshd: [accepted]
           ├─196873 sshd: [net]
           └─196972 systemctl status sshd

Nov 20 16:44:28 flpmid sshd[196890]: WARNING: 'UsePAM no' is not supported in Red Hat Enterprise Linux and may cause several problems.
Nov 20 16:44:28 flpmid sshd[196890]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Nov 20 16:44:28 flpmid sshd[196890]: Connection from 137.74.173.182 port 39496 on 154.114.23.2 port 22
Nov 20 16:44:29 flpmid sshd[196890]: reprocess config line 47: Deprecated option RSAAuthentication
Nov 20 16:44:29 flpmid sshd[196890]: Invalid user ranjit from 137.74.173.182 port 39496
Nov 20 16:44:29 flpmid sshd[196890]: input_userauth_request: invalid user ranjit [preauth]
Nov 20 16:44:29 flpmid sshd[196890]: error: Could not get shadow information for NOUSER
Nov 20 16:44:29 flpmid sshd[196890]: Failed password for invalid user ranjit from 137.74.173.182 port 39496 ssh2
Nov 20 16:44:30 flpmid sshd[196890]: Received disconnect from 137.74.173.182 port 39496:11: Bye Bye [preauth]
Nov 20 16:44:30 flpmid sshd[196890]: Disconnected from 137.74.173.182 port 39496 [preauth]
Hint: Some lines were ellipsized, use -l to show in full.

I leave this to the system administrators in charge of the MID FLP.

hi @kalexopo,
ok, thanks a lot for the debugging!
Cheers,
Diego

@rmonteve: just for information