TPC Readout with CRU

Hello,

tried to run a TPC readout. Used the scripts provided under

cru-tests/DETECTORS/tpc-readout/ from the Git master (8d22c1d887…)

Two things i noticed.

  • The execution of the standalone-startup.py which is executed by the script takes ages. Timed it, was 44 seconds. I remember this was much faster in the past, maybe 1 second max.
  • The script fails. Apparently because i don’t work as root (which i still think is a bad idea). Output is below.
[talt@flp2 tpc-readout]$ source tpc-set.sh 04:00.0
Starting TTC PON calibration
Starting GBT calibration
Checking link status
/home/talt/cru/cru-tests/DETECTORS/tpc-readout
rm -vf /tmp/output.bin
roc-setup-hugetlbfs
File '/etc/flpprototype.d/hugepages-2MiB.conf' not found, allocating default amount of 2 MiB hugepages: 128
/opt/alisw/el7/ReadoutCard/v0.8.8-1/bin/roc-setup-hugetlbfs: line 24: /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages: Permission denied
File '/etc/flpprototype.d/hugepages-1GiB.conf' found, allocating configured amount of 1 GiB hugepages: 16
/opt/alisw/el7/ReadoutCard/v0.8.8-1/bin/roc-setup-hugetlbfs: line 33: /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages: Permission denied
Adding 'pda' group
/opt/alisw/el7/ReadoutCard/v0.8.8-1/bin/roc-setup-hugetlbfs: line 39: /usr/sbin/groupadd: Permission denied
Creating hugetlbfs mounts
hugeadm:ERROR: Mounts can only be created by root
Setting permissions on hugeltbfs mounts
chgrp: changing group of '/var/lib/hugetlbfs/global/pagesize-1GB/readout-bank-a2': Operation not permitted
chgrp: changing group of '/var/lib/hugetlbfs/global/pagesize-1GB': Operation not permitted
chgrp: changing group of '/var/lib/hugetlbfs/global/pagesize-2MB': Operation not permitted
chmod: changing permissions of '/var/lib/hugetlbfs/global/pagesize-1GB': Operation not permitted
chmod: changing permissions of '/var/lib/hugetlbfs/global/pagesize-1GB/readout-bank-a2': Operation not permitted
chmod: changing permissions of '/var/lib/hugetlbfs/global/pagesize-2MB': Operation not permitted

Hugepages:
      Size  Minimum  Current  Maximum  Default
   2097152        0        0        0        *
1073741824        0        0        0         

Use 'echo [number] > /sys/kernel/mm/hugepages/hugepages-[size]/nr_hugepages' to allocate hugepages manually
Or set a number in the following conf files and run the script again:
  echo [number] > /etc/flpprototype.d/hugepages-2MiB.conf
  echo [number] > /etc/flpprototype.d/hugepages-1GiB.conf
build date 0x20180726
verinfo0=0x20202020
verinfo1=0x1ee1ed72
verinfo2=0x7670
verinfo3=0x56494e46
Altera chip ID=0x0054018810d20309
roc-reg-write --id=04:00.0 --channel=0 --address=0xc00 --value=0
infoLoggerD not available, falling back to stdout logging
0x0
roc-bench-dma --verbose   --id=04:00.0   --buffer-size=128Mi --superpage-size=2Mi --links="0-31"  --no-errorche --bytes=4096Mi  --to-file-bin=/tmp/output.bin --loopback=NONE --generator=0
infoLoggerD not available, falling back to stdout logging
2018-08-17 14:46:09.355792     DMA channel: 0
2018-08-17 14:46:09.356396     IOMMU enabled
Error: Permission denied
/build/workarea/sw/20155875/1/SOURCES/ReadoutCard/v0.8.8/v0.8.8/src/InterprocessLock.h(47): Throw in function AliceO2::roc::Interprocess::Lock::Lock(const boost::filesystem::path&, const string&, bool)
Dynamic exception type: boost::exception_detail::clone_impl<AliceO2::roc::LockException>
std::exception::what: Permission denied
[AliceO2::roc::ErrorInfo::_NamedMutexName*] = AliceO2_RoC_Pda_Mutex
[Possible cause] = Invalid lock file path
[AliceO2::roc::ErrorInfo::_SharedLockFile*] = /dev/shm/AliceO2_RoC_Pda.lock
[Error message] = Permission denied

Traceback (most recent call last):
  File "tpc-readout-test.py", line 93, in <module>
    p.wait()
AttributeError: 'int' object has no attribute 'wait'
[talt@flp2 tpc-readout]$ groups
z2 pda
[talt@flp2 tpc-readout]$ 

Ciao,
all the scripts from cru-tests should not need root credentials.

I will have a look … which tag version are you using?
We had many modifications I think it would make sense to updgrade firmware and software to the latest release (that currently is develop) … let me know.

Cheers

I tried with the master branch and then with the development branch. Both fail with the same issue. Can try to figure out which firmware version is installed. Unfortunately the roc-list-cards doesn’t give me that information because of the know mutex issues. So can try to delete the mutex by hand to get it working…

I checked the sof file you have in the server … it should be the tag_v1.0.0 … we are already at tagv1.5.0 … I think we should load v1.5.0 … but I would really like to try the develop branch as it it the last one used to test the firmware and software.
We will tag as soon Olivier is back

you can find the firmware in the dev folder here https://cernbox.cern.ch/index.php/s/mWIgJy1OC3817TR

Alright, will try the new firmware. Another thing i saw is the execution speed of the tools. Not sure if it is an issue with the machine or the software in combination with the machine. As you can see from the output, it took a whooping 1 minute and 27 seconds to run the tool. Same with the python stuff. And it uses 100% CPU…

[talt@flp2 tpc-readout]$ roc-list-cards --verbose
infoLoggerD not available, falling back to stdout logging
2018-08-17 15:35:06.261619     [pci=04:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:35:06.261936     [pci=04:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:35:10.870828     [pci=04:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:35:10.870897     [pci=04:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:35:10.871041     [pci=04:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:35:10.871999     [pci=04:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:35:20.537720     [pci=04:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:35:20.538008     [pci=04:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:35:25.237883     [pci=04:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:35:25.237957     [pci=04:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:35:25.238037     [pci=04:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:35:25.239592     [pci=04:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:35:35.059232     [pci=05:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:35:35.059528     [pci=05:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:35:40.269393     [pci=05:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:35:40.269460     [pci=05:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:35:40.269534     [pci=05:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:35:40.270835     [pci=05:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:35:50.308285     [pci=05:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:35:50.308538     [pci=05:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:35:55.014660     [pci=05:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:35:55.014711     [pci=05:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:35:55.014754     [pci=05:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:35:55.015663     [pci=05:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:36:04.377784     [pci=83:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:36:04.378102     [pci=83:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:36:08.932595     [pci=83:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:36:08.932680     [pci=83:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:36:08.932755     [pci=83:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:36:08.934270     [pci=83:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:36:18.357240     [pci=83:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:36:18.357627     [pci=83:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:36:23.093293     [pci=83:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:36:23.093343     [pci=83:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:36:23.093384     [pci=83:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:36:23.095706     [pci=83:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:36:33.156991     [pci=84:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:36:33.157158     [pci=84:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:36:38.291942     [pci=84:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:36:38.291988     [pci=84:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:36:38.292029     [pci=84:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:36:38.292941     [pci=84:00.0 serial=0 channel=0] Releasing DMA channel lock
2018-08-17 15:36:48.670325     [pci=84:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-08-17 15:36:48.670737     [pci=84:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-08-17 15:36:53.818352     [pci=84:00.0 serial=0 channel=0] Initializing with null DMA buffer
2018-08-17 15:36:53.818401     [pci=84:00.0 serial=0 channel=0] Scatter-gather list size: 0
2018-08-17 15:36:53.818444     [pci=84:00.0 serial=0 channel=0] Enabling link(s): 0 
2018-08-17 15:36:53.819349     [pci=84:00.0 serial=0 channel=0] Releasing DMA channel lock
===========================================================================================
  #   Type   PCI Addr   Vendor ID   Device ID   Serial   FW Version      Card ID          
-------------------------------------------------------------------------------------------
  0   CRU    04:00.0    0x1172      0xe001      0        20180726-220009-1ee1ed72 00000000-00000000
  1   CRU    05:00.0    0x1172      0xe001      0        20180726-220009-1ee1ed72 00000000-00000000
  2   CRU    83:00.0    0x1172      0xe001      0        20180726-220009-1ee1ed72 00000000-00000000
  3   CRU    84:00.0    0x1172      0xe001      0        20180726-220009-1ee1ed72 00000000-00000000
===========================================================================================

:sweat: that is unexpected.
We will have soon a new developer to take over the driver and the tools stuff … we will have him have a look asap.
Cheers

the version is 1.3.0 … but still better to upgrade

Alright, will try with the latest version of the FW. Downloaded it just now. Short remark here. In the folder under v1.5.0 is the file called cru-v2.sof. Would it be possible to add the tag also to the file-name. Files can be copied from folders to other locations, so having all the information in the filename (tag + maybe data) is really helpful :slight_smile:

yes … we’ll do that … but still take the dev

You mean the “development” branch for the cru_test tools, right?

there is a dev folder in the CRU firmware … take that one … we will tag it when Olivier is back.
It has the latest firmware features.

I am clening your script for TPC in the dev of cru-tests I will send a message when is pushed

O.k. So apparently we are using our own firmware which Lars compiled. It is based on the one from the master branch and he changed the number of links to 20. Will first try your firmware, then checkout the latest develop, change the links to 20 (if it is not already in yet) and compile/update.

no dthe dev has only 6 … we had 20 when we tag

Installed the v1.5.0 version. Link status gives me 6 links…

ah but I said you should install dev :smiley:

Alright, will install dev then. But to quote you from 2 messages above : “no dthe dev has only 6 … we had 20 when we tag” So which is it? :wink:

Alight, “dev” firmware is in the card. After the reboot the host seems to be back to normal speed again, tools run at normal speed again. Readout for TPC still doesn’t work and complains about no access rights and that it requires root privileges.

This issue appeared again after sending SCA ADC commands for about 12 hours (might or might not be a causality there).

It was taking 1.5 seconds to send a couple of SCA commands (read GPIOs and a couple of I2C transactions).
After digging a bit with perf trace I found that the time was spent in the PCI DMA driver:

-bash-4.2$ perf trace --duration 10.0 --pf=maj tfec --id 04:00.0 -m 0x4 -s 
46.466 (283.151 ms): tfec/34521 write(fd: 3</sys/bus/pci/drivers/uio_pci_dma/new_id>, buf: 0x7f6e2ab86000, count: 9) = 9
349.816 (272.340 ms): tfec/34521 write(fd: 3</sys/bus/pci/drivers/uio_pci_dma/new_id>, buf: 0x7f6e2ab86000, count: 9) = 9
687.824 (278.161 ms): tfec/34521 write(fd: 3</sys/bus/pci/drivers/uio_pci_dma/new_id>, buf: 0x7f6e2ab86000, count: 9) = 9
984.796 (261.142 ms): tfec/34521 write(fd: 3</sys/bus/pci/drivers/uio_pci_dma/new_id>, buf: 0x7f6e2ab86000, count: 9) = 9
FEC #2057 [0x809] [rev1a]
GBTx0 I2C addr 15 (sw_cfg)
GBTx1 I2C addr 14 (sw_cfg)
SCA id 0x64 -> v2
SAMPAs v4
SAMPAs powered 0x1f

After reinitializing the DMA driver, everything is back to normal speed again
-bash-4.2$ sudo modprobe -r uio_pci_dma
-bash-4.2$ sudo modprobe uio_pci_dma

Maybe this can hint to what the problem might be?

I will have our Driver expert have a look at it.
What do you do exactly with your software?
What type of command is it? Do you run it in a loop for several hours?
I have a FEC in the lab … can I try to reproduce it?
And last but not least, what is the O2 software version you are using?
SO I can install the same setup here in the lab.

Thx