Dropped packets using latest readout.exe and very small data rate

I am currently doing tests with the installed MCH equipments, reading data from all the equipped links in our CRUs at P2. While doing this tests I noticed that readout.exe reports dropped packets even if the data rate is very low. The dropped packets appear when writing the data to disk, while they are absent if the data is written to /dev/null.

The equipped links for the two data paths of the CRUs with the higher drop rate are as follows:

[pci=3b:00.0 serial=10220 endpoint=0 channel=0] Using link(s): 0 1 2 3 4 5 6 7 8 9 10
[pci=3c:00.0 serial=10220 endpoint=1 channel=0] Using link(s): 0 1 2 3 4 5 6 7 8

Other CRUs with less active links are less affected by the packet dropping.

The fact is that the dropping occurs even in triggered mode, with a very low trigger rate of 1 Hz, giving a data rate to disk of about 2 MB/s.

My gut feeling is that readout.exe has a hard time absorbing large instantaneous data rate peaks, like those coming from physics triggers when most of the links are active. But that does not come from the DMA interface, otherwise writing to /dev/null should not make a difference. Is there some decoupling mechanism inside readout.exe between the input DMA channel and the output channel?

Pinging @sy-c to make sure he sees this message…

For reference, here is the full readout.exe configuration I am using:

###################################
# readout configuration file
#
# comments lines start with #
# section names are in brackets []
# settings are defined with key=value pairs
###################################


###################################
# general settings
###################################

[readout]

# per-equipment data rate limit, in Hertz (-1 for unlimited)
rate=-1

# time (in seconds) after which program exits automatically (-1 for unlimited)
exitTimeout=-1




[consumer-data-sampling]
consumerType=DataSampling
enabled=0

###################################
# data consumers
###################################

# collect data statistics
[consumer-stats]
consumerType=stats
enabled=0
# this publishes stats, if enabled, to O2 monitoring system
monitoringEnabled=1
monitoringUpdatePeriod=0
consoleUpdate=1
#monitoringURI=infologger://
monitoringURI=influxdb-udp://aido2mon-gpn.cern.ch:8088
#processMonitoringInterval=5



# recording to file
[consumer-rec]
consumerType=fileRecorder
enabled=1
#fileName=/tmp/%T_data.raw
#fileName=/tmp/readout-pipe
fileName=data.raw
#fileName=/dev/null
bytesMax=10000M
dropEmptyHBFrames=1


###################################
# memory banks
###################################
# All section names should start with 'bank-' to be taken into account.
# They define memory to be allocated to readout
# If bank name not specified in each equipment, the first available bank (created first) will be used.
# Types of memory banks include: malloc, MemoryMappedFile
# NB: the FairMQChannel consumers may also create some banks, which will not be
# listed here, and created before them.

[bank-1]
type=MemoryMappedFile
size=1G
numaNode=1

[equipment-rorc-1]
equipmentType=rorc
enabled=1
cardId=3b:0.0
dataSource=Fee
memoryBankName=bank-1
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 0
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=1023
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
linkMask=0-10


[bank-2]
type=MemoryMappedFile
size=1G
numaNode=1

[equipment-rorc-2]
equipmentType=rorc
enabled=1
cardId=3c:0.0
dataSource=Fee
memoryBankName=bank-2
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 0
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=1023
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
linkMask=0-10



# push to fairMQ device
# currently fixed local TCP port 5555
# needs FairMQ at compile time
[consumer-fmq]
consumerType=FairMQDevice
enabled=0



# push to fairMQ device
# light FMQ channel implementation
# WP5-compatible output
[consumer-fmq-wp5]
# session name must match --session parameter of all O2 devices in the chain
sessionName=default
consumerType=FairMQChannel
enabled=0
transportType=shmem
channelName=readout-out
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
unmanagedMemorySize=2G
disableSending=0
#need also a memory pool for headers and partial HBf chunks copies
memPoolNumberOfElements=100
memPoolElementSize=128k



[receiver-fmq]
transportType=shmem
channelName=readout
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
decodingMode=readout
#one of: readout, none

Hello,
what version of readout are you using?
do you have some output log please?

It’s possible you are short on the number of data pages, and I’m not surprised that enabling recording adds enough latency to run out of pages for DMA. (it’s not only a matter of throughput, as pages are changed for each new TF / link, and there are some internal timeouts).
Please give it a try with 2000, and also set rdhUseFirstInPageEnabled = 1 for proper timestamping. Also, linkMask does not need to be set for readout v1.3.8.

cheers,
Sylvain

Hi Sylvain,

EDIT: I am using Readout v1.3.8.

here is the modified configuration file:

###################################
# readout configuration file
#
# comments lines start with #
# section names are in brackets []
# settings are defined with key=value pairs
###################################


###################################
# general settings
###################################

[readout]

# per-equipment data rate limit, in Hertz (-1 for unlimited)
rate=-1

# time (in seconds) after which program exits automatically (-1 for unlimited)
exitTimeout=-1




[consumer-data-sampling]
consumerType=DataSampling
enabled=0

###################################
# data consumers
###################################

# collect data statistics
[consumer-stats]
consumerType=stats
enabled=0
# this publishes stats, if enabled, to O2 monitoring system
monitoringEnabled=1
monitoringUpdatePeriod=0
consoleUpdate=1
#monitoringURI=infologger://
monitoringURI=influxdb-udp://aido2mon-gpn.cern.ch:8088
#processMonitoringInterval=5



# recording to file
[consumer-rec]
consumerType=fileRecorder
enabled=1
#fileName=/tmp/%T_data.raw
#fileName=/tmp/readout-pipe
fileName=data.raw
#fileName=/dev/null
bytesMax=10000M
dropEmptyHBFrames=1


###################################
# memory banks
###################################
# All section names should start with 'bank-' to be taken into account.
# They define memory to be allocated to readout
# If bank name not specified in each equipment, the first available bank (created first) will be used.
# Types of memory banks include: malloc, MemoryMappedFile
# NB: the FairMQChannel consumers may also create some banks, which will not be
# listed here, and created before them.

[bank-1]
type=MemoryMappedFile
size=2G
numaNode=1

[equipment-rorc-1]
equipmentType=rorc
enabled=1
cardId=3b:0.0
dataSource=Fee
memoryBankName=bank-1
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 1
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=2047
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
#linkMask=0-10


[bank-2]
type=MemoryMappedFile
size=2G
numaNode=1

[equipment-rorc-2]
equipmentType=rorc
enabled=1
cardId=3c:0.0
dataSource=Fee
memoryBankName=bank-2
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 1
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=2047
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
#linkMask=0-10



# push to fairMQ device
# currently fixed local TCP port 5555
# needs FairMQ at compile time
[consumer-fmq]
consumerType=FairMQDevice
enabled=0



# push to fairMQ device
# light FMQ channel implementation
# WP5-compatible output
[consumer-fmq-wp5]
# session name must match --session parameter of all O2 devices in the chain
sessionName=default
consumerType=FairMQChannel
enabled=0
transportType=shmem
channelName=readout-out
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
unmanagedMemorySize=2G
disableSending=0
#need also a memory pool for headers and partial HBf chunks copies
memPoolNumberOfElements=100
memPoolElementSize=128k



[receiver-fmq]
transportType=shmem
channelName=readout
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
decodingMode=readout
#one of: readout, none

Dropping still happens… this is the readout log file:

2020-03-11 15:58:52.268263     Readout process starting, pid 233334
2020-03-11 15:58:52.268356     Optional built features enabled:
2020-03-11 15:58:52.268364     CONFIG : yes
2020-03-11 15:58:52.268369     FAIRMQ : yes
2020-03-11 15:58:52.268429     NUMA : yes
2020-03-11 15:58:52.268436     RDMA : no
2020-03-11 15:58:52.268441     OCC : yes
2020-03-11 15:58:52.268445     LOGBOOK : no
2020-03-11 15:58:52.268451     Readout entering standalone state machine
2020-03-11 15:58:52.268458     Readout executing CONFIGURE
2020-03-11 15:58:52.268466     Reading configuration from file:///home/flp/mch_upgr_prod/cru/scripts/readout.cfg 
2020-03-11 15:58:52.268790     Merging selected content of OCC configuration
2020-03-11 15:58:52.268800     No OCC FMQ channels configuration found
2020-03-11 15:58:52.268813     CPU deep sleep not disabled for process
2020-03-11 15:58:52.269193     Enforcing memory allocated on NUMA node 1
2020-03-11 15:58:52.269202     Creating memory bank bank-1: type MemoryMappedFile size 2147483648
2020-03-11 15:58:52.269215     Creating shared memory block for bank bank-1 : size 2147483648 using /var/lib/hugetlbfs/global/pagesize-1GB/readout-bank-1
2020-03-11 15:58:52.269342     Shared memory block for bank bank-1 is ready
2020-03-11 15:58:53.338708     Bank bank-1 added
2020-03-11 15:58:53.338785     Enforcing memory allocated on NUMA node 1
2020-03-11 15:58:53.338790     Creating memory bank bank-2: type MemoryMappedFile size 2147483648
2020-03-11 15:58:53.338797     Creating shared memory block for bank bank-2 : size 2147483648 using /var/lib/hugetlbfs/global/pagesize-1GB/readout-bank-2
2020-03-11 15:58:53.338845     Shared memory block for bank bank-2 is ready
2020-03-11 15:58:54.403032     Bank bank-2 added
2020-03-11 15:58:54.403069     Releasing memory NUMA node enforcment
2020-03-11 15:58:54.403116     Configuring consumer consumer-rec: fileRecorder
2020-03-11 15:58:54.403129     Recording path = data.raw
2020-03-11 15:58:54.403137     Maximum recording size: 10485760000 bytes
2020-03-11 15:58:54.403168     Recording internal data block headers = 0
2020-03-11 15:58:54.403179     Some packets with RDH-only payload will not be recorded to file, option dropEmptyHBFrames is enabled
2020-03-11 15:58:54.403189     Opening file for writing: data.raw
2020-03-11 15:58:54.403250     Recording enabled
2020-03-11 15:58:54.403264     Configuring equipment equipment-rorc-1: rorc
2020-03-11 15:58:54.403354     Equipment equipment-rorc-1: from config [equipment-rorc-1], max rate=-1.000000 Hz, idleSleepTime=200 us, outputFifoSize=2047
2020-03-11 15:58:54.403358     Equipment equipment-rorc-1: requesting memory pool 2047 pages x 1048576 bytes from bank 'bank-1', block aligned @ 0x200000, 1st page offset @ 0x0
2020-03-11 15:58:54.403365     pageSpaceReserved = 56, aligning 1st page @ 0xFFFC8
2020-03-11 15:58:54.403492  !  e[1;33mWarning - Bypassing RORC firmware compatibility checke[0m
2020-03-11 15:58:54.403509     Using superpage size 1015808
2020-03-11 15:58:54.403512     Opening ROC 3b:0.0:0
2020-03-11 15:58:54.403567     Register DMA block 0x2aaac0000000:2147483648
2020-03-11 15:58:54.618885     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Acquiring DMA channel lock
2020-03-11 15:58:54.618950     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Acquired DMA channel lock
2020-03-11 15:58:54.691611     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Initializing with DMA buffer from memory region
2020-03-11 15:58:54.809028     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Scatter-gather list size: 1
2020-03-11 15:58:54.813613     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Buffer is hugepage-backed
2020-03-11 15:58:54.958538     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Using link(s): 0 1 2 3 4 5 6 7 8 9 10 
2020-03-11 15:58:54.958592     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Resetting channel
2020-03-11 15:58:55.231617     Equipment equipment-rorc-1 : PCI 3b:00.0 @ NUMA node 0, serial number 10220, firmware version 6ec5d23f, card id 00540190-28a6020a
2020-03-11 15:58:55.231646     Timeframe length = 256 orbits
2020-03-11 15:58:55.231669     Timeframe IDs generated from RDH trigger counters
2020-03-11 15:58:55.231708     Configuring equipment equipment-rorc-2: rorc
2020-03-11 15:58:55.231974     Equipment equipment-rorc-2: from config [equipment-rorc-2], max rate=-1.000000 Hz, idleSleepTime=200 us, outputFifoSize=2047
2020-03-11 15:58:55.231989     Equipment equipment-rorc-2: requesting memory pool 2047 pages x 1048576 bytes from bank 'bank-2', block aligned @ 0x200000, 1st page offset @ 0x0
2020-03-11 15:58:55.232002     pageSpaceReserved = 56, aligning 1st page @ 0xFFFC8
2020-03-11 15:58:55.232242  !  e[1;33mWarning - Bypassing RORC firmware compatibility checke[0m
2020-03-11 15:58:55.232289     Using superpage size 1015808
2020-03-11 15:58:55.232299     Opening ROC 3c:0.0:0
2020-03-11 15:58:55.232317     Register DMA block 0x2aab40000000:2147483648
2020-03-11 15:58:55.418292     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Acquiring DMA channel lock
2020-03-11 15:58:55.418353     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Acquired DMA channel lock
2020-03-11 15:58:55.490931     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Initializing with DMA buffer from memory region
2020-03-11 15:58:55.604974     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Scatter-gather list size: 1
2020-03-11 15:58:55.609599     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Buffer is hugepage-backed
2020-03-11 15:58:55.753565     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Using link(s): 0 1 2 3 4 5 6 7 8 
2020-03-11 15:58:55.753610     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Resetting channel
2020-03-11 15:58:56.026496     Equipment equipment-rorc-2 : PCI 3c:00.0 @ NUMA node 0, serial number 10220, firmware version 6ec5d23f, card id 00540190-28a6020a
2020-03-11 15:58:56.026531     Timeframe length = 256 orbits
2020-03-11 15:58:56.026541     Timeframe IDs generated from RDH trigger counters
2020-03-11 15:58:56.026560     Creating aggregator
2020-03-11 15:58:56.026599     Aggregator: 2 equipments
2020-03-11 15:58:56.026608     Readout completed CONFIGURE
2020-03-11 15:58:56.026616     Readout executing START
2020-03-11 15:58:56.026631     Starting aggregator
2020-03-11 15:58:56.026750     Starting readout equipments
2020-03-11 15:58:56.026841     Starting DMA for ROC equipment-rorc-1
2020-03-11 15:58:56.026870     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Starting DMA
2020-03-11 15:58:56.237597     ROC input queue size = 1408 pages
2020-03-11 15:58:56.237627     Starting DMA for ROC equipment-rorc-2
2020-03-11 15:58:56.237669     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Starting DMA
2020-03-11 15:58:56.448253     ROC input queue size = 1152 pages
2020-03-11 15:58:56.448286     Running
2020-03-11 15:58:56.448377     Readout completed START
2020-03-11 15:58:56.448442     Entering main loop
2020-03-11 15:59:37.238079  !  e[1;33mWarning - Equipment equipment-rorc-1: CRU has dropped packets (new=128721 total=128721)e[0m
2020-03-11 15:59:37.750541  !  e[1;33mWarning - Non-contiguous timeframe IDs 1754 ... 1799e[0m
2020-03-11 15:59:38.238013  !  e[1;33mWarning - Equipment equipment-rorc-1: CRU has dropped packets (new=118185 total=246906)e[0m
2020-03-11 15:59:49.238070  !  e[1;33mWarning - Equipment equipment-rorc-1: CRU has dropped packets (new=14 total=246920)e[0m
2020-03-11 15:59:49.379271     Received signal 2
2020-03-11 15:59:49.379344     Exit requested
2020-03-11 15:59:49.379357     Readout requesting to stop
2020-03-11 15:59:49.379366     Readout executing STOP
2020-03-11 15:59:49.379378     Stopping DMA for ROC equipment-rorc-1
2020-03-11 15:59:49.379430     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Stopping DMA
2020-03-11 15:59:49.379538     Stopping DMA for ROC equipment-rorc-2
2020-03-11 15:59:49.379557     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Stopping DMA
2020-03-11 15:59:50.380430     Exiting main loop
2020-03-11 15:59:50.380755     Equipment equipment-rorc-1 : 24926 pages (+ 0 lost + 1397 empty)
2020-03-11 15:59:50.380784     equipment-rorc-1.nBlocksOut = 24926  (avg=0.12  min=0  max=11  count=205331)
2020-03-11 15:59:50.380808     equipment-rorc-1.nBytesOut = 891769936  (avg=35776.70  min=27712  max=293712  count=24926)
2020-03-11 15:59:50.380817     equipment-rorc-1.nMemoryLow = 0
2020-03-11 15:59:50.380825     equipment-rorc-1.nOutputFull = 0
2020-03-11 15:59:50.380834     equipment-rorc-1.nIdle = 202994  (avg=1.00  min=1  max=1  count=202994)
2020-03-11 15:59:50.380842     equipment-rorc-1.nLoop = 205331  (avg=1.00  min=1  max=1  count=205331)
2020-03-11 15:59:50.380850     equipment-rorc-1.nThrottle = 0
2020-03-11 15:59:50.380857     equipment-rorc-1.nFifoUpEmpty = 0
2020-03-11 15:59:50.380864     equipment-rorc-1.nFifoReadyFull = 0
2020-03-11 15:59:50.380873     equipment-rorc-1.nPushedUp = 26323  (avg=0.13  min=0  max=1408  count=200641)
2020-03-11 15:59:50.380881     equipment-rorc-1.fifoOccupancyFreeBlocks = 0  (avg=0.12  min=0  max=11  count=200640)
2020-03-11 15:59:50.380890     equipment-rorc-1.fifoOccupancyReadyBlocks = 0  (avg=0.00  min=0  max=0  count=200641)
2020-03-11 15:59:50.380899     equipment-rorc-1.fifoOccupancyOutBlocks = 0  (avg=0.29  min=0  max=11  count=205331)
2020-03-11 15:59:50.380908     equipment-rorc-1.nPagesUsed = 0  (avg=1391.95  min=0  max=1430  count=205331)
2020-03-11 15:59:50.380918     equipment-rorc-1.nPagesFree = 2047  (avg=655.05  min=617  max=2047  count=205331)
2020-03-11 15:59:50.380926     Average pages pushed per iteration: 10.7
2020-03-11 15:59:50.380934     Average fifoready occupancy: 0.0
2020-03-11 15:59:50.380986     Average data throughput: 15.65 MB/s
2020-03-11 15:59:50.381195     Equipment equipment-rorc-2 : 20790 pages (+ 0 lost + 1143 empty)
2020-03-11 15:59:50.381205     equipment-rorc-2.nBlocksOut = 20790  (avg=0.10  min=0  max=9  count=206446)
2020-03-11 15:59:50.381215     equipment-rorc-2.nBytesOut = 742737600  (avg=35725.71  min=27712  max=161792  count=20790)
2020-03-11 15:59:50.381223     equipment-rorc-2.nMemoryLow = 0
2020-03-11 15:59:50.381230     equipment-rorc-2.nOutputFull = 0
2020-03-11 15:59:50.381238     equipment-rorc-2.nIdle = 204078  (avg=1.00  min=1  max=1  count=204078)
2020-03-11 15:59:50.381246     equipment-rorc-2.nLoop = 206446  (avg=1.00  min=1  max=1  count=206446)
2020-03-11 15:59:50.381254     equipment-rorc-2.nThrottle = 0
2020-03-11 15:59:50.381261     equipment-rorc-2.nFifoUpEmpty = 0
2020-03-11 15:59:50.381268     equipment-rorc-2.nFifoReadyFull = 0
2020-03-11 15:59:50.381276     equipment-rorc-2.nPushedUp = 21933  (avg=0.11  min=0  max=1152  count=200942)
2020-03-11 15:59:50.381285     equipment-rorc-2.fifoOccupancyFreeBlocks = 0  (avg=0.10  min=0  max=9  count=200941)
2020-03-11 15:59:50.381293     equipment-rorc-2.fifoOccupancyReadyBlocks = 0  (avg=0.00  min=0  max=0  count=200942)
2020-03-11 15:59:50.381302     equipment-rorc-2.fifoOccupancyOutBlocks = 0  (avg=0.25  min=0  max=9  count=206446)
2020-03-11 15:59:50.381311     equipment-rorc-2.nPagesUsed = 0  (avg=1133.76  min=0  max=1170  count=206446)
2020-03-11 15:59:50.381321     equipment-rorc-2.nPagesFree = 2047  (avg=913.24  min=877  max=2047  count=206446)
2020-03-11 15:59:50.381328     Average pages pushed per iteration: 8.8
2020-03-11 15:59:50.381335     Average fifoready occupancy: 0.0
2020-03-11 15:59:50.381347     Average data throughput: 13.03 MB/s
2020-03-11 15:59:50.381355     Readout stopped
2020-03-11 15:59:50.381362     Stopping aggregator
2020-03-11 15:59:50.382454     Aggregator processed 45716 blocks
2020-03-11 15:59:50.382476     Stopping consumers
2020-03-11 15:59:50.382486     Equipment equipment-rorc-1 : 0/2047 pages (0.00%) still in use
2020-03-11 15:59:50.382494     Equipment equipment-rorc-2 : 0/2047 pages (0.00%) still in use
2020-03-11 15:59:50.382501     Readout completed STOP
2020-03-11 15:59:50.382508     Readout executing RESET
2020-03-11 15:59:50.382515     Releasing primary consumers
2020-03-11 15:59:50.382522     Releasing consumer consumer-rec
2020-03-11 15:59:50.382536     Closing file data.raw : 136965712 bytes (~130.6 MB)
2020-03-11 15:59:50.382590     Packets recorded=23489 discarded(empty)=23399080
2020-03-11 15:59:50.382601     Releasing secondary consumers
2020-03-11 15:59:50.382612     Releasing aggregator
2020-03-11 15:59:50.382629     Releasing equipment equipment-rorc-1
2020-03-11 15:59:50.382636     Releasing equipment equipment-rorc-2
2020-03-11 15:59:50.382643     Releasing readout devices
2020-03-11 15:59:50.404146     [pci=3b:00.0 serial=10220 endpoint=0 channel=0] Releasing DMA channel lock
2020-03-11 15:59:50.422699     [pci=3c:00.0 serial=10220 endpoint=1 channel=0] Releasing DMA channel lock
2020-03-11 15:59:50.422726     Releasing memory bank manager
2020-03-11 15:59:50.422734     Releasing bank bank-1
2020-03-11 15:59:50.422739     Releasing bank bank-2
2020-03-11 15:59:50.422826     Readout completed RESET
2020-03-11 15:59:50.422834     Readout process exiting
*** break ***

Hello Andrea,

The config is perfectly fine, and readout always provides data pages to CRU
( the fifo providing free pages to DMA strictly never goes below 99% full, so there is a huge margin … about 2.5s of buffer ready to be written even if sw would be completely stuck). We can do nothing more on the software side in these conditions.

I leave @costaf comment on the possible cause of dropped packets in such situation. There were some recent investigations at the system and firmware levels done.

cheers,
Sylvain

Ciao Andrea,
we should talk … what is the length of the acquisition windows?
The CRU can absorb a few blocks of 8KB … you data taking window is very large and this could be the problem of packet dropped with many links enabled.

You can call me when you want.

Ciao Filippo,

setting the CRU window to 2000 with

roc-reg-write --id=3b:0.0 --channel=2 --address=0x600034 --value=0x7D0
roc-reg-write --id=3b:0.0 --channel=2 --address=0x700034 --value=0x7D0

seems to be enough to avoid packets dropping, even at 100 Hz of trigger rate.

A value of 3000 (0xBB8) is still not enough. This is the output of roc-pkt-monitor when packets are dropped:

[flp@alio2-cr1-flp151 ~]$ roc-pkt-monitor --id=10220
=======================================================
  Link ID   Accepted       Rejected       Forced      
-------------------------------------------------------
  0         1484494        0              0           
  1         1484494        0              0           
  2         1484494        0              0           
  3         1484494        0              0           
  4         1484494        0              0           
  5         1484494        0              0           
  6         1484494        0              0           
  7         1484494        0              0           
  8         1484494        0              0           
  9         1484494        0              0           
  10        1484494        0              0           
  11        0              0              0           
=======================================================
  ULL ID    Accepted       Rejected       Forced      
-------------------------------------------------------
  15        0              0              0           
=======================================================
  Wrapper   Dropped          Total Packets per second 
-------------------------------------------------------
  0         7406             254540                   
=======================================================

EDIT: a window of 6000 at 10 Hz of trigger rate seems to be also OK, even when writing to disk and not to /dev/null.

Hope this helps…

Ciao Andrea.
With 10 links per endpoint I can run with a time window of 6500 running at 10Hz.

More than that the CRU drops.

With a window of 6500 I still see some rare packet drops. The safe value in our setup seems
to be more of the order of 4500, which is still fine for our pedestals data taking (although we need in this case to limit the number of samples per SAMPA hit to 5, and the trigger rate to 10 Hz).

EDIT: I have actually seen a sudden spur of packet drops even at 4500 of window, after writing about 3GB of data… Re-checking with /dev/null output, drops occur also in this case, so it does not seem to be related to the actual readout output.

We might be able to tolerate dropped packets when taking calibration data, however for debugging the system this is probably not ideal. However, going lower than 4500 in the window size would mean to reduce the number of samples for each SAMPA hit in pedestals mode to 4 or less, which is very small…

@costaf would it make sense to try reading the DMA in the same conditions with roc-bench-dma instead of readout.exe, and see if the dropping still happens?

We would not gain much. Readout is needed for the memory allocation using FAIRMQ to communicate with STBF, so we need to understand where are the limitation, as the final system will use readout to collect data. In addition to that, roc-bench-dma can’t be used efficiently to store data on the disk.

I think with the help of Sylvain we can find out the proper settings to run with readout.

Of course! My idea would be to try understanding if the limitation we are seeing is in the CRU itself or in the way readout handles the data transfer…

sure I get your point.
will call you tomorrow to schedule a debug session

1 Like

I exclude this is caused by readout: the fifo providing data pages for CRU is always full. If CRU was lacking superpages, it would sometime be empty when readout fills it, which is never the case according to the logs.