I am currently doing tests with the installed MCH equipments, reading data from all the equipped links in our CRUs at P2. While doing this tests I noticed that readout.exe reports dropped packets even if the data rate is very low. The dropped packets appear when writing the data to disk, while they are absent if the data is written to /dev/null.
The equipped links for the two data paths of the CRUs with the higher drop rate are as follows:
Other CRUs with less active links are less affected by the packet dropping.
The fact is that the dropping occurs even in triggered mode, with a very low trigger rate of 1 Hz, giving a data rate to disk of about 2 MB/s.
My gut feeling is that readout.exe has a hard time absorbing large instantaneous data rate peaks, like those coming from physics triggers when most of the links are active. But that does not come from the DMA interface, otherwise writing to /dev/null should not make a difference. Is there some decoupling mechanism inside readout.exe between the input DMA channel and the output channel?
For reference, here is the full readout.exe configuration I am using:
###################################
# readout configuration file
#
# comments lines start with #
# section names are in brackets []
# settings are defined with key=value pairs
###################################
###################################
# general settings
###################################
[readout]
# per-equipment data rate limit, in Hertz (-1 for unlimited)
rate=-1
# time (in seconds) after which program exits automatically (-1 for unlimited)
exitTimeout=-1
[consumer-data-sampling]
consumerType=DataSampling
enabled=0
###################################
# data consumers
###################################
# collect data statistics
[consumer-stats]
consumerType=stats
enabled=0
# this publishes stats, if enabled, to O2 monitoring system
monitoringEnabled=1
monitoringUpdatePeriod=0
consoleUpdate=1
#monitoringURI=infologger://
monitoringURI=influxdb-udp://aido2mon-gpn.cern.ch:8088
#processMonitoringInterval=5
# recording to file
[consumer-rec]
consumerType=fileRecorder
enabled=1
#fileName=/tmp/%T_data.raw
#fileName=/tmp/readout-pipe
fileName=data.raw
#fileName=/dev/null
bytesMax=10000M
dropEmptyHBFrames=1
###################################
# memory banks
###################################
# All section names should start with 'bank-' to be taken into account.
# They define memory to be allocated to readout
# If bank name not specified in each equipment, the first available bank (created first) will be used.
# Types of memory banks include: malloc, MemoryMappedFile
# NB: the FairMQChannel consumers may also create some banks, which will not be
# listed here, and created before them.
[bank-1]
type=MemoryMappedFile
size=1G
numaNode=1
[equipment-rorc-1]
equipmentType=rorc
enabled=1
cardId=3b:0.0
dataSource=Fee
memoryBankName=bank-1
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 0
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=1023
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
linkMask=0-10
[bank-2]
type=MemoryMappedFile
size=1G
numaNode=1
[equipment-rorc-2]
equipmentType=rorc
enabled=1
cardId=3c:0.0
dataSource=Fee
memoryBankName=bank-2
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 0
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=1023
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
linkMask=0-10
# push to fairMQ device
# currently fixed local TCP port 5555
# needs FairMQ at compile time
[consumer-fmq]
consumerType=FairMQDevice
enabled=0
# push to fairMQ device
# light FMQ channel implementation
# WP5-compatible output
[consumer-fmq-wp5]
# session name must match --session parameter of all O2 devices in the chain
sessionName=default
consumerType=FairMQChannel
enabled=0
transportType=shmem
channelName=readout-out
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
unmanagedMemorySize=2G
disableSending=0
#need also a memory pool for headers and partial HBf chunks copies
memPoolNumberOfElements=100
memPoolElementSize=128k
[receiver-fmq]
transportType=shmem
channelName=readout
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
decodingMode=readout
#one of: readout, none
Hello,
what version of readout are you using?
do you have some output log please?
It’s possible you are short on the number of data pages, and I’m not surprised that enabling recording adds enough latency to run out of pages for DMA. (it’s not only a matter of throughput, as pages are changed for each new TF / link, and there are some internal timeouts).
Please give it a try with 2000, and also set rdhUseFirstInPageEnabled = 1 for proper timestamping. Also, linkMask does not need to be set for readout v1.3.8.
###################################
# readout configuration file
#
# comments lines start with #
# section names are in brackets []
# settings are defined with key=value pairs
###################################
###################################
# general settings
###################################
[readout]
# per-equipment data rate limit, in Hertz (-1 for unlimited)
rate=-1
# time (in seconds) after which program exits automatically (-1 for unlimited)
exitTimeout=-1
[consumer-data-sampling]
consumerType=DataSampling
enabled=0
###################################
# data consumers
###################################
# collect data statistics
[consumer-stats]
consumerType=stats
enabled=0
# this publishes stats, if enabled, to O2 monitoring system
monitoringEnabled=1
monitoringUpdatePeriod=0
consoleUpdate=1
#monitoringURI=infologger://
monitoringURI=influxdb-udp://aido2mon-gpn.cern.ch:8088
#processMonitoringInterval=5
# recording to file
[consumer-rec]
consumerType=fileRecorder
enabled=1
#fileName=/tmp/%T_data.raw
#fileName=/tmp/readout-pipe
fileName=data.raw
#fileName=/dev/null
bytesMax=10000M
dropEmptyHBFrames=1
###################################
# memory banks
###################################
# All section names should start with 'bank-' to be taken into account.
# They define memory to be allocated to readout
# If bank name not specified in each equipment, the first available bank (created first) will be used.
# Types of memory banks include: malloc, MemoryMappedFile
# NB: the FairMQChannel consumers may also create some banks, which will not be
# listed here, and created before them.
[bank-1]
type=MemoryMappedFile
size=2G
numaNode=1
[equipment-rorc-1]
equipmentType=rorc
enabled=1
cardId=3b:0.0
dataSource=Fee
memoryBankName=bank-1
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 1
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=2047
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
#linkMask=0-10
[bank-2]
type=MemoryMappedFile
size=2G
numaNode=1
[equipment-rorc-2]
equipmentType=rorc
enabled=1
cardId=3c:0.0
dataSource=Fee
memoryBankName=bank-2
rdhCheckEnabled=0
rdhDumpEnabled=0
rdhUseFirstInPageEnabled = 1
consoleStatsUpdateTime=0
memoryPoolNumberOfPages=2047
memoryPoolPageSize=1048576
firmwareCheckEnabled=0
#linkMask=0-10
# push to fairMQ device
# currently fixed local TCP port 5555
# needs FairMQ at compile time
[consumer-fmq]
consumerType=FairMQDevice
enabled=0
# push to fairMQ device
# light FMQ channel implementation
# WP5-compatible output
[consumer-fmq-wp5]
# session name must match --session parameter of all O2 devices in the chain
sessionName=default
consumerType=FairMQChannel
enabled=0
transportType=shmem
channelName=readout-out
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
unmanagedMemorySize=2G
disableSending=0
#need also a memory pool for headers and partial HBf chunks copies
memPoolNumberOfElements=100
memPoolElementSize=128k
[receiver-fmq]
transportType=shmem
channelName=readout
channelType=pair
channelAddress=ipc:///tmp/readout-pipe-0
decodingMode=readout
#one of: readout, none
Dropping still happens… this is the readout log file:
The config is perfectly fine, and readout always provides data pages to CRU
( the fifo providing free pages to DMA strictly never goes below 99% full, so there is a huge margin … about 2.5s of buffer ready to be written even if sw would be completely stuck). We can do nothing more on the software side in these conditions.
I leave @costaf comment on the possible cause of dropped packets in such situation. There were some recent investigations at the system and firmware levels done.
Ciao Andrea,
we should talk … what is the length of the acquisition windows?
The CRU can absorb a few blocks of 8KB … you data taking window is very large and this could be the problem of packet dropped with many links enabled.
With a window of 6500 I still see some rare packet drops. The safe value in our setup seems
to be more of the order of 4500, which is still fine for our pedestals data taking (although we need in this case to limit the number of samples per SAMPA hit to 5, and the trigger rate to 10 Hz).
EDIT: I have actually seen a sudden spur of packet drops even at 4500 of window, after writing about 3GB of data… Re-checking with /dev/null output, drops occur also in this case, so it does not seem to be related to the actual readout output.
We might be able to tolerate dropped packets when taking calibration data, however for debugging the system this is probably not ideal. However, going lower than 4500 in the window size would mean to reduce the number of samples for each SAMPA hit in pedestals mode to 4 or less, which is very small…
@costaf would it make sense to try reading the DMA in the same conditions with roc-bench-dma instead of readout.exe, and see if the dropping still happens?
We would not gain much. Readout is needed for the memory allocation using FAIRMQ to communicate with STBF, so we need to understand where are the limitation, as the final system will use readout to collect data. In addition to that, roc-bench-dma can’t be used efficiently to store data on the disk.
I think with the help of Sylvain we can find out the proper settings to run with readout.
I exclude this is caused by readout: the fifo providing data pages for CRU is always full. If CRU was lacking superpages, it would sometime be empty when readout fills it, which is never the case according to the logs.