StfBuilder crashes on MID FLP

Dear all,
I’m trying to run an acquisition on the MID FLP installed at CERN.
Readout.exe works fine when writing the output on file, but when I try to use the StfBuilder I get this error message:

terminate called without an active exception
96654 Aborted                 StfBuilder --session default --id stf_builder-0 --transport shmem --detector MID --dpl-channel-name dpl-chan --channel-config "name=dpl-chan,type=push,method=bind,address=ipc:///tmp/stf-builder-dpl-pipe-0,transport=shmem,rateLogging=5" --channel-config "name=readout,type=pull,method=connect,address=ipc:///tmp/readout-pipe-0,transport=shmem,rateLogging=5" --detector-rdh 6 --verbosity veryhigh

The config file I use is the following:
# test config to run readout-stbf out of the box with data emulator

# dummy data source
[equipment-emulator-1]
enabled=0
name=emulator-1
equipmentType=cruEmulator
memoryPoolNumberOfPages=1800
memoryPoolPageSize=1M
numberOfLinks=4
PayloadSize=8000

# define a (disabled) CRU equipment for CRU end point #0
[equipment-rorc-1]
enabled=1
equipmentType=rorc
cardId=#1
dataSource=Fee
memoryBankName=bank-o2
memoryPoolNumberOfPages=1800
memoryPoolPageSize=1M
rdhUseFirstInPageEnabled=1
linkMask=10,11
firmwareCheckEnabled=0

# monitor counters
[consumer-stats]
consumerType=stats
monitoringEnabled=0
monitoringUpdatePeriod=5
monitoringURI=influxdb-udp://alio2-cr1-flp159:8088

# record data to file (disabled)
[consumer-rec]
enabled=0
consumerType=fileRecorder
fileName=/tmp/data.raw

# allow data sampling to take data
[consumer-data-sampling]
enabled=0
consumerType=DataSampling

# send data to stfb
[consumer-StfBuilder]
enabled = 1
consumerType = FairMQChannel
sessionName = default
fmq-transport = shmem
fmq-name = readout
fmq-type = push
# fmq-address = ipc:///tmp/flp-readout-pipe-0
fmq-address = ipc:///tmp/readout-pipe-0
memoryBankName = bank-o2
unmanagedMemorySize = 2G
memoryPoolNumberOfPages = 200
memoryPoolPageSize = 1M
disableSending=0

# matching config for the test receiver
# [receiver-fmq]
# decodingMode=stfHbf
# channelAddress=ipc:///tmp/flp-readout-pipe-0
# channelType=pull

The data distribution version I’m using is the one shipped with the flp-suite, namely: v0.7.6.

Notice that a similar configuration is working on different FLPs (not located at CERN), one with the same software version, and the other with an older version installed (0.7.3).

Any suggestion is welcome.
Thanks in advance!
Best regards,
Diego

It’s a shot in the dark … but I have seen this exception recently related to shared memory problems. Could you try increasing the shared memory segment using --shm-segment-size 10000000000 (for 10GB instead of 2 default). Alternatively, you may try with the new --no-IPC option to disable shared mem.

Hi @swenzel,
I tried both options you’re suggesting, but I still get the same error.

For information, I also tried to run the command with gdb. Here’s what I get:

Thread 13 "StfBuilder" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f586097f700 (LWP 117237)]
0x00007f58ecbb4387 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f58ecbb4387 in raise () from /lib64/libc.so.6
#1  0x00007f58ecbb5a78 in abort () from /lib64/libc.so.6
#2  0x00007f58ed1fc305 in __gnu_cxx::__verbose_terminate_handler () at ../../../../gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007f58ed1fa0d6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007f58ed1fa121 in std::terminate () at ../../../../gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00007f58ec837e6e in fair::mq::shmem::Message::~Message (this=<optimized out>, __in_chrg=<optimized out>)
    at /mnt/mesos/sandbox/sandbox/jenkins/workspace/build-any-ib/sw/20156136/1/SOURCES/FairMQ/v1.4.18/v1.4.18/fairmq/shmem/Message.h:227
#6  0x00007f58ec838089 in fair::mq::shmem::Message::~Message (this=0x7f5830800900, __in_chrg=<optimized out>)
    at /mnt/mesos/sandbox/sandbox/jenkins/workspace/build-any-ib/sw/20156136/1/SOURCES/FairMQ/v1.4.18/v1.4.18/fairmq/shmem/Message.h:236
#7  0x0000000000417682 in std::default_delete<FairMQMessage>::operator()(FairMQMessage*) const ()
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/unique_ptr.h:78
#8  std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >::~unique_ptr() ()
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/unique_ptr.h:268
#9  std::_Destroy<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> > >(std::unique_ptr<FairMQMessage, std::default_delete<Fai
rMQMessage> >*) ()
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/stl_construct.h:98
#10 std::_Destroy_aux<false>::__destroy<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*>(std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*, std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*) ()
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/stl_construct.h:108
#11 std::_Destroy<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*>(std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*, std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*) ()
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/stl_construct.h:137
#12 std::_Destroy<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*, std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> > >(std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*, std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >*, std::allocator<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> > >&) ()
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/stl_construct.h:206
#13 std::vector<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> >, std::allocator<std::unique_ptr<FairMQMessage, std::default_delete<FairMQMessage> > > >::~vector() [clone .lto_priv.419] (this=0x7f586097e160)
    at /jenkins/workspace/BuildRPM/sw/20156139/1/slc7_x86-64/GCC-Toolchain/v7.3.0-alice2-9/include/c++/7.3.0/bits/stl_vector.h:434
#14 0x0000000000425780 in o2::DataDistribution::StfInputInterface::DataHandlerThread(unsigned int) (this=0x1089ee0,
    pInputChannelIdx=<optimized out>)
    at /jenkins/workspace/BuildRPM/sw/20156139/1/SOURCES/DataDistribution/v0.7.6/v0.7.6/src/StfBuilder/StfBuilderInput.cxx:87
#15 0x00007f58ed224f3f in std::execute_native_thread_routine (__p=0x1096420) at ../../../../../gcc/libstdc++-v3/src/c++11/thread.cc:83
#16 0x00007f58ed4f1ea5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f58ecc7c8dd in clone () from /lib64/libc.so.6

Thanks,
cheers,
Diego

It does seem to be related to shared mem. From which date is this software distribution?

It was installed about one week ago using the flp-suite…
Here is the full sw version:

Currently Loaded Modulefiles:
  1) BASE/1.0                       20) libxml2/v2.9.3-34
  2) GCC-Toolchain/v7.3.0-alice2-9  21) GSL/v1.16-54
  3) zlib/v1.2.8-65                 22) Python-modules/1.0-117
  4) libpng/v1.6.34-56              23) Clang/v9.0.0-29
  5) libffi/v3.2.1-21               24) arrow/v0.17.0-5
  6) FreeType/v2.10.1-10            25) lzma/v5.2.3-29
  7) Python/v3.6.10-34              26) ROOT/v6-20-02-alice7-3
  8) boost/v1.72.0-alice1-24        27) FairRoot/7701b7e691-10
  9) FairLogger/v1.5.0-20           28) Vc/1.4.1-33
 10) ZeroMQ/v4.3.2-9                29) Monitoring/v3.0.6-2
 11) ofi/v1.7.1-13                  30) Configuration/v2.2.6-19
 12) asiofi/v0.3.1-70               31) libInfoLogger/v1.3.9-19
 13) DDS/3.2-2                      32) Common-O2/v1.4.9-19
 14) FairMQ/v1.4.18-1               33) ms_gsl/2.1.0-2
 15) Ppconsul/0.1.0-70              34) GLFW/3.3-090b16bfae-19
 16) c-ares/v1.15.0-10              35) DebugGUI/v0.1.0-34bc77ae9c-17
 17) OpenSSL/v1.0.2o-27             36) libuv/v1.38.0-4
 18) protobuf/v3.11.4-13            37) O2/cdb6ef451b_O2_DATAFLOW-1
 19) grpc/v1.27.3-5                 38) DataDistribution/v0.7.6-2

The O2 version is from over a month ago. The option “–no-IPC” was not there yet.

Hi @swenzel,
ok, I think I figured it out.
The readout.exe is compiled with FairMQ v1.4.20 while the DataDistribution part comes with FairMQ v1.4.18. I re-installed DataDistribution with aliBuild (instead of the RPMs) with an updated version of alidist and the issue is gone.
Not sure how to ensure that all of the packages needed by Readout and DataDistribution are in synch though…
This is a problem similar to what I posted here:

Anyways, thanks for the suggestions.
Cheers,
Diego

Dear Diego,

Before each release of the FLP Suite, we do integration tests to identify issues like this and avoid them from reaching end users, so I’d like to understand if there’s something to be improved in the test procedure,

The latest release of FLP Suite (v0.8.0) ships readout 1.4.0-4 but I see that you also have Readout 1.4.4-3 (which, as indicated, compiles against FairMQ v1.4.20).

Was there any manual installation of Readout in this machine ?

Cheers,
Vasco

Dear Vasco,
I did not install the software myself, but it might be that the needed Readout version was not tagged yet and the Readout package was updated later on (I guess via yum).
So maybe this is the problem.
Thanks,
cheers,
Diego

Dear all,
while the crash is solved I get another issue.
The Readout seems not to be recognising that I’m in the MID detector. I usually define the dataspec MID/RAWDATA. However, the DPL proxy gives me the warning message:

[195868:readout-proxy]: [15:46:14][WARN] Some input data are not matched by filter rules
[195868:readout-proxy]:    FLP/DISTSUBTIMEFRAME/0
[195868:readout-proxy]:    NIL/RAWDATA/12
[195868:readout-proxy]:    NIL/RAWDATA/11

meaning that somewhere in the chain (Readout?, StfBuilder?) the data source is not identified. Is this a known issue? How can I force telling the system that the data source is MID?

Thanks in advance,
cheers,
Diego

Dear @gneskovi,
AFAIU, in the past the data source for the StfBuilder was specified with the --detector option.
However, if we have RDH v6, this option is ignored, and the data source is taken directly from the RDH.
I guess that it is up to each detector to correctly fill the data source in the RDH when the UL is implemented, but I’m not sure what to do if the UL is not implemented. In this case, it seems that the data source is invalid.
Is there a way to force the StfBuilder to use the data source specified with the --detector option?

Thanks,
cheers,
Diego

Hi @dstocco,
Do you know what value is used in the RDH?
Using the RDH value was intentional, but I could add fallback to the configuration parameter if this is invalid…

@costaf : Can the Detector/System ID be specified during configuration of CRU to have the correct value in the RDHv6?

Hello, which CRU firmware version has been used for the tests?
There is a way to configure the SYSTEM ID for the streaming detector without UL.
It is not yes timplemented in the software so I have to check the register to be written.

Ciao @costaf,
the FW version is: e7687156 (you updated it last week).

However, before inserting the correct value (which I need to check what is for MID), do you know if a reload of the electronics configuration might be required after the modification?
We have a problem with the mini-pc used to configure the electronics: they seem to be not online, so we cannot reconfigure it remotely.
In this case I’d prefer to stick with the current configuration until September and use an hack instead (if I tell StfBuilder that the RDH version is v5 instead of v6 it seems to work…)

Thanks,
Cheers,
Diego

Hello,
no the adding of this register will not change the clock of the CRU, so the link should stay up.

I am not in favor to hacking the system, otherwise we will progress with the testing and at some point realize that everything was working because we were tweaking something.

For MID the magic number is 37 (0x25)

How many links do you use?

for EP0 link 0 the raw command to load the correct system id is the following

# LINK 0
roc-reg-write --i=PCIEADD --address=0x00640004 --ch=2 --val=0x250000
# LINK 1
roc-reg-write --i=PCIEADD --address=0x00642004 --ch=2 --val=0x250000
# LINK 2
roc-reg-write --i=PCIEADD --address=0x00644004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x00646004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x00648004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x0064a004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x0064c004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x0064e004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x00650004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x00652004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x00654004 --ch=2 --val=0x250000
# LINK 
roc-reg-write --i=PCIEADD --address=0x00656004 --ch=2 --val=0x250000

FOR EP1 just replace 0x006 in 0x007

Cheers

Ciao @costaf,
well, indeed the hack is not working as expected (it complains about the RDH size).
But if the proper configuration does not bring the link down, let’s use it.
I use 2 links (10 and 11) on CRU #1 (3b:00.0, EP 0).
So, if I get it correctly the command is:

# LINK 10
roc-reg-write --i=PCIEADD --address=0x00654004 --ch=2 --val=0x250000
# LINK 11
roc-reg-write --i=PCIEADD --address=0x00656004 --ch=2 --val=0x250000

Quick question to be sure. When I write --ch=2 does it mean CRU ID #1…or should I do something. else to target that CRU?

@costaf,
forget about my last comment: I guess that I need to replace PCIEADD with #1, right?

ch=2 is the BAR channels you should not change it. yes you have to use #1 or 3b:0.0 in place of PCIEADD

Hi @costaf,
ok, this works. Thanks a lot!

However, I still get issues on readout.exe:

2020-07-15 17:38:50.928050 ! **Warning - no page left**
2020-07-15 17:38:50.928184 !!! **Error - ConsumerFMQ : error 478** 

And in StfBuilder:

[2020-07-15 17:42:43.538][ **E** ] READOUT INTERFACE: wrong number of HBFrames in the header.header_cnt=256 msg_length=449 total_occurrences=1
[2020-07-15 17:42:43.539][ **E** ] READOUT INTERFACE: Error when accessing the RDH: RDH size is too small. size=12
[2020-07-15 17:42:43.561][ **E** ] READOUT INTERFACE: TF ID non-contiguous increase! (6) -> (11) readout.exe sent messages with non-monotonic TF id!
SubTimeFrames will be incomplete! Total occurrences: 0

can you dump some data locally on the disk with readout and give me tha path to the file?
RDH size = 12 is really wrong, but I doubt it is coming from the CRU as it should chop the data correctly … is there some data processing in the middle?