Bus error during digitization and reconstruction

Dear experts,

As of late last week, o2-sim-digitizer is often breaking with bus error:

...
[11788:ITSDigitizer]:  *** Break *** bus error
[11788:ITSDigitizer]:  Generating stack trace...
[11788:ITSDigitizer]:  0x00007f2af948eb13 in o2::framework::DataProcessor::doSend(FairMQDevice&, o2::framework::MessageContext&) at /usr/include/c++/7/bits/unique_ptr.h:355 from /home/alidocklite/sw/ubuntu1804_x86-64/O2/v1.2.0-1/lib/libO2Framework.so
[11788:ITSDigitizer]:  0x00007f2af9459281 in <unknown> from /home/alidocklite/sw/ubuntu1804_x86-64/O2/v1.2.0-1/lib/libO2Framework.so
...

While simulations with 500 to 1000 events runs fine, this error happens for more than about 2000 pythia events with the ITS and MFT enabled. The full output of o2-sim-digitizer can be found here: digitizer.log. All files here. This is on Ubuntu 18.04 on different machines Intel and AMD.

In some intermediate cases the digitizer runs OK but then the reconstruction workflow breaks with bus error

...
[16761:mft-clusterer]: [20:02:42][INFO] MFTClusterer pushed 967112 compressed clusters, in 9822 RO frames
[16761:mft-clusterer]: 
[16761:mft-clusterer]:  *** Break *** bus error
[16761:mft-clusterer]:  Generating stack trace...
[16761:mft-clusterer]:  0x00007fd1497e6a93 in o2::framework::DataProcessor::doSend(FairMQDevice&, o2::framework::MessageContext&) at /usr/include/c++/7/bits/unique_ptr.h:355 from /home/alidocklite/sw/ubuntu1804_x86-64/O2/v1.2.0-1/lib/libO2Framework.so
...

Alidist and O2 dev are up to date. Anyone else experiencing this? Suggestions are much appreciated.

Thanks in advance,
Rafael

Do you have the same problem without alidock? In particular, do you have any particular settings for docker on your machine which might reduce the amount of shared memory available to docker itself? Can you do a docker inspect while it runs?

1 Like

Hi @pezzi

I’ve seen bus error when the shared memory segment was not sufficiently large, can you try with larger --shm-segment-size?

Cheers,
Ruben

1 Like

Hi @eulisse and @shahoian, thanks for your replies. Yes, this is on docker, which has a default shm-segment of only 64MB. I have been running these setups for a while with up to 50000 events without this never showing up. Anyway, by increasing shared memory to 1GB with --shm-size="1g" the problem is gone! Thanks!

@pezzi with only 64MB it’s weird it never showed up, actually. What do you have for ulimit -a?

@mconcas can you please update the docker defaults for alidock to a more sensible (for O2) amount of shared memory? We use 20GB by default for the segment size. I would at least try 1GB.

Hi @eulisse

It seems that alidock is also sticking to the default 64MB limit. As for ulimits -a, on my system the limits are not changed according to the shm size.

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256727
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Hi, just saw the issue.
A fix has been added in master.
Thanks to @fcatalan