DPL workflow shared memory issue

dstocco · April 26, 2021, 1:26pm

Dear all,
I’m running a workflow on the FLP to decode and check output data.
Everything runs smoothly when using only orbit triggers, or when running with 100kHz physics triggers with only 2 GBT links. But when I run with 100kHz physics triggers on 32 links, I get:

[22055:readout-proxy]: [14:55:34][WARN] dispatching on channel from_readout-proxy_to_MIDRawDecoder was delayed by 10 ms

and finally:

[22056:MIDRawDecoder]: [14:55:35][ERROR] Exception caught: shmem: could not create a message of size 24126080, alignment: default, free memory: 53944160

My guess is that the devices are too slow to process the full TF with 32 GBT links when the trigger rate is 100kHz, and the back pressure causes issues in the end.
But I do not quite understand why the system complains on shmem…considering that it is telling that it needs only half of the available free memory.

Do you have a better understanding of what the problem could be and how to solve it? Is it a processing-time or shared memory issue?
Thanks in advance for any hint.
Best regards,
Diego

costaf · April 27, 2021, 9:38am

Ciao Diego … we’ll check and let you know

costaf · April 27, 2021, 9:44am

I am contacting the experts we will let you know if a test session is needed

dstocco · April 27, 2021, 9:50am

Ciao @costaf ,
@divia replied to me privately yesterday. We verified that the checker requires too much CPU (close to 90% when running with orbit triggers only). The CPU becomes too large when increasing the trigger rate, and this probably generate backpressure…and eventually the crash.
He told me that the fact that it cannot generate a message while the double of the required memory size is available might be due to memory fragmentation.

I’m trying to optimize the checker, I’ve gained a 20% in CPU…but it is not enough to handle the output of 4 EPs with 100kHz physics triggers…
And I cannot easily go down a factor of two (at some points I need to range the data according to their timestamp, and the malloc/free of the unordered_maps is time-consuming).
Anyways, as long as it is the checker (and not the decoder) that needs too much CPU, it is not critical…
Of course, if some experts have further insight/suggestions, they’re welcome!
Cheers,
Diego

pkonopka · April 27, 2021, 9:56am

Hello,

It’s probably because this 53944160 bytes aren’t contiguous. In reality they could be 10 separate pieces of memory which make 53MB in total, but there is not a single one which can contain 24MB at once.

Either your processing (or something down the chain) is just too slow and it cannot keep up with the throughput or maybe it can, but the latency is too high, so you might just require more memory.

In the first case, you could check which process takes 100% CPU and profile it (e.g. perf top -p <pid>), or you could parallelize some processing if applicable.

In the latter case, you can increase the shared memory segment size. If you are running on FLP with AliECS, then you can try setting shm_segment_size in the Advanced Configuration panel to a higher number (in bytes). Probably 10GB is your default, but to be checked. If you run the DPL workflow standalone, then you have to add --shm-segment-size <your value> to all parts of the workflow.

dstocco · April 27, 2021, 10:03am

Hi @pkonopka ,
thanks for the detailed explanations. Ok, I’ll try to reduce the CPU time and increase the shm_segment_size…although I fear that I will not be able to handle the full data rate int the checks…
Thanks!
Cheers,
Diego

pkonopka · April 27, 2021, 10:06am

Judging from what you wrote a few mins before me, that should only postpone the inevitable.