Memory limits in https://qcg-test.cern.ch/?

jgcn · October 24, 2020, 7:20am

Dear experts

in MFT we plan to use TH2F histograms of size 1024x512. We need 936 such objects.
We have problems when publishing them.

Up to now we have been testing with just one histogram and everything ran fine.
(That is, the QC task has the 936 histograms, but only one is published)
When we increased the number of published histograms to 100 then the program crashed after 21 cycles with the message

[ERROR] Exception caught: shmem: could not create a message of size 269354496, alignment: default, free memory: 114500464

If we change the number of published histograms to 200 then the problem appears after just 5 cycles. (Same message as above, slightly different numbers.)

If we use all 936 histograms, then the program crashes in the first cycle:
[ERROR] Exception caught: Requested message size (2147483638) exceeds segment size (2000000000)

This led us to suspect that there may be a limit on the number/size of objects that can be posted to https://qcg-test.cern.ch/. Is that so? If yes, is it possible to increase it?
If it is not possible to increase the limit or if there is not such a limit, any suggestions?

thanks a lot

guillermo

pkonopka · October 26, 2020, 8:02am

Hi, QCG has nothing to do with this. To clarify the processing chain a bit, QC is run as a bunch of processes on a machine, they exchange data with messages. These messages are passed around inside a block of shared memory (shmem). One of the processes, CheckRunner stores objects in the repository (probably ccdb-test.cern.ch:8080/ in your case). Then, you can inspect these objects in QCG (https://qcg-test.cern.ch), but this is basically a web-based browser of the database.

Back to your problem, let’s count memory needs for your objects:
936 (nb of histograms) x 1024 (dim) x 512 (dim) x 4 (bytes in float) = ~1962 MB
Now, the log messages show that you have 2000000000B = 2GB of shared memory allocated. This is definitely not enough if your QC Task needs that much space.
You can try running your workflow with more shmem allocated by adding the argument --shm-segment-size <some-larger-amount>, supposed that you have more memory available.

Please note, that if one process is overwhelmed with incoming data (it cannot process it in real time), then its input message queues will grow until they eat all the available memory. So you should also make sure that each process in your workflow can sustain its input data throughput.

jgcn · October 27, 2020, 9:05am

Hi Piotr

thanks for the info. Putting more memory does not seem to help.

I see that you also write: “So you should also make sure that each process in your workflow can sustain its input data throughput”.
Do you have any hints how to achieve this?

I tried reducing the number of histograms (to 250) and increasing the length of the cycle (cycleDurationSeconds in JSON file). This seems to help. So I have identified three options (more shared memory, less data, longer cycles).

Is there any other ingredient to play with to increase the processing capabilities of the workflow?

thanks a lot

guillermo

pkonopka · November 2, 2020, 2:08pm

I am sorry, I’ve missed your reply before.

Basically by improving the code performance. First, you should probably check if any of the processes in the workflow use 100% or more CPU. In their full command, they should have a --id <device-name> argument, which should give you an idea which DPL process is it. Then you can use profilers to inspect what is taking so much processing power, for example (if you develop on linux):

sudo perf top -g -p <process pid>

That perhaps should give you a clue where to look for bottlenecks.

Surely using a more powerful machine. Are you using a PC or some FLP-like server? Increasing the amount of memory in the argument won’t help if there is no physical memory available. Also, perhaps /dev/shm is too small on your machine. This might be useful if you want to crosscheck it:

Cheers, Piotr