Readout.exe + QC for MCH software

Dear O2 experts,

I have started to play with readout.exe + QC framework, and I would have some basic questions regarding the structure of the buffers that are sent from readout to QC.

I can successfully acquire and decode the CRU data by letting readout.exe write into a named pipe, with this configuration:

# recording to file
[consumer-rec]
consumerType=fileRecorder
enabled=1
fileName=/tmp/readout-pipe
bytesMax=0G

In this case, the data is divided into 8kB buffers with a 64-byte header and a 8128-bytes payload.

However, when I receive the CRU data inside the QC framework the payload has a different size.

Here is the relevant part of my readout.exe configuration for QC:

[consumer-data-sampling]
consumerType=DataSampling
enabled=1

and here is the code snippet that processes the received data:

void RawDataProcessor::monitorData(o2::framework::ProcessingContext& ctx)
{
  for (auto&& input : ctx.inputs()) {
    const auto* header = o2::header::get<header::DataHeader*>(input.header);
    mHistogram->Fill(header->payloadSize);
    QcInfoLogger::GetInstance() << "payloadSize: " << header->payloadSize << AliceO2::InfoLogger::InfoLogger::endm;

    // const char* payload = input.payload;
  }
}

which produces the following output:

[28689]: 2018-12-14 11:07:45.142392     payloadSize: 1015808

What is the structure of the payload received by QC? Where does the CRU data start?

Thanks a lot!

Hello Andrea,

We have discussed this morning with @sy-c and @pkonopka. The payload received in the QC is a CRU data page. It should be the equivalent of what is actually copied on the data pipe.

The header is a O2 header populated with data from the header built by the Readout. I don’t think that you interested in the header I just want to be complete.

I let Sylvain correct me if I am wrong. Maybe you could have a look at the first 8kB to see if there is the 64-byte header and 8128-bytes payload you expect ?

Cheers,
Barth

Hi!

Indeed I can see the 64-byte headers and the 8128-bytes payloads in the data, just at the beginning of the O2 payload.

At this point I am able to perform some basic decoding of the data, however the terminal is flooded by messages like

[METRIC] dpl/min_input_latency_ms,0 -1648084763 1544797793928 hostname=pcphaid004,name=qcRunReadout

Is there a way to disable them? Is seems hey are enough to slow down the processing to the point where readout.exe starts dropping frames…

Hi,
you can use the option -q which is a default option of the DPL.
You could also specify a monitoring backend where this type of data would go. I have to check how to do that.

Some progress on this… I have been able to get the CRU data and do some basic decoding, and now I notice that the payloads are not consecutive, van though I have set

      "samplingConditions": [
        {
          "condition": "random",
          "fraction": "1.0",
          "seed": "1441"
        }
      ],
      "blocking": "false"

in my readout.json configuration file.

Could it be that some of the lines above need to be changed in order to really get 100% of the CRU data?

Thanks!

One possible problem is that readout is going too fast for the task and some blocks are dropped in the Data Sampling. Could you tell me whether the data is ok within a single payload ? is it that the payloads are not properly ordered or is it that you have missing payloads ?

It looks like the data is OK within one payload, but there is a gap when the next payload is received.

As far as I can tell, readout.exe can cope with the data rate when writing to a named pipe.

I strongly suspect that we are not fast enough either in the Data Sampling or in the QC itself. This is to be investigated on our side.

Pragmatically this brings you back to some kind of sampling (<100%). Can you produce meaningful plots with that at all or not ?

Most likely not, at least not before we have some working pre-filtering in the CRU user logic.
Nevertheless, I think it would be interesting to investigate where is the bottleneck compared to a straight dump into a named pipe… what do you think?

We will certainly investigate the bottleneck, no doubt. This is very interesting for us and we should aim at being able to collect a 100% data in such a case. Could you give me details on the rate we are talking about here ?

It is the full GBTx data rate from one CRU link. The GBTx runs at 80 MHz. Maybe @costaf can easily quote an exact number of the bit rate in this case?

Ciao … here my 2 cents.
the SOLAR board generates GBT word (80 bit) in continuous mode.
That means 80 bit @ 40 MHz (clock rate of the GBT link).

Now the 80 bit are currently translated to 128 bit.
So the raw data throughput in input per link is 128 * 40 MHz = 5.12 Gb/s

This is the amount of data transferred to the FLP memory.

Is this what you wanted to know?

Yes! Thanks…

In case it can be useful, here is the typical CPU usage during the data processing with the QC task:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                    
 9211 root      20   0  335876  37976  30212 R 100.0  0.2   0:18.84 qcRunReadout                                                               
 9215 root      20   0 1489840 987472  30868 S  35.5  6.1   0:05.21 qcRunReadout                                                               
 9212 root      20   0 1685624   1.1g  30664 S  19.6  7.3   0:03.15 qcRunReadout                                                               
 9213 root      20   0  770504 118864  65368 S  15.3  0.7   0:02.90 qcRunReadout                                                               
 9164 root      20   0 2316536  19184   8040 S  11.0  0.1   0:02.50 readout.exe                                                                
 9214 root      20   0  839720 111100  65504 S   4.0  0.7   0:00.97 qcRunReadout                                                               
 9216 root      20   0  765908 110492  65088 S   4.0  0.7   0:01.10 qcRunReadout                                                               

There are several QC-related threads running, one reaching 100% of CPU usage. The readout.exe process stays at around 10%.

For comparison, when processing the data through a named pipe I get the decoding at around 90% and the readout.exe process at around 24%.

Therefore it seems that the overhead is on the QC side…
For this test, the QC processing was reduced to the bare minimum (just getting the O2 header):

void RawDataProcessor::monitorData(o2::framework::ProcessingContext& ctx)
{
  for (auto&& input : ctx.inputs()) {
    const auto* header = o2::header::get<header::DataHeader*>(input.header);
  }
}

Hope this helps…

It helps ! I have to see now in which piece we are wasting time. It seems that the readout is not affected by the Data Sampling which is good.
I’ll keep you informed.

Looking to the process flow with valgrind, it seems that quite a lot of CPU cycles are spent in the following function:

void processChildrenOutput(DriverInfo& driverInfo, DeviceInfos& infos, DeviceSpecs const& specs,
                           DeviceControls& controls, std::vector<DeviceMetricsInfo>& metricsInfos)

from

AliceO2/Framework/Core/src/runDataProcessing.cxx

Could you try using --batch ?
This is in DPL, so I am a bit shooting in the dark.
@eulisse might help us ?

Are you sending a lot of debug messages on the stdout? That’s the handler of messages for all the processes and therefore it can get congested. Can we sit together to profile what is going on?

I’ve now read the whole thread. I think you’ll need at least https://github.com/AliceO2Group/AliceO2/pull/1552 to fix the spurious “METRIC” stuff.

Hi!

I am running qcRunReadout with the -q option, so that the terminal output is reduced to a minimum.

I’m available to profile the code, but most likely not before the end of the Christmas break, as I need to bring the equipment in the P2 cavern this afternoon for other tests…