QC slowed down on flpits

Dear experts,

We are running QC for ITS inner barrel on our flpits1 and flpits5.

Recently, the QC processing time was dramatically getting longer. For the same run (same data), the processing time increased from 1 hour to 2 hours. We tracked the QC threads and found the pipeline for lz4 decompression and O2 data decoding slowed down the QC processing time.

This problem happened on flpits5 roughly 2 weeks ago and is still there. On flpits1, it happened yesterday afternoon around 4 pm. We tested a standalone lz4 decompression for our data, it looks like the speed is around 4MB/s. The CPU usage is kept around 55% during the decompression. The available disk space is fine on both flps.

Do you have any suggestion to inspect this issue?

Thank you in advance.

Best regards,
Jian

Dear Jian,

As you noted, it is not directly related to the QC framework, however I’d like to ask a couple of questions.

  • Was there any change made to these machines ? software udpate ? git pull ?
  • What do you mean by “We tracked the QC threads” ?

Cheers,
Barth

Dear Barth,

Thanks a lot for the quick feedback.

I am sure that we did not change our QC and O2. For the other changes on these machines, I am verifying to my colleagues.

We investigated the QC processing sequence, and found that the lz4 decompression and Raw-Pixel-Reader are rather slower than before. Then Qc-Task-Runner have to wait for a long time (but the processing is fine).

Cheers,
Jian

Hello Jian,
where do you read the data file from? what size is it ?
Sylvain

Hello Sylvain,

We read the data from /data/shifts/runs, which is mounted on /dev/sde1. Actually it’s done via pipes, the pipes (for flpits1) are under /home/its/QC/workdir/infiles
The typical size of a single file is 1.3GB (.lz4 file).

Thanks,
Jian

Hello Jian,

by looking at the YUM logs there is a alisw-lz4 update done on flpits2 on 11 October and flpits1 this morning at 10:18 (therefore too late to match your timeline). This is the only obvious difference I can spot from the system logs. I did not check SW that has been updated outside of YUM.

Otherwise the two machines followed the same upgrade path and the same rebooting episodes. As they both have not been rebooted in the last 32 days, I would exclude an effect induced by a Linux kernel update (for updates both active and pending).

Cheers,

Roberto

Hello Roberto,

Thanks a lot for your inspection.

I was also aware of the lz4 update time and it seems not related to the issue. We are continuing to investigate.

Thanks,
Jian