Testing data pipeline at FLP level

afurs · May 13, 2020, 2:03pm

Dear Experts,
I was trying to start data flow chain:
readout.exe(equipment-player) -> StfBulder ->o2-dpl-raw-proxy|o2-dpl-raw-parser

But there was a crash at o2-dpl-raw-proxy. I would like to figure out the reason of this problems. Could it be incorrect binary file for equipment-player or configurations? The binary data file seems to be okay, but I’m not sure about this.

I placed all necessary files(log files, config file and binary file with data) to CernBox:
https://cernbox.cern.ch/index.php/s/ePeMsnw3Va1qSYZ

Best Regards,
Artur Furs.

shahoian · May 13, 2020, 2:38pm

You can check the raw data files by

o2-raw-file-check [options] <raw_file0> .. <raw_fileN>

see https://github.com/AliceO2Group/AliceO2/blob/dev/Detectors/Raw/README.md#raw-data-file-checker-standalone-executable for details.

I see in your logs:

[472:readout-proxy]: [16:40:27][ERROR] data on input 0 does not follow the O2 data model, DataHeader missing
[472:readout-proxy]: [16:40:27][ERROR] Unable to locate the region info

afurs · May 13, 2020, 3:13pm

Dear Ruben,
thank you for answering, I tried this app for my file and received bad logs:

[INFO] RawDataHeader v4 is assumed
[INFO] perform check for /Wrong RDH.packetCounter increment/
[INFO] perform check for /Wrong RDH.pageCnt increment/
[INFO] perform check for /RDH.stop set of 1st HBF page/
[INFO] perform check for /New HBF starts w/o closing old one/
[INFO] perform check for /Data does not start with TF/HBF/
[INFO] perform check for /Number of HBFs per TF not as expected/
[INFO] perform check for /Number of TFs is less than expected/
[INFO] perform check for /Wrong HBF orbit increment/
[INFO] perform check for /TF does not start by new superpage/
[ERROR] Data does not start with TF/HBF
[ERROR] ^^^Problem(s) was encountered at offset 0 of file 0
[INFO] EP:0 CRU:0x0000 Link:5 FEEID:0xaaaa Packet:0 MemSize:64 OffsNext:8192 prio.:1 BL:0 HS:64 HV:4
[INFO] HBOrb:20377418 TrOrb:20377418 Trg:00000000000000000000001000000010 HBBC:0 TrBC:0 Page:0 Stop:0 Par:48059 DetFld:0xcccc
[INFO] File 0 : 183885824 bytes scanned, 22447 RDH read for 1 links from test_readout_realsim05_HDdt.raw
[INFO] Summary of preprocessing:
[INFO] Lnk0 | Link FLP/RAWDATA/0x00000006 FEE:0xaaaa CRU: 0 Lnk: 5 EP:0 | SPages: 1 Pages: 22447 TFs: 0 with 0 HBF in 1 blocks (1 err)
[WARN] Attention: largest superpage 183885824 B exceeds expected 1048576 B
[INFO] First orbit: 20377418, Last orbit: 20377418
[INFO] Largest super-page: 183885824 B, largest TF: 183885824 B
Real time 0:00:00, CP time 0.050

But, I have another binary data file (ft0raw.raw, I have put it to the same CernBox location as in the previous my message) which seems to be okay:

[fitdaq@j91 data]$ o2-raw-file-check ft0raw.bin
[INFO] RawDataHeader v4 is assumed
[INFO] perform check for /Wrong RDH.packetCounter increment/
[INFO] perform check for /Wrong RDH.pageCnt increment/
[INFO] perform check for /RDH.stop set of 1st HBF page/
[INFO] perform check for /New HBF starts w/o closing old one/
[INFO] perform check for /Data does not start with TF/HBF/
[INFO] perform check for /Number of HBFs per TF not as expected/
[INFO] perform check for /Number of TFs is less than expected/
[INFO] perform check for /Wrong HBF orbit increment/
[INFO] perform check for /TF does not start by new superpage/
[INFO] File 0 : 809840 bytes scanned, 9728 RDH read for 19 links from ft0raw.bin
[INFO] Summary of preprocessing:
[INFO] Lnk0 | Link FLP/RAWDATA/0x00000001 FEE:0x0000 CRU: 0 Lnk: 0 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk1 | Link FLP/RAWDATA/0x00010002 FEE:0x0001 CRU: 1 Lnk: 1 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk2 | Link FLP/RAWDATA/0x00020003 FEE:0x0002 CRU: 2 Lnk: 2 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk3 | Link FLP/RAWDATA/0x00030004 FEE:0x0003 CRU: 3 Lnk: 3 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk4 | Link FLP/RAWDATA/0x00040005 FEE:0x0004 CRU: 4 Lnk: 4 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk5 | Link FLP/RAWDATA/0x00050006 FEE:0x0005 CRU: 5 Lnk: 5 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk6 | Link FLP/RAWDATA/0x00060007 FEE:0x0006 CRU: 6 Lnk: 6 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk7 | Link FLP/RAWDATA/0x00070008 FEE:0x0007 CRU: 7 Lnk: 7 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk8 | Link FLP/RAWDATA/0x00080009 FEE:0x0008 CRU: 8 Lnk: 8 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk9 | Link FLP/RAWDATA/0x0009000a FEE:0x0009 CRU: 9 Lnk: 9 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk10 | Link FLP/RAWDATA/0x000a000b FEE:0x000a CRU: 10 Lnk: 10 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk11 | Link FLP/RAWDATA/0x000b000c FEE:0x000b CRU: 11 Lnk: 11 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk12 | Link FLP/RAWDATA/0x000c000d FEE:0x000c CRU: 12 Lnk: 12 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk13 | Link FLP/RAWDATA/0x000d000e FEE:0x000d CRU: 13 Lnk: 13 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk14 | Link FLP/RAWDATA/0x000e000f FEE:0x000e CRU: 14 Lnk: 14 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk15 | Link FLP/RAWDATA/0x000f0010 FEE:0x000f CRU: 15 Lnk: 15 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk16 | Link FLP/RAWDATA/0x00100011 FEE:0x0010 CRU: 16 Lnk: 16 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk17 | Link FLP/RAWDATA/0x00110012 FEE:0x0011 CRU: 17 Lnk: 17 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] Lnk18 | Link FLP/RAWDATA/0x00120013 FEE:0x0012 CRU: 18 Lnk: 18 EP:0 | SPages: 1 Pages: 512 TFs: 1 with 256 HBF in 256 blocks (0 err)
[INFO] First orbit: 0, Last orbit: 255
[INFO] Largest super-page: 44768 B, largest TF: 44768 B
Real time 0:00:00, CP time 0.000

When I used this file as equipment-player at readout.exe , I recieved the same problem with some strange behavior(data from readout.exe was sent only after interrupting by using ctrl+c)

Also, it looks to me that o2-raw-file-check is only for checking files which will be used as source for o2-raw-file-reader-workflow, am I right? Can you please clarify, if source files for equipment-player(readout.exe) and o2-raw-file-reader-workflow should have the same structure?

By the way, test_readout_realsim05_HDdt.raw (first binary data source file) was collected during tests by using readout.exe with “fileRecorder” consumer type and “rorc” equipment type, at FLP.

Best Regards,
Artur Furs.

shahoian · May 13, 2020, 4:05pm

The raw-file-reader-workflow expects the data *dumped to file) as it comes out from the readout.exe, so the o2-raw-file-check checks the errors in any raw file (although was made in order to validate emulated raw data packaging).
Since your 2nd file passes all the checks, I think the problem is not on the level of the input data but in one of the processors (StfBuilder or raw-proxy).
Your 1st file does not have any orbit or TF flag set, I don’t know how the StfBuilder is supposed to react in this case, but the o2-raw-file-reader-workflow will not be able to replay it.

gneskovi · May 13, 2020, 7:58pm

Hi @afurs,
I believe the actual problem is with the shmem session parameter. You have to give this option to all DPL tasks, otherwise they use some random value. All processes in the chain must belong to the same session to exchange data. Please try the same chain with adding --session default to both DPL tasks. You can add it also to the StfBuilder, but that is already the default.
In the future I recomend opening Jira tickets. If you attach all components to the ticket relevant people should be notified.

@eulisse This is becoming common problem with new users. Would you consider reverting this? I don’t see different session scenario being useed in the online system.

Regards

afurs · May 13, 2020, 8:05pm

Okay, thanks!
I will try this option.

Best Regards,
Artur Furs.

eulisse · May 13, 2020, 10:33pm

The DPL driver simply ensures a given workflow has a unique session id when running, because otherwise it would be impossible to run more than one workflow on a node, like on the grid. This is and remains the main usecase for it, so I would keep it as is.

The two options you currently have are:

Pass --session default everywhere. This is obviously the most trivial one.
Use a proper control system which creates the channels for you, as per original design, like DDS (on EPN) or AliECS (on FLP).

There is a third way which to use what I call “DPL Component Model” (buzzword for the JSON which DPL uses to describes workflows and it’s used e.g. for the workflow merging). I will open a ticket about it so that we can discuss how feasible that is.

bvonhall · May 14, 2020, 6:00am

what about defaulting session to default in case the user did not pass it ? It would avoid this common mistake with new users.

eulisse · May 14, 2020, 7:49am

That will require extra arguments to run on the grid or on the user laptop. Given that’s the main usecase of DPL driver, I think the correct behavior is the current one, given this is a non issue when running in AliECS or OCD. In order to better support this usecase, though, I suggested a possible solution on:

https://alice.its.cern.ch/jira/browse/O2-1424