Corrupted data during readout

Could you try to repeat the test mounting the ramdisk?

I am running with 6 links in continuous mode and checking the data … I don’t see errors (I am running since several minutes).
You dump 16MB on the disk that is very little and very short … so I am surprised you get error with a single link. So I am wondering if the writing on the disk has an effect.

By the way, in the new release of the bench-dma we fixed the issue of the file size.

Cheers

Time       Pushed        Read          Errors        °C
  00:01:00   28058112      28057864      0             39.5
  00:02:00   56460288      56460232      0             40.2
  00:02:39   74649600      74649600      0             40.2

  Seconds     159.73
  Superpages  74649600
  Bytes       6.1153e+11
  GB          611.53
  GB/s        3.82851
  Gb/s        30.6281
  GiB/s       3.56558
  Errors      0

Sure, will run a test after lunch & keep you posted.

Cheers,
T.

Alright, did a test with the ramdisk. First event was fine, second one showed scrambling. I guess that was to be expected. Once you have the data in the memory, the writing to disk should not be the issue. This is buffered by the OS.

So, did more tests, this time with 12 links. Package loss happens now nearly immediately, seen in data and in the package headers. The package number jumps frequently on all links.

Ciao,
I am not sure what you mean with first event and second one … anyway I think it is time for me to debug a bit your system to see if I get something strange.
Today I am bit stuck with other stuiff … I’ll contact you later or tomorrow to see how to run a few things on your machines.

With 12 links indeed the situation is a bit at the limit and I would expect to see errors immediately when writing data on the disk. Let me discuss the issue with other colleagues CRU/O2

Thx

Hey,

well, whenever we take data with the roc-bench-dma tool, the data is dumped to disk. I call that an event. You can call it a run or whatever. So i take data once, data looks good. I take data again, data looks not good.

I don’t write data to disk, i used the ramdisk for the test. I reserved 2 GB of buffer and took 16 MB (bytes option). So we should be fine.

Again, why would writing to disk be an issue? The data is in the memory (DMA) and from there stored to disk. So the host memory acts as a buffer and we don’t send that much data to overflow it. Disk writes are usually handled by the OS, so why should there be a loss when writing from memory to disk?

Cheers,
Torsten

Ok I checked with Kostas.
The sw writes data on the disk everytime a superpage is filled … that explains the lower perf when writing data in a file.

I think we reached the limit of this software scope … I will see how to move to readout.

Can you send me instructions to take data with the DDG? Then I can check on our side already

Ok after discussing with other colleague:

  1. you will need the new software to check the data online … it has been tagged, the RPM should be available soon to be installed.
  2. during data taking you can check if the CRU DROPs packet … that would explain why you see full dma page missing in your stream.

I have done many tests … and indeed if I check data online or store data on the disk I can collect correctly data with only a few links without errors (max 4) … but if I disable the error check and I monitor the DROPPED counter I can run with 9 links and the data is never dropped in the CRU.

To read the DROPPED packet counter read this register while in data taking (the counter is cleared when RUN_ENABLE is 0)

roc-reg-read --id=PCIeID--channel=2 --address=0x60001c

We have a few stats counter that describe the number of packet received and dropped, the python tool has a bug, we will fix it and release it soon.

Running with 9 links since 12 minutes and the dropped counter is 0

 roc-reg-read --id=16:0.0 --channel=2 --address=0x60001c
infoLoggerD not available, falling back to stdout logging
0x0

Hi Pippo,

Is there a way to test with the current software we have installed? I want to repeat exactly the same test first and write data to disk. The only difference to testing with FECs would be that the data is sourced from the DDG instead from the DDG.
So, how do I set the data path to DDG and control it?

Thanks,
T.

Yes …

Activate DDG and readout it out

  • ENABLE gbt RX mode

edit your tpc–readout script

cru.gbt.txmux(0)
cru.gbt.txmode("gbt")
cru.gbt.rxmode("wb") <= change this in gbt

exectue it

  • Activate LOOPBACK
    edit standatlone-startup.py add the following line
print ("Checking link status")
cru.gbt.health()

cru.gbt.loopback(1) <== ADD this command

execute it

all the links should be up.

python linkstat.py -i 2:0.0 -v 
  • enable DDG in all the CRU GBT TX link (in COMMON)
python gbt-mux.py -i 2:0.0 -l all -m ddg
  • START DDG (in DDG/)
 python ddg.py -i 2:0.0 -c continuous -u

From now on it is readout as you do for the FEC.

Let me know.
You should see a counter all over the DMA pages.

Cool, will check it out and report back.

T.

Hey,

so i did run a check with the DDG and i see the same behavior: Packets are lost. I can see it in the package counter in the RDH and also in the raw data. Below is the output from both.

GBT Frames

00000008 00073408 00004409 : 00000000.0000afe5.0fa8afe5.0fa8afe5 : 001e 0007 000e 0005 | 0013 001a 0013 000a | 000f 0008 000d 0008 | 000e 0007 000e 0005 | 0000 0000 0000 0000 | 
00000008 00073424 00004410 : 00000000.0000afe6.0fa8afe6.0fa8afe6 : 001e 0007 000f 0004 | 0013 001a 001b 0002 | 000f 0008 000d 0008 | 000e 0007 000f 0004 | 0000 0000 0000 0000 | 
00000009 00073792 00004411 : 00000000.0000c6df.0fa8c6df.0fa8c6df : 001b 000f 0005 0003 | 001b 001a 000b 001a | 000f 0009 000c 0008 | 000b 000f 0005 0003 | 0000 0000 0000 0000 | 
00000009 00073808 00004412 : 00000000.0000c6e0.0fa8c6e0.0fa8c6e0 : 001a 000e 0006 0000 | 0013 0012 0013 0002 | 000f 0009 000c 0008 | 000a 000e 0006 0000 | 0000 0000 0000 0000 | 

Packet RDH

Packet 00008 - Link 0
Header 00 : 1ea04003
Header 01 : 000012ec
Header 02 : 1ee02000
Header 03 : 00000000
Header 04 : a3f86ae6
Header 05 : 00000000
Header 06 : 00112244
Header 07 : 00112233
Header 08 : 00000000
Header 09 : 00000003
Header 10 : 00112255
Header 11 : 00112233
Header 12 : 00000000
Header 13 : 00000800 <- Packet counter = 8
Header 14 : 00112266
Header 15 : 00112233

Packet 00009 - Link 0
Header 00 : 1ea04003
Header 01 : 000012ec
Header 02 : 1ee02000
Header 03 : 00000000
Header 04 : a3f86ae6
Header 05 : 00000000
Header 06 : 00112244
Header 07 : 00112233
Header 08 : 00000000
Header 09 : 00000003
Header 10 : 00112255
Header 11 : 00112233
Header 12 : 00000000
Header 13 : 00001500 <- Packet counter should be 9 but is 0x15 instead. 
Header 14 : 00112266
Header 15 : 00112233

Having a quick look over the packet RDHs, there is actually a lot of packet loss.

Was running the test with RamDisk and 8 links, byte=16Mi and buffer=1Gi

Cheers,
Torsten

Ciao,
yes I was expecting that. 8 links and dumping data on the disk wont work.
I have just finished to discuss with the other O2 colleagues … we should move to readout.exe.
I will write an email to you and johannes to see how to sync the operation next week.
Readout should dump data on the disk in a more efficient way.
Would it work for you?

Thx

Hi Pippo,

sure, we can move to the new readout. I still would like to understand why dumping to disk won’t work. You reserve a DMA buffer, right? That is located in the host memory. From there you dump the data to disk. The buffer is 1 GB and the actual data only 16 M. So plenty of space in the host memory. Once it is there, it should not matter how long the disk will take to store it. There is no more data coming in, so there is no overflow or overwriting in the memory.

Anyway, let’s try the new readout.

Ciao,
roc-bench-dma doesn’t work like that.
The current version dump on the disk every single SUPERPAGE when they are filled … that kills the purpose of the buffer in the memory.

The data is not kept in the memory and write at the end. This is why the writing on the disk is very inefficient.
Considering that readout is ready we decided to use roc-bench-dma for other scopes but move on using readout, that is actually doing what you described.
Data is stored in the memory while a second thread dump data on the disk … so if there is enough buffer in the RAM, the writing on the disk will not slow it down.

Hi Pippo,

thanks for the explanation. That actually makes a lot of sense (the explanation, not the implementation). For the next time, it would be nice to declare this type of tool immediately unsuitable for data taking. Because it is, the way it is implemented. A single TPC link will send 500 MB/s and this already brings the tool over its limits. Instead, it is advertised in the CRU Test GIT as the readout tool one should use. Could have saved the two of us a lot of time and debugging effort.

So, let’s see how the new tool performs. Is there an RPM for it?

Cheers,
Torsten

Well apologize for that, but it took me time to figure it out.
The previous developer left the group since quite some time and the new one just joined the collaboration and is still getting used to the code and past decisions (that were taken while there was not much to test).
So far it was working fine for the different test systems, and in my machine I can store data for a couple of links without errors but indeed we reached soon its limitation adding more links.

I have contacted Johannes, readout in the last RPM available should be ok.
I will do tests to see how much data we can dump on the disk before dropping data.

Cheers

No worries, no hard feelings. At least we found the issue and understand it, so we can happily move to the new readout.

Concerning dropping data, we’re not planning to take huge amounts of data for the noise testing. If we get 1 MB/Link that is already quite a bit and plenty to check the noise. So if you buffer the data in the host memory first, there should be no issues at all.

Cheers,
T.

Ciao,
so I did some tests with readout.exe.

In the current test machine (no tuning) I can read data from 12 links without errors up to 30 DMA pages per link.

That means
30 * 8KB = 240 KB per link
240 KB * 12 = 2880 KB = ~2.8 MB

Would it be enough for the time being to do some more complete tests on your electronic?
We are looking how to configure the machine to get better memory usage.

Cheers