Corrupted data during readout

talt · October 16, 2018, 3:23pm

Hello CRU team,

during data taking we spotted another potential bug. While data taking seemed to have worked last week, this week we saw some artifacts in the data. To exclude any mixed up state in the CRU, we rebooted the host and started fresh. Still, the issue with data corruption remains.

What we did so far. The data stream is decoded into the ADC values of the different SAMPA channels. One can then look at the values over time (baseline). They need to be in a certain range for each SAMPA channel. So this can be used as a “digital” fingerprint. What we see is a jump in the data, meaning data is lost somewhere.

Below is the time sequence (each line = time-bin) for 16 ADC channels. In time-bin 4941 there is a sudden jump, meaning data is lost somewhere.

4938 :   52   61   70   76   83   73   82   83   73   68   79   69   78   86   76   98
4939 :   55   63   71   77   84   74   84   84   72   69   80   65   79   85   76   98
4940 :   55   60   71   76   84   74   82   85   74   69   80   67   80   85   78   98
4941 :   54   61   70   99   55   61   70   76   83   73   84   84   72   69   79   67  <-- Jump in baseline
4942 :   79   84   76   98   52   61   72   77   83   73   83   85   74   69   80   70

I traced this back to the GBT frames in the data. First column is the packet number, second column the file position (in byte), third column is the GBT frame number (starting from the SYNC). What follows is the GBT frame and the decoded ADC values around the jump in the data (in brackets).

00000080 00663200 00039687 : 00000082.7a0a20fa.0807a0a0.052200d8 : 0003 0002 0010 0002 
00000080 00663216 00039688 : 000010a2.f81288da.00af21a2.a6182a78 : 0015 0002 000e 0002 
00000080 00663232 00039689 : 0000f088.f0f80870.82272fa0.07a202e5 : 0002 0003 0016 0001 (98  54)
00000080 00663248 00039690 : 0000f80a.70f88050.08050f2a.a5288a3c : 001d 0001 0006 0002 (61  70)
00000081 00663616 00039691 : 0000f088.daf8007a.0aaf2fa0.078202ef : 0003 0003 0017 0001 (99  55)
00000081 00663632 00039692 : 0000f80a.78f808da.2a263f2a.a5288a3c : 001d 0001 0006 0002 
00000081 00663648 00039693 : 0000f0a8.5af2a0d2.88a3ef28.87828872 : 000c 0002 0013 0002 
00000081 00663664 00039694 : 0000e082.72e88836.08ad8e28.8eb28258 : 0009 0002 0014 0002

I cross-checked this with the raw 32-bit data dumped by the CRU to file. This is the end of packet number 80. The position here is simply the line number in the hex-dump. So each number corresponds to a 32-bit word and counting starts with 1.

Last GBT frame starts with 0xA5288A3C (Position 165812 *4 = 663248). This is consistent with the frame dump above.

End of package 80
 165809 07A202E5
 165810 82272FA0
 165811 F0F80870
 165812 0000F088
 165813 A5288A3C
 165814 08050F2A
 165815 70F88050
 165816 0000F80A
 165817 00000000
 165818 00000000
 165819 00000000
 165820 00000000
 165821 00000000

And this is the start of packet number 81. First GBT frame starts with 0x078202EF (Position 165904*4 = 663.616).

Start of package 81 

 165885 00000000
 165886 00000000
 165887 00000000
 165888 00000000
 165889 1EA04003
 165890 000012EC
 165891 1EE02000
 165892 00000003
 165893 A3F86AE6
 165894 00000000
 165895 00112244
 165896 00112233
 165897 00000000
 165898 00000003
 165899 00112255
 165900 00112233
 165901 00000000
 165902 0000EC00
 165903 00112266
 165904 00112233
 165905 078202EF
 165906 0AAF2FA0
 165907 DAF8007A
 165908 0000F088
 165909 A5288A3C
 165910 2A263F2A

All those values match. Also this is consistent with the dump of the extracted frames.

So, it seems that there is somewhere data lost. This happens randomly, for each link. I tried it with single links (also different ones), sometimes it works, sometimes it does not. Also with several links, same behavior. The position where it happens is also kind or erratic. In this case it happened for package 80/81 (single link readout). But i have seen it in many different positions.

Cheers,
Torsten

costaf · October 16, 2018, 4:10pm

Ok,
we will have a look.
Just to be sure we are testing with same configuration.
You are using the bench dma enabling 8 links in the data taking right?

So I can use the same settings you sent me yesterday in another thread.
Right?

Thx

talt · October 16, 2018, 4:36pm

Hi Pippo,

well, as i wrote, i used different settings for the links. What i did was the following:

Readout with 13 links but only 12 links showed up in the data (might be a wrong setting on my side, only used 12 MB for buffer and bytes). Data looked corrupted. Same settings as i send you in the other thread
reduced number of links to 8 in the readout for comparison with data from last week. Data shows corruption
rebooted host, used same settings and redid readout with 8 links. Data shows corruption
tested single link readout with link 1, 2 and 3. Data looks “better” but still corruption sometimes
changed byte value in roc-bench-dma to 28 MB and increased buffer space to 48 MB

So i tested several combinations of settings and links. All show the above mentioned behavior. I can redo some of the tests and document the exact settings.

Cheers,
Torsten

costaf · October 16, 2018, 4:45pm

Woa, many things.
So let’s drill it down.

more than 12 links … that means 2 datapathwrapper (every datapath wrapper has 12 links) … that means 2 bench-dma instances until we don’t use readout (2 different end points)
I am testing the stability of readout using 8 links … I will increase to 12

I’ll keep you posted.

talt · October 17, 2018, 12:16pm

Hi Pippo & CRU team,

i am doing more tests, putting all the information in a log file on google.

So far, i tested 2 links (separately) with 5 readout cycles. 3 had good data, 2 showed no SYNC in data, 5 showed corrupted (scrambled data). The scrambling happens multiple times.

Cheers,
Torsten

costaf · October 17, 2018, 12:30pm

Ciao, ok

the sync has been fixed in develop, it will be available next tag (quite soon)
for the corrupted data could you try to do the following (change 0x200 in 0x1fc) in RAWMAXLEN in your script

cru.dwrapper.setGbtDatapathLink(wrap=0,link=l,is_GBT_pkt=0,RAWBYLEN=1,RAWBYID=0,RAWMAXLEN=0x1fc)

I am doing some tests … I will come back to you once I am done.

talt · October 17, 2018, 2:01pm

Hi Pippo,

actually, it was never set to 0x200 by us, it is still the default of 0x1EA.

cru.dwrapper.setGbtDatapathLink(wrap=0,link=l,is_GBT_pkt=0,RAWBYLEN=1,RAWBYID=0,RAWMAXLEN=0x1EA)

But i can try with the 0x1FC, to see if it changes anything.

Cheers,
Torsten

costaf · October 17, 2018, 2:05pm

Ah ok … no in this case it should be ok.
We are still checking it.
Thx

talt · October 17, 2018, 2:31pm

I can try to modify my decoder to automatically check for this baseline jumps and calculate the positions it happens. What i have seen so far is that it works fine for about 4000 time-bins (about 80 packages), then it jumps. Sometimes nothing happens and it stays that way, sometimes it jumps again after only a few frames. It looks to me like some buffer overflow somewhere in the DMA channel.

Cheers,
Torsten

costaf · October 17, 2018, 2:47pm

No need for that.
We are checking the stream with a counter so it is easier to understand what is going on.
It looks like more a software issue than a firmware, but we are still looking into it.

Let’s wait a bit more before starting to change too many things.
Cheers

costaf · October 17, 2018, 5:21pm

Ciao,
could you try again to run your tests changing the following option in bench-dma?

–buffer-size=2Gi --superpage-size=1Mi

and see if it affects the data?

The rest should be kept the same.

I am doing tests to verify the quality of the data when running in cont mode and including more than one link.
Now we have a new bench-dma in line with the firmware and it can check the data without writing file on the disk.
I will leave it running for the night and let you know the results.

Cheers

talt · October 17, 2018, 5:47pm

Yo!

Just did a test, executed the tpc-set.sh and the patgen before to have a clean start. Error is still there. Attached is the command line and the output of the roc-bench-dma.

[talt@flp2 ~]$ roc-bench-dma --verbose   --id=05:00.0   --buffer-size=2Gi --superpage-size=1Mi --links=3  --no-errorche --bytes=16Mi  --to-file-bin=/tmp/data/raw/debug-17-10-18/rdo_3_2G.bin --loopback=NONE --generator=0
infoLoggerD not available, falling back to stdout logging
2018-10-17 19:38:38.261164     DMA channel: 0
2018-10-17 19:38:38.261630     IOMMU enabled
2018-10-17 19:38:38.297338     Freeing PDA buffer '/sys/bus/pci/drivers/uio_pci_dma/0000:05:00.0/dma/1000000000/map'
2018-10-17 19:38:38.304752     Using buffer file path: /var/lib/hugetlbfs/global/pagesize-1GB/roc-bench-dma_id=05:00.0_chan=0_pages
2018-10-17 19:38:38.304902     Buffer size: 2147483648
2018-10-17 19:38:38.304924     Superpage size: 1048576
2018-10-17 19:38:38.304940     Superpages in buffer: 2048
2018-10-17 19:38:38.304956     Page size: 8192
2018-10-17 19:38:38.304971     Page limit: 2048
2018-10-17 19:38:38.304986     Pages per superpage: 128
2018-10-17 19:38:38.305006     Generator data size: <internal default>
2018-10-17 19:38:39.643472     [pci=05:00.0 serial=0 channel=0] Acquiring DMA channel lock
2018-10-17 19:38:39.643647     [pci=05:00.0 serial=0 channel=0] Acquired DMA channel lock
2018-10-17 19:38:40.441550     [pci=05:00.0 serial=0 channel=0] Initializing with DMA buffer from memory region
2018-10-17 19:38:42.474008     [pci=05:00.0 serial=0 channel=0] Scatter-gather list size: 1
2018-10-17 19:38:42.483915  !  Warning - [pci=05:00.0 serial=0 channel=0] Buffer is NOT hugepage-backed, but IOMMU is enabled
2018-10-17 19:38:42.484558     [pci=05:00.0 serial=0 channel=0] Enabling link(s): 3 
2018-10-17 19:38:42.484623     Card type: CRU
2018-10-17 19:38:42.484652     Card PCI address: 05:00.0
2018-10-17 19:38:42.484744     Card NUMA node: 0
2018-10-17 19:38:42.484934     Card firmware info: 0-0-5168b6b8
2018-10-17 19:38:42.484950     Starting benchmark
2018-10-17 19:38:42.484974     [pci=05:00.0 serial=0 channel=0] Starting DMA

  Time       Pushed        Read          Errors        °C  
  00:00:10   1920          1920          n/a           -263.6

2018-10-17 19:38:52.866214     [pci=05:00.0 serial=0 channel=0] Stopping DMA
2018-10-17 19:38:52.866817     [pci=05:00.0 serial=0 channel=0] Moved 128 remaining superpages to ready queue
2018-10-17 19:38:52.949562     Popped 128 remaining superpages

  Seconds     10.1705   
  Superpages  2176      
  Bytes       1.78258e+07
  GB          0.0178258 
  GB/s        0.00175269
  Gb/s        0.0140215 
  GiB/s       0.00163232
  Errors      n/a       

2018-10-17 19:38:52.949779     Benchmark complete
2018-10-17 19:38:53.433967     [pci=05:00.0 serial=0 channel=0] Releasing DMA channel lock

File size is 144 MB. Apart from the jumps in the data, the file size is a mystery to me. I specify 2 GB as buffer and 16 MB of data to be read out and i end up with 144 MB

[talt@flp2 ~]$ ls -hla /tmp/data/raw/debug-17-10-18/rdo_3_2G.bin 
-rw-r--r-- 1 talt z2 144M Oct 17 19:38 /tmp/data/raw/debug-17-10-18/rdo_3_2G.bin

costaf · October 17, 2018, 8:32pm

Ciao, we checked with Kostas the roc-bench-dma.
There were some implementation errors that are going to be fixed in the next release of the software.
This should fix the wrong file size.

I am still puzzled about the error you are seeing.
I am running the following tests.
I have the DDG generating a continuous stream (a counter).
Data is looped back in input to the CRU and different links are read-out in parallel (6 in total).
The new roc-bench-dma checks that the data is ok (counter increasing).
For 16 MB I don’t see errors for most of the time.
When I detect an error is always at the beginning of the superpage, so I doubt is caused by data corrupted in the DMA … otherwise it would happen randomly in the data stream. I think it is more an issue in the memory handler.

If I reduce the number of links included in the run the error disappear. For example with 4 links included in the run I can run for several minutes without errors (I will run a long test).
I am really surprised you see errors with a single link …
I will test more the new software and see if it is the case to upgrade your system, just to be sure it is not a software issue.
Even better would be to move to readout … I will check with Sylvain what are the steps needed to install it in your system.

Cheers

talt · October 17, 2018, 9:32pm

Hey Pippo,

thanks for checking. I will also implement the automatic checking and tracing back the jumps in the ADC values to the raw 32-bit data on the disk. I did this once manually to make sure there is not a bug in my decoder but i am pretty sure i can exclude that since i could trace it back. Maybe the exact position where we loose data will help.

Cheers,
Torsten

talt · October 17, 2018, 10:23pm

Hi Pippo,

i think i know what is happening, had a theory and did some more tests which points towards it. Needs a bit more explaining:

This is how our data looks like, when it is decoded. Below you see one SAMPA stream, which consists of 16 ADC channels (those are the columns). The rows are the time-bins (or sample number). Each channel has a dedicated value (baseline) and a bit of noise, so you get fluctuations around the baseline. Long story short, in each column you should see a specific value +/- 2.

5799 :   52   60   71   76   83   74   83   83   73   68   79   69   78   85   77  100
5800 :   54   61   69   77   85   73   82   85   74   69   79   67   80   85   75   98
5801 :   54   62   70   77   83   73   84   84   70   69   78   69   78   85   77   96
5802 :   54   61   70   78   84   74   82   85   74   69   80   69   80   85   76  100
5803 :   54   62   71   76   83   74   84   84   72   67   79   67   80   84   75   99
5804 :   54   60   71   76   85   74   82   84   73   68   79   68   79   84   77   99
5805 :   54   61   71   76   83   73   81   84   74   70   79   68   79   84   76   99

Now, each GBT frame contains 2 ADC values per stream. That means you need 8 frames to fill a complete row (16 ADC values). That means if i loose 8 frames or any N x 8 frames, i won’t see it in the data. It means a complete row is missing.

Now, my theory is that you loose complete packages. To test this, i changed the payload size and tried with 4 different payload sizes:

0x1EA = 490 frames/package. 490 frames/8 frames = 61,25 => Jumps in baseline values
0x190 = 400 frames/package. 400 frames/8 frames = 50 => No jumps seen, data looks good
0x191 = 401 frames/package. 401 frames/8 frames = 50,125 => ADC values jump by 2 positions

4287 :   53   59   71   76   83   75   83   84   72   69   79   67   79   85   76   99
4288 :   54   62   71   74   85   74   83   84   73   68   78   67   78   85   79   98
4289 :   53   61   70   76   83   73   84   84   72   67   79   68   79   84   76   85
4290 :   77   97   53   61   71   76   86   73   83   83   73   69   80   67   79   84
4291 :   76   98   54   61   72   75   85   73   81   84   73   68   79   69   79   85
4292 :   77   98   54   61   69   75   84   73   83   84   73   68   80   67   80   84

0x192 = 402 frames/package. 402 frames/8 frames = 50,25 => ADC values jump by 4 positions

5806 :   54   61   71   75   84   72   85   82   73   67   78   68   79   83   76   97
5807 :   53   64   70   76   83   73   83   84   74   68   80   67   80   84   75   97
5808 :   53   60   71   75   84   71   82   82   73   67   79   67   79   84   76   99
5809 :   54   83   76   98   54   62   71   77   84   72   82   83   74   67   79   69
5810 :   80   84   77   98   53   60   68   78   85   74   82   83   72   67   79   68
5811 :   77   85   77   99   53   62   72   76   84   73   82   84   71   69   78   68
5812 :   80   84   76   97   56   62   71   76   83   74   84   84   73   69   79   66

So, this observations are consistent with complete package loss. Now, why do you not see this with the counter. This depends on how your test procedure handles it. If the counter always starts with a fixed number (let’s say 0), each DMA test run will write the exact same pattern into the memory. And if the memory is not actively cleared, you will not notice any package loss. If you loose a package and this is not the first test run, it will simply see the old data which is identical to what you were supposed to write to memory. That’s my theory

Cheers,
Torsten

costaf · October 17, 2018, 10:34pm

Ok I need a bit more thinking.

The DDG stream is a continuous counter … it is not cleared at every run.

I disable the CRU
I start the DDG
I start the DMA
I enable the CRU and data flows in.

The error checking lock to the first number in the first dma page and starts to check from it.
The dma data taking starts always at different number as I don’t pause/reset the DDG, I just DISABLE/ENABLE the CRU.

But indeed it requires a bit more thinking, I will have a deeper look tomorrow

talt · October 17, 2018, 10:41pm

More info. I checked the package headers for the corrupted data.

Packet 00085 - Link 3
Header 00 : 19104003
Header 01 : 000012ec
Header 02 : 19502000
Header 03 : 00000003
Header 04 : a3f86ae6
Header 05 : 00000000
Header 06 : 00112244
Header 07 : 00112233
Header 08 : 00000000
Header 09 : 00000003
Header 10 : 00112255
Header 11 : 00112233
Header 12 : 00000000
Header 13 : 00005500  <- seems to be the packet counter, starts with 0 and increases by 1
Header 14 : 00112266
Header 15 : 00112233

Packet 00086 - Link 3
Header 00 : 19104003
Header 01 : 000012ec
Header 02 : 19502000
Header 03 : 00000003
Header 04 : a3f86ae6
Header 05 : 00000000
Header 06 : 00112244
Header 07 : 00112233
Header 08 : 00000000
Header 09 : 00000003
Header 10 : 00112255
Header 11 : 00112233
Header 12 : 00000000
Header 13 : 00005d00 <- this should be 0x56, not 0x5D, no?
Header 14 : 00112266
Header 15 : 00112233

talt · October 17, 2018, 10:55pm

Yeah, go to bed! Maybe tomorrow or these days we can run your checker or even the counter only on our setup. To exclude any screw-ups due to different machines/setups.

costaf · October 18, 2018, 7:54am

Just a very simple question.
Did you mount the ramdisk?

talt · October 18, 2018, 9:28am

No, we didn’t. We write straight to disk, seeing that we have a fast SSD in there.