Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded access to different channel and different memory area fails #516

Closed
X-Ryl669 opened this issue Apr 1, 2021 · 15 comments
Closed

Comments

@X-Ryl669
Copy link

X-Ryl669 commented Apr 1, 2021

I'm trying to use XDMA.

Description of the issue:

  1. I've 2 threads (R & S).
  2. R has opened xdma0_c2h_0
  3. S has opened xdma0_h2c_0
  4. R does char buf[1024*1024]; pread(c2h_fd, buf, sizeof(buf), SOME_ADDR1); in a loop
  5. S does char buf[20*1024]; pwrite(h2c_fd, buf, sizeof(buf), SOME_ADDR2); in a loop
  6. Most transfer will work but some will fail (timeout) with this in dmesg:
Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfe809400 flags=0x0020]
xdma_xfer_submit: xfer 0x00000000352a938a,1048832, s 0x1 timed out, ep 0x300100.
engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000389bf350) = 0x1fc10006 (id).
engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000c5bcecb2) = 0x00000001 (status).
engine_reg_dump: 0-C2H0-MM: ioread32(0x000000002fc32109) = 0x00f83e1f (control)
engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000fc4073fa) = 0xffeb0000 (first_desc_lo)
engine_reg_dump: 0-C2H0-MM: ioread32(0x000000007b883500) = 0x00000000 (first_desc_hi)
engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000190cefd2) = 0x0000003f (first_desc_adjacent).
engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000017def8cc) = 0x00000004 (completed_desc_count).
engine_reg_dump: 0-C2H0-MM: ioread32(0x000000005cc4976c) = 0x00f83e1e (interrupt_enable_mask)
engine_status_dump: SG engine 0-C2H0-MM status: 0x00000001: BUSY
transfer_abort: abort transfer 0x00000000352a938a, desc 257, engine desc queued 0.

Remarks

If I add a mutex around these calls to prevent both thread to access the XDMA driver at the same time, it does not happen (but obviously with a much lower performance)

I'm not using the same memory buffer, I'm not using the same file descriptor, and I'm not using the same destination/src address in the FPGA here.

@deeppat
Copy link
Contributor

deeppat commented Apr 1, 2021

Hi @X-Ryl669, Thanks for reporting this. We've asked a Xilinx subject matter expert to reply to this issue and they will either respond here or we will add their response here as soon as we hear back from them.

Thanks,

Deep

@X-Ryl669
Copy link
Author

X-Ryl669 commented Apr 1, 2021

Thank you

@deeppat
Copy link
Contributor

deeppat commented Apr 7, 2021

Hi @X-Ryl669, We are still waiting to hear back. I apologize for the delay and I hope to update you on this soon.

-Deep

@czfpga
Copy link
Contributor

czfpga commented Apr 7, 2021

Hi @X-Ryl669,

Xilinx get back to us and ask for more information for analysis. Could you provide this information to us?

Your use case of having 2 threads, each to a different DMA engine looks correct. Each engine is independent of each other. Adding the mutex just slows down the traffic, allowing only 1 engine active at any given time, so the perf. would be worse. From the log, the reason for the timeout is not looking obvious. We need additional data. Please share full dmesg log which includes the events from the start. Few questions to understand your design use mode.

  1. Whether this test is done using custom design (or) AWS shell?
  2. Which version of the XDMA driver is used?
  3. Whether driver is set in "Interrupt mode" or "polling mode"?

Thanks

-Chen

@X-Ryl669
Copy link
Author

Hi,

The full log
[ 5073.773255] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, timeout: h2c 10 c2h 10 sec.
[ 5073.773326] xdma:xdma_device_open: xdma device 0000:0e:00.0, 0x000000006cecc6ad.
[ 5073.773372] xdma:map_single_bar: BAR0 at 0xf2500000 mapped at 0x000000007f8e02ba, length=524288(/524288)
[ 5073.773381] xdma:map_single_bar: BAR1 at 0xf2580000 mapped at 0x00000000059b1b97, length=65536(/65536)
[ 5073.773384] xdma:map_bars: config bar 1, pos 1.
[ 5073.773385] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[ 5073.773558] xdma:pci_keep_intx_enabled: 0000:0e:00.0: clear INTX_DISABLE, 0x407 -> 0x7.
[ 5073.773581] xdma:probe_one: 0000:0e:00.0 xdma0, pdev 0x000000006cecc6ad, xdev 0x00000000827e5e77, 0x00000000e358ef29, usr 16, ch 4,4.
[ 5073.774210] xdma:cdev_xvc_init: xcdev 0x00000000bb29dcfd, bar 0, offset 0x40000.
[ 5085.059221] xdma:xdma_xfer_submit: xfer 0x00000000f87a0510,1048832, s 0x1 timed out, ep 0x300100.
[ 5085.059225] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000046fd6c54) = 0x1fc10006 (id).
[ 5085.059228] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069e996ec) = 0x00000001 (status).
[ 5085.059231] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f164dc34) = 0x00f83e1f (control)
[ 5085.059233] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000084869551) = 0xffab8960 (first_desc_lo)
[ 5085.059236] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000668e7d47) = 0x00000000 (first_desc_hi)
[ 5085.059239] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000001f3b3591) = 0x00000034 (first_desc_adjacent).
[ 5085.059242] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f28285f9) = 0x00000005 (completed_desc_count).
[ 5085.059245] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000fa0f3b77) = 0x00f83e1e (interrupt_enable_mask)
[ 5085.059248] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0x00000001: BUSY
[ 5085.059250] xdma:transfer_abort: abort transfer 0x00000000f87a0510, desc 257, engine desc queued 0.
[ 5085.059288] amd_iommu_report_page_fault: 115 callbacks suppressed
[ 5085.059291] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408000 flags=0x0020]
[ 5085.059300] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408200 flags=0x0020]
[ 5085.059307] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408400 flags=0x0020]
[ 5085.059313] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408600 flags=0x0020]
[ 5085.059320] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408800 flags=0x0020]
[ 5085.059327] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408a00 flags=0x0020]
[ 5085.059334] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408c00 flags=0x0020]
[ 5085.059341] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd408e00 flags=0x0020]
[ 5085.059347] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd409000 flags=0x0020]
[ 5085.059354] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfd409200 flags=0x0020]
[ 5085.059362] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd409400 flags=0x0020]
[ 5085.059369] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd409600 flags=0x0020]
[ 5085.059375] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd409800 flags=0x0020]
[ 5085.059382] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd409a00 flags=0x0020]
[ 5085.059388] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd409c00 flags=0x0020]
[ 5085.059395] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd409e00 flags=0x0020]
[ 5085.059402] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd40a000 flags=0x0020]
[ 5085.059408] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd40a200 flags=0x0020]
[ 5085.059415] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd40a400 flags=0x0020]
[ 5085.059421] AMD-Vi: Event logged [IO_PAGE_FAULT device=0e:00.0 domain=0x001a address=0xfd40a600 flags=0x0020]
[ 5085.315219] xdma:xdma_xfer_submit: xfer 0x000000006305024f,131328, s 0x1 timed out, ep 0x141100.
[ 5085.315224] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000059b1b97) = 0x1fc00006 (id).
[ 5085.315227] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000c1190d09) = 0x00000001 (status).
[ 5085.315230] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000008f63c48d) = 0x00f83e1f (control)
[ 5085.315233] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000001489041c) = 0xffaf9ac0 (first_desc_lo)
[ 5085.315236] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000f0d75f27) = 0x00000000 (first_desc_hi)
[ 5085.315239] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000003e1257a9) = 0x00000020 (first_desc_adjacent).
[ 5085.315243] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000aedf6ab0) = 0x00000005 (completed_desc_count).
[ 5085.315246] xdma:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000105b3249) = 0x00f83e1e (interrupt_enable_mask)
[ 5085.315249] xdma:engine_status_dump: SG engine 0-H2C0-MM status: 0x00000001: BUSY
[ 5085.315250] xdma:transfer_abort: abort transfer 0x000000006305024f, desc 33, engine desc queued 0.
[ 5085.623249] xdma:xdma_xfer_submit: xfer 0x00000000f87a0510,1048832, s 0x1 timed out, ep 0x300100.
[ 5085.623253] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000046fd6c54) = 0x1fc10006 (id).
[ 5085.623255] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069e996ec) = 0x00000001 (status).
[ 5085.623258] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f164dc34) = 0x00f83e1f (control)
[ 5085.623261] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000084869551) = 0xffabc060 (first_desc_lo)
[ 5085.623264] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000668e7d47) = 0x00000000 (first_desc_hi)
[ 5085.623267] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000001f3b3591) = 0x0000003c (first_desc_adjacent).
[ 5085.623270] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f28285f9) = 0x00000004 (completed_desc_count).
[ 5085.623273] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000fa0f3b77) = 0x00f83e1e (interrupt_enable_mask)
[ 5085.623276] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0x00000001: BUSY
[ 5085.623277] xdma:transfer_abort: abort transfer 0x00000000f87a0510, desc 257, engine desc queued 0.
[ 5085.699242] xdma:xdma_xfer_submit: xfer 0x00000000f87a0510,1048832, s 0x1 timed out, ep 0x300100.
[ 5085.699246] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000046fd6c54) = 0x1fc10006 (id).
[ 5085.699249] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069e996ec) = 0x00000001 (status).
[ 5085.699251] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f164dc34) = 0x00f83e1f (control)
[ 5085.699254] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000084869551) = 0xffab47e0 (first_desc_lo)
[ 5085.699257] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000668e7d47) = 0x00000000 (first_desc_hi)
[ 5085.699260] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000001f3b3591) = 0x00000000 (first_desc_adjacent).
[ 5085.699263] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f28285f9) = 0x00000004 (completed_desc_count).
[ 5085.699266] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000fa0f3b77) = 0x00f83e1e (interrupt_enable_mask)
[ 5085.699269] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0x00000001: BUSY
[ 5085.699270] xdma:transfer_abort: abort transfer 0x00000000f87a0510, desc 257, engine desc queued 0.
[ 5086.555240] xdma:xdma_xfer_submit: xfer 0x00000000f87a0510,893184, s 0x1 timed out, ep 0x300100.
[ 5086.555243] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000046fd6c54) = 0x1fc10006 (id).
[ 5086.555246] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069e996ec) = 0x00000001 (status).
[ 5086.555249] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f164dc34) = 0x00f83e1f (control)
[ 5086.555253] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000084869551) = 0xffab0000 (first_desc_lo)
[ 5086.555256] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000668e7d47) = 0x00000000 (first_desc_hi)
[ 5086.555259] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000001f3b3591) = 0x0000003f (first_desc_adjacent).
[ 5086.555261] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f28285f9) = 0x00000007 (completed_desc_count).
[ 5086.555264] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000fa0f3b77) = 0x00f83e1e (interrupt_enable_mask)
[ 5086.555267] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0x00000001: BUSY
[ 5086.555268] xdma:transfer_abort: abort transfer 0x00000000f87a0510, desc 219, engine desc queued 0.
[ 5293.684296] xdma:xdma_xfer_submit: xfer 0x00000000f87a0510,1048832, s 0x1 timed out, ep 0x300100.
[ 5293.684300] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000046fd6c54) = 0x1fc10006 (id).
[ 5293.684304] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069e996ec) = 0x00000001 (status).
[ 5293.684307] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f164dc34) = 0x00f83e1f (control)
[ 5293.684310] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000084869551) = 0xffab5180 (first_desc_lo)
[ 5293.684312] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000668e7d47) = 0x00000000 (first_desc_hi)
[ 5293.684315] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000001f3b3591) = 0x00000033 (first_desc_adjacent).
[ 5293.684318] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f28285f9) = 0x00000005 (completed_desc_count).
[ 5293.684321] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000fa0f3b77) = 0x00f83e1e (interrupt_enable_mask)
[ 5293.684324] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0x00000001: BUSY
[ 5293.684325] xdma:transfer_abort: abort transfer 0x00000000f87a0510, desc 257, engine desc queued 0.
[ 5293.684363] amd_iommu_report_page_fault: 74 callbacks suppressed
[ 5293.684366] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08000 flags=0x0020]
[ 5293.684375] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08200 flags=0x0020]
[ 5293.684382] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08400 flags=0x0020]
[ 5293.684388] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08600 flags=0x0020]
[ 5293.684395] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08800 flags=0x0020]
[ 5293.684402] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08a00 flags=0x0020]
[ 5293.684410] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08c00 flags=0x0020]
[ 5293.684418] xdma 0000:0e:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001a address=0xfec08e00 flags=0x0020]

To answer your questions:

  1. I'm using a custom design, based on Xilinx's XDMA example code. We are using a AXI MM interface plugged on a BRAM at @0x200000 (for receiving) and at @0x100000 (for sending).
  2. I've tried both aws-fpga and "official" xilinx dma_ip_driver. We are not using any exotic feature like cyclic transfer or whatever else so the AWS version works out of the box here.
  3. The driver is set in interrupt mode.

@X-Ryl669
Copy link
Author

Please notice that I've tried to enable descriptor dumping and it does not happen in that case (unfortunately), since it slows down the system too much.

My current hypothesis is that the driver "times out" and start deregistering descriptors but the engine is still running. It would then tries to read from (now) unknown address causing the IO_PAGE_FAULT error I'm experiencing. Adding the descriptor dumping in that case adds more time for the engine to actually finish the transfer before the descriptor are unregistered/unmapped and that would explain why it does not happen in that case.

@czfpga
Copy link
Contributor

czfpga commented Apr 26, 2021

Hi @X-Ryl669,

Sorry for the slow response. I want to let you know that we're still working on this issue with Xilinx. We'll keep you updated as soon as we hear back from Xilinx. Thank you for your patience.

Thanks,

-Chen

@czfpga
Copy link
Contributor

czfpga commented Apr 30, 2021

Hi @X-Ryl669,

Here is response made by the subject matter expert from Xilinx:

The log shared seems to be capturing the events when the issue has been triggered. It is not looking like full log from the start of the test. Based on initial description, I understand that the issue is not happening immediately after the test has started.

From the dmesg log, it seems customer has enabled Legacy interrupts with the IP. Can customer confirm this?

As an experiment, try to set the driver in "poll mode" instead of "Interrupt mode".

Also, can customer share the .xci file of the IP used with their design?

Could you please confirm the information above and try switching the interrupt mode as suggested? Please let me know if you can shared the .XCI files.

Thanks,

-Chen

@X-Ryl669
Copy link
Author

X-Ryl669 commented May 3, 2021

We've tried with these combinations and the result is the same (failing):

  1. MSI with Polling mode
  2. MSI without Polling mode

XCI and XML:

bd_PCIe_xdma_0_0.zip

Block design:

https://www12.zippyshare.com/v/1JIBV0x2/file.html

I think there are 2 bugs here, one in the FPGA's logic and one in the driver.

The first bug is that the XDMA engine is failing/timing out. Here, we are clueless since XDMA code is not accessible to inspect.
The second bug, is when the first one happen, it's detected here which is followed by a transfer_abort which unmap the buffers. Yet, the FPGA is continuing to send data after the PC has finished unmapping the buffers and this causes the IO faults I'm observing.

It seems that the more pressure we put on the engine, the more likely it is to fail/timeout and this is triggered. I wonder if it would work by adding a check that the engine is actually stop between these 2 lines :

			xdma_engine_stop(engine);
//>>>>> Here
			transfer_destroy(engine, xfer);

So that the engine is really stopped before buffers are unmapped.

@czfpga
Copy link
Contributor

czfpga commented May 5, 2021

Hi @X-Ryl669,

Here is response made by the subject matter expert from Xilinx:

  1. Ask the customer to add system_ila IP on M_AXI port of the XDMA IP as shown below:
    image001

  2. In the customer's description of the test, they mentioned about two addresses from which they initiate C2H and H2C transfers. In the block diagram, I am finding two BRAMs and a PS-DDR interfacing to XDMA’s M_AXI port. Based on the memory to which customer is doing C2H/H2C transactions, ask them to add system ILA to the S_AXI port of that memory.

Unknown
The reason for this test is to know if there is any issue with memory write/read when the issue gets triggered. So XDMA is waiting for the memory read/write operation to be successful.

Could you give these tests a try and let us know the results?

Thanks,

-Chen

@X-Ryl669
Copy link
Author

X-Ryl669 commented May 6, 2021

I've found the bug and I think I've solved it:

When looking at the log, I spotted this:

xdma_xfer_submit: xfer 0x00000000352a938a,1048832, s 0x1 timed out, ep 0x300100

Notice the s 0x1 part that's supposed to be the xfer->state which is equal to TRANSFER_STATE_SUBMITTED.
Yet, the code above does this (libxdma.c around line 3192):

			rv = wait_event_interruptible_timeout(xfer->wq,
                	        (xfer->state != TRANSFER_STATE_SUBMITTED),
				msecs_to_jiffies(timeout_ms));

The idea here is to wait until xfer->state is different than TRANSFER_STATE_SUBMITTED to process the change's event.

In wait_event_interruptible_timeout function, the return value is either:

  • 0 if the condition was false when the timeout expired,
  • The number of left jiffies when the event happened before the timeout
  • -ERESTARTSYS if it's interrupted by a signal

In the current code, the return condition is never checked. So I dumped it and I got -ERESTARTSYS when the issue happened.

Typically, the engine is still running, it hasn't received the interrupt to complete but the wait is spuriously interrupted and from then, the code thinks it was due to an engine timeout which is wrong, leading to the engine writing in unmapped area later on because the code calls transfer_destroy when a false timeout is detected.

So I've replaced the code to this and it works:

		/*
		 * When polling, determine how many descriptors have been queued		 * on the engine to determine the writeback value expected
		 */
		if (poll_mode) {
			unsigned int desc_count;

			spin_lock_irqsave(&engine->lock, flags);
                        desc_count = xfer->desc_num;
			spin_unlock_irqrestore(&engine->lock, flags);

                        dbg_tfr("%s poll desc_count=%d\n",
				engine->name, desc_count);
			rv = engine_service_poll(engine, desc_count);

		} else {
			do {
				rv = wait_event_interruptible_timeout(xfer->wq,
								(xfer->state != TRANSFER_STATE_SUBMITTED),
					msecs_to_jiffies(timeout_ms));
			}
			while (rv == -ERESTARTSYS);
		}

Please notice that it does not solve the race condition between cleaning and engine continuing to transfer data in case of a legitimate timeout. So another bug happens in poll mode too, but I don't use it.

To answer Xilinx's expert, adding system_ila to spy on AXI bus hasn't revealed anything. We've tried to trigger on non ok status, but they never trigger. And since there's no (obvious) way to add an external signal to the system_ila block, we can't synchronize this ILA's capture with another ILA's signal. I can send the screenshot of the capture, but it's perfectly standard AXI communication going on, since we failed capturing the timeout event (if it either happened).

@czfpga I'd really like to thank you for your time and your support.

@X-Ryl669
Copy link
Author

X-Ryl669 commented May 6, 2021

I haven't checked the poll_mode path so I don't know if it can happen the same (the returned value is not checked either). While testing with poll_mode = 1, I got a system freeze which I interpreted as the result of the FPGA writing to random memory (like what happens here in interrupt mode), but it might have been a completely different issue, and it's almost impossible to debug since I can't check the log in that case, obviously.

@czfpga
Copy link
Contributor

czfpga commented May 6, 2021

Hi @X-Ryl669,

You're welcome. Thank you for pointing these out. We'll work with Xilinx on this feedback.

Thanks,

-Chen

@czfpga
Copy link
Contributor

czfpga commented May 7, 2021

Hi @X-Ryl669,

I want to keep you updated that we're had the driver developer from Xilinx involved to look at the issue you pointed out. I'll post the response here as soon as I hear back from Xilinx.

Thanks,

-Chen

kyyalama2 pushed a commit that referenced this issue Jun 21, 2021
…out was interrupted by a signal. (#680)

Co-authored-by: root <[email protected]>

Release v1.4.20

* Bug Fix release to fix XDMA to fix Issue #516
* Miscallaneous documentation updates
kyyalama2 pushed a commit that referenced this issue Jun 22, 2021
* Bug Fix release to fix XDMA to fix Issue #516
* Miscallaneous documentation updates

Co-authored-by: root <[email protected]>
kyyalama2 pushed a commit that referenced this issue Jun 22, 2021
* Bug Fix release to fix XDMA to fix Issue #516
* Miscallaneous documentation updates

Co-authored-by: root <[email protected]>
kyyalama2 added a commit that referenced this issue Jun 22, 2021
* Bug Fix release to fix XDMA to fix Issue #516
* Miscallaneous documentation updates
kyyalama2 added a commit that referenced this issue Jun 22, 2021
* Bug Fix release to fix XDMA to fix Issue #516
* Miscallaneous documentation updates
deeppat pushed a commit that referenced this issue Jun 23, 2021
* Bug Fix release to fix XDMA to fix Issue #516
* Miscallaneous documentation updates
@deeppat
Copy link
Contributor

deeppat commented Jun 23, 2021

Hi @X-Ryl669 this should be fixed. Please re-open this issue and let us know if this still doesn't work for you and we'll be happy to help.

@deeppat deeppat closed this as completed Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants