-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded access to different channel and different memory area fails #516
Comments
Hi @X-Ryl669, Thanks for reporting this. We've asked a Xilinx subject matter expert to reply to this issue and they will either respond here or we will add their response here as soon as we hear back from them. Thanks, Deep |
Thank you |
Hi @X-Ryl669, We are still waiting to hear back. I apologize for the delay and I hope to update you on this soon. -Deep |
Hi @X-Ryl669, Xilinx get back to us and ask for more information for analysis. Could you provide this information to us? Your use case of having 2 threads, each to a different DMA engine looks correct. Each engine is independent of each other. Adding the mutex just slows down the traffic, allowing only 1 engine active at any given time, so the perf. would be worse. From the log, the reason for the timeout is not looking obvious. We need additional data. Please share full dmesg log which includes the events from the start. Few questions to understand your design use mode.
Thanks -Chen |
Hi, The full log
To answer your questions:
|
Please notice that I've tried to enable descriptor dumping and it does not happen in that case (unfortunately), since it slows down the system too much. My current hypothesis is that the driver "times out" and start deregistering descriptors but the engine is still running. It would then tries to read from (now) unknown address causing the |
Hi @X-Ryl669, Sorry for the slow response. I want to let you know that we're still working on this issue with Xilinx. We'll keep you updated as soon as we hear back from Xilinx. Thank you for your patience. Thanks, -Chen |
Hi @X-Ryl669, Here is response made by the subject matter expert from Xilinx: The log shared seems to be capturing the events when the issue has been triggered. It is not looking like full log from the start of the test. Based on initial description, I understand that the issue is not happening immediately after the test has started. From the dmesg log, it seems customer has enabled Legacy interrupts with the IP. Can customer confirm this? As an experiment, try to set the driver in "poll mode" instead of "Interrupt mode". Also, can customer share the .xci file of the IP used with their design? Could you please confirm the information above and try switching the interrupt mode as suggested? Please let me know if you can shared the .XCI files. Thanks, -Chen |
We've tried with these combinations and the result is the same (failing):
XCI and XML: Block design: https://www12.zippyshare.com/v/1JIBV0x2/file.html I think there are 2 bugs here, one in the FPGA's logic and one in the driver. The first bug is that the XDMA engine is failing/timing out. Here, we are clueless since XDMA code is not accessible to inspect. It seems that the more pressure we put on the engine, the more likely it is to fail/timeout and this is triggered. I wonder if it would work by adding a check that the engine is actually stop between these 2 lines :
So that the engine is really stopped before buffers are unmapped. |
Hi @X-Ryl669, Here is response made by the subject matter expert from Xilinx:
Could you give these tests a try and let us know the results? Thanks, -Chen |
I've found the bug and I think I've solved it: When looking at the log, I spotted this:
Notice the
The idea here is to wait until In
In the current code, the return condition is never checked. So I dumped it and I got Typically, the engine is still running, it hasn't received the interrupt to complete but the wait is spuriously interrupted and from then, the code thinks it was due to an engine timeout which is wrong, leading to the engine writing in unmapped area later on because the code calls So I've replaced the code to this and it works: /*
* When polling, determine how many descriptors have been queued * on the engine to determine the writeback value expected
*/
if (poll_mode) {
unsigned int desc_count;
spin_lock_irqsave(&engine->lock, flags);
desc_count = xfer->desc_num;
spin_unlock_irqrestore(&engine->lock, flags);
dbg_tfr("%s poll desc_count=%d\n",
engine->name, desc_count);
rv = engine_service_poll(engine, desc_count);
} else {
do {
rv = wait_event_interruptible_timeout(xfer->wq,
(xfer->state != TRANSFER_STATE_SUBMITTED),
msecs_to_jiffies(timeout_ms));
}
while (rv == -ERESTARTSYS);
} Please notice that it does not solve the race condition between cleaning and engine continuing to transfer data in case of a legitimate timeout. So another bug happens in poll mode too, but I don't use it. To answer Xilinx's expert, adding @czfpga I'd really like to thank you for your time and your support. |
I haven't checked the |
Hi @X-Ryl669, You're welcome. Thank you for pointing these out. We'll work with Xilinx on this feedback. Thanks, -Chen |
Hi @X-Ryl669, I want to keep you updated that we're had the driver developer from Xilinx involved to look at the issue you pointed out. I'll post the response here as soon as I hear back from Xilinx. Thanks, -Chen |
…out was interrupted by a signal. (#680) Co-authored-by: root <[email protected]> Release v1.4.20 * Bug Fix release to fix XDMA to fix Issue #516 * Miscallaneous documentation updates
* Bug Fix release to fix XDMA to fix Issue #516 * Miscallaneous documentation updates Co-authored-by: root <[email protected]>
* Bug Fix release to fix XDMA to fix Issue #516 * Miscallaneous documentation updates Co-authored-by: root <[email protected]>
* Bug Fix release to fix XDMA to fix Issue #516 * Miscallaneous documentation updates
* Bug Fix release to fix XDMA to fix Issue #516 * Miscallaneous documentation updates
* Bug Fix release to fix XDMA to fix Issue #516 * Miscallaneous documentation updates
Hi @X-Ryl669 this should be fixed. Please re-open this issue and let us know if this still doesn't work for you and we'll be happy to help. |
I'm trying to use XDMA.
Description of the issue:
xdma0_c2h_0
xdma0_h2c_0
char buf[1024*1024]; pread(c2h_fd, buf, sizeof(buf), SOME_ADDR1);
in a loopchar buf[20*1024]; pwrite(h2c_fd, buf, sizeof(buf), SOME_ADDR2);
in a loopRemarks
If I add a mutex around these calls to prevent both thread to access the XDMA driver at the same time, it does not happen (but obviously with a much lower performance)
I'm not using the same memory buffer, I'm not using the same file descriptor, and I'm not using the same destination/src address in the FPGA here.
The text was updated successfully, but these errors were encountered: