-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656
Comments
Hi @NoamDualBird, Thank you for reaching out. In order for us to better investigate this, can you please provide more info about your application?
In addition, as you're using the small shell, it's unclear to me why the XDMA driver is utilized as there is no DMA engine in the shell (this is not directly related to this timeout issue, but I just want to call it out) Thanks, Chen |
Hi Chen,
Regarding XDMA driver, the driver is needed for the main reason of enabling registers in the shell interrupt controller (otherwise MSI-X does not work). We do not use it for DMA. since we had to use it to enable interrupts, we also use its devices (_user/_events) for mapping of BAR0 and interrupts to userspace We were wondering if there may be an issue in the shell, where the CL AW/W channel works in burst sizes that are larger than the PCIe MTU. If PCIe interface receives backpressure it may be propagated to the AW/W channel multiple times during a single transaction and thus reaching timeout, although the protocol wasn't violated. Thanks, Noam |
Hi Noam, Thank you for the details about your application and use of the XDMA driver. A burst size exceeding the PCIe MTU isn't problematic, as packets will be automatically fragmented. However, multiple cards sending simultaneous bursts to the host may exceed the hardware interface's bandwidth capacity, causing back pressure to the CL and triggering AW/W channel timeouts. If reducing traffic request frequency isn't a preferred option, please consider implementing staggered transmissions from the cards to see if this helps minimize the peak BW spikes and potentially eliminate the timeout issues. Hope this helps. Chen |
Hi Chen, Thank you for your response and we will implement all mechanisms in order to avoid this timeout, however I still think there's an issue here.
Thanks, |
Hi Noam, Timeout protect the shell and facilitate debugging if, for example, a misbehaved AXI master continuously drives the AW/W channels without properly terminating the transactions. Your application is experiencing timeout for a totally different reason. However, I suspect the issue is related to the PCIe MTU. Your tests on 1/2 FPGA instances would have shown similar problems if that were the case. It's more probable that your application is encountering a system bottleneck that occurs when the peak bandwidth from all FPGAs exceeds the system's capacity. The crashing needs more investigation. Timeout shouldn't cause system crash because the shell should return OKAY even in a timeout event. Thanks, Chen |
Hi Chen, Please advise how you propose to further investigate the system crash. Thanks |
When we run our workload on 1 or 2 FPGA's we do not have any issues but when we try to run on 4 or 8 FPGA's
We usually get an indication of shell pci master timeout error in one of the FPGA slots during high bandwidth DMA.
our setup:
From our internal debug this is what we see:
Our PCI AXI master (CL) is trying to write to the shell AXI transactions with typical burst size of 4KB.
At some point we see that the shell is reporting on Timeout Error on the W channel (i.e. pcim-axi-protocol-wchannel-error).
After debugging it we see that there is indeed a timeout violation between some WDATA transfers,
but this violation is caused because the WREADY is de-asserted during this period (while WVALID is asserted).
As a result of the WREADY backpressure, the CL can’t complete the transaction during the timeout period.
Some time after the timeout occurs, all writes and reads from FPGA towards PCI are stuck, including interrupts.
The text was updated successfully, but these errors were encountered: