When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

NoamDualBird · 2024-11-26T08:48:35Z

When we run our workload on 1 or 2 FPGA's we do not have any issues but when we try to run on 4 or 8 FPGA's

We usually get an indication of shell pci master timeout error in one of the FPGA slots during high bandwidth DMA.
our setup:

F1.16xlarge (8 FPGA's running in parallel)
Amazom Linux AMI
Small shell version - 0x04182104
linux XDMA driver

From our internal debug this is what we see:
Our PCI AXI master (CL) is trying to write to the shell AXI transactions with typical burst size of 4KB.
At some point we see that the shell is reporting on Timeout Error on the W channel (i.e. pcim-axi-protocol-wchannel-error).
After debugging it we see that there is indeed a timeout violation between some WDATA transfers,
but this violation is caused because the WREADY is de-asserted during this period (while WVALID is asserted).
As a result of the WREADY backpressure, the CL can’t complete the transaction during the timeout period.

Some time after the timeout occurs, all writes and reads from FPGA towards PCI are stuck, including interrupts.

czfpga · 2024-11-26T21:37:52Z

Hi @NoamDualBird,

Thank you for reaching out.

In order for us to better investigate this, can you please provide more info about your application?

Does your CL only send traffic to a memory space on the host? Or it sends to both the host and neighbor cards (in case on a multi-card instance)
When multiple cards are used, how often dose the card send traffic? Are they sending roughly around the same time with a similar frequency?
If you have a flow control mechanism enabled in the CL, can you run a test to tune down the frequency of request generated to the AW/W channels, and see if that eliminates or mitigate this issue?

In addition, as you're using the small shell, it's unclear to me why the XDMA driver is utilized as there is no DMA engine in the shell (this is not directly related to this timeout issue, but I just want to call it out)

Thanks,

Chen

NoamDualBird · 2024-11-27T10:05:33Z

Hi Chen,
Thanks for the quick response. With regards to your questions:

Our custom logic sends traffic only to host memory space
When multiple cards are used they are working at the same time. we managed to re-create the failure in a dedicated test where all CL's write in parallel many GB's of data to host
We tried to tune down the frequency by reducing the number of outstanding transactions, which gives us some control over the AW/W channels. When trying this we do not see the timeout error. However, we wish to better understand how we should work with the shell interface so we can guarantee it cant happen.

Regarding XDMA driver, the driver is needed for the main reason of enabling registers in the shell interrupt controller (otherwise MSI-X does not work). We do not use it for DMA. since we had to use it to enable interrupts, we also use its devices (_user/_events) for mapping of BAR0 and interrupts to userspace

We were wondering if there may be an issue in the shell, where the CL AW/W channel works in burst sizes that are larger than the PCIe MTU. If PCIe interface receives backpressure it may be propagated to the AW/W channel multiple times during a single transaction and thus reaching timeout, although the protocol wasn't violated.

Thanks,

Noam

czfpga · 2024-12-03T04:18:40Z

Hi Noam,

Thank you for the details about your application and use of the XDMA driver.

A burst size exceeding the PCIe MTU isn't problematic, as packets will be automatically fragmented. However, multiple cards sending simultaneous bursts to the host may exceed the hardware interface's bandwidth capacity, causing back pressure to the CL and triggering AW/W channel timeouts.

If reducing traffic request frequency isn't a preferred option, please consider implementing staggered transmissions from the cards to see if this helps minimize the peak BW spikes and potentially eliminate the timeout issues.

Hope this helps.

Chen

NoamDualBird · 2024-12-03T09:25:49Z

Hi Chen,

Thank you for your response and we will implement all mechanisms in order to avoid this timeout, however I still think there's an issue here.

First of all, AXI timeouts causes the system to crash since the shell's response is to shut down all AXI interfaces. If AXI timeouts are expected in a fully functional system I would expect the shell to recover from the timeout gracefully and not crash. In our understanding, AXI timeout indicates a broken system such as inaccessible DRAM or PCIe link down. In these cases I would expect other indications from the instance in AWS monitor for some critical event. Backpressure on the PCIe channel is common and can happen for any number of reasons that may or may not involve high bandwidth from the FPGA. It does not indicate a broken system.
When we used 4KB burst on the AW/W channel, we assumed that if the shell doesn't have the capacity to absorb the entire burst it will apply backpressure de-assert aw_ready signal thus avoiding the timeout altogether. Can we somehow alter the shell logic to do this? If not, the only "bullet proof" way to avoid timeouts is to adjust the AXI MTU to be no larger than the PCIe MTU, which imposes many adjustments in our CL in order to keep the required BW. Reducing the FPGA BW will only reduce the probability of the issue, but it is not a valid solution.
In any case, if timeout is detected by the shell, why is slverr not asserted?

Thanks,
Noam

czfpga · 2024-12-04T22:39:47Z

Hi Noam,

Timeout protect the shell and facilitate debugging if, for example, a misbehaved AXI master continuously drives the AW/W channels without properly terminating the transactions.

Your application is experiencing timeout for a totally different reason. However, I suspect the issue is related to the PCIe MTU. Your tests on 1/2 FPGA instances would have shown similar problems if that were the case. It's more probable that your application is encountering a system bottleneck that occurs when the peak bandwidth from all FPGAs exceeds the system's capacity.

The crashing needs more investigation. Timeout shouldn't cause system crash because the shell should return OKAY even in a timeout event.

Thanks,

Chen

NoamDualBird · 2024-12-12T06:58:58Z

Hi Chen,

Please advise how you propose to further investigate the system crash.

Thanks
Noam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

NoamDualBird commented Nov 26, 2024

czfpga commented Nov 26, 2024

NoamDualBird commented Nov 27, 2024

czfpga commented Dec 3, 2024

NoamDualBird commented Dec 3, 2024

czfpga commented Dec 4, 2024

NoamDualBird commented Dec 12, 2024

When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

Comments

NoamDualBird commented Nov 26, 2024

czfpga commented Nov 26, 2024

NoamDualBird commented Nov 27, 2024

czfpga commented Dec 3, 2024

NoamDualBird commented Dec 3, 2024

czfpga commented Dec 4, 2024

NoamDualBird commented Dec 12, 2024