I need to show that some technique called gradient checkpointing can really save GPU memory usage during backward propagation. When I see the result there are two columns on the left showing active_bytes and reserved_bytes. In my testing, while active bytes read 3.83G, the reserved bytes read 9.35G. So why does PyTorch still reserve that much GPU memory?