Skip to content

Bug: Memory usage grows unbounded during training, OOM crashes #120668

@anshul23102

Description

@anshul23102

Problem Statement

Memory leak:

  • Unbounded growth
  • OOM crashes
  • Cannot train

Business Impact: Cannot train
Technical Impact: Memory leak

Root Cause Analysis

No cleanup. Accumulation of tensors.

Solution Overview

  1. Cleanup: proper cleanup
  2. Profiling: memory profiling
  3. Optimization: optimize usage
  4. Monitoring: memory monitoring
  5. Testing: memory tests

Type of Change

  • Bug fix: Memory management

Testing Done

  • Leak test: no leak
  • Profiling test: stable
  • Training test: completes

Related Issue

Relates to stability

Suggested Labels

bug, memory, stability, gssoc26

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions