Skip to content

Conversation

@celestinoxp
Copy link
Contributor

@celestinoxp celestinoxp commented Dec 10, 2025

Summary

Fixes #5433 - Incorrect memory reporting on Windows where available memory was reported as 3.2 TB instead of actual RAM (~32 GB).

Problem

During training with large datasets on Windows, the system reported unrealistic memory values:

  • Reported: Available Memory: 3203341.44 MB (3.2TB) ❌
  • Actual: ~32 GB RAM ✅

This occurred because psutil.virtual_memory() occasionally returns incorrect values on Windows.

Solution

Implemented a simplified, platform-specific approach that uses the most reliable memory detection method for each platform:

Windows

# Use native Windows API (most reliable)
if platform.system() == "Windows":
    try:
        total_mem, _ = GlobalMemoryStatusEx()  # ctypes wrapper
        return total_mem
    except:
        return psutil.virtual_memory().total  # Fallback

Linux/Mac

# psutil works well on these platforms
return psutil.virtual_memory().total

Why This Approach

  1. Simple: No complex validation thresholds or magic numbers
  2. Reliable: Always uses the most trusted source per platform
  3. Fast: Single API call instead of multiple
  4. Maintainable: ~70 lines less code than cross-validation approach
  5. Industry standard: How other projects handle platform-specific APIs
  6. Future-proof: Even if psutil improves, Windows API remains the best choice

Changes

Modified Files

  • common/src/autogluon/common/utils/resource_utils.py - Simplified platform-specific memory detection
  • common/tests/unittests/test_memory_validation.py - Unit tests for validation logic

Code Changes Summary

  • Removed: 76 lines of complex cross-validation code
  • Added: 14 lines of simple, clear platform-specific code
  • Net: -62 lines, much cleaner implementation

Key Functions

  1. _get_memory_size_windows() - Windows API wrapper using GlobalMemoryStatusEx

    • Returns (total_memory, available_memory) in bytes
    • More reliable than psutil per Microsoft documentation
  2. _validate_memory_size() - Basic sanity check

    • Validates memory is between 512 MB and 2 TB
    • Used for logging warnings, not primary detection
  3. _get_memory_size() and _get_available_virtual_mem()

    • Windows: Use Windows API first, fallback to psutil
    • Other platforms: Use psutil directly

Testing

All tests pass:

  • ✅ Windows: Correct memory detection (31.42 GB total, ~17 GB available)
  • ✅ Unit tests: Validation logic works correctly
  • ✅ No breaking changes

Test output:

Test PASSED: Total=31.42 GB, Available=17.08 GB

Impact

Before: Training logs showed 3.2 TB causing confusion and incorrect OOM warnings
After: Correct memory detection (32 GB) with accurate monitoring

Notes

  • The bug is specific to Windows; Linux/Mac use psutil without issues
  • Windows API (GlobalMemoryStatusEx) is recommended by Microsoft as the most reliable method
  • Even latest psutil (7.1.3) can exhibit this issue intermittently
  • Solution is permanent, not a temporary workaround

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add validation to detect unrealistic memory values (< 512 MB or > 2 TB)
- Implement Windows API fallback using GlobalMemoryStatusEx
- Add comprehensive logging when fallback is used
- Update _get_memory_size() and _get_available_virtual_mem() with validation
- Add unit tests for validation and Windows API fallback

Fixes autogluon#5433
- Add cross-validation between psutil and Windows API
- Detect discrepancies > 50% (e.g., 1.97 TB vs 32 GB)
- Use Windows API when sources disagree significantly
- Catches subtle bugs that pass simple threshold validation
- Better logging for debugging memory detection issues
- Remove cross-validation with 50% threshold (over-complex)
- Windows: Use native API first (most reliable), fallback to psutil
- Linux/Mac: Use psutil (works well on these platforms)
- Simpler code: ~70 lines less, no magic numbers
- Faster: Single API call instead of two
- Based on user feedback and research of other projects

Addresses concerns raised in autogluon#5433
Comment on lines +330 to +336
# On Windows, prefer native Windows API (more reliable than psutil)
if platform.system() == "Windows":
try:
_, available_mem = ResourceManager._get_memory_size_windows()
return available_mem
except Exception as e:
logger.debug(f"Windows API unavailable, falling back to psutil: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also need to adjust the above code:

     if os.environ.get("AG_MEMORY_LIMIT_IN_GB", None) is not None:
            total_memory = ResourceManager._get_custom_memory_size()
            p = psutil.Process()
            return total_memory - p.memory_info().rss

so that it uses the right value for p.memory_info().rss?

# Most systems have between 512 MB and 2 TB of RAM
# Values outside this range are likely errors
MIN_REALISTIC_MEMORY = 512 * 1024 * 1024 # 512 MB
MAX_REALISTIC_MEMORY = 2 * 1024 * 1024 * 1024 * 1024 # 2 TB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 TB is highly realistic in industry. Better to do something like 20 TB.

"""
# Most systems have between 512 MB and 2 TB of RAM
# Values outside this range are likely errors
MIN_REALISTIC_MEMORY = 512 * 1024 * 1024 # 512 MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is available memory, it is possible that the system is running on very little available memory, and thus 512 MB is actually a reasonable value. IMO better to do 32 MB.

@Innixma
Copy link
Contributor

Innixma commented Dec 11, 2025

@celestinoxp 3203341.44 MB is actually 3.2 TB, not 3.2 PB.

@Innixma
Copy link
Contributor

Innixma commented Dec 11, 2025

It seems like it is overestimating available memory by a factor of ~100, which seems like an oddly specific number.

Comment on lines 215 to 252
@staticmethod
def _get_memory_size_windows():
"""
Get total physical memory on Windows using GlobalMemoryStatusEx API.
This is a fallback when psutil reports incorrect values.
Returns
-------
tuple[float, float]
(total_physical_memory_bytes, available_physical_memory_bytes)
"""
try:
import ctypes
from ctypes import wintypes

class MEMORYSTATUSEX(ctypes.Structure):
_fields_ = [
("dwLength", wintypes.DWORD),
("dwMemoryLoad", wintypes.DWORD),
("ullTotalPhys", ctypes.c_ulonglong),
("ullAvailPhys", ctypes.c_ulonglong),
("ullTotalPageFile", ctypes.c_ulonglong),
("ullAvailPageFile", ctypes.c_ulonglong),
("ullTotalVirtual", ctypes.c_ulonglong),
("ullAvailVirtual", ctypes.c_ulonglong),
("ullAvailExtendedVirtual", ctypes.c_ulonglong),
]

mem_status = MEMORYSTATUSEX()
mem_status.dwLength = ctypes.sizeof(MEMORYSTATUSEX)

if ctypes.windll.kernel32.GlobalMemoryStatusEx(ctypes.byref(mem_status)):
return float(mem_status.ullTotalPhys), float(mem_status.ullAvailPhys)
else:
raise RuntimeError("GlobalMemoryStatusEx API call failed")
except Exception as e:
logger.warning(f"Failed to get memory size using Windows API: {e}")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@celestinoxp do you have a references to others who have experienced this issue more generally (psutil on Windows)?

mem_status.dwLength = ctypes.sizeof(MEMORYSTATUSEX)

if ctypes.windll.kernel32.GlobalMemoryStatusEx(ctypes.byref(mem_status)):
return float(mem_status.ullTotalPhys), float(mem_status.ullAvailPhys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these values integers? Probably best to return them as integers if they are in bytes form. I don't remember why the other methods are returning as floats, but they probably can return integers too if they are returning bytes I'd guess.

Comment on lines +83 to +87
assert ResourceManager._validate_memory_size(total_mem, "Windows API test"), \
"Windows API total memory is unrealistic"
assert ResourceManager._validate_memory_size(avail_mem, "Windows API test"), \
"Windows API available memory is unrealistic"
assert avail_mem <= total_mem, "Available memory cannot exceed total"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to look at discrepancy in values between this and the normal ResourceManager._get_available_virtual_mem call?

except Exception as e:
logger.debug(f"Windows API unavailable, falling back to psutil: {e}")

# On other platforms or if Windows API failed, use psutil
Copy link
Contributor

@Innixma Innixma Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can make this its own mini method _get_available_virtual_mem_psutil, so we can call it in an isolated fashion in tests.

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is very nice and I wasn't aware Windows can sometimes do this. I've added some comments.

@Innixma Innixma added this to the 1.5 Release milestone Dec 11, 2025
@celestinoxp
Copy link
Contributor Author

celestinoxp commented Dec 12, 2025

Thanks for the review @Innixma! I've addressed your feedback:

  • Updated _validate_memory_size ranges to 32MB (min) and 20TB (max).
  • Changed Windows API return types to int.
  • Extracted _get_available_virtual_mem_psutil into its own method to isolate the logic.
  • Added test_psutil_fallback_direct to explicitly test the fallback mechanism.

Ready for another look!

@Innixma
Copy link
Contributor

Innixma commented Dec 16, 2025

Thanks! Generally LGTM, will fix the linting when I get a chance and will then merge

@benhoumine-Abdelkhalek
Copy link

benhoumine-Abdelkhalek commented Dec 17, 2025

Hello,
We have observed a similar issue when running workloads on Linux nodes in a Kubernetes cluster, where the application reports severely inflated available memory values. We have encountered the same behavior in our environment.
Could you please advise whether the existing solution can be adapted for a Kubernetes environment (e.g., by relying on pod limits or cgroup-based memory reporting), or if we should open a separate issue to propose a Kubernetes-specific solution?
Thank you for your guidance.

@celestinoxp
Copy link
Contributor Author

celestinoxp commented Dec 17, 2025

@benhoumine-Abdelkhalek Thanks for the details!
From my experience, this issue seems to appear only in specific scenarios, such as with large datasets, and I’ve had difficulty reproducing it on my side. Have you been able to reproduce this consistently?
@Innixma, do you have insights on whether the current PR approach makes sense for Kubernetes environments, or should this be investigated separately?
It seems the main culprit may still be the psutil package, but further investigation is needed.

@Innixma
Copy link
Contributor

Innixma commented Dec 18, 2025

Hi @benhoumine-Abdelkhalek, I suspect that what you are experiencing would require a separate fix, as this PR seems specific to Windows.

In general we need a reproducible example, or at least where the user can tell us a specific setup that causes it to happen (and in what way it is incorrect)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] PerformanceWarning: Available Memory reported incorrectly on Windows (3.2 TB)

3 participants