-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix Windows memory detection bug (#5433) #5470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix Windows memory detection bug (#5433) #5470
Conversation
- Add validation to detect unrealistic memory values (< 512 MB or > 2 TB) - Implement Windows API fallback using GlobalMemoryStatusEx - Add comprehensive logging when fallback is used - Update _get_memory_size() and _get_available_virtual_mem() with validation - Add unit tests for validation and Windows API fallback Fixes autogluon#5433
- Add cross-validation between psutil and Windows API - Detect discrepancies > 50% (e.g., 1.97 TB vs 32 GB) - Use Windows API when sources disagree significantly - Catches subtle bugs that pass simple threshold validation - Better logging for debugging memory detection issues
- Remove cross-validation with 50% threshold (over-complex) - Windows: Use native API first (most reliable), fallback to psutil - Linux/Mac: Use psutil (works well on these platforms) - Simpler code: ~70 lines less, no magic numbers - Faster: Single API call instead of two - Based on user feedback and research of other projects Addresses concerns raised in autogluon#5433
| # On Windows, prefer native Windows API (more reliable than psutil) | ||
| if platform.system() == "Windows": | ||
| try: | ||
| _, available_mem = ResourceManager._get_memory_size_windows() | ||
| return available_mem | ||
| except Exception as e: | ||
| logger.debug(f"Windows API unavailable, falling back to psutil: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also need to adjust the above code:
if os.environ.get("AG_MEMORY_LIMIT_IN_GB", None) is not None:
total_memory = ResourceManager._get_custom_memory_size()
p = psutil.Process()
return total_memory - p.memory_info().rss
so that it uses the right value for p.memory_info().rss?
| # Most systems have between 512 MB and 2 TB of RAM | ||
| # Values outside this range are likely errors | ||
| MIN_REALISTIC_MEMORY = 512 * 1024 * 1024 # 512 MB | ||
| MAX_REALISTIC_MEMORY = 2 * 1024 * 1024 * 1024 * 1024 # 2 TB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 TB is highly realistic in industry. Better to do something like 20 TB.
| """ | ||
| # Most systems have between 512 MB and 2 TB of RAM | ||
| # Values outside this range are likely errors | ||
| MIN_REALISTIC_MEMORY = 512 * 1024 * 1024 # 512 MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is available memory, it is possible that the system is running on very little available memory, and thus 512 MB is actually a reasonable value. IMO better to do 32 MB.
|
@celestinoxp 3203341.44 MB is actually 3.2 TB, not 3.2 PB. |
|
It seems like it is overestimating available memory by a factor of ~100, which seems like an oddly specific number. |
| @staticmethod | ||
| def _get_memory_size_windows(): | ||
| """ | ||
| Get total physical memory on Windows using GlobalMemoryStatusEx API. | ||
| This is a fallback when psutil reports incorrect values. | ||
| Returns | ||
| ------- | ||
| tuple[float, float] | ||
| (total_physical_memory_bytes, available_physical_memory_bytes) | ||
| """ | ||
| try: | ||
| import ctypes | ||
| from ctypes import wintypes | ||
|
|
||
| class MEMORYSTATUSEX(ctypes.Structure): | ||
| _fields_ = [ | ||
| ("dwLength", wintypes.DWORD), | ||
| ("dwMemoryLoad", wintypes.DWORD), | ||
| ("ullTotalPhys", ctypes.c_ulonglong), | ||
| ("ullAvailPhys", ctypes.c_ulonglong), | ||
| ("ullTotalPageFile", ctypes.c_ulonglong), | ||
| ("ullAvailPageFile", ctypes.c_ulonglong), | ||
| ("ullTotalVirtual", ctypes.c_ulonglong), | ||
| ("ullAvailVirtual", ctypes.c_ulonglong), | ||
| ("ullAvailExtendedVirtual", ctypes.c_ulonglong), | ||
| ] | ||
|
|
||
| mem_status = MEMORYSTATUSEX() | ||
| mem_status.dwLength = ctypes.sizeof(MEMORYSTATUSEX) | ||
|
|
||
| if ctypes.windll.kernel32.GlobalMemoryStatusEx(ctypes.byref(mem_status)): | ||
| return float(mem_status.ullTotalPhys), float(mem_status.ullAvailPhys) | ||
| else: | ||
| raise RuntimeError("GlobalMemoryStatusEx API call failed") | ||
| except Exception as e: | ||
| logger.warning(f"Failed to get memory size using Windows API: {e}") | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@celestinoxp do you have a references to others who have experienced this issue more generally (psutil on Windows)?
| mem_status.dwLength = ctypes.sizeof(MEMORYSTATUSEX) | ||
|
|
||
| if ctypes.windll.kernel32.GlobalMemoryStatusEx(ctypes.byref(mem_status)): | ||
| return float(mem_status.ullTotalPhys), float(mem_status.ullAvailPhys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these values integers? Probably best to return them as integers if they are in bytes form. I don't remember why the other methods are returning as floats, but they probably can return integers too if they are returning bytes I'd guess.
| assert ResourceManager._validate_memory_size(total_mem, "Windows API test"), \ | ||
| "Windows API total memory is unrealistic" | ||
| assert ResourceManager._validate_memory_size(avail_mem, "Windows API test"), \ | ||
| "Windows API available memory is unrealistic" | ||
| assert avail_mem <= total_mem, "Available memory cannot exceed total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to look at discrepancy in values between this and the normal ResourceManager._get_available_virtual_mem call?
| except Exception as e: | ||
| logger.debug(f"Windows API unavailable, falling back to psutil: {e}") | ||
|
|
||
| # On other platforms or if Windows API failed, use psutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can make this its own mini method _get_available_virtual_mem_psutil, so we can call it in an isolated fashion in tests.
Innixma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This is very nice and I wasn't aware Windows can sometimes do this. I've added some comments.
|
Thanks for the review @Innixma! I've addressed your feedback:
Ready for another look! |
|
Thanks! Generally LGTM, will fix the linting when I get a chance and will then merge |
|
Hello, |
|
@benhoumine-Abdelkhalek Thanks for the details! |
|
Hi @benhoumine-Abdelkhalek, I suspect that what you are experiencing would require a separate fix, as this PR seems specific to Windows. In general we need a reproducible example, or at least where the user can tell us a specific setup that causes it to happen (and in what way it is incorrect) |
Summary
Fixes #5433 - Incorrect memory reporting on Windows where available memory was reported as 3.2 TB instead of actual RAM (~32 GB).
Problem
During training with large datasets on Windows, the system reported unrealistic memory values:
Available Memory: 3203341.44 MB(3.2TB) ❌This occurred because
psutil.virtual_memory()occasionally returns incorrect values on Windows.Solution
Implemented a simplified, platform-specific approach that uses the most reliable memory detection method for each platform:
Windows
Linux/Mac
Why This Approach
Changes
Modified Files
common/src/autogluon/common/utils/resource_utils.py- Simplified platform-specific memory detectioncommon/tests/unittests/test_memory_validation.py- Unit tests for validation logicCode Changes Summary
Key Functions
_get_memory_size_windows()- Windows API wrapper usingGlobalMemoryStatusEx(total_memory, available_memory)in bytes_validate_memory_size()- Basic sanity check_get_memory_size()and_get_available_virtual_mem()Testing
All tests pass:
Test output:
Impact
Before: Training logs showed 3.2 TB causing confusion and incorrect OOM warnings
After: Correct memory detection (32 GB) with accurate monitoring
Notes
GlobalMemoryStatusEx) is recommended by Microsoft as the most reliable methodBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.