Skip to content

Fix OOM in dynamic_preprocess for extreme aspect ratios#1256

Open
Mr-Neutr0n wants to merge 1 commit intoOpenGVLab:mainfrom
Mr-Neutr0n:fix/dynamic-preprocess-oom-extreme-aspect-ratios
Open

Fix OOM in dynamic_preprocess for extreme aspect ratios#1256
Mr-Neutr0n wants to merge 1 commit intoOpenGVLab:mainfrom
Mr-Neutr0n:fix/dynamic-preprocess-oom-extreme-aspect-ratios

Conversation

@Mr-Neutr0n
Copy link

Summary

  • Adds safety bounds to dynamic_preprocess to prevent out-of-memory (OOM) errors when processing images with extreme aspect ratios or when max_num is set to very high values.
  • Caps max_num at MAX_PATCHES_LIMIT (24) to prevent runaway patch counts that lead to excessive memory allocation.
  • Filters out target patch grid ratios with extreme aspect ratios (beyond MAX_ASPECT_RATIO_THRESHOLD of 200) to avoid degenerate patch grids like (12, 1) that produce large intermediate images without useful visual information.
  • Adds a safety fallback to (1, 1) grid if all candidate ratios are filtered out.
  • Applies the fix consistently across all three copies of the function in internvl_chat, internvl_chat_gpt_oss, and streamlit_demo.

Closes #1221

Details

When dynamic_preprocess receives an image with extreme proportions (e.g., a very wide panorama or a very tall screenshot), the generated target ratios can include highly elongated grids. Combined with large max_num values that can be configured via training metadata (max_dynamic_patch), this causes:

  1. Large intermediate resize buffers (e.g., 5376x448 pixels for a (12, 1) grid with image_size=448)
  2. Many patch tensors being created simultaneously
  3. OOM crashes, especially on GPU-constrained environments

The fix introduces two configurable module-level constants:

  • MAX_PATCHES_LIMIT = 24: Hard upper bound on patch count regardless of max_num parameter
  • MAX_ASPECT_RATIO_THRESHOLD = 200: Maximum allowed ratio between grid dimensions

These values are intentionally generous to avoid affecting normal usage while preventing pathological cases.

Test plan

  • Verify normal images (standard aspect ratios like 4:3, 16:9) produce identical results before and after the change
  • Test with extreme aspect ratio images (e.g., 10000x100, 100x10000) to confirm no OOM
  • Test with max_num values > 24 to confirm capping works correctly
  • Verify the fallback to (1, 1) works when all ratios are filtered
  • Run existing eval scripts (e.g., MME, VQA) to confirm no regression

…spect ratios

Images with extreme aspect ratios (e.g., very wide panoramas or tall
screenshots) can cause excessive memory allocation in dynamic_preprocess
because the function generates patch grid ratios without any aspect ratio
filtering. This can lead to OOM errors, especially when max_num is set
to high values.

Changes:
- Cap max_num at MAX_PATCHES_LIMIT (24) to prevent runaway patch counts
- Filter out target ratios where either dimension ratio exceeds
  MAX_ASPECT_RATIO_THRESHOLD (200) to avoid degenerate patch grids
- Add a safety fallback to (1,1) if all ratios are filtered out
- Apply the fix consistently across all three copies of the function:
  internvl_chat, internvl_chat_gpt_oss, and streamlit_demo
@Mr-Neutr0n
Copy link
Author

Friendly bump! Let me know if there's anything I should update or improve to help move this forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] AttributeError: 'InternVLChatModel' object has no attribute 'warp_llm_lora'

1 participant