Skip to content

Conversation

@FazeelUsmani
Copy link
Contributor

@FazeelUsmani FazeelUsmani commented Nov 12, 2025

Fixes #2488

PR summary

Problem

When using scatter() with per-point colors and then moving the axes limits so all points are off-screen, the PDF backend still writes all marker paths to the file, resulting in unnecessarily large PDFs (~400KB instead of ~7KB for 1000 points).

Solution

Added bounds checking in draw_path_collection (backend_pdf.py:2121-2125) to skip markers outside the visible canvas. This mirrors the existing optimization already present in draw_markers (lines 2157-2159).

Implementation

  • Check marker offsets (xo, yo) against canvas bounds (0, 0) to (width*72, height*72)
  • Skip the output() call for out-of-bounds markers
  • Conservative, backend-specific change (PDF only)
  • No API changes or user-facing behavior modifications

Reproduction Example

import numpy as np
import matplotlib.pyplot as plt

x = np.random.random(1000)
y = np.random.random(1000)
c = np.random.random(1000)

# Before fix: ~410 KB PDF
# After fix: ~7-15 KB PDF
plt.figure()
plt.scatter(x, y, c=c)
plt.xlim(20, 30)  # Move all points off-axis (data is 0-1 range)
plt.savefig('scatter_offaxis.pdf')

Testing

Added three new tests in test_backend_pdf.py:

  • test_scatter_offaxis_colored_pdf_size
  • test_scatter_offaxis_colored_visual
  • test_scatter_mixed_onoff_axis

PR checklist

Skip emitting markers outside canvas bounds in draw_path_collection
to reduce PDF file size when scatter points are off-axis.

Fixes matplotlib#2488
@FazeelUsmani FazeelUsmani marked this pull request as draft November 12, 2025 09:41
@FazeelUsmani FazeelUsmani marked this pull request as ready for review November 12, 2025 10:29
@jklymak
Copy link
Member

jklymak commented Nov 12, 2025

This seems to remove a marker if it's center is outside the document. What happens if the marker size is large enough that the edge of the marker should still be shown even if the center is not on the page?

# The off-axis colored scatter should be close to empty size
# Allow up to 50KB overhead for axes/metadata, but should be much smaller
# than if all 1000 markers were written (which would add ~200-400KB)
assert size_offaxis_colored < size_empty + 50_000, (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 50kb padding is confusing to me. The axes should be the same output either way (as the limits are the same etc). If there was anything extra I would expect it to be the structures for the scatter that has no paths (there are some open/close group stuff emitted if I recall correctly). To account for that we could ad ax2.scatter([], []) to the empty test.

It is probably worth also adding a third example where there are markers in in the plot (ax3.scatter(x+20, y+20, c=c)) to test that fully off axis one is about the same size as the empty one and both are smaller than the one with visible markers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the tests per your suggestions:

  1. Added ax2.scatter([], []) to the baseline test to match the axes structure exactly
  2. Reduced tolerance from 50KB to 5KB since axes output is now identical
  3. Added a third test case with visible markers (x + 20, y + 20, c=c) that validates:
    • Visible scatter is >50KB larger than empty
    • Visible scatter is >50KB larger than off-axis

Now, we can compare all three outputs:

  • fully off-axis scatter --> small file
  • empty scatter --> small file
  • visible scatter --> larger file

# may be partially visible even if its center is outside the canvas.
canvas_width = self.file.width * 72
canvas_height = self.file.height * 72
if not (-max_marker_extent <= xo <= canvas_width + max_marker_extent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use a maximum extent rather than doing the computation here and having per-maker and per-direction filtering?

We have to compute the extent of every maker no matter what so might as well get the better behavior by doing it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially used a single max_marker_extent to simplify the bounds check and avoid recalculating the extent for each marker.

Yes, you're right - per-marker filtering is more precise. I've updated the commit.

bbox = path.get_extents(transform)
max_marker_extent = max(max_marker_extent,
bbox.width / 2, bbox.height / 2)
name = self.file.pathCollectionObject(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this write something into the file our only update our Python side sate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not write into file immediately but seems it writes after all drawing is done. Let me fix this by creating path templates only for used path_ids and compute extents for creted paths.

# Get the bounding box of the transformed marker path.
# Use get_extents() which is more efficient than transforming
# all vertices, and add padding for stroke width.
bbox = path.get_extents(transform)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that this works as expected with log scale and polar?

@FazeelUsmani
Copy link
Contributor Author

This seems to remove a marker if it's center is outside the document. What happens if the marker size is large enough that the edge of the marker should still be shown even if the center is not on the page?

Great catch! You're absolutely right - the initial implementation didn't account for marker size. I've
updated the code to compute per-marker extents (bounding boxes) and only skip markers that are
completely outside the canvas.

The new implementation:

  • Calculates the extent (half-width, half-height) for each marker path
  • Checks if the marker's bounding box intersects the canvas: (-extent_x <= xo <= canvas_width + extent_x)
  • Only skips markers where the entire bounding box is outside the visible area

I've also added test_scatter_large_markers_partial_clip which specifically tests markers with centers
outside the canvas but edges extending into the visible area, checked and these are now correctly rendered.

@jklymak
Copy link
Member

jklymak commented Nov 13, 2025

@FazeelUsmani that looks correc. Nice job!

However, now I wonder if we are trading file-size bloat for run-time bloat? This seems to check the extent of every marker - what if I have 10,000 markers, and they are all in the axes - will this be super slow in needlessly getting all the extents? Does a hybrid of the two approaches seem useful (eg only update the extent if the markers centre is not in the figure?)

@FazeelUsmani
Copy link
Contributor Author

@jklymak Yes, hybrid suits well here. We will get extent only if the marker is off the grid.
Added the changes.

@FazeelUsmani
Copy link
Contributor Author

@jklymak, it's ready for review.

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@FazeelUsmani
Copy link
Contributor Author

Hi @tacaswell, can you please take another look?

@FazeelUsmani
Copy link
Contributor Author

Thanks! for approving this PR @jklymak.

I've a question: Usually, when the PR is merged after changes are accepted? Is it before release?

@jklymak
Copy link
Member

jklymak commented Nov 18, 2025

The PR requires two approvals and then will be merged

@FazeelUsmani
Copy link
Contributor Author

I see. @jklymak, can you please look at this PR #30756? It has one approval, just need another.

@FazeelUsmani
Copy link
Contributor Author

The PR requires two approvals and then will be merged

@tacaswell waiting for you. Let me know if you've any questions or need clarifications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Off-axes scatter() points unnecessarily saved to PDF when coloured

3 participants