-
-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Fix PDF bloat for off-axis scatter with per-point colors #30746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix PDF bloat for off-axis scatter with per-point colors #30746
Conversation
Skip emitting markers outside canvas bounds in draw_path_collection to reduce PDF file size when scatter points are off-axis. Fixes matplotlib#2488
|
This seems to remove a marker if it's center is outside the document. What happens if the marker size is large enough that the edge of the marker should still be shown even if the center is not on the page? |
| # The off-axis colored scatter should be close to empty size | ||
| # Allow up to 50KB overhead for axes/metadata, but should be much smaller | ||
| # than if all 1000 markers were written (which would add ~200-400KB) | ||
| assert size_offaxis_colored < size_empty + 50_000, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 50kb padding is confusing to me. The axes should be the same output either way (as the limits are the same etc). If there was anything extra I would expect it to be the structures for the scatter that has no paths (there are some open/close group stuff emitted if I recall correctly). To account for that we could ad ax2.scatter([], []) to the empty test.
It is probably worth also adding a third example where there are markers in in the plot (ax3.scatter(x+20, y+20, c=c)) to test that fully off axis one is about the same size as the empty one and both are smaller than the one with visible markers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the tests per your suggestions:
- Added
ax2.scatter([], [])to the baseline test to match the axes structure exactly - Reduced tolerance from 50KB to 5KB since axes output is now identical
- Added a third test case with visible markers (x + 20, y + 20, c=c) that validates:
- Visible scatter is >50KB larger than empty
- Visible scatter is >50KB larger than off-axis
Now, we can compare all three outputs:
- fully off-axis scatter --> small file
- empty scatter --> small file
- visible scatter --> larger file
| # may be partially visible even if its center is outside the canvas. | ||
| canvas_width = self.file.width * 72 | ||
| canvas_height = self.file.height * 72 | ||
| if not (-max_marker_extent <= xo <= canvas_width + max_marker_extent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use a maximum extent rather than doing the computation here and having per-maker and per-direction filtering?
We have to compute the extent of every maker no matter what so might as well get the better behavior by doing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially used a single max_marker_extent to simplify the bounds check and avoid recalculating the extent for each marker.
Yes, you're right - per-marker filtering is more precise. I've updated the commit.
| bbox = path.get_extents(transform) | ||
| max_marker_extent = max(max_marker_extent, | ||
| bbox.width / 2, bbox.height / 2) | ||
| name = self.file.pathCollectionObject( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this write something into the file our only update our Python side sate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not write into file immediately but seems it writes after all drawing is done. Let me fix this by creating path templates only for used path_ids and compute extents for creted paths.
| # Get the bounding box of the transformed marker path. | ||
| # Use get_extents() which is more efficient than transforming | ||
| # all vertices, and add padding for stroke width. | ||
| bbox = path.get_extents(transform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test that this works as expected with log scale and polar?
Great catch! You're absolutely right - the initial implementation didn't account for marker size. I've The new implementation:
I've also added |
|
@FazeelUsmani that looks correc. Nice job! However, now I wonder if we are trading file-size bloat for run-time bloat? This seems to check the extent of every marker - what if I have 10,000 markers, and they are all in the axes - will this be super slow in needlessly getting all the extents? Does a hybrid of the two approaches seem useful (eg only update the extent if the markers centre is not in the figure?) |
|
@jklymak Yes, hybrid suits well here. We will get extent only if the marker is off the grid. |
|
@jklymak, it's ready for review. |
jklymak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
|
Hi @tacaswell, can you please take another look? |
|
Thanks! for approving this PR @jklymak. I've a question: Usually, when the PR is merged after changes are accepted? Is it before release? |
|
The PR requires two approvals and then will be merged |
@tacaswell waiting for you. Let me know if you've any questions or need clarifications |
Fixes #2488
PR summary
Problem
When using
scatter()with per-point colors and then moving the axes limits so all points are off-screen, the PDF backend still writes all marker paths to the file, resulting in unnecessarily large PDFs (~400KB instead of ~7KB for 1000 points).Solution
Added bounds checking in
draw_path_collection(backend_pdf.py:2121-2125) to skip markers outside the visible canvas. This mirrors the existing optimization already present indraw_markers(lines 2157-2159).Implementation
(xo, yo)against canvas bounds(0, 0)to(width*72, height*72)output()call for out-of-bounds markersReproduction Example
Testing
Added three new tests in
test_backend_pdf.py:test_scatter_offaxis_colored_pdf_sizetest_scatter_offaxis_colored_visualtest_scatter_mixed_onoff_axisPR checklist