Conversation
|
Just a quick note to say that overall this looks good and I think it is going in the right direction. I'll give more detailed feedback early next week. For code comments you are already using correct sentence-case but could you also add a full stop/period at the end of each line. This makes it easier to change comments into documentation. It isn't a rule that has been adhered to in the existing code but I would like to improve that going forward. |
|
Man, i wish i discovered this pull early, i wanted to know if a cell is stricken or not. Sat whole day and edited the code myself. Came till adding if stricken or not from the styles. Learnt a lot from my experience though. |
|
Is the data and styles coming together? |
Data and Formats have always been processed separately (e.g. When I finish the base work, I'm going to look into optimizations:
|
|
i got it, even formulae were being used similarly. Was able to satisfy my requirement with Another doubt, why every item in the style is an option? hope i am not bothering you.😊 |
@azarmadr this is because the particular style may not exist for a given cell. |
|
@jmcnamara We're currently using this fork in production and it's working well. What do I need to do to get this merged with Calamine? Some things to discuss:
|
You should create a non-draft PR with all the WIP files and code removed. I will try to do a review on the current code.
Overall I think it is okay but I may have specific comments during review.
Copy the style of the Xlsx documentation:
https://docs.rs/calamine/latest/calamine/struct.Xlsx.html#method.merged_regions_by_sheet
That is always good. :-)
I think not.
Probably not but lets try get it right for Xlsx first. |
STYLE_FEATURE.md
Outdated
| @@ -0,0 +1,253 @@ | |||
| # Style Support in Calamine | |||
There was a problem hiding this comment.
Remove this files from the final PR or convert it to documentation.
| /// The cell value | ||
| pub value: Data, | ||
| /// The cell style | ||
| pub style: Option<Style>, |
There was a problem hiding this comment.
Strictly speaking the style is never optional in the Excel file formats. There is always an implicit 0 style that is the default and is defined in the theme.xml file. Note, I'm not necessarily saying that it shouldn't be an Option<>.
src/datatype.rs
Outdated
| self.style = None; | ||
| } | ||
|
|
||
| /// Check if the cell has style information |
There was a problem hiding this comment.
Again, an Excel cell will always have style information even if it is implicit for size reasons.
src/xlsx/cells_reader.rs
Outdated
| let mut style = None; | ||
|
|
||
| // Extract style ID if present | ||
| if let Ok(Some(style_id_str)) = |
There was a problem hiding this comment.
Again, the style id is 0 and not None if not present.
| } | ||
|
|
||
| /// Get effective column width (custom or default) | ||
| pub fn get_effective_column_width(&self, column: u32) -> f64 { |
There was a problem hiding this comment.
This effective width is probably not the width that the user would expect and also may be dependent on the default font for the file.
See the following for the calculation in the other direction: https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/worksheet.rs#L19340-L19368
However, there may not be any simple way to return the correct value in all cases.
| } | ||
|
|
||
| /// Get effective row height (custom or default) | ||
| pub fn get_effective_row_height(&self, row: u32) -> f64 { |
There was a problem hiding this comment.
Same comment for row height although the height calculation is simpler.
src/xlsx/style_parser.rs
Outdated
| use crate::style::*; | ||
| use crate::XlsxError; | ||
|
|
||
| /// Get default Excel 2007+ theme color scheme (Office theme) |
There was a problem hiding this comment.
Avoid using /// for non-public non-doc comments. Also this is an example of where proper sentence punctuation should be applied.
|
First review done. Some general comments:
|
|
@ddimaria Any update on this? |
src/xlsx/cells_reader.rs
Outdated
| }; | ||
|
|
||
| if style_id < self.styles.len() { | ||
| style = Some(self.styles[style_id].clone()); |
There was a problem hiding this comment.
I'm not sure if this is in scope for this project, but at work we've been dealing with a lot of very large workbooks and I'm concerned about the performance effect of cloning the style for each cell (especially because the style object includes strings).
Personally I'd prefer just exposing the style ID and also the styles table and allowing users to map it as required. Although I understand the ergonomics of that are not as nice, I still think it would avoid some potential performance impact.
To be clear I'm saying this without any data to back it up, but anecdotally I made a change to how this library parses formulas and got a significant* speed improvement by avoiding extra allocations for each cell.
Here's an example from a different PR: #568 (comment)
Let me know if you have any thoughts.
There was a problem hiding this comment.
@louisdewar I share your concern. A good data structure for styles is Run Length Encoding (since styles often repeat). I want to add that feature, as originally discussed here: #538 (comment), though it's been tabled in this PR by the maintainer: #538 (comment)
There was a problem hiding this comment.
hough it's been tabled in this PR by the maintainer: #538 (comment)
I tabled it because "stretch goal" sounded like a nice to have. If, in fact, it is something that is required to get reasonable performance then that will need to be done before merge.
There was a problem hiding this comment.
I'd like to get some perf number before implementing. We've been using this branch in production and haven't noticed perf issues.
There was a problem hiding this comment.
I'm happy to run some very basic benchmarking today and show the results. I'll just comment out the clone and see what impact that has and then do some flamegraphs to see if I can see a meaningful difference.
There was a problem hiding this comment.
Thanks @louisdewar! I want to do the same, time permitting, with the addition of comparing perf of style vs non-style parts of the library. It also won't take very long to implement RLE.
There was a problem hiding this comment.
Sorry for the delay.
Simple benchmark:
use std::env;
use calamine::{open_workbook_auto, Reader};
fn main() {
let file_path = env::args().nth(1).expect("Please provide a file path");
let mut workbook = open_workbook_auto(&file_path)
.expect("Failed to open workbook");
for sheet_name in workbook.sheet_names() {
let range = workbook
.worksheet_range(&sheet_name)
.expect("Failed to read worksheet");
for row in range.rows() {
for cell in row {
std::hint::black_box(cell);
}
}
}
}Before
# warmup = 2
Benchmark 1: ./target/release/examples/parse_tmp ~/Downloads/DESRI_BASE.xlsm
Time (mean ± σ): 1.285 s ± 0.005 s [User: 1.206 s, System: 0.075 s]
Range (min … max): 1.275 s … 1.292 s 10 runs
https://share.firefox.dev/4i0pUAo
7.2% of total extraction time is spent doing a specific style.clone().
After removing .clones
I'd imagine we would have comparible performance between this and just storing the style ID for each cell.
$ git diff
diff --git a/src/xlsx/cells_reader.rs b/src/xlsx/cells_reader.rs
index b1b1168..e60dc2f 100644
--- a/src/xlsx/cells_reader.rs
+++ b/src/xlsx/cells_reader.rs
@@ -132,7 +132,7 @@ where
};
if style_id < self.styles.len() {
- style = Some(self.styles[style_id].clone());
+ // style = Some(self.styles[style_id].clone());
}
loop {
@@ -211,7 +211,7 @@ where
};
if style_id < self.styles.len() {
- style = Some(self.styles[style_id].clone());
+ // style = Some(self.styles[style_id].clone());
}
loop {
@@ -352,7 +352,8 @@ where
{
if let Ok(style_id) = atoi_simd::parse::<usize>(style_id_str) {
if style_id < self.styles.len() {
- self.styles[style_id].clone()
+ Style::new()
+ // self.styles[style_id].clone()
} else {
Style::new()
}
# warmup = 2
Benchmark 1: ./target/release/examples/parse_tmp ~/Downloads/DESRI_BASE.xlsm
Time (mean ± σ): 1.103 s ± 0.005 s [User: 1.038 s, System: 0.061 s]
Range (min … max): 1.093 s … 1.110 s 10 runs
https://share.firefox.dev/47WkCBo
About a 14% reduction in time.
Notes
I assume the easiest way to get this speedup is to simply store style ID in the cells and expose the style table.
I haven't profiled memory, I'm not sure what the best test of this is but I imagine this would also show a reasonable impact assuming you were storing the entire workbook in memory e.g. doing a .collect() on the cell reader.
Also this benchmark could easily be biased, I just picked a large xlsx I was using at work, other workbooks might show something different.
My laptop is quite fast (m4 pro), I can imagine the different is a lot bigger on slower servers etc...
|
I'm also willing to help out on this if that would be useful |
|
@louisdewar Thanks. That would be great. Could you start by rebasing the PR to the latest HEAD, fixing any issues, and creating a new PR. Or, if you want to submit your own approach that is fine too. |
|
@jmcnamara sounds good. I should have some time in the next few weeks. I think my preference would be to squash all these commits into 1 with a Co-Authored-By and make changes on top or figure out ways to make smaller incremental PRs - I'll decide once I have time to dig into this in detail. Also it might be good to let me know if you have any thoughts about implementation / things you want to see before I start trying to tidy this up. |
Nothing major. I made some comments above on the original PR. My main change would be to remove the
I don't know if you saw this comment/proposal (See also: #597) but there is now a Rust based Excel number formatting crate called ssfmt that could be plugged in so that the user could have access to a fn get_formatted_number(&self) -> Option<String> {}All the style IDs and lookups could then be hidden in the background from the user. |
@jmcnamara I've been a bit busy and haven't been able to return to this. My next body of work is to improve the performance of excel imports (using Calamine), so I can handle this then. That said, if there is a pressing need and someone wants to take it over, they certainly can. Here is my plan:
Let me know how you'd like to proceed. Thanks! -David |
|
Thanks. As a first step it would be good to have a rebased PR with the CI checks passing. Maybe submit that as a new PR to supersede this one. Due to the number of comments this page often fails to load. |
|
Finally back on this 😄 |
|
I've merged main into this branch, resolving conflicts and fixing tests. I also have a 1M cell style file being used in new benchmark tests (used criterion). I'm moving on to flamegraph and will start improvements based on the results. |
|
I just implemented RLE and other optimizations. I'll close this PR and open a new one tomorrow, and will squash all of my commits. The perf is a lot better now, and there is less of a memory footprint. |
|
I added support for Rich Text, which handles various formats within a single cell. This is how excel stores it: <si>
<r>
<rPr>
<b/> <!-- bold -->
<sz val="11"/>
<color rgb="FF0000FF"/>
<rFont val="Calibri"/>
</rPr>
<t>Bold blue text</t>
</r>
<r>
<rPr>
<i/> <!-- italic -->
<sz val="11"/>
<color rgb="FFFF0000"/>
</rPr>
<t> italic red text</t>
</r>
</si> |
Resolves merge conflicts in Cargo.toml (rstest version) and tests/test.rs (pivot table tests vs style tests — keep both). This merge adds: - Style struct with Font/Fill/Borders/Alignment/NumberFormat/Protection - styles.xml parser (src/xlsx/style_parser.rs) - worksheet_style() API returning row×col style grid - RLE-compressed StyleRange for memory efficiency Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
I noticed a bug in colors where tints weren't being handled. This is now implemented. |
|
@jmcnamara I need to implement conditional formatting and validations, is it OK to add them to this PR? |
|
@ddimaria This PR is already very large. It would be best to get this stabilized first (is it suitable for review now?). You can create other PRs that are dependent on this on. Conditional formatting and Data Validation should be separate PRs. |
That works for me. I just rebased my fork's commits onto upstream/master and squashed into a single commit. |
is this PR now ready for review and can be merge? 👀 |
@Erik1000 I believe it is, though some conflicts just popped up. Let me review those and CI and it will be ready to review/merge. |
…worksheet layout Made-with: Cursor
|
@jmcnamara this PR is ready for review. I'm staring conditional formatting in another branch. |
Ok. Fix the |
Add style support for Calamine.
Basic api for getting styles
Here is the relevant structs so far: