-
-
Notifications
You must be signed in to change notification settings - Fork 637
Increase XLSX reading performance #617
base: master
Are you sure you want to change the base?
Conversation
Hi @agolovenkin, thanks for the pull request. Before we can merge it, we need you to sign our Contributor License Agreement. You can do so electronically here: http://opensource.box.com/cla Once you have signed, just add a comment to this pull request saying, "CLA signed". Thanks! |
CLA signed |
Verified that @agolovenkin has just signed the CLA. Thanks, and we look forward to your contribution. |
Can this be merged with master? |
While it works great with the test file, there are some files where it performs worse... That's why I'm not sure whether this should be merged as is |
I have had to down grade to 2.7. I actually think the creation of millions of cell objects is what is slowing 3.0 down and would explain the varied performance benchmarks. Even if I want the values as arrays - to be backward compatible - spout creates the objects in the background anyway and then loops through each one and calls getValue to cast the row as an array. Which defeats the point of returning them as an array. |
Alexander Golovenkin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
I have large XLSX file about 1 million rows and 5 columns. First col with unique values and last 2 with the same values. When I try to read this file I saw that performance is not so good. After investigation I found that in FileBasedStrategy::getStringAtIndex() method file with cache is rereading for each row because string index is differ greatly between first and last column of xlsx document.
I have optimized cache by adding additional index file with offset and length for each data and increase reading speed about 4 times (depends on column count).
P.S.
Cant attach whole XLSX file because it is too big. So attached only 300K rows
test.xlsx
Current realization
After optimization