Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

Increase XLSX reading performance #617

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Increase XLSX reading performance #617

wants to merge 2 commits into from

Conversation

agolovenkin
Copy link

@agolovenkin agolovenkin commented Jan 16, 2019

I have large XLSX file about 1 million rows and 5 columns. First col with unique values and last 2 with the same values. When I try to read this file I saw that performance is not so good. After investigation I found that in FileBasedStrategy::getStringAtIndex() method file with cache is rereading for each row because string index is differ greatly between first and last column of xlsx document.

I have optimized cache by adding additional index file with offset and length for each data and increase reading speed about 4 times (depends on column count).

P.S.

Cant attach whole XLSX file because it is too big. So attached only 300K rows
test.xlsx

<?php

use Box\Spout\Reader\Common\Creator\ReaderFactory;
use Box\Spout\Common\Type;

require_once __DIR__ . '/vendor/autoload.php';

$reader = ReaderFactory::create(Type::XLSX);
$reader->open(__DIR__ . '/test.xlsx');

$time = mktime();
foreach ($reader->getSheetIterator() as $sheet) {
	foreach ($sheet->getRowIterator() as $n => $row) {
		if ($n % 100000 == 0) {
			echo "Read $n in " . (mktime() - $time) . " seconds\n";
		}
	}
	break; // no need to read more sheets
}
echo "Read $n in " . (mktime() - $time) . " seconds\n";

$reader->close();

Current realization

php ./test.php
Read 100000 in 44 seconds
Read 200000 in 92 seconds
Read 300000 in 139 seconds

After optimization

php ./test.php
Read 100000 in 14 seconds
Read 200000 in 30 seconds
Read 300000 in 46 seconds

@boxcla
Copy link

boxcla commented Jan 16, 2019

Hi @agolovenkin, thanks for the pull request. Before we can merge it, we need you to sign our Contributor License Agreement. You can do so electronically here: http://opensource.box.com/cla

Once you have signed, just add a comment to this pull request saying, "CLA signed". Thanks!

@agolovenkin
Copy link
Author

CLA signed

@boxcla
Copy link

boxcla commented Jan 16, 2019

Verified that @agolovenkin has just signed the CLA. Thanks, and we look forward to your contribution.

@adrilo adrilo changed the base branch from develop_3.0 to master May 24, 2019 07:49
@developedsoftware
Copy link

Can this be merged with master?

@adrilo
Copy link
Collaborator

adrilo commented Nov 13, 2019

While it works great with the test file, there are some files where it performs worse... That's why I'm not sure whether this should be merged as is

@developedsoftware
Copy link

developedsoftware commented Nov 13, 2019

I have had to down grade to 2.7. I actually think the creation of millions of cell objects is what is slowing 3.0 down and would explain the varied performance benchmarks.

Even if I want the values as arrays - to be backward compatible - spout creates the objects in the background anyway and then loops through each one and calls getValue to cast the row as an array. Which defeats the point of returning them as an array.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Alexander Golovenkin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants