Increase XLSX reading performance #617

agolovenkin · 2019-01-16T13:38:04Z

I have large XLSX file about 1 million rows and 5 columns. First col with unique values and last 2 with the same values. When I try to read this file I saw that performance is not so good. After investigation I found that in FileBasedStrategy::getStringAtIndex() method file with cache is rereading for each row because string index is differ greatly between first and last column of xlsx document.

I have optimized cache by adding additional index file with offset and length for each data and increase reading speed about 4 times (depends on column count).

P.S.

Cant attach whole XLSX file because it is too big. So attached only 300K rows
test.xlsx

<?php

use Box\Spout\Reader\Common\Creator\ReaderFactory;
use Box\Spout\Common\Type;

require_once __DIR__ . '/vendor/autoload.php';

$reader = ReaderFactory::create(Type::XLSX);
$reader->open(__DIR__ . '/test.xlsx');

$time = mktime();
foreach ($reader->getSheetIterator() as $sheet) {
	foreach ($sheet->getRowIterator() as $n => $row) {
		if ($n % 100000 == 0) {
			echo "Read $n in " . (mktime() - $time) . " seconds\n";
		}
	}
	break; // no need to read more sheets
}
echo "Read $n in " . (mktime() - $time) . " seconds\n";

$reader->close();

Current realization

php ./test.php
Read 100000 in 44 seconds
Read 200000 in 92 seconds
Read 300000 in 139 seconds

After optimization

php ./test.php
Read 100000 in 14 seconds
Read 200000 in 30 seconds
Read 300000 in 46 seconds

boxcla · 2019-01-16T13:38:15Z

Hi @agolovenkin, thanks for the pull request. Before we can merge it, we need you to sign our Contributor License Agreement. You can do so electronically here: http://opensource.box.com/cla

Once you have signed, just add a comment to this pull request saying, "CLA signed". Thanks!

agolovenkin · 2019-01-16T13:57:14Z

CLA signed

boxcla · 2019-01-16T13:57:16Z

Verified that @agolovenkin has just signed the CLA. Thanks, and we look forward to your contribution.

developedsoftware · 2019-11-07T14:49:03Z

Can this be merged with master?

adrilo · 2019-11-13T20:40:06Z

While it works great with the test file, there are some files where it performs worse... That's why I'm not sure whether this should be merged as is

developedsoftware · 2019-11-13T23:46:08Z

I have had to down grade to 2.7. I actually think the creation of millions of cell objects is what is slowing 3.0 down and would explain the varied performance benchmarks.

Even if I want the values as arrays - to be backward compatible - spout creates the objects in the background anyway and then loops through each one and calls getValue to cast the row as an array. Which defeats the point of returning them as an array.

CLAassistant · 2020-03-03T19:02:01Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Alexander Golovenkin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

XLSX performance

abddc0b

Fix coding style

9560565

agolovenkin mentioned this pull request Jan 17, 2019

Forever to parse 150,000 row file #585

Closed

adrilo changed the base branch from develop_3.0 to master May 24, 2019 07:49

Slamdunk mentioned this pull request Mar 3, 2022

Not just another custom column widths PR #715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase XLSX reading performance #617

Increase XLSX reading performance #617

agolovenkin commented Jan 16, 2019 •

edited

Loading

boxcla commented Jan 16, 2019

agolovenkin commented Jan 16, 2019

boxcla commented Jan 16, 2019

developedsoftware commented Nov 7, 2019

adrilo commented Nov 13, 2019

developedsoftware commented Nov 13, 2019 •

edited

Loading

CLAassistant commented Mar 3, 2020

Increase XLSX reading performance #617

Are you sure you want to change the base?

Increase XLSX reading performance #617

Conversation

agolovenkin commented Jan 16, 2019 • edited Loading

boxcla commented Jan 16, 2019

agolovenkin commented Jan 16, 2019

boxcla commented Jan 16, 2019

developedsoftware commented Nov 7, 2019

adrilo commented Nov 13, 2019

developedsoftware commented Nov 13, 2019 • edited Loading

CLAassistant commented Mar 3, 2020

agolovenkin commented Jan 16, 2019 •

edited

Loading

developedsoftware commented Nov 13, 2019 •

edited

Loading