Glob complexity is Quadratic on directory depth #106

jaraco · 2023-07-14T15:51:34Z

In #105, this project re-worked the glob functionality. In that effort, I found that in test_glob_depth, the best complexity was never better than Quadratic. That's why I wrote test_baseline_regex_complexity to show that the regex is Constant on the length of the path, which means it should be linear on a number of paths.

It's probably not important, but I'd like to get a good answer for why the test performance isn't better than Quadratic.

The text was updated successfully, but these errors were encountered:

jaraco · 2023-07-14T15:57:06Z

@nh2 Perhaps you'd be interested to take a look and see if you can understand why the performance is quadratic.

jaraco · 2024-03-13T01:45:08Z

The complexity appears to be coming from the call to the zipfile namelist. If I add this patch:

 zipp main @ git diff
diff --git a/tests/test_complexity.py b/tests/test_complexity.py
index 67e9c17..7b91505 100644
--- a/tests/test_complexity.py
+++ b/tests/test_complexity.py
@@ -39,7 +39,9 @@ class TestComplexity(unittest.TestCase):
         for path, name in pairs:
             zf.writestr(f"{path}{name}.txt", b'')
         zf.filename = "big un.zip"
-        return zipp.Path(zf)
+        res = zipp.Path(zf)
+        res._saved_namelist = res.root.namelist()
+        return res
 
     @classmethod
     def make_names(cls, width, letters=string.ascii_lowercase):
@@ -81,6 +83,7 @@ class TestComplexity(unittest.TestCase):
             max_n=100,
             min_n=1,
         )
+        breakpoint()
         assert best <= big_o.complexities.Quadratic
 
     @pytest.mark.flaky
diff --git a/zipp/__init__.py b/zipp/__init__.py
index a1b9884..e62dc05 100644
--- a/zipp/__init__.py
+++ b/zipp/__init__.py
@@ -399,7 +399,7 @@ class Path:
         prefix = re.escape(self.at)
         tr = Translator(seps='/')
         matches = re.compile(prefix + tr.translate(pattern)).fullmatch
-        return map(self._next, filter(matches, self.root.namelist()))
+        return map(self._next, filter(matches, self._saved_namelist))
 
     def rglob(self, pattern):
         return self.glob(f'**/{pattern}')

The result comes back as Constant time (in one test; it's probably Linear).

jaraco · 2024-03-13T01:50:05Z

The problem is that ZipFile.namelist constructs a new list, which is apparently quadratic in the length of the filelist. Bypassing that list construction restores the expectation of linear or better performance.

jaraco added a commit that referenced this issue Jul 14, 2023

Replace TODO with issue #106

47c5fe9

jaraco changed the title ~~Glob complexity is Quadratic~~ Glob complexity is Quadratic on directory depth Jul 14, 2023

jaraco added a commit that referenced this issue Mar 13, 2024

Bypass ZipFile.namelist in glob. Closes #106.

6494cd7

jaraco mentioned this issue Mar 13, 2024

Improve glob performance #113

Merged

jaraco added a commit that referenced this issue Mar 13, 2024

Bypass ZipFile.namelist in glob. Closes #106.

ac8ea7a

jaraco closed this as completed in #113 Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glob complexity is Quadratic on directory depth #106

Glob complexity is Quadratic on directory depth #106

jaraco commented Jul 14, 2023

jaraco commented Jul 14, 2023

jaraco commented Mar 13, 2024

jaraco commented Mar 13, 2024

Glob complexity is Quadratic on directory depth #106

Glob complexity is Quadratic on directory depth #106

Comments

jaraco commented Jul 14, 2023

jaraco commented Jul 14, 2023

jaraco commented Mar 13, 2024

jaraco commented Mar 13, 2024