Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(str): add invariant to avoid unexpected errors when parsing an invalid UTF8 string #410

Merged
merged 6 commits into from
Jul 18, 2023

Conversation

devnix
Copy link
Contributor

@devnix devnix commented Jun 17, 2023

Fixes #409

@devnix devnix marked this pull request as ready for review June 17, 2023 22:21
@devnix
Copy link
Contributor Author

devnix commented Jun 17, 2023

I took the approach of adding an invariant checking that the problematic parameter is a valid utf8 before doing anything else.

Maybe an error handling around preg_replace would feel safer but also this approach felt cleaner and more efficient to my eyes, let me know about anything I can improve or rewrite!

@veewee veewee added Priority: Medium This issue may be useful, and needs some attention. Status: Review Needed The issue has a PR attached to it which needs to be reviewed. Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Jun 18, 2023
@coveralls
Copy link

coveralls commented Jun 18, 2023

Pull Request Test Coverage Report for Build 5437950104

  • 20 of 20 (100.0%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.004%) to 99.046%

Totals Coverage Status
Change from base Build 5294391869: 0.004%
Covered Lines: 4151
Relevant Lines: 4191

💛 - Coveralls

Copy link
Collaborator

@veewee veewee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!
I've added a few small remarks

src/Psl/Type/Internal/PositiveIntType.php Outdated Show resolved Hide resolved
src/Psl/Str/trim.php Outdated Show resolved Hide resolved
@veewee
Copy link
Collaborator

veewee commented Jun 21, 2023

@devnix I took the liberty of performing some benchmarks here:

#[Groups(['str'])]
final class Utf8Bench
{
    public function benchValidUtf8CheckThroughPreg(): void
    {
        preg_match('/./u', "HELLO");
    }

    public function benchValidUtf8CheckThroughEncodingAndPreg(): void
    {
        Psl\invariant(Str\is_utf8("HELLO"), 'Expected $string to be a valid UTF-8 string.');
        preg_match('/./u', "HELLO");
    }
}

The results are rather painful:

./vendor/bin/phpbench run --config config/phpbench.json --group=str --report=benchmark                                                                                                                  [13:37:38]
PHPBench (1.2.7) running benchmarks... #standwithukraine
with configuration file: /Users/verweto/Projects/github/azjezz/psl/config/phpbench.json
with PHP version 8.1.10, xdebug ✔, opcache ❌

\Psl\Tests\Benchmark\Str\Utf8Bench

    benchValidUtf8CheckThroughPreg..........R1 I2 - Mo0.700μs (±0.00%)
    benchValidUtf8CheckThroughEncodingAndPr.R4 I4 - Mo4.103μs (±1.18%)

Subjects: 2, Assertions: 0, Failures: 0, Errors: 0
Utf8Bench
=========

Average iteration times by variant

4.1μs     │   █
3.6μs     │   █
3.1μs     │   █
2.6μs     │   █
2.1μs     │   █
1.5μs     │   █
1.0μs     │ ▃ █
0.5μs     │ █ █
          └─────
            1 2

[█ <current>]

1: benchValidUtf8Check᠁ 2: benchValidUtf8Check᠁

Memory by variant

6.1mb     │ █ █
5.3mb     │ █ █
4.6mb     │ █ █
3.8mb     │ █ █
3.1mb     │ █ █
2.3mb     │ █ █
1.5mb     │ █ █
762.8kb   │ █ █
          └─────
            1 2

[█ <current>]

1: benchValidUtf8Check᠁ 2: benchValidUtf8Check᠁

<current>
+----------------------------------------------+---------+---------+--------+---------+
| subject                                      | memory  | mode    | rstdev | stdev   |
+----------------------------------------------+---------+---------+--------+---------+
| benchValidUtf8CheckThroughPreg ()            | 6.065mb | 0.700μs | ±0.00% | 0.000μs |
| benchValidUtf8CheckThroughEncodingAndPreg () | 6.102mb | 4.103μs | ±1.18% | 0.049μs |
+----------------------------------------------+---------+---------+--------+---------+

As you can see, this way is close to 6 times slower for successcases in comparison with dealing with the null return of preg_x.
It might not be the best of ideas - so I let @azjezz decide on what to do with this :)

@azjezz
Copy link
Owner

azjezz commented Jun 21, 2023

I think using is_utf8 is not the right solution indeed, we could either manually check for null return, or use Psl\Regex if it doesn't introduce as much of an overhead.

@veewee
Copy link
Collaborator

veewee commented Jun 21, 2023

@azjezz Psl\Regex is about 4.6x slower than regular preg_ - yet it is 4.600μs so not really that big of a time :)
It seems like in this run, that the utf8 invariant is performing similarely to the Psl\Regex, but can't beat a simple null check.

#[Groups(['str'])]
final class Utf8Bench
{
    public function benchValidUtf8CheckThroughPreg(): void
    {
        if (null === preg_replace('/./u', '', "HELLO")) {
            throw new \RuntimeException('xxx');
        }
    }

    public function benchValidUtf8CheckThroughEncodingAndPreg(): void
    {
        Psl\invariant(Str\is_utf8("HELLO"), 'Expected $string to be a valid UTF-8 string.');
        preg_replace('/./u', '', "HELLO");
    }

    public function benchValidUtfWithPslRegex(): void
    {
        Regex\replace('HELLO', '/./u', '');
    }
}
PHPBench (1.2.7) running benchmarks... #standwithukraine
with configuration file: /Users/verweto/Projects/github/azjezz/psl/config/phpbench.json
with PHP version 8.1.10, xdebug ✔, opcache ❌

\Psl\Tests\Benchmark\Str\Utf8Bench

    benchValidUtf8CheckThroughPreg..........R1 I3 - Mo1.000μs (±0.00%)
    benchValidUtf8CheckThroughEncodingAndPr.R1 I4 - Mo4.580μs (±2.58%)
    benchValidUtfWithPslRegex...............R1 I3 - Mo4.697μs (±1.05%)

Subjects: 3, Assertions: 0, Failures: 0, Errors: 0
Utf8Bench
=========

Average iteration times by variant

4.7μs     │   ▇ █
4.1μs     │   █ █
3.5μs     │   █ █
2.9μs     │   █ █
2.3μs     │   █ █
1.8μs     │   █ █
1.2μs     │ ▆ █ █
0.6μs     │ █ █ █
          └───────
            1 2 3

[█ <current>]

1: benchValidUtf8Check᠁ 2: benchValidUtf8Check᠁ 3: benchValidUtfWithPs᠁

Memory by variant

6.1mb     │ █ █ █
5.3mb     │ █ █ █
4.6mb     │ █ █ █
3.8mb     │ █ █ █
3.1mb     │ █ █ █
2.3mb     │ █ █ █
1.5mb     │ █ █ █
762.9kb   │ █ █ █
          └───────
            1 2 3

[█ <current>]

1: benchValidUtf8Check᠁ 2: benchValidUtf8Check᠁ 3: benchValidUtfWithPs᠁

<current>
+----------------------------------------------+---------+---------+--------+---------+
| subject                                      | memory  | mode    | rstdev | stdev   |
+----------------------------------------------+---------+---------+--------+---------+
| benchValidUtf8CheckThroughPreg ()            | 6.066mb | 1.000μs | ±0.00% | 0.000μs |
| benchValidUtf8CheckThroughEncodingAndPreg () | 6.103mb | 4.580μs | ±2.58% | 0.117μs |
| benchValidUtfWithPslRegex ()                 | 6.066mb | 4.697μs | ±1.05% | 0.049μs |
+----------------------------------------------+---------+---------+--------+---------+

@azjezz
Copy link
Owner

azjezz commented Jun 21, 2023

Yep, i guess we just do a null check here then, we should try to minimise the performance difference between Psl and PHP stdlib, even though in this case, PHP does not have an alternative for these string functions ( at least not mb variants ).

src/Psl/Str/replace_ci.php Outdated Show resolved Hide resolved
Copy link
Collaborator

@veewee veewee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Added a small question about the suppress here - but apart from that it looks to do what it needs to do :)

src/Psl/Str/replace_ci.php Outdated Show resolved Hide resolved
src/Psl/Str/replace_ci.php Outdated Show resolved Hide resolved
src/Psl/Str/replace_ci.php Outdated Show resolved Hide resolved
src/Psl/Str/trim.php Outdated Show resolved Hide resolved
src/Psl/Str/trim_left.php Outdated Show resolved Hide resolved
src/Psl/Str/trim_right.php Outdated Show resolved Hide resolved
@devnix devnix requested review from veewee and azjezz July 3, 2023 06:06
Copy link
Collaborator

@veewee veewee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the Regex component is probably the best option - even though it's less performant than other options.

@azjezz azjezz added Status: Completed Nothing further to be done with this issue. Awaiting to be closed by the requestor out of politeness and removed Status: Review Needed The issue has a PR attached to it which needs to be reviewed. labels Jul 17, 2023
@azjezz azjezz merged commit 1d015ad into azjezz:next Jul 18, 2023
@devnix devnix deleted the str-fix-utf8-overlong branch July 19, 2023 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium This issue may be useful, and needs some attention. Status: Completed Nothing further to be done with this issue. Awaiting to be closed by the requestor out of politeness Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return value of Psl\Str\trim() must be of the type string, null returned
4 participants