Skip to content

Devirtualize UTF8 with a specialized text converter#6016

Draft
bbowyersmyth wants to merge 2 commits intonpgsql:mainfrom
bbowyersmyth:users/bruceb/utf8
Draft

Devirtualize UTF8 with a specialized text converter#6016
bbowyersmyth wants to merge 2 commits intonpgsql:mainfrom
bbowyersmyth:users/bruceb/utf8

Conversation

@bbowyersmyth
Copy link
Contributor

Encoding.UTF8 actually returns a UTF8EncodingSealed object (https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/UTF8Encoding.Sealed.cs) that allows for the functions to be devirtualized if accessed via that static.

Npgsql is creating its own UTF8Encoding objects that don't allow for this specialization.

Using a custom UTF8Encoding and a specific TextConverter when that encoding is in use allows greater performance. I did not see much performance differences with the rest of the stack allocations in UTF8EncodingSealed.

Before

Method Value Mean Error StdDev Op/s Gen0 Allocated
Read x 24.72 ns 0.098 ns 0.087 ns 40,449,395.2 0.0023 24 B
Write x NA NA NA NA NA NA
Read xxxxxxxxxx 26.19 ns 0.078 ns 0.069 ns 38,187,844.1 0.0046 48 B
Write xxxxxxxxxx 18.96 ns 0.027 ns 0.024 ns 52,748,503.0 0.0000 -
Read xxxx(...)xxxx [100] 35.01 ns 0.649 ns 0.575 ns 28,564,979.3 0.0214 224 B
Write xxxx(...)xxxx [100] 24.38 ns 0.071 ns 0.059 ns 41,019,998.9 0.0005 5 B
Read xxxx(...)xxxx [1000] 107.83 ns 0.627 ns 0.556 ns 9,273,447.3 0.1935 2024 B
Write xxxx(...)xxxx [1000] 71.47 ns 0.165 ns 0.147 ns 13,992,508.8 0.0051 54 B

After

Method Value Mean Error StdDev Op/s Gen0 Allocated
Read x 18.88 ns 0.161 ns 0.135 ns 52,956,082.8 0.0023 24 B
Write x NA NA NA NA NA NA
Read xxxxxxxxxx 20.33 ns 0.093 ns 0.078 ns 49,191,801.5 0.0046 48 B
Write xxxxxxxxxx 18.83 ns 0.019 ns 0.015 ns 53,103,167.5 0.0000 -
Read xxxx(...)xxxx [100] 27.33 ns 0.091 ns 0.085 ns 36,587,970.4 0.0214 224 B
Write xxxx(...)xxxx [100] 21.84 ns 0.028 ns 0.023 ns 45,783,841.6 0.0005 5 B
Read xxxx(...)xxxx [1000] 98.62 ns 0.773 ns 0.723 ns 10,139,873.0 0.1935 2024 B
Write xxxx(...)xxxx [1000] 67.47 ns 0.227 ns 0.201 ns 14,822,340.7 0.0051 54 B

@bbowyersmyth
Copy link
Contributor Author

@NinoFloris is this a change you would be interested in? I know there is some desire for small AOT size.

@NinoFloris
Copy link
Member

Clever observation! Maybe we can devirtualize inside Read and Write? We would add a branch if (Encoding.UTF8 == encoding) but won't need the extra converter and so on.

@NinoFloris
Copy link
Member

NinoFloris commented Feb 8, 2025

Ah I missed the fact Npgsql actually defines its own UTF8Encoding instance as well. Since we don't call GetPreamble anywhere we can just use Encoding.UTF8 for NpgsqlWriteBuffer.UTF8Encoding. However we still need to use Encoding.UTF8 directly in the converter for the JIT to see it's of type UTF8EncodingSealed.

EDIT:
I recall I did a similar optimization in https://github.com/npgsql/npgsql/blob/main/src/Npgsql/Internal/Converters/JsonConverter.cs#L122-L131 (which should also ideally not depend on CodePage but a straight reference equality to Encoding.UTF8)

@bbowyersmyth
Copy link
Contributor Author

Encoding.UTF8 uses the "replace invalid chars" option whereas the main Npgsql encoding uses the "throw on invalid" setting.

I tried to combine the converter but most things were a regression on the original perf. Had to deconstruct the Read function and then it was good. The current change might be a bit worse for AOT but the JIT likes it.
Only modifying the sync Read at the moment.

Method Value Mean Error StdDev Op/s Gen0 Allocated
Read x 16.68 ns 0.066 ns 0.058 ns 59,945,087.0 0.0023 24 B
Write x NA NA NA NA NA NA
Read xxxxxxxxxx 18.09 ns 0.073 ns 0.068 ns 55,264,866.3 0.0046 48 B
Write xxxxxxxxxx 19.68 ns 0.028 ns 0.025 ns 50,810,920.2 0.0000 -
Read xxxx(...)xxxx [100] 25.24 ns 0.079 ns 0.074 ns 39,616,495.6 0.0214 224 B
Write xxxx(...)xxxx [100] 24.95 ns 0.053 ns 0.050 ns 40,080,983.7 0.0005 5 B
Read xxxx(...)xxxx [1000] 98.19 ns 1.330 ns 1.244 ns 10,184,553.7 0.1935 2024 B
Write xxxx(...)xxxx [1000] 72.26 ns 0.242 ns 0.214 ns 13,838,010.8 0.0051 54 B

@NinoFloris
Copy link
Member

NinoFloris commented Feb 9, 2025

Encoding.UTF8 uses the "replace invalid chars" option whereas the main Npgsql encoding uses the "throw on invalid" setting.

I traced back this decision to #503, however it seems outdated in the current scheme of things. At that time no other encodings were supported (support was introduced in ea9f203). So we're currently in the situation where we only do the strict thing for UTF8, which I don't think makes a lot of sense.

I'd personally be fine using Encoding.UTF8 throughout, @roji what are your thoughts on this?

@NinoFloris
Copy link
Member

@roji friendly ping, do you think we can use the builtin Encoding.UTF8 over using strict today?

See #6016 (comment)

@roji
Copy link
Member

roji commented Oct 29, 2025

Thanks for the ping @NinoFloris.

The point about us only being strict for UTF8 is definittely valid... But I think that throwing for invalid characters (as opposed to the IIRC default behavior of silently replacing them with whatever) is valuable - I remember at least some cases where users unknowingly got corrupt data in their database, and fail-fast behavior seems much better. If anything I'd rather look into some way for us to be strict for all encodings instead.

I don't have all the context here (or even nearly enough), but if this is a tradeoff between slightly better AOT size plus a few nanoseconds vs. throwing for invalid characters, I think I'm more in the latter camp. But is that a valid summary of the decision here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants