Devirtualize UTF8 with a specialized text converter#6016
Devirtualize UTF8 with a specialized text converter#6016bbowyersmyth wants to merge 2 commits intonpgsql:mainfrom
Conversation
|
@NinoFloris is this a change you would be interested in? I know there is some desire for small AOT size. |
|
Clever observation! Maybe we can devirtualize inside Read and Write? We would add a branch |
|
Ah I missed the fact Npgsql actually defines its own UTF8Encoding instance as well. Since we don't call GetPreamble anywhere we can just use Encoding.UTF8 for NpgsqlWriteBuffer.UTF8Encoding. However we still need to use Encoding.UTF8 directly in the converter for the JIT to see it's of type UTF8EncodingSealed. EDIT: |
|
I tried to combine the converter but most things were a regression on the original perf. Had to deconstruct the Read function and then it was good. The current change might be a bit worse for AOT but the JIT likes it.
|
I traced back this decision to #503, however it seems outdated in the current scheme of things. At that time no other encodings were supported (support was introduced in ea9f203). So we're currently in the situation where we only do the strict thing for UTF8, which I don't think makes a lot of sense. I'd personally be fine using Encoding.UTF8 throughout, @roji what are your thoughts on this? |
|
@roji friendly ping, do you think we can use the builtin Encoding.UTF8 over using strict today? See #6016 (comment) |
|
Thanks for the ping @NinoFloris. The point about us only being strict for UTF8 is definittely valid... But I think that throwing for invalid characters (as opposed to the IIRC default behavior of silently replacing them with whatever) is valuable - I remember at least some cases where users unknowingly got corrupt data in their database, and fail-fast behavior seems much better. If anything I'd rather look into some way for us to be strict for all encodings instead. I don't have all the context here (or even nearly enough), but if this is a tradeoff between slightly better AOT size plus a few nanoseconds vs. throwing for invalid characters, I think I'm more in the latter camp. But is that a valid summary of the decision here? |
Encoding.UTF8actually returns a UTF8EncodingSealed object (https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/UTF8Encoding.Sealed.cs) that allows for the functions to be devirtualized if accessed via that static.Npgsql is creating its own UTF8Encoding objects that don't allow for this specialization.
Using a custom UTF8Encoding and a specific TextConverter when that encoding is in use allows greater performance. I did not see much performance differences with the rest of the stack allocations in UTF8EncodingSealed.
Before
After