I'm working on a new site at https://highperformancewebfonts.com/ where I'm doing everything wrong. E.g. using a joke-y client-side-only rendering of articles from .md files (Hello Lizzy.js)
Since there's no static generation, there was no RSS feed. And since someone asked, I decided to add one. But in the spirit of learning-while-doing, I thought I should do the feed generation in Rust—a language I know nothing about.
Here are my first steps in Rust, for posterity. BTW the end result is https://highperformancewebfonts.com/feed.xml
The recommended install is via rustup
tool. This page https://www.rust-lang.org/tools/install has the instructions:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Next, restart the terminal shell or run:
$ . "$HOME/.cargo/env"
Check if installation was ok:
$ rustc --version rustc 1.84.0 (9fc6b4312 2025-01-07)
A tool called Cargo seems like the way to go. Looks like it's a package manager, an NPM of Rust:
$ cargo new russel && cd russel
(The name of my program is "russel", from "rusty", from "rust". Yeah, I'll see myself out.)
And Cargo.toml
looks like a config file similar in spirit to package.json
and similar in syntax to a php.ini. Since I'll need to write an RSS feed, a package called rss
would be handy.
[package] name = "russel" version = "0.1.0" edition = "2021" [dependencies] rss = "2.0.0"
Running $ cargo build
after a dependency update seems necessary.
To explore packages, a crates.io site looks appropriate e.g. https://crates.io/crates/rss as well as docs.rs, e.g. https://docs.rs/rss/2.0.11/rss/index.html
The command `cargo new russel` from the previous step created a hello-world program. We can test it by running:
$ cargo run
This should print "Hello, world!"
Nice!
Let's just test we can make changes and see the result. Seeing is believing!
Open src/main.rs
, look at this wonderful function:
fn main() { println!("Hello, world!"); }
Replace the string with "Bello, world". Save. Run:
$ cargo run
If you see the "Bello", the setup seems to be working. Rejoice!
It's pretty helpful. In VSCode you can go to Extensions and search for "rust-analyzer" to find it.
When you use a bog-standard WordPress install, the caching header in the HTML response is
Cache-Control: max-age=600
OK, cool, this means cache the HTML for 10 minutes.
Additionally these headers are sent:
Date: Sat, 07 Dec 2024 05:20:02 GMT Expires: Sat, 07 Dec 2024 05:30:02 GMT
These don't help at all, because they instruct the browser to cache for 10 minutes too, which the browser already knows. These can actually be harmful in cases of clocks that are off. But let's move on.
This is a plugin I installed, made by WP folks themselves, so joy, joy, joy. It saves the generated HTML from the PHP code on the disk and then gives that cached content to the next visitor. Win!
However, I noticed it ads another header:
Cache-Control: max-age=3, must-revalidate
And actually now there are two cache-control headers being sent, the new and the old:
Cache-Control: max-age=3, must-revalidate Cache-Control: max-age=600
What do you think happens? Well, the browser goes with the more restrictive one, so the wonderfully cached (on disk) HTML is now stale after 3 seconds. Not cool!
Looking around in the plugin settings I see there is no way to fix this. There's another curious setting though, disabled by default:
[ ] 304 Browser caching. Improves site performance by checking if the page has changed since the browser last requested it. (Recommended)
304 support is disabled by default because some hosts have had problems with the headers used in the past.
I turned this on. It means that instead of a new request after 3 seconds, the repeat visit will send an If-Modified-Since
header, and since 3 seconds is a very short time, the server will very likely respond with 304 Not Modified
response, which means the browser is free to use the copy from the browser cache.
Better, but still... it's an HTTP request.
Then I had to poke around the code and saw this:
// Default headers. $headers = array( 'Vary' => 'Accept-Encoding, Cookie', 'Cache-Control' => 'max-age=3, must-revalidate', ); // Allow users to override Cache-control header with WPSC_CACHE_CONTROL_HEADER if ( defined( 'WPSC_CACHE_CONTROL_HEADER' ) && ! empty( WPSC_CACHE_CONTROL_HEADER ) ) { $headers['Cache-Control'] = WPSC_CACHE_CONTROL_HEADER; }
Alrighty, so there is a way! All I needed to do was define the constant with the header I want.
The new constant lives in wp-content/wp-cache-config.php
- a file that already exists, created by the cache plugin.
I opted for:
define( 'WPSC_CACHE_CONTROL_HEADER', 'max-age=600, stale-while-revalidate=100' );
Why 600? I'd do it for longer but there's this other Cache-Control 600 coming from who-knows-where, so 600 is the max I can do. (TODO: figure out that other Cache-Control and ditch it)
Why stale-while-revalidate
? Well, this lets the browser use the cached response after the 10 minutes while it's re-checking for a fresher copy.
1. The repeat visit as-is, meaning the default less-than-ideal WP Super Cache behavior:
https://www.webpagetest.org/result/241207_AiDcHR_1QT/
Here you can see a new request for a repeat view, because 3 seconds have passed.
2. With the 304 setting turned on:
https://www.webpagetest.org/result/241207_AiDc4D_1QQ/
You can see a request being made that gets a 304 Not Modified response
3. Finally with the fix, the new header coming from the new constant:
https://www.webpagetest.org/result/241207_AiDcVT_1R8/
Here you can see no more requests for HTML, just one for stats. No static resources either (CSS, images, JS are cached "forever"). So the page is loaded completely from the browser cache.
Animated gifs are fun and all but they can get big (in filesize) quickly. At some point, maybe after just a few low-resolution frames it's better to use an MP4 and an HTML <video>
element. You also preferably need a "poster" image for the video so people can see a quick preview before they decide to play your video. The procedure can be pretty simple thanks to freely available amazing open source command-line tools.
For this we use ffmpeg:
$ ffmpeg -i amazing.gif amazing.mp4
Here we use ImageMagick to take the first frame in a gif and export it to a PNG:
$ magick "amazing.gif[0]" amazing.png
... or a JPEG, depending on the type of video (photographic vs more shape-y)
<video width="640" height="480" controls preload="none" poster="amazing.png"> <source src="amazing.mp4" type="video/mp4"> </video>
... with your favorite image-smushing tool e.g. ImageOptim
I did this for a recent Perfplanet calendar post and the 2.5MB gif turned to 270K mp4. Another 23MB gif turned to 1.2MB mp4.
I dunno if my ffmpeg install is to blame but the videos didn't play in QuickTime/FF/Safari, only in Chrome. So I ran them through HandBrake and that solved it. Cuz... ffmpeg options are not for the faint-hearted.
Do use preload="none"
so that the browser doesn't load the whole video unless the user decides to play. In my testing without a preload=none Chrome and Safari send range requests (like an HTTP header Range: bytes=0-
) for 206 Partial Content. Firefox just gets the whole thing
There is no loading="lazy"
for poster images
You've seen some of these UIs as of recent AI tools that stream text, right? Like this:
I peeked under the hood of ChatGPT and meta.ai to figure how they work.
Server-sent events (SSE) seem like the right tool for the job. A server-side script flushes out content whenever it's ready. The browser listens to the content as it's coming down the wire with the help of EventSource() and updates the UI.
Sadly I couldn't make the PHP code work server-side on this here blog, even though I consulted Dreamhost's support. I never got the "chunked" response to flush progressively from the server, I always get the whole response once it's ready. It's not impossible though, it worked for me with a local PHP server (like $ php -S localhost:8000
) and I'm pretty sure it used to work on Dreamhost before they switched to FastCGI.
If you want to make flush()
-ing work in PHP, here are some pointers to try in .htaccess
<filesmatch "\.php$"> SetEnv no-gzip 1 Header always set Cache-Control "no-cache, no-store, must-revalidate" SetEnv chunked yes SetEnv FcgidOutputBufferSize 0 SetEnv OutputBufferSize 0 <filesmatch>
And a test page to tell the time every second:
<?php header('Cache-Control: no-cache'); @ob_end_clean(); $go = 5; while ($go) { $go--; // Send a message echo sprintf( "It's %s o'clock on my server.\n\n", date('H:i:s', time()), ); flush(); sleep(1); }
In this repo stoyan/vexedbyalazyox you can find two PHP scripts that worked for me.
BTW, the server-side partial responses and flushing is pretty old as web performance techniques go.
(I'll keep using PHP to illustrate for just a bit more and then switch to Node.js)
In their simplest from server-sent events (or messages) are pretty sparse, all you do is:
echo "data: I am a message\n\n"; flush();
And now the client can receive "I am a message".
The events can have event names, anything you make up, like:
echo "event: start\n"; echo "data: Hi!\n\n"; flush();
More on the message fields is available on MDN. But all in all, the stuff you spit out on the server can be really simple:
event: start data: data: hello data: foo event: end data:
Events can be named anything, "start" and "end" are just examples. And they are optional too.
data:
is not optional. Even if all you need is to send an event with no data.
When event:
is omitted, it's assumed to be event: message
.
To get started you need an EventSource
object pointed to the server-side script:
const evtSource = new EventSource( 'https://pebble-capricious-bearberry.glitch.me/', );
Then you just listen to events (messages) and update the UI:
evtSource.onmessage = (e) => { msg.textContent += e.data; };
And that's all! You have optional event handlers should you need them:
evtSource.onopen = () => {}; evtSource.onerror = () => {};
Additionally, you can listen to any events with names you decide. For example I want the server to signal to the client that the response is over. So I have the server send this message:
event: imouttahere data:
And then the client can listen to the imouttahere
event:
evtSource.addEventListener('imouttahere', () => { console.info('Server calls it done'); evtSource.close(); });
OK, demo time! The server side script takes a paragraph of text and spits out every word after a random delay:
$txt = "The zebra jumps quickly over a fence, vexed by..."; $words = explode(" ", $txt); foreach ($words as $word) { echo "data: $word \n\n"; usleep(rand(90000, 200000)); // Random delay flush(); }
The client side sets up EventSource
and, on every message, updates the text on the page. When the server is done (event: imouttahere
), the client closes the connection.
Try it here in action. View source for the complete code. Note: if nothing happens initially, that's because the server-side Glitch is gone to sleep and needs to wake up.
One cool Chrome devtools feature is the list of events under an EventStream tab in the Network panel:
Now, what happens if the server is done and doesn't send a special message (such as imouttahere
)? Well, the browser thinks something went wrong and re-requests the same URL and the whole thing repeats. This is probably desired behavior in many cases, but here I don't want it.
Try the case of a non-terminating client.
The re-request will look like the following... note the error and the repeat request:
Alrighty, that just about clarifies SSE (Server-Sent Events) and provides a small demo to get you started.
In fact, this is the type of "streaming" ChatGPT uses when giving answers, take a look:
In the EventStream tab you can see the messages passing through. The server sends stuff like:
event: delta data: {json: here}
This should look familiar now, except the chosen event name is "delta" (not the default, optional "message") and the data is JSON-encoded.
And at the end, the server switches back to "message" and the data is "[DONE]" as a way to signal to the client that the answer is complete and the UI can be updated appropriately, e.g. make the STOP button back to SEND (arrow pointing up)
OK, cool story ChatGPT, let's take a gander at what the competition is doing over at meta.ai
Asking meta.ai a question I don't see EventStream tab, so must be something else. Looking at the Performance panel for UI updates I see:
All of these pinkish, purplish vertical almost-lines are updates. Zooming in on one:
Here we can see XHR readyState change. Aha! Our old friend XMLHttpRequest, the source of all things Ajax!
Looks like with similar server-side flushes meta.ai is streaming the answer. On every readyState
change, the client can inspect the current state of the response and grab data from it.
Here's our version of the XHR boilerplate:
const xhr = new XMLHttpRequest(); xhr.open( 'GET', 'https://pebble-capricious-bearberry.glitch.me/xhr', true, ); xhr.send(null);
Now the only thing left is to listen to onprogress
:
xhr.onprogress = () => { console.log('LOADING', xhr.readyState); msg.textContent = xhr.responseText; };
Like before, for a test page, the server just flushes the next chunk of text after a random delay:
$txt = "The zebra jumps quickly over a fence, vexed ..."; $words = explode(" ", $txt); foreach ($words as $word) { echo "$word "; usleep(rand(20000, 200000)); // Random delay flush(); }
First, HTTP header:
# XHR Content-Type: text/plain # SSE Content-Type: text/event-stream
Second, message format. SSE requires a (however simple) format of "event:" and "data:" where data can be JSON-encoded or however you wish. Maybe even XML if you're feeling cheeky. XHR responses are completely free for all, no formatting imposed, and even XML is not required despite the unfortunate name.
And lastly, and most importantly IMO, is that SSE can be interrupted by the client. In my examples I have a "close" button:
document.querySelector('#close').onclick = function () { console.log('Connection closed'); evtSource.close(); };
Here close()
tells the server that's enough and the server takes a breath. No such thing is possible in XHR. And you can see inspecting meta.ai that even though the user can click "stop generating", the response is sent by the server until it completes.
Finally, here's my Node.js that I used for the demos. Since I couldn't get Dreamhost to flush() in PHP, I went to Glitch as a free Node hosting to host just this one script.
The code handles requests /
for SSE and /xhr
for XHR. And there are a few ifs based on XHR vs SSE:
const http = require("http"); const server = http.createServer((req, res) => { if (req.url === "/" || req.url === "/xhr") { const xhr = req.url === "/xhr"; res.writeHead(200, { "Content-Type": xhr ? "text/plain" : "text/event-stream", "Cache-Control": "no-cache", "Access-Control-Allow-Origin": "*", }); if (xhr) { res.write(" ".repeat(1024)); // for Chrome } res.write("\n\n"); const txt = "The zebra jumps quickly over a fence, vexed ..."; const words = txt.split(" "); let to = 0; for (let word of words) { to += Math.floor(Math.random() * 200) + 80; setTimeout(() => { if (!xhr) { res.write(`data: ${word} \n\n`); } else { res.write(`${word} `); } }, to); } if (!xhr) { setTimeout(() => { res.write("event: imouttahere\n"); res.write("data:\n\n"); res.end(); }, to + 1000); } req.on("close", () => { res.end(); }); } else { res.writeHead(404); res.end("Not Found\n"); } }); const port = 8080; server.listen(port, () => { console.log(`Server started on port ${port}`); });
Note the weird-looking line:
res.write(" ".repeat(1024)); // for Chrome
In the world of flushing, there are many foes that want to buffer the output. Apache, PHP, mod_gzip, you name it. Even the browser. Sometimes it's required to flush out some emptiness (in this case 1K of spaces). I was actually pleasantly surprised that not too much of it was needed. In my testing this 1K buffer was needed only in the XHR case and only in Chrome.
If you want to inspect the endpoints here they are:
Once again, the repo stoyan/vexedbyalazyox has all the code from this blog and some more too.
And the demos one more time:
Web Sockets are yet another alternative to streaming content. Probably the most complex of the three in terms of implementation. Perplexity.ai and MS Copilot seem to have went this route:
While at the most recent performance.now() conference, I had a little chat with Andy Davies about fonts and he mentioned it'd be cool if, while subsetting, you can easily create a second subset file that contains all the "rejects". All the characters that were not included in the initially desired subset.
And as the flight from Amsterdam is pretty long, I hacked on just that. Say hello to a new script, available as an NPM package, called...
Initially I was thinking to wrap around Glyphhanger and do both subsets, but decided that there's no point in wrapping Glyphhanger to do what Glyphhanger already does. So the initial subset is left to the user to do in any way they see fit. What I set out to do was take The Source (the complete font file) and The Subset and produce an inversion, where
The Inverted Subset = The Source - The Subset
This way if your subset is all Latin characters, the inversion will be all non-Latin characters.
When you craft the @font-face
declaration, you can use the Unicode range of the subset, like
@font-face { font-family: "Oxanium"; src: url("Oxanium-subset.woff2") format("woff2"); unicode-range: U+0020-007E; }
(Unicode generated by wakamaifondue.com/beta)
Then for the occasional character that is not in this range, you can let the browser load the inverted subset. But that should be rare, otherwise an oft-needed character will be in the original subset.
Save on HTTP requests and bytes (in 99% of cases) and yet, take care of all characters your font supports for that extra special 1% of cases.
Wakamaifondue can generate the Unicode range for the inverted subset too but it's not required (it's too long!) only if the inverted declaration comes first. In other words if you have:
@font-face { font-family: "Oxanium"; src: url("Oxanium-inverse-subset.woff2") format("woff2"); } @font-face { font-family: "Oxanium"; src: url("Oxanium-subset.woff2") format("woff2"); unicode-range: U+0020-007E; }
... and only Latin characters on the page, then Oxanium-inverse-subset.woff2
is NOT going to be downloaded, because the second declaration overwrites the first.
If you flip the two @font-face blocks, the inversion will be loaded because it claims to support everything. And the Latin will be loaded too, because the inversion proves inadequate.
If you cannot guarantee the order of @font-faces for some reason, specifying a scary-looking Unicode range for the inversion is advisable:
@font-face { font-family: "Oxanium"; src: url("Oxanium-inverse-subset.woff2") format("woff2"); unicode-range: U+0000, U+000D, U+00A0-0107, U+010C-0113, U+0116-011B, U+011E-011F, U+0122-0123, U+012A-012B, U+012E-0131, U+0136-0137, U+0139-013E, U+0141-0148, U+014C-014D, U+0150-015B, U+015E-0165, U+016A-016B, U+016E-0173, U+0178-017E, U+0192, U+0218-021B, U+0237, U+02C6-02C7, U+02C9, U+02D8-02DD, U+0300-0304, U+0306-0308, U+030A-030C, U+0326-0328, U+03C0, U+1E9E, U+2013-2014, U+2018-201A, U+201C-201E, U+2020-2022, U+2026, U+2030, U+2039-203A, U+2044, U+2070, U+2074, U+2080-2084, U+20AC, U+20BA, U+20BD, U+2113, U+2122, U+2126, U+212E, U+2202, U+2206, U+220F, U+2211-2212, U+2215, U+2219-221A, U+221E, U+222B, U+2248, U+2260, U+2264-2265, U+25CA, U+F000, U+FB01-FB02; }
If you don't load the extended characters and someone uses your CMS to add a wee bit of je ne sais quoi, you get a fallback font:
(Note the à shown in a fallback font)
But if you do load the inversion, all is fine with the UI once again.
... and happy type setting, subsetting, and inverse subsetting!
Here's a view of the tool in action:
This is part 4 of an ongoing study of web font file sizes, subsetting, and file sizes of the subsets.
I used the collection of freely available web fonts that is Google Fonts.
Now, instead of focusing on just regular or just weight-variable fonts, I thought let's just do them all and let you, my dear reader, do your own filtering, analysis and conclusions.
One constraint I kept was just focusing on the LATIN subset (see part 1 as to what LATIN means) because as Boris Shapira notes: "...even with basic high school Chinese, we would need a minimum of 3,000 characters..." which is order of magnitude larger than Latin and we do need to keep some sort of apples-to-apples here.
First download all Google fonts (see part 1).
Then subset all of them fonts to LATIN and drop all fonts that don't support at least 200 characters. 200 and a bit is what the average LATIN font out there supports. This resulted in excluding fonts that focus mostly on non-Latin, e.g. Chinese characters. But it also dropped some fonts that are close to 200 Latin characters but not quite there. See part 1 for the "magic" 200 number. So this replicates part 1 and part 3 but this time for all available fonts.
This 200-LATIN filtering leaves us with 3277 font files to study and 261 font file "rejects". The full list of rejects is rejects.txt
Finally, subset each of the remaining fonts, 10 characters at a time to see how they grow. This replicates part 2 for all fonts, albeit a bit more coarse (10 characters at a time as opposed to 1. Hey, it still took over 24 hours while running 10 threads simultaneously, meaning 10 copies of the subsetting script!). The subsets are 1 character, 10, characters, 20... up to 200. I ended up with 68,817 font files.
((10 to 200 = 20) + 1) * 3277 files
The LATIN subset data is available in CSV (latin.csv) and HTML (latin.html)
The subset data is available as CSV (stats.csv) and Google spreadsheet
I'd love to hear your analysis on the data! I hope this data can be useful and I'm looking forward to any and all insights.
]]>I've been crafting a nice font-face fallback, something like this:
@font-face { font-family: fallback; src: local('Helvetica Neue'); ascent-override: 85%; descent-override: 19.5%; line-gap-override: 0%; size-adjust: 106.74%; }
It works well, however Safari doesn't yet support ascent-override
, descent-override
, nor line-gap-override
in @font-face
blocks. It does support size-adjust
though.
Since my code requires all 4, the results with size-adjust
-only look bad. Worse than no overrides. Easy-peasy I thought, I'll target Safari and not give it any of the 4.
I wanted to use @supports
in CSS to keep everything nice and compact. No JavaScript, no external CSS, all this is for a font fallback, so it should be loaded as early in the page as possible, together with the @font-face
.
Unfortunately, turns out that for example both
@supports (ascent-override: normal) {/* css here */}
and
@supports (size-adjust: 100%) {/* css here */}
end up with the "css here" not being used.
In fact even the amazing font-display: swap
is not declared as being @support
-ed.
Using the JavaScript API I get this in Chrome, Safari and Firefox:
console.log(CSS.supports('font-stretch: normal')); // true console.log(CSS.supports('font-style: normal')); // true console.log(CSS.supports('font-display: swap')); // false console.log(CSS.supports('size-adjust: 100%')); // false console.log(CSS.supports('ascent-override: normal')); // false
Huh? Am I using @supports
incorrectly? Or browsers forget to update this part of the code after adding a new feature? But what are the chances that all three make the same error?
It's not like anything in @font-face
is not declared @support
-ed, because font-style
and font-stretch
are.
Ryan Townsend pointed out what font-style
and font-stretch
work because they double as properties not only as font descriptors. So turns out font descriptors are not supported by @supports. Darn!
Noam Rosenthal pointed out this github issue, open in 2018, to add support for descriptors too.
For now I came up with 2 (imperfect) solutions. One that uses JavaScript to check for a property, like
'ascentOverride' in new FontFace(1,1); // true in Chrome, FF, false in Saf
Not ideal because it's JavaScript.
The other one is to target non-Safari in CSS is with a different property to use as a "proxy". Using the wonderful Compare Browsers feature of CanIUse.com I found a good candidate:
@supports (overflow-anchor: auto) { @font-face { /* works in Chrome, Edge, FF, but not in Safari*/ } }
It's not-ideal to test one thing (overflow-anchor
) and use another (ascent-override
) but at least no JavaScript is involved
In this post, I talked about the letter frequency in English presented in Peter Norvig's research. And then I thought... what about my own mother tongue?
So I got a corpus of 5000 books (832,260 words), a mix of Bulgarian authors and translations, and counted the letter frequency. Here's the result in CSV format: letters.csv
Here are the results (in alphabetical order) in a graph:
And another graph, with data sorted by the frequency of letters:
ChatGPT gives a different result, even startlingly so (o is the winner at ~9.1% and a is third with 7.5%), which makes me like my letter count research even more
TL;DR:
For context see part 1 and part 2.
After publishing part 2 of my ongoing web fonts file size study, I got feedback on Mastodon to the effect of hey, what about variable fonts?
Good question! I speculated in part 2 that there may be savings if we can combine font variants (bold, italic) in a single file, sprite-style. And that's just what a variable font is (and more!)
Following the process described in part 1. I grabbed only fonts from Google fonts that have [wght]
in the name and subset them to the LATIN subset, throwing away those with fewer than 200 characters. Also I removed all fonts with "Italic" in the name.
Why [wght]
only and not stuff like AdventPro[wdth,wght]
?
I wanted to keep only one variable dimension so we can see apples-to-apples as much as possible. And [wght]
seems to be the most popular dimension by far.
Why no Italic?
I wanted to keep fonts kinda diverse. Chances are AlbertSans-Italic[wght].ttf
and AlbertSans[wght].ttf
are designed by the same person (or people). So they are using similar techniques, optimizations and so on. And I'm looking for what's "out there" in general.
Here are the results in HTML and in CSV format.
And just a taste of what the results look like...
Num chars | Num glyphs | Bytes | File | Font name |
---|---|---|---|---|
235 | 378 | 21400 | Afacad[wght]-subset.woff2 | Afacad |
217 | 243 | 34688 | Aleo[wght]-subset.woff2 | Aleo |
... | ... | ... | ... | ... |
241 | 609 | 61456 | YsabeauOffice[wght]-subset.woff2 | Ysabeau Office |
241 | 621 | 62552 | Ysabeau[wght]-subset.woff2 | Ysabeau |
241 | 584 | 58688 | YsabeauInfant[wght]-subset.woff2 | Ysabeau Infant |
Overall stats:
The file size difference is not big but we can still see a saving probably because of duplicate metadata and some other similar elements in two files vs one. And there there's also the delivery saving - 2 HTTPS requests vs one.
In the spirit of part 2 I'd like to study the sizes when incrementing the number of characters in a subset (as opposed to a catch-all LATIN). This will address potential skew #2 above. Probably not increments of 1 but of 50 to save some processing.
I'd also like to experiment with ALL the fonts available. So far I've been looking at "Regular" and [wght] only. But I should just do it all and then have people smarter than me (such as yourself, my dear reader) slice the results and draw conclusions any way you want.
]]>The zebra jumps quickly over a fence, vexed by a lazy ox. Eden tries to alter soft stone near it. Tall giants often need to rest, and open roads invite no pause. Some long lines appear there. In bright cold night, stars drift, and people watch them. A few near doors step out. Much light finds land slowly, while men feel deep quiet. Words run in ways, forward yet true. Look ahead, and things form still, yet dreams stay hidden. Down the path, close skies come, forming hard arcs. High above, quiet kites drift, fast on pure wind, yanking joints.
What's so special about the nonsense paragraph above? It's attempting to match the average distribution of letters in texts written in the English language.
This article by Peter Norvig discusses a 2012 study of letter frequency using Google books data set. And the distribution look like so:
For font-fallback matching purposes (more on this later) I want a shorter paragraph, representing roughly similar distribution. One can, of course, just create a paragraph like "Zzzzzzzzz" (9 Zs), followed by 12 Qs and so on, all the way to 1249 Es. But where's the fun in that? Plus texts have spaces and punctuation too.
So after some tweaking and coaching AI, this is a paragraph that came out that looks more realistic and matches the letter frequency pretty well.
Here's a CSV that shows:
Letter,Norvig,Tall giants E,12.49%,12.26% T,9.28%,8.73% A,8.04%,7.55% O,7.64%,7.08% I,7.57%,6.60% N,7.23%,7.55% S,6.51%,6.84% R,6.28%,6.13% H,5.05%,4.01% L,4.07%,4.48% D,3.82%,5.42% C,3.34%,1.89% U,2.73%,2.36% M,2.51%,2.12% F,2.40%,2.83% P,2.14%,2.59% G,1.87%,2.12% W,1.68%,2.12% Y,1.66%,2.12% B,1.48%,0.94% V,1.05%,0.94% K,0.54%,1.18% X,0.23%,0.47% J,0.16%,0.47% Q,0.12%,0.71% Z,0.09%,0.47%
Here's the same data represented graphically:
Similar to the nonsense etaoin shrdlu used by typesetters, this paragraph can be used to find out the average character width of a font.
Just render the paragraph in a non-wrapping inline-block DOM element, measure the width of the element and divide by the length of the text.
How is this useful? Welp, to set the size-adjust
CSS property of a fallback font to match a custom web font. Further write up is coming, stay tuned!
As you can see in the graph, the two lines do not match exactly. I think this is OK. It's extremely unlikely that any text on your page will have the exact average distribution of letters in it. So we're talking about an approximation to begin with. May also be site-dependent. E.g. in an adult site maybe the X character will occur more often than the average book.
Also Norvig's analysis doesn't mention spaces and punctuation. In my paragraph, these exist, maybe making it possible to match the average text on a web page just a little bit closer.
Well, it doesn't attempt to match the character distribution in English. (Duh, it's not even English!)
Here's what it looks like in the same digram:
Note: no K, J, Z, W or Y. Barely any H.
Here are the stats in CSV and .numbers for your perusal.
May "The zebra jumps quickly over a fence, vexed by a lazy ox" be always in your favor!
]]>