phpied.com https://www.phpied.com Stoyan's blog Sat, 01 Feb 2025 20:44:32 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 First timid steps in Rust https://www.phpied.com/first-timid-steps-in-rust/ https://www.phpied.com/first-timid-steps-in-rust/#comments <![CDATA[Stoyan]]> Sat, 01 Feb 2025 00:00:43 +0000 <![CDATA[Rust]]> https://www.phpied.com/?p=2154 <![CDATA[I'm working on a new site at https://highperformancewebfonts.com/ where I'm doing everything wrong. E.g. using a joke-y client-side-only rendering of articles from .md files (Hello Lizzy.js) Since there's no static generation, there was no RSS feed. And since someone asked, I decided to add one. But in the spirit of learning-while-doing, I thought I should […]]]> <![CDATA[

I'm working on a new site at https://highperformancewebfonts.com/ where I'm doing everything wrong. E.g. using a joke-y client-side-only rendering of articles from .md files (Hello Lizzy.js)

Since there's no static generation, there was no RSS feed. And since someone asked, I decided to add one. But in the spirit of learning-while-doing, I thought I should do the feed generation in Rust—a language I know nothing about.

Here are my first steps in Rust, for posterity. BTW the end result is https://highperformancewebfonts.com/feed.xml

1. Install Rust

The recommended install is via rustup tool. This page https://www.rust-lang.org/tools/install has the instructions:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Next, restart the terminal shell or run:

$ . "$HOME/.cargo/env"

Check if installation was ok:

$ rustc --version
rustc 1.84.0 (9fc6b4312 2025-01-07)

2. A new project

A tool called Cargo seems like the way to go. Looks like it's a package manager, an NPM of Rust:

$ cargo new russel && cd russel

(The name of my program is "russel", from "rusty", from "rust". Yeah, I'll see myself out.)

3. Add dependencies to Cargo.toml

And Cargo.toml looks like a config file similar in spirit to package.json and similar in syntax to a php.ini. Since I'll need to write an RSS feed, a package called rss would be handy.

[package]
name = "russel"
version = "0.1.0"
edition = "2021"

[dependencies]
rss = "2.0.0"

Running $ cargo build after a dependency update seems necessary.

To explore packages, a crates.io site looks appropriate e.g. https://crates.io/crates/rss as well as docs.rs, e.g. https://docs.rs/rss/2.0.11/rss/index.html

4. All ok so far?

The command `cargo new russel` from the previous step created a hello-world program. We can test it by running:

$ cargo run

This should print "Hello, world!"

Nice!

5. Tweaks in src/main.rs

Let's just test we can make changes and see the result. Seeing is believing!

Open src/main.rs, look at this wonderful function:

fn main() {
  println!("Hello, world!");
}

Replace the string with "Bello, world". Save. Run:

$ cargo run

If you see the "Bello", the setup seems to be working. Rejoice!

6. (Optional) install rust-analyzer

It's pretty helpful. In VSCode you can go to Extensions and search for "rust-analyzer" to find it.

Go Rust in peace!

https://en.wikipedia.org/wiki/Rust_in_Peace

]]>
https://www.phpied.com/first-timid-steps-in-rust/feed/ 1
A quick WordPress Super Cache fix https://www.phpied.com/a-quick-wordpress-super-cache-fix/ https://www.phpied.com/a-quick-wordpress-super-cache-fix/#respond <![CDATA[Stoyan]]> Sat, 07 Dec 2024 06:41:15 +0000 <![CDATA[performance]]> <![CDATA[WordPress]]> https://www.phpied.com/?p=2151 <![CDATA[When you use a bog-standard WordPress install, the caching header in the HTML response is Cache-Control: max-age=600 OK, cool, this means cache the HTML for 10 minutes. Additionally these headers are sent: Date: Sat, 07 Dec 2024 05:20:02 GMT Expires: Sat, 07 Dec 2024 05:30:02 GMT These don't help at all, because they instruct the […]]]> <![CDATA[


When you use a bog-standard WordPress install, the caching header in the HTML response is

Cache-Control: max-age=600

OK, cool, this means cache the HTML for 10 minutes.

Additionally these headers are sent:

Date: Sat, 07 Dec 2024 05:20:02 GMT
Expires: Sat, 07 Dec 2024 05:30:02 GMT

These don't help at all, because they instruct the browser to cache for 10 minutes too, which the browser already knows. These can actually be harmful in cases of clocks that are off. But let's move on.

WP Super Cache

This is a plugin I installed, made by WP folks themselves, so joy, joy, joy. It saves the generated HTML from the PHP code on the disk and then gives that cached content to the next visitor. Win!

However, I noticed it ads another header:

Cache-Control: max-age=3, must-revalidate

And actually now there are two cache-control headers being sent, the new and the old:

Cache-Control: max-age=3, must-revalidate
Cache-Control: max-age=600

What do you think happens? Well, the browser goes with the more restrictive one, so the wonderfully cached (on disk) HTML is now stale after 3 seconds. Not cool!

A settings fix

Looking around in the plugin settings I see there is no way to fix this. There's another curious setting though, disabled by default:

[ ] 304 Browser caching. Improves site performance by checking if the page has changed since the browser last requested it. (Recommended)
304 support is disabled by default because some hosts have had problems with the headers used in the past.

I turned this on. It means that instead of a new request after 3 seconds, the repeat visit will send an If-Modified-Since header, and since 3 seconds is a very short time, the server will very likely respond with 304 Not Modified response, which means the browser is free to use the copy from the browser cache.

Better, but still... it's an HTTP request.

A config fix

Then I had to poke around the code and saw this:

// Default headers.
$headers = array(
  'Vary'          => 'Accept-Encoding, Cookie',
  'Cache-Control' => 'max-age=3, must-revalidate',
);

// Allow users to override Cache-control header with WPSC_CACHE_CONTROL_HEADER
if ( defined( 'WPSC_CACHE_CONTROL_HEADER' ) && ! empty( WPSC_CACHE_CONTROL_HEADER ) ) {
  $headers['Cache-Control'] = WPSC_CACHE_CONTROL_HEADER;
}

Alrighty, so there is a way! All I needed to do was define the constant with the header I want.

The new constant lives in wp-content/wp-cache-config.php - a file that already exists, created by the cache plugin.

I opted for:

define(
  'WPSC_CACHE_CONTROL_HEADER',
  'max-age=600, stale-while-revalidate=100'
);

Why 600? I'd do it for longer but there's this other Cache-Control 600 coming from who-knows-where, so 600 is the max I can do. (TODO: figure out that other Cache-Control and ditch it)

Why stale-while-revalidate? Well, this lets the browser use the cached response after the 10 minutes while it's re-checking for a fresher copy.

Some WebPageTest tests

1. The repeat visit as-is, meaning the default less-than-ideal WP Super Cache behavior:
https://www.webpagetest.org/result/241207_AiDcHR_1QT/

Here you can see a new request for a repeat view, because 3 seconds have passed.

2. With the 304 setting turned on:
https://www.webpagetest.org/result/241207_AiDc4D_1QQ/

You can see a request being made that gets a 304 Not Modified response

3. Finally with the fix, the new header coming from the new constant:
https://www.webpagetest.org/result/241207_AiDcVT_1R8/

Here you can see no more requests for HTML, just one for stats. No static resources either (CSS, images, JS are cached "forever"). So the page is loaded completely from the browser cache.

]]>
https://www.phpied.com/a-quick-wordpress-super-cache-fix/feed/ 0
Turn an animated GIF into a <video> https://www.phpied.com/turn-an-animated-gif-into-a-video/ https://www.phpied.com/turn-an-animated-gif-into-a-video/#respond <![CDATA[Stoyan]]> Wed, 04 Dec 2024 22:45:49 +0000 <![CDATA[(x)HTML(5)]]> <![CDATA[ffmpeg]]> <![CDATA[images]]> https://www.phpied.com/?p=2150 <![CDATA[Animated gifs are fun and all but they can get big (in filesize) quickly. At some point, maybe after just a few low-resolution frames it's better to use an MP4 and an HTML <video> element. You also preferably need a "poster" image for the video so people can see a quick preview before they decide […]]]> <![CDATA[


Animated gifs are fun and all but they can get big (in filesize) quickly. At some point, maybe after just a few low-resolution frames it's better to use an MP4 and an HTML <video> element. You also preferably need a "poster" image for the video so people can see a quick preview before they decide to play your video. The procedure can be pretty simple thanks to freely available amazing open source command-line tools.

Step 1: an MP4

For this we use ffmpeg:

$ ffmpeg -i amazing.gif amazing.mp4

Step 2: a poster image

Here we use ImageMagick to take the first frame in a gif and export it to a PNG:

$ magick "amazing.gif[0]" amazing.png

... or a JPEG, depending on the type of video (photographic vs more shape-y)

Step 3: video tag

<video width="640" height="480" 
  controls preload="none" poster="amazing.png">
  <source src="amazing.mp4" type="video/mp4">
</video>

Step 4: optimize the image

... with your favorite image-smushing tool e.g. ImageOptim

Comments

I did this for a recent Perfplanet calendar post and the 2.5MB gif turned to 270K mp4. Another 23MB gif turned to 1.2MB mp4.

I dunno if my ffmpeg install is to blame but the videos didn't play in QuickTime/FF/Safari, only in Chrome. So I ran them through HandBrake and that solved it. Cuz... ffmpeg options are not for the faint-hearted.

Do use preload="none" so that the browser doesn't load the whole video unless the user decides to play. In my testing without a preload=none Chrome and Safari send range requests (like an HTTP header Range: bytes=0-) for 206 Partial Content. Firefox just gets the whole thing

There is no loading="lazy" for poster images 🙁

]]>
https://www.phpied.com/turn-an-animated-gif-into-a-video/feed/ 0
AI’s “streaming text” UIs: a how-to https://www.phpied.com/ai-streaming-text-ui-how-to/ https://www.phpied.com/ai-streaming-text-ui-how-to/#comments <![CDATA[Stoyan]]> Wed, 27 Nov 2024 09:47:34 +0000 <![CDATA[AI]]> <![CDATA[JavaScript]]> <![CDATA[php]]> https://www.phpied.com/?p=2146 <![CDATA[You've seen some of these UIs as of recent AI tools that stream text, right? Like this: I peeked under the hood of ChatGPT and meta.ai to figure how they work. Server-sent events Server-sent events (SSE) seem like the right tool for the job. A server-side script flushes out content whenever it's ready. The browser […]]]> <![CDATA[


You've seen some of these UIs as of recent AI tools that stream text, right? Like this:

I peeked under the hood of ChatGPT and meta.ai to figure how they work.

Server-sent events

Server-sent events (SSE) seem like the right tool for the job. A server-side script flushes out content whenever it's ready. The browser listens to the content as it's coming down the wire with the help of EventSource() and updates the UI.

(aside:) PHP on the server

Sadly I couldn't make the PHP code work server-side on this here blog, even though I consulted Dreamhost's support. I never got the "chunked" response to flush progressively from the server, I always get the whole response once it's ready. It's not impossible though, it worked for me with a local PHP server (like $ php -S localhost:8000) and I'm pretty sure it used to work on Dreamhost before they switched to FastCGI.

If you want to make flush()-ing work in PHP, here are some pointers to try in .htaccess

<filesmatch "\.php$">
    SetEnv no-gzip 1
    Header always set Cache-Control "no-cache, no-store, must-revalidate"
    SetEnv chunked yes
    SetEnv FcgidOutputBufferSize 0
    SetEnv OutputBufferSize 0
<filesmatch>

And a test page to tell the time every second:

<?php
header('Cache-Control: no-cache');

@ob_end_clean();

$go = 5;
while ($go) {
    $go--;
    // Send a message
    echo sprintf(
      "It's %s o'clock on my server.\n\n", 
      date('H:i:s', time()),
    );
    flush();
    sleep(1);
}

In this repo stoyan/vexedbyalazyox you can find two PHP scripts that worked for me.

BTW, the server-side partial responses and flushing is pretty old as web performance techniques go.

A bit about the server-sent messages

(I'll keep using PHP to illustrate for just a bit more and then switch to Node.js)

In their simplest from server-sent events (or messages) are pretty sparse, all you do is:

echo "data: I am a message\n\n";
flush();

And now the client can receive "I am a message".

The events can have event names, anything you make up, like:

echo "event: start\n";
echo "data: Hi!\n\n";
flush();

More on the message fields is available on MDN. But all in all, the stuff you spit out on the server can be really simple:

event: start
data:

data: hello

data: foo

event: end
data:

Events can be named anything, "start" and "end" are just examples. And they are optional too.

data: is not optional. Even if all you need is to send an event with no data.

When event: is omitted, it's assumed to be event: message.

The client's JavaScript

To get started you need an EventSource object pointed to the server-side script:

const evtSource = new EventSource(
  'https://pebble-capricious-bearberry.glitch.me/',
);

Then you just listen to events (messages) and update the UI:

evtSource.onmessage = (e) => {
  msg.textContent += e.data;
};

And that's all! You have optional event handlers should you need them:

evtSource.onopen = () => {};
evtSource.onerror = () => {};

Additionally, you can listen to any events with names you decide. For example I want the server to signal to the client that the response is over. So I have the server send this message:

event: imouttahere
data:

And then the client can listen to the imouttahere event:

evtSource.addEventListener('imouttahere', () => {
  console.info('Server calls it done');
  evtSource.close();
});

Demo time

OK, demo time! The server side script takes a paragraph of text and spits out every word after a random delay:

$txt = "The zebra jumps quickly over a fence, vexed by...";
$words = explode(" ", $txt);
foreach ($words as $word) {
    echo "data: $word \n\n";
    usleep(rand(90000, 200000)); // Random delay
    flush();
}

The client side sets up EventSource and, on every message, updates the text on the page. When the server is done (event: imouttahere), the client closes the connection.

Try it here in action. View source for the complete code. Note: if nothing happens initially, that's because the server-side Glitch is gone to sleep and needs to wake up.

One cool Chrome devtools feature is the list of events under an EventStream tab in the Network panel:

Now, what happens if the server is done and doesn't send a special message (such as imouttahere)? Well, the browser thinks something went wrong and re-requests the same URL and the whole thing repeats. This is probably desired behavior in many cases, but here I don't want it.

Try the case of a non-terminating client.

The re-request will look like the following... note the error and the repeat request:
re-requesting

Alrighty, that just about clarifies SSE (Server-Sent Events) and provides a small demo to get you started.

In fact, this is the type of "streaming" ChatGPT uses when giving answers, take a look:

ChatGPT's EventSource

In the EventStream tab you can see the messages passing through. The server sends stuff like:

event: delta
data: {json: here}

This should look familiar now, except the chosen event name is "delta" (not the default, optional "message") and the data is JSON-encoded.

And at the end, the server switches back to "message" and the data is "[DONE]" as a way to signal to the client that the answer is complete and the UI can be updated appropriately, e.g. make the STOP button back to SEND (arrow pointing up)

OK, cool story ChatGPT, let's take a gander at what the competition is doing over at meta.ai

XMLHttpRequest

Asking meta.ai a question I don't see EventStream tab, so must be something else. Looking at the Performance panel for UI updates I see:

meta.ai updates overview

All of these pinkish, purplish vertical almost-lines are updates. Zooming in on one:

meta.ai updates zoomed

Here we can see XHR readyState change. Aha! Our old friend XMLHttpRequest, the source of all things Ajax!

Looks like with similar server-side flushes meta.ai is streaming the answer. On every readyState change, the client can inspect the current state of the response and grab data from it.

Here's our version of the XHR boilerplate:

const xhr = new XMLHttpRequest();
xhr.open(
  'GET',
  'https://pebble-capricious-bearberry.glitch.me/xhr',
  true,
);
xhr.send(null);

Now the only thing left is to listen to onprogress:

xhr.onprogress = () => {
  console.log('LOADING', xhr.readyState);
  msg.textContent = xhr.responseText;
};

Like before, for a test page, the server just flushes the next chunk of text after a random delay:

$txt = "The zebra jumps quickly over a fence, vexed ...";
$words = explode(" ", $txt);
foreach ($words as $word) {
    echo "$word ";
    usleep(rand(20000, 200000)); // Random delay
    flush();
}

XHR client demo page

Differences between XHR and SSE

First, HTTP header:

# XHR
Content-Type: text/plain
# SSE
Content-Type: text/event-stream

Second, message format. SSE requires a (however simple) format of "event:" and "data:" where data can be JSON-encoded or however you wish. Maybe even XML if you're feeling cheeky. XHR responses are completely free for all, no formatting imposed, and even XML is not required despite the unfortunate name.

And lastly, and most importantly IMO, is that SSE can be interrupted by the client. In my examples I have a "close" button:

document.querySelector('#close').onclick = function () {
  console.log('Connection closed');
  evtSource.close();
};

Here close() tells the server that's enough and the server takes a breath. No such thing is possible in XHR. And you can see inspecting meta.ai that even though the user can click "stop generating", the response is sent by the server until it completes.

Node.js on the server

Finally, here's my Node.js that I used for the demos. Since I couldn't get Dreamhost to flush() in PHP, I went to Glitch as a free Node hosting to host just this one script.

The code handles requests / for SSE and /xhr for XHR. And there are a few ifs based on XHR vs SSE:

const http = require("http");

const server = http.createServer((req, res) => {
  if (req.url === "/" || req.url === "/xhr") {
    const xhr = req.url === "/xhr";
    res.writeHead(200, {
      "Content-Type": xhr ? "text/plain" : "text/event-stream",
      "Cache-Control": "no-cache",
      "Access-Control-Allow-Origin": "*",
    });

    if (xhr) {
      res.write(" ".repeat(1024)); // for Chrome
    }
    res.write("\n\n");

    const txt = "The zebra jumps quickly over a fence, vexed ...";
    const words = txt.split(" ");
    let to = 0;
    for (let word of words) {
      to += Math.floor(Math.random() * 200) + 80;
      setTimeout(() => {
        if (!xhr) {
          res.write(`data: ${word} \n\n`);
        } else {
          res.write(`${word} `);
        }
      }, to);
    }

    if (!xhr) {
      setTimeout(() => {
        res.write("event: imouttahere\n");
        res.write("data:\n\n");
        res.end();
      }, to + 1000);
    }

    req.on("close", () => {
      res.end();
    });
  } else {
    res.writeHead(404);
    res.end("Not Found\n");
  }
});

const port = 8080;
server.listen(port, () => {
  console.log(`Server started on port ${port}`);
});

Note the weird-looking line:

res.write(" ".repeat(1024)); // for Chrome

In the world of flushing, there are many foes that want to buffer the output. Apache, PHP, mod_gzip, you name it. Even the browser. Sometimes it's required to flush out some emptiness (in this case 1K of spaces). I was actually pleasantly surprised that not too much of it was needed. In my testing this 1K buffer was needed only in the XHR case and only in Chrome.

That's all folks!

If you want to inspect the endpoints here they are:

Once again, the repo stoyan/vexedbyalazyox has all the code from this blog and some more too.

And the demos one more time:

Small update: honorable mention for Web Sockets

Web Sockets are yet another alternative to streaming content. Probably the most complex of the three in terms of implementation. Perplexity.ai and MS Copilot seem to have went this route:

Perplexity

Copilot

]]>
https://www.phpied.com/ai-streaming-text-ui-how-to/feed/ 27
Inverse font subsetting https://www.phpied.com/inverse-font-subsetting/ https://www.phpied.com/inverse-font-subsetting/#respond <![CDATA[Stoyan]]> Mon, 25 Nov 2024 23:39:41 +0000 <![CDATA[font-face]]> <![CDATA[performance]]> https://www.phpied.com/?p=2144 <![CDATA[While at the most recent performance.now() conference, I had a little chat with Andy Davies about fonts and he mentioned it'd be cool if, while subsetting, you can easily create a second subset file that contains all the "rejects". All the characters that were not included in the initially desired subset. And as the flight […]]]> <![CDATA[

While at the most recent performance.now() conference, I had a little chat with Andy Davies about fonts and he mentioned it'd be cool if, while subsetting, you can easily create a second subset file that contains all the "rejects". All the characters that were not included in the initially desired subset.

And as the flight from Amsterdam is pretty long, I hacked on just that. Say hello to a new script, available as an NPM package, called...

inverse-subset

Initially I was thinking to wrap around Glyphhanger and do both subsets, but decided that there's no point in wrapping Glyphhanger to do what Glyphhanger already does. So the initial subset is left to the user to do in any way they see fit. What I set out to do was take The Source (the complete font file) and The Subset and produce an inversion, where

The Inverted Subset = The Source - The Subset

This way if your subset is all Latin characters, the inversion will be all non-Latin characters.

When you craft the @font-face declaration, you can use the Unicode range of the subset, like

@font-face {
    font-family: "Oxanium";
    src: url("Oxanium-subset.woff2") format("woff2");
    unicode-range: U+0020-007E;
}

(Unicode generated by wakamaifondue.com/beta)

Then for the occasional character that is not in this range, you can let the browser load the inverted subset. But that should be rare, otherwise an oft-needed character will be in the original subset.

Save on HTTP requests and bytes (in 99% of cases) and yet, take care of all characters your font supports for that extra special 1% of cases.

Unicode-optional

Wakamaifondue can generate the Unicode range for the inverted subset too but it's not required (it's too long!) only if the inverted declaration comes first. In other words if you have:

@font-face {
  font-family: "Oxanium";
  src: url("Oxanium-inverse-subset.woff2") format("woff2");
}
@font-face {
  font-family: "Oxanium";
  src: url("Oxanium-subset.woff2") format("woff2");
  unicode-range: U+0020-007E;
}

... and only Latin characters on the page, then Oxanium-inverse-subset.woff2 is NOT going to be downloaded, because the second declaration overwrites the first.

Test page is here

If you flip the two @font-face blocks, the inversion will be loaded because it claims to support everything. And the Latin will be loaded too, because the inversion proves inadequate.

If you cannot guarantee the order of @font-faces for some reason, specifying a scary-looking Unicode range for the inversion is advisable:

@font-face {
    font-family: "Oxanium";
    src: url("Oxanium-inverse-subset.woff2") format("woff2");
    unicode-range: U+0000, U+000D, U+00A0-0107, U+010C-0113, U+0116-011B,
        U+011E-011F, U+0122-0123, U+012A-012B, U+012E-0131, U+0136-0137,
        U+0139-013E, U+0141-0148, U+014C-014D, U+0150-015B, U+015E-0165,
        U+016A-016B, U+016E-0173, U+0178-017E, U+0192, U+0218-021B, U+0237,
        U+02C6-02C7, U+02C9, U+02D8-02DD, U+0300-0304, U+0306-0308,
        U+030A-030C, U+0326-0328, U+03C0, U+1E9E, U+2013-2014, U+2018-201A,
        U+201C-201E, U+2020-2022, U+2026, U+2030, U+2039-203A, U+2044, U+2070,
        U+2074, U+2080-2084, U+20AC, U+20BA, U+20BD, U+2113, U+2122, U+2126,
        U+212E, U+2202, U+2206, U+220F, U+2211-2212, U+2215, U+2219-221A,
        U+221E, U+222B, U+2248, U+2260, U+2264-2265, U+25CA, U+F000,
        U+FB01-FB02;
}

How embarrassment looks like

If you don't load the extended characters and someone uses your CMS to add a wee bit of je ne sais quoi, you get a fallback font:

Test page is here

(Note the à shown in a fallback font)

But if you do load the inversion, all is fine with the UI once again.

Test page

Thank you!

... and happy type setting, subsetting, and inverse subsetting!

Here's a view of the tool in action:

]]>
https://www.phpied.com/inverse-font-subsetting/feed/ 0
Web font sizes: a more complete data set https://www.phpied.com/web-font-sizes-a-more-complete-data-set/ https://www.phpied.com/web-font-sizes-a-more-complete-data-set/#respond <![CDATA[Stoyan]]> Tue, 12 Nov 2024 00:43:07 +0000 <![CDATA[font-face]]> <![CDATA[performance]]> https://www.phpied.com/?p=2139 <![CDATA[This is part 4 of an ongoing study of web font file sizes, subsetting, and file sizes of the subsets. I used the collection of freely available web fonts that is Google Fonts. In part 1 I wondered How many bytes is "normal" for a web font by studying all regular fonts, meaning no bolds, […]]]> <![CDATA[

This is part 4 of an ongoing study of web font file sizes, subsetting, and file sizes of the subsets.

I used the collection of freely available web fonts that is Google Fonts.

  • In part 1 I wondered How many bytes is "normal" for a web font by studying all regular fonts, meaning no bolds, italics, etc. The answer was, of course 42, around 20K for a LATIN subset
  • In part 2 I wondered how does a font grow, by subsetting fonts one character at a time. The answer was, of course 42, about 0.1K per character
  • Part 3 was a re-study of part 1, but this time focusing on variable fonts using only one variable dimension - weight, i.e. a variable bold-ness. This time the answer was, of course 42,: 35K is the median file size of a wght-variable font

Now, instead of focusing on just regular or just weight-variable fonts, I thought let's just do them all and let you, my dear reader, do your own filtering, analysis and conclusions.

One constraint I kept was just focusing on the LATIN subset (see part 1 as to what LATIN means) because as Boris Shapira notes: "...even with basic high school Chinese, we would need a minimum of 3,000 characters..." which is order of magnitude larger than Latin and we do need to keep some sort of apples-to-apples here.

The study

First download all Google fonts (see part 1).

Then subset all of them fonts to LATIN and drop all fonts that don't support at least 200 characters. 200 and a bit is what the average LATIN font out there supports. This resulted in excluding fonts that focus mostly on non-Latin, e.g. Chinese characters. But it also dropped some fonts that are close to 200 Latin characters but not quite there. See part 1 for the "magic" 200 number. So this replicates part 1 and part 3 but this time for all available fonts.

This 200-LATIN filtering leaves us with 3277 font files to study and 261 font file "rejects". The full list of rejects is rejects.txt

Finally, subset each of the remaining fonts, 10 characters at a time to see how they grow. This replicates part 2 for all fonts, albeit a bit more coarse (10 characters at a time as opposed to 1. Hey, it still took over 24 hours while running 10 threads simultaneously, meaning 10 copies of the subsetting script!). The subsets are 1 character, 10, characters, 20... up to 200. I ended up with 68,817 font files.

((10 to 200 = 20) + 1) * 3277 files

Data

LATIN

The LATIN subset data is available in CSV (latin.csv) and HTML (latin.html)

Subsets

The subset data is available as CSV (stats.csv) and Google spreadsheet

Some observations

  • The data set contains 3277 different fonts files, each being subset 21 times
  • 588 are variable fonts
  • 429 variable only on the weight axis
  • 196 containing variable with more than one axis, e.g. [wdth,wght] or [FLAR,VOLM,slnt,wght]
  • 63 using the [opsz] axis (it's been suggested this is the "expensive" one in terms of file size

Conclusions

I'd love to hear your analysis on the data! I hope this data can be useful and I'm looking forward to any and all insights.

]]>
https://www.phpied.com/web-font-sizes-a-more-complete-data-set/feed/ 0
@supports and @font-face troubles https://www.phpied.com/supports-and-font-face-troubles/ https://www.phpied.com/supports-and-font-face-troubles/#comments <![CDATA[Stoyan]]> Sun, 03 Nov 2024 05:01:51 +0000 <![CDATA[font-face]]> <![CDATA[performance]]> https://www.phpied.com/?p=2133 <![CDATA[I've been crafting a nice font-face fallback, something like this: @font-face { font-family: fallback; src: local('Helvetica Neue'); ascent-override: 85%; descent-override: 19.5%; line-gap-override: 0%; size-adjust: 106.74%; } It works well, however Safari doesn't yet support ascent-override, descent-override, nor line-gap-override in @font-face blocks. It does support size-adjust though. Since my code requires all 4, the results with […]]]> <![CDATA[

I've been crafting a nice font-face fallback, something like this:

@font-face {
  font-family: fallback;
  src: local('Helvetica Neue');

  ascent-override: 85%;
  descent-override: 19.5%;
  line-gap-override: 0%;

  size-adjust: 106.74%;
}

It works well, however Safari doesn't yet support ascent-override, descent-override, nor line-gap-override in @font-face blocks. It does support size-adjust though.

Since my code requires all 4, the results with size-adjust-only look bad. Worse than no overrides. Easy-peasy I thought, I'll target Safari and not give it any of the 4.

I wanted to use @supports in CSS to keep everything nice and compact. No JavaScript, no external CSS, all this is for a font fallback, so it should be loaded as early in the page as possible, together with the @font-face.

Unfortunately, turns out that for example both

@supports (ascent-override: normal) {/* css here */}

and

@supports (size-adjust: 100%) {/* css here */}

end up with the "css here" not being used.

In fact even the amazing font-display: swap is not declared as being @support-ed.

Using the JavaScript API I get this in Chrome, Safari and Firefox:

console.log(CSS.supports('font-stretch: normal')); // true
console.log(CSS.supports('font-style: normal')); // true
console.log(CSS.supports('font-display: swap')); // false
console.log(CSS.supports('size-adjust: 100%')); // false
console.log(CSS.supports('ascent-override: normal')); // false

Huh? Am I using @supports incorrectly? Or browsers forget to update this part of the code after adding a new feature? But what are the chances that all three make the same error?

It's not like anything in @font-face is not declared @support-ed, because font-style and font-stretch are.

Clearing out my confusion

Ryan Townsend pointed out what font-style and font-stretch work because they double as properties not only as font descriptors. So turns out font descriptors are not supported by @supports. Darn!

Noam Rosenthal pointed out this github issue, open in 2018, to add support for descriptors too.

For now I came up with 2 (imperfect) solutions. One that uses JavaScript to check for a property, like

'ascentOverride' in new FontFace(1,1); // true in Chrome, FF, false in Saf

Not ideal because it's JavaScript.

The other one is to target non-Safari in CSS is with a different property to use as a "proxy". Using the wonderful Compare Browsers feature of CanIUse.com I found a good candidate:

@supports (overflow-anchor: auto) {
  @font-face {    
    /* works in Chrome, Edge, FF, but not in Safari*/ 
  }
}

It's not-ideal to test one thing (overflow-anchor) and use another (ascent-override) but at least no JavaScript is involved

]]>
https://www.phpied.com/supports-and-font-face-troubles/feed/ 3
Letter frequency in the Bulgarian language https://www.phpied.com/letter-frequency-in-bulgarian-language/ https://www.phpied.com/letter-frequency-in-bulgarian-language/#respond <![CDATA[Stoyan]]> Fri, 01 Nov 2024 06:18:52 +0000 <![CDATA[misc hackery]]> https://www.phpied.com/?p=2130 <![CDATA[In this post, I talked about the letter frequency in English presented in Peter Norvig's research. And then I thought... what about my own mother tongue? So I got a corpus of 5000 books (832,260 words), a mix of Bulgarian authors and translations, and counted the letter frequency. Here's the result in CSV format: letters.csv […]]]> <![CDATA[

In this post, I talked about the letter frequency in English presented in Peter Norvig's research. And then I thought... what about my own mother tongue?

So I got a corpus of 5000 books (832,260 words), a mix of Bulgarian authors and translations, and counted the letter frequency. Here's the result in CSV format: letters.csv


Here are the results (in alphabetical order) in a graph:

And another graph, with data sorted by the frequency of letters:

ChatGPT gives a different result, even startlingly so (o is the winner at ~9.1% and a is third with 7.5%), which makes me like my letter count research even more 😀

]]>
https://www.phpied.com/letter-frequency-in-bulgarian-language/feed/ 0
Web font file size study: a variable font addition https://www.phpied.com/web-font-file-size-study-a-variable-font-addition/ https://www.phpied.com/web-font-file-size-study-a-variable-font-addition/#comments <![CDATA[Stoyan]]> Mon, 28 Oct 2024 23:09:13 +0000 <![CDATA[font-face]]> <![CDATA[performance]]> https://www.phpied.com/?p=2129 <![CDATA[TL;DR: If your variable font file is significantly larger than 35K you may ask yourself "How did I get here?" Two font files (of the same family) means more bytes than one variable font that does both For context see part 1 and part 2. After publishing part 2 of my ongoing web fonts file […]]]> <![CDATA[

TL;DR:

For context see part 1 and part 2.

After publishing part 2 of my ongoing web fonts file size study, I got feedback on Mastodon to the effect of hey, what about variable fonts?

Good question! I speculated in part 2 that there may be savings if we can combine font variants (bold, italic) in a single file, sprite-style. And that's just what a variable font is (and more!)

Rerun them scripts

Following the process described in part 1. I grabbed only fonts from Google fonts that have [wght] in the name and subset them to the LATIN subset, throwing away those with fewer than 200 characters. Also I removed all fonts with "Italic" in the name.

Why [wght] only and not stuff like AdventPro[wdth,wght]?
I wanted to keep only one variable dimension so we can see apples-to-apples as much as possible. And [wght] seems to be the most popular dimension by far.

Why no Italic?
I wanted to keep fonts kinda diverse. Chances are AlbertSans-Italic[wght].ttf and AlbertSans[wght].ttf are designed by the same person (or people). So they are using similar techniques, optimizations and so on. And I'm looking for what's "out there" in general.

Results

Here are the results in HTML and in CSV format.

And just a taste of what the results look like...

Num chars Num glyphs Bytes File Font name
235 378 21400 Afacad[wght]-subset.woff2 Afacad
217 243 34688 Aleo[wght]-subset.woff2 Aleo
... ... ... ... ...
241 609 61456 YsabeauOffice[wght]-subset.woff2 Ysabeau Office
241 621 62552 Ysabeau[wght]-subset.woff2 Ysabeau
241 584 58688 YsabeauInfant[wght]-subset.woff2 Ysabeau Infant

Overall stats:

  • Average File Size: 50532.85970149254 bytes
  • Median File Size: 34744 bytes
  • Average Glyph Count: 438.4179104477612
  • Median Glyph Count: 328
  • Median Character Count: 222
  • Number of font files: 335

Conclusions?

  • In part 1 one of the conclusions was: the median file size of a regular web font with Latin-extended subset of characters is 19092 bytes. Where "regular" means no bolds, no italics, etc.
  • Here we see that the median file size of a variable web font with Latin-extended subset of characters is 34744 bytes
  • The sum is smaller than the parts. A variable font that has both normal and heavy (bold) weight (and also everything in between) is slightly smaller than two regular fonts. Assuming that a bold font file is as big as a regular (we'll check on that assumption later), then 19092 * 2 = 38,184 is greater than 34,744

The file size difference is not big but we can still see a saving probably because of duplicate metadata and some other similar elements in two files vs one. And there there's also the delivery saving - 2 HTTPS requests vs one.

Potential skew-age?

  1. Smaller subset: here we're looking at the median file size amongst 335 files vs 1009 files in the original study.
  2. Uneven number of characters: the median number of characters here is 222 where in the the original study it was 219. Not a big difference but still... Also overall the total number of characters is random (but over 200) in both studies. We can control for this (in a followup) by comparing only 200-char subsets for example.
  3. Google fonts only: well yeah, that's an easy corpus of fonts to download and mess around with.

Next?

In the spirit of part 2 I'd like to study the sizes when incrementing the number of characters in a subset (as opposed to a catch-all LATIN). This will address potential skew #2 above. Probably not increments of 1 but of 50 to save some processing.

I'd also like to experiment with ALL the fonts available. So far I've been looking at "Regular" and [wght] only. But I should just do it all and then have people smarter than me (such as yourself, my dear reader) slice the results and draw conclusions any way you want.

]]>
https://www.phpied.com/web-font-file-size-study-a-variable-font-addition/feed/ 1
The zebra jumps quickly over a fence, vexed by a lazy ox https://www.phpied.com/the-zebra-jumps-quickly-over-a-fence-vexed-by-a-lazy-ox/ https://www.phpied.com/the-zebra-jumps-quickly-over-a-fence-vexed-by-a-lazy-ox/#comments <![CDATA[Stoyan]]> Mon, 21 Oct 2024 05:18:13 +0000 <![CDATA[font-face]]> <![CDATA[performance]]> https://www.phpied.com/?p=2125 <![CDATA[The zebra jumps quickly over a fence, vexed by a lazy ox. Eden tries to alter soft stone near it. Tall giants often need to rest, and open roads invite no pause. Some long lines appear there. In bright cold night, stars drift, and people watch them. A few near doors step out. Much light […]]]> <![CDATA[

The zebra jumps quickly over a fence, vexed by a lazy ox. Eden tries to alter soft stone near it. Tall giants often need to rest, and open roads invite no pause. Some long lines appear there. In bright cold night, stars drift, and people watch them. A few near doors step out. Much light finds land slowly, while men feel deep quiet. Words run in ways, forward yet true. Look ahead, and things form still, yet dreams stay hidden. Down the path, close skies come, forming hard arcs. High above, quiet kites drift, fast on pure wind, yanking joints.

What's so special about the nonsense paragraph above? It's attempting to match the average distribution of letters in texts written in the English language.

This article by Peter Norvig discusses a 2012 study of letter frequency using Google books data set. And the distribution look like so:

For font-fallback matching purposes (more on this later) I want a shorter paragraph, representing roughly similar distribution. One can, of course, just create a paragraph like "Zzzzzzzzz" (9 Zs), followed by 12 Qs and so on, all the way to 1249 Es. But where's the fun in that? Plus texts have spaces and punctuation too.

So after some tweaking and coaching AI, this is a paragraph that came out that looks more realistic and matches the letter frequency pretty well.

Here's a CSV that shows:

  • each letter,
  • the Norvig's frequencies (based on 3,563,505,777,820 letters in the dataset) and
  • my frequencies too (based on mere 424 letters, once you take out spaces and punctuation)
Letter,Norvig,Tall giants
E,12.49%,12.26%
T,9.28%,8.73%
A,8.04%,7.55%
O,7.64%,7.08%
I,7.57%,6.60%
N,7.23%,7.55%
S,6.51%,6.84%
R,6.28%,6.13%
H,5.05%,4.01%
L,4.07%,4.48%
D,3.82%,5.42%
C,3.34%,1.89%
U,2.73%,2.36%
M,2.51%,2.12%
F,2.40%,2.83%
P,2.14%,2.59%
G,1.87%,2.12%
W,1.68%,2.12%
Y,1.66%,2.12%
B,1.48%,0.94%
V,1.05%,0.94%
K,0.54%,1.18%
X,0.23%,0.47%
J,0.16%,0.47%
Q,0.12%,0.71%
Z,0.09%,0.47%

Here's the same data represented graphically:

Well, what's the point of this?

Similar to the nonsense etaoin shrdlu used by typesetters, this paragraph can be used to find out the average character width of a font.

Just render the paragraph in a non-wrapping inline-block DOM element, measure the width of the element and divide by the length of the text.

How is this useful? Welp, to set the size-adjust CSS property of a fallback font to match a custom web font. Further write up is coming, stay tuned!

Close enough

As you can see in the graph, the two lines do not match exactly. I think this is OK. It's extremely unlikely that any text on your page will have the exact average distribution of letters in it. So we're talking about an approximation to begin with. May also be site-dependent. E.g. in an adult site maybe the X character will occur more often than the average book.

Also Norvig's analysis doesn't mention spaces and punctuation. In my paragraph, these exist, maybe making it possible to match the average text on a web page just a little bit closer.

Aside: why not just Lorem Ipsum

Well, it doesn't attempt to match the character distribution in English. (Duh, it's not even English!)
Here's what it looks like in the same digram:

Note: no K, J, Z, W or Y. Barely any H.

Here are the stats in CSV and .numbers for your perusal.

May "The zebra jumps quickly over a fence, vexed by a lazy ox" be always in your favor!

]]>
https://www.phpied.com/the-zebra-jumps-quickly-over-a-fence-vexed-by-a-lazy-ox/feed/ 2