The problems with embedding in REST today and how it might be solved with HTTP/2
Please note: Recently browsers deprecate HTTP/2 Push from the Web Platform, making some of the ideas in this article no longer relevant. See Push is Dead for some more information.
When looking at REST, the underlying theory, and various interpretations and even HTTP, you’ll find that REST is about singular resources and transferring state of those resources.
There’s very little supporting information about the concept of a collection of resources. Almost every REST API out there will have a need for them, but there doesn’t seem to be a clear definition for them, or an accepted way to technically handle them.
Fielding’s dissertation on REST says:
The key abstraction of information in REST is a resource. Any information that can be named can be a resource: a document or image, a temporal service (e.g. “today’s weather in Los Angeles”), a collection of other resources, a non-virtual object (e.g. a person), and so on. In other words, any concept that might be the target of an author’s hypertext reference must fit within the definition of a resource. A resource is a conceptual mapping to a set of entities, not the entity that corresponds to the mapping at any particular point in time.
This paragraph is the only place where the word ‘collection’ appears, and it basically says that it’s just another resource. This is good, because it tells us that collections are not a separate concept from other resources and should abide by the same constraints.
However, REST gives us an architecture, we still have to impose our own further contstraints and meaning on top of it. So how do we treat a collection of resources?
Well, when you look at any of the common hypermedia formats, you’ll find that each of them has a slightly different idea about it. However, each of these formats tends to have something in common: they treat items in a collection as a sort of ‘sub-resource’ inside a ‘resource’.
What is a collection, really?
Before we go into some examples, I think it’s worthwhile to discuss this. A collection is a set of resources. Resources may ‘belong’ to a collection in some cases, but resources may also appear in multiple collections.
For example, if my REST API represents blog posts, the same article may appear in a collection that represents all posts from 2017, and all posts written by me.
You might think of a collection as a database table, but I kind of think that that’s not the right analogy. I think a collection is more like a directory on a modern filesystem. The filesystem directory only contains links to files and their position on the filesystem (inode). You could say that a directory has references to files, but you can’t really say that a directory IS a collection of those files.
Files can appear in multiple directories using hardlinks too.
So knowning that, we can translate this to REST terms. A collection is a set of hyperlinks to other resources.
If you’re using HAL, you can express this using:
{
"_links" : {
"self" : { "href" : "/articles/byyear/2017" },
"item" : [
{ "href" : "/articles/1" },
{ "href" : "/articles/2" },
{ "href" : "/articles/3" }
]
}
}
There you have it. In HAL the top-level property names in the _links
object
are link relations, and the item
link relation was standardized by
Mike Amundsen as RFC6573.
This is great, but unpractical
The problem with the linked approach is that very often, a REST client will want to receive the details of every item in a collection. In the last example this would mean an extra HTTP GET request for each of the items in the collection.
This is not acceptable for many real-world REST services out there, so we need a solution. HAL (and virtually every other hypermedia format) solves this by embedding the resources in the collection.
To demonstrate, here is the same collection again.
{
"_links" : {
"self" : { "href" : "/articles/byyear/2017" }
},
"_embedded" : {
"item" : [
{
"_links" : {
"self" : { "href" : "/articles/1" }
},
"title" : "..",
"pubDate" : "..",
"content" : ".."
},
{
"_links" : {
"self" : { "href" : "/articles/2" }
},
"title" : "..",
"pubDate" : "..",
"content" : ".."
},
{
"_links" : {
"self" : { "href" : "/articles/3" }
},
"title" : "..",
"pubDate" : "..",
"content" : ".."
}
]
}
}
Two things happened here. Each individual item in the collection now appears
in _embedded
. They have their own sets of links, including a self
link,
referring to the uri of the resource we just embedded. We also removed the
items from _links
.
The way we look at _embedded
is that things that appear in _embedded
are:
- Link relations, equal to items appearing in
_links
. - The data you would receive, if you did a
GET
request on the target of the link.
This is important, because a good HAL-based REST client should ideally
consider things appearing in _embedded
and _links
as the exact same thing.
The only difference is that because an item appears in _embedded
, it’s no
longer needed to perform that GET
request. The client should cache it.
If a client is built around this core concept, another benefit is that the client becomes adaptable to changes to the server. You might for example see over time that a HAL client very often might follow a certain link and almost always will want the data for it.
An example
Here’s a real-world example from our API. Our API has a document on the root of the API that the client uses to discover all the other resources. It contains a link to a resource that has information about the current user:
{
"_links" : {
"self" : { "href" : "/" },
"current-user" : { "href" : "/user/1356" },
"support" : { "href": "mailto:[email protected]" }
}
}
We noticed that all clients always request the current users’ information
after logging in. It always follows up this initial GET
to a GET
to
whatever the current-user
relation points at.
Knowing this, we can change our API to simply assume this and pre-emptively send that resource over:
{
"_links" : {
"self" : { "href" : "/" },
"support" : { "href": "mailto:[email protected]" }
},
"_embedded" : {
"current-user" : {
"_links" : {
"self" : { "href" : "/user/1356" }
},
"firstName" : "...",
"lastName" : "...",
"email" : "..."
}
}
}
A good HAL client would need 0 changes, and simply adapt to this new situation and skip the second GET request.
A few problems with this
One of the biggest advantages and promises that REST typically gives us, is by using the functionality HTTP, we get all the benefits from HTTP. The usual example of this is being able to use the rich caching features from HTTP.
However, HTTP caches will not be aware of embedded resources. In the last
example, the cache doesn’t know that /user/1356
was embedded and cached, and
it does not know it can skip a future GET
request to that resource.
Also, if we did a real cached GET
request, but later on we issue a PUT
on
that same resource, a HTTP client ‘knows’ that since a PUT
was issued, the
local cache is no longer valid.
So typically, a HAL client that wants to be adaptable to this needs it’s own cache. What we’ll ideally want to do, is something like this (in pseudo-code)
var api = new API('/'); // Referring to the root.
var currentUserResource = await api.follow('current-user');
console.log(await currentUserResource.get());
I hope the source makes some sense, but the general idea is that once we
‘follow’ the current-user
link and get its representation, we only need the
extra GET
request, if it wasn’t embedded. This should be a seamless
experience.
To implement this in browsers today, it means that the API client will need to:
- Understand
_embedded
- Take items from
_embedded
, treat them the same as links - Store them into a some kind of local cache.
- Use that local cache when the user wants to do a
GET
request. - Invalidate that local cache when
PUT
,DELETE
or another non-safe HTTP request is issued on that same resource.
So while we can still use HTTP semantics, we now have two caching layers which may conflict.
Fetch: A future solution to this problem
The Fetch API is the future of doing HTTP requests in browsers. It’s a much nicer api than XMLHTTPRequest, but also has another really cool feature: Once it lands it gives us direct access to the browsers’ HTTP cache.
The specific thing to look for is Cache.put(). This API should allow us to
directly add things to the browser cache. A HAL client in this case could parse
out everything that appears in _embedded
and directly add it to this cache.
A future GET
request will then simply directly be taken from this cache, and
the GET
request is avoided. This is huge.
What’s interesting about this API is that it does not just take a URL and the thing you want to store, you actually store a HTTP request and a HTTP response.
This is important, because a HTTP cache is not just URL-based. A single GET
request to a single url might have different responses based on HTTP request
headers like:
Accept
Accept-Language
Authorization
Or even other custom headers. A HTTP server can indicate to a client how the
client should store responses in the cache based on the Vary
header. For
example, if a response to a GET
request has the Vary: X-Foo
header, the
http client knows that depending on the value of the X-Foo
header in the
HTTP Request (Not the response, this is important!) the server might emit
a different response.
What’s also important to know, is that the HTTP client cache will store and
expire the cache for any given resource based on instructions the server gives
in headers such as Cache-Control
and Expires
.
In HAL we have none of that information available in _embedded
, so we sort
of need to make these values up, and that’s not great.
A suggested proposal to improve HAL
Conceptually I feel that HAL’s _embedded
is not strictly for specifying
collections and/or sub-resources. It’s core feature is really
means to prepopulate the HTTP cache to avoid future GET
requests.
Knowing that, I believe _embedded
is not the best format to achieve this
goal. We’re missing crucial information to do this as correct as possible.
Here’s a format I would like to see instead:
{
"_links" : {
"self" : { "href" : "/" },
"current-user" : { "href" : "/user/1356" },
"support" : { "href": "mailto:[email protected]" }
},
"_push" : [
{
"request" : {
"method" : "GET",
"uri" : "/user/1356",
"headers" : {
"accept" : "application/hal+json"
}
},
"response" : {
"headers" : {
"vary": "Accept",
"cache-control" : "private; max-age=3600; no-revalidate",
"content-type" : "application/hal+json",
"etag" : "\"foo-bar\""
},
"body" : {
"_links" : {
"self" : { "href" : "/user/1356" }
},
"firstName" : "...",
"lastName" : "...",
"email" : "..."
}
}
]
}
Changed from _embedded
are:
- I added HTTP request and response headers and the request method.
- I’m no longer indexing things in
_embedded
by their relation type, it’s just an array of requests and response pairs. - I’m no longer removing items from
_links
. The link just stays there. _push
is not nested. It only appears in the top and it’s really just a ‘transport-level’ feature instead of a part of the data-structure.
The drawback? If you’re not interesed in developing an an advanced HAL client/server, and just want to build a ‘dumb’ parser of collections, there’s more data and a higher cognitive load.
However, even if this is a better way to push resources to a client, it’s only really needed for HTTP/1.1. HTTP/2 makes this obsolete.
HTTP/2
Those aware of what’s going on with HTTP/2 might recognize _push
. HTTP/2
has a very similar feature.
HTTP/2 actually introduces a protocol-level push and it works in the exact
same way. HTTP/2 Push can be used to preemptively populate the browser cache
if the server knows the client will likely want to do certain GET
requests
in the future.
A HTTP/2 push message always contains BOTH the HTTP response, but also the
HTTP Request that the browser would send if they did have to do the GET
request.
And now there’s a new feature under development that makes this especially cool: HTTP/2 cache digest.
HTTP/2 cache digests is an extension to the HTTP/2 protocol that will allow a client to send a small summary of the cache a browser has for the server.
This is really the missing key, because the biggest thing we were missing with HTTP/2 push is that the server doesn’t know in advance which resources the client already knows. This results in unneeded pushes.
This is identical to a HAL server always adding resources to _embedded
.
The server doesn’t know if the client cared for them, or if it already had
an up-to-date copy.
It would be easy for a HTTP/2 HAL client to add a HTTP header to GET
requests such X-Please-Push: current-user
to automatically push
request/response pairs for a current-user
relationship and only if the
client didn’t already have an up-to-date copy of it.
For this reason I think that my _push
proposal doesn’t make all that much
sense. It’s only really useful to provide a HTTP/2-like push feature to
HTTP/1.1. It’s a stopgap.
And since we no longer really need _embedded
, it might make sense to also
ditch _links
and just move the information to the HTTP Link header,
which also means all this REST linking logic is no longer restricted to JSON
or XML-based formats.
Conclusion
In HAL and REST we currently have an awkward and poor fit for embedding resources. The only reason we need this in the first place, is because clients doing many GET requests is expensive. Technical limitations of HTTP.
This causes problems because these resources are not easily cached and simply don’t fit well within the HTTP design. We need to create layers on top of HTTP just to work around this.
HTTP/2 and cache digests might offer a great solution, in that we can completely avoid embedding resources and preemptively send resources that the client will probably want and doesn’t already have an up-to-date copy for.
This will make REST much more practical and alleviate some of the largest isues people have with it today. I think it eventually makes formats like HAL completely obsolete, because we’re moving the hard stuff to the protocol layer.
Comments
Asmir Mustafic •
Nice idea "push on demand" :)
Saquib Rizwan •
Great concept, But it is very much complicated.
develCuy •
Only thinking on REST for browsers? how about REST on JS server-side (node.js)?
Evert •
Everything I talked about applies to the server-side too. Let me know if you have specific questions about stuff, I might be able to give you some pointers. To do the HTTP/2 stuff you'll need a HTTP/2 client in node.js, which I believe exists (but I haven't used yet).