One of the nice things about SQL is that the system is able to perform global optimizations. This is at least partly enabled by the fact that you are not able to give the system procedural “do this and then do that” instructions. I fear that PRQL’s “filter the thing that came before” design is going to undermine the DBMS’s ability to perform the kinds of global optimizations it can on SQL. The counterargument here will be that it is easier to understand what is going on and optimize the performance by ordering your statements. There may be some truth to that, but it also means that the DBMS is not able to intervene to select a good plan either, if your unskilled analysts do things in a foolish order, or if (more likely) the cardinality of tables changes and causes a different approach to be more performant in this environment than the original one in which the query was written. In fact that’s really the central strength of having a highly declarative query: the query planner knows things about your tables as they are now, which may differ from query to query or day to day.
I see why the author considers this to be simpler and more powerful. You often see these “let’s rewrite SQL” concepts and they usually bring some procedural thinking with them. In this case the author seems not to have noticed that “do this and then do that” is intrinsically procedural. The author also thinks that this kind of reasoning will be easier for the analysts writing the queries—which, maybe it will be, but it’s an assumption I wouldn’t feel comfortable making. Having a mental model of how SQL works isn’t easy, but understanding a large query here would entail keeping in your head all the steps that have already occurred, much like debugging any procedural program. I think this trades readability for writability. I wouldn’t be quick to make that trade, but I couldn’t come down hard on either side.
A desirable property of SQL is that if the query works once, it’s going to keep working. Adding procedurality makes me worry that queries will work in development and then could fail in production due to a missed state. But I don’t know enough about PRQL to say whether it is vulnerable to this problem or not.
With SQL, you also often have to introduce subqueries if you want to control some aspect of the order of things—for instance, you want to resort or refilter after the HAVING clause, or you want to select the first N rows before doing some joins or deeper processing. PRQL would obviously be advantageous there, since you can intermix filters and sorts freely; you don’t need to even have such a thing as HAVING, which is only there so you can filter after aggregates are applied.
I do really like the idea of creating functions inside the query to deal with the boilerplate, I think SQL can be a bit gassy, and usually your only recourse is to create a subquery that does some of the work for you.
I don’t think you should worry to much about the ordering in PRQL query text. It can be all converted back into SQL and all the set logic optimisations can be done as usual, as well as reordering based on table statistics. Unless you move your filters after aggregation (similarly to SQL) the performance should stay the same.
I’m not sure what makes you confident that it does not need to be worried about. In fact I expect the translation to SQL to suffer in much the same way that ORM-generated SQL typically suffers.
“do this and then do that” is intrinsically procedural
PRQL is actually much closer to the typical logical plan representation than SQL is. (It would likely be a much easier frontend to write too - SQL’s scope ordering and name resolution is horrific).
The fact that it is closer to the logical plan representation seems to reinforce my point, that it is less high-level than SQL. The fact that the generated SQL from PRQL arrived at a similar plan is really only evidence of the power of your engine’s planner. Is the objective here to replace SQL or to hide SQL from the query author?
I’m sure there are going to be some edge cases, but good db engines can easily optimise these queries. Even if you wrote each step of the example query as a stacked SQL subquery, SQL Server will most likely give you a plan identical to the flat query you’d write.
“Do this and then do that” is already baked into SQL via Common Table Expressions and subqueries.
SQL ends up being a language that describes a flow of data, sources and sinks, and some parts of that can be rearranged to be optimized better, but there is already a semantic notion of ordering built in to SQL. It’s just not surfaced very often.
I see FunSQL.jl is already mentioned in the discussion but I still want to
highlight how it works as I think FunSQL represents a very interesting point in design space of “SQL replacements”.
FunSQL is a Julia library which presents a
combinator-based API to generate SQL. The query construction looks like:
I’ve myself ported the lib (with some deviations) to Python and OCaml (ended up
using this one) and built a concrete syntax on top of it, so at the end it looks
very similar to PRQL, albeit I’ve kept it conservative syntax-wise (closer to
SQL). Same query as above in the concrete syntax:
from person
where year_of_birth >= 1930 and year_of_birth <= 1940
select person_id
(I’ll use this syntax going forward in this post)
Now the feature I’ve wanted to highlight (and which I think makes FunSQL very
useful) is how it treats group by and aggregate expressions. An example first:
from users as u
join (from comments group by user_id) as c on c.user_id = u.id
select
u.username,
c.count() as comment_count,
c.max(created_date) as comment_last_created_date
Here the subrelation c which is joined to u is being grouped by user_id
column. The c itself doesn’t specify any aggregates though as you can see. All
aggregates are located in the single select below.
FunSQL treats grouped relations as a special kind of namespace from which you
can either select columns you grouped by or refer to aggregate functions.
In my mind, this feature is very powerful as it allows one to incrementally
construct reusable query fragments and then use them to build final queries.
Same query in two steps:
let users_with_comments =
from users as u
join (from comments group by user_id) as c on c.user_id = u.id
from users_with_comments
select u.username, c.array_agg(text) as comments
In other words: you can build a nested namespace from relations / grouppings and
then later decide what you want to select from this namespace.
This surprisingly works well if you want to implement visual query builders (I
have one for my OCaml port).
Check out JuliaCon 2021 presentaton about FunSQL if interested, the main
author Kyrylo (who’s by the way the co-creator of HTSQL, another query language
mentioned in the discussion here) talks about motivation and design of FunSQL.
Ah yeah that’s a cool one too. Though as much as I love SML I wouldn’t watch that project in the same way until he spins it out into a standalone thing that can be implemented as query language library for different languages.
If it’s not fair to say it’s more complicated (since I’m biased by just knowing sql already) it’s at least not less complicated or less verbose than sql, judging from the examples in your tutorial.
I’m most likely not your audience anyway. The particular thing I’m excited about shorthand sql is for more easily defining application auth policies.
Tutorial examples are necessarily trivial, so the difference in verbosity is not that visible. It’s when your queries get comparatively more complex, then EdgeQL starts to shine. We’ll post a more elaborate “EdgeQL vs SQL” page to illustrate this better soon. Also, check out this talk by Yury with some good examples [1]
The particular thing I’m excited about shorthand sql is for more easily defining application auth policies.
We are working on just that right now [2] (though note it’s an early draft and things will likely change a lot).
Ah sorry I don’t mean postgres-level policies. If you look at https://www.dbcore.org/ and scroll down to the Authorization section you’ll see declarative policies for API endpoints that are backed by a combination of database queries and injected request variables.
I’m not actively working on that anymore and I wasn’t going to write my own SQL shorthand (well I thought about it for a second and decided against it). But I’d still like to see that sort of declarative application-level auth policies happen with a language that can combine SQL queries and injected request context.
This is exactly what we are looking to implement with the “Query rewrite” proposal. First of all, access rules are conditional, and, more importantly you can combine them with globals [1] to get that “request context dependent” filtering.
The idea is that you set your access policies in the schema (because, like with everything else you want correctness and migration support). These policies may depend on context variables. For example:
type Post {
link author -> User;
access policy permit read using (.author.id = global current_user_id)
}
Then, in application code (likely in some middleware), you configure the database client to send the appropriate globals:
client = client.with_globals(current_user_id = user_id_from_jwt)
...
# Only Posts authored by `current_user_id` are visible
result = client.query('SELECT Post')
But my desire for a SQL shorthand also extends to wanting to be able to use it selectively. i.e. I just want to write vanilla SQL in my CRUD app for everything except the policies. But again I think I’m not your audience.
EdgeQL is a lot more than an SQL shorthand for some things. If you are a heavy SQL user and you like it, but perhaps wish for the query language to be a tad less verbose and a bit more modern, then maybe you’ll find reasons to love EdgeQL too :-)
Well, that’s what EdgeDB does. It takes your schema and your EdgeQL, and produces PostgreSQL queries in a 1:1 fashion. That is, every EdgeQL query is exactly one SQL query regardless of complexity (including things you can’t express in SQL easily, like correlated nested DML). Theoretically you can call the query compiler directly and execute the returned SQL manually, however it’s not clear how that would be of benefit as opposed to using the EdgeDB query I/O.
Thanks for these links. Lately I’ve been also thinking how I can optimize my work, as I need to write a lot of exploratory SQL queries, involving a lot of joins in a system without any foreign key defined(!). Most of join conditions are fairly static, but there should be more succinct way of expressing joins.
HTSQL looks quite promising. I was also looking into Cypher and cytosm, but the project looks unmaintained…
Having written some complex dbt projects (and if PRQL officially supported macros it could be a good drop in replacement), I can definitely see the value in something like this. The first thing that jumped out to me that it gets right is to start with the table and work down. This is an enormous readability boost in large projects and leads to great intellisense. I saw recently this with import statements in Bagel recently and I can just say it’s a major improvement over how say Typescript does things.
One thing I’d love to see a system like this explore is a gradual type system ala Typescript as there’s no reason we can’t start to detect basic type interactions like only being able to sum a number. In addition to the .sql output, it could also provide a .d.sql file which contains a type specification of the output table. Source tables can be imported directly from the database or written by hand (also provides a convenient way to provide types for native functions/UDFs instead of having to natively support every dialect’s details in the parser). Even just having this do very basic “fall back to any if I see something unknown” inference would be a big step up over SQL.
The main issue with SQL is that there is no machine-facing API or information model besides string concat.
One of the nice things about SQL is that the system is able to perform global optimizations. This is at least partly enabled by the fact that you are not able to give the system procedural “do this and then do that” instructions. I fear that PRQL’s “filter the thing that came before” design is going to undermine the DBMS’s ability to perform the kinds of global optimizations it can on SQL. The counterargument here will be that it is easier to understand what is going on and optimize the performance by ordering your statements. There may be some truth to that, but it also means that the DBMS is not able to intervene to select a good plan either, if your unskilled analysts do things in a foolish order, or if (more likely) the cardinality of tables changes and causes a different approach to be more performant in this environment than the original one in which the query was written. In fact that’s really the central strength of having a highly declarative query: the query planner knows things about your tables as they are now, which may differ from query to query or day to day.
I see why the author considers this to be simpler and more powerful. You often see these “let’s rewrite SQL” concepts and they usually bring some procedural thinking with them. In this case the author seems not to have noticed that “do this and then do that” is intrinsically procedural. The author also thinks that this kind of reasoning will be easier for the analysts writing the queries—which, maybe it will be, but it’s an assumption I wouldn’t feel comfortable making. Having a mental model of how SQL works isn’t easy, but understanding a large query here would entail keeping in your head all the steps that have already occurred, much like debugging any procedural program. I think this trades readability for writability. I wouldn’t be quick to make that trade, but I couldn’t come down hard on either side.
A desirable property of SQL is that if the query works once, it’s going to keep working. Adding procedurality makes me worry that queries will work in development and then could fail in production due to a missed state. But I don’t know enough about PRQL to say whether it is vulnerable to this problem or not.
With SQL, you also often have to introduce subqueries if you want to control some aspect of the order of things—for instance, you want to resort or refilter after the HAVING clause, or you want to select the first N rows before doing some joins or deeper processing. PRQL would obviously be advantageous there, since you can intermix filters and sorts freely; you don’t need to even have such a thing as HAVING, which is only there so you can filter after aggregates are applied.
I do really like the idea of creating functions inside the query to deal with the boilerplate, I think SQL can be a bit gassy, and usually your only recourse is to create a subquery that does some of the work for you.
I don’t think you should worry to much about the ordering in PRQL query text. It can be all converted back into SQL and all the set logic optimisations can be done as usual, as well as reordering based on table statistics. Unless you move your filters after aggregation (similarly to SQL) the performance should stay the same.
I’m not sure what makes you confident that it does not need to be worried about. In fact I expect the translation to SQL to suffer in much the same way that ORM-generated SQL typically suffers.
Filter pushdown is a trivial optimization.
Here is a direct mechanical translation of the PRQL example - https://gist.github.com/jamii/2275b88dd65a77f808312759e4445fe1. It generates the same plan as the original SQL.
PRQL is actually much closer to the typical logical plan representation than SQL is. (It would likely be a much easier frontend to write too - SQL’s scope ordering and name resolution is horrific).
The fact that it is closer to the logical plan representation seems to reinforce my point, that it is less high-level than SQL. The fact that the generated SQL from PRQL arrived at a similar plan is really only evidence of the power of your engine’s planner. Is the objective here to replace SQL or to hide SQL from the query author?
I’m sure there are going to be some edge cases, but good db engines can easily optimise these queries. Even if you wrote each step of the example query as a stacked SQL subquery, SQL Server will most likely give you a plan identical to the flat query you’d write.
“Do this and then do that” is already baked into SQL via Common Table Expressions and subqueries.
SQL ends up being a language that describes a flow of data, sources and sinks, and some parts of that can be rearranged to be optimized better, but there is already a semantic notion of ordering built in to SQL. It’s just not surfaced very often.
It’s much less overt than in PRQL.
I see FunSQL.jl is already mentioned in the discussion but I still want to highlight how it works as I think FunSQL represents a very interesting point in design space of “SQL replacements”.
FunSQL is a Julia library which presents a combinator-based API to generate SQL. The query construction looks like:
I’ve myself ported the lib (with some deviations) to Python and OCaml (ended up using this one) and built a concrete syntax on top of it, so at the end it looks very similar to PRQL, albeit I’ve kept it conservative syntax-wise (closer to SQL). Same query as above in the concrete syntax:
(I’ll use this syntax going forward in this post)
Now the feature I’ve wanted to highlight (and which I think makes FunSQL very useful) is how it treats
group by
and aggregate expressions. An example first:Here the subrelation
c
which is joined tou
is being grouped byuser_id
column. Thec
itself doesn’t specify any aggregates though as you can see. All aggregates are located in the singleselect
below.FunSQL treats grouped relations as a special kind of namespace from which you can either select columns you grouped by or refer to aggregate functions.
In my mind, this feature is very powerful as it allows one to incrementally construct reusable query fragments and then use them to build final queries. Same query in two steps:
In other words: you can build a nested namespace from relations / grouppings and then later decide what you want to select from this namespace.
This surprisingly works well if you want to implement visual query builders (I have one for my OCaml port).
Check out JuliaCon 2021 presentaton about FunSQL if interested, the main author Kyrylo (who’s by the way the co-creator of HTSQL, another query language mentioned in the discussion here) talks about motivation and design of FunSQL.
I don’t think this proposal makes a useful improvement on SQL personally. But I do desperately want a well-adopted SQL shorthand.
Here are some of the projects I’m watching in that vein:
I’ve been looking at Morel (talk here)
Ah yeah that’s a cool one too. Though as much as I love SML I wouldn’t watch that project in the same way until he spins it out into a standalone thing that can be implemented as query language library for different languages.
Also check out:
Malloy is as evil-looking to me as the OP proposal is. EdgeDB is pretty interesting now that I look at it but still seems more complicated than SQL.
Curious: what makes you feel like EdgeQL is more complicated?
If it’s not fair to say it’s more complicated (since I’m biased by just knowing sql already) it’s at least not less complicated or less verbose than sql, judging from the examples in your tutorial.
I’m most likely not your audience anyway. The particular thing I’m excited about shorthand sql is for more easily defining application auth policies.
Tutorial examples are necessarily trivial, so the difference in verbosity is not that visible. It’s when your queries get comparatively more complex, then EdgeQL starts to shine. We’ll post a more elaborate “EdgeQL vs SQL” page to illustrate this better soon. Also, check out this talk by Yury with some good examples [1]
We are working on just that right now [2] (though note it’s an early draft and things will likely change a lot).
[1] https://youtu.be/CmXB5xqEENs?t=437 [2] https://github.com/edgedb/rfcs/blob/865bc48f4050ced99447bd77a5039f5d34fcb8b2/text/1011-query-rewrite.rst
Ah sorry I don’t mean postgres-level policies. If you look at https://www.dbcore.org/ and scroll down to the Authorization section you’ll see declarative policies for API endpoints that are backed by a combination of database queries and injected request variables.
I’m not actively working on that anymore and I wasn’t going to write my own SQL shorthand (well I thought about it for a second and decided against it). But I’d still like to see that sort of declarative application-level auth policies happen with a language that can combine SQL queries and injected request context.
Nether did I :-)
This is exactly what we are looking to implement with the “Query rewrite” proposal. First of all, access rules are conditional, and, more importantly you can combine them with
globals
[1] to get that “request context dependent” filtering.[1] https://github.com/edgedb/rfcs/blob/043acb494ed8e1c4f72dbd0fcf7c00a0b8439624/text/1010-global-vars.rst
Gotcha, then I guess I didn’t understand by that first RFC you shared how you are thinking it would be used by application developers.
The idea is that you set your access policies in the schema (because, like with everything else you want correctness and migration support). These policies may depend on context variables. For example:
Then, in application code (likely in some middleware), you configure the database client to send the appropriate globals:
Something like that.
Gotcha! Yeah that looks nice.
But my desire for a SQL shorthand also extends to wanting to be able to use it selectively. i.e. I just want to write vanilla SQL in my CRUD app for everything except the policies. But again I think I’m not your audience.
EdgeQL is a lot more than an SQL shorthand for some things. If you are a heavy SQL user and you like it, but perhaps wish for the query language to be a tad less verbose and a bit more modern, then maybe you’ll find reasons to love EdgeQL too :-)
Is there a library to translate EdgeQL into SQL?
Well, that’s what EdgeDB does. It takes your schema and your EdgeQL, and produces PostgreSQL queries in a 1:1 fashion. That is, every EdgeQL query is exactly one SQL query regardless of complexity (including things you can’t express in SQL easily, like correlated nested DML). Theoretically you can call the query compiler directly and execute the returned SQL manually, however it’s not clear how that would be of benefit as opposed to using the EdgeDB query I/O.
Thanks for these links. Lately I’ve been also thinking how I can optimize my work, as I need to write a lot of exploratory SQL queries, involving a lot of joins in a system without any foreign key defined(!). Most of join conditions are fairly static, but there should be more succinct way of expressing joins.
HTSQL looks quite promising. I was also looking into Cypher and cytosm, but the project looks unmaintained…
Yeah. I’m watching in vain. And hoping for something to take off…
Having written some complex dbt projects (and if PRQL officially supported macros it could be a good drop in replacement), I can definitely see the value in something like this. The first thing that jumped out to me that it gets right is to start with the table and work down. This is an enormous readability boost in large projects and leads to great intellisense. I saw recently this with import statements in Bagel recently and I can just say it’s a major improvement over how say Typescript does things.
One thing I’d love to see a system like this explore is a gradual type system ala Typescript as there’s no reason we can’t start to detect basic type interactions like only being able to sum a number. In addition to the
.sql
output, it could also provide a.d.sql
file which contains a type specification of the output table. Source tables can be imported directly from the database or written by hand (also provides a convenient way to provide types for native functions/UDFs instead of having to natively support every dialect’s details in the parser). Even just having this do very basic “fall back to any if I see something unknown” inference would be a big step up over SQL.This looks like just a reasonable DSL for relational algebra. That is an endorsement.
This reminds me of InfluxDB’s Flux language https://docs.influxdata.com/flux/v0.x/get-started/query-basics/