PRQL — a simpler and more powerful SQL

17

PRQL — a simpler and more powerful SQL databases github.com/max-sixty
via LeahNeukirchen 2 years ago | caches
Archive.org Archive.today Ghostarchive
| 31 comments

31

1. 10
  
  ghadi 2 years ago | link
  
  The main issue with SQL is that there is no machine-facing API or information model besides string concat.
2. 8
  
  dkl 2 years ago | link
  
  One of the nice things about SQL is that the system is able to perform global optimizations. This is at least partly enabled by the fact that you are not able to give the system procedural “do this and then do that” instructions. I fear that PRQL’s “filter the thing that came before” design is going to undermine the DBMS’s ability to perform the kinds of global optimizations it can on SQL. The counterargument here will be that it is easier to understand what is going on and optimize the performance by ordering your statements. There may be some truth to that, but it also means that the DBMS is not able to intervene to select a good plan either, if your unskilled analysts do things in a foolish order, or if (more likely) the cardinality of tables changes and causes a different approach to be more performant in this environment than the original one in which the query was written. In fact that’s really the central strength of having a highly declarative query: the query planner knows things about your tables as they are now, which may differ from query to query or day to day.
  
  I see why the author considers this to be simpler and more powerful. You often see these “let’s rewrite SQL” concepts and they usually bring some procedural thinking with them. In this case the author seems not to have noticed that “do this and then do that” is intrinsically procedural. The author also thinks that this kind of reasoning will be easier for the analysts writing the queries—which, maybe it will be, but it’s an assumption I wouldn’t feel comfortable making. Having a mental model of how SQL works isn’t easy, but understanding a large query here would entail keeping in your head all the steps that have already occurred, much like debugging any procedural program. I think this trades readability for writability. I wouldn’t be quick to make that trade, but I couldn’t come down hard on either side.
  
  A desirable property of SQL is that if the query works once, it’s going to keep working. Adding procedurality makes me worry that queries will work in development and then could fail in production due to a missed state. But I don’t know enough about PRQL to say whether it is vulnerable to this problem or not.
  
  With SQL, you also often have to introduce subqueries if you want to control some aspect of the order of things—for instance, you want to resort or refilter after the HAVING clause, or you want to select the first N rows before doing some joins or deeper processing. PRQL would obviously be advantageous there, since you can intermix filters and sorts freely; you don’t need to even have such a thing as HAVING, which is only there so you can filter after aggregates are applied.
  
  I do really like the idea of creating functions inside the query to deal with the boilerplate, I think SQL can be a bit gassy, and usually your only recourse is to create a subquery that does some of the work for you.
  1. 3
    
    viraptor 2 years ago | link
    
    I don’t think you should worry to much about the ordering in PRQL query text. It can be all converted back into SQL and all the set logic optimisations can be done as usual, as well as reordering based on table statistics. Unless you move your filters after aggregation (similarly to SQL) the performance should stay the same.
    1. 3
      
      dkl 2 years ago | link
      
      I’m not sure what makes you confident that it does not need to be worried about. In fact I expect the translation to SQL to suffer in much the same way that ORM-generated SQL typically suffers.
      1. 1
        
        jamii 2 years ago | link
        
        I’m not sure what makes you confident
        
        Filter pushdown is a trivial optimization.
        
        Here is a direct mechanical translation of the PRQL example - https://gist.github.com/jamii/2275b88dd65a77f808312759e4445fe1. It generates the same plan as the original SQL.
        
        “do this and then do that” is intrinsically procedural
        
        PRQL is actually much closer to the typical logical plan representation than SQL is. (It would likely be a much easier frontend to write too - SQL’s scope ordering and name resolution is horrific).
        
        1
        
        dkl 2 years ago | link
        
        The fact that it is closer to the logical plan representation seems to reinforce my point, that it is less high-level than SQL. The fact that the generated SQL from PRQL arrived at a similar plan is really only evidence of the power of your engine’s planner. Is the objective here to replace SQL or to hide SQL from the query author?
      2. 1
        
        viraptor edited 2 years ago | link
        
        I’m sure there are going to be some edge cases, but good db engines can easily optimise these queries. Even if you wrote each step of the example query as a stacked SQL subquery, SQL Server will most likely give you a plan identical to the flat query you’d write.
  2. 3
    
    yumaikas 2 years ago | link
    
    “Do this and then do that” is already baked into SQL via Common Table Expressions and subqueries.
    
    SQL ends up being a language that describes a flow of data, sources and sinks, and some parts of that can be rearranged to be optimized better, but there is already a semantic notion of ordering built in to SQL. It’s just not surfaced very often.
    1. 2
      
      dkl 2 years ago | link
      
      It’s much less overt than in PRQL.
3. 3
  andreypopp edited 2 years ago | link
  I see FunSQL.jl is already mentioned in the discussion but I still want to highlight how it works as I think FunSQL represents a very interesting point in design space of “SQL replacements”.
  
  FunSQL is a Julia library which presents a combinator-based API to generate SQL. The query construction looks like:
  
  From(:person) |> Where(Fun.between(Get.year_of_birth, 1930, 1940)) |> Select(Get.person_id)
  
  I’ve myself ported the lib (with some deviations) to Python and OCaml (ended up using this one) and built a concrete syntax on top of it, so at the end it looks very similar to PRQL, albeit I’ve kept it conservative syntax-wise (closer to SQL). Same query as above in the concrete syntax:
  
  from person where year_of_birth >= 1930 and year_of_birth <= 1940 select person_id
  
  (I’ll use this syntax going forward in this post)
  
  Now the feature I’ve wanted to highlight (and which I think makes FunSQL very useful) is how it treats group by and aggregate expressions. An example first:
  
  from users as u join (from comments group by user_id) as c on c.user_id = u.id select u.username, c.count() as comment_count, c.max(created_date) as comment_last_created_date
  
  Here the subrelation c which is joined to u is being grouped by user_id column. The c itself doesn’t specify any aggregates though as you can see. All aggregates are located in the single select below.
  
  FunSQL treats grouped relations as a special kind of namespace from which you can either select columns you grouped by or refer to aggregate functions.
  
  In my mind, this feature is very powerful as it allows one to incrementally construct reusable query fragments and then use them to build final queries. Same query in two steps:
  
  let users_with_comments = from users as u join (from comments group by user_id) as c on c.user_id = u.id from users_with_comments select u.username, c.array_agg(text) as comments
  
  In other words: you can build a nested namespace from relations / grouppings and then later decide what you want to select from this namespace.
  
  This surprisingly works well if you want to implement visual query builders (I have one for my OCaml port).
  
  Check out JuliaCon 2021 presentaton about FunSQL if interested, the main author Kyrylo (who’s by the way the co-creator of HTSQL, another query language mentioned in the discussion here) talks about motivation and design of FunSQL.
4. 3
  eatonphil 2 years ago | link
  I don’t think this proposal makes a useful improvement on SQL personally. But I do desperately want a well-adopted SQL shorthand.
  
  Here are some of the projects I’m watching in that vein:
  
  https://github.com/mrumkovskis/tresql
  
  https://www.htsql.org/doc/overview.html
  
  https://github.com/cytosm/cytosm
  
  https://github.com/MechanicalRabbit/FunSQL.jl
  1. 2
    
    pbsds 2 years ago | link
    
    I’ve been looking at Morel (talk here)
    1. 1
      
      eatonphil 2 years ago | link
      
      Ah yeah that’s a cool one too. Though as much as I love SML I wouldn’t watch that project in the same way until he spins it out into a standalone thing that can be implemented as query language library for different languages.
  2. 2
    jamii 2 years ago | link
    
    Also check out:
    
    https://www.edgedb.com/tutorial
    
    https://looker-open-source.github.io/malloy/documentation/language/join.html
    1. 1
      
      eatonphil 2 years ago | link
      
      Malloy is as evil-looking to me as the OP proposal is. EdgeDB is pretty interesting now that I look at it but still seems more complicated than SQL.
      1. 1
        
        elprans 2 years ago | link
        
        Curious: what makes you feel like EdgeQL is more complicated?
        
        1
        
        eatonphil 2 years ago | link
        
        If it’s not fair to say it’s more complicated (since I’m biased by just knowing sql already) it’s at least not less complicated or less verbose than sql, judging from the examples in your tutorial.
        
        I’m most likely not your audience anyway. The particular thing I’m excited about shorthand sql is for more easily defining application auth policies.
        
        1
        
        elprans 2 years ago | link
        
        Tutorial examples are necessarily trivial, so the difference in verbosity is not that visible. It’s when your queries get comparatively more complex, then EdgeQL starts to shine. We’ll post a more elaborate “EdgeQL vs SQL” page to illustrate this better soon. Also, check out this talk by Yury with some good examples [1]
        
        The particular thing I’m excited about shorthand sql is for more easily defining application auth policies.
        
        We are working on just that right now [2] (though note it’s an early draft and things will likely change a lot).
        
        [1] https://youtu.be/CmXB5xqEENs?t=437 [2] https://github.com/edgedb/rfcs/blob/865bc48f4050ced99447bd77a5039f5d34fcb8b2/text/1011-query-rewrite.rst
        
        1
        
        eatonphil 2 years ago | link
        
        Ah sorry I don’t mean postgres-level policies. If you look at https://www.dbcore.org/ and scroll down to the Authorization section you’ll see declarative policies for API endpoints that are backed by a combination of database queries and injected request variables.
        
        I’m not actively working on that anymore and I wasn’t going to write my own SQL shorthand (well I thought about it for a second and decided against it). But I’d still like to see that sort of declarative application-level auth policies happen with a language that can combine SQL queries and injected request context.
        
        1
        
        elprans 2 years ago | link
        
        Ah sorry I don’t mean postgres-level policies
        
        Nether did I :-)
        
        combine SQL queries and injected request context.
        
        This is exactly what we are looking to implement with the “Query rewrite” proposal. First of all, access rules are conditional, and, more importantly you can combine them with globals [1] to get that “request context dependent” filtering.
        
        [1] https://github.com/edgedb/rfcs/blob/043acb494ed8e1c4f72dbd0fcf7c00a0b8439624/text/1010-global-vars.rst
        
        1
        
        eatonphil 2 years ago | link
        
        Gotcha, then I guess I didn’t understand by that first RFC you shared how you are thinking it would be used by application developers.
        
        1
        
        elprans 2 years ago | link
        
        The idea is that you set your access policies in the schema (because, like with everything else you want correctness and migration support). These policies may depend on context variables. For example:
        
        type Post { link author -> User; access policy permit read using (.author.id = global current_user_id) }
        
        Then, in application code (likely in some middleware), you configure the database client to send the appropriate globals:
        
        client = client.with_globals(current_user_id = user_id_from_jwt) ... # Only Posts authored by `current_user_id` are visible result = client.query('SELECT Post')
        
        Something like that.
        
        1
        
        eatonphil 2 years ago | link
        
        Gotcha! Yeah that looks nice.
        
        But my desire for a SQL shorthand also extends to wanting to be able to use it selectively. i.e. I just want to write vanilla SQL in my CRUD app for everything except the policies. But again I think I’m not your audience.
        
        1
        
        elprans 2 years ago | link
        
        EdgeQL is a lot more than an SQL shorthand for some things. If you are a heavy SQL user and you like it, but perhaps wish for the query language to be a tad less verbose and a bit more modern, then maybe you’ll find reasons to love EdgeQL too :-)
        
        1
        
        eatonphil 2 years ago | link
        
        Is there a library to translate EdgeQL into SQL?
        
        1
        
        elprans 2 years ago | link
        
        Well, that’s what EdgeDB does. It takes your schema and your EdgeQL, and produces PostgreSQL queries in a 1:1 fashion. That is, every EdgeQL query is exactly one SQL query regardless of complexity (including things you can’t express in SQL easily, like correlated nested DML). Theoretically you can call the query compiler directly and execute the returned SQL manually, however it’s not clear how that would be of benefit as opposed to using the EdgeDB query I/O.
  3. 1
    
    knl edited 2 years ago | link
    
    Thanks for these links. Lately I’ve been also thinking how I can optimize my work, as I need to write a lot of exploratory SQL queries, involving a lot of joins in a system without any foreign key defined(!). Most of join conditions are fairly static, but there should be more succinct way of expressing joins.
    
    HTSQL looks quite promising. I was also looking into Cypher and cytosm, but the project looks unmaintained…
    1. 1
      
      eatonphil 2 years ago | link
      
      Yeah. I’m watching in vain. And hoping for something to take off…
5. 2
  
  rslabbert 2 years ago | link
  
  Having written some complex dbt projects (and if PRQL officially supported macros it could be a good drop in replacement), I can definitely see the value in something like this. The first thing that jumped out to me that it gets right is to start with the table and work down. This is an enormous readability boost in large projects and leads to great intellisense. I saw recently this with import statements in Bagel recently and I can just say it’s a major improvement over how say Typescript does things.
  
  One thing I’d love to see a system like this explore is a gradual type system ala Typescript as there’s no reason we can’t start to detect basic type interactions like only being able to sum a number. In addition to the .sql output, it could also provide a .d.sql file which contains a type specification of the output table. Source tables can be imported directly from the database or written by hand (also provides a convenient way to provide types for native functions/UDFs instead of having to natively support every dialect’s details in the parser). Even just having this do very basic “fall back to any if I see something unknown” inference would be a big step up over SQL.
6. 1
  
  0x2ba22e11 2 years ago | link
  
  This looks like just a reasonable DSL for relational algebra. That is an endorsement.
7. 1
  
  viraptor 2 years ago | link
  
  This reminds me of InfluxDB’s Flux language https://docs.influxdata.com/flux/v0.x/get-started/query-basics/