feat/issue 200/order of operations (#244)

* refactor: simplify op priority logic * chore: bump development to py3.10, fix warnings * fix: do not treat square brackets as standalone operators (regression from prev commit) * refactor: create OperatorPrecedence enum and tiers * feat: parse more operators into specific tiers * fix #207: between's and now a tight word operator * feat: add more granular operator precedence, close #200 * chore: update changelog and readme * chore: bump primer refs * fix: fix bugs in op precedence, improve test coverage * fix: update primer stats for dbt-utils
tconbeer · Aug 18, 2022 · c34fcf2 · c34fcf2
1 parent 73c1a0a
commit c34fcf2
Show file tree

Hide file tree

Showing 18 changed files with 394 additions and 117 deletions.
diff --git a/.python-version b/.python-version
@@ -1,4 +1,4 @@
-3.9.7
-3.8.13
 3.7.12
-3.10.3
+3.8.13
+3.9.13
+3.10.6
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,9 +13,11 @@ All notable changes to this project will be documented in this file.
 -   sqlfmt can now be called as a python module, with `python -m sqlfmt`
 
 ### Formatting Changes + Bug Fixes
--   fixed a bug that could cause lines with long jinja tags to be one character over the line length limit, and could result in unstable formatting ([#237](https://github.com/tconbeer/sqlfmt/pull/237) - thank you [@nfcampos](https://github.com/nfcampos)!)
--   fixed a bug that formatted array literals like they were indexing operations ([#235](https://github.com/tconbeer/sqlfmt/pull/235) - thank you [@nfcampos](https://github.com/nfcampos)!)
--   fixed a bug that caused unnecessary copying of the cache when using multiprocessing. Large projects should see dramatically faster (near-instant) runs once the cache is warm
+-   adds more granularity to operator precedence and will merge lines more aggressively that start with high-precedence operators ([#200](https://github.com/tconbeer/sqlfmt/issues/200))
+-   improves the formatting of `between ... and ...`, especially in situations where the source includes a line break ([#207](https://github.com/tconbeer/sqlfmt/issues/207))
+-   fixes a bug that caused unnecessary copying of the cache when using multiprocessing. Large projects should see dramatically faster (near-instant) runs once the cache is warm
+-   fixes a bug that could cause lines with long jinja tags to be one character over the line length limit, and could result in unstable formatting ([#237](https://github.com/tconbeer/sqlfmt/issues/237) - thank you [@nfcampos](https://github.com/nfcampos)!)
+-   fixes a bug that formatted array literals like they were indexing operations ([#235](https://github.com/tconbeer/sqlfmt/issues/235) - thank you [@nfcampos](https://github.com/nfcampos)!)
 
 ## [0.10.1] - 2022-08-05
 

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ For now, sqlfmt only works on `select` statements (which is all you need if you
 ## Installation
 
 ### Try it first
-Want to test out sqlfmt on a query before you install it? Go to [sqlfmt.com](http://sqlfmt.com) to use the interactive, web-based version.
+Want to test out sqlfmt on a query before you install it? Go to [sqlfmt.com](https://sqlfmt.com) to use the interactive, web-based version.
 
 ### Prerequisites
 You will need Python 3.7-3.10 installed. You should use pipx or install into a virtual environment (maybe as a dev-dependency in your project). If you do not know how to install python and use pipx and/or virtual environments, go read about that first.
@@ -127,7 +127,7 @@ This can also be configured using the `pyproject.toml` file:
 dialect = "clickhouse"
 ```
 
-Note that with this option, sqlfmt will not lowercase **most** non-reserved keywords, even common ones like `sum` or `count`. See (and please join) [this issue](https://github.com/tconbeer/sqlfmt/issues/195) for a discussion of this topic.
+Note that with this option, sqlfmt will not lowercase **most** non-reserved keywords, even common ones like `sum` or `count`. See (and please join) [this discussion](https://github.com/tconbeer/sqlfmt/discussions/229) for more on this topic.
 
 ### Using sqlfmt with dbt
 
@@ -146,7 +146,7 @@ Add the following config to your `.pre-commit-config.yaml` file:
 ```yml
 repos:
   - repo: https://github.com/tconbeer/sqlfmt
-    rev: v0.10.0
+    rev: v0.10.1
     hooks:
       - id: sqlfmt
         language_version: python
@@ -199,28 +199,27 @@ You can use the [Run on Save](https://marketplace.visualstudio.com/items?itemNam
 ## The sqlfmt style
 The only thing you can configure with sqlfmt is the desired line length of the formatted file. You can do this with the `--line-length` or `-l` options. The default is 88.
 
-Given the desired line length, sqlfmt has four objectives:
-1. Break and indent lines to make the syntactical structure of the code apparent
-2. Break lines so they are shorter than the desired line length, if possible
-3. Combine lines to use the least possible vertical space, without violating #1 and #2
-4. Standardize capitalization (to lowercase) and in-line whitespace
+Given the desired line length, sqlfmt has three objectives:
+1. Break and indent lines to make the syntactical structure of the code apparent.
+2. Combine lines to use the least possible vertical space, without violating the line-length constraint or the structure in #1.
+3. Standardize capitalization (to lowercase) and in-line whitespace.
 
 sqlfmt borrows elements from well-accepted styles from other programming languages. It places opening brackets on the same line as preceding function names (like *black* for python and *1TBS* for C). It indents closing brackets to the same depth as the opening bracket (this is extended to statements that must be closed, like `case` and `end`).
 
 The sqlfmt style is as simple as possible, with little-to-no special-casing of formatting concerns. While at first blush, this may not create a format that is as "nice" or "expressive" as hand-crafted indentation, over time, as you grow accustomed to the style, formatting becomes transparent and the consistency will allow you to jump between files, projects, and even companies much faster.
 
 ### Why lowercase?
 There are several reasons that sqlfmt insists on lowercase SQL keywords:
-1. We believe that SQL is code (this is a surprisingly controversial statement!). Shouting-case keywords perpetuate the myth that SQL isn't "real code", or isn't "modern" and somehow is trapped in the era of low-level imperative languages: BASIC, COBOL, and FORTRAN. The reality is that SQL is an incredibly powerful, declarative, and modern language. It should look like one
-1. Syntax highlighting for SQL makes shouting-case keywords redundant; the syntax highlighter in any text editor is going to be more consistent than any manual shout-casing. If you have a SQL query as a string inside of a block of code in another language, you may want to capitalize your keywords; sqlfmt only operates on dedicated SQL (and templated sql) files, so this is not relevant. However, even without syntax highlighting, the hierarchical and consistent indentation provided by sqlfmt provides sufficient visual structure without shout-casing keywords
+1. We believe that SQL is code (this is a surprisingly controversial statement!). Shouting-case keywords perpetuate the myth that SQL isn't "real code", or isn't "modern" and somehow is trapped in the era of low-level imperative languages: BASIC, COBOL, and FORTRAN. The reality is that SQL is an incredibly powerful, declarative, and modern language. It should look like one.
+1. Syntax highlighting for SQL makes shouting-case keywords redundant; the syntax highlighter in any text editor is going to be more consistent than any manual shout-casing. If you have a SQL query as a string inside of a block of code in another language, you may want to capitalize your keywords; sqlfmt only operates on dedicated SQL (and templated sql) files, so this is not relevant. However, even without syntax highlighting, the hierarchical and consistent indentation provided by sqlfmt provides sufficient visual structure without shout-casing keywords.
 1. Even among people who like shout-cased keywords, there are disagreements between what gets shout-cased. SELECT, sure, but SUM? AS? OVER? AND? All-lowercase keywords eliminates this potential source of irregularity and disagreement.
-1. Research shows that generally, lowercase words are more readable
+1. Research shows that generally, lowercase words are more readable.
 
 ### Why trailing commas?
-1. Using trailing commas follows the convention of every other written language and programming language
-1. Leading commas require placing the first field name on the same line as `select`, which can obscure that field
-1. SQL query compilation is extremely fast; the "cost" of "last field" errors is very low. Some dialects (e.g., BigQuery) even allow a trailing comma in the final field of a select statement
-1. Trailing commas generalize better within `select` statements (e.g. `group by` and `partition by` clauses) and in other kinds of SQL statements (e.g. `insert` statements)
+1. Using trailing commas follows the convention of every other written language and programming language.
+1. Leading commas require placing the first field name on the same line as `select`, which can obscure that field (especially with `select distinct top 25 ...`).
+1. SQL query compilation is extremely fast; the "cost" of "last field" errors is very low. Some dialects (e.g., BigQuery, DuckDB) even allow a trailing comma in the final field of a select statement.
+1. Trailing commas generalize better within `select` statements (e.g. `group by` and `partition by` clauses) and in other kinds of SQL statements (e.g. `insert` statements).
 
 ### Examples
 
@@ -289,7 +288,7 @@ select
     ) as e
 from my_table
 ```
-Want more examples? See the `tests/data` directory, or go to http://sqlfmt.com to see how sqlfmt will format your queries.
+Want more examples? See the `tests/data` directory, or go to https://sqlfmt.com to see how sqlfmt will format your queries.
 
 ## Contributing
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -9,7 +9,7 @@ description = "sqlfmt is an opinionated CLI tool that formats your sql files"
 readme = "README.md"
 authors = ["Ted Conbeer <[email protected]>"]
 license = "Apache-2.0"
-homepage = "http://sqlfmt.com"
+homepage = "https://sqlfmt.com"
 repository = "https://github.com/tconbeer/sqlfmt"
 classifiers = [
     "Development Status :: 3 - Alpha",

diff --git a/src/sqlfmt/api.py b/src/sqlfmt/api.py
@@ -168,9 +168,7 @@ def _format_many(
     format_func = partial(_format_one, mode=mode)
     if len(cache_misses) > 1 and not mode.single_process:
         results.extend(
-            asyncio.get_event_loop().run_until_complete(
-                _multiprocess_map(format_func, paths, callback=callback)
-            )
+            asyncio.run(_multiprocess_map(format_func, paths, callback=callback))
         )
     else:
         results.extend((map(format_func, cache_misses)))

diff --git a/src/sqlfmt/dialect.py b/src/sqlfmt/dialect.py
@@ -278,14 +278,14 @@ def __init__(self) -> None:
                         r"cube",
                         r"(not\s+)?exists",
                         r"grouping sets",
-                        r"(not\s+)?ilike",
                         r"(not\s+)?in",
                         r"is(\s+not)?",
                         r"isnull",
-                        r"(not\s+)?like(\s+any)?",
+                        r"(not\s+)?i?like(\s+any)?",
                         r"notnull",
-                        r"rollup",
+                        r"(not\s+)?regexp",
                         r"(not\s+)?rlike",
+                        r"rollup",
                         r"some",
                         r"(not\s+)?similar\s+to",
                         r"using",
@@ -318,15 +318,6 @@ def __init__(self) -> None:
                         token_type=TokenType.TIGHT_WORD_OPERATOR,
                     ),
                 ),
-                Rule(
-                    name="not",
-                    priority=923,
-                    pattern=group(r"not") + group(r"\W", r"$"),
-                    action=partial(
-                        actions.add_node_to_buffer,
-                        token_type=TokenType.WORD_OPERATOR,
-                    ),
-                ),
                 Rule(
                     name="as",
                     priority=930,
@@ -345,6 +336,7 @@ def __init__(self) -> None:
                     pattern=group(
                         r"and",
                         r"or",
+                        r"not",
                     )
                     + group(r"\W", r"$"),
                     action=partial(

diff --git a/src/sqlfmt/line.py b/src/sqlfmt/line.py
@@ -236,7 +236,9 @@ def is_standalone_jinja_statement(self) -> bool:
 
     @property
     def is_standalone_operator(self) -> bool:
-        return self._is_standalone_if(self.starts_with_operator)
+        return self._is_standalone_if(
+            self.starts_with_operator and not self.starts_with_square_bracket_operator
+        )
 
     @property
     def is_standalone_comma(self) -> bool:

diff --git a/src/sqlfmt/merger.py b/src/sqlfmt/merger.py
@@ -7,8 +7,8 @@
 from sqlfmt.line import Line
 from sqlfmt.mode import Mode
 from sqlfmt.node import Node
+from sqlfmt.operator_precedence import OperatorPrecedence
 from sqlfmt.segment import Segment, create_segments_from_lines
-from sqlfmt.token import TokenType
 
 
 @dataclass
@@ -62,7 +62,7 @@ def _extract_components(
             content_nodes = [
                 cls._raise_unmergeable(node, allow_multiline)
                 for node in line.nodes
-                if node.token.type != TokenType.NEWLINE
+                if not node.is_newline
             ]
             if content_nodes:
                 final_newline = line.nodes[-1]
@@ -125,6 +125,7 @@ def maybe_merge_lines(self, lines: List[Line]) -> List[Line]:
 
         Returns a new list of Lines
         """
+
         try:
             merged_lines = self.create_merged_line(lines)
         except CannotMergeException:
@@ -139,7 +140,9 @@ def maybe_merge_lines(self, lines: List[Line]) -> List[Line]:
             if len(segments) > 1:
                 # merge together segments of equal depth that are
                 # joined by operators
-                segments = self._maybe_merge_operators(segments)
+                segments = self._maybe_merge_operators(
+                    segments, OperatorPrecedence.tiers()
+                )
                 # some operators really should not be by themselves
                 # so if their segments are too long to be merged,
                 # we merge just their first line onto the prior segment
@@ -189,41 +192,42 @@ def _fix_standalone_operators(self, segments: List[Segment]) -> List[Segment]:
     def _maybe_merge_operators(
         self,
         segments: List[Segment],
-        priority: int = 2,
+        op_tiers: List[OperatorPrecedence],
     ) -> List[Segment]:
         """
         Tries to merge runs of segments that start with operators into previous
         segments. Operators have a priority that determines a sort of hierarchy;
         if we can't merge a whole run of operators, we increase the priority to
         create shorter runs that can be merged
         """
-        if len(segments) <= 1 or priority < 0:
+        if len(segments) <= 1 or not op_tiers:
             return segments
         head = 0
         new_segments: List[Segment] = []
+        precedence = op_tiers.pop()
 
         for i, segment in enumerate(segments[1:], start=1):
-            if not self._segment_continues_operator_sequence(segment, priority):
+            if not self._segment_continues_operator_sequence(segment, precedence):
                 new_segments.extend(
-                    self._try_merge_operator_segments(segments[head:i], priority)
+                    self._try_merge_operator_segments(segments[head:i], op_tiers.copy())
                 )
                 head = i
 
         # we need to try one more time to merge everything after head
         else:
             new_segments.extend(
-                self._try_merge_operator_segments(segments[head:], priority)
+                self._try_merge_operator_segments(segments[head:], op_tiers.copy())
             )
 
         return new_segments
 
     @classmethod
     def _segment_continues_operator_sequence(
-        cls, segment: Segment, min_priority: int
+        cls, segment: Segment, max_precedence: OperatorPrecedence
     ) -> bool:
         """
         Returns true if the first line of the segment is part
-        of a sequence of operators
+        of a sequence of operators of priority <= max_priority
         """
         try:
             line, _ = segment.head
@@ -232,30 +236,12 @@ def _segment_continues_operator_sequence(
             return True
         else:
             return (
-                (
-                    line.starts_with_operator
-                    and cls._operator_priority(line.nodes[0].token.type) <= min_priority
-                )
-                or line.starts_with_comma
-                or line.starts_with_square_bracket_operator
-            )
-
-    @staticmethod
-    def _operator_priority(token_type: TokenType) -> int:
-        if token_type in (TokenType.BOOLEAN_OPERATOR, TokenType.ON):
-            return 2
-        elif token_type not in (
-            # list of "tight binding" operators
-            TokenType.AS,
-            TokenType.DOUBLE_COLON,
-            TokenType.TIGHT_WORD_OPERATOR,
-        ):
-            return 1
-        else:
-            return 0
+                line.starts_with_operator
+                and OperatorPrecedence.from_node(line.nodes[0]) <= max_precedence
+            ) or line.starts_with_comma
 
     def _try_merge_operator_segments(
-        self, segments: List[Segment], priority: int
+        self, segments: List[Segment], op_tiers: List[OperatorPrecedence]
     ) -> List[Segment]:
         """
         Attempts to merge segments into a single line; if that fails,
@@ -269,7 +255,7 @@ def _try_merge_operator_segments(
                 Segment(self.create_merged_line(list(itertools.chain(*segments))))
             ]
         except CannotMergeException:
-            new_segments = self._maybe_merge_operators(segments, priority - 1)
+            new_segments = self._maybe_merge_operators(segments, op_tiers)
         finally:
             return new_segments
 
@@ -289,21 +275,25 @@ def _maybe_stubbornly_merge(self, segments: List[Segment]) -> List[Segment]:
         if len(segments) <= 1:
             return segments
 
+        stubborn_merge_tier = OperatorPrecedence.COMPARATORS
+
         new_segments = [segments[0]]
         for segment in segments[1:]:
             prev_operator = self._segment_continues_operator_sequence(
-                new_segments[-1], min_priority=1
+                new_segments[-1], max_precedence=stubborn_merge_tier
             )
             if (
                 # always stubbornly merge P0 operators (e.g., `over`)
-                self._segment_continues_operator_sequence(segment, min_priority=0)
+                self._segment_continues_operator_sequence(
+                    segment, max_precedence=OperatorPrecedence.OTHER_TIGHT
+                )
                 # stubbornly merge p1 operators only if they do NOT
                 # follow another p1 operator AND they open brackets
                 # and cover multiple lines
                 or (
                     not prev_operator
                     and self._segment_continues_operator_sequence(
-                        segment, min_priority=1
+                        segment, max_precedence=stubborn_merge_tier
                     )
                     and segment.tail_closes_head
                 )
@@ -325,12 +315,12 @@ def _stubbornly_merge(
         best possible merger of those two segments
         """
         new_segments: List[Segment] = []
-        # try to merge the first line of this segment with the previous segment
         head, i = segment.head
 
+        # try to merge the first line of this segment with the previous segment
         try:
             prev_segment = Segment(self.create_merged_line(prev_segment + [head]))
-            prev_segment.extend(Segment(segment[i + 1 :]))
+            prev_segment.extend(segment[i + 1 :])
             new_segments.append(prev_segment)
         except CannotMergeException:
             # try to add this segment to the last line of the previous segment

diff --git a/src/sqlfmt/node.py b/src/sqlfmt/node.py
@@ -207,6 +207,7 @@ def is_operator(self) -> bool:
                 TokenType.SEMICOLON,
             )
             or self.is_multiplication_star
+            or self.is_square_bracket_operator
         )
 
     @property
-Original file line number
+Diff line change
@@ -1,4 +1,4 @@
-.9.7
-.8.13
 .7.12
-.10.3
+.8.13
+.9.13
+.10.6