Init Lookup Driver Just once - part 1 #139406

julian-elastic · 2025-12-11T21:06:20Z

This PR is a refactoring step towards Init Lookup Driver Just once.

Currently, AbstractLookupService::doLookup() creates operators directly from the request and input page. Since operators are tightly coupled with the driver and input pages, they cannot be reused across multiple pages. We plan to add local logical and physical planning. However, we cannot do that per page as it would add too much overhead. We need to perform planning once during session initialization rather than for every page. This PR takes a step in that direction by generating a physical plan first that can be shared across multiple pages. Main changes include:
1. Refactor AbstractLookupService::doLookup(). Instead of creating operators directly, we now create PhysicalPlan. We then covert PhysicalPlan ->Operator Factories-> Operators.
This separation allows the PhysicalPlan to be generated once and cached in a future PR, since it doesn't depend on the input page data.
2. QueryLists are no longer dependent on a particular page and stateless in terms of page contents. They use channelOffset instead of blocks. QueryLists are to be created during planning (before we have input pages), so they can no longer store blocks directly. Instead, each QueryList stores a channelOffset (the index of the block within a page). Since the page structure is consistent across all pages in a session, the channelOffset remains the same. At runtime, when getQuery() is called, the QueryList extracts the appropriate block from the current page using inputPage.getBlock(channelOffset)
3. New Physical Plan Nodes - LookupDropMergeExec and ParameterizedQueryExec
4. New LookupExecutionMapper - converts a Physical Plan to Operators for the lookup node, handles dictionary encoding optimization for Enrich (and possibly lookup join in the future).
5. Add unit tests for the refactor
6. Add LookupJoinIT - allow for fast debugging of failing csv-spec it tests by only loading the required indexes

elasticsearchmachine · 2025-12-15T01:16:52Z

Hi @julian-elastic, I've created a changelog YAML for you.

elasticsearchmachine · 2025-12-15T21:38:27Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2025-12-17T23:13:47Z

Hi @julian-elastic, I've created a changelog YAML for you.

alex-spies

I did a first pass. Thanks a lot @julian-elastic , this is going in the right direction and the AbstractLookupService#doLookup method is starting to look like a nice planning pipeline.

However, the main thing I noticed is that the physical plan abstraction is a bit broken in places, because we include operator-level concepts into the physical plans. I left some remarks about that.

As I continued reviewing more, I realized that some of the complexity and some of the places where the border between physical plan and operator is crossed, are probably due to AbstractLookupService being used both for enrich and lookup join.

In particular, LookupMergeDropExec confused me and is crossing the border into operator territory (see below). However, I think we can simplify that if we separate the code paths for enrich and lookup join. If I understand correctly, LookupMergeDropExec always maps to a project operator for lookup joins because for joins, we have mergePages = false.

If you agree with that (please see if you can confirm my observation - it's late in my day and I might have missed something), then my comments below are a bit outdated and the conclusion is rather that we shouldn't have LookupMergeDropExec at all, and the doLookup method should be refactored for the lookup join case, only, leaving the code for enrich alone for the most part.

It's perfectly fine to have the code for enrich and lookup join diverge, even to the point that we get rid of AbstractLookupService altogether and only have the separate EnrichLookupService and LookupFomrIndexService. The code paths will necessarily diverge as we continue improving the planning and execution of lookup joins.

alex-spies · 2025-12-19T16:29:19Z

x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/tree/EsqlNodeSubclassTests.java

        }
        Class<?> argClass = (Class<?>) argType;

+        // Handle array types - can't mock arrays, so create them directly


I just realized that our query plans never really use arrays, but generally lists.

consistency nit: any reason to use arrays in LookupMergeDropExec (which necessitates this test change)? (Except that it's admittedly weird to not use int[].)

alex-spies · 2025-12-19T16:32:29Z

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/LookupMergeDropExec.java

+public class LookupMergeDropExec extends UnaryExec {
+    private final List<NamedExpression> extractFields;
+    private final ElementType[] mergingTypes;
+    private final int[] mergingChannels;


Other PhysicalPlans are not aware of channels - channels are an operator concept. (I know, in other query planning frameworks, the physical plans would already be aware of the physical layout of columnar data.)

Normally, we'd use attributes to refer to specific columns of the physical plan. Can't we do the same here?

The problem is the attributes are different in subsequent requests I think (they would have different id due to deserialization). Finding them by name might lead to performance issues. Any ideas how to go around those? Do you still think it is a bad idea to go with channels for now?

Looks like we only use mergingChannels in the mergePages == true case, that is, for ENRICH. Same for mergingTypes.

I think separating the enrich and lookup join code paths should get rid of this altogether, because we won't need them?

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/LookupMergeDropExec.java

alex-spies · 2025-12-19T16:48:08Z

...lugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/LookupMergeDropExec.java

+ * This handles either merging multiple result pages into one (via MergePositionsOperator)
+ * or dropping the doc block (via ProjectOperator).


This is confusing. The MergePositionsOperator and ProjectOperator are quite different. Shouldn't/can't they be represented using different physical plans?

Well, we decide which one to use dynamically depending on the page contents and the optimization value. So for one page we might use MergePositionsOperator, for the other ProjectOperator. We cannot make that decision at Physical Planning because we don't have the page contents yet. The decision has to be made during Execution planning.

I think MergePositionOperator is only required for ENRICH, when we merge multiple matches into multivalues, no?

Looking through the code, this depends on mergePages, which is set during instantiation of the AbstractLookupService. For lookup join, we instantiate it with mergePages == false. Which means that the LookupMergeDropExec could just be a ProjectExec every time, no?

alex-spies · 2025-12-19T16:50:11Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/plan/physical/ParameterizedQueryExec.java

+/**
+ * Physical plan node representing a lookup source operation.
+ * This represents the source of a lookup query before conversion to operators.
+ * The QueryList is created during physical plan creation and will receive the Block at runtime.


What is "the Block" in this context? Can we extend this javadoc, maybe also add an example?

I will add better comments. It is the Page that is passed in doQuery() that is sent at runtime.

alex-spies · 2025-12-19T17:24:16Z