Skip to content

Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min)#772

Merged
vikrantpuppala merged 3 commits intomainfrom
ci/optimize-e2e-and-coverage-v2
Apr 14, 2026
Merged

Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min)#772
vikrantpuppala merged 3 commits intomainfrom
ci/optimize-e2e-and-coverage-v2

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Contributor

@vikrantpuppala vikrantpuppala commented Apr 13, 2026

Summary

  • Consolidate 3 workflows into 1: Delete integration.yml and daily-telemetry-e2e.yml — coverage workflow already runs all e2e tests. Add push: main trigger. Run all tests (including telemetry) in a single pytest invocation with --dist=loadgroup for xdist_group isolation.
  • Fix pyarrow cache: Remove cache-path: .venv-pyarrow — poetry always creates .venv, so the cache was never saved ("Path does not exist" error). 3.14 PyArrow jobs dropped from 18min → 3min once cache populated.
  • Fix 3.14 post-test DNS hang: Add enable_telemetry=False to unit test dummy connection args. Unit tests using server_hostname="foo" triggered real HTTP calls — on protected runners this caused an 8-min process hang. 3.14 unit tests dropped from 17min → 2min.
  • Better xdist distribution: Split TestPySQLLargeQueriesSuite into 3 separate classes and split lz4 on/off into separate parametrized cases so xdist distributes slow tests across 4 workers.
  • Use 4 workers: -n 4 instead of -n auto (2 CPUs). E2e tests are network-bound (waiting on warehouse), not CPU-bound.
  • Reduce test sizes: Large result set tests 300MB → 100MB. test_long_running_query threshold 3min → 1min, starting scale_factor 1 → 50.

Results (measured)

Metric Before After
E2E workflows per PR 3 1
Coverage wall-clock 47 min 15 min
Integration workflow 40 min deleted
3.14 unit tests 17m38s 2m46s
3.14 PyArrow tests 18m26s 3m21s
3.14 linting 15m46s 1m27s
Total warehouse compute per PR ~85 min ~15 min

Test plan

  • All 34 CI checks pass
  • Coverage workflow runs all tests including telemetry (870 passed, 25 skipped)
  • 3.14 pyarrow cache saves and hits on subsequent runs
  • 3.14 jobs no longer have post-test DNS hang
  • LargeQueriesSuite tests distributed across multiple xdist workers

SKIP_COVERAGE_CHECK = CI workflow changes only, no source code coverage impact

This pull request was AI-assisted by Isaac.

Workflow consolidation:
- Delete integration.yml and daily-telemetry-e2e.yml (redundant with
  coverage workflow which already runs all e2e tests)
- Add push-to-main trigger to coverage workflow
- Run all tests (including telemetry) in single pytest invocation with
  --dist=loadgroup to respect xdist_group markers for isolation

Fix pyarrow cache:
- Remove cache-path: .venv-pyarrow from pyarrow jobs. Poetry always
  creates .venv regardless of the cache-path input, so the cache was
  never saved ("Path does not exist" error). The cache-suffix already
  differentiates keys between variants.

Fix 3.14 post-test DNS hang:
- Add enable_telemetry=False to unit test DUMMY_CONNECTION_ARGS that
  use server_hostname="foo". This prevents FeatureFlagsContext from
  making real HTTP calls to fake hosts, eliminating ~8min hang from
  ThreadPoolExecutor threads timing out on DNS on protected runners.

Improve e2e test parallelization:
- Split TestPySQLLargeQueriesSuite into 3 separate classes
  (TestPySQLLargeWideResultSet, TestPySQLLargeNarrowResultSet,
  TestPySQLLongRunningQuery) so xdist distributes them across workers
  instead of all landing on one.

Speed up slow tests:
- Reduce large result set sizes from 300MB to 100MB (still validates
  large fetches, lz4, chunking, row integrity)
- Start test_long_running_query at scale_factor=50 instead of 1 to
  skip ramp-up iterations that finish instantly

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <[email protected]>
- Use -n 4 instead of -n auto in coverage workflow. The e2e tests are
  network-bound (waiting on warehouse), not CPU-bound, so 4 workers on
  a 2-CPU runner is fine and doubles parallelism.
- Lower test_long_running_query min_duration from 3 min to 1 min.
  The test validates long-running query completion — 1 minute is
  sufficient and saves ~4 min per variant.
- Split lz4 on/off loop in test_query_with_large_wide_result_set into
  separate parametrized test cases so xdist can run them on different
  workers instead of sequentially in one test.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <[email protected]>
Comment thread .github/workflows/code-coverage.yml
Comment thread .github/workflows/daily-telemetry-e2e.yml Outdated
@vikrantpuppala vikrantpuppala changed the title Optimize CI: consolidate workflows, fix caching, speed up e2e tests Optimize CI: consolidate workflows, fix caching, speed up e2e tests (47min → 15min) Apr 13, 2026
Comment thread .github/workflows/integration.yml Outdated
Copy link
Copy Markdown
Contributor

@jprakash-db jprakash-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Thanks for making the changes

Comment thread tests/e2e/common/large_queries_mixin.py Outdated
Comment thread tests/e2e/test_driver.py Outdated
Comment thread tests/e2e/common/large_queries_mixin.py Outdated
Comment thread tests/e2e/test_driver.py Outdated
Per review feedback from jprakash-db:
- Remove mixin classes (LargeWideResultSetMixin, etc) — inline the
  test methods directly into the test classes in test_driver.py
- Remove backward-compat LargeQueriesMixin alias (nothing uses it)
- Rename _LargeQueryRowHelper — replaced entirely by inlining
- Convert large_queries_mixin.py to just a fetch_rows() helper function

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <[email protected]>
@vikrantpuppala vikrantpuppala enabled auto-merge (squash) April 14, 2026 06:20
@vikrantpuppala vikrantpuppala merged commit c46b3a0 into main Apr 14, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants