-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: table-not-found issue with executeSelect while running long queries #2222
Conversation
…n table_not_found
# Conflicts: # README.md
// for getQueryResults) per iteration of the loop | ||
long startTimeMs = System.currentTimeMillis(); | ||
long totalTimeOutMs = 18 * 60 * 60 * 1000; // 18 hours, which is the max timeout for the job | ||
long poolingIntervalMs = 60000; // 1 min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/pooling/polling/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for point it out, I have update it
// This logic will wait for approx (poolingIntervalMs + 10 seconds which is the default timeout | ||
// for getQueryResults) per iteration of the loop | ||
long startTimeMs = System.currentTimeMillis(); | ||
long totalTimeOutMs = 18 * 60 * 60 * 1000; // 18 hours, which is the max timeout for the job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to leave the totalTimeoutMs and assorted logic out of this PR. You may in the future want to add a user configurable timeout, but per-rpc timeouts are sufficient.
BQ guarantees that jobs make forward progress (a job won't get stuck in pending forever). This interval is so long you may as well just trust the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I have updated the logic and have added a timeoutMs
param at the RPC layer which is currently hardcoded to 60 seconds. QQ, in the very corner case when the job runs for 18 hours and is still not-complete then will the backend throw and error? Otherwise it will be an infinite loop as jobComplete
will never be true
.
Also, I have captured task to make timeoutMs
user configurable here: #2240 (point#6)
|
||
} else { // wait for the defined poolingIntervalMs and the loop will retry | ||
try { | ||
Thread.sleep(poolingIntervalMs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior here seems wrong. getQueryResults will poll from the server side for up to 10 seconds, and then you sleep for 60 in the client? Effective latency for an 11 sec query becomes more than a minute.
You should not wait on the client side at all, just re-issue a subsequent getQueryResults call. Allow the poll and wait to happen within BQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, updated to logic to use RCP based poll and wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting all this together!
Internal Bug's Ref: b/241134681 .
table-not-found
issue has been observed withexecuteSelect
while running long queries. This issue comes while initialising storage read session when the query job is not complete.This is a short term fix where we are polling the job's status using
jobs.getQueryResults
and the session with read API is initialised when the job is complete, thus avoidingtable-not-found
.Capturing this FR for the long term fix/re-design: #2240
Ref: go/executeselect-re-design