Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

MySQL 5.7 compatibility #33

Merged
merged 1 commit into from
Jan 27, 2020

Conversation

szczeles
Copy link
Contributor

SQLs modification to make DiscreETLy run on MySQL 5.7 (which has no row_number and poor regular expressions support)

Copy link
Collaborator

@MikolajBalcerek MikolajBalcerek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

  • keeps working with MySQL 8.0.11

https://airflow.apache.org/docs/stable/installation.html - Airflow itself starts MySQL support from 5.6.4, discreETLy says we need 8.0+

We can make changes to unofficially broaden support (keep Airflow parity?) for now, what do you think @Wikia/data-engineering?

Probably no promises that future changes won't break it again though

Comment on lines +119 to +131
SELECT
dag_run.dag_id,
dag_run.state
FROM dag_run
INNER JOIN (SELECT
dag_id,
MAX(execution_date) AS date
FROM dag_run
GROUP BY dag_id) mx
ON
dag_run.dag_id = mx.dag_id
AND dag_run.execution_date = mx.date
JOIN dag ON dag.dag_id = dag_run.dag_id AND is_active = 1 AND is_paused = 0'''.replace("\n", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks OK to me

Comment on lines +71 to +87
SELECT
dr.dag_id,
dr.execution_date,
dr_latest.state as dag_state,
ti.task_id,
ti.state as task_state,
ti.duration,
ti.start_date,
ti.end_date
FROM (
SELECT dag_id,
MAX(execution_date) as execution_date
FROM dag_run
JOIN dag ON dag.dag_id = dag_run.dag_id AND is_active = 1 AND is_paused = 0) dr
JOIN task_instance ti ON ti.dag_id = dr.dag_id AND ti.execution_date = dr.execution_date
WHERE dr.age = 1'''.replace("\n", "")
GROUP BY dag_id) dr
JOIN dag_run dr_latest ON dr.dag_id = dr_latest.dag_id AND dr.execution_date = dr_latest.execution_date
JOIN task_instance ti ON dr.dag_id = ti.dag_id AND dr.execution_date = ti.execution_date
JOIN dag ON dag.dag_id = dr.dag_id AND is_active = 1 AND is_paused = 0'''.replace("\n", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also looks OK

Comment on lines +21 to +41
SELECT
@row_number:=CASE
WHEN @dag_id_current = clean_dag_id
THEN
@row_number + 1
ELSE
1
END AS rownum,
@dag_id_current:=clean_dag_id clean_dag_id,
dag_id,
execution_date as date,
state
FROM (
SELECT *,
CASE
WHEN substring_index(reverse(dag_id), "v_", 1) rlike '[0-9]+\.[0-9]'
THEN substring(dag_id, 1, length(dag_id) -1 - locate("v_", reverse(dag_id)))
ELSE dag_id
END as clean_dag_id from dag_run) dag_run, (SELECT @dag_id_current:='',@row_number:=0) as t
ORDER BY clean_dag_id, execution_date DESC
) dr WHERE rownum <= {days}'''
Copy link
Collaborator

@MikolajBalcerek MikolajBalcerek Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much less readible to me than the previous version.
We'd have to comment the method like it was so it's clear it's done to ignore the DAG's name versioning part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is much less readable. But well, it's the only way to do it in one query for all DAGs on the DB that doesn't support windowing functions. It passes all the tests in the project.

@mikulskibartosz
Copy link
Contributor

@MikolajBalcerek can we define the API for retrieving data from Airflow and provide two classes with separate implementations?

It complicates the implementation but we can continue using all features of the latest MySQL versions and at the same time provide compatibility with older versions.
We can add a configuration parameter that specifies the implementation to be used by discreETLy.

If we decide to have two implementations or use only the 5.7 version of SQL, we should have automated tests for that version too because in the future nobody will bother to test it manually after changes.

@szczeles
Copy link
Contributor Author

I like your idea about ability to add possibility to switch underlying implementation of Airflow's data provider. It can be useful for integrating discreETLy with other DBs like Postgres.

Current automated tests cover many cases about striping the DAGs versions, they were actually very useful when implementing this change. The both implementations (current for mysql >= 8 and the from the PR for mysql < 8) passes the same set of unit tests from the project, so there is no need to duplicate.

@MikolajBalcerek
Copy link
Collaborator

Hey @szczeles ,
We'll take your changes, so it's easier for everyone to keep using discreETLy :).
We are not touching it that often & performance is not crucial, so it's OK.

In the future we are going to continue developing with MySQL 8+ in mind, but if there are any incompatibilities feel free to prepare PRs or raise an issue and we'll see what can be done.
If the project gets some traction or we see additional uses we will reconsider @mikulskibartosz data providers idea.

Thanks for contributions and good luck :)!

@MikolajBalcerek MikolajBalcerek merged commit e946687 into Wikia:master Jan 27, 2020
@szczeles szczeles deleted the mysql_5_7_compatibility branch January 27, 2020 11:56
@jmalyszko-fandom
Copy link

Many thanks @szczeles for sharing your work!

@szczeles
Copy link
Contributor Author

Thanks @MikolajBalcerek, @mikulskibartosz 🙇‍♂️
Do you plan some docker image release as well with the current master code?

@MikolajBalcerek
Copy link
Collaborator

latest tag is already updated, I'll compile a changelog since 0.2.9 and release to dockerhub as 0.3.0 later :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants