-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskGroup dependencies handled inconsistently #16764
Comments
Oh yeah. Okay, so why this happens: TaskGroups don't actually exist as dependencies, so when you do To fix this properly we would need need a "two pass" approach (which I think, isn't a problem): the first pass happens when parsing the DAG file, and when we do |
The workaround for now, is as you said, to move the |
I think in order to really tackle this one, there are two issues here that need to be addressed.
I believe that by solving these two problems independently, with respect to the bugs that are involved with TaskGroups, as shown in this Issue, will give a better base for the platform and provide a better integration with TaskGroups. Thoughts on anything that I've written here? I also believe that one solution may also inform the other. I know that you've already come up with some ideas @ashb but I'm really curious to see how you feel about tackling the two problems that I've given here separately, using one to inform the other. |
Did some more research and it leads me to believe that if we consider the TaskGroup to be a "dependable" in the same way that we consider tasks able to depend on each other, that is: Taskgroups may depend on or be depended on Tasks and other TaskGroups, then we will be available to avoid many other problems that occur like this. Appendix A: Expected Graph and Tree View I expect all definitions below to give a graph view, tree view, and actual running order to look like the pictures linked in Appendix A. Here are the definitions that I found that give the correct graph and tree view:
and
The definition below gives a graph and tree view that are consistent with each other, but not correct and matching with Appendix A:
The definition below gives an inconsistent tree and graph view, as well as incorrect running order. This is the example given by OP of this issue.
What we can see from the examples and the diagrams above is that there are a few events which depending on their order can affect the correctness of the dependencies in the DAG as well as the graph and tree view, which are sometimes inconsistent with each other. The events that are significant in these definitions that I can see are:
Before steps 2, 3, or 4 happens, we must ensure that step 1 has taken place. This means that we are left with 3 steps that can have an interchangeable order and affect the graph view, tree view, and running order of the DAG. I believe that all of the definitions above should give the running order and graph/tree view specified in Appendix A. This means that steps 2, 3, 4, from the above paragraph can be run in any order and the result will always be the same. |
One way I can think of to get rid of all the complexity is to prohibit the task group from being used before the context manager exits, i.e. with TaskGroup('tg') as taskgroup:
task1 = ...
task2 = ...
task1 >> task2 # Fine.
start >> taskgroup >> end # Throws error! with TaskGroup('tg') as taskgroup:
task1 = ...
task2 = ...
task1 >> task2
start >> taskgroup >> end # Must do this instead. This essentially ensures the step 3 happens after step 2, and leaves only steps 3 and 4 to be interchangable. Does with TaskGroup('tg') as taskgroup:
task1 = ...
task2 = ...
start >> taskgroup >> end
task1 >> task2 produce an “expected” graph? If it does, all problems would be solved from what I can tell; if not, this is the only thing we need to fix (aside from implementing logic to prohibit a task gorup to be used before exiting). This would break some of the existing usages, but I think we might be able to get away with the breaking change because most of the usages that would break does not work very well right now anyway (as shown in this issue), and therefore are unlikely to be widely in use right now. |
That would work, but there are also other problems that we get from not having TaskGroups actually exist in the DAG/dependency chain, so another option is to make tasks be able to depend directly on TaskGroups (I.e. rename downstream_task_ids to downstream_ids) |
This one really caught me out the other day as I couldn't see from the dependency lines being drawn in the graph view why my tasks were starting with their dependencies un-met. It would seem that for task groups to be totally convenient it shouldn't matter when I add the dependency information, the outcome should be the same (as others have posted.). Because that's exactly what would happen when you link two tasks. It doesn't matter when you link them or in what order you create them. This looks like it might get a fix some time quite soon but if that is not the case perhaps some form of warning logged if tasks are added to groups AFTER dependencies have been added would at least alert people that the behaviour they expect is unlikely to be what they get. :) Do like task groups though. Make things really simple and also lets me compose the same set of tasks multiple times in the same DAG by adding a different task group around the outside of them and letting the prefix keep them uniquely identifiable, which can be really really handy when you are programatically generating a DAG from meta data about the files you have been given to process. :) |
I had the same problem here and the tasks inside of TaskGroup didn't respect the TaskGroup previous dependencies. This issue helped me a lot, thanks! Ps: I'm using Airflow 2.0.2 from AWS (MWAA) |
I want to add that a fix to this problem is still requested (alongside better documentation of this problem in the first place). I had not realized that chaining the TaskGroup before populating it would result in no dependencies being defined between group-internal tasks and those downstream. This post was so illuminating to what had been hours of frustration as I tried to understand why the grid and graph views misaligned. |
Can someone please confirm if issue still happens on latest Airflow version? |
Yes, just tested the original sample dag. Still happens. Unfortunately, it causes the grid view to break. |
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author. |
Apache Airflow version: 2.0.1
Environment:
What happened:
I read the following documentation about Task Groups:
https://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html
https://www.astronomer.io/guides/task-groups
From this documentation it seemed that dependencies between Task Groups are possible, which is a really nice feature for complex DAGs where adding a task to one group no longer involves updating the dependencies of tasks in downstream groups.
I implemented a Task Group with dependency relationships to start and end dummy tasks. However when the DAG was run the start, end, and first task of the group all ran simultaneously. It took me a while to see what I was doing wrong, which was that I was adding the group dependencies before adding tasks to the group.
One big source of confusion here is that the Graph View of the DAG does show connecting lines from the start/end tasks to the Task Group, so it looks like there should be dependencies when there aren't any. The Tree View however shows no such dependencies.
What you expected to happen:
I would expect the Graph View to show the same dependencies as the Tree View, and not show dependencies that aren't actually there.
My mental model from reading the documentation was that the dependencies were set on the group, whereas it seems as if the dependencies are actually set on whatever tasks happen to be in the group at the time the dependency is added.
If this is indeed how Task Groups are intended to work it might be worth clarifying this somewhere in the documentation and not just rely on examples that do the right thing.
How to reproduce it:
Here is an example that shows what how my DAG was laid out:
Here is what I see in the Graph View:
Here is what I see in the Tree View:
If I move the
start >> taskgroup >> end
line below thetask1 >> task2
line the Graph View is exactly identical but the Tree View matches my expectation:The text was updated successfully, but these errors were encountered: