Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskGroup dependencies handled inconsistently #16764

Open
tomyedwab opened this issue Jul 1, 2021 · 13 comments
Open

TaskGroup dependencies handled inconsistently #16764

tomyedwab opened this issue Jul 1, 2021 · 13 comments

Comments

@tomyedwab
Copy link

Apache Airflow version: 2.0.1

Environment:

  • Cloud provider or hardware configuration: Various (GCP & local Python)
  • OS (e.g. from /etc/os-release): Various (linux, OSX)

What happened:

I read the following documentation about Task Groups:
https://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html
https://www.astronomer.io/guides/task-groups

From this documentation it seemed that dependencies between Task Groups are possible, which is a really nice feature for complex DAGs where adding a task to one group no longer involves updating the dependencies of tasks in downstream groups.

I implemented a Task Group with dependency relationships to start and end dummy tasks. However when the DAG was run the start, end, and first task of the group all ran simultaneously. It took me a while to see what I was doing wrong, which was that I was adding the group dependencies before adding tasks to the group.

One big source of confusion here is that the Graph View of the DAG does show connecting lines from the start/end tasks to the Task Group, so it looks like there should be dependencies when there aren't any. The Tree View however shows no such dependencies.

What you expected to happen:

I would expect the Graph View to show the same dependencies as the Tree View, and not show dependencies that aren't actually there.

My mental model from reading the documentation was that the dependencies were set on the group, whereas it seems as if the dependencies are actually set on whatever tasks happen to be in the group at the time the dependency is added.

If this is indeed how Task Groups are intended to work it might be worth clarifying this somewhere in the documentation and not just rely on examples that do the right thing.

How to reproduce it:

Here is an example that shows what how my DAG was laid out:

with DAG(
    'task_group_test',
    default_args=default_args,
    description='Task Group Test',
    start_date=datetime(2021, 7, 1),
    schedule_interval=None) as dag:

    start = DummyOperator(task_id='start')
    end = DummyOperator(task_id='end')

    with TaskGroup('tg') as taskgroup:
        start >> taskgroup >> end

        task1 = PythonOperator(task_id='hello1', python_callable=_print_hello)
        task2 = PythonOperator(task_id='hello2', python_callable=_print_hello)
        task1 >> task2

Here is what I see in the Graph View:
image

Here is what I see in the Tree View:
image

If I move the start >> taskgroup >> end line below the task1 >> task2 line the Graph View is exactly identical but the Tree View matches my expectation:

image

@tomyedwab tomyedwab added the kind:bug This is a clearly a bug label Jul 1, 2021
@eladkal eladkal added area:UI Related to UI/UX. For Frontend Developers. affected_version:2.0 Issues Reported for 2.0 labels Jul 4, 2021
@ashb
Copy link
Member

ashb commented Jul 7, 2021

Oh yeah.

Okay, so why this happens:

TaskGroups don't actually exist as dependencies, so when you do start >> taskgroup >> end you are setting the downstream of start and the upstream of end based on the current tasks in the taskgroup.

To fix this properly we would need need a "two pass" approach (which I think, isn't a problem): the first pass happens when parsing the DAG file, and when we do start >> taskgroup we store the Actual TaskGroup there, and only in the second pass (likely when we "bag" the DAG, handled internally in the parsing process of Airflow) is when we'd turn TaskGroups in the dependencies in to their actual values.

@ashb
Copy link
Member

ashb commented Jul 7, 2021

The workaround for now, is as you said, to move the start >> taskgroup >> end to outside of the TG context.

@ashb ashb added this to the Airflow 2.3 milestone Jul 27, 2021
@alex-astronomer
Copy link
Contributor

I think in order to really tackle this one, there are two issues here that need to be addressed.

  1. Just like the ticket mentions, we do expect that the TaskGroup can be a part of the dependency chain right? So if someone were to build a dependency like the "broken" example from this ticket, then the tasks would still all be connected like they are supposed to be. That would be my understanding at least, from the perspective of a user rather than a developer. By using the taskgroup as a "top-level" dependency, and handling all "sub-dependencies" within the TaskGroup separately, I think this problem could be solved.
  2. The graph view and tree view are showing inconsistencies, and my understanding is that the tree view dependencies are being honored in this case, rather than the ones that are showing in the graph view.

I believe that by solving these two problems independently, with respect to the bugs that are involved with TaskGroups, as shown in this Issue, will give a better base for the platform and provide a better integration with TaskGroups. Thoughts on anything that I've written here? I also believe that one solution may also inform the other. I know that you've already come up with some ideas @ashb but I'm really curious to see how you feel about tackling the two problems that I've given here separately, using one to inform the other.

@alex-astronomer
Copy link
Contributor

alex-astronomer commented Jan 17, 2022

Did some more research and it leads me to believe that if we consider the TaskGroup to be a "dependable" in the same way that we consider tasks able to depend on each other, that is: Taskgroups may depend on or be depended on Tasks and other TaskGroups, then we will be available to avoid many other problems that occur like this.


Appendix A: Expected Graph and Tree View

7280204D-5796-4741-8F2E-CB0E06C91A28
CDDC9FF2-B1ED-476B-BF35-845FD6028BC0


I expect all definitions below to give a graph view, tree view, and actual running order to look like the pictures linked in Appendix A.

Here are the definitions that I found that give the correct graph and tree view:

with TaskGroup(‘tg’) as taskgroup:
    task1 = PythonOperator(task_id=‘hello1’, python_callable=_print_hello)
    task2 = PythonOperator(task_id=‘hello2’, python_callable=_print_hello)
    task1 >> task2
    start >> taskgroup >> end

and

with TaskGroup('tg') as taskgroup:  
    task1 = PythonOperator(task_id='hello1', python_callable=_print_hello)  
    task2 = PythonOperator(task_id='hello2', python_callable=_print_hello)  
    task1 >> task2  
  
start >> taskgroup >> end

The definition below gives a graph and tree view that are consistent with each other, but not correct and matching with Appendix A:

with TaskGroup(‘tg’) as taskgroup:
    task1 = PythonOperator(task_id=‘hello1’, python_callable=_print_hello)
    task2 = PythonOperator(task_id=‘hello2’, python_callable=_print_hello)
    start >> taskgroup >> end
    task1 >> task2

1F30F659-8BBC-4014-997A-EB00F0BB0D42
6641254F-B31D-4BF1-9846-C6B17875C223


The definition below gives an inconsistent tree and graph view, as well as incorrect running order. This is the example given by OP of this issue.

with TaskGroup(‘tg’) as taskgroup:
    start >> taskgroup >> end
    task1 = PythonOperator(task_id=‘hello1’, python_callable=_print_hello)
    task2 = PythonOperator(task_id=‘hello2’, python_callable=_print_hello)
    task1 >> task2

53DDE4F7-DA19-4FBB-A1D3-359F4F7763F4
C3AED63E-C82A-4D16-A82F-81E344566624


What we can see from the examples and the diagrams above is that there are a few events which depending on their order can affect the correctness of the dependencies in the DAG as well as the graph and tree view, which are sometimes inconsistent with each other. The events that are significant in these definitions that I can see are:

  1. taskgroup variable defined
  2. internal tasks defined
  3. dependency set between start >> taskgroup >> end
  4. "internal" dependency set between hello1 >> hello2

Before steps 2, 3, or 4 happens, we must ensure that step 1 has taken place. This means that we are left with 3 steps that can have an interchangeable order and affect the graph view, tree view, and running order of the DAG.

I believe that all of the definitions above should give the running order and graph/tree view specified in Appendix A. This means that steps 2, 3, 4, from the above paragraph can be run in any order and the result will always be the same.

@uranusjr
Copy link
Member

One way I can think of to get rid of all the complexity is to prohibit the task group from being used before the context manager exits, i.e.

with TaskGroup('tg') as taskgroup:
    task1 = ...
    task2 = ...
    task1 >> task2  # Fine.
    start >> taskgroup >> end  # Throws error!
with TaskGroup('tg') as taskgroup:
    task1 = ...
    task2 = ...
    task1 >> task2
start >> taskgroup >> end  # Must do this instead.

This essentially ensures the step 3 happens after step 2, and leaves only steps 3 and 4 to be interchangable. Does

with TaskGroup('tg') as taskgroup:
    task1 = ...
    task2 = ...
start >> taskgroup >> end
task1 >> task2

produce an “expected” graph? If it does, all problems would be solved from what I can tell; if not, this is the only thing we need to fix (aside from implementing logic to prohibit a task gorup to be used before exiting).

This would break some of the existing usages, but I think we might be able to get away with the breaking change because most of the usages that would break does not work very well right now anyway (as shown in this issue), and therefore are unlikely to be widely in use right now.

@ashb
Copy link
Member

ashb commented Jan 18, 2022

That would work, but there are also other problems that we get from not having TaskGroups actually exist in the DAG/dependency chain, so another option is to make tasks be able to depend directly on TaskGroups (I.e. rename downstream_task_ids to downstream_ids)

@lucharo
Copy link

lucharo commented Mar 10, 2022

Hello, I am experiencing a similar issue with nested TaskGroups (TG). This is my current GraphView:
image
The tree view reveals the same:
image

This is not the behaviour I would expect to observe based on how I define my dag. Within ANAH-LOAD and DPI-LOAD there is a TG for each year and for each month and then each month contains several tasks. This is the code where I define the top layer of the nested TGs:

with TaskGroup(f'{platform}-LOAD', dag = dag) as TG:
    year_tasks = []
    for year_list in res:
        with TaskGroup(f'{year_list[0][:4]}', dag = dag) as tg_year:
            ym_tasks = []
            for ym in year_list:
                ym_tasks.append(populate_task_group(ym, gvars))

        year_tasks.append(ym_tasks)

where the function populate_task_group simply returns a TaskGroup object and year_list contains date in ym format (e.g. 201901 for January 2019). Note, the code above is part of a function (populate()) that returns the following:

return TG >> end_email

where end_email is an EmailOperator as you can imagine

Then finally I define my DAG order/sequence as:

reset >> start_populate_email >>  [populate(), populate('ANAH',HadoopCluster.ProdAnaH)]

Hope I've made myself clear, all help is appreciated and please message me if you need more details

Expected behaviour

I would expect the logic/graph view/tree view to be:

cleanrefresh >> start-populate-notification >> ANAH-LOAD >> END-EMAIL-ANAH 
                                            >> DPI-LOAD >> END-EMAIL-DPI

@tony-brookes
Copy link

This one really caught me out the other day as I couldn't see from the dependency lines being drawn in the graph view why my tasks were starting with their dependencies un-met.

It would seem that for task groups to be totally convenient it shouldn't matter when I add the dependency information, the outcome should be the same (as others have posted.). Because that's exactly what would happen when you link two tasks. It doesn't matter when you link them or in what order you create them.

This looks like it might get a fix some time quite soon but if that is not the case perhaps some form of warning logged if tasks are added to groups AFTER dependencies have been added would at least alert people that the behaviour they expect is unlikely to be what they get. :)

Do like task groups though. Make things really simple and also lets me compose the same set of tasks multiple times in the same DAG by adding a different task group around the outside of them and letting the prefix keep them uniquely identifiable, which can be really really handy when you are programatically generating a DAG from meta data about the files you have been given to process. :)

@ashb ashb modified the milestones: Airflow 2.4.0, Airflow 2.4.1 Sep 8, 2022
@o-nikolas o-nikolas removed the affected_version:2.0 Issues Reported for 2.0 label Oct 15, 2022
@ephraimbuddy ephraimbuddy modified the milestones: Airflow 2.4.3, Airflow 2.4.4 Nov 9, 2022
@ephraimbuddy ephraimbuddy modified the milestones: Airflow 2.4.4, Airflow 2.5.0, Airflow 2.5.1 Nov 23, 2022
@eladkal eladkal removed this from the Airflow 2.5.1 milestone Nov 25, 2022
@darambaris
Copy link

darambaris commented Mar 16, 2023

I had the same problem here and the tasks inside of TaskGroup didn't respect the TaskGroup previous dependencies. This issue helped me a lot, thanks!

Ps: I'm using Airflow 2.0.2 from AWS (MWAA)

@jayckaiser
Copy link

I want to add that a fix to this problem is still requested (alongside better documentation of this problem in the first place). I had not realized that chaining the TaskGroup before populating it would result in no dependencies being defined between group-internal tasks and those downstream. This post was so illuminating to what had been hours of frustration as I tried to understand why the grid and graph views misaligned.

@eladkal eladkal added area:core affected_version:2.0 Issues Reported for 2.0 and removed area:UI Related to UI/UX. For Frontend Developers. labels Jul 15, 2023
@eladkal
Copy link
Contributor

eladkal commented Jul 15, 2023

Can someone please confirm if issue still happens on latest Airflow version?

@bbovenzi
Copy link
Contributor

Can someone please confirm if issue still happens on latest Airflow version?

Yes, just tested the original sample dag. Still happens. Unfortunately, it causes the grid view to break.

@eladkal eladkal added affected_version:2.6 Issues Reported for 2.6 and removed affected_version:2.0 Issues Reported for 2.0 labels Jul 27, 2023
Copy link

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests