-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3041: Feature, NodeConformance, NodeFeature, and Feature Gate labels cleanup #3042
KEP-3041: Feature, NodeConformance, NodeFeature, and Feature Gate labels cleanup #3042
Conversation
0d600ac
to
3853fa2
Compare
/sig node |
/cc |
247d4d8
to
4010139
Compare
Today this was presented at SIG Testing meeting. Here are comments: [Sergey] Proposed changes to test labels: #3042
|
3d92433
to
8ef92d9
Compare
/assign @endocrimes Can you please take a look at the update (tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add an Environment section, ideally that also clarifies how they are to be documented. That way we can easily transfer this to developer documentation and be strict about the labeling during reviews.
|
||
The label `NodeSpecialFeature:` was introduced in [this document](https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit?usp=sharing), | ||
but was never consistently used. Today we only have a couple uses of it that | ||
might be cleaned up in favor of using `Feature:` tag when applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeSpecialFeature is basically a combination Feature + Environment tag as originally defined so 👍
/approve Current state is unsustainable and this moves towards a more coherent state |
#3042 (comment) but generally LGTM |
> * `[FeatureGate:.+]`: If a test only works when the certain feature gate is | ||
> enabled it receives a `[FeatureGate:.+]` label. `[FeatureGate:.+]` tests | ||
> must also be marked with the status of this feature gate: `[Alpha]`, | ||
> `[Beta]`, `[Stable]`, `[Deprecated]`. This label helps to skip tests that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a few problems with requiring these `[Alpha]/[Beta]/[Stable]/[Deprecated]' tags.
The first is conceptual: if a test depends on two different features, what tag should be used? All that apply, e.g. [FeatureGate:foo] [Beta] [FeatureGate:bar] [Stable]
?
In the past, annotations where removed once a feature graduated to stable, which then also automatically enabled the test in the core jobs. Is the plan now to keep the [FeatureGate]
annotation and instead adapt job definitions when a feature graduates to GA? I can see that it would be valuable to keep the [FeatureGate]
annotation because it provides information that otherwise would get lost. For example, one might want to run all tests for a specific feature even after it has graduated. I am less sure that [Stable]
is useful.
My biggest concern is that the [Alpha]/[Beta]/[Stable]/[Deprecated]
annotations are redundant and that maintaining them is going to be error-prone.
Instead of adding them manually to the test description, can we do that with helper functions? Here's how that could look like:
Current solution:
ginkgo.It("foo bar [FeatureGate:foo] [Beta]", ...)
Revised approach:
framework.It("foo bar", framework.WithFeatureGate("foo"), ...)
framework.It
then acts as wrapper around ginkgo.It
. When one of the optional parameters is the result of framework.WithFeatureGate
, it does several things:
- Checks that the feature is registered in a feature gate. That feature gate could be seeded with the Kubernetes features, but additional features could be added.
- Retrieves the stability level of that feature from the feature gate.
- Extends the test description with, in this example,
[FeatureGate:foo] [Beta]
. - Add
ginkgo.Label("FeatureGate:foo")
andginkgo.Label("Beta")
when callingginkgo.It
.
Another advantage is that typos in the feature name will be detected when registering the tests, i.e. when the suite starts. When embedded in the text we have to rely on code reviews.
Similar helper functions for [Environment]
and [Feature]
could also be added. Then it becomes possible to select tests with:
--label-filter [expression]
If set, ginkgo will only run specs with labels that match the label-filter.
The passed-in expression can include boolean operations (!, &&, ||, ','),
groupings via '()', and regular expressions '/regexp/'. e.g. '(cat || dog)
&& !fruit'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first is conceptual: if a test depends on two different features, what tag should be used? All that apply, e.g.
[FeatureGate:foo] [Beta] [FeatureGate:bar] [Stable]
?
we should not have features that depends on features that are not stable, right?
I can see that it would be valuable to keep the [FeatureGate] annotation because it provides information that otherwise would get lost
I see that more as part of the [Feature:]
tag, feature-gates should be part of the system when they graduate, if the FeatureGate enables a new Feature we can use this tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By requiring that all features, feature gates and environments must be registered in the framework, we can add command line flags that list them. Might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, agree, we can do much better and we should aim for that, but at this point I think we should fix current state and have the right variables, and then iterate on those.
It seems this KEP allows us to define those variables with Feature, FeatureGate and Environment ... if we agree on those, we can start working later no how to "enforce" those and how to improve the automation , WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/kubernetes/kubernetes/pull/112894/files for a prototype implementation of this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not have features that depends on features that are not stable, right?
Yes, but a test itself might depend on two independent features. It's unlikely, I just wanted to call it out.
I see that more as part of the [Feature:] tag, feature-gates should be part of the system when they graduate, if the FeatureGate enables a new Feature we can use this tag
So before a feature is GA, tests get annotated with [FeatureGate:foo] [Beta]
, then when they graduate, we replace that with [Feature:foo] [Stable]
? That seems a bit inconsistent to me.
at this point I think we should fix current state and have the right variables, and then iterate on those
I'm worried about code churn and reviewer fatigue.If we do this in two steps, we have to touch all test definitions twice, first to clean up the text, then later to turn the text into function calls. Reviewing the first step will be harder because typos have to be caught by reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before a feature is GA, tests get annotated with [FeatureGate:foo] [Beta], then when they graduate, we replace that with [Feature:foo] [Stable]? That seems a bit inconsistent to me.
I think that this is one of the problems we are having, FeatureGates and Feature are different things, the problem that I see is that sometimes FeatureGates are used to gate Features, but not all FeatureGates are Features
I'm worried about code churn and reviewer fatigue.If we do this in two steps, we have to touch all test definitions twice, first to clean up the text, then later to turn the text into function calls. Reviewing the first step will be harder because typos have to be caught by reviewers.
I think that is a fantastic idea to use the functions as you described and keep it in sync with the code ... but I don't want to impose it to the KEP author 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can help with using those functions, both by updating the KEP and providing the implementation.
But let's clarify first how to handle graduated features. That has implications that are unclear, both with and without those helper functions.
|
||
Proposed change: | ||
|
||
> * `[Feature:.+]`: If a test validates functionality that may only work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: Feature
and FeatureGate
are easy to confuse. Kubernetes itself didn't get this right: the featuregate
package has an implementation of one [Mutable]FeatureGate
that stores the state of different Feature
s, but everywhere else we call those Feature
strings "feature gates".
Can we perhaps replace the term Feature
? How about Functionality
?
With Functionality
and FeatureGate
it is more obvious that Functionality
is something else and "feature gate" is pretty well defined in Kubernetes (when ignoring the slight misnaming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that renaming Feature has massive churn requirement on CI configs upstream and downstream. The default state of most jobs is to opt out of features, to only test stably supported portable things. With specific jobs running unstable or not necessarily available things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the back compat was mostly the reason to keep the Feature
naming. Alternative may be Capability
, but again, it will require a lot of renaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Perhaps add that rationale to the KEP?
/cc |
Survey of current test tags: https://gist.github.com/spiffxp/5e39919df9eaa3ad7d26ac0e69c5d4e5#file-tags-txt |
It's taking me some time to think through this deeply but my gut reaction is this seems like it introduces more complexity and potential for confusion |
With beta features disabled by default, at least for subset of feature gates. it will be more and more complicated to author test definitions and keep them up to date. Conflating all meanings in Feature tag causing a lot of confusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I still can't tell what concrete problems this is solving, and so I'm not sure the additional complexity cost is worth it
There are parts of this I totally agree with:
- dropping NodeFeature, NodeSpecialFeature
- adding Alpha, Beta, Deprecated
- testing node functionality with FeatureGates disabled
But I'm just not sold on FeatureGate
and would prefer better names for Environment
and NodeConformance
I'll admit I am still spinning back up here, so if everyone else here disagrees with my takes I will defer to them.
Kubernetes certification program. The similarity in the name may be confusing, | ||
but this confusion will not be addressed in this KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this confusion will not be addressed in this KEP"
If not now, when? If you're going to formalize this label, I think now is the appropriate time to choose a better name to avoid the confusion
The subtle differences as I see them:
[NodeConformance]
tests "enabled out of the box functionality [... that] may still require an additional environment set up"... can you give some examples here?[Beta]
features are allowed since they're usually enabled by default (though beta APIs are usually disabled by default)
I get the need to classify and test all "default" Node level functionality, I would just prefer to avoid the word Conformance
given these caveats. [NodeDefault]
, [NodeBase]
, even just [Node]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just wanted to postpone solving this naming problem =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plus changing name has some back compat issues I didn't want to explore too much
- There is no way to disable tests for a specific feature gate | ||
(example where it’s needed: https://github.com/kubernetes/kubernetes/issues/99854). | ||
It is a "feature" guarded by feature gate, but not marked as "feature". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what problem you are trying to solve here, so let's try a strawman.
What would be wrong with a test that used something like [Feature:DisabledExecProbeTimeout]
and ran against a cluster that had the ExecProbeTimeout
feature gate turned off (and would fail if the gate was on)?
To me this fits the "a test has non-default requirements to run" definition of [Feature:]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with this is there are examples when Feature tag just stayed after feature gate was gone. Having explicit difference between not fully implemented and not running everywhere makes it easier to keep consistency in test definitions. Some treat Feature
tag as "I implemented this feature and called it like this. Kubernetes has this feature so I keep the tag here".
- There are no tests validating that k8s works with ALL beta feature gates disabled. | ||
Creating such test would be hard with today's test labeling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this problem either.
From my perspective we already do this today for [Conformance]
tests by setting the feature gates AllAlpha
and AllBeta
to false
:
- https://testgrid.k8s.io/sig-release-master-blocking#conformance-ga-only
- https://github.com/kubernetes/test-infra/blob/13e0036ab3dc00813758b2be3699056add81b478/config/jobs/kubernetes/sig-testing/conformance-e2e.yaml#L143-L184
- https://github.com/kubernetes-sigs/kind/blob/7c843a264b854942f9c18211d82d1bf79d4ca3ed/hack/ci/e2e-k8s.sh#L132
So I'm trying to understand how this is a labeling issue. What other subset of tests would you want to be able to run against a cluster setup this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with that, but we also have things like this https://github.com/kubernetes/test-infra/blob/4d7b0d15c5b7551ecc9d0a23ddf1a5a9c696de1f/config/jobs/kubernetes/sig-gcp/sig-gcp-gce-config.yaml#L209-L232
- --test_args=--ginkgo.focus=\[Feature:(GRPCContainerProbe|InPlacePodVerticalScaling|ProbeTerminationGracePeriod|APIServerTracing|APISelfSubjectReview|StorageVersionAPI|PodPreset|StatefulSetMinReadySeconds|ProxyTerminatingEndpoints|NodeOutOfServiceVolumeDetach|RetroactiveDefaultStorageClass)\]|Networking --ginkgo.skip=\[Feature:(SCTPConnectivity|Volumes|Networking-Performance)\]|IPv6|csi-hostpath-v0 --minStartupPods=8
and that list has not been updated for a long time 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we also have things like this
yeah exactly replacing all that with --ginkgo.focus=\[Alpha\]
makes complete sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, it seems this part has general agreement
- Some tests are degraded in how `[Feature]` and `[NodeFeature]` labels are applied | ||
(for example: https://github.com/kubernetes/kubernetes/pull/105921). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what degraded means in this context?
I interpret the example as "working as intended".. you removed the [Feature:]
tag because the functionality is now default / enabled out of the box aka the test no longer "has non-default requirements to run or targets some non-core functionality". You can still include/exclude these tests by selecting the text in the Describe
, e.g. Hostname of Pod
For example, https://github.com/kubernetes/kubernetes/pull/104803 is not marked | ||
as `Feature`, even though it depends on the environment - if `test-handler` | ||
is not installed, test will not succeed. Similar thing to AppArmor tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing context here... since it has "non-default requirements" why is it not marked as a [Feature:]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because things are pre-installed on our test machines in OSS. And we were filtering them out in other environments based on name.
Split the special environment and feature development labels: | ||
|
||
- Clarify the semantic of a `[Feature:]` label. | ||
- Introduce the `[FeatureGate:]` label with the stage (`[Alpha]`, `[Beta]`, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what value [FeatureGate:]
adds that makes it worth being distinct from [Feature:]
.
It seems like it's going add more boilerplate ([Feature:foo]
tests become [Feature:foo][FeatureGate:foo]
tests)
Meanwhile I'm having a tough time coming up with examples of why I would want to test [FeatureGate:foo]
independent of [Feature:foo]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, [Alpha]
, [Beta]
and [Deprecated]
seem like worthwhile additions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spiffxp the problem i see with [Feature:]
is that is misleading for developers, see current status:
1 [Feature:CloudProvider][Disruptive]
1 [Feature:ClusterSizeAutoscalingScaleUp]
1 [Feature:CustomMetricsAutoscaling]
1 [Feature:GPUDevicePlugin]
1 [Feature:HPA]
1 [Feature:NEG]
1 [Feature:NodeOutOfServiceVolumeDetach]
1 [Feature:PodGarbageCollector]
1 [Feature:StatefulSet]
1 [Feature:WindowsHostProcessContainers]
2 [Feature:HPA]
2 [Feature:NoSNAT]
2 [Feature:Windows]
3 [Feature:Flexvolumes]
3 [Feature:vsphere]
3 [Feature:Windows]
4 [Feature:NetworkPolicy]
6 [Feature:Windows]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KEP suggests to start implementing by adding [FeatureGate:Foo]
and stage labels to tests. Then clean up other things. If we want to split addition and later clean up, I'm happy with this too
> * `[Environment:.+]`: If test requires non-standard environment | ||
> (different from standard OSS test machines) to run it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sold on [Environment]
because it feels way too generic, compared the examples you're giving which are all about a given machine's configuration. Something like [NodeConfig:]
instead?
I think I'm also trying to understand what sets of tests you would want to group together with these that you cannot today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tag is mostly to filter tests out. Like run everything that doesn't require non-standard environment. Instead of filtering out [Feature:Performance]
have `Serial` or `Disruptive` labels as well. Out of all tests that has `Feature` | ||
label without `Serial` or `Disruptive`, most of them just degraded, | ||
and don’t actually need the `Feature` label any longer. Going forward, features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we introduced the [Alpha]
and [Beta]
labels, removing [Beta]
from a feature would likely be the time to remove the [Feature:]
tag entirely, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we are not using the tag correctly, but if fFeature == FeatureGate I agree that after betas it has to be removed entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what @aojea said. Since we mixing up feature and featuregate, we had examples when tests keep being called Feature even though feature gate was not existing any longer and the functionality is almost a conformance. Explicit FeatureGate thing will ensure that we will remove the FeatureGate from that test and decide whether it's optional feature or conformance at this point
@spiffxp raised a good points, maybe we should define more clearly what do we want and where do we agree? I think one of the problems we have now is the lack of discipline tagging and how easy it goes out of sync, I like Patrick idea about having the feature gate state associated to the code, so developers don't have to check if the need to update the e2e tag or not. The more automation we can add here I think that is the better for everybody. The other fact is that the
|
What is the best way to proceed? Can we take parts of this KEP for 1.26? Or approve it directionally and polish details with the actual PRs to modify test tags definitions? I think when we will start applying tags we may find something that needs adjustment. |
Co-authored-by: Aaron Crickenberger <[email protected]>
Aaron has seniority 😄 |
@spiffxp has seniority - making this a github mention |
Like I said in my review comment, the parts I totally agree with moving forward on are:
The direction I'm not clear on is splitting Feature/FeatureGate, I'm continuing to mull this over. I agree Patrick's idea of automating the tagging adds value. I'm available to discuss this elsewhere if you'd like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
I'm available to discuss this elsewhere if you'd like.
We did this. I learned this was discussed in sig arch about a year ago
I'm fine approving directionally and iterating. Like, I'm not happy with the names, and I'm still not convinced about FeatureGate, but others seem fine with it. I guess I like the ideas of being more explicit about the required capabilities/configuration of the cluster (or nodes) under test, and automating some of the labeling.
But, we will eventually reach a limit where we can't stuff this all in the test name. Leaning into whatever ginkgo v2 can offer us may be necessary.
Let's make sure this is kept up to date as we iterate.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, dchen1107, SergeyKanzhelev, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'd like to see specific examples that explain when to use Let's also finish the helper functions from kubernetes/kubernetes#112894 and use those when actually touching test definitions. |
KEP: #3041
Feature, NodeConformance, NodeFeature, and Feature Gate labels cleanup
The document https://docs.google.com/document/d/19HqSyrS-4pyubqTvQV0hJKt_97nbCSt_aD0soL-RGGE/edit#heading=h.veqp9g4ihszu was discussed in a few forums and it was suggested to write this as a KEP for awareness.