1
0
mirror of https://github.com/kubernetes-sigs/descheduler.git synced 2026-01-25 20:59:28 +01:00

Address review comments

This commit is contained in:
Jan Chaloupka
2024-05-30 13:08:35 +02:00
parent b1c6e24f93
commit a889e57768

View File

@@ -264,7 +264,7 @@ The [evacuation API](https://github.com/kubernetes/enhancements/pull/4565) is ex
to replace the [eviction API](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/).
Instead of checking whether an eviction was successful or failed, an evacuation request is created.
In case the request fails to be created a pod eviction is skipped as previously.
In addition, each cycle will keep a number of evacution requests created
In addition, each cycle will keep a number of evacuation requests created
instead of a number of pods evicted.
The annotation based approach is a special case of invoking the eviction API:
@@ -273,7 +273,7 @@ The annotation based approach is a special case of invoking the eviction API:
|--|--|--|
|**Eviction handling**|Eviction of a pod with `descheduler.alpha.kubernetes.io/request-evict-only` annotation. Checking the eviction error, return code and response text. Interpreting a specific error code and a presence of a known suffix in the response text. |Creation of an evacuation CR for the given pod. Checking the evacuation CR creation error.|
|**Pod nomination and sorting**|Preferring pods with both `descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress` annotations.|Preferring pods that have a corresponding evacuation CR present.|
|**Cache reconstruction**|When a descheduler gets restarted and a new internal caches (with tracked annotations) cleared the descheduler lists all pods and populate the cache with any pod that has both annotation (`descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress`) present. In case only the first annotation is present but the eviction request was already created, the update event handler will catch the second annotation addition and the cache gets synced. In the worst case the limit of max number of pods evicted gets exceeded.|The descheduler waits until all evacution CRs are synced.|
|**Cache reconstruction**|When a descheduler gets restarted and a new internal caches (with tracked annotations) cleared the descheduler lists all pods and populate the cache with any pod that has both annotation (`descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress`) present. In case only the first annotation is present but the eviction request was already created, the update event handler will catch the second annotation addition and the cache gets synced. In the worst case the limit of max number of pods evicted gets exceeded.|The descheduler waits until all evacuation CRs are synced.|
Given the evacuation API is still a work in progress without any
existing implementation, the annotation based eviction can be seen
@@ -295,6 +295,8 @@ without following additional eviction policies.
#### Workflow Steps
##### Annotation based evacuation (v1alpha1)
1. Policy configuration
1. The cluster administrator configures the descheduler to enable the new functionality.
2. A user deploys a workload through various nodes
@@ -302,19 +304,31 @@ without following additional eviction policies.
3. Eviction nomination
1. At the beginning of each descheduling cycle the descheduler increases
all the internal counters to take into account all pods that are subjects
to background eviction. Either by checking `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation,
list of evacuation requests or confronting the internal caches.
to background eviction. By checking `descheduler.alpha.kubernetes.io/eviction-in-progress`
annotation or confronting the internal caches.
2. Each descheduling plugin nominates a set of pods to be evicted.
3. All the nominated pods are sorted based on the eviction-in-progress first priority.
4. Pod eviction: the descheduler starts evicting nominated pods until a limit is reached
1. If an annotation based eviction is enabled:
1. If a pod is not annotated with `descheduler.alpha.kubernetes.io/request-evict-only` evict
the pod normally without additional handling.
2. If a pod is annotated with `descheduler.alpha.kubernetes.io/request-evict-only` evict the pod.
1. If the eviction fails intepret the error as a request for eviction in background.
2. Error code, resp. response text is checked to distinguish between
a genuine error and a confirmation of an eviction in background.
2. If the evacuation API is utilized create an evacution request
1. If a pod is not annotated with `descheduler.alpha.kubernetes.io/request-evict-only` evict
the pod normally without additional handling.
2. If a pod is annotated with `descheduler.alpha.kubernetes.io/request-evict-only` evict the pod.
1. If the eviction fails intepret the error as a request for eviction in background.
2. Error code, resp. response text is checked to distinguish between
a genuine error and a confirmation of an eviction in background.
##### Evacuation API (v1alphaN or beta)
1. Policy configuration
1. The cluster administrator configures the descheduler to enable the new functionality.
2. A user deploys a workload through various nodes
1. A scheduler distributes the pods based on configured scheduling policies
3. Eviction nomination
1. At the beginning of each descheduling cycle the descheduler increases
all the internal counters to take into account all pods that are subjects
to background eviction. By listing evacuation requests or confronting the internal caches.
2. Each descheduling plugin nominates a set of pods to be evicted.
3. All the nominated pods are sorted based on the eviction-in-progress first priority.
4. Pod eviction: the descheduler creates an evacuation request for each nominated pod until a limit is reached
### Notes/Constraints/Caveats (Optional)
@@ -344,6 +358,8 @@ without following additional eviction policies.
resource utilization.
- This new functionality needs to stay feature gated until
the evacuation API is available and stable.
- When `descheduler.alpha.kubernetes.io/request-evict-only` annotation is deprecated
workload that forgets to remove the annotation will not be evacuated.
## Design Details
@@ -406,6 +422,10 @@ without following additional eviction policies.
Existing deployments will need to remove `descheduler.alpha.kubernetes.io/request-evict-only`
annotation to utilize the new evacuation API.
Otherwise, the eviction API will be used as a fallback.
Workload owners utilizing the new evacuation API are expected to replace both annotations
with [evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS}: ${PRIORITY}/${ROLE}](https://github.com/kubernetes/enhancements/blob/2d7dbab6c95575d68395c30fa789f87d9a2c3c01/keps/sig-apps/4563-evacuation-api/README.md#evacuator).
Important: the new annotation have different semantics than the deprecated ones.
Users are strongly recommended to review the evacuation API proposal for necessary details.
- The evacuation API expects the evacuation request to be reconciled during each descheduling cycle.
Given each cycle starts from scratch the descheduler needs to first list
all evacuation requests and sort the pods accordingly in each plugin.
@@ -414,9 +434,10 @@ without following additional eviction policies.
E.g. by creating a snapshot (per cycle, per strategy) to avoid inconsistencies.
An evacuation request may exist without a corresponding pod
before the evacuation controller garbage collects the request.
- The descheduler needs to keep a cache of evacution API requests (CRs)
- The descheduler needs to keep a cache of evacuation API requests (CRs)
and remove itself from finalizers if the request is no longer needed.
E.g. `evacuation.coordination.k8s.io/instigator_descheduler.sigs.k8s.io` as a finalizer.
The removal is allowed only when the evacuation request has its cancellation policy set `Allow`.
- Read the eviction status from the [evacuation CR status](https://github.com/kubernetes/enhancements/pull/4565).
#### Code changes
@@ -442,14 +463,14 @@ A new metric `pod_evacuations` for counting the number of evacuated pods is intr
The descheduler will not observe whether a pod was ultimately evicted.
Only provide a summary about how many pods were requested to be evacuated.
If needed, another metric `pod_evacuations_in_progress` for counting
the number of evacutions that are reconciled can be introduced.
the number of evacuations that are reconciled can be introduced.
### Open Questions [optional]
This is where to call out areas of the design that require closure before deciding
to implement the design.
* The evacution API proposal mentions the evacuation instigator should remove
* The evacuation API proposal mentions the evacuation instigator should remove
its intent when the evacuation is no longer necessary.
Should each descheduling cycle reset all the eviction requests
or wait until some of the eviction requests were completed
@@ -458,18 +479,18 @@ to implement the design.
The descheduler will account for existing requests and update the internal counters accordingally.
In the future a mechanism for deleting too old evacuation requests can be introduced.
I.e. based on a new `--max-evacuation-request-ttl` option.
* The evacution API proposal mentions more than a one entity can request an eviction.
* The evacuation API proposal mentions more than a one entity can request an eviction.
Should the descheduler take these into account as well?
What if another entity decides to evict a pod that is of a low priority
from the descheduler's point of view?
**Answer**: The first implementation will consider all pods with an existing
evacuation request as "already getting evacuated". Independent of the pod priorities.
* With evacution requests a plugin does not evict a pod right away.
* With evacuation requests a plugin does not evict a pod right away.
Meaning, other plugins will need to check whether a pod has a corresponding
evacuation request present.
If so, take each such pod into account when constructing a list of nominated pods.
For that it might be easier to see the current evictions as marking pods
for future evacuation and perform the actual eviction/evacution
for future evacuation and perform the actual eviction/evacuation
at the end of each descheduling cycle instead of evicting pods from within plugins.
**Answer**: This is a breaking change for existing plugins.
The first implementation will filter out pods that has already
@@ -538,9 +559,9 @@ Current coverages:
These tests will be added:
- New tests will be added under `sigs.k8s.io/descheduler/pkg/framework/plugins` for each
mentioned plugin to test evacution (replacement of the eviction API).
mentioned plugin to test evacuation (replacement of the eviction API).
- New tests will be added under `sigs.k8s.io/descheduler/pkg/descheduler/evictions` to test
the new evacution (annotation based, API based) implementation.
the new evacuation (annotation based, API based) implementation.
- New tests under `sigs.k8s.io/descheduler/pkg/descheduler` will be created for:
- the new internal sorting and filtering algorithms
- validating the evacuation limits are respected with already existing
@@ -600,8 +621,8 @@ These tests will be added:
3. At the beginning of the next descheduling cycle the descheduler checks for all
evacuation requests/pods with `descheduler.alpha.kubernetes.io/request-evict-only` annotation
and remove finalizer/entry from the cache.
4. During the descheduling cycle the descheduler creates eviction/evacution requests.
5. Check there's an expected number of eviction/evacution requests.
4. During the descheduling cycle the descheduler creates eviction/evacuation requests.
5. Check there's an expected number of eviction/evacuation requests.
### Graduation Criteria
@@ -672,12 +693,16 @@ in back-to-back releases.
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
- Annotation based evictions
- Workloads do not set `descheduler.alpha.kubernetes.io/request-evict-only` annotation
when deprecated
#### Beta
- Gather feedback from developers and surveys
- Evacuation API based evictions
- Identifying corner cases that are not addressed by the evacuation API
- `descheduler.alpha.kubernetes.io/request-evict-only` annotation is deprecated
- Workloads do not set `descheduler.alpha.kubernetes.io/request-evict-only` annotation
#### GA
@@ -775,7 +800,7 @@ automations, so be extremely careful here.
-->
Eviction of pods can be delayed. Users need to make sure they have a reliable external
components that can evict a pod based on the corresponding evacution request.
components that can evict a pod based on the corresponding evacuation request.
For the annotation based eviction only pods annotated
with `descheduler.alpha.kubernetes.io/request-evict-only` are effected.