1
0
mirror of https://github.com/kubernetes-sigs/descheduler.git synced 2026-01-25 20:59:28 +01:00
Files
descheduler/keps/1397-evictions-in-background

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The descheduler eviction policy is built on top of the eviction API. The API currently does not support eviction requests that are not completed right away. Instead, any eviction needs to either succeed or be rejected in response. Nevertheless, there are cases where an eviction request is expected to only initiate eviction. While getting confirmation or rejection of the eviction initiation (or its promise).

The descheduler treats all pods as cattle. Each descheduling loop consists of pods nomination followed by their eviction. When eviction of one pod does not succeed another pod is selected until a limit on the number of evictions is reached or all nominated pods are evicted. The descheduler does not keep track of pods that failed to be evicted. The next descheduling loop repeats the same routine. The order of eviction is not guaranteed. Thus, any subsequent loop can either evict pods that failed to be evicted in the previous loop or completely ignore them if a limit is reached. Thus, if an administrator installs a webhook that intercepts eviction requests while changing the underlying functionality (e.g. eviction in background initiation but returning an error code instead of success), the descheduler can do more harm than good.

Motivation

Kubernetes ecosystem is wide. There are various types of workloads in various environments that may require different eviction policies to be applied. Some pods can be evicted right away, some pods may need more time for clean up. Other pods may need more time than a graceful termination period gives or be able to retry eviction in case a live migration did not succeed.

Goals

  • Enhance Descheduler eviction policy: Allow pods that require eviction without getting evicted right away to get a promise from third part component on handling the eviction without directly invoking the upstream eviction API. Eventually replacing the eviction API with a new evacuation API.
  • No more evictions, only requests for evacuation: the descheduler will no longer guarantee eviction. All the configurable limits per namespace/node/run will change their semantics to a maximal number of evacuation requests per namespace/node/run.
  • New sorting algorithm to prioritize in-progress evictions: Sort the nominated pods to prefer those with eviction already in progress (or its promise) to minimize workload disruptions.
  • New prefilter algorithm to exclude in-progress evictions: Filter out all pods that have an already existing evacuation request present.
  • Measure pending evictions: New metric to keep count of evictions in progress/evacuation requests

Non-Goals

  • Implementation of any customized eviction policy: The Descheduler will only acknowledge existence of pods that require different eviction handling. The actual eviction (e.g. live migration) will be implemented by third parties.

Proposal

This enhancement proposes the implementation of a more informed eviction policy for the Descheduler. Allowing plugins (constructing lists of pods nominated for eviction) to be aware of pods with eviction in progress to avoid evicting more than a configured limit allows. Reducing unnecessary evictions can improve overall workload distribution and save cost of bringing up new pods.

Expected Outcomes

  • Live migration implemented through external admission webhooks are no longer perceived as eviction errors. Currently, no pod with a live migration in background is evicted properly. Thus resulting in eviction of all such pods from all possible nodes until a limit is reached. Which is an undesirable behaviour.
  • Descheduling plugins are more aware of pods with live migration in background. Thus, improving the decision making behind which pods get nominated for eviction.
  • The descheduler will no longer be responsible for direct eviction. Instead, only requesting eviction and letting other components (more informed and following various policies) to perform the actual eviction.

User Stories (Optional)

  • As a cluster admin running a K8s cluster with a descheduler, I want to evict KubeVirt pods that may require live migration of VMs before a pod can be evicted. A VM live migration can take minutes to complete or may require retries.
  • As an end user deploying my application on a K8s cluster with a descheduler, I want to evict pods while performing pod live migration to persist pod's state without storing any necessary data into pod-independent persistent storage.
  • As a developer I want to be able to implement a custom eviction policy to address various company use cases that are not supported by default when running a descheduler instance.
  • As a descheduler plugin developer I want to be aware of evictions in progress to improve pod nomination and thus avoid unnecessary disruptions.
  • As a security professional I want to make sure any live migration adhers to security policies and protects sensitive data.

Annotation vs. evacuation API based eviction

The evacuation API is expected to replace the eviction API. Instead of checking whether an eviction was successful or failed, an evacuation request is created. In case the request fails to be created a pod eviction is skipped as previously. In addition, each cycle will keep a number of evacuation requests created instead of a number of pods evicted.

The annotation based approach is a special case of invoking the eviction API:

Eviction API Evacuation API
Eviction handling Eviction of a pod with descheduler.alpha.kubernetes.io/request-evict-only annotation. Checking the eviction error, return code and response text. Interpreting a specific error code and a presence of a known suffix in the response text. Creation of an evacuation CR for the given pod. Checking the evacuation CR creation error.
Pod nomination and sorting Preferring pods with both descheduler.alpha.kubernetes.io/request-evict-only and descheduler.alpha.kubernetes.io/eviction-in-progress annotations. Preferring pods that have a corresponding evacuation CR present.
Cache reconstruction When a descheduler gets restarted and a new internal caches (with tracked annotations) cleared the descheduler lists all pods and populate the cache with any pod that has both annotation (descheduler.alpha.kubernetes.io/request-evict-only and descheduler.alpha.kubernetes.io/eviction-in-progress) present. In case only the first annotation is present but the eviction request was already created, the update event handler will catch the second annotation addition and the cache gets synced. In the worst case the limit of max number of pods evicted gets exceeded. The descheduler waits until all evacuation CRs are synced.

Given the evacuation API is still a work in progress without any existing implementation, the annotation based eviction can be seen as an v1alpha1 implemenation of this proposal. Integration with the evacuation API as an v1alphaN or beta implementation.

Workflow Description

This section outlines the workflow for enabling the proposed evictions in background that prevent workloads from being evicted right away without following additional eviction policies.

Actors

  • Cluster Administrator: Responsible for configuring descheduling policies and overseeing cluster operations.
  • User: Individuals or services running workloads within the cluster.
  • Descheduler: Component responsible for requesting pod evictions

Workflow Steps

Annotation based evacuation (v1alpha1)
  1. Policy configuration
    1. The cluster administrator configures the descheduler to enable the new functionality.
  2. A user deploys a workload through various nodes
    1. A scheduler distributes the pods based on configured scheduling policies
  3. Eviction nomination
    1. At the beginning of each descheduling cycle the descheduler increases all the internal counters to take into account all pods that are subjects to background eviction. By checking descheduler.alpha.kubernetes.io/eviction-in-progress annotation or confronting the internal caches.
    2. Each descheduling plugin nominates a set of pods to be evicted.
    3. All the nominated pods are sorted based on the eviction-in-progress first priority.
  4. Pod eviction: the descheduler starts evicting nominated pods until a limit is reached
    1. If a pod is not annotated with descheduler.alpha.kubernetes.io/request-evict-only evict the pod normally without additional handling.
    2. If a pod is annotated with descheduler.alpha.kubernetes.io/request-evict-only evict the pod.
      1. If the eviction fails intepret the error as a request for eviction in background.
      2. Error code, resp. response text is checked to distinguish between a genuine error and a confirmation of an eviction in background.
Evacuation API (v1alphaN or beta)
  1. Policy configuration
    1. The cluster administrator configures the descheduler to enable the new functionality.
  2. A user deploys a workload through various nodes
    1. A scheduler distributes the pods based on configured scheduling policies
  3. Eviction nomination
    1. At the beginning of each descheduling cycle the descheduler increases all the internal counters to take into account all pods that are subjects to background eviction. By listing evacuation requests or confronting the internal caches.
    2. Each descheduling plugin nominates a set of pods to be evicted.
    3. All the nominated pods are sorted based on the eviction-in-progress first priority.
  4. Pod eviction: the descheduler creates an evacuation request for each nominated pod until a limit is reached

Notes/Constraints/Caveats (Optional)

  • The upstream eviction API does not currently support evictions in background implemented by external components. For that, there's no community dedicated error code nor a response text for this type of eviction.
  • The eviction in background error code will be temporary mapped to 429 (TooManyRequests).
  • The response text will be searched for strings containing "Eviction triggered" prefix or similar.
  • The descheduler will still treat all the pods with eviction in background as cattle. It will only prioritize these pods to be processed sooner. There may still be unnecessary/sub-optimal evictions.

Risks and Mitigations

  • If an external component responsible for eviction in background does not work properly the descheduler might evict less pods than expected. Leaving pods that require eviction (e.g. long running or violating third party policies) running longer than expected. The descheduler will need to implement a TTL mechanism to abandon eviction requests that take longer than expected.
  • Most of the heavy lifting happens in the external components. The descheduler only acknowledges pods that are evicted in background. All the internal counters take these pods into account. When these pods are not properly evicted the internal counters might limit eviction of other pods that (when evicted) might improve overall cluster resource utilization.
  • This new functionality needs to stay feature gated until the evacuation API is available and stable.
  • When descheduler.alpha.kubernetes.io/request-evict-only annotation is deprecated workload that forgets to remove the annotation will not be evacuated.

Design Details

Implementation Strategies

API changes

  • New category of pods for eviction: Two new annotations are introduced:
    • descheduler.alpha.kubernetes.io/request-evict-only: to signal pods whose eviction is not expected to be completed right away. Instead, an eviction request is expected to be intercepted by an external component which will initiate the eviction process for the pod. E.g. by performing live migration or scheduling the eviction to a more suitable time.
    • descheduler.alpha.kubernetes.io/eviction-in-progress: to mark pods whose eviction was initiated by an external component.
  • New eviction limits: introduction of new MaxNoOfPodsEvacuationsPerNode and MaxNoOfPodsEvacuationsPerNamespace feature gated v1alpha1 fields.
  • New configurable list of eviction message prefixes (introduced later when needed): each third party validating admission webhook can provide different response message. It is not feasible to ask every party to change their messages for a temporary solution until the evacuation API is in place. Thus, temporary introducing a new feature gated evictionMessagePrefixes field ([]string) for a list of known prefixes signaling an eviction in background. Normally, "Eviction triggered" prefix will be acknowledged by default.

v1alpha1

  • Keep track of evictions in progress: A new cache will be introduced to keep track of pods annotated with descheduler.alpha.kubernetes.io/request-evict-only that were requested to be evicted through the upstream eviction API. In case a descheduler gets unexpectedly restarted any pod annotated with descheduler.alpha.kubernetes.io/eviction-in-progress will be added to the cache if not already present. The cache will help with sorting and avoiding a duplicated eviction request of pods that were not annotated with descheduler.alpha.kubernetes.io/eviction-in-progress quickly enough. Each item in the cache will have a TTL attribute to address cases where an eviction request was intercepted by an external component but the component failed to annotate the pod.
  • Implement a new built-in sorting plugin: Extend the descheduling framework with a new sorting plugin that will prefer pods with descheduler.alpha.kubernetes.io/eviction-in-progress annotation or those that are already in the cache of to-be-evicted pods. The plugin will be enabled when the new functionality is enabled (through a feature gate). Each plugin will have option to either take the new sorting into account or not. The sorting plugin can be executed either in pre-filter or pre-eviction phase.
  • Implement a multi-level sorting: Allow plugins to first sort pods based on eviction-in-progress quality and then based on configured sorting plugins. Preferring pods that are already in a process of eviction over other pods is a crucial step for reducing unnecessary disruptions.
  • Internal counters aware of eviction in progress: At the beginning of each descheduling loop all internal counters will take into account pods that are either annotated with descheduler.alpha.kubernetes.io/eviction-in-progress or are present in the cache of to-be-evicted pods.

v1alphaN or beta

  • Replace the current eviction API with the new evacuation API by default. Both descheduler.alpha.kubernetes.io/request-evict-only and descheduler.alpha.kubernetes.io/eviction-in-progress will be deprecated. Existing deployments will need to remove descheduler.alpha.kubernetes.io/request-evict-only annotation to utilize the new evacuation API. Otherwise, the eviction API will be used as a fallback. Workload owners utilizing the new evacuation API are expected to replace both annotations with evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS}: ${PRIORITY}/${ROLE}. Important: the new annotation have different semantics than the deprecated ones. Users are strongly recommended to review the evacuation API proposal for necessary details.
  • The evacuation API expects the evacuation request to be reconciled during each descheduling cycle. Given each cycle starts from scratch the descheduler needs to first list all evacuation requests and sort the pods accordingly in each plugin. Other components may create an evacuation request as well. The descheduler needs to take this into account. E.g. by creating a snapshot (per cycle, per strategy) to avoid inconsistencies. An evacuation request may exist without a corresponding pod before the evacuation controller garbage collects the request.
  • The descheduler needs to keep a cache of evacuation API requests (CRs) and remove itself from finalizers if the request is no longer needed. E.g. evacuation.coordination.k8s.io/instigator_descheduler.sigs.k8s.io as a finalizer. The removal is allowed only when the evacuation request has its cancellation policy set Allow.
  • Read the eviction status from the evacuation CR status.

Code changes

The code responsible for invoking the eviction API is located under sigs.k8s.io/descheduler/pkg/descheduler/evictions under private evictPod function. The function can be generalized into a generic interface:

type PodEvacuator interface {
    EvacutePod(ctx context.Context, pod *v1.Pod) error
}

to provide the descheduling framework a generic way of evicting/evacuating pods. The new interface can be used for implementing various testing scenarios. Or turned into a plugin if needed.

Metrics

A new metric pod_evacuations for counting the number of evacuated pods is introduced. The descheduler will not observe whether a pod was ultimately evicted. Only provide a summary about how many pods were requested to be evacuated. If needed, another metric pod_evacuations_in_progress for counting the number of evacuations that are reconciled can be introduced.

Open Questions [optional]

This is where to call out areas of the design that require closure before deciding to implement the design.

  • The evacuation API proposal mentions the evacuation instigator should remove its intent when the evacuation is no longer necessary. Should each descheduling cycle reset all the eviction requests or wait until some of the eviction requests were completed to free a "bucket" for new evictions? Answer: The first implementation will not reset any evacuation request. The descheduler will account for existing requests and update the internal counters accordingally. In the future a mechanism for deleting too old evacuation requests can be introduced. I.e. based on a new --max-evacuation-request-ttl option.
  • The evacuation API proposal mentions more than a one entity can request an eviction. Should the descheduler take these into account as well? What if another entity decides to evict a pod that is of a low priority from the descheduler's point of view? Answer: The first implementation will consider all pods with an existing evacuation request as "already getting evacuated". Independent of the pod priorities.
  • With evacuation requests a plugin does not evict a pod right away. Meaning, other plugins will need to check whether a pod has a corresponding evacuation request present. If so, take each such pod into account when constructing a list of nominated pods. For that it might be easier to see the current evictions as marking pods for future evacuation and perform the actual eviction/evacuation at the end of each descheduling cycle instead of evicting pods from within plugins. Answer: This is a breaking change for existing plugins. The first implementation will filter out pods that has already an existing evacuation request present. Keeping the backward compatibility with the current implementation. Later, a new proposal can be created to discuss the possibility of evicting pods at the end of each descheduling cycle (after both Deschedule and Balance extension points of all plugins were invoked).

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests

Current coverages:

  • sigs.k8s.io/descheduler/pkg/descheduler/evictions: 2024-05-13 - 4.3
  • sigs.k8s.io/descheduler/pkg/framework/plugins/nodeutilization: 2024-05-13 - 48.7
  • sigs.k8s.io/descheduler/pkg/framework/plugins/podlifetime: 2024-05-13 - 18.2
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removeduplicates: 2024-05-13 - 32.3
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removefailedpods: 2024-05-13 - 21.6
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removepodshavingtoomanyrestarts: 2024-05-13 - 22.0
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removepodsviolatinginterpodantiaffinity: 2024-05-13 - 17.4
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removepodsviolatingnodeaffinity: 2024-05-13 - 18.0
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removepodsviolatingnodetaints: 2024-05-13 - 18.2
  • sigs.k8s.io/descheduler/pkg/framework/plugins/removepodsviolatingtopologyspreadconstraint: 2024-05-13 - 43.0

These tests will be added:

  • New tests will be added under sigs.k8s.io/descheduler/pkg/framework/plugins for each mentioned plugin to test evacuation (replacement of the eviction API).
  • New tests will be added under sigs.k8s.io/descheduler/pkg/descheduler/evictions to test the new evacuation (annotation based, API based) implementation.
  • New tests under sigs.k8s.io/descheduler/pkg/descheduler will be created for:
    • the new internal sorting and filtering algorithms
    • validating the evacuation limits are respected with already existing evacuation requests from a previous descheduling cycle
Integration tests

No integration tests available

e2e tests

These tests will be added:

  • EvacuationInBackground for testing an evacuation request/annotated pod eviction:
    1. Setup a validating admission webhook that will intercept an eviction request, start a pod eviction/deletion in e.g. T=5s but rejects the eviction request.
    2. Create an evacuation request for a pod/evict a pod with descheduler.alpha.kubernetes.io/request-evict-only annotation.
    3. Observe the pod is not removed right away and wait for the configured delay T before checking a pod is removed.
    4. Check a pod was removed after T, not before.
  • EvacuationLimitReached for testing only a limited amount of pods is evacuated with already existing evacuation requests:
    1. Create evacuation requests (e.g. by annotating pods with descheduler.alpha.kubernetes.io/eviction-in-progress or creating evacuation requests).
    2. Run the descheduler and observe only an expected amount of pods is requested to be evicted/evacuated.
  • EvacuationTTL for testing an evacuation request (with a single owner in case of the evacuation API) is canceled after a configured TTL:
    1. Setup a validating admission webhook that will intercept an eviction request, start a pod eviction initiation but rejects the eviction request.
    2. Wait for e.g. 10s (TTL).
    3. At the beginning of the next descheduling cycle the descheduler checks for all evacuation requests/pods with descheduler.alpha.kubernetes.io/request-evict-only annotation and remove finalizer/entry from the cache.
    4. During the descheduling cycle the descheduler creates eviction/evacuation requests.
    5. Check there's an expected number of eviction/evacuation requests.

Graduation Criteria

Alpha

  • Feature implemented behind a feature flag
  • Initial e2e tests completed and enabled
  • Annotation based evictions
  • Workloads do not set descheduler.alpha.kubernetes.io/request-evict-only annotation when deprecated

Beta

  • Gather feedback from developers and surveys
  • Evacuation API based evictions
  • Identifying corner cases that are not addressed by the evacuation API
  • descheduler.alpha.kubernetes.io/request-evict-only annotation is deprecated
  • Workloads do not set descheduler.alpha.kubernetes.io/request-evict-only annotation

GA

  • At least 3 examples of real-world usage
  • Many adoptions of the new feature
  • No negative feedback
  • No bug issues reported

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name:
    • Components depending on the feature gate:
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?

Eviction of pods can be delayed. Users need to make sure they have a reliable external components that can evict a pod based on the corresponding evacuation request.

For the annotation based eviction only pods annotated with descheduler.alpha.kubernetes.io/request-evict-only are effected. The eviction itself is still performed through the eviction API. Only the error (code) is interpreted differently. Pods with both annotations present are not evicted more than once.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

What happens if we reenable the feature if it was previously rolled back?

The same as when the feature is enabled the first time.

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

When the evacuation API is used new evacuation create/update/list/watch requests are sent.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Very likely. When the evacuation API is used the eviction itself very depends on existing external controllers facilitating the eviction.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • 2024-04-14: Initial draft KEP

Drawbacks

Alternatives

Infrastructure Needed (Optional)