Pods that don't pass the nodeFit condition currently log an
unsuppressable error message to logs. This changes the log level to info
as it's a normal operating condition.
Signed-off-by: Antoine Deschênes <antoine.deschenes@linux.com>
* Add handling for node eligibility
* Make tests buildable
* Update topologyspreadconstraint.go
* Updated test cases failing
* squashed changes for test case addition
corrected function name
refactored duplicate TopoContraint check logic
Added more test cases for testing node eligibility scenario
Added 5 test cases for testing scenarios related to node eligibility
* topologySpreadConstraints e2e: `nodeTaintsPolicy` and `nodeAffinityPolicy` constraints
---------
Co-authored-by: Marc Power <marcpow@microsoft.com>
Co-authored-by: nitindagar0 <81955199+nitindagar0@users.noreply.github.com>
* feat: Implement preferredDuringSchedulingIgnoredDuringExecution for RemovePodsViolatingNodeAffinity
Now, the descheduler can detect and evict pods that are not optimally
allocated according to the "preferred..." node affinity. It only evicts
a pod if it can be scheduled on a node that scores higher in terms of
preferred node affinity than the current one.
This can be activated by enabling the RemovePodsViolatingNodeAffinity
plugin and passing "preferredDuringSchedulingIgnoredDuringExecution" in
the args.
For example, imagine we have a pod that prefers nodes with label "key1:
value1" with a weight of 10. If this pod is scheduled on a node that
doesn't have "key1: value1" as label but there's another node that has
this label and where this pod can potentially run, then the descheduler
will evict the pod.
Another effect of this commit is that the
RemovePodsViolatingNodeAffinity plugin will not remove pods that don't
fit in the current node but for other reasons than violating the node
affinity. Before that, enabling this plugin could cause evictions on
pods that were running on tainted nodes without the necessary
tolerations.
This commit also fixes the wording of some tests from
node_affinity_test.go and some parameters and expectations of these
tests, which were wrong.
* Optimization on RemovePodsViolatingNodeAffinity
Before checking if a pod can be evicted or if it can be scheduled
somewhere else, we first check if it has the corresponding nodeAffinity
field defined. Otherwise, the pod is automatically discarded as a
candidate.
Apart from that, the method that calculates the weight that a pod
gives to a node based on its preferred node affinity has been
renamed to better reflect what it does.
1. Enable OTEL configuration and base framework
2. update generated conversion spec
3. enable docker based conversion and deep copy generate
4. fix broken unit tests
* use pod informers for listing pods in removepodsviolatingtopologyspreadconstraint and removepodsviolatinginterpodantiaffinity
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* workaround in topologyspreadconstraint test to ensure that informer's index returns pods sorted by name
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
Huge clusters with thousands of pods can quickly exceed the default
watch channel size of the fake clientset. Causing the channel
to panic with "channel full".
* pod anti-affinity check among nodes
* avoid pod equality check with UID field
also add node equality check with Name for short-cut
* add test case for anti-affinity violation among different node
* reduce ListPodsOnANode call
* fix old code
* apply gofumpt -w -extra
move klog/v2 import entry to bottom according to master code
* fix plugin arg conversion when using multiple profiles with same plugin
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* PR feedback to refactor validateDeschedulerConfiguration error handling
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* update helm chart to v0.27.0
* update manifest version and docs
* fix 1.27 release version from README.md
Co-authored-by: Mike Dame <mikedame@google.com>
---------
Co-authored-by: Mike Dame <mikedame@google.com>
The Evict extension point is not currently in use.
All DefaultEvictor plugin functionality is exposed through Filter and
PreEvictionFilter extension points instead.
Thus, no need to limit the number of evictors enabled.
* bump to k8s 1.27
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* bump go version to 1.20.3
* bump k8s version and kine for e2e
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
- Populate extension points automatically from plugin types
- Make a list of enabled extension points based on a profile
configuration
- Populate filter and pre-eviction filter handles from their
corresponding extension points
* Descheduling profile
* Fake plugin + profile unit testing
* Rename Profile config type into DeschedulerProfile
To avoid resamblance with profileImpl
* First run deschedule, then balance extension points
* Adding descheduler policy API Version option
in helm templates
* Updating comment for deschedulerPolicyAPIVersion
field
* Making v1alpha1 the default api version
* v1alph2 docs
* remove internal toc (gh has this natively)
* fix typo and newlines
* name plugins with less confusing names
* add type column
* fix kv selector and nodeSelector desc
* group plugin types in a table
* link the deprecated doc
* warning signs
* add v1alpha2 registry based conversion
* test defaults, set our 1st explicit default
* fix typos and dates
* move pluginregistry to its own dir
* remove unused v1alpha2.Namespace type
* move migration code folders, remove switch
* validate internalPolicy a single time
* remove structured logs
* simplify return
* check for nil methods
* properly check before adding default evictor
* add TODO comment
* bump copyright year
* use plugin registry and prepare for conersion
* Register plugins explicitly to a registry
* check interface impl instead of struc var
* setup plugins at top level
* treat plugin type combinations
* pass registry as arg of V1alpha1ToInternal
* move registry yet another level up
* check interface type separately
* Remove log level from Errors
Every error printed via Errors is expected to be important and always
printable.
* Invoke first Deschedule and then Balance extension points (breaking change)
* Separate plugin arg conversion from pluginsMap
* Seperate profile population from plugin execution
* Convert strategy params into profiles outside the main descheduling loop
Strategy params are static and do not change in time.
* Bump the internal DeschedulerPolicy to v1alpha2
Drop conversion from v1alpha1 to internal
* add tests to v1alpha1 to internal conversion
* add tests to strategyParamsToPluginArgs params wiring
* in v1alpha1 evictableNamespaces are still Namespaces
* add test passing in all params
Co-authored-by: Lucas Severo Alves <lseveroa@redhat.com>
--help is now an CMD which means explicitly providing a command override in kubernetes is no longer required. You can now simply provide the necessary arguments
This commit changes build_info metric labels
- AppVersion label will show major+minor version
for example 0.24.1
minor version numbers and commit hash
Signed-off-by: eminaktas <eminaktas34@gmail.com>
This does the following:
1. Enables RemovePodsHavingTooManyRestarts when using Helm by default (it is not currently)
2. Adds RemovePodsHavingTooManyRestarts to the values.yaml for clearer configs
update go to 1.19 and helm kubernetes cluster to 1.25
bump -rc.0 to 1.25 GA
bump k8s utils library
bump golang-ci
use go 1.19 for helm github action
upgrade kubectl from 0.20 to 0.25
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
When an error is returned a strategy either stops completely or starts
processing another node. Given the error can be a transient error or
only one of the limits can get exceeded it is fair to just skip a
pod that failed eviction and proceed to the next instead.
In order to optimize the processing and stop earlier, it is more
practical to implement a check which will say when a limit was
exceeded.
The method uses the node object to only get the node name.
The node name can be retrieved from the pod object.
Some strategies might try to evict a pod in Pending state which
does not have the .spec.nodeName field set. Thus, skipping
the test for the node limit.
Both LowNode and HighNode utilization strategies evict only as many pods
as there's free resources on other nodes. Thus, the resource fit test
is always true by definition.
Add taint exclusion to RemovePodsViolatingNodeTaints. This permits node
taints to be ignored by allowing users to specify ignored taint keys or
ignored taint key=value pairs.
Currently, when the descheduler is running with the --dry-run on, no strategy actually
evicts a pod so every strategy always starts with a complete list of
pods. E.g. when the PodLifeTime strategy evicts few pods, the RemoveDuplicatePods
strategy still takes into account even the pods eliminated by the PodLifeTime
strategy. Which does not correspond to the real case scenarios as the
same pod can be evicted multiple times. Instead, use a fake client and
evict/delete the pods from its cache so the strategies evict each pod
at most once as it would be normally done in a real cluster.
This patch adds the policy(evictFailedBarePods) to allow the failed
pods without ownerReferences to be evicted. For backward compatibility,
disable the policy by default. Address #644.
calcContainerRestarts sums over containers. The new language makes
that clear, avoiding potential confusion vs. an altenative that looked
for pods where a single container had passed the configured threshold.
For example, with three containers with 50 restarts and a threshold of
100, the actual "sum over containers" logic makes that pod a candidate
for descheduling, but the "largest single container restart count"
hypothetical would not have made it a candidate.
Also shifts labelSelector into the parameter table, because when it
was added in 29ade13ce7 (README and e2e-testcase add for
labelSelector, 2021-03-02, #510), it landed a few lines too high.
RemoveDuplicates: take node taints, node affinity and node selector into account when computing a number of feasible nodes for the average occurence of pods per node
Nodes with taints which are not tolerated by evicted pods will never run the
pods. The same holds for node affinity and node selector.
So increase the number of pods per feasible nodes to decrease the
number of evicted pods.
Always use structured logging. Therefore update klog.Errorf() to instead
use klog.ErrorS().
Here is an example of the new log message.
E0428 23:58:57.048912 586 descheduler.go:145] "skipping strategy" err="unknown strategy name" strategy=ASDFPodLifeTime
The master branch always represents the next release of the
descheduler. Therefore applying the descheduler k8s manifests
from the master branch is not considered stable. It is best for
users to install descheduler using the released tags.
Similar to ReplicaSet, ReplicationController, and Jobs pods with a
StatefulSet metadata.ownerReference are considered for eviction.
Document this, so that it is clear to end users.
The resources.cpuRequest and resources.memoryRequest varialbes are
not valid in the helm chart values.yaml file. The correct varialbe
name for setting the requests and limits is resources.
Also, fixed white space alignment in the markdown table.
New metrics:
- build_info: Build info about descheduler, including Go version, Descheduler version, Git SHA, Git branch
- pods_evicted: Number of successfully evicted pods, by the result, by the strategy, by the namespace
For roughly the past year damemi has been the only active approver for
the descheduler. Therefore move the inactive approvers to emeritus
status. This will help clarify to contributors who should be assigned to
pull requests.
Previous to this change official descheduler container images only
supported the AMD64 hardware architecture. This change enables
building official descheduler container images for multiple
architectures.
The initially supported architectures are AMD64 and ARM64.
Prior to this commit the helm chart used to install the descheduler
CronJob did not set container requests or limits. This is considered
an anti-pattern when deploying applications on k8s.
Set descheduler container resources to make it a burstable pod. This
will ensure a high quality experience for end users when deploying
descheduler into their clusters using the instructions from the README.
The default values choosen for CPU/Memory are not based on any real data.
Prior to this change the output from the command "descheduler version"
when run using the official container images from k8s.gcr.io would
always output an empty string. See below for an example.
```
docker run k8s.gcr.io/descheduler/descheduler:v0.19.0 /bin/descheduler version
Descheduler version {Major: Minor: GitCommit: GitVersion: BuildDate:2020-09-01T16:43:23+0000 GoVersion:go1.15 Compiler:gc Platform:linux/amd64}
```
This change makes it possible to pass the descheduler version
information to the automated container image build process and
also makes it work for local builds too.
Prior to this commit the YAML manifests used to install the descheduler
Job and CronJob did not set container requests or limits. This is
considered an anti-pattern when deploying applications on k8s.
Set descheduler container resources to make it a burstable pod. This
will ensure a high quality experience for end users when deploying
descheduler into their clusters using the instructions from the README.
The values choosen for CPU/Memory are not based on any real data.
This adds a strategy to balance pod topology domains based on the scheduler's
PodTopologySpread constraints. It attempts to find the minimum number of pods
that should be sent for eviction by comparing the largest domains in a topology
with the smallest domains in that topology.
This commit adds restrictive PodSecurityPolicy, which can be
optionally created, so descheduler can be deployed on clusters with
PodSecurityPolicy admission controller, but which do not ship default
policies.
Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
The k8s.io/klog/v2 package does not currently support structured logging
for warning level log messages. Therefore update the one call in the
code base using klog.Warningf to instead use klog.InfoS.
While non-evictable pods should never be evicted, they should still be
considered when calculating PodAntiAffinity violations. For example, you
may have an evictable pod that should not be running next to a system-critical
static pod. We currently filter IsEvictable before checking for Affinity violations,
so this case would not be caught.
Updated the README with the link to the official descheduler helm chart
on https://hub.helm.sh. This makes it easier for end users to install the
desceduler using helm.
The k8s project recently cut over to the new official k8s.gcr.io
container registry. The descheduler image can now be pulled from
k8s.gcr.io. This is just a new DNS name for the container registry. The
previously documented DNS names for the registry still work, but
require more typing.
Basing this action on push to `chart-*` tags doesn't work: the action itself
creates the new release tag, so trying to push the changes to a tag ends up
with the releaser comparing the changes to themself (and failing).
This also renames the chart from "descheduler" to "descheduler-helm-chart", to
avoid confusion with automated releases.
This moves the kind setup (previously used by Travis) to the e2e runner script
to accomodate the switch to Prow. This provides a KIND_E2E env var to specify
whether to run the tests in kind, or (by default) to run locally).
The kubernetes project has been updated to use Go 1.14.4. See below pull
request.
https://github.com/kubernetes/kubernetes/pull/88638
After making the updates to Go 1.14 "make gen" no longer worked. The
file hack/tools.go had to be created to get "make gen" working with Go
1.14.
Initially users can open a bug report, feature request, or misc request.
Bug reports and feature requests will have the "kind" label
automatically added.
Prior to this change the event created for every pod eviction was
identical. Instead leverage the newly added eviction reason when
creating k8s events. This makes it easier for end users to understand
why the descheduler evicted a pod when inspecting k8s events.
This is a very minor refactor to use a var declaration for the reason
variable. A var declaration is being used because the zero value for
strings is an empty string.
https://golang.org/ref/spec#The_zero_value
This is a minor refactor of PodEvictor to include evictLocalStoragePods as a
property (so that it doesn't need to be passed to each strategy function.) It
also removes ListEvictablePodsOnNode and extends ListPodsOnANode with an optional
"filter" function parameter, so that for example IsEvictable can be passed to it
and achieve the same results as the function formerly known as ListEvictablePodsOnNode.
This approach also now allows strategies to more explicitly extend their criteria of what
IsEvictable considers an evictable pod (such as NodeAffinity, which checks that the pod
can fit on any other node).
1.Set default CPU/Mem/Pods percentage of thresholds to 100
2.Stop evicting pods if any resource ran out
3.Add thresholds verification method and limit resource percentage within [0, 100]
4.Change testcases and readme
This newly documented URL can be used to view the descheduler staging
registry in a web browser. This is easier to browse if the gcloud
command is not available.
Starting with descheduler v0.18 a release branch will be created for
each descheduler minor release. The release guide has been updated with
the steps to create release branches.
The matrix has been updated with the soon the be released v0.18
details. Also, clarified the descheduler and k8s version compatibility
requirements and recommendations.
This increases the specificity of the RemoveDuplicates strategy by
removing pods which not only have the same owner, but who also
must have the same list of container images. This also adds a
parameter, `ExcludeOwnerKinds` to the RemoveDuplicates strategy
which accepts a list of Kinds. If a pod has any of these Kinds as
an owner, that pod is not considered for eviction.
The EvictPod function previously returned a bool and an error. The
function now only returns an error. Callers can check for failure by
testing if the returned error is not nil. This aligns the EvictPod
function with idiomatic Go best practices.
Also, the function name has been changed from EvictPod to evictPod
because it is not used outside the evictions package.
End users should be able to see the detailed error from the EvictPod
method when it fails. Updates all strategies to log the error. The
PodLifeTime strategy already logs this error.
In multi-tenant environments it is useful to know which namespace a pod
was evicted from. Therefore log the namespace when evicting pods.
Also, do not log a nil error when successfully evicting pods.
The new PodLifeTime descheduler strategy can be used to evict pods that
were created more than the configured number of seconds ago.
In the below example pods created more than 24 hours ago will be evicted.
````
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 86400
````
Each strategy implements a test for checking if a maximum number of pods per node
was already evicted. The test duplicates a code that can be put
under a single invocation. Thus, reducing the number of arguments passed
to each strategy given EvicPod call can encapsulate both the check
and the invocation of the pod eviction itself.
Only over utilized nodes need clasification of pods into categories.
Thus, skipping categorizing of pods which saves computation time in cases
where the number of over utilized nodes makes less than 50% of all nodes
or their fraction.
Compared to https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity
the strategy kinda implements requiredDuringSchedulingRequiredDuringExecution node affinity type
for kubelets. Only addition to kubelet is the strategy checks whether is at least another node
capable of respecting the node affinity rules.
When requiredDuringSchedulingRequiredDuringExecution node affinity type is implemented in kubelet,
it's likely the strategy either gets removed or re-implemented. Stressing the relation with
requiredDuringSchedulingRequiredDuringExecution will helps consumers of descheduler
to keep in mind the kubelet will eventually take over the strategy when implemented.
Add a link to the test grid dashboard for automated container image builds,
and documents useful gcloud commands for woking with the staging registry.
This documentation is useful for project maintainers that are creating
releases and publishing container images. The chance of remembering
these commands and links is slim to none. Therefore documenting this
information will be very useful.
Based on feedback during code review it was recommended to allow
updating events in addition to creating events. Because event proceeding
logic on the client side sometimes updates existing events instead of
creating a new one.
Enable a build matrix in Travis CI to run the e2e tests against
multiple k8s versions. Additionally update kind to v0.7.0 which adds
support for k8s v1.17.
Replaced command "kind get kubeconfig-path" with "kind get kubeconfig"
because the "kubeconfig-path" subcommand is not valid with kind v0.7.0.
The descheduler creates a k8s event for each pod that it evicts. When
the code to create events was added the RBAC ClusterRole was not updated
to allow creating events. Users would see the below error in the
descheduler log when it tried to create an event.
"system:serviceaccount:kube-system:descheduler-sa" cannot create resource
"events" in API group "" in the namespace "xxxx-production"' (will not retry!)'
This change fixes this error by updating the ClusterRole to allow
creation of k8s events.
Change the Job and CronJob YAML manifests to use container image
registry us.gcr.io/k8s-artifacts-prod/descheduler. This is the new
official location for the descheduler container image that is very close
to being all setup.
The k8s YAML manifests for deploying the descheduler as a k8s job were
duplicated across the "examples" and "kubernetes" directories and also
in README.md. This change consolidates the YAML manifests into the
"kubernetes" directory and simplifies the installation instructions for end
users in README.md.
Additionally a k8s CronJob has been added.
This allows end users to more easily monitor when the descheduler evicts
a pod in their environment. Prior to adding k8s event support the only
way to track evicted pods was to view the logs.
When pod has this annotation, descheduler will always try to evict this pod.
User can control which pods (only pods with this annotation) will be evicted.
Signed-off-by: ksimon1 <ksimon@redhat.com>
Run below commands to build and push
a staging image to staging-k8s.gcr.io.
```
$ make push
```
Run below commands to build and push
a release image to k8s.gcr.io.
```
$ git checkout v0.9.0
$ VERSION=v0.9.0 REGISTRY=k8s.gcr.io make push
```
Given the repository is build and about-to-be tested under openshift organization
we need to change the kubernetes-incubator into openshift. Yet, keeping both repos still in sync.
By setting up a multistage Docker build, we can create the container
image in a single command. This eliminates external setup and allows us
to build this automatically on registries.
This commit adds checks to only allow valid resource names for the
thresholds field, which are "cpu", "memory", and "pods" right now.
For the other valid or invalid resource names, descheduler will now
throw an error.
Also, tests have been added to test the added behavior.
fix#40
This commit adds log messages to be printed whenever a node is
identified as a target node from which the pods can be evicted or
as an underutilized node to which the evicted pods can be
scheduled, if at all.
This should help with debugging as well as with letting the user
know the identified nodes.
Welcome to Kubernetes. We are excited about the prospect of you joining our [community](https://github.com/kubernetes/community)! The Kubernetes community abides by the CNCF [code of conduct](code-of-conduct.md). Here is an excerpt:
_As contributors and maintainers of this project, and in the interest of fostering an open and welcoming community, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities._
## Getting Started
We have full documentation on how to get started contributing here:
- [Contributor License Agreement](https://git.k8s.io/community/CLA.md) Kubernetes projects require that you sign a Contributor License Agreement (CLA) before we can accept your pull requests
- [Kubernetes Contributor Guide](http://git.k8s.io/community/contributors/guide) - Main contributor documentation, or you can just jump directly to the [contributing section](http://git.k8s.io/community/contributors/guide#contributing)
- [Contributor Cheat Sheet](https://git.k8s.io/community/contributors/guide/contributor-cheatsheet/README.md) - Common resources for existing developers
## Mentorship
- [Mentoring Initiatives](https://git.k8s.io/community/mentoring) - We have a diverse set of mentorship programs available that are always looking for volunteers!
description:Descheduler for Kubernetes is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes. In the current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.
[Descheduler](https://github.com/kubernetes-sigs/descheduler/) for Kubernetes is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes. In the current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.
This chart bootstraps a [descheduler](https://github.com/kubernetes-sigs/descheduler/) cron job on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.
## Prerequisites
- Kubernetes 1.14+
## Installing the Chart
To install the chart with the release name `my-release`:
The command deploys _descheduler_ on the Kubernetes cluster in the default configuration. The [configuration](#configuration) section lists the parameters that can be configured during installation.
> **Tip**: List all releases using `helm list`
## Uninstalling the Chart
To uninstall/delete the `my-release` deployment:
```shell
helm delete my-release
```
The command removes all the Kubernetes components associated with the chart and deletes the release.
## Configuration
The following table lists the configurable parameters of the _descheduler_ chart and their default values.
| `cronJobApiVersion` | CronJob API Group Version | `"batch/v1"` |
| `schedule` | The cron schedule to run the _descheduler_ job on | `"*/2 * * * *"` |
| `startingDeadlineSeconds` | If set, configure `startingDeadlineSeconds` for the _descheduler_ job | `nil` |
| `timeZone` | configure `timeZone` for CronJob | `nil` |
| `successfulJobsHistoryLimit` | If set, configure `successfulJobsHistoryLimit` for the _descheduler_ job | `3` |
| `failedJobsHistoryLimit` | If set, configure `failedJobsHistoryLimit` for the _descheduler_ job | `1` |
| `ttlSecondsAfterFinished` | If set, configure `ttlSecondsAfterFinished` for the _descheduler_ job | `nil` |
| `deschedulingInterval` | If using kind:Deployment, sets time between consecutive descheduler executions. | `5m` |
| `replicas` | The replica count for Deployment | `1` |
| `leaderElection` | The options for high availability when running replicated components | _see values.yaml_ |
| `cmdOptions` | The options to pass to the _descheduler_ command | _see values.yaml_ |
| `deschedulerPolicy.strategies` | The _descheduler_ strategies to apply | _see values.yaml_ |
| `priorityClassName` | The name of the priority class to add to pods | `system-cluster-critical` |
| `rbac.create` | If `true`, create & use RBAC resources | `true` |
| `resources` | Descheduler container CPU and memory requests/limits | _see values.yaml_ |
| `serviceAccount.create` | If `true`, create a service account for the cron job | `true` |
| `serviceAccount.name` | The name of the service account to use, if not set and create is true a name is generated using the fullname template | `nil` |
| `serviceAccount.annotations` | Specifies custom annotations for the serviceAccount | `{}` |
| `podAnnotations` | Annotations to add to the descheduler Pods | `{}` |
| `podLabels` | Labels to add to the descheduler Pods | `{}` |
| `nodeSelector` | Node selectors to run the descheduler cronjob/deployment on specific nodes | `nil` |
| `service.enabled` | If `true`, create a service for deployment | `false` |
| `serviceMonitor.enabled` | If `true`, create a ServiceMonitor for deployment | `false` |
| `serviceMonitor.namespace` | The namespace where Prometheus expects to find service monitors | `nil` |
| `serviceMonitor.additionalLabels` | Add custom labels to the ServiceMonitor resource | `{}` |
| `serviceMonitor.interval` | The scrape interval. If not set, the Prometheus default scrape interval is used | `nil` |
| `serviceMonitor.honorLabels` | Keeps the scraped data's labels when labels are on collisions with target labels. | `true` |
WARNING: You set replica count as 1 and workload kind as Deployment however leaderElection is not enabled. Consider enabling Leader Election for HA mode.
{{- end}}
{{- if .Values.leaderElection }}
{{- if and (hasKey .Values.cmdOptions "dry-run") (eq (get .Values.cmdOptions "dry-run") true) }}
WARNING: You enabled DryRun mode, you can't use Leader Election.
fs.DurationVar(&rs.DeschedulingInterval,"descheduling-interval",rs.DeschedulingInterval,"time interval between two consecutive descheduler executions")
fs.StringVar(&rs.KubeconfigFile,"kubeconfig-file",rs.KubeconfigFile,"File with kube configuration.")
fs.DurationVar(&rs.DeschedulingInterval,"descheduling-interval",rs.DeschedulingInterval,"Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.")
fs.StringVar(&rs.ClientConnection.Kubeconfig,"kubeconfig",rs.ClientConnection.Kubeconfig,"File with kube configuration. Deprecated, use client-connection-kubeconfig instead.")
fs.StringVar(&rs.ClientConnection.Kubeconfig,"client-connection-kubeconfig",rs.ClientConnection.Kubeconfig,"File path to kube configuration for interacting with kubernetes apiserver.")
fs.Float32Var(&rs.ClientConnection.QPS,"client-connection-qps",rs.ClientConnection.QPS,"QPS to use for interacting with kubernetes apiserver.")
fs.Int32Var(&rs.ClientConnection.Burst,"client-connection-burst",rs.ClientConnection.Burst,"Burst to use for interacting with kubernetes apiserver.")
fs.StringVar(&rs.PolicyConfigFile,"policy-config-file",rs.PolicyConfigFile,"File with descheduler policy configuration.")
fs.BoolVar(&rs.DryRun,"dry-run",rs.DryRun,"execute descheduler in dry run mode.")
// node-selector query causes descheduler to run only on nodes that matches the node labels in the query
fs.StringVar(&rs.NodeSelector,"node-selector",rs.NodeSelector,"Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)")
fs.BoolVar(&rs.DryRun,"dry-run",rs.DryRun,"Execute descheduler in dry run mode.")
fs.BoolVar(&rs.DisableMetrics,"disable-metrics",rs.DisableMetrics,"Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.")
fs.StringVar(&rs.Tracing.CollectorEndpoint,"otel-collector-endpoint","","Set this flag to the OpenTelemetry Collector Service Address")
fs.StringVar(&rs.Tracing.TransportCert,"otel-transport-ca-cert","","Path of the CA Cert that can be used to generate the client Certificate for establishing secure connection to the OTEL in gRPC mode")
fs.StringVar(&rs.Tracing.ServiceName,"otel-service-name",tracing.DefaultServiceName,"OTEL Trace name to be used with the resources")
fs.StringVar(&rs.Tracing.ServiceNamespace,"otel-trace-namespace","","OTEL Trace namespace to be used with the resources")
fs.Float64Var(&rs.Tracing.SampleRate,"otel-sample-rate",1.0,"Sample rate to collect the Traces")
fs.BoolVar(&rs.Tracing.FallbackToNoOpProviderOnError,"otel-fallback-no-op-on-error",false,"Fallback to NoOp Tracer in case of error")
fs.BoolVar(&rs.EnableHTTP2,"enable-http2",false,"If http/2 should be enabled for the metrics and health check")
As contributors and maintainers of this project, and in the interest of fostering
an open and welcoming community, we pledge to respect all people who contribute
through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for
everyone, regardless of level of experience, gender, gender identity and expression,
sexual orientation, disability, personal appearance, body size, race, ethnicity, age,
religion, or nationality.
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery
* Personal attacks
* Trolling or insulting/derogatory comments
* Public or private harassment
* Publishing other's private information, such as physical or electronic addresses,
without explicit permission
* Other unethical or unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are not
aligned to this Code of Conduct. By adopting this Code of Conduct, project maintainers
commit themselves to fairly and consistently applying these principles to every aspect
of managing this project. Project maintainers who do not follow or enforce the Code of
Conduct may be permanently removed from the project team.
This code of conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting a Kubernetes maintainer, Sarah Novotny <sarahnovotny@google.com>, and/or Dan Kohn <dan@linuxfoundation.org>.
This Code of Conduct is adapted from the Contributor Covenant
(http://contributor-covenant.org), version 1.2.0, available at
http://contributor-covenant.org/version/1/2/0/
### Kubernetes Events Code of Conduct
Kubernetes events are working conferences intended for professional networking and collaboration in the
Kubernetes community. Attendees are expected to behave according to professional standards and in accordance
with their employer's policies on appropriate workplace behavior.
While at Kubernetes events or related social networking opportunities, attendees should not engage in
discriminatory or offensive speech or actions regarding gender, sexuality, race, or religion. Speakers should
be especially aware of these concerns.
The Kubernetes team does not condone any statements by speakers contrary to these standards. The Kubernetes
team reserves the right to deny entrance and/or eject from an event (without refund) any individual found to
be engaging in discriminatory or offensive speech or actions.
Please bring any concerns to to the immediate attention of Kubernetes event staff
The descheduler evicts pods which may be bound to less desired nodes
```
descheduler [flags]
```
### Options
```
--bind-address ip The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces and IP address families will be used. (default 0.0.0.0)
--cert-dir string The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. (default "apiserver.local.config/certificates")
--client-connection-burst int32 Burst to use for interacting with kubernetes apiserver.
--client-connection-kubeconfig string File path to kube configuration for interacting with kubernetes apiserver.
--client-connection-qps float32 QPS to use for interacting with kubernetes apiserver.
--descheduling-interval duration Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.
--disable-metrics Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.
--dry-run Execute descheduler in dry run mode.
--enable-http2 If http/2 should be enabled for the metrics and health check
-h, --help help for descheduler
--http2-max-streams-per-connection int The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default.
--kubeconfig string File with kube configuration. Deprecated, use client-connection-kubeconfig instead.
--leader-elect Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability.
--leader-elect-lease-duration duration The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 2m17s)
--leader-elect-renew-deadline duration The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled. (default 1m47s)
--leader-elect-resource-lock string The type of resource object that is used for locking during leader election. Supported options are 'leases', 'endpointsleases' and 'configmapsleases'. (default "leases")
--leader-elect-resource-name string The name of resource object that is used for locking during leader election. (default "descheduler")
--leader-elect-resource-namespace string The namespace of resource object that is used for locking during leader election. (default "kube-system")
--leader-elect-retry-period duration The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 26s)
--log-flush-frequency duration Maximum number of seconds between log flushes (default 5s)
--log-json-info-buffer-size quantity [Alpha] In JSON format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this.
--log-json-split-stream [Alpha] In JSON format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this.
--logging-format string Sets the log format. Permitted formats: "json" (gated by LoggingBetaOptions), "text". (default "text")
--otel-collector-endpoint string Set this flag to the OpenTelemetry Collector Service Address
--otel-fallback-no-op-on-error Fallback to NoOp Tracer in case of error
--otel-sample-rate float Sample rate to collect the Traces (default 1)
--otel-service-name string OTEL Trace name to be used with the resources (default "descheduler")
--otel-trace-namespace string OTEL Trace namespace to be used with the resources
--otel-transport-ca-cert string Path of the CA Cert that can be used to generate the client Certificate for establishing secure connection to the OTEL in gRPC mode
--permit-address-sharing If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false]
--permit-port-sharing If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false]
--policy-config-file string File with descheduler policy configuration.
--secure-port int The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all. (default 10258)
--tls-cert-file string File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir.
--tls-cipher-suites strings Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used.
--tls-sni-cert-key namedCertKey A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". (default [])
-v, --v Level number for the log level verbosity
--vmodule pattern=N,... comma-separated list of pattern=N settings for file-filtered logging (only works for text log format)
```
### SEE ALSO
* [descheduler version](descheduler_version.md) - Version of descheduler
./_output/bin/descheduler --client-connection-kubeconfig <path to kubeconfig> --policy-config-file <path-to-policy-file> --v 1
```
View all CLI options.
```
./_output/bin/descheduler --help
```
## Run Tests
```
GOOS=linux make dev-image
make kind-multi-node
kind load docker-image <image name>
kind get kubeconfig > /tmp/admin.conf
export KUBECONFIG=/tmp/admin.conf
make test-unit
make test-e2e
```
## Format Code
After making changes in the code base, ensure that the code is formatted correctly:
```
make fmt
```
## Build Helm Package locally
If you made some changes in the chart, and just want to check if templating is ok, or if the chart is buildable, you can run this command to have a package built from the `./charts` directory.
```
make build-helm
```
## Lint Helm Chart locally
To check linting of your changes in the helm chart locally you can run:
```
make lint-chart
```
## Test helm changes locally with kind and ct
You will need kind and docker (or equivalent) installed. We can use ct public image to avoid installing ct and all its dependencies.
```
make kind-multi-node
make ct-helm
```
### Miscellaneous
See the [hack directory](https://github.com/kubernetes-sigs/descheduler/tree/master/hack) for additional tools and scripts used for developing the descheduler.
Starting with release v0.18.0 there is an official helm chart that can be used to install the
descheduler. See the [helm chart README](https://github.com/kubernetes-sigs/descheduler/blob/master/charts/descheduler/README.md) for detailed instructions.
The descheduler helm chart is also listed on the [artifact hub](https://artifacthub.io/packages/helm/descheduler/descheduler).
### Install Using Kustomize
You can use kustomize to install descheduler.
See the [resources | Kustomize](https://kubectl.docs.kubernetes.io/references/kustomize/cmd/build/) for detailed instructions.
See the [user guide](docs/user-guide.md) in the `/docs` directory.
## Policy and Strategies
Descheduler's policy is configurable and includes strategies that can be enabled or disabled. By default, all strategies are enabled.
The policy includes a common configuration that applies to all the strategies:
| Name | Default Value | Description |
|------|---------------|-------------|
| `nodeSelector` | `nil` | limiting the nodes which are processed |
| `evictLocalStoragePods` | `false` | allows eviction of pods with local storage |
| `evictSystemCriticalPods` | `false` | [Warning: Will evict Kubernetes system pods] allows eviction of pods with any priority, including system pods like kube-dns |
| `ignorePvcPods` | `false` | set whether PVC pods should be evicted or ignored |
| `maxNoOfPodsToEvictPerNode` | `nil` | maximum number of pods evicted from each node (summed through all strategies) |
| `maxNoOfPodsToEvictPerNamespace` | `nil` | maximum number of pods evicted from each namespace (summed through all strategies) |
| `evictFailedBarePods` | `false` | allow eviction of pods without owner references and in failed phase |
As part of the policy, the parameters associated with each strategy can be configured.
See each strategy for details on available parameters.
**Policy:**
```yaml
apiVersion:"descheduler/v1alpha1"
kind:"DeschedulerPolicy"
nodeSelector:prod=dev
evictFailedBarePods:false
evictLocalStoragePods:true
evictSystemCriticalPods:true
maxNoOfPodsToEvictPerNode:40
ignorePvcPods:false
strategies:
...
```
The following diagram provides a visualization of most of the strategies to help
categorize how strategies fit together.

### RemoveDuplicates
This strategy makes sure that there is only one pod associated with a ReplicaSet (RS),
ReplicationController (RC), StatefulSet, or Job running on the same node. If there are more,
those duplicate pods are evicted for better spreading of pods in a cluster. This issue could happen
if some nodes went down due to whatever reasons, and pods on them were moved to other nodes leading to
more than one pod associated with a RS or RC, for example, running on the same node. Once the failed nodes
are ready again, this strategy could be enabled to evict those duplicate pods.
It provides one optional parameter, `excludeOwnerKinds`, which is a list of OwnerRef `Kind`s. If a pod
has any of these `Kind`s listed as an `OwnerRef`, that pod will not be considered for eviction. Note that
pods created by Deployments are considered for eviction by this strategy. The `excludeOwnerKinds` parameter
should include `ReplicaSet` to have pods created by Deployments excluded.
Policy should pass the following validation checks:
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`.
If any of these resource types is not specified, all its thresholds default to 100% to avoid nodes going from underutilized to overutilized.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional,
and will not be used to compute node's usage if it's not specified in `thresholds` and `targetThresholds` explicitly.
*`thresholds` or `targetThresholds` can not be nil and they must configure exactly the same types of resources.
* The valid range of the resource's percentage value is \[0, 100\]
* Percentage value of `thresholds` can not be greater than `targetThresholds` for the same resource.
There is another parameter associated with the `LowNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when the number of under utilized nodes
are above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.
### HighNodeUtilization
This strategy finds nodes that are under utilized and evicts pods from the nodes in the hope that these pods will be
scheduled compactly into fewer nodes. Used in conjunction with node auto-scaling, this strategy is intended to help
trigger down scaling of under utilized nodes.
This strategy **must** be used with the scheduler scoring strategy `MostAllocated`. The parameters of this strategy are
configured under `nodeResourceUtilizationThresholds`.
The under utilization of nodes is determined by a configurable threshold `thresholds`. The threshold
`thresholds` can be configured for cpu, memory, number of pods, and extended resources in terms of percentage. The percentage is
calculated as the current resources requested on the node vs [total allocatable](https://kubernetes.io/docs/concepts/architecture/nodes/#capacity).
For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node.
If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized.
Currently, pods request resource requirements are considered for computing node resource utilization.
Any node above `thresholds` is considered appropriately utilized and is not considered for eviction.
The `thresholds` param could be tuned as per your cluster requirements. Note that this
strategy evicts pods from `underutilized nodes` (those with usage below `thresholds`)
so that they can be recreated in appropriately utilized nodes.
The strategy will abort if any number of `underutilized nodes` or `appropriately utilized nodes` is zero.
**NOTE:** Node resource consumption is determined by the requests and limits of pods, not actual usage.
This approach is chosen in order to maintain consistency with the kube-scheduler, which follows the same
design for scheduling pods onto nodes. This means that resource usage as reported by Kubelet (or commands
like `kubectl top`) may differ from the calculated consumption, due to these components reporting
actual usage metrics. Implementing metrics-based descheduling is currently TODO for the project.
**Parameters:**
|Name|Type|
|---|---|
|`thresholds`|map(string:int)|
|`numberOfNodes`|int|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|
Policy should pass the following validation checks:
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`. If any of these resource types is not specified, all its thresholds default to 100%.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in `thresholds` explicitly.
*`thresholds` can not be nil.
* The valid range of the resource's percentage value is \[0, 100\]
There is another parameter associated with the `HighNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when the number of under utilized nodes
is above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.
### RemovePodsViolatingInterPodAntiAffinity
This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example,
if there is podA on a node and podB and podC (running on the same node) have anti-affinity rules which prohibit
them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This
issue could happen, when the anti-affinity rules for podB and podC are created when they are already running on
node.
**Parameters:**
|Name|Type|
|---|---|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|
**Example:**
````yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeTaints":
enabled: true
params:
excludedTaints:
- dedicated=special-user # exclude taints with key "dedicated" and value "special-user"
- reserved # exclude all taints with key "reserved"
````
### RemovePodsViolatingTopologySpreadConstraint
This strategy makes sure that pods violating [topology spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
are evicted from nodes. Specifically, it tries to evict the minimum number of pods required to balance topology domains to within each constraint's `maxSkew`.
This strategy requires k8s version 1.18 at a minimum.
By default, this strategy only deals with hard constraints, setting parameter `includeSoftConstraints` to `true` will
include soft constraints.
Strategy parameter `labelSelector` is not utilized when balancing topology domains and is only applied during eviction to determine if the pod can be evicted.
**Parameters:**
|Name|Type|
|---|---|
|`includeSoftConstraints`|bool|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|
**Example:**
```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveFailedPods":
enabled: true
params:
failedPods:
reasons:
- "NodeAffinity"
includingInitContainers: true
excludeOwnerKinds:
- "Job"
minPodLifetimeSeconds: 3600
```
## Filter Pods
### Namespace filtering
The following strategies accept a `namespaces` parameter which allows to specify a list of including, resp. excluding namespaces:
* `PodLifeTime`
* `RemovePodsHavingTooManyRestarts`
* `RemovePodsViolatingNodeTaints`
* `RemovePodsViolatingNodeAffinity`
* `RemovePodsViolatingInterPodAntiAffinity`
* `RemoveDuplicates`
* `RemovePodsViolatingTopologySpreadConstraint`
* `RemoveFailedPods`
* `LowNodeUtilization` and `HighNodeUtilization` (Only filtered right before eviction)
For example:
```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
namespaces:
include:
- "namespace1"
- "namespace2"
```
In the examples `PodLifeTime` gets executed only over `namespace1` and `namespace2`.
The similar holds for `exclude` field:
```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
namespaces:
exclude:
- "namespace1"
- "namespace2"
```
The strategy gets executed over all namespaces but `namespace1` and `namespace2`.
It's not allowed to compute `include` with `exclude` field.
### Priority filtering
All strategies are able to configure a priority threshold, only pods under the threshold can be evicted. You can
specify this threshold by setting `thresholdPriorityClassName`(setting the threshold to the value of the given
priority class) or `thresholdPriority`(directly setting the threshold) parameters. By default, this threshold
is set to the value of `system-cluster-critical` priority class.
Note: Setting `evictSystemCriticalPods` to true disables priority filtering entirely.
E.g.
Setting `thresholdPriority`
```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
thresholdPriority: 10000
```
Setting `thresholdPriorityClassName`
```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
thresholdPriorityClassName: "priorityclass1"
```
Note that you can't configure both `thresholdPriority` and `thresholdPriorityClassName`, if the given priority class
does not exist, descheduler won't create it and will throw an error.
### Label filtering
The following strategies can configure a [standard kubernetes labelSelector](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#labelselector-v1-meta)
to filter pods by their labels:
* `PodLifeTime`
* `RemovePodsHavingTooManyRestarts`
* `RemovePodsViolatingNodeTaints`
* `RemovePodsViolatingNodeAffinity`
* `RemovePodsViolatingInterPodAntiAffinity`
* `RemovePodsViolatingTopologySpreadConstraint`
* `RemoveFailedPods`
This allows running strategies among pods the descheduler is interested in.
The following strategies accept a `nodeFit` boolean parameter which can optimize descheduling:
* `RemoveDuplicates`
* `LowNodeUtilization`
* `HighNodeUtilization`
* `RemovePodsViolatingInterPodAntiAffinity`
* `RemovePodsViolatingNodeAffinity`
* `RemovePodsViolatingNodeTaints`
* `RemovePodsViolatingTopologySpreadConstraint`
* `RemovePodsHavingTooManyRestarts`
* `RemoveFailedPods`
If set to `true` the descheduler will consider whether or not the pods that meet eviction criteria will fit on other nodes before evicting them. If a pod cannot be rescheduled to another node, it will not be evicted. Currently the following criteria are considered when setting `nodeFit` to `true`:
- A `nodeSelector` on the pod
- Any `tolerations` on the pod and any `taints` on the other nodes
- `nodeAffinity` on the pod
- Resource `requests` made by the pod and the resources available on other nodes
- Whether any of the other nodes are marked as `unschedulable`
E.g.
```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeFit: true
nodeResourceUtilizationThresholds:
thresholds:
"cpu": 20
"memory": 20
"pods": 20
targetThresholds:
"cpu": 50
"memory": 50
"pods": 50
```
Note that node fit filtering references the current pod spec, and not that of it's owner.
Thus, if the pod is owned by a ReplicationController (and that ReplicationController was modified recently),
the pod may be running with an outdated spec, which the descheduler will reference when determining node fit.
This is expected behavior as the descheduler is a "best-effort" mechanism.
Using Deployments instead of ReplicationControllers provides an automated rollout of pod spec changes, therefore ensuring that the descheduler has an up-to-date view of the cluster state.
The process for publishing each Descheduler release includes a mixture of manual and automatic steps. Over
time, it would be good to automate as much of this process as possible. However, due to current limitations there
is care that must be taken to perform each manual step precisely so that the automated steps execute properly.
## Pre-release Code Changes
Before publishing each release, the following code updates must be made:
- [ ] (Optional, but recommended) Bump `k8s.io` dependencies to the `-rc` tags. These tags are usually published around upstream code freeze. [Example](https://github.com/kubernetes-sigs/descheduler/pull/539)
- [ ] Bump `k8s.io` dependencies to GA tags once they are published (following the upstream release). [Example](https://github.com/kubernetes-sigs/descheduler/pull/615)
- [ ] Ensure that Go is updated to the same version as upstream. [Example](https://github.com/kubernetes-sigs/descheduler/pull/801)
- [ ] Make CI changes in [github.com/kubernetes/test-infra](https://github.com/kubernetes/test-infra) to add the new version's tests (note, this may also include a Go bump). [Example](https://github.com/kubernetes/test-infra/pull/25833)
- [ ] Update local CI versions for utils (such as golang-ci), kind, and go. [Example - e2e](https://github.com/kubernetes-sigs/descheduler/commit/ac4d576df8831c0c399ee8fff1e85469e90b8c44), [Example - helm](https://github.com/kubernetes-sigs/descheduler/pull/821)
- [ ] Update version references in docs and Readme. [Example](https://github.com/kubernetes-sigs/descheduler/pull/617)
## Release Process
When the above pre-release steps are complete and the release is ready to be cut, perform the following steps **in order**
(the flowchart below demonstrates these steps):
**Version release**
1. Create the `git tag` on `master` for the release, eg `v0.24.0`
2. Merge Helm chart version update to `master` (see [Helm chart](#helm-chart) below). [Example](https://github.com/kubernetes-sigs/descheduler/pull/709)
3. Perform the [image promotion process](https://github.com/kubernetes/k8s.io/tree/main/k8s.gcr.io#image-promoter). [Example](https://github.com/kubernetes/k8s.io/pull/3344)
4. Cut release branch from `master`, eg `release-1.24`
5. Publish release using Github's release process from the git tag you created
6. Email `kubernetes-sig-scheduling@googlegroups.com` to announce the release
**Patch release**
1. Pick relevant code change commits to the matching release branch, eg `release-1.24`
2. Create the patch tag on the release branch, eg `v0.24.1` on `release-1.24`
3. Merge Helm chart version update to release branch
4. Perform the image promotion process for the patch version
5. Publish release using Github's release process from the git tag you created
6. Email `kubernetes-sig-scheduling@googlegroups.com` to announce the release
### Flowchart

### Image promotion process
Every merge to any branch triggers an [image build and push](https://github.com/kubernetes/test-infra/blob/c36b8e5/config/jobs/image-pushing/k8s-staging-descheduler.yaml) to a `gcr.io` repository.
These automated image builds are snapshots of the code in place at the time of every PR merge and
tagged with the latest git SHA at the time of the build. To create a final release image, the desired
auto-built image SHA is added to a [file upstream](https://github.com/kubernetes/k8s.io/blob/e9e971c/k8s.gcr.io/images/k8s-staging-descheduler/images.yaml) which
copies that image to a public registry.
Automatic builds can be monitored and re-triggered with the [`post-descheduler-push-images` job](https://prow.k8s.io/?job=post-descheduler-push-images) on prow.k8s.io.
Note that images can also be manually built and pushed using `VERSION=$VERSION make push-all` by [users with access](https://github.com/kubernetes/k8s.io/blob/fbee8f67b70304241e613a672c625ad972998ad7/groups/sig-scheduling/groups.yaml#L33-L43).
## Helm Chart
We currently use the [chart-releaser-action GitHub Action](https://github.com/helm/chart-releaser-action) to automatically
This action is triggered when it detects any changes to [`Chart.yaml`](https://github.com/kubernetes-sigs/descheduler/blob/022e07c27853fade6d1304adc0a6ebe02642386c/charts/descheduler/Chart.yaml) on
a `release-*` branch.
Helm chart releases are managed by a separate set of git tags that are prefixed with `descheduler-helm-chart-*`. Example git tag name is `descheduler-helm-chart-0.18.0`.
Released versions of the helm charts are stored in the `gh-pages` branch of this repo.
The major and minor version of the chart matches the descheduler major and minor versions. For example descheduler helm chart version helm-descheduler-chart-0.18.0 corresponds
to descheduler version v0.18.0. The patch version of the descheduler helm chart and the patcher version of the descheduler will not necessarily match. The patch
version of the descheduler helm chart is used to version changes specific to the helm chart.
1. Merge all helm chart changes into the master branch before the release is tagged/cut
1. Ensure that `appVersion` in file `charts/descheduler/Chart.yaml` matches the descheduler version(no `v` prefix)
2. Ensure that `version` in file `charts/descheduler/Chart.yaml` has been incremented. This is the chart version.
2. Make sure your repo is clean by git's standards
3. Follow the release-branch or patch release tagging pattern from the above section.
4. Verify the new helm artifact has been successfully pushed to the `gh-pages` branch
## Notes
The Helm releaser-action compares the changes in the action-triggering branch to the latest tag on that branch, so if you tag before creating the new branch there
will be nothing to compare and it will fail. This is why it's necessary to tag, eg, `v0.24.0`*before* making the changes to the
Helm chart version, so that there is a new diff for the action to find. (Tagging *after* making the Helm chart changes would
also work, but then the code that gets built into the promoted image will be tagged as `descheduler-helm-chart-xxx` rather than `v0.xx.0`).
See [post-descheduler-push-images dashboard](https://testgrid.k8s.io/sig-scheduling#post-descheduler-push-images) for staging registry image build job status.
View the descheduler staging registry using [this URL](https://console.cloud.google.com/gcr/images/k8s-staging-descheduler/GLOBAL/descheduler) in a web browser
or use the below `gcloud` commands.
List images in staging registry.
```
gcloud container images list --repository gcr.io/k8s-staging-descheduler
```
List descheduler image tags in the staging registry.
The [examples](https://github.com/kubernetes-sigs/descheduler/tree/master/examples) directory has descheduler policy configuration examples.
## CLI Options
The descheduler has many CLI options that can be used to override its default behavior. Please check the [CLI Options](./cli/descheduler.md) documentation for details
## Production Use Cases
This section contains descriptions of real world production use cases.
### Balance Cluster By Pod Age
When initially migrating applications from a static virtual machine infrastructure to a cloud native k8s
infrastructure there can be a tendency to treat application pods like static virtual machines. One approach
to help prevent developers and operators from treating pods like virtual machines is to ensure that pods
only run for a fixed amount
of time.
The `PodLifeTime` strategy can be used to ensure that old pods are evicted. It is recommended to create a
[pod disruption budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) for each
This policy configuration file ensures that pods created more than 7 days ago are evicted.
```
---
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: ProfileName
pluginConfig:
- name: "PodLifeTime"
args:
maxPodLifeTimeSeconds: 604800
plugins:
deschedule:
enabled:
- "PodLifeTime"
```
### Balance Cluster By Node Memory Utilization
If your cluster has been running for a long period of time, you may find that the resource utilization is not very
balanced. The following two strategies can be used to rebalance your cluster based on `cpu`, `memory`
or `number of pods`.
#### Balance high utilization nodes
Using `LowNodeUtilization`, descheduler will rebalance the cluster based on memory by evicting pods
from nodes with memory utilization over 70% to nodes with memory utilization below 20%.
```
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: ProfileName
pluginConfig:
- name: "LowNodeUtilization"
args:
thresholds:
"memory": 20
targetThresholds:
"memory": 70
plugins:
balance:
enabled:
- "LowNodeUtilization"
```
#### Balance low utilization nodes
Using `HighNodeUtilization`, descheduler will rebalance the cluster based on memory by evicting pods
from nodes with memory utilization lower than 20%. This should be use `NodeResourcesFit` with the `MostAllocated` scoring strategy based on these [doc](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins).
The evicted pods will be compacted into minimal set of nodes.
```
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: ProfileName
pluginConfig:
- name: "HighNodeUtilization"
args:
thresholds:
"memory": 20
plugins:
balance:
enabled:
- "HighNodeUtilization"
```
### Autoheal Node Problems
Descheduler's `RemovePodsViolatingNodeTaints` strategy can be combined with
[Node Problem Detector](https://github.com/kubernetes/node-problem-detector/) and
[Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) to automatically remove
Nodes which have problems. Node Problem Detector can detect specific Node problems and report them to the API server.
There is a feature called TaintNodeByCondition of the node controller that takes some conditions and turns them into taints. Currently, this only works for the default node conditions: PIDPressure, MemoryPressure, DiskPressure, Ready, and some cloud provider specific conditions.
The Descheduler will then deschedule workloads from those Nodes. Finally, if the descheduled Node's resource
allocation falls below the Cluster Autoscaler's scale down threshold, the Node will become a scale down candidate
and can be removed by Cluster Autoscaler. These three components form an autohealing cycle for Node problems.
---
**NOTE**
Once [kubernetes/node-problem-detector#565](https://github.com/kubernetes/node-problem-detector/pull/565) is available in NPD, we need to update this section.
echo"+++ Creating a pull request on GitHub at ${GITHUB_USER}:${NEWBRANCH}"
local numandtitle
numandtitle=$(printf'%s\n'"${SUBJECTS[@]}")
prtext=$(cat <<EOF
Cherry pick of ${PULLSUBJ} on ${rel}.
${numandtitle}
For details on the cherry pick process, see the [cherry pick requests](https://git.k8s.io/community/contributors/devel/sig-release/cherry-picks.md) page.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.