* fix plugin arg conversion when using multiple profiles with same plugin
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* PR feedback to refactor validateDeschedulerConfiguration error handling
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* update helm chart to v0.27.0
* update manifest version and docs
* fix 1.27 release version from README.md
Co-authored-by: Mike Dame <mikedame@google.com>
---------
Co-authored-by: Mike Dame <mikedame@google.com>
The Evict extension point is not currently in use.
All DefaultEvictor plugin functionality is exposed through Filter and
PreEvictionFilter extension points instead.
Thus, no need to limit the number of evictors enabled.
* bump to k8s 1.27
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* bump go version to 1.20.3
* bump k8s version and kine for e2e
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
- Populate extension points automatically from plugin types
- Make a list of enabled extension points based on a profile
configuration
- Populate filter and pre-eviction filter handles from their
corresponding extension points
* Descheduling profile
* Fake plugin + profile unit testing
* Rename Profile config type into DeschedulerProfile
To avoid resamblance with profileImpl
* First run deschedule, then balance extension points
* Adding descheduler policy API Version option
in helm templates
* Updating comment for deschedulerPolicyAPIVersion
field
* Making v1alpha1 the default api version
* v1alph2 docs
* remove internal toc (gh has this natively)
* fix typo and newlines
* name plugins with less confusing names
* add type column
* fix kv selector and nodeSelector desc
* group plugin types in a table
* link the deprecated doc
* warning signs
* add v1alpha2 registry based conversion
* test defaults, set our 1st explicit default
* fix typos and dates
* move pluginregistry to its own dir
* remove unused v1alpha2.Namespace type
* move migration code folders, remove switch
* validate internalPolicy a single time
* remove structured logs
* simplify return
* check for nil methods
* properly check before adding default evictor
* add TODO comment
* bump copyright year
* use plugin registry and prepare for conersion
* Register plugins explicitly to a registry
* check interface impl instead of struc var
* setup plugins at top level
* treat plugin type combinations
* pass registry as arg of V1alpha1ToInternal
* move registry yet another level up
* check interface type separately
* Remove log level from Errors
Every error printed via Errors is expected to be important and always
printable.
* Invoke first Deschedule and then Balance extension points (breaking change)
* Separate plugin arg conversion from pluginsMap
* Seperate profile population from plugin execution
* Convert strategy params into profiles outside the main descheduling loop
Strategy params are static and do not change in time.
* Bump the internal DeschedulerPolicy to v1alpha2
Drop conversion from v1alpha1 to internal
* add tests to v1alpha1 to internal conversion
* add tests to strategyParamsToPluginArgs params wiring
* in v1alpha1 evictableNamespaces are still Namespaces
* add test passing in all params
Co-authored-by: Lucas Severo Alves <lseveroa@redhat.com>
--help is now an CMD which means explicitly providing a command override in kubernetes is no longer required. You can now simply provide the necessary arguments
This commit changes build_info metric labels
- AppVersion label will show major+minor version
for example 0.24.1
minor version numbers and commit hash
Signed-off-by: eminaktas <eminaktas34@gmail.com>
This does the following:
1. Enables RemovePodsHavingTooManyRestarts when using Helm by default (it is not currently)
2. Adds RemovePodsHavingTooManyRestarts to the values.yaml for clearer configs
update go to 1.19 and helm kubernetes cluster to 1.25
bump -rc.0 to 1.25 GA
bump k8s utils library
bump golang-ci
use go 1.19 for helm github action
upgrade kubectl from 0.20 to 0.25
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
When an error is returned a strategy either stops completely or starts
processing another node. Given the error can be a transient error or
only one of the limits can get exceeded it is fair to just skip a
pod that failed eviction and proceed to the next instead.
In order to optimize the processing and stop earlier, it is more
practical to implement a check which will say when a limit was
exceeded.
The method uses the node object to only get the node name.
The node name can be retrieved from the pod object.
Some strategies might try to evict a pod in Pending state which
does not have the .spec.nodeName field set. Thus, skipping
the test for the node limit.
Both LowNode and HighNode utilization strategies evict only as many pods
as there's free resources on other nodes. Thus, the resource fit test
is always true by definition.
Add taint exclusion to RemovePodsViolatingNodeTaints. This permits node
taints to be ignored by allowing users to specify ignored taint keys or
ignored taint key=value pairs.
Currently, when the descheduler is running with the --dry-run on, no strategy actually
evicts a pod so every strategy always starts with a complete list of
pods. E.g. when the PodLifeTime strategy evicts few pods, the RemoveDuplicatePods
strategy still takes into account even the pods eliminated by the PodLifeTime
strategy. Which does not correspond to the real case scenarios as the
same pod can be evicted multiple times. Instead, use a fake client and
evict/delete the pods from its cache so the strategies evict each pod
at most once as it would be normally done in a real cluster.
This patch adds the policy(evictFailedBarePods) to allow the failed
pods without ownerReferences to be evicted. For backward compatibility,
disable the policy by default. Address #644.
calcContainerRestarts sums over containers. The new language makes
that clear, avoiding potential confusion vs. an altenative that looked
for pods where a single container had passed the configured threshold.
For example, with three containers with 50 restarts and a threshold of
100, the actual "sum over containers" logic makes that pod a candidate
for descheduling, but the "largest single container restart count"
hypothetical would not have made it a candidate.
Also shifts labelSelector into the parameter table, because when it
was added in 29ade13ce7 (README and e2e-testcase add for
labelSelector, 2021-03-02, #510), it landed a few lines too high.
description:Descheduler for Kubernetes is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes. In the current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.
| `cronJobApiVersion` | CronJob API Group Version | `"batch/v1"` |
| `schedule` | The cron schedule to run the _descheduler_ job on | `"*/2 * * * *"` |
| `startingDeadlineSeconds` | If set, configure `startingDeadlineSeconds` for the _descheduler_ job | `nil` |
| `successfulJobsHistoryLimit` | If set, configure `successfulJobsHistoryLimit` for the _descheduler_ job | `nil` |
| `failedJobsHistoryLimit` | If set, configure `failedJobsHistoryLimit` for the _descheduler_ job | `nil` |
| `deschedulingInterval` | If using kind:Deployment, sets time between consecutive descheduler executions. | `5m` |
| `cmdOptions` | The options to pass to the _descheduler_ command | _see values.yaml_ |
| `deschedulerPolicy.strategies` | The _descheduler_ strategies to apply | _see values.yaml_ |
| `priorityClassName` | The name of the priority class to add to pods | `system-cluster-critical` |
| `rbac.create`| If `true`, create & use RBAC resources | `true` |
| `podSecurityPolicy.create` | If `true`, create PodSecurityPolicy | `true` |
| `resources` | Descheduler container CPU and memory requests/limits | _see values.yaml_ |
| `serviceAccount.create` | If `true`, create a service account for the cron job | `true` |
| `serviceAccount.name` | The name of the service account to use, if not set and create is true a name is generated using the fullname template | `nil` |
| `nodeSelector` | Node selectors to run the descheduler cronjob on specific nodes | `nil` |
| `tolerations` | tolerations to run the descheduler cronjob on specific nodes | `nil`|
| `cronJobApiVersion`| CronJob API Group Version | `"batch/v1"` |
| `schedule`| The cron schedule to run the _descheduler_ job on | `"*/2 * * * *"` |
| `startingDeadlineSeconds`| If set, configure `startingDeadlineSeconds` for the _descheduler_ job | `nil` |
| `successfulJobsHistoryLimit` | If set, configure `successfulJobsHistoryLimit` for the _descheduler_ job | `3` |
| `failedJobsHistoryLimit`| If set, configure `failedJobsHistoryLimit` for the _descheduler_ job | `1` |
| `ttlSecondsAfterFinished` | If set, configure `ttlSecondsAfterFinished` for the _descheduler_ job | `nil` |
| `deschedulingInterval` | If using kind:Deployment, sets time between consecutive descheduler executions. | `5m` |
| `replicas` | The replica count for Deployment | `1` |
| `leaderElection` | The options for high availability when running replicated components | _see values.yaml_ |
| `cmdOptions` | The options to pass to the _descheduler_ command | _see values.yaml_ |
| `deschedulerPolicy.strategies` | The _descheduler_ strategies to apply | _see values.yaml_ |
| `priorityClassName` | The name of the priority class to add to pods | `system-cluster-critical` |
| `rbac.create` | If `true`, create & use RBAC resources | `true` |
| `resources` | Descheduler container CPU and memory requests/limits | _see values.yaml_ |
| `serviceAccount.create` | If `true`, create a service account for the cronjob | `true` |
| `serviceAccount.name` | The name of the service account to use, if not set and create is true a name is generated using the fullname template | `nil` |
| `serviceAccount.annotations` | Specifies custom annotations for the serviceAccount | `{}` |
| `podAnnotations` | Annotations to add to the descheduler Pods | `{}` |
| `podLabels` | Labels to add to the descheduler Pods | `{}` |
| `nodeSelector` | Node selectors to run the descheduler cronjob/deployment on specific nodes | `nil` |
| `service.enabled` | If `true`, create a service for deployment | `false` |
| `serviceMonitor.enabled` | If `true`, create a ServiceMonitor for deployment | `false` |
| `serviceMonitor.namespace` | The namespace where Prometheus expects to find service monitors | `nil` |
| `serviceMonitor.interval` | The scrape interval. If not set, the Prometheus default scrape interval is used | `nil` |
| `serviceMonitor.honorLabels` | Keeps the scraped data's labels when labels are on collisions with target labels. | `true` |
WARNING: You set replica count as 1 and workload kind as Deployment however leaderElection is not enabled. Consider enabling Leader Election for HA mode.
{{- end}}
{{- if .Values.leaderElection }}
{{- if and (hasKey .Values.cmdOptions "dry-run") (eq (get .Values.cmdOptions "dry-run") true) }}
WARNING: You enabled DryRun mode, you can't use Leader Election.
fs.StringVar(&rs.Logging.Format,"logging-format","text",`Sets the log format. Permitted formats: "text", "json". Non-default formats don't honor these flags: --add-dir-header, --alsologtostderr, --log-backtrace-at, --log-dir, --log-file, --log-file-max-size, --logtostderr, --skip-headers, --skip-log-headers, --stderrthreshold, --log-flush-frequency.\nNon-default choices are currently alpha and subject to change without warning.`)
fs.StringVar(&rs.Logging.Format,"logging-format","text",`Sets the log format. Permitted formats: "text", "json". Non-default formats don't honor these flags: --add-dir-header, --alsologtostderr, --log-backtrace-at, --log_dir, --log_file, --log_file_max_size, --logtostderr, --skip-headers, --skip-log-headers, --stderrthreshold, --log-flush-frequency.\nNon-default choices are currently alpha and subject to change without warning.`)
fs.DurationVar(&rs.DeschedulingInterval,"descheduling-interval",rs.DeschedulingInterval,"Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.")
fs.StringVar(&rs.KubeconfigFile,"kubeconfig",rs.KubeconfigFile,"File with kube configuration.")
fs.StringVar(&rs.ClientConnection.Kubeconfig,"kubeconfig",rs.ClientConnection.Kubeconfig,"File with kube configuration. Deprecated, use client-connection-kubeconfig instead.")
fs.StringVar(&rs.ClientConnection.Kubeconfig,"client-connection-kubeconfig",rs.ClientConnection.Kubeconfig,"File path to kube configuration for interacting with kubernetes apiserver.")
fs.Float32Var(&rs.ClientConnection.QPS,"client-connection-qps",rs.ClientConnection.QPS,"QPS to use for interacting with kubernetes apiserver.")
fs.Int32Var(&rs.ClientConnection.Burst,"client-connection-burst",rs.ClientConnection.Burst,"Burst to use for interacting with kubernetes apiserver.")
fs.StringVar(&rs.PolicyConfigFile,"policy-config-file",rs.PolicyConfigFile,"File with descheduler policy configuration.")
fs.BoolVar(&rs.DryRun,"dry-run",rs.DryRun,"execute descheduler in dry run mode.")
// node-selector query causes descheduler to run only on nodes that matches the node labels in the query
fs.StringVar(&rs.NodeSelector,"node-selector",rs.NodeSelector,"DEPRECATED: selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)")
// max-no-pods-to-evict limits the maximum number of pods to be evicted per node by descheduler.
fs.IntVar(&rs.MaxNoOfPodsToEvictPerNode,"max-pods-to-evict-per-node",rs.MaxNoOfPodsToEvictPerNode,"DEPRECATED: limits the maximum number of pods to be evicted per node by descheduler")
// evict-local-storage-pods allows eviction of pods that are using local storage. This is false by default.
fs.BoolVar(&rs.EvictLocalStoragePods,"evict-local-storage-pods",rs.EvictLocalStoragePods,"DEPRECATED: enables evicting pods using local storage by descheduler")
fs.BoolVar(&rs.DryRun,"dry-run",rs.DryRun,"Execute descheduler in dry run mode.")
fs.BoolVar(&rs.DisableMetrics,"disable-metrics",rs.DisableMetrics,"Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.")
The descheduler evicts pods which may be bound to less desired nodes
```
descheduler [flags]
```
### Options
```
--bind-address ip The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used. (default 0.0.0.0)
--cert-dir string The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. (default "apiserver.local.config/certificates")
--client-connection-burst int32 Burst to use for interacting with kubernetes apiserver.
--client-connection-kubeconfig string File path to kube configuration for interacting with kubernetes apiserver.
--client-connection-qps float32 QPS to use for interacting with kubernetes apiserver.
--descheduling-interval duration Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.
--disable-metrics Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.
--dry-run Execute descheduler in dry run mode.
-h, --help help for descheduler
--http2-max-streams-per-connection int The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default.
--kubeconfig string File with kube configuration. Deprecated, use client-connection-kubeconfig instead.
--leader-elect Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability.
--leader-elect-lease-duration duration The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 2m17s)
--leader-elect-renew-deadline duration The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled. (default 1m47s)
--leader-elect-resource-lock string The type of resource object that is used for locking during leader election. Supported options are 'leases', 'endpointsleases' and 'configmapsleases'. (default "leases")
--leader-elect-resource-name string The name of resource object that is used for locking during leader election. (default "descheduler")
--leader-elect-resource-namespace string The namespace of resource object that is used for locking during leader election. (default "kube-system")
--leader-elect-retry-period duration The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 26s)
--logging-format string Sets the log format. Permitted formats: "text", "json". Non-default formats don't honor these flags: --add-dir-header, --alsologtostderr, --log-backtrace-at, --log_dir, --log_file, --log_file_max_size, --logtostderr, --skip-headers, --skip-log-headers, --stderrthreshold, --log-flush-frequency.\nNon-default choices are currently alpha and subject to change without warning. (default "text")
--permit-address-sharing If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false]
--permit-port-sharing If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false]
--policy-config-file string File with descheduler policy configuration.
--secure-port int The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all. (default 10258)
--tls-cert-file string File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir.
--tls-cipher-suites strings Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used.
--tls-sni-cert-key namedCertKey A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". (default [])
```
### SEE ALSO
* [descheduler version](descheduler_version.md) - Version of descheduler
Run the helm test for a particular descheduler release by setting below variables,
```
HELM_IMAGE_REPO="descheduler"
HELM_IMAGE_TAG="helm-test"
HELM_CHART_LOCATION="./charts/descheduler"
```
The helm tests runs as part of descheduler CI. But, to run it manually from the descheduler root,
## Format Code
After making changes in the code base, ensure that the code is formatted correctly:
```
make test-helm
make fmt
```
## Build Helm Package locally
If you made some changes in the chart, and just want to check if templating is ok, or if the chart is buildable, you can run this command to have a package built from the `./charts` directory.
```
make build-helm
```
## Lint Helm Chart locally
To check linting of your changes in the helm chart locally you can run:
```
make lint-chart
```
## Test helm changes locally with kind and ct
You will need kind and docker (or equivalent) installed. We can use ct public image to avoid installing ct and all its dependencies.
Starting with release v0.18.0 there is an official helm chart that can be used to install the
descheduler. See the [helm chart README](https://github.com/kubernetes-sigs/descheduler/blob/master/charts/descheduler/README.md) for detailed instructions.
The descheduler helm chart is also listed on the [artifact hub](https://artifacthub.io/packages/helm/descheduler/descheduler).
### Install Using Kustomize
You can use kustomize to install descheduler.
See the [resources | Kustomize](https://kubectl.docs.kubernetes.io/references/kustomize/cmd/build/) for detailed instructions.
See the [user guide](docs/user-guide.md) in the `/docs` directory.
## Policy and Strategies
Descheduler's policy is configurable and includes strategies that can be enabled or disabled. By default, all strategies are enabled.
The policy includes a common configuration that applies to all the strategies:
| Name | Default Value | Description |
|------|---------------|-------------|
| `nodeSelector` | `nil` | limiting the nodes which are processed |
| `evictLocalStoragePods` | `false` | allows eviction of pods with local storage |
| `evictSystemCriticalPods` | `false` | [Warning: Will evict Kubernetes system pods] allows eviction of pods with any priority, including system pods like kube-dns |
| `ignorePvcPods` | `false` | set whether PVC pods should be evicted or ignored |
| `maxNoOfPodsToEvictPerNode` | `nil` | maximum number of pods evicted from each node (summed through all strategies) |
| `maxNoOfPodsToEvictPerNamespace` | `nil` | maximum number of pods evicted from each namespace (summed through all strategies) |
| `evictFailedBarePods` | `false` | allow eviction of pods without owner references and in failed phase |
As part of the policy, the parameters associated with each strategy can be configured.
See each strategy for details on available parameters.
**Policy:**
```yaml
apiVersion:"descheduler/v1alpha1"
kind:"DeschedulerPolicy"
nodeSelector:prod=dev
evictFailedBarePods:false
evictLocalStoragePods:true
evictSystemCriticalPods:true
maxNoOfPodsToEvictPerNode:40
ignorePvcPods:false
strategies:
...
```
The following diagram provides a visualization of most of the strategies to help
categorize how strategies fit together.

### RemoveDuplicates
This strategy makes sure that there is only one pod associated with a ReplicaSet (RS),
ReplicationController (RC), StatefulSet, or Job running on the same node. If there are more,
those duplicate pods are evicted for better spreading of pods in a cluster. This issue could happen
if some nodes went down due to whatever reasons, and pods on them were moved to other nodes leading to
more than one pod associated with a RS or RC, for example, running on the same node. Once the failed nodes
are ready again, this strategy could be enabled to evict those duplicate pods.
It provides one optional parameter, `excludeOwnerKinds`, which is a list of OwnerRef `Kind`s. If a pod
has any of these `Kind`s listed as an `OwnerRef`, that pod will not be considered for eviction. Note that
pods created by Deployments are considered for eviction by this strategy. The `excludeOwnerKinds` parameter
should include `ReplicaSet` to have pods created by Deployments excluded.
Policy should pass the following validation checks:
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`.
If any of these resource types is not specified, all its thresholds default to 100% to avoid nodes going from underutilized to overutilized.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional,
and will not be used to compute node's usage if it's not specified in `thresholds` and `targetThresholds` explicitly.
*`thresholds` or `targetThresholds` can not be nil and they must configure exactly the same types of resources.
* The valid range of the resource's percentage value is \[0, 100\]
* Percentage value of `thresholds` can not be greater than `targetThresholds` for the same resource.
There is another parameter associated with the `LowNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when the number of under utilized nodes
are above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.
### HighNodeUtilization
This strategy finds nodes that are under utilized and evicts pods from the nodes in the hope that these pods will be
scheduled compactly into fewer nodes. Used in conjunction with node auto-scaling, this strategy is intended to help
trigger down scaling of under utilized nodes.
This strategy **must** be used with the scheduler scoring strategy `MostAllocated`. The parameters of this strategy are
configured under `nodeResourceUtilizationThresholds`.
The under utilization of nodes is determined by a configurable threshold `thresholds`. The threshold
`thresholds` can be configured for cpu, memory, number of pods, and extended resources in terms of percentage. The percentage is
calculated as the current resources requested on the node vs [total allocatable](https://kubernetes.io/docs/concepts/architecture/nodes/#capacity).
For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node.
If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized.
Currently, pods request resource requirements are considered for computing node resource utilization.
Any node above `thresholds` is considered appropriately utilized and is not considered for eviction.
The `thresholds` param could be tuned as per your cluster requirements. Note that this
strategy evicts pods from `underutilized nodes` (those with usage below `thresholds`)
so that they can be recreated in appropriately utilized nodes.
The strategy will abort if any number of `underutilized nodes` or `appropriately utilized nodes` is zero.
**NOTE:** Node resource consumption is determined by the requests and limits of pods, not actual usage.
This approach is chosen in order to maintain consistency with the kube-scheduler, which follows the same
design for scheduling pods onto nodes. This means that resource usage as reported by Kubelet (or commands
like `kubectl top`) may differ from the calculated consumption, due to these components reporting
actual usage metrics. Implementing metrics-based descheduling is currently TODO for the project.
**Parameters:**
|Name|Type|
|---|---|
|`thresholds`|map(string:int)|
|`numberOfNodes`|int|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|
Policy should pass the following validation checks:
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`. If any of these resource types is not specified, all its thresholds default to 100%.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in `thresholds` explicitly.
*`thresholds` can not be nil.
* The valid range of the resource's percentage value is \[0, 100\]
There is another parameter associated with the `HighNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when the number of under utilized nodes
is above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.
### RemovePodsViolatingInterPodAntiAffinity
This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example,
if there is podA on a node and podB and podC (running on the same node) have anti-affinity rules which prohibit
them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This
issue could happen, when the anti-affinity rules for podB and podC are created when they are already running on
node.
**Parameters:**
|Name|Type|
|---|---|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|
**Example:**
````yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeTaints":
enabled: true
params:
excludedTaints:
- dedicated=special-user # exclude taints with key "dedicated" and value "special-user"
- reserved # exclude all taints with key "reserved"
````
### RemovePodsViolatingTopologySpreadConstraint
This strategy makes sure that pods violating [topology spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
are evicted from nodes. Specifically, it tries to evict the minimum number of pods required to balance topology domains to within each constraint's `maxSkew`.
This strategy requires k8s version 1.18 at a minimum.
By default, this strategy only deals with hard constraints, setting parameter `includeSoftConstraints` to `true` will
include soft constraints.
Strategy parameter `labelSelector` is not utilized when balancing topology domains and is only applied during eviction to determine if the pod can be evicted.
**Parameters:**
|Name|Type|
|---|---|
|`includeSoftConstraints`|bool|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|