User Guide

Starting with descheduler release v0.10.0 container images are available in the official k8s container registry.

Descheduler Version	Container Image	Architectures
v0.25.1	registry.k8s.io/descheduler/descheduler:v0.25.1	AMD64 ARM64 ARMv7
v0.25.0	registry.k8s.io/descheduler/descheduler:v0.25.0	AMD64 ARM64 ARMv7
v0.24.1	registry.k8s.io/descheduler/descheduler:v0.24.1	AMD64 ARM64 ARMv7
v0.24.0	registry.k8s.io/descheduler/descheduler:v0.24.0	AMD64 ARM64 ARMv7
v0.23.1	registry.k8s.io/descheduler/descheduler:v0.23.1	AMD64 ARM64 ARMv7
v0.22.0	registry.k8s.io/descheduler/descheduler:v0.22.0	AMD64 ARM64 ARMv7
v0.21.0	registry.k8s.io/descheduler/descheduler:v0.21.0	AMD64 ARM64 ARMv7
v0.20.0	registry.k8s.io/descheduler/descheduler:v0.20.0	AMD64 ARM64
v0.19.0	registry.k8s.io/descheduler/descheduler:v0.19.0	AMD64
v0.18.0	registry.k8s.io/descheduler/descheduler:v0.18.0	AMD64
v0.10.0	registry.k8s.io/descheduler/descheduler:v0.10.0	AMD64

Note that multi-arch container images cannot be pulled by kind from a registry. Therefore starting with descheduler release v0.20.0 use the below process to download the official descheduler image into a kind cluster.

kind create cluster
docker pull registry.k8s.io/descheduler/descheduler:v0.20.0
kind load docker-image registry.k8s.io/descheduler/descheduler:v0.20.0

Policy Configuration Examples

The examples directory has descheduler policy configuration examples.

CLI Options

The descheduler has many CLI options that can be used to override its default behavior.

descheduler --help
The descheduler evicts pods which may be bound to less desired nodes

Usage:
  descheduler [flags]
  descheduler [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  version     Version of descheduler

Flags:
      --bind-address ip                          The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used. (default 0.0.0.0)
      --cert-dir string                          The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. (default "apiserver.local.config/certificates")
      --client-connection-burst int32            Burst to use for interacting with kubernetes apiserver.
      --client-connection-kubeconfig string      File path to kube configuration for interacting with kubernetes apiserver.
      --client-connection-qps float32            QPS to use for interacting with kubernetes apiserver.
      --descheduling-interval duration           Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.
      --disable-metrics                          Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.
      --dry-run                                  execute descheduler in dry run mode.
  -h, --help                                     help for descheduler
      --http2-max-streams-per-connection int     The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default.
      --kubeconfig string                        File with kube configuration. Deprecated, use client-connection-kubeconfig instead.
      --leader-elect                             Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability.
      --leader-elect-lease-duration duration     The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 2m17s)
      --leader-elect-renew-deadline duration     The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled. (default 1m47s)
      --leader-elect-resource-lock string        The type of resource object that is used for locking during leader election. Supported options are 'leases', 'endpointsleases' and 'configmapsleases'. (default "leases")
      --leader-elect-resource-name string        The name of resource object that is used for locking during leader election. (default "descheduler")
      --leader-elect-resource-namespace string   The namespace of resource object that is used for locking during leader election. (default "kube-system")
      --leader-elect-retry-period duration       The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 26s)
      --log-flush-frequency duration             Maximum number of seconds between log flushes (default 5s)
      --logging-format string                    Sets the log format. Permitted formats: "text", "json". Non-default formats don't honor these flags: --add-dir-header, --alsologtostderr, --log-backtrace-at, --log_dir, --log_file, --log_file_max_size, --logtostderr, --skip-headers, --skip-log-headers, --stderrthreshold, --log-flush-frequency.\nNon-default choices are currently alpha and subject to change without warning. (default "text")
      --permit-address-sharing                   If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false]
      --permit-port-sharing                      If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false]
      --policy-config-file string                File with descheduler policy configuration.
      --secure-port int                          The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all. (default 10258)
      --tls-cert-file string                     File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir.
      --tls-cipher-suites strings                Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used.
                                                 Preferred values: TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384.
                                                 Insecure values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA, TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_3DES_EDE_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_RC4_128_SHA.
      --tls-min-version string                   Minimum TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13
      --tls-private-key-file string              File containing the default x509 private key matching --tls-cert-file.
      --tls-sni-cert-key namedCertKey            A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". (default [])
  -v, --v Level                                  number for the log level verbosity
      --vmodule moduleSpec                       comma-separated list of pattern=N settings for file-filtered logging (only works for the default text log format)

Use "descheduler [command] --help" for more information about a command.

Production Use Cases

This section contains descriptions of real world production use cases.

Balance Cluster By Pod Age

When initially migrating applications from a static virtual machine infrastructure to a cloud native k8s infrastructure there can be a tendency to treat application pods like static virtual machines. One approach to help prevent developers and operators from treating pods like virtual machines is to ensure that pods only run for a fixed amount of time.

The PodLifeTime strategy can be used to ensure that old pods are evicted. It is recommended to create a pod disruption budget for each application to ensure application availability.

descheduler -v=3 --evict-local-storage-pods --policy-config-file=pod-life-time.yml

This policy configuration file ensures that pods created more than 7 days ago are evicted.

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 604800 # pods run for a maximum of 7 days

Balance Cluster By Node Memory Utilization

If your cluster has been running for a long period of time, you may find that the resource utilization is not very balanced. The following two strategies can be used to rebalance your cluster based on cpu, memory or number of pods.

Balance high utilization nodes

Using LowNodeUtilization, descheduler will rebalance the cluster based on memory by evicting pods from nodes with memory utilization over 70% to nodes with memory utilization below 20%.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        thresholds:
          "memory": 20
        targetThresholds:
          "memory": 70

Balance low utilization nodes

Using HighNodeUtilization, descheduler will rebalance the cluster based on memory by evicting pods from nodes with memory utilization lower than 20%. This should be use NodeResourcesFit with the MostAllocated scoring strategy based on these doc. The evicted pods will be compacted into minimal set of nodes.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "HighNodeUtilization":
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        thresholds:
          "memory": 20

Autoheal Node Problems

Descheduler's RemovePodsViolatingNodeTaints strategy can be combined with Node Problem Detector and Cluster Autoscaler to automatically remove Nodes which have problems. Node Problem Detector can detect specific Node problems and report them to the API server. There is a feature called TaintNodeByCondition of the node controller that takes some conditions and turns them into taints. Currently, this only works for the default node conditions: PIDPressure, MemoryPressure, DiskPressure, Ready, and some cloud provider specific conditions. The Descheduler will then deschedule workloads from those Nodes. Finally, if the descheduled Node's resource allocation falls below the Cluster Autoscaler's scale down threshold, the Node will become a scale down candidate and can be removed by Cluster Autoscaler. These three components form an autohealing cycle for Node problems.

NOTE

Once kubernetes/node-problem-detector#565 is available in NPD, we need to update this section.

14 KiB Raw Blame History