Move the newProtectedStorageClassesConfig helper function from local scope
to package level so it can be reused by both TestDefaultEvictorFilter and
Test_protectedPVCStorageClasses, eliminating code duplication.
previously, descheduler code had copied an old version of PodRequestsAndLimits which does not consider native sidecars
it will now rely on resourcehelper libs, which will continue to get upstream updates
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
this commit introduces a new customization on the existing PodsWithPVC
protection. this new customization allow users to make pods that refer
to a given storage class unevictable.
for example, to protect pods referring to `storage-class-0` and
`storage-class-1` this configuration can be used:
```yaml
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: ProfileName
pluginConfig:
- name: "DefaultEvictor"
args:
podProtections:
extraEnabled:
- PodsWithPVC
config:
PodsWithPVC:
protectedStorageClasses:
- name: storage-class-0
- name: storage-class-1
```
changes introduced by this pr:
1. the descheduler starts to observe persistent volume claims.
1. a new api field was introduced to allow per pod protection config.
1. rbac had to be adjusted (+persistentvolumeclaims).
Although eviction requests (policy/v1) are not persisted long term,
their API still implements the full metav1.ObjectMeta struct. While
name and namespace refer to the pod being evicted, eviction requests
can still carry annotations.
This change adds annotations to descheduler-initiated evictions,
including the requester, reason, and the strategy or plugin that
triggered them.
While these details are already logged by the descheduler, exposing them
as annotations allows external webhooks or controllers to provide
clearer context about each eviction request, both for tracking and
prioritization purposes.
Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
the tracing.Shutdown() uses the context so we must guarantee that the
context we use is valid regardless if the original context being
cancelled already.
this change introduces an exclusive context for the shutdown process
with an arbitrary timeout.
NoEvictionPolicy dictates whether a no-eviction policy is prefered or mandatory.
Needs to be used with caution as this will give users ability to protect their pods
from eviction. Which might work against enfored policies. E.g. plugins evicting pods
violating security policies.
Sort pods that are above ideal avg based on the criteria that they fit on other nodes that are below avg
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
we have been carrying these no-op for quite a while now. we should only
set defaults when they are different from what they are being provided
by the user.
as per prometheus golang client implementation: the only url validation
done is by means of an `url.Parse()` call. we should do the same and not
enforce the usage of https scheme.
our readme even shows an example of descheduler config using http
prometheus url scheme.
with strict eviction policy the descheduler only evict pods if the pod
contains a request for the given threshold. for example, if using a
threshold for an extended resource called `example.com/gpu` only pods
who request such a resource will be evicted.
when calculating the average an applying the deviations it would be nice
to also see the assessed values.
this commit makes the descheduler logs these values when using level 3.
In some cases it might be usefull to limit how many evictions per a
domain can be performed. To avoid burning the whole per descheduling
cycle budget. Limiting the number of evictions per node is a
prerequisite for evicting pods whose usage can't be easily subtracted
from overall node resource usage to predict the final usage. E.g. when a
pod is evicted due to high PSI pressure which takes into account many
factors which can be fully captured by the current predictive resource
model.
This commit adds a sample plugin implementation as follow:
This directory provides an example plugin for the Kubernetes Descheduler,
demonstrating how to evict pods based on custom criteria. The plugin targets
pods based on:
* **Name Regex:** Pods matching a specified regular expression.
* **Age:** Pods older than a defined duration.
* **Namespace:** Pods within or outside a given list of namespaces (inclusion
or exclusion).
To incorporate this plugin into your Descheduler build, you must register it
within the Descheduler's plugin registry. Follow these steps:
1. **Register the Plugin:**
* Modify the `pkg/descheduler/setupplugins.go` file.
* Add the following registration line to the end of the
`RegisterDefaultPlugins()` function:
```go
pluginregistry.Register(
example.PluginName,
example.New,
&example.Example{},
&example.ExampleArgs{},
example.ValidateExampleArgs,
example.SetDefaults_Example,
registry,
)
```
2. **Generate Code:**
* If you modify the plugin's code, execute `make gen` before rebuilding the
Descheduler. This ensures generated code is up-to-date.
3. **Rebuild the Descheduler:**
* Build the descheduler with your changes.
Configure the plugin's behavior using the Descheduler's policy configuration.
Here's an example:
```yaml
apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: LifecycleAndUtilization
plugins:
deschedule:
enabled:
- Example
pluginConfig:
- name: Example
args:
regex: ^descheduler-test.*$
maxAge: 3m
namespaces:
include:
- default
```
- `regex: ^descheduler-test.*$`: Evicts pods whose names match the regular
expression `^descheduler-test.*$`.
- `maxAge: 3m`: Evicts pods older than 3 minutes.
- `namespaces.include: - default`: Evicts pods within the default namespace.
This configuration will cause the plugin to evict pods that meet all three
criteria: matching the `regex`, exceeding the `maxAge`, and residing in the
specified namespace.
The golangci-lint tool gets stuck for a variety of reasons when
running in Prow CI. Enabling verbose output in an attempt to make
debugging easier.
ref: https://golangci-lint.run/contributing/debug/
Since the parameter strategies don't exist anywhere in the code or docs, I'm removing it from the chart readme as a possible option.
It just makes things more confusing.
When the feature is enabled each pod with descheduler.alpha.kubernetes.io/request-evict-only
annotation will have the eviction API error examined for a specific
error code/reason and message. If matched eviction of such a pod will be interpreted
as initiation of an eviction in background.
* add ignoreNonPDBPods option
* take2
* add test
* poddisruptionbudgets are now used by defaultevictor plugin
* add poddisruptionbudgets to rbac
* review comments
* don't use GetPodPodDisruptionBudgets
* review comment, don't hide error
At the time of making this commit, the package `github.com/ghodss/yaml`
is no longer actively maintained.
`sigs.k8s.io/yaml` is a permanent fork of `ghodss/yaml` and is actively
maintained by Kubernetes SIG.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* skip eviction when pod creation time is below minPodAge threshold setting
In the default initialization phase of the descheduler, add a new
constraint to not evict pods that creation time is below minPodAge
threshold.
Added value:
- Avoid crazy pod movement when the autoscaler scales up and down.
- Avoid evicting pods when they are warming up.
- Decreases the overall cost of eviction as no pod will be evicted
before doing significant amount of work.
- Guard against scheduling. Descheduling loops in situations where
the descheduler has a different node fit logic from scheduler,
like not considering topology spread constraints.
* Use *time.Duration instead of uint for MinPodAge type
* Remove '(in minutes)' from default evictor configuration table
* make fmt
* Add explicit name for Duration field
* Use Duration.String()
Currently, all the plugins are run in a sequence.
No plugin executes evictions in parallel within.
Yet, there's no guarantee a future plugin (e.g. a custom one)
will not attemp to evict pods in parallel.
When checking for node limit getting exceeded the pod eviction
never fails. Thus, ignoring the metric reporting when a pod fails
to be evicted due to node limit constrains.
The error also allows plugin to react on other limits getting
exceeded. E.g. the limit on the number of pods evicted per namespace.
Currently, the pod evictor is created during each descheduling cycle
to reset the internal counters and the fake client (in case a dry run is
configured). Instead, create the pod evictor once and reset only what's
needed. So later on the pod evictor can be extended with e.g. a cache
keeping the track of eviction requests that are still in progress and
required more than a single descheduling cycle to complete.
when we cut a new release of descheduler, we have to update the go version in multiple places
which presents an opportunity to miss updating one.
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
in recent kubernetes 1.30, the code-gen flags were changed. --output-file-base -> --output-file based on 144141734d\#diff-beaa4412ca0edb2451061daa9570ce25858ec41951938fc60f17e2370462ad8e
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
Update the profiles to reflect only Deschedule and Balance plugins are
run and the order of first Deschedule of all profiles then Balance of
all profiles.
* Allow the use of existing policy configMap.
* Update charts/descheduler/templates/configmap.yaml
Co-authored-by: Amir Alavi <amiralavi7@gmail.com>
* Remove references to unused variable and update documentation regarding deschedulerPolicy
* Add missing newLine at EOF
* Update charts/descheduler/values.yaml
* remove trailing space
---------
Co-authored-by: Amir Alavi <amiralavi7@gmail.com>
* Check if Pod matches inter-pod anti-affinity of other pod on node as part of NodeFit()
* Add unit tests for checking inter-pod anti-affinity match in NodeFit()
* Export setPodAntiAffinity() helper func to test utils
* Add docs for inter-pod anti-affinity in README
* Refactor logic for inter-pod anti-affinity to use in multiple pkgs
* Move logic for finding match between pods with antiaffinity out of framework to reuse in other pkgs
* Move interpod antiaffinity funcs to pkg/utils/predicates.go
* Add unit tests for inter-pod anti-affinity check
* Test logic in GroupByNodeName
* Test NodeFit() case where pods matches inter-pod anti-affinity
* Test for inter-pod anti-affinity pods match terms, have label selector
* NodeFit inter-pod anti-affinity check returns early if affinity spec not set
* feat: profile name for pods_evicted metric
Support new label "profile" for "pods_evicted" metric to allow
understand which profiles are evicting more pods, allowing better
observability
* refactor: evictoptions improved observability
Send profile and strategy names for EvictOptions, allowing Evictors to
access observability information
* cleanup: remove unnecessary evictoption reference
* feat: evictoptions for nodeutilzation
Explicit usage of options when invoking evictPods from the helper
function from nodeutilization for both highnodeutilization and
lownodeutilization
when the pod createtimestamp is greater than the current time (which is
not make sense in real life, but when doing test with such case,
it is possible), it will convert to a large number if we convert it
to uint, and though it can pass the test, but doesn't make sense.
* helm: ability to specify security context for pod
* Update charts/descheduler/templates/cronjob.yaml
Co-authored-by: Amir Alavi <amiralavi7@gmail.com>
* Update charts/descheduler/templates/deployment.yaml
Co-authored-by: Amir Alavi <amiralavi7@gmail.com>
---------
Co-authored-by: Amir Alavi <amiralavi7@gmail.com>
Pods that don't pass the nodeFit condition currently log an
unsuppressable error message to logs. This changes the log level to info
as it's a normal operating condition.
Signed-off-by: Antoine Deschênes <antoine.deschenes@linux.com>
* Add handling for node eligibility
* Make tests buildable
* Update topologyspreadconstraint.go
* Updated test cases failing
* squashed changes for test case addition
corrected function name
refactored duplicate TopoContraint check logic
Added more test cases for testing node eligibility scenario
Added 5 test cases for testing scenarios related to node eligibility
* topologySpreadConstraints e2e: `nodeTaintsPolicy` and `nodeAffinityPolicy` constraints
---------
Co-authored-by: Marc Power <marcpow@microsoft.com>
Co-authored-by: nitindagar0 <81955199+nitindagar0@users.noreply.github.com>
* feat: Implement preferredDuringSchedulingIgnoredDuringExecution for RemovePodsViolatingNodeAffinity
Now, the descheduler can detect and evict pods that are not optimally
allocated according to the "preferred..." node affinity. It only evicts
a pod if it can be scheduled on a node that scores higher in terms of
preferred node affinity than the current one.
This can be activated by enabling the RemovePodsViolatingNodeAffinity
plugin and passing "preferredDuringSchedulingIgnoredDuringExecution" in
the args.
For example, imagine we have a pod that prefers nodes with label "key1:
value1" with a weight of 10. If this pod is scheduled on a node that
doesn't have "key1: value1" as label but there's another node that has
this label and where this pod can potentially run, then the descheduler
will evict the pod.
Another effect of this commit is that the
RemovePodsViolatingNodeAffinity plugin will not remove pods that don't
fit in the current node but for other reasons than violating the node
affinity. Before that, enabling this plugin could cause evictions on
pods that were running on tainted nodes without the necessary
tolerations.
This commit also fixes the wording of some tests from
node_affinity_test.go and some parameters and expectations of these
tests, which were wrong.
* Optimization on RemovePodsViolatingNodeAffinity
Before checking if a pod can be evicted or if it can be scheduled
somewhere else, we first check if it has the corresponding nodeAffinity
field defined. Otherwise, the pod is automatically discarded as a
candidate.
Apart from that, the method that calculates the weight that a pod
gives to a node based on its preferred node affinity has been
renamed to better reflect what it does.
1. Enable OTEL configuration and base framework
2. update generated conversion spec
3. enable docker based conversion and deep copy generate
4. fix broken unit tests
* use pod informers for listing pods in removepodsviolatingtopologyspreadconstraint and removepodsviolatinginterpodantiaffinity
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* workaround in topologyspreadconstraint test to ensure that informer's index returns pods sorted by name
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
Huge clusters with thousands of pods can quickly exceed the default
watch channel size of the fake clientset. Causing the channel
to panic with "channel full".
* pod anti-affinity check among nodes
* avoid pod equality check with UID field
also add node equality check with Name for short-cut
* add test case for anti-affinity violation among different node
* reduce ListPodsOnANode call
* fix old code
* apply gofumpt -w -extra
move klog/v2 import entry to bottom according to master code
* fix plugin arg conversion when using multiple profiles with same plugin
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* PR feedback to refactor validateDeschedulerConfiguration error handling
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* update helm chart to v0.27.0
* update manifest version and docs
* fix 1.27 release version from README.md
Co-authored-by: Mike Dame <mikedame@google.com>
---------
Co-authored-by: Mike Dame <mikedame@google.com>
The Evict extension point is not currently in use.
All DefaultEvictor plugin functionality is exposed through Filter and
PreEvictionFilter extension points instead.
Thus, no need to limit the number of evictors enabled.
* bump to k8s 1.27
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
* bump go version to 1.20.3
* bump k8s version and kine for e2e
---------
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
- Populate extension points automatically from plugin types
- Make a list of enabled extension points based on a profile
configuration
- Populate filter and pre-eviction filter handles from their
corresponding extension points
* Descheduling profile
* Fake plugin + profile unit testing
* Rename Profile config type into DeschedulerProfile
To avoid resamblance with profileImpl
* First run deschedule, then balance extension points
* Adding descheduler policy API Version option
in helm templates
* Updating comment for deschedulerPolicyAPIVersion
field
* Making v1alpha1 the default api version
* v1alph2 docs
* remove internal toc (gh has this natively)
* fix typo and newlines
* name plugins with less confusing names
* add type column
* fix kv selector and nodeSelector desc
* group plugin types in a table
* link the deprecated doc
* warning signs
* add v1alpha2 registry based conversion
* test defaults, set our 1st explicit default
* fix typos and dates
* move pluginregistry to its own dir
* remove unused v1alpha2.Namespace type
* move migration code folders, remove switch
* validate internalPolicy a single time
* remove structured logs
* simplify return
* check for nil methods
* properly check before adding default evictor
* add TODO comment
* bump copyright year
* use plugin registry and prepare for conersion
* Register plugins explicitly to a registry
* check interface impl instead of struc var
* setup plugins at top level
* treat plugin type combinations
* pass registry as arg of V1alpha1ToInternal
* move registry yet another level up
* check interface type separately
* Remove log level from Errors
Every error printed via Errors is expected to be important and always
printable.
* Invoke first Deschedule and then Balance extension points (breaking change)
* Separate plugin arg conversion from pluginsMap
* Seperate profile population from plugin execution
* Convert strategy params into profiles outside the main descheduling loop
Strategy params are static and do not change in time.
* Bump the internal DeschedulerPolicy to v1alpha2
Drop conversion from v1alpha1 to internal
* add tests to v1alpha1 to internal conversion
* add tests to strategyParamsToPluginArgs params wiring
* in v1alpha1 evictableNamespaces are still Namespaces
* add test passing in all params
Co-authored-by: Lucas Severo Alves <lseveroa@redhat.com>
--help is now an CMD which means explicitly providing a command override in kubernetes is no longer required. You can now simply provide the necessary arguments
This commit changes build_info metric labels
- AppVersion label will show major+minor version
for example 0.24.1
minor version numbers and commit hash
Signed-off-by: eminaktas <eminaktas34@gmail.com>
This does the following:
1. Enables RemovePodsHavingTooManyRestarts when using Helm by default (it is not currently)
2. Adds RemovePodsHavingTooManyRestarts to the values.yaml for clearer configs
update go to 1.19 and helm kubernetes cluster to 1.25
bump -rc.0 to 1.25 GA
bump k8s utils library
bump golang-ci
use go 1.19 for helm github action
upgrade kubectl from 0.20 to 0.25
Signed-off-by: Amir Alavi <amiralavi7@gmail.com>
When an error is returned a strategy either stops completely or starts
processing another node. Given the error can be a transient error or
only one of the limits can get exceeded it is fair to just skip a
pod that failed eviction and proceed to the next instead.
In order to optimize the processing and stop earlier, it is more
practical to implement a check which will say when a limit was
exceeded.
The method uses the node object to only get the node name.
The node name can be retrieved from the pod object.
Some strategies might try to evict a pod in Pending state which
does not have the .spec.nodeName field set. Thus, skipping
the test for the node limit.
Both LowNode and HighNode utilization strategies evict only as many pods
as there's free resources on other nodes. Thus, the resource fit test
is always true by definition.
Add taint exclusion to RemovePodsViolatingNodeTaints. This permits node
taints to be ignored by allowing users to specify ignored taint keys or
ignored taint key=value pairs.
Currently, when the descheduler is running with the --dry-run on, no strategy actually
evicts a pod so every strategy always starts with a complete list of
pods. E.g. when the PodLifeTime strategy evicts few pods, the RemoveDuplicatePods
strategy still takes into account even the pods eliminated by the PodLifeTime
strategy. Which does not correspond to the real case scenarios as the
same pod can be evicted multiple times. Instead, use a fake client and
evict/delete the pods from its cache so the strategies evict each pod
at most once as it would be normally done in a real cluster.
This patch adds the policy(evictFailedBarePods) to allow the failed
pods without ownerReferences to be evicted. For backward compatibility,
disable the policy by default. Address #644.
calcContainerRestarts sums over containers. The new language makes
that clear, avoiding potential confusion vs. an altenative that looked
for pods where a single container had passed the configured threshold.
For example, with three containers with 50 restarts and a threshold of
100, the actual "sum over containers" logic makes that pod a candidate
for descheduling, but the "largest single container restart count"
hypothetical would not have made it a candidate.
Also shifts labelSelector into the parameter table, because when it
was added in 29ade13ce7 (README and e2e-testcase add for
labelSelector, 2021-03-02, #510), it landed a few lines too high.
RemoveDuplicates: take node taints, node affinity and node selector into account when computing a number of feasible nodes for the average occurence of pods per node
Nodes with taints which are not tolerated by evicted pods will never run the
pods. The same holds for node affinity and node selector.
So increase the number of pods per feasible nodes to decrease the
number of evicted pods.
Always use structured logging. Therefore update klog.Errorf() to instead
use klog.ErrorS().
Here is an example of the new log message.
E0428 23:58:57.048912 586 descheduler.go:145] "skipping strategy" err="unknown strategy name" strategy=ASDFPodLifeTime
The master branch always represents the next release of the
descheduler. Therefore applying the descheduler k8s manifests
from the master branch is not considered stable. It is best for
users to install descheduler using the released tags.
Similar to ReplicaSet, ReplicationController, and Jobs pods with a
StatefulSet metadata.ownerReference are considered for eviction.
Document this, so that it is clear to end users.
The resources.cpuRequest and resources.memoryRequest varialbes are
not valid in the helm chart values.yaml file. The correct varialbe
name for setting the requests and limits is resources.
Also, fixed white space alignment in the markdown table.
New metrics:
- build_info: Build info about descheduler, including Go version, Descheduler version, Git SHA, Git branch
- pods_evicted: Number of successfully evicted pods, by the result, by the strategy, by the namespace
For roughly the past year damemi has been the only active approver for
the descheduler. Therefore move the inactive approvers to emeritus
status. This will help clarify to contributors who should be assigned to
pull requests.
Previous to this change official descheduler container images only
supported the AMD64 hardware architecture. This change enables
building official descheduler container images for multiple
architectures.
The initially supported architectures are AMD64 and ARM64.
Prior to this commit the helm chart used to install the descheduler
CronJob did not set container requests or limits. This is considered
an anti-pattern when deploying applications on k8s.
Set descheduler container resources to make it a burstable pod. This
will ensure a high quality experience for end users when deploying
descheduler into their clusters using the instructions from the README.
The default values choosen for CPU/Memory are not based on any real data.
Prior to this change the output from the command "descheduler version"
when run using the official container images from k8s.gcr.io would
always output an empty string. See below for an example.
```
docker run k8s.gcr.io/descheduler/descheduler:v0.19.0 /bin/descheduler version
Descheduler version {Major: Minor: GitCommit: GitVersion: BuildDate:2020-09-01T16:43:23+0000 GoVersion:go1.15 Compiler:gc Platform:linux/amd64}
```
This change makes it possible to pass the descheduler version
information to the automated container image build process and
also makes it work for local builds too.
Prior to this commit the YAML manifests used to install the descheduler
Job and CronJob did not set container requests or limits. This is
considered an anti-pattern when deploying applications on k8s.
Set descheduler container resources to make it a burstable pod. This
will ensure a high quality experience for end users when deploying
descheduler into their clusters using the instructions from the README.
The values choosen for CPU/Memory are not based on any real data.
This adds a strategy to balance pod topology domains based on the scheduler's
PodTopologySpread constraints. It attempts to find the minimum number of pods
that should be sent for eviction by comparing the largest domains in a topology
with the smallest domains in that topology.
This commit adds restrictive PodSecurityPolicy, which can be
optionally created, so descheduler can be deployed on clusters with
PodSecurityPolicy admission controller, but which do not ship default
policies.
Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
The k8s.io/klog/v2 package does not currently support structured logging
for warning level log messages. Therefore update the one call in the
code base using klog.Warningf to instead use klog.InfoS.
While non-evictable pods should never be evicted, they should still be
considered when calculating PodAntiAffinity violations. For example, you
may have an evictable pod that should not be running next to a system-critical
static pod. We currently filter IsEvictable before checking for Affinity violations,
so this case would not be caught.
Updated the README with the link to the official descheduler helm chart
on https://hub.helm.sh. This makes it easier for end users to install the
desceduler using helm.
The k8s project recently cut over to the new official k8s.gcr.io
container registry. The descheduler image can now be pulled from
k8s.gcr.io. This is just a new DNS name for the container registry. The
previously documented DNS names for the registry still work, but
require more typing.
Basing this action on push to `chart-*` tags doesn't work: the action itself
creates the new release tag, so trying to push the changes to a tag ends up
with the releaser comparing the changes to themself (and failing).
This also renames the chart from "descheduler" to "descheduler-helm-chart", to
avoid confusion with automated releases.
This moves the kind setup (previously used by Travis) to the e2e runner script
to accomodate the switch to Prow. This provides a KIND_E2E env var to specify
whether to run the tests in kind, or (by default) to run locally).
The kubernetes project has been updated to use Go 1.14.4. See below pull
request.
https://github.com/kubernetes/kubernetes/pull/88638
After making the updates to Go 1.14 "make gen" no longer worked. The
file hack/tools.go had to be created to get "make gen" working with Go
1.14.
Initially users can open a bug report, feature request, or misc request.
Bug reports and feature requests will have the "kind" label
automatically added.
Prior to this change the event created for every pod eviction was
identical. Instead leverage the newly added eviction reason when
creating k8s events. This makes it easier for end users to understand
why the descheduler evicted a pod when inspecting k8s events.
This is a very minor refactor to use a var declaration for the reason
variable. A var declaration is being used because the zero value for
strings is an empty string.
https://golang.org/ref/spec#The_zero_value
This is a minor refactor of PodEvictor to include evictLocalStoragePods as a
property (so that it doesn't need to be passed to each strategy function.) It
also removes ListEvictablePodsOnNode and extends ListPodsOnANode with an optional
"filter" function parameter, so that for example IsEvictable can be passed to it
and achieve the same results as the function formerly known as ListEvictablePodsOnNode.
This approach also now allows strategies to more explicitly extend their criteria of what
IsEvictable considers an evictable pod (such as NodeAffinity, which checks that the pod
can fit on any other node).
1.Set default CPU/Mem/Pods percentage of thresholds to 100
2.Stop evicting pods if any resource ran out
3.Add thresholds verification method and limit resource percentage within [0, 100]
4.Change testcases and readme
This newly documented URL can be used to view the descheduler staging
registry in a web browser. This is easier to browse if the gcloud
command is not available.
The matrix has been updated with the soon the be released v0.18
details. Also, clarified the descheduler and k8s version compatibility
requirements and recommendations.
2020-05-20 10:55:57 -05:00
7403 changed files with 1508474 additions and 277340 deletions
This is a slowly growing document that lists good practices, conventions, and design decisions.
## Overview
TBD
## Code convention
* *formatting code*: running `make fmt` before committing each change to avoid ci failing
## Unit Test Conventions
These are the known conventions that are useful to practice whenever reasonable:
* *single pod creation*: each pod variable built using `test.BuildTestPod` is updated only through the `apply` argument of `BuildTestPod`
* *single node creation*: each node variable built using `test.BuildTestNode` is updated only through the `apply` argument of `BuildTestNode`
* *no object instance sharing*: each object built through `test.BuildXXX` functions is newly created in each unit test to avoid accidental object mutations
* *no object instance duplication*: avoid duplication by no creating two objects with the same passed values at two different places. E.g. two nodes created with the same memory, cpu and pods requests. Rather create a single function wrapping test.BuildTestNode and invoke this wrapper multiple times.
The aim is to reduce cognitive load when reading and debugging the test code.
## Design Decisions FAQ
This section documents common questions about design decisions in the descheduler codebase and the rationale behind them.
### Why doesn't the framework provide helpers for registering and retrieving indexers for plugins?
In general, each plugin can have many indexers—for example, for nodes, namespaces, pods, and other resources. Each plugin, depending on its internal optimizations, may choose a different indexing function. Indexers are currently used very rarely in the framework and default plugins. Therefore, extending the framework interface with additional helpers for registering and retrieving indexers might introduce an unnecessary and overly restrictive layer without first understanding how indexers will be used. For the moment, I suggest avoiding any restrictions on how many indexers can be registered or which ones can be registered. Instead, we should extend the framework handle to provide a unique ID for each profile, so that indexers within the same profile share a unique prefix. This avoids collisions when the same profile is instantiated more than once. Later, once we learn more about indexer usage, we can revisit whether it makes sense to impose additional restrictions.
$(CONTAINER_ENGINE) run --entrypoint make -it -v $(CURRENT_DIR):/go/src/sigs.k8s.io/descheduler -w /go/src/sigs.k8s.io/descheduler golang:$(GO_VERSION) gen
description:Descheduler for Kubernetes is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes. In the current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.
This chart bootstraps a [desheduler](https://github.com/kubernetes-sigs/descheduler/) cron job on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager.
This chart bootstraps a [descheduler](https://github.com/kubernetes-sigs/descheduler/) cron job with a default DeschedulerPolicy on a [Kubernetes](http://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager. To preview what changes descheduler would make without actually going forward with the changes, you can install descheduler in dry run mode by providing the flag `--set cmdOptions.dry-run=true` to the `helm install` command below.
## Prerequisites
@@ -22,7 +22,7 @@ This chart bootstraps a [desheduler](https://github.com/kubernetes-sigs/deschedu
To install the chart with the release name `my-release`:
The command deploys _descheduler_ on the Kubernetes cluster in the default configuration. The [configuration](#configuration) section lists the parameters that can be configured during installation.
@@ -43,17 +43,53 @@ The command removes all the Kubernetes components associated with the chart and
The following table lists the configurable parameters of the _descheduler_ chart and their default values.
| `schedule` | The cron schedule to run the _descheduler_ job on | `"*/2 * * * *"` |
| `cmdOptions` | The options to pass to the _descheduler_ command | _see values.yaml_ |
| `deschedulerPolicy.strategies` | The _descheduler_ strategies to apply | _see values.yaml_ |
| `priorityClassName` | The name of the priority class to add to pods | `system-cluster-critical` |
| `rbac.create`| If `true`, create & use RBAC resources | `true` |
| `serviceAccount.create` | If `true`, create a service account for the cron job | `true` |
| `serviceAccount.name` | The name of the service account to use, if not set and create is true a name is generated using the fullname template | `nil`|
| `namespaceOverride` | Override the deployment namespace; defaults to .Release.Namespace | `""` |
| `cronJobApiVersion` | CronJob API Group Version | `"batch/v1"` |
| `schedule` | The cron schedule to run the _descheduler_ job on | `"*/2 * * * *"` |
| `startingDeadlineSeconds` | If set, configure `startingDeadlineSeconds` for the _descheduler_ job | `nil` |
| `timeZone` | configure `timeZone` for CronJob | `nil` |
| `successfulJobsHistoryLimit` | If set, configure `successfulJobsHistoryLimit` for the _descheduler_ job | `3` |
| `failedJobsHistoryLimit` | If set, configure `failedJobsHistoryLimit` for the _descheduler_ job | `1` |
| `ttlSecondsAfterFinished` | If set, configure `ttlSecondsAfterFinished` for the _descheduler_ job | `nil` |
| `deschedulingInterval` | If using kind:Deployment, sets time between consecutive descheduler executions. | `5m` |
| `replicas` | The replica count for Deployment | `1` |
| `leaderElection` | The options for high availability when running replicated components | _see values.yaml_ |
| `cmdOptions` | The options to pass to the _descheduler_ command | _see values.yaml_ |
| `priorityClassName` | The name of the priority class to add to pods | `system-cluster-critical` |
| `rbac.create` | If `true`, create & use RBAC resources | `true` |
| `resources` | Descheduler container CPU and memory requests/limits | _see values.yaml_ |
| `serviceAccount.create` | If `true`, create a service account for the cron job | `true` |
| `serviceAccount.name` | The name of the service account to use, if not set and create is true a name is generated using the fullname template | `nil` |
| `serviceAccount.annotations` | Specifies custom annotations for the serviceAccount | `{}` |
| `cronJobAnnotations` | Annotations to add to the descheduler CronJob | `{}` |
| `cronJobLabels` | Labels to add to the descheduler CronJob | `{}` |
| `jobAnnotations` | Annotations to add to the descheduler Job resources (created by CronJob) | `{}` |
| `jobLabels` | Labels to add to the descheduler Job resources (created by CronJob) | `{}` |
| `podAnnotations` | Annotations to add to the descheduler Pods | `{}` |
| `podLabels` | Labels to add to the descheduler Pods | `{}` |
| `nodeSelector` | Node selectors to run the descheduler cronjob/deployment on specific nodes | `nil` |
| `service.enabled` | If `true`, create a service for deployment | `false` |
| `serviceMonitor.enabled` | If `true`, create a ServiceMonitor for deployment | `false` |
| `serviceMonitor.namespace` | The namespace where Prometheus expects to find service monitors | `nil` |
| `serviceMonitor.additionalLabels` | Add custom labels to the ServiceMonitor resource | `{}` |
| `serviceMonitor.interval` | The scrape interval. If not set, the Prometheus default scrape interval is used | `nil` |
| `serviceMonitor.honorLabels` | Keeps the scraped data's labels when labels are on collisions with target labels. | `true` |
WARNING: You set replica count as 1 and workload kind as Deployment however leaderElection is not enabled. Consider enabling Leader Election for HA mode.
{{- end}}
{{- if .Values.leaderElection }}
{{- if and (hasKey .Values.cmdOptions "dry-run") (eq (get .Values.cmdOptions "dry-run") true) }}
WARNING: You enabled DryRun mode, you can't use Leader Election.
{{- end}}
{{- end}}
{{- end}}
{{- if .Values.deschedulerPolicy }}
A DeschedulerPolicy has been applied for you. You can view the policy with:
kubectl get configmap -n {{ include "descheduler.namespace" . }} {{ template "descheduler.fullname" . }} -o yaml
If you wish to define your own policies out of band from this chart, you may define a configmap named {{ template "descheduler.fullname" . }}.
To avoid a conflict between helm and your out of band method to deploy the configmap, please set deschedulerPolicy in values.yaml to an empty object as below.
fs.DurationVar(&rs.DeschedulingInterval,"descheduling-interval",rs.DeschedulingInterval,"Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.")
fs.StringVar(&rs.KubeconfigFile,"kubeconfig",rs.KubeconfigFile,"File with kube configuration.")
fs.StringVar(&rs.ClientConnection.Kubeconfig,"kubeconfig",rs.ClientConnection.Kubeconfig,"File with kube configuration. Deprecated, use client-connection-kubeconfig instead.")
fs.StringVar(&rs.ClientConnection.Kubeconfig,"client-connection-kubeconfig",rs.ClientConnection.Kubeconfig,"File path to kube configuration for interacting with kubernetes apiserver.")
fs.Float32Var(&rs.ClientConnection.QPS,"client-connection-qps",rs.ClientConnection.QPS,"QPS to use for interacting with kubernetes apiserver.")
fs.Int32Var(&rs.ClientConnection.Burst,"client-connection-burst",rs.ClientConnection.Burst,"Burst to use for interacting with kubernetes apiserver.")
fs.StringVar(&rs.PolicyConfigFile,"policy-config-file",rs.PolicyConfigFile,"File with descheduler policy configuration.")
fs.BoolVar(&rs.DryRun,"dry-run",rs.DryRun,"execute descheduler in dry run mode.")
// node-selector query causes descheduler to run only on nodes that matches the node labels in the query
fs.StringVar(&rs.NodeSelector,"node-selector",rs.NodeSelector,"Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)")
// max-no-pods-to-evict limits the maximum number of pods to be evicted per node by descheduler.
fs.IntVar(&rs.MaxNoOfPodsToEvictPerNode,"max-pods-to-evict-per-node",rs.MaxNoOfPodsToEvictPerNode,"Limits the maximum number of pods to be evicted per node by descheduler")
// evict-local-storage-pods allows eviction of pods that are using local storage. This is false by default.
fs.BoolVar(&rs.EvictLocalStoragePods,"evict-local-storage-pods",rs.EvictLocalStoragePods,"Enables evicting pods using local storage by descheduler")
fs.BoolVar(&rs.DryRun,"dry-run",rs.DryRun,"Execute descheduler in dry run mode.")
fs.BoolVar(&rs.DisableMetrics,"disable-metrics",rs.DisableMetrics,"Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.")
fs.StringVar(&rs.Tracing.CollectorEndpoint,"otel-collector-endpoint","","Set this flag to the OpenTelemetry Collector Service Address")
fs.StringVar(&rs.Tracing.TransportCert,"otel-transport-ca-cert","","Path of the CA Cert that can be used to generate the client Certificate for establishing secure connection to the OTEL in gRPC mode")
fs.StringVar(&rs.Tracing.ServiceName,"otel-service-name",tracing.DefaultServiceName,"OTEL Trace name to be used with the resources")
fs.StringVar(&rs.Tracing.ServiceNamespace,"otel-trace-namespace","","OTEL Trace namespace to be used with the resources")
fs.Float64Var(&rs.Tracing.SampleRate,"otel-sample-rate",1.0,"Sample rate to collect the Traces")
fs.BoolVar(&rs.Tracing.FallbackToNoOpProviderOnError,"otel-fallback-no-op-on-error",false,"Fallback to NoOp Tracer in case of error")
fs.BoolVar(&rs.EnableHTTP2,"enable-http2",false,"If http/2 should be enabled for the metrics and health check")
fs.Var(cliflag.NewMapStringBool(&rs.FeatureGates),"feature-gates","A set of key=value pairs that describe feature gates for alpha/experimental features. "+
The descheduler evicts pods which may be bound to less desired nodes
```
descheduler [flags]
```
### Options
```
--bind-address ip The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces and IP address families will be used. (default 0.0.0.0)
--cert-dir string The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. (default "apiserver.local.config/certificates")
--client-connection-burst int32 Burst to use for interacting with kubernetes apiserver.
--client-connection-kubeconfig string File path to kube configuration for interacting with kubernetes apiserver.
--client-connection-qps float32 QPS to use for interacting with kubernetes apiserver.
--descheduling-interval duration Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.
--disable-http2-serving If true, HTTP2 serving will be disabled [default=false]
--disable-metrics Disables metrics. The metrics are by default served through https://localhost:10258/metrics. Secure address, resp. port can be changed through --bind-address, resp. --secure-port flags.
--dry-run Execute descheduler in dry run mode.
--enable-http2 If http/2 should be enabled for the metrics and health check
--feature-gates mapStringBool A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:
--http2-max-streams-per-connection int The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default.
--kubeconfig string File with kube configuration. Deprecated, use client-connection-kubeconfig instead.
--leader-elect Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability.
--leader-elect-lease-duration duration The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 2m17s)
--leader-elect-renew-deadline duration The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled. (default 1m47s)
--leader-elect-resource-lock string The type of resource object that is used for locking during leader election. Supported options are 'leases'. (default "leases")
--leader-elect-resource-name string The name of resource object that is used for locking during leader election. (default "descheduler")
--leader-elect-resource-namespace string The namespace of resource object that is used for locking during leader election. (default "kube-system")
--leader-elect-retry-period duration The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 26s)
--log-flush-frequency duration Maximum number of seconds between log flushes (default 5s)
--log-json-info-buffer-size quantity [Alpha] In JSON format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this.
--log-json-split-stream [Alpha] In JSON format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this.
--log-text-info-buffer-size quantity [Alpha] In text format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this.
--log-text-split-stream [Alpha] In text format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this.
--logging-format string Sets the log format. Permitted formats: "json" (gated by LoggingBetaOptions), "text". (default "text")
--otel-collector-endpoint string Set this flag to the OpenTelemetry Collector Service Address
--otel-fallback-no-op-on-error Fallback to NoOp Tracer in case of error
--otel-sample-rate float Sample rate to collect the Traces (default 1)
--otel-service-name string OTEL Trace name to be used with the resources (default "descheduler")
--otel-trace-namespace string OTEL Trace namespace to be used with the resources
--otel-transport-ca-cert string Path of the CA Cert that can be used to generate the client Certificate for establishing secure connection to the OTEL in gRPC mode
--permit-address-sharing If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false]
--permit-port-sharing If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false]
--policy-config-file string File with descheduler policy configuration.
--secure-port int The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all. (default 10258)
--tls-cert-file string File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir.
--tls-cipher-suites strings Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used.
--tls-sni-cert-key namedCertKey A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". (default [])
-v, --v Level number for the log level verbosity
--vmodule pattern=N,... comma-separated list of pattern=N settings for file-filtered logging (only works for text log format)
```
### SEE ALSO
* [descheduler version](descheduler_version.md) - Version of descheduler
After making changes in the code base, ensure that the code is formatted correctly:
```
make fmt
```
## Build Helm Package locally
If you made some changes in the chart, and just want to check if templating is ok, or if the chart is buildable, you can run this command to have a package built from the `./charts` directory.
```
make build-helm
```
## Lint Helm Chart locally
To check linting of your changes in the helm chart locally you can run:
```
make lint-chart
```
## Test helm changes locally with kind and ct
You will need kind and docker (or equivalent) installed. We can use ct public image to avoid installing ct and all its dependencies.
```
make kind-multi-node
make ct-helm
```
### Miscellaneous
See the [hack directory](https://github.com/kubernetes-sigs/descheduler/tree/master/hack) for additional tools and scripts used for developing the descheduler.
See the [hack directory](https://github.com/kubernetes-sigs/descheduler/tree/master/hack) for additional tools and scripts used for developing the descheduler.
The process for publishing each Descheduler release includes a mixture of manual and automatic steps. Over
time, it would be good to automate as much of this process as possible. However, due to current limitations there
is care that must be taken to perform each manual step precisely so that the automated steps execute properly.
### Semi-automatic
## Pre-release Code Changes
1. Make sure your repo is clean by git's standards
2. Create a release branch `git checkout -b release-1.18` (not required for patch releases)
3. Push the release branch to the descheuler repo and ensure branch protection is enabled (not required for patch releases)
4. Tag the repository from the `master` branch (from the `release-1.18` branch for a patch release) and push the tag `VERSION=v0.18.0 git tag -m $VERSION $VERSION; git push origin $VERSION`
5. Publish a draft release using the tag you just created
6. Perform the [image promotion process](https://github.com/kubernetes/k8s.io/tree/master/k8s.gcr.io#image-promoter)
7. Publish release
8. Email `kubernetes-sig-scheduling@googlegroups.com` to announce the release
Before publishing each release, the following code updates must be made:
### Manual
- [ ] (Optional, but recommended) Bump `k8s.io` dependencies to the `-rc` tags. These tags are usually published around upstream code freeze. [Example](https://github.com/kubernetes-sigs/descheduler/pull/539)
- [ ] Bump `k8s.io` dependencies to GA tags once they are published (following the upstream release). [Example](https://github.com/kubernetes-sigs/descheduler/pull/615)
- [ ] Ensure that Go is updated to the same version as upstream. [Example](https://github.com/kubernetes-sigs/descheduler/pull/801)
- [ ] Make CI changes in [github.com/kubernetes/test-infra](https://github.com/kubernetes/test-infra) to add the new version's tests (note, this may also include a Go bump). [Example](https://github.com/kubernetes/test-infra/pull/25833)
- [ ] Update local CI versions for utils (such as golang-ci), kind, and go. [Example - e2e](https://github.com/kubernetes-sigs/descheduler/commit/ac4d576df8831c0c399ee8fff1e85469e90b8c44), [Example - helm](https://github.com/kubernetes-sigs/descheduler/pull/821)
- [ ] Update version references in docs and Readme. [Example](https://github.com/kubernetes-sigs/descheduler/pull/617)
1. Make sure your repo is clean by git's standards
2. Create a release branch `git checkout -b release-1.18` (not required for patch releases)
3. Push the release branch to the descheuler repo and ensure branch protection is enabled (not required for patch releases)
4. Tag the repository from the `master` branch (from the `release-1.18` branch for a patch release) and push the tag `VERSION=v0.18.0 git tag -m $VERSION $VERSION; git push origin $VERSION`
5. Checkout the tag you just created and make sure your repo is clean by git's standards `git checkout $VERSION`
6. Build and push the container image to the staging registry `VERSION=$VERSION make push`
7. Publish a draft release using the tag you just created
8. Perform the [image promotion process](https://github.com/kubernetes/k8s.io/tree/master/k8s.gcr.io#image-promoter)
9. Publish release
10. Email `kubernetes-sig-scheduling@googlegroups.com` to announce the release
## Release Process
### Notes
It's important to create the tag on the master branch after creating the release-* branch so that the [Helm releaser-action](#helm-chart) can work.
It compares the changes in the action-triggering branch to the latest tag on that branch, so if you tag before creating the new branch there
will be nothing to compare and it will fail (creating a new release branch usually involves no code changes). For this same reason, you should
also tag patch releases (on the release-* branch) *after* pushing changes (if those changes involve bumping the Helm chart version).
When the above pre-release steps are complete and the release is ready to be cut, perform the following steps **in order**
(the flowchart below demonstrates these steps):
**Version release**
1. Create the `git tag` on `master` for the release, eg `v0.24.0`
2. Merge Helm chart version update to `master` (see [Helm chart](#helm-chart) below). [Example](https://github.com/kubernetes-sigs/descheduler/pull/709)
3. Perform the [image promotion process](https://github.com/kubernetes/k8s.io/tree/main/k8s.gcr.io#image-promoter). [Example](https://github.com/kubernetes/k8s.io/pull/3344)
4. Cut release branch from `master`, eg `release-1.24`
5. Publish release using Github's release process from the git tag you created
6. Email `sig-scheduling@kubernetes.io` to announce the release
**Patch release**
1. Pick relevant code change commits to the matching release branch, eg `release-1.24`
2. Create the patch tag on the release branch, eg `v0.24.1` on `release-1.24`
3. Merge Helm chart version update to release branch
4. Perform the image promotion process for the patch version
5. Publish release using Github's release process from the git tag you created
6. Email `sig-scheduling@kubernetes.io` to announce the release
### Flowchart

### Image promotion process
Every merge to any branch triggers an [image build and push](https://github.com/kubernetes/test-infra/blob/c36b8e5/config/jobs/image-pushing/k8s-staging-descheduler.yaml) to a `gcr.io` repository.
These automated image builds are snapshots of the code in place at the time of every PR merge and
tagged with the latest git SHA at the time of the build. To create a final release image, the desired
auto-built image SHA is added to a [file upstream](https://github.com/kubernetes/k8s.io/blob/e9e971c/k8s.gcr.io/images/k8s-staging-descheduler/images.yaml) which
copies that image to a public registry.
Automatic builds can be monitored and re-triggered with the [`post-descheduler-push-images` job](https://prow.k8s.io/?job=post-descheduler-push-images) on prow.k8s.io.
Note that images can also be manually built and pushed using `VERSION=$VERSION make push-all` by [users with access](https://github.com/kubernetes/k8s.io/blob/fbee8f67b70304241e613a672c625ad972998ad7/groups/sig-scheduling/groups.yaml#L33-L43).
## Helm Chart
We currently use the [chart-releaser-action GitHub Action](https://github.com/helm/chart-releaser-action) to automatically
This action is triggered when it detects any changes to [`Chart.yaml`](https://github.com/kubernetes-sigs/descheduler/blob/022e07c27853fade6d1304adc0a6ebe02642386c/charts/descheduler/Chart.yaml) on
a `release-*` branch.
Helm chart releases are managed by a separate set of git tags that are prefixed with `descheduler-helm-chart-*`. Example git tag name is `descheduler-helm-chart-0.18.0`.
Released versions of the helm charts are stored in the `gh-pages` branch of this repo.
The major and minor version of the chart matches the descheduler major and minor versions. For example descheduler helm chart version helm-descheduler-chart-0.18.0 corresponds
to descheduler version v0.18.0. The patch version of the descheduler helm chart and the patcher version of the descheduler will not necessarily match. The patch
version of the descheduler helm chart is used to version changes specific to the helm chart.
1. Merge all helm chart changes into the master branch before the release is tagged/cut
1. Ensure that `appVersion` in file `charts/descheduler/Chart.yaml` matches the descheduler version(no `v` prefix)
2. Ensure that `version` in file `charts/descheduler/Chart.yaml` has been incremented. This is the chart version.
2. Make sure your repo is clean by git's standards
3. Follow the release-branch or patch release tagging pattern from the above section.
4. Verify the new helm artifact has been successfully pushed to the `gh-pages` branch
## Notes
The Helm releaser-action compares the changes in the action-triggering branch to the latest tag on that branch, so if you tag before creating the new branch there
will be nothing to compare and it will fail. This is why it's necessary to tag, eg, `v0.24.0`*before* making the changes to the
Helm chart version, so that there is a new diff for the action to find. (Tagging *after* making the Helm chart changes would
also work, but then the code that gets built into the promoted image will be tagged as `descheduler-helm-chart-xxx` rather than `v0.xx.0`).
See [post-descheduler-push-images dashboard](https://testgrid.k8s.io/sig-scheduling#post-descheduler-push-images) for staging registry image build job status.
View the descheduler staging registry using [this URL](https://console.cloud.google.com/gcr/images/k8s-staging-descheduler/GLOBAL/descheduler) in a web browser
or use the below `gcloud` commands.
List images in staging registry.
```
gcloud container images list --repository gcr.io/k8s-staging-descheduler
@@ -53,19 +102,3 @@ Pull image from the staging registry.
Helm chart releases are managed by a separate set of git tags that are prefixed with `descheduler-helm-chart-*`. Example git tag name is `descheduler-helm-chart-0.18.0`.
Released versions of the helm charts are stored in the `gh-pages` branch of this repo. The [chart-releaser-action GitHub Action](https://github.com/helm/chart-releaser-action)
is setup to build and push the helm charts to the `gh-pages` branch when changes are pushed to a `release-*` branch.
The major and minor version of the chart matches the descheduler major and minor versions. For example descheduler helm chart version helm-descheduler-chart-0.18.0 corresponds
to descheduler version v0.18.0. The patch version of the descheduler helm chart and the patcher version of the descheduler will not necessarily match. The patch
version of the descheduler helm chart is used to version changes specific to the helm chart.
1. Merge all helm chart changes into the master branch before the release is tagged/cut
1. Ensure that `appVersion` in file `charts/descheduler/Chart.yaml` matches the descheduler version(no `v` prefix)
2. Ensure that `version` in file `charts/descheduler/Chart.yaml` has been incremented. This is the chart version.
2. Make sure your repo is clean by git's standards
3. Follow the release-branch or patch release tagging pattern from the above section.
4. Verify the new helm artifact has been successfully pushed to the `gh-pages` branch
The [examples](https://github.com/kubernetes-sigs/descheduler/tree/master/examples) directory has descheduler policy configuration examples.
## CLI Options
The descheduler has many CLI options that can be used to override its default behavior.
```
descheduler --help
The descheduler evicts pods which may be bound to less desired nodes
Usage:
descheduler [flags]
descheduler [command]
Available Commands:
help Help about any command
version Version of descheduler
Flags:
--add-dir-header If true, adds the file directory to the header
--alsologtostderr log to standard error as well as files
--descheduling-interval duration Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.
--dry-run execute descheduler in dry run mode.
--evict-local-storage-pods Enables evicting pods using local storage by descheduler
-h, --help help for descheduler
--kubeconfig string File with kube configuration.
--log-backtrace-at traceLocation when logging hits line file:N, emit a stack trace (default :0)
--log-dir string If non-empty, write log files in this directory
--log-file string If non-empty, use this log file
--log-file-max-size uint Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
--log-flush-frequency duration Maximum number of seconds between log flushes (default 5s)
--logtostderr log to standard error instead of files (default true)
--max-pods-to-evict-per-node int Limits the maximum number of pods to be evicted per node by descheduler
--node-selector string Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)
--policy-config-file string File with descheduler policy configuration.
--skip-headers If true, avoid header prefixes in the log messages
--skip-log-headers If true, avoid headers when opening log files
--stderrthreshold severity logs at or above this threshold go to stderr (default 2)
-v, --v Level number for the log level verbosity
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
Use "descheduler [command] --help" for more information about a command.
```
The descheduler has many CLI options that can be used to override its default behavior. Please check the [CLI Options](./cli/descheduler.md) documentation for details
## Production Use Cases
This section contains descriptions of real world production use cases.
This policy configuration file ensures that pods created more than 7 days ago are evicted.
```
---
apiVersion: "descheduler/v1alpha1"
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: false
"RemoveDuplicates":
enabled: false
"RemovePodsViolatingInterPodAntiAffinity":
enabled: false
"RemovePodsViolatingNodeAffinity":
enabled: false
"RemovePodsViolatingNodeTaints":
enabled: false
"RemovePodsHavingTooManyRestarts":
enabled: false
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 604800 # pods run for a maximum of 7 days
profiles:
- name: ProfileName
pluginConfig:
- name: "PodLifeTime"
args:
maxPodLifeTimeSeconds: 604800
plugins:
deschedule:
enabled:
- "PodLifeTime"
```
### Balance Cluster By Node Memory Utilization
If your cluster has been running for a long period of time, you may find that the resource utilization is not very
balanced. The following two strategies can be used to rebalance your cluster based on `cpu`, `memory`
or `number of pods`.
#### Balance high utilization nodes
Using `LowNodeUtilization`, descheduler will rebalance the cluster based on memory by evicting pods
from nodes with memory utilization over 70% to nodes with memory utilization below 20%.
```
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: ProfileName
pluginConfig:
- name: "LowNodeUtilization"
args:
thresholds:
"memory": 20
targetThresholds:
"memory": 70
plugins:
balance:
enabled:
- "LowNodeUtilization"
```
#### Balance low utilization nodes
Using `HighNodeUtilization`, descheduler will rebalance the cluster based on memory by evicting pods
from nodes with memory utilization lower than 20%. This should be use `NodeResourcesFit` with the `MostAllocated` scoring strategy based on these [doc](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins).
The evicted pods will be compacted into minimal set of nodes.
```
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: ProfileName
pluginConfig:
- name: "HighNodeUtilization"
args:
thresholds:
"memory": 20
plugins:
balance:
enabled:
- "HighNodeUtilization"
```
### Autoheal Node Problems
Descheduler's `RemovePodsViolatingNodeTaints` strategy can be combined with
[Node Problem Detector](https://github.com/kubernetes/node-problem-detector/) and
[Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) to automatically remove
Nodes which have problems. Node Problem Detector can detect specific Node problems and report them to the API server.
There is a feature called TaintNodeByCondition of the node controller that takes some conditions and turns them into taints. Currently, this only works for the default node conditions: PIDPressure, MemoryPressure, DiskPressure, Ready, and some cloud provider specific conditions.
The Descheduler will then deschedule workloads from those Nodes. Finally, if the descheduled Node's resource
allocation falls below the Cluster Autoscaler's scale down threshold, the Node will become a scale down candidate
and can be removed by Cluster Autoscaler. These three components form an autohealing cycle for Node problems.
---
**NOTE**
Once [kubernetes/node-problem-detector#565](https://github.com/kubernetes/node-problem-detector/pull/565) is available in NPD, we need to update this section.
echo"+++ Creating a pull request on GitHub at ${GITHUB_USER}:${NEWBRANCH}"
local numandtitle
numandtitle=$(printf'%s\n'"${SUBJECTS[@]}")
prtext=$(cat <<EOF
Cherry pick of ${PULLSUBJ} on ${rel}.
${numandtitle}
For details on the cherry pick process, see the [cherry pick requests](https://git.k8s.io/community/contributors/devel/sig-release/cherry-picks.md) page.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
packagemetrics
import(
"sync"
"k8s.io/component-base/metrics"
"k8s.io/component-base/metrics/legacyregistry"
"sigs.k8s.io/descheduler/pkg/version"
)
const(
// DeschedulerSubsystem - subsystem name used by descheduler
DeschedulerSubsystem="descheduler"
)
var(
PodsEvicted=metrics.NewCounterVec(
&metrics.CounterOpts{
Subsystem:DeschedulerSubsystem,
Name:"pods_evicted",
Help:"Number of total evicted pods, by the result, by the strategy, by the namespace, by the node name. 'error' result means a pod could not be evicted",
Help:"Number of total evicted pods, by the result, by the strategy, by the namespace, by the node name. 'error' result means a pod could not be evicted",
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
packageapi
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.