Skip to content

Pod spec equivalency checks can break Cluster Autoscaler scalability #4724

@towca

Description

@towca

The logic in buildPodEquivalenceGroups and filterOutSchedulable groups pods by their scheduling requirements, as a scalability optimization. This is done by first grouping by the controller UID, and then comparing pod specs for pods from one controller. If there's something in the pod spec that's unique to a single pod within a controller, every pod ends up in a group of its own, and the optimization breaks.

In extreme cases when there are a lot of such pods (a couple thousand can be enough), CA can spend such a long time in one loop that it fails health-checks and is killed by kubelet. Then everything repeats once it gets back up, and CA is effectively broken until the pods are scheduled or deleted.

One trigger for pod specs being different is the BoundServiceAccountTokenVolume feature, which injects uniquely-named projected volumes into each pod's spec. This was taken into account by CA in #4441.

We've just run into another one - Jobs using completionMode: Indexed. In this mode, each pod gets a unique, indexed hostname in its spec. This is documented here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode. AFAIU the hostname shouldn't affect scheduling, so sanitizing it in PodSpecSemanticallyEqual should be enough to fix this particular issue.

However, this approach of "fixing" single fields as issues pop up doesn't scale very well. We should come up with a more generic solution to these kinds of problems. One idea could be having a cutoff for the number of groups within one controller, proposed in #4441 (comment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions