Fix RemovalSimulation for parallel scale down #5552

yaroslava-serdiuk · 2023-02-28T19:22:03Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Added label for outer loop and break outer loop if the timer send signal. Before CA did nothing and didn't have any limit for scale down simulation.

Also, I modified the default value for scaleDownSimulationTimeout flag, because 30 seconds is enough to process ~1000 non empty nodes and in the same time cluster snapshot is not too old comparing to the snapshot for 5 minutes timeout.

x13n · 2023-03-01T08:14:32Z

cluster-autoscaler/core/scaledown/unremovable/nodes.go


+// Contains returns true iff a given node is unremovable.
+func (n *Nodes) Contains(nodeName string) bool {
+	_, found := n.ttls[nodeName]


This works for nodes added with AddTimeout, but not for Add/AddReason. I think checking n.reasons would actually deliver on the promise from the function comment. Also, since you're extending the public interface of this struct, it might be worth to add tests verifying that nodes added via any of the Add* methods can then be checked for presence using Contains.

x13n · 2023-03-01T08:21:48Z

cluster-autoscaler/core/scaledown/planner/planner.go

 	}
 	p.nodeUtilizationMap = utilizationMap
 	timer := time.NewTimer(p.context.ScaleDownSimulationTimeout)
+RemovalSimulation:


nit, optional: I'd consider extracting an extra function (e.g. timedOut(time.Timer) bool) instead of introducing the loop label. I'm a fan of short functions :)

Not sure I understand your proposal. Do you mean extract the whole loop to another function or perform a for loop with additional condition timedOut(time.Timer)?

Extracting the whole loop into the function will reduce readability as for me, because the function will have a lot of parameters.

Yeah, extracting the loop might not be very readable, I was thinking about replacing the whole select statement with:

if timedOut(timer) { break }

to avoid the need for a label.

x13n · 2023-03-01T08:23:51Z

cluster-autoscaler/core/scaledown/planner/planner.go

 		case <-timer.C:
 			klog.Warningf("%d out of %d nodes skipped in scale down simulation due to timeout.", len(currentlyUnneededNodeNames)-i, len(currentlyUnneededNodeNames))
-			break
+			break RemovalSimulation


WDYT about testing if the timeout is actually honored?

That's require a timer mock, because with ScaleDownSimulationTimeout=0seconds, the channel still sends a signal later then the loop processing for small number of nodes.
I don't think we want to move timer to categorizeNodes() variables. Do you have other ideas how to mock the timer?

I was thinking rather about using real timer, but mocking removalSimulator to make sure it is not called after the timeout.

x13n · 2023-03-01T10:22:39Z

cluster-autoscaler/core/scaledown/unremovable/nodes_test.go

 	}
 }

+func TestContains(t *testing.T) {


Thanks for the test! It would be good to have a test case for a node that wasn't added. Right now it seems the test checks whether Contains always returns true :)

yaroslava-serdiuk · 2023-03-01T17:03:16Z

/hold

x13n · 2023-03-01T17:15:18Z

cluster-autoscaler/core/scaledown/planner/planner_test.go

 			p := New(&context, NewTestProcessors(&context), deleteOptions)
 			p.eligibilityChecker = &fakeEligibilityChecker{eligible: asMap(tc.eligible)}
+			if tc.isSimulationTimeout {
+				context.AutoscalingOptions.ScaleDownSimulationTimeout = 1 * time.Nanosecond


I'm a bit worried ns/ms scale may make this test flaky - in particular 1ns can easily pass before we have a chance to process the first node.

x13n · 2023-03-01T17:32:02Z

/lgtm
/approve

k8s-ci-robot · 2023-03-01T17:32:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: x13n, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yaroslava-serdiuk · 2023-03-01T20:30:28Z

/unhold

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 28, 2023

k8s-ci-robot requested a review from feiskyer February 28, 2023 19:22

k8s-ci-robot added the area/cluster-autoscaler label Feb 28, 2023

k8s-ci-robot requested a review from x13n February 28, 2023 19:22

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 28, 2023

x13n requested changes Mar 1, 2023

View reviewed changes

yaroslava-serdiuk force-pushed the scalability branch from 9b58845 to f37498e Compare March 1, 2023 09:24

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2023

yaroslava-serdiuk force-pushed the scalability branch from f37498e to 1fa583e Compare March 1, 2023 10:17

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 1, 2023

x13n reviewed Mar 1, 2023

View reviewed changes

yaroslava-serdiuk changed the title ~~Fix RemovalSimulation~~ Fix RemovalSimulation for parallel scale down Mar 1, 2023

yaroslava-serdiuk force-pushed the scalability branch from 1fa583e to 76fd882 Compare March 1, 2023 17:02

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 1, 2023

yaroslava-serdiuk force-pushed the scalability branch from 76fd882 to c2ee917 Compare March 1, 2023 17:02

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2023

x13n reviewed Mar 1, 2023

View reviewed changes

yaroslava-serdiuk added 2 commits March 1, 2023 17:30

Fix RemovalSimulation for parallel scale down

a35d6d2

Add limit for removable nodes count

849bb5f

yaroslava-serdiuk force-pushed the scalability branch from c2ee917 to 849bb5f Compare March 1, 2023 17:31

k8s-ci-robot assigned x13n Mar 1, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2023

k8s-ci-robot merged commit e1d9861 into kubernetes:master Mar 1, 2023

yaroslava-serdiuk deleted the scalability branch May 28, 2024 23:34

Fix RemovalSimulation for parallel scale down #5552

Fix RemovalSimulation for parallel scale down #5552

Uh oh!

Conversation

yaroslava-serdiuk commented Feb 28, 2023

What type of PR is this?

What this PR does / why we need it:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaroslava-serdiuk commented Mar 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

x13n commented Mar 1, 2023

Uh oh!

k8s-ci-robot commented Mar 1, 2023

Uh oh!

yaroslava-serdiuk commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants