-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Fix RemovalSimulation for parallel scale down #5552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| // Contains returns true iff a given node is unremovable. | ||
| func (n *Nodes) Contains(nodeName string) bool { | ||
| _, found := n.ttls[nodeName] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for nodes added with AddTimeout, but not for Add/AddReason. I think checking n.reasons would actually deliver on the promise from the function comment. Also, since you're extending the public interface of this struct, it might be worth to add tests verifying that nodes added via any of the Add* methods can then be checked for presence using Contains.
| } | ||
| p.nodeUtilizationMap = utilizationMap | ||
| timer := time.NewTimer(p.context.ScaleDownSimulationTimeout) | ||
| RemovalSimulation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, optional: I'd consider extracting an extra function (e.g. timedOut(time.Timer) bool) instead of introducing the loop label. I'm a fan of short functions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your proposal. Do you mean extract the whole loop to another function or perform a for loop with additional condition timedOut(time.Timer)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting the whole loop into the function will reduce readability as for me, because the function will have a lot of parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, extracting the loop might not be very readable, I was thinking about replacing the whole select statement with:
if timedOut(timer) {
break
}
to avoid the need for a label.
| case <-timer.C: | ||
| klog.Warningf("%d out of %d nodes skipped in scale down simulation due to timeout.", len(currentlyUnneededNodeNames)-i, len(currentlyUnneededNodeNames)) | ||
| break | ||
| break RemovalSimulation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about testing if the timeout is actually honored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's require a timer mock, because with ScaleDownSimulationTimeout=0seconds, the channel still sends a signal later then the loop processing for small number of nodes.
I don't think we want to move timer to categorizeNodes() variables. Do you have other ideas how to mock the timer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking rather about using real timer, but mocking removalSimulator to make sure it is not called after the timeout.
9b58845 to
f37498e
Compare
f37498e to
1fa583e
Compare
| } | ||
| } | ||
|
|
||
| func TestContains(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test! It would be good to have a test case for a node that wasn't added. Right now it seems the test checks whether Contains always returns true :)
1fa583e to
76fd882
Compare
76fd882 to
c2ee917
Compare
|
/hold |
| p := New(&context, NewTestProcessors(&context), deleteOptions) | ||
| p.eligibilityChecker = &fakeEligibilityChecker{eligible: asMap(tc.eligible)} | ||
| if tc.isSimulationTimeout { | ||
| context.AutoscalingOptions.ScaleDownSimulationTimeout = 1 * time.Nanosecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried ns/ms scale may make this test flaky - in particular 1ns can easily pass before we have a chance to process the first node.
c2ee917 to
849bb5f
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: x13n, yaroslava-serdiuk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Added label for outer loop and break outer loop if the timer send signal. Before CA did nothing and didn't have any limit for scale down simulation.
Also, I modified the default value for scaleDownSimulationTimeout flag, because 30 seconds is enough to process ~1000 non empty nodes and in the same time cluster snapshot is not too old comparing to the snapshot for 5 minutes timeout.