Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error changing node type #48

Closed
verdurin opened this issue Oct 19, 2019 · 3 comments
Closed

Error changing node type #48

verdurin opened this issue Oct 19, 2019 · 3 comments

Comments

@verdurin
Copy link

verdurin commented Oct 19, 2019

When running finish:

Error: Could not find shape information for 'n1-standard-8'.

This was on an already provisioned cluster which I was hoping to change. This may not be a supported use-case?

I wanted to use this machine type because it is recommended for Filestore clients:

https://cloud.google.com/filestore/docs/performance#client-machine

@christopheredsall
Copy link
Contributor

Workaround

Yes, changing shapes (machine types in Google terms) is supported. The process is, as you tried: change the limits.yaml file and re-run finish.

Unfortunately the list of shapes is currently hardcoded. It is in google-cloud-platform/files/shapes.yaml. At cluster creation time it gets copied on to the management node in /etc/citc/shapes.yaml.

So a workaround would be to edit the file

[provisioner@mgmt ~]$ sudo vim /etc/citc/shapes.yaml

And add a block, for example

n1-standard-8:
  memory: 29000
  cores_per_socket: 4
  threads_per_core: 2

And rerun

[provisioner@mgmt ~]$ finish

This rewrites the node specifications in /mnt/shared/etc/slurm/slurm.conf that the slurm controller uses to check with slurm daemon on the compute node when it comes up that it has sufficient resources.

Background

In theory, we should be able to get the required information via an API call like machineTypes.list

$ jq '[.items][][] | select (.name=="n1-standard-4" or .name=="n1-standard-8") | {name, memoryMb, guestCpus}' < types.json

Gives:

{
  "name": "n1-standard-4",
  "memoryMb": 15360,
  "guestCpus": 4
}
{
  "name": "n1-standard-8",
  "memoryMb": 30720,
  "guestCpus": 8
}

Whereas a freshly booted n1-standard-4-0001 has

[citc@n1-standard-4-0001 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          14876         364       13669           8         842       14208
Swap:             0           0           0
[citc@n1-standard-4-0001 ~]$ lscpu | grep -E '^CPU\(s|^Thread|^Core|^Socket'
CPU(s):                4
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1

and on a n1-standard-8-0001

[citc@n1-standard-8-0001 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          29994         527       28621           8         846       29103
Swap:             0           0           0
[citc@n1-standard-8-0001 ~]$ lscpu | grep -E '^CPU\(s|^Thread|^Core|^Socket'
CPU(s):                8
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1

So we have to "derate" the memory somewhat (~5%) and figure out what the "topology" (threads, cores, sockets) is.

At the moment we are doing this empirically by booting a node and seeing what we get.

@christopheredsall
Copy link
Contributor

I've made pull request #50 you could pull the shapes.yaml out of that in the meantime in case you are blocked.

@verdurin
Copy link
Author

@christopheredsall thanks, I also spotted that there was a branch addressing this, after I posted the issue.
I've tried the PR and it does work, as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants