Fix oversubscription with CHPL_LOCALE_MODEL=gpu #26059

jhh67 · 2024-10-08T20:38:03Z

The PRs to add GPU support to co-locales (PRs #25734 and #25846) broke oversubscription such that no locales had any GPUs. This PR fixes that problem, and cleans up resource allocation with co-locales in general. Oversubscription is handled more cleanly, as is the "remainder" node that occurs when the number of locales is not evenly divisible by the number of nodes.

Signed-off-by: John H. Hartman jhh67@users.noreply.github.com

Don't crash if the logical accessible CPU set for a locale isn't set. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

If we are oversubscribed then the locales should share the devices, instead of treating them as co-locales and partitioning the devices Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

If the number of nodes does not evenly divide the number of locales there will be a "remainder node" that has fewer co-locales than the other nodes. Previously, there was some special-casing to deal with the remainder node which was clunky and error-prone. This commit introduces the "partition" abstraction in which the number of partitions on each node is the expected number of co-locales on the node. All nodes, including the remainder node, allocate resources based on partitions, then assign co-locales to partitions. On the remainder node this means that some partitions (and therefore resources) go unused, but this is what we want because all locales should have the same amount of resources. This greatly cleans up the code. In addition, oversubscription handling is cleaner. If there are locales on the node, but the expected number of co-locales is zero, the node is oversubscribed and all locales share all resources. Also added some remainder node and oversubsciption tests. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

runtime/src/topo/hwloc/topo-hwloc.c

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

bradcray · 2024-10-10T04:50:33Z

@jhh67 : Does this resolve #25989?

@e-kayrakli : And presumably that user's GPU/SMP case that you were helping me with last week as well?

jhh67 · 2024-10-10T13:33:09Z

@bradcray: I realized last night I forgot to close the issues resolved by the PRs I merged yesterday. I will do it today.

jhh67 added 5 commits October 8, 2024 09:13

Fix logical accessible CPU set debug message

d0918dc

Don't crash if the logical accessible CPU set for a locale isn't set. Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Dump distance matrix to debugging output

ef3f0a1

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Minimum distance might be the maximum

840e56d

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

Partition devices based on number of co-locales, not locales

b49c745

If we are oversubscribed then the locales should share the devices, instead of treating them as co-locales and partitioning the devices Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

jhh67 force-pushed the gpu branch from 65c6e43 to 7a25877 Compare October 8, 2024 20:40

jhh67 requested a review from jabraham17 October 9, 2024 13:41

jhh67 marked this pull request as ready for review October 9, 2024 13:41

jabraham17 approved these changes Oct 9, 2024

View reviewed changes

jabraham17 reviewed Oct 9, 2024

View reviewed changes

runtime/src/topo/hwloc/topo-hwloc.c Outdated Show resolved Hide resolved

Fixed typo.

a204581

Signed-off-by: John H. Hartman <jhh67@users.noreply.github.com>

jhh67 merged commit 75028c6 into chapel-lang:main Oct 9, 2024
7 checks passed

jhh67 deleted the gpu branch October 9, 2024 21:40

jhh67 mentioned this pull request Oct 10, 2024

Oversubscribed gasnet with GPU support is broken #25989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix oversubscription with CHPL_LOCALE_MODEL=gpu #26059

Fix oversubscription with CHPL_LOCALE_MODEL=gpu #26059

jhh67 commented Oct 8, 2024

bradcray commented Oct 10, 2024

jhh67 commented Oct 10, 2024

Fix oversubscription with CHPL_LOCALE_MODEL=gpu #26059

Fix oversubscription with CHPL_LOCALE_MODEL=gpu #26059

Conversation

jhh67 commented Oct 8, 2024

bradcray commented Oct 10, 2024

jhh67 commented Oct 10, 2024