Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gitlab: add sched #15

Merged
merged 1 commit into from
Feb 22, 2024
Merged

gitlab: add sched #15

merged 1 commit into from
Feb 22, 2024

Conversation

wihobbs
Copy link
Member

@wihobbs wihobbs commented Feb 14, 2024

This is being put up as WIP because there are some failures in the testsuite when run under the gitlab runners:

(s=125,d=0) fluxci@corona82 /usr/WS1/fluxci/cibuilds/380819_tioga/flux-sched/t (master)$ cat test-suite.log | grep ERROR
# ERROR: 5
ERROR: t4005-match-unsat
ERROR: t4005-match-unsat.t - missing test plan
ERROR: t4005-match-unsat.t - exited with status 1
ERROR: t4008-match-jgf
ERROR: t4008-match-jgf.t - missing test plan
ERROR: t4008-match-jgf.t - exited with status 1
ERROR: t5000-valgrind
ERROR: t5000-valgrind.t - exited with status 1

Notably, the valgrind test is failing, even when I rerun it. I don't really see what the "missing test plan" is all about...

@wihobbs
Copy link
Member Author

wihobbs commented Feb 14, 2024

What failed:

(s=125,d=0) fluxci@corona82 /usr/WS1/fluxci/cibuilds/380819_tioga/flux-sched/t (master)$ cat t5000-valgrind.output
expecting success:
	run_timeout 900 \
	flux start -s ${VALGRIND_NBROKERS} \
		--killer-timeout=120 \
		--wrap=libtool,e,${VALGRIND} \
		--wrap=--tool=memcheck \
		--wrap=--leak-check=full \
		--wrap=--gen-suppressions=all \
		--wrap=--trace-children=no \
		--wrap=--child-silent-after-fork=yes \
		--wrap=--num-callers=30 \
		--wrap=--leak-resolution=med \
		--wrap=--error-exitcode=1 \
		--wrap=--suppressions=$VALGRIND_SUPPRESSIONS \
		 ${VALGRIND_WORKLOAD}

==2147959== Memcheck, a memory error detector
==2147959== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==2147959== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==2147959== Command: /usr/WS1/fluxci/cibuilds/380819_tioga/flux-core/install/libexec/flux/cmd/flux-broker --setattr=rundir=/var/tmp/fluxci/flux-dARpVC /usr/WS1/fluxci/cibuilds/380819_tioga/flux-sched/t/valgrind/valgrind-workload.sh
==2147959==
==2147966== Memcheck, a memory error detector
==2147966== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==2147966== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==2147966== Command: /usr/WS1/fluxci/cibuilds/380819_tioga/flux-core/install/libexec/flux/cmd/flux-broker --setattr=rundir=/var/tmp/fluxci/flux-dARpVC
==2147966==
FLUX_URI=local:///var/tmp/fluxci/flux-dARpVC/local-0
Running 00-job
Submitting 10 jobs
fDCE4Q3h
fDWosPvP
fDtcUqJX
fEC26QDM
fEWXTSF1
fEonB5Tm
fFBgiTyH
fFRPMq27
fFeF5bmZ
fFt9as59
Waiting jobs to complete
Completed
Feb 14 18:21:00.896979 sched-fluxion-qmanager.err[0]: update_on_resource_response: exiting due to sched-fluxion-resource.notify failure: Operation canceled
==2147966==
==2147966== HEAP SUMMARY:
==2147966==     in use at exit: 560,453 bytes in 151 blocks
==2147966==   total heap usage: 201,853 allocs, 201,702 frees, 136,632,343 bytes allocated
==2147966==
==2147966== 496 bytes in 1 blocks are possibly lost in loss record 53 of 67
==2147966==    at 0x4C3D1C3: calloc (vg_replace_malloc.c:1554)
==2147966==    by 0x4015322: UnknownInlinedFun (rtld-malloc.h:44)
==2147966==    by 0x4015322: allocate_dtv (dl-tls.c:371)
==2147966==    by 0x4015D51: _dl_allocate_tls (dl-tls.c:629)
==2147966==    by 0x4E51E32: pthread_create@@GLIBC_2.2.5 (in /usr/lib64/libpthread-2.28.so)
==2147966==    by 0x12157BE2: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147966==    by 0x1221800E: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147966==    by 0x120F84E2: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147966==    by 0x1216A4D7: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147966==    by 0xFC08519: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147966==    by 0xFC0B8DF: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147966==    by 0x4E58E66: __pthread_once_slow (in /usr/lib64/libpthread-2.28.so)
==2147966==    by 0xFC54658: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147966==    by 0xFBFC0BE: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147966==    by 0xFC1F279: cudaGetDeviceCount (in /usr/lib64/hwloc/libcudart.so.12)
==2147966==    by 0xF9D00D8: hwloc_cuda_discover (in /usr/lib64/hwloc/hwloc_cuda.so)
==2147966==    by 0xE7DC90B: hwloc_discover_by_phase (in /usr/lib64/libhwloc.so.15.7.0)
==2147966==    by 0xE7DD08D: hwloc_discover (in /usr/lib64/libhwloc.so.15.7.0)
==2147966==    by 0xE7DE1FE: hwloc_topology_load (in /usr/lib64/libhwloc.so.15.7.0)
==2147966==    by 0x415730F: rhwloc_local_topology_load (rhwloc.c:216)
==2147966==    by 0x415735B: rhwloc_local_topology_xml (rhwloc.c:231)
==2147966==    by 0x414B6BD: topo_get_local_xml (topo.c:302)
==2147966==    by 0x414B6BD: topo_create (topo.c:332)
==2147966==    by 0x414A5A0: mod_main (resource.c:507)
==2147966==    by 0x4109D3: module_thread (module.c:208)
==2147966==    by 0x4E511C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==2147966==    by 0x6043E72: clone (in /usr/lib64/libc-2.28.so)
==2147966==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: possible
   fun:calloc
   fun:UnknownInlinedFun
   fun:allocate_dtv
   fun:_dl_allocate_tls
   fun:pthread_create@@GLIBC_2.2.5
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/hwloc/libcudart.so.12
   obj:/usr/lib64/hwloc/libcudart.so.12
   fun:__pthread_once_slow
   obj:/usr/lib64/hwloc/libcudart.so.12
   obj:/usr/lib64/hwloc/libcudart.so.12
   fun:cudaGetDeviceCount
   fun:hwloc_cuda_discover
   fun:hwloc_discover_by_phase
   fun:hwloc_discover
   fun:hwloc_topology_load
   fun:rhwloc_local_topology_load
   fun:rhwloc_local_topology_xml
   fun:topo_get_local_xml
   fun:topo_create
   fun:mod_main
   fun:module_thread
   fun:start_thread
   fun:clone
}
==2147966== 525,056 bytes in 16 blocks are definitely lost in loss record 67 of 67
==2147966==    at 0x4C38185: malloc (vg_replace_malloc.c:431)
==2147966==    by 0x60FF0B9: __alloc_dir (in /usr/lib64/libc-2.28.so)
==2147966==    by 0x60FF1BC: opendir_tail (in /usr/lib64/libc-2.28.so)
==2147966==    by 0x100E8C8A: amd::smi::getListOfAppTmpFiles[abi:cxx11]() (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147966==    by 0x100E9474: amd::smi::readTmpFile(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147966==    by 0x100BFA2E: rsmi_status_t amd::smi::storeParameter<rsmi_compute_partition_type_t>(unsigned int) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147966==    by 0x100BFD2E: amd::smi::Device::storeDevicePartitions(unsigned int) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147966==    by 0x100CAEDF: amd::smi::RocmSMI::Initialize(unsigned long) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147966==    by 0x100FDE8A: rsmi_init (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147966==    by 0xFE819FB: hwloc_rsmi_discover (in /usr/lib64/hwloc/hwloc_rsmi.so)
==2147966==    by 0xE7DC90B: hwloc_discover_by_phase (in /usr/lib64/libhwloc.so.15.7.0)
==2147966==    by 0xE7DD08D: hwloc_discover (in /usr/lib64/libhwloc.so.15.7.0)
==2147966==    by 0xE7DE1FE: hwloc_topology_load (in /usr/lib64/libhwloc.so.15.7.0)
==2147966==    by 0x415730F: rhwloc_local_topology_load (rhwloc.c:216)
==2147966==    by 0x415735B: rhwloc_local_topology_xml (rhwloc.c:231)
==2147966==    by 0x414B6BD: topo_get_local_xml (topo.c:302)
==2147966==    by 0x414B6BD: topo_create (topo.c:332)
==2147966==    by 0x414A5A0: mod_main (resource.c:507)
==2147966==    by 0x4109D3: module_thread (module.c:208)
==2147966==    by 0x4E511C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==2147966==    by 0x6043E72: clone (in /usr/lib64/libc-2.28.so)
==2147966==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   fun:__alloc_dir
   fun:opendir_tail
   fun:_ZN3amd3smi20getListOfAppTmpFilesB5cxx11Ev
   fun:_ZN3amd3smi11readTmpFileEjNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_
   fun:_ZN3amd3smi14storeParameterI29rsmi_compute_partition_type_tEE13rsmi_status_tj
   fun:_ZN3amd3smi6Device21storeDevicePartitionsEj
   fun:_ZN3amd3smi7RocmSMI10InitializeEm
   fun:rsmi_init
   fun:hwloc_rsmi_discover
   fun:hwloc_discover_by_phase
   fun:hwloc_discover
   fun:hwloc_topology_load
   fun:rhwloc_local_topology_load
   fun:rhwloc_local_topology_xml
   fun:topo_get_local_xml
   fun:topo_create
   fun:mod_main
   fun:module_thread
   fun:start_thread
   fun:clone
}
==2147966== LEAK SUMMARY:
==2147966==    definitely lost: 525,056 bytes in 16 blocks
==2147966==    indirectly lost: 0 bytes in 0 blocks
==2147966==      possibly lost: 496 bytes in 1 blocks
==2147966==    still reachable: 34,901 bytes in 134 blocks
==2147966==         suppressed: 0 bytes in 0 blocks
==2147966== Reachable blocks (those to which a pointer was found) are not shown.
==2147966== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2147966==
==2147966== For lists of detected and suppressed errors, rerun with: -s
==2147966== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
flux-start: 1 (pid 2147966) exited with rc=1
==2147959==
==2147959== HEAP SUMMARY:
==2147959==     in use at exit: 614,398 bytes in 257 blocks
==2147959==   total heap usage: 598,586 allocs, 598,329 frees, 251,207,903 bytes allocated
==2147959==
==2147959== 496 bytes in 1 blocks are possibly lost in loss record 66 of 77
==2147959==    at 0x4C3D1C3: calloc (vg_replace_malloc.c:1554)
==2147959==    by 0x4015322: UnknownInlinedFun (rtld-malloc.h:44)
==2147959==    by 0x4015322: allocate_dtv (dl-tls.c:371)
==2147959==    by 0x4015D51: _dl_allocate_tls (dl-tls.c:629)
==2147959==    by 0x4E51E32: pthread_create@@GLIBC_2.2.5 (in /usr/lib64/libpthread-2.28.so)
==2147959==    by 0x1306CBE2: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147959==    by 0x1312D00E: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147959==    by 0x1300D4E2: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147959==    by 0x1307F4D7: ??? (in /usr/lib64/libcuda.so.545.23.08)
==2147959==    by 0x1071D519: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147959==    by 0x107208DF: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147959==    by 0x4E58E66: __pthread_once_slow (in /usr/lib64/libpthread-2.28.so)
==2147959==    by 0x10769658: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147959==    by 0x107110BE: ??? (in /usr/lib64/hwloc/libcudart.so.12)
==2147959==    by 0x10734279: cudaGetDeviceCount (in /usr/lib64/hwloc/libcudart.so.12)
==2147959==    by 0x104E50D8: hwloc_cuda_discover (in /usr/lib64/hwloc/hwloc_cuda.so)
==2147959==    by 0xF2F190B: hwloc_discover_by_phase (in /usr/lib64/libhwloc.so.15.7.0)
==2147959==    by 0xF2F208D: hwloc_discover (in /usr/lib64/libhwloc.so.15.7.0)
==2147959==    by 0xF2F31FE: hwloc_topology_load (in /usr/lib64/libhwloc.so.15.7.0)
==2147959==    by 0x416130F: rhwloc_local_topology_load (rhwloc.c:216)
==2147959==    by 0x416135B: rhwloc_local_topology_xml (rhwloc.c:231)
==2147959==    by 0x41556BD: topo_get_local_xml (topo.c:302)
==2147959==    by 0x41556BD: topo_create (topo.c:332)
==2147959==    by 0x415473C: mod_main (resource.c:507)
==2147959==    by 0x4109D3: module_thread (module.c:208)
==2147959==    by 0x4E511C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==2147959==    by 0x6043E72: clone (in /usr/lib64/libc-2.28.so)
==2147959==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: possible
   fun:calloc
   fun:UnknownInlinedFun
   fun:allocate_dtv
   fun:_dl_allocate_tls
   fun:pthread_create@@GLIBC_2.2.5
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/libcuda.so.545.23.08
   obj:/usr/lib64/hwloc/libcudart.so.12
   obj:/usr/lib64/hwloc/libcudart.so.12
   fun:__pthread_once_slow
   obj:/usr/lib64/hwloc/libcudart.so.12
   obj:/usr/lib64/hwloc/libcudart.so.12
   fun:cudaGetDeviceCount
   fun:hwloc_cuda_discover
   fun:hwloc_discover_by_phase
   fun:hwloc_discover
   fun:hwloc_topology_load
   fun:rhwloc_local_topology_load
   fun:rhwloc_local_topology_xml
   fun:topo_get_local_xml
   fun:topo_create
   fun:mod_main
   fun:module_thread
   fun:start_thread
   fun:clone
}
==2147959== 525,056 bytes in 16 blocks are definitely lost in loss record 77 of 77
==2147959==    at 0x4C38185: malloc (vg_replace_malloc.c:431)
==2147959==    by 0x60FF0B9: __alloc_dir (in /usr/lib64/libc-2.28.so)
==2147959==    by 0x60FF1BC: opendir_tail (in /usr/lib64/libc-2.28.so)
==2147959==    by 0x10BFDC8A: amd::smi::getListOfAppTmpFiles[abi:cxx11]() (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147959==    by 0x10BFE474: amd::smi::readTmpFile(unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147959==    by 0x10BD4A2E: rsmi_status_t amd::smi::storeParameter<rsmi_compute_partition_type_t>(unsigned int) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147959==    by 0x10BD4D2E: amd::smi::Device::storeDevicePartitions(unsigned int) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147959==    by 0x10BDFEDF: amd::smi::RocmSMI::Initialize(unsigned long) (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147959==    by 0x10C12E8A: rsmi_init (in /usr/lib64/hwloc/librocm_smi64.so.5)
==2147959==    by 0x109969FB: hwloc_rsmi_discover (in /usr/lib64/hwloc/hwloc_rsmi.so)
==2147959==    by 0xF2F190B: hwloc_discover_by_phase (in /usr/lib64/libhwloc.so.15.7.0)
==2147959==    by 0xF2F208D: hwloc_discover (in /usr/lib64/libhwloc.so.15.7.0)
==2147959==    by 0xF2F31FE: hwloc_topology_load (in /usr/lib64/libhwloc.so.15.7.0)
==2147959==    by 0x416130F: rhwloc_local_topology_load (rhwloc.c:216)
==2147959==    by 0x416135B: rhwloc_local_topology_xml (rhwloc.c:231)
==2147959==    by 0x41556BD: topo_get_local_xml (topo.c:302)
==2147959==    by 0x41556BD: topo_create (topo.c:332)
==2147959==    by 0x415473C: mod_main (resource.c:507)
==2147959==    by 0x4109D3: module_thread (module.c:208)
==2147959==    by 0x4E511C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==2147959==    by 0x6043E72: clone (in /usr/lib64/libc-2.28.so)
==2147959==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   fun:__alloc_dir
   fun:opendir_tail
   fun:_ZN3amd3smi20getListOfAppTmpFilesB5cxx11Ev
   fun:_ZN3amd3smi11readTmpFileEjNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES6_
   fun:_ZN3amd3smi14storeParameterI29rsmi_compute_partition_type_tEE13rsmi_status_tj
   fun:_ZN3amd3smi6Device21storeDevicePartitionsEj
   fun:_ZN3amd3smi7RocmSMI10InitializeEm
   fun:rsmi_init
   fun:hwloc_rsmi_discover
   fun:hwloc_discover_by_phase
   fun:hwloc_discover
   fun:hwloc_topology_load
   fun:rhwloc_local_topology_load
   fun:rhwloc_local_topology_xml
   fun:topo_get_local_xml
   fun:topo_create
   fun:mod_main
   fun:module_thread
   fun:start_thread
   fun:clone
}
==2147959== LEAK SUMMARY:
==2147959==    definitely lost: 525,056 bytes in 16 blocks
==2147959==    indirectly lost: 0 bytes in 0 blocks
==2147959==      possibly lost: 496 bytes in 1 blocks
==2147959==    still reachable: 88,846 bytes in 240 blocks
==2147959==         suppressed: 0 bytes in 0 blocks
==2147959== Reachable blocks (those to which a pointer was found) are not shown.
==2147959== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2147959==
==2147959== For lists of detected and suppressed errors, rerun with: -s
==2147959== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
flux-start: 0 (pid 2147959) exited with rc=1
not ok 1 - valgrind reports no new errors on 2 broker run

# failed 1 among 1 test(s)
1..1

@wihobbs
Copy link
Member Author

wihobbs commented Feb 21, 2024

Since the ^^ above is a hwloc issue and not a sched issue we're going to ignore the valgrind test in our CI for now.

@wihobbs wihobbs force-pushed the add-sched branch 3 times, most recently from b3c240d to 63d9f90 Compare February 22, 2024 00:40
@wihobbs wihobbs changed the title WIP: gitlab: add sched gitlab: add sched Feb 22, 2024
@wihobbs wihobbs requested a review from grondo February 22, 2024 00:40
@wihobbs
Copy link
Member Author

wihobbs commented Feb 22, 2024

Dropped WIP. This is ready for a review.

@wihobbs
Copy link
Member Author

wihobbs commented Feb 22, 2024

Whoops accidentally had two corona-sched-test's in there.

Copy link

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One suggestion inline, but feel free to ignore it if the current solution is working fine.

.gitlab-ci.yml Outdated
- git clone https://github.com/flux-framework/flux-sched
- cd flux-sched
- module load gcc
- ${CORE_INSTALL_PREFIX}/bin/flux start ./configure
Copy link

@grondo grondo Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to mention that you could probably drop the start here and just use /path/to/flux ./configure. This is because when an explicit path is used with flux(1) it will execute its argument with the same environment as would have been passed to a subcommand.

Try that and see if it works if you'd like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's handy, thanks! Fixed and pushed.

I also noticed that the commit message was out of date. Fixed on recent push.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Problem: We need to be testing flux-sched's integration with the
rest of the framework, and we're not.

Do that on corona and tioga since they're the system-instance clusters.

Note that due to missing RPATH entries for certain libraries in Cray's
compilers, this currently needs to be compiled with GCC. Also, the
Valgrind test is temporarily exempted because it fails due to a memory
leak in hwloc.
@wihobbs wihobbs merged commit b3c8a1e into flux-framework:main Feb 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants