This is a playbook for build cops to help deal with problems with the CI infrastructure.
List regional quotas to see which quotas are running hot
gcloud compute regions describe --project=kubeflow-ci ${REGION}
Check if we are leaking Kubeflow deployments and this is causing us to run out of quota.
gcloud --project=kubeflow-ci --format="table(name,createTime:sort=1,location,status)" container clusters list gcloud --project=kubeflow-ci deployment-manager deployments list --format="table(name,insertTime:sort=1)"
- Deployments created by the E2E tests should be GC'd after O(2) hours
- So if there are resources older than O(2) hours it indicates that there is a problem with garbage collection
To access to k8s resources make sure to get credentials and set the default namespace to
gcloud container clusters get-credentials kubeflow-testing --zone $ZONE --project kubelow-ci
kubectl config set-context $(kubectl config current-context) --namespace=kubeflow-test-infra
Check if the cron job to GC resources is running in the test cluster
kubectl get cronjobs NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE cleanup-ci 0 */2 * * * False 0 <none> 14m
The cron job is defined in cleanup-ci-cron.jsonnet
If the cron job is not configured then start it.
Look for recent runs of the cron job and figure out whether the are running successfully
kubectl get jobs | grep cleanup-ci
Jobs triggered by cron will match the regex
Check that the job ran successfully
The pods associated with the job can be fetched via labels
kubectl logs -l job-name=${JOBNAME}
Do a oneoff run of the cleanup job
cd test-infra kubectl create -f cleanup-ci-kubeflow-ci-deployment.yaml
- You can adjust the command line arguments in order to do more aggressive garbage collection then usual
Use stackdriver to check the disk usage
There are two ways to free up disk space
Delete old directories on the NFS share
Delete and recreate the NFS share
- Both options are outlined below
Start a shell in the debug worker
kubectl exec -it debug-worker-0 /bin/bash
Delete old directories
cd /mnt/test-data-volume find -maxdepth 1 -type d ! -path . -mtime +7 -exec rm -rf {} ";"
Delete the PV and pvc
kubectl delete pvc nfs-external kubectl delete pv gcfs kubectl delete pods --all=true
- We delete the pods since the pods will be mounting the volume which will prevent deletion of the PV and PVC
Wait for them to be deleted
Most likely we will need to override delete protection because there will be some pods still mounting it
Dump the yaml
kubectl get pvc nfs-external -o yaml > /tmp/nfs-external.yaml
Delete the finalizer
in nfs-external.yaml... finalizers: labels: ...
- Make sure you have the field finalizers and its an empty list
Update the object
kubectl apply -f /tmp/nfs-external.yaml
- Alternatively you can use
kubectl edit
to remove finalizers.
- Alternatively you can use
Similarly, make sure you remove finzlizers from pv (i.e., gcfs)
If pv/pvc deleteion still stalls, delete all pods in
manuallykubectl delete pods --all
Delete the nfs deployment
gcloud --project=kubeflow-ci deployment-manager deployments delete kubeflow-ci-nfs
Recreate the NFS share
cd test-infra/gcp_configs gcloud --project=kubeflow-ci deployment-manager deployments create kubeflow-ci-nfs --config=gcfs.yaml
Get the IP address of the new NFS share
gcloud beta --project=kubeflow-ci filestore instances list
Set the IP address in the PV
- Edit
- Change the server address of the persistent volume to the new PV
- Edit
Recreate the PV and PVC
cd test-infra/ kustomize build base | kubectl apply -f -
Make sure the
pod is able to successfully mount the PV
You may need to restart the debug worker if it isn't running
ks apply kubeflow-ci -c debug-worker
- If you already deleted the pod
make sure it is restarted and is healthy. Otherwise, if it stalls in terminated state, force delete it as follows:
kubectl delete pods debug-worker-0 --grace-period=0 --force
- Connect to
to make sure it is able to mount the PV
kubectl exec -it debug-worker-0 /bin/bash ls /secret
- If you already deleted the pod