Skip to content

Commit

Permalink
Deprecate checkpointing (#2361)
Browse files Browse the repository at this point in the history
* Deprecate checkpointing

* Remove checkpointing from test
  • Loading branch information
gpleiss authored Jun 2, 2023
1 parent 1be177e commit f73fa7d
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 115 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
"This notebook demonstrates the most simple usage of contour integral quadrature with msMINRES as described [here](https://arxiv.org/pdf/2006.11267.pdf) to sample from the predictive distribution of an exact GP.\n",
"\n",
"Note that to achieve results where Cholesky would run the GPU out of memory, you'll either need to have KeOps installed (see our KeOps tutorial in this same folder), or use the `checkpoint_kernel` beta feature. Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
"Note that to achieve results where Cholesky would run the GPU out of memory, you'll need to have KeOps installed (see our KeOps tutorial in this same folder). Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky."
]
},
{
Expand Down
94 changes: 9 additions & 85 deletions examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exact GP Regression with Multiple GPUs and Kernel Partitioning\n",
"# Exact GP Regression with Multiple GPUs\n",
"## Introduction\n",
"In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: \n",
"In this notebook, we'll demonstrate training exact GPs on large datasets by distributing the kernel matrix across multiple GPUs, for additional parallelism.\n",
"\n",
"1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.\n",
"2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.\n",
"**NOTE**: Kernel partitioning (another memory-saving mechanism introduced in https://arxiv.org/abs/1903.08114) is no longer supported for multiple GPUs. If your kernel matrix is too big to fit on your available GPUs, please use the [GPyTorch KeOps integration](./KeOps_GP_Regression.ipynb) for kernel partitioning.\n",
"\n",
"We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time."
"We'll be using the `protein` dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: the number of GPUs available has a significant effect on training time."
]
},
{
Expand Down Expand Up @@ -164,7 +163,6 @@
" train_y,\n",
" n_devices,\n",
" output_device,\n",
" checkpoint_size,\n",
" preconditioner_size,\n",
" n_training_iter,\n",
"):\n",
Expand All @@ -178,8 +176,7 @@
" mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
"\n",
" \n",
" with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \\\n",
" gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
" with gpytorch.settings.max_preconditioner_size(preconditioner_size):\n",
"\n",
" def closure():\n",
" optimizer.zero_grad()\n",
Expand Down Expand Up @@ -208,78 +205,6 @@
" return model, likelihood"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Automatically determining GPU Settings\n",
"\n",
"In the next cell, we automatically determine a roughly reasonable partition or *checkpoint* size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of devices: 2 -- Kernel partition size: 0\n",
"RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)\n",
"Number of devices: 2 -- Kernel partition size: 18292\n",
"RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)\n",
"Number of devices: 2 -- Kernel partition size: 9146\n",
"Iter 1/1 - Loss: 0.893 lengthscale: 0.486 noise: 0.248\n",
"Finished training on 36584 data points using 2 GPUs.\n"
]
}
],
"source": [
"import gc\n",
"\n",
"def find_best_gpu_setting(train_x,\n",
" train_y,\n",
" n_devices,\n",
" output_device,\n",
" preconditioner_size\n",
"):\n",
" N = train_x.size(0)\n",
" \n",
" # Find the optimum partition/checkpoint size by decreasing in powers of 2\n",
" # Start with no partitioning (size = 0)\n",
" settings = [0] + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]\n",
"\n",
" for checkpoint_size in settings:\n",
" print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))\n",
" try:\n",
" # Try a full forward and backward pass with this setting to check memory usage\n",
" _, _ = train(train_x, train_y,\n",
" n_devices=n_devices, output_device=output_device,\n",
" checkpoint_size=checkpoint_size,\n",
" preconditioner_size=preconditioner_size, n_training_iter=1)\n",
" \n",
" # when successful, break out of for-loop and jump to finally block\n",
" break\n",
" except RuntimeError as e:\n",
" print('RuntimeError: {}'.format(e))\n",
" except AttributeError as e:\n",
" print('AttributeError: {}'.format(e))\n",
" finally:\n",
" # handle CUDA OOM error\n",
" gc.collect()\n",
" torch.cuda.empty_cache()\n",
" return checkpoint_size\n",
"\n",
"# Set a large enough preconditioner size to reduce the number of CG iterations run\n",
"preconditioner_size = 100\n",
"checkpoint_size = find_best_gpu_setting(train_x, train_y,\n",
" n_devices=n_devices, \n",
" output_device=output_device,\n",
" preconditioner_size=preconditioner_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -309,7 +234,6 @@
"source": [
"model, likelihood = train(train_x, train_y,\n",
" n_devices=n_devices, output_device=output_device,\n",
" checkpoint_size=10000,\n",
" preconditioner_size=100,\n",
" n_training_iter=20)"
]
Expand All @@ -331,7 +255,7 @@
"model.eval()\n",
"likelihood.eval()\n",
"\n",
"with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
"with torch.no_grad(), gpytorch.settings.fast_pred_var():\n",
" # Make predictions on a small number of test points to get the test time caches computed\n",
" latent_pred = model(test_x[:2, :])\n",
" del latent_pred # We don't care about these predictions, we really just want the caches."
Expand Down Expand Up @@ -360,7 +284,7 @@
}
],
"source": [
"with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):\n",
"with torch.no_grad(), gpytorch.settings.fast_pred_var():\n",
" %time latent_pred = model(test_x)\n",
" \n",
"test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))\n",
Expand All @@ -385,7 +309,7 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -399,7 +323,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
"version": "3.8.0"
}
},
"nbformat": 4,
Expand Down
10 changes: 10 additions & 0 deletions gpytorch/beta_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ class checkpoint_kernel(_value_context):

_global_value = 0

def __enter__(self, *args, **kwargs):
warnings.warn(
"The checkpointing feature is deprecated and will be removed in the next version. "
"If your data cannot fit on a single GPU, we recommend using the GPyTorch KeOps integration. "
"(The KeOps integration accomplishes the same thing that our checkpointing feature did, but better!) "
"See the KeOps example in the GPyTorch documentation at docs.gpytorch.ai",
DeprecationWarning,
)
return super().__enter__(*args, **kwargs)


class default_preconditioner(_feature_flag):
"""
Expand Down
9 changes: 3 additions & 6 deletions test/examples/test_simple_gp_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ def test_gp_posterior_single_training_point_smoke_test(self):

def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False, checkpoint=0):
train_x, test_x, train_y, test_y = self._get_data(
cuda=cuda, num_data=(1000 if checkpoint else 11), add_noise=bool(checkpoint)
cuda=cuda, num_data=(11), add_noise=bool(checkpoint)
)
# We're manually going to set the hyperparameters to something they shouldn't be
likelihood = GaussianLikelihood(noise_prior=SmoothedBoxPrior(exp(-3), exp(3), sigma=0.1))
Expand All @@ -234,8 +234,8 @@ def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False,
gp_model.train()
likelihood.train()
optimizer = optim.Adam(gp_model.parameters(), lr=0.15)
with gpytorch.beta_features.checkpoint_kernel(checkpoint), gpytorch.settings.fast_pred_var():
for _ in range(20 if checkpoint else 50):
with gpytorch.settings.fast_pred_var():
for _ in range(50):
optimizer.zero_grad()
output = gp_model(train_x)
loss = -mll(output, train_y)
Expand All @@ -256,9 +256,6 @@ def test_posterior_latent_gp_and_likelihood_with_optimization(self, cuda=False,

self.assertLess(mean_abs_error.item(), 0.05)

def test_gp_with_checkpointing(self, cuda=False):
return self.test_posterior_latent_gp_and_likelihood_with_optimization(cuda=cuda, checkpoint=250)

def test_fantasy_updates_cuda(self):
if torch.cuda.is_available():
with least_used_cuda_device():
Expand Down
23 changes: 0 additions & 23 deletions test/lazy/test_lazy_evaluated_kernel_tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,29 +112,6 @@ def _test_inv_matmul(self, rhs, lhs=None, cholesky=False):
else:
self.assertFalse(linear_cg_mock.called)

def test_inv_matmul_matrix_with_checkpointing(self):
# Add one checkpointing test
lazy_tensor = self.create_linear_op().requires_grad_(True)
lazy_tensor_copy = lazy_tensor.clone().detach_().requires_grad_(True)
evaluated = self.evaluate_linear_op(lazy_tensor_copy)

test_vector = torch.randn(2, 5, 6)
test_vector_copy = test_vector.clone()
with gpytorch.beta_features.checkpoint_kernel(2):
res = lazy_tensor.solve(test_vector)
actual = evaluated.inverse().matmul(test_vector_copy)
self.assertLess(((res - actual).abs() / actual.abs().clamp(1, 1e5)).max().item(), 3e-1)

grad = torch.randn_like(res)
res.backward(gradient=grad)
actual.backward(gradient=grad)

for param, param_copy in zip(lazy_tensor.kernel.parameters(), lazy_tensor_copy.kernel.parameters()):
self.assertAllClose(param.grad, param_copy.grad, rtol=1e-3)
self.assertAllClose(
lazy_tensor.x1.grad + lazy_tensor.x2.grad, lazy_tensor_copy.x1.grad + lazy_tensor_copy.x2.grad, rtol=1e-3
)

def test_batch_getitem(self):
"""Indexing was wrong when the kernel had more batch dimensions than the
data"""
Expand Down

0 comments on commit f73fa7d

Please sign in to comment.