diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..9abd205 --- /dev/null +++ b/.gitattributes @@ -0,0 +1 @@ +*.nc filter=lfs diff=lfs merge=lfs -text diff --git a/api-reference/datasets/accessing-dataset.mdx b/api-reference/datasets/accessing-dataset.mdx index 7a36046..a36ba37 100644 --- a/api-reference/datasets/accessing-dataset.mdx +++ b/api-reference/datasets/accessing-dataset.mdx @@ -9,12 +9,12 @@ Once you have listed all available datasets, you can access a specific dataset b ```python Python (Sync) -dataset = datasets.open_data.asf.sentinel1_sar +dataset = datasets.open_data.copernicus.sentinel1_sar # or any other dataset available to you ``` ```python Python (Async) -dataset = datasets.open_data.asf.sentinel1_sar +dataset = datasets.open_data.copernicus.sentinel1_sar # or any other dataset available to you ``` diff --git a/api-reference/storage-providers/creating-storage-client.mdx b/api-reference/storage-providers/creating-storage-client.mdx index cd16fe1..5e9f8f8 100644 --- a/api-reference/storage-providers/creating-storage-client.mdx +++ b/api-reference/storage-providers/creating-storage-client.mdx @@ -6,6 +6,8 @@ icon: database You can create a cached storage client by importing the respective class and instantiating it. +For a complete example look at the [Accessing Open Data](/datasets/open-data#sample-code) section. + ```python Python (Sync) diff --git a/assets/data/example_satellite_data.nc b/assets/data/example_satellite_data.nc new file mode 100644 index 0000000..7d716ac --- /dev/null +++ b/assets/data/example_satellite_data.nc @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b9d96a425529d93f6cf94a8c8fc7f751c6943567c05969f3c18515bca2e2575d +size 466226 diff --git a/datasets/collections.mdx b/datasets/collections.mdx index 0a91db9..acaf8fd 100644 --- a/datasets/collections.mdx +++ b/datasets/collections.mdx @@ -1,6 +1,7 @@ --- title: Collections description: Learn about Time Series Dataset Collections +icon: layer-group --- Collections are a way of grouping together data points from the same dataset. They are useful for representing @@ -48,13 +49,13 @@ Each dataset has a list of collections associated with it. You can list the coll ```python Python (Sync) - dataset = datasets.open_data.asf.sentinel1_sar + dataset = datasets.open_data.copernicus.landsat8_oli_tirs collections = dataset.collections() print(collections) ``` ```python Python (Async) - dataset = datasets.open_data.asf.sentinel1_sar + dataset = datasets.open_data.copernicus.landsat8_oli_tirs collections = await dataset.collections() print(collections) ``` @@ -62,8 +63,10 @@ Each dataset has a list of collections associated with it. You can list the coll ```txt Output -{'Sentinel-1A': Collection Sentinel-1A: [2014-06-15T03:44:43.000 UTC, 2022-12-31T23:57:59.000 UTC] (1209636 data points), - 'Sentinel-1B': Collection Sentinel-1B: [2016-09-26T00:02:34.000 UTC, 2021-12-23T06:53:08.000 UTC] (657674 data points)} +{'L1GT': Collection L1GT: [2013-03-25T12:08:43.699 UTC, 2024-08-19T12:57:32.456 UTC], + 'L1T': Collection L1T: [2013-03-26T09:33:19.763 UTC, 2020-08-24T03:21:50.000 UTC], + 'L1TP': Collection L1TP: [2013-03-24T00:25:55.457 UTC, 2024-08-19T12:58:20.229 UTC], + 'L2SP': Collection L2SP: [2015-01-01T07:53:35.391 UTC, 2024-08-12T12:52:03.243 UTC]} ``` The `collections` variable is a dictionary, where the keys are the names of the collections and the values are @@ -78,13 +81,13 @@ method. ```python Python (Sync) - dataset = datasets.open_data.asf.sentinel1_sar + dataset = datasets.open_data.copernicus.landsat8_oli_tirs collections = dataset.collections(availability=True, count=True) print(collections) ``` ```python Python (Async) - dataset = datasets.open_data.asf.sentinel1_sar + dataset = datasets.open_data.copernicus.landsat8_oli_tirs collections = await dataset.collections(availability=True, count=True) print(collections) ``` @@ -92,8 +95,10 @@ method. ```txt Output -{'Sentinel-1A': Collection Sentinel-1A: [2014-06-15T03:44:43.000 UTC, 2022-12-31T23:57:59.000 UTC] (1209636 data points), - 'Sentinel-1B': Collection Sentinel-1B: [2016-09-26T00:02:34.000 UTC, 2021-12-23T06:53:08.000 UTC] (657674 data points)} +{'L1GT': Collection L1GT: [2013-03-25T12:08:43.699 UTC, 2024-08-19T12:57:32.456 UTC] (154288 data points), + 'L1T': Collection L1T: [2013-03-26T09:33:19.763 UTC, 2020-08-24T03:21:50.000 UTC] (87958 data points), + 'L1TP': Collection L1TP: [2013-03-24T00:25:55.457 UTC, 2024-08-19T12:58:20.229 UTC] (322041 data points), + 'L2SP': Collection L2SP: [2015-01-01T07:53:35.391 UTC, 2024-08-12T12:52:03.243 UTC] (191110 data points)} ``` ## Accessing individual collections @@ -107,22 +112,22 @@ You can then use the `info()` method on the collection object to get information ```python Python (Sync) collections = dataset.collections() - sat1 = collections["Sat-1"] - collection_info = sat1.info(availability=True, count=True) + terrain_correction = collections["L1GT"] + collection_info = terrain_correction.info(availability=True, count=True) print(collection_info) ``` ```python Python (Async) collections = await dataset.collections() - sat1 = collections["Sat-1"] - collection_info = await sat1.info(availability=True, count=True) + terrain_correction = collections["L1GT"] + collection_info = await terrain_correction.info(availability=True, count=True) print(collection_info) ``` ```txt Output -Collection Sat-1: [2019-03-07T16:09:17.773000 UTC, 2021-05-23T19:17:23.472000 UTC] (910245 data points) +L1GT: [2013-03-25T12:08:43.699 UTC, 2024-08-19T12:57:32.456 UTC] (154288 data points) ``` You can also access a specific collection by using the `collection` method on the dataset object as well. @@ -131,21 +136,21 @@ This has the advantage that you can directly access the collection without havin ```python Python (Sync) - sat1 = dataset.collection("Sat-1") - collection_info = sat1.info(availability=True, count=True) + terrain_correction = dataset.collection("L1GT") + collection_info = terrain_correction.info(availability=True, count=True) print(collection_info) ``` ```python Python (Async) - sat1 = dataset.collection("Sat-1") - collection_info = await sat1.info(availability=True, count=True) + terrain_correction = dataset.collection("L1GT") + collection_info = await terrain_correction.info(availability=True, count=True) print(collection_info) ``` ```txt Output -Collection Sat-1: [2019-03-07T16:09:17.773000 UTC, 2021-05-23T19:17:23.472000 UTC] (910245 data points) +L1GT: [2013-03-25T12:08:43.699 UTC, 2024-08-19T12:57:32.456 UTC] (154288 data points) ``` ## Errors you may encounter @@ -166,6 +171,10 @@ await dataset.collection("Sat-X").info() # raises NotFoundError: 'No such collec -## Summary +## Next steps -Great, now you know how to list and access collections. Next you can look at [how to query data points from a collection](/datasets/loading-data). + + + How to load data points from a collection. + + diff --git a/datasets/introduction.mdx b/datasets/introduction.mdx index 16e66cb..e852903 100644 --- a/datasets/introduction.mdx +++ b/datasets/introduction.mdx @@ -1,6 +1,7 @@ --- title: Introduction description: Learn about Tilebox Datasets +icon: house --- As the name suggests, time series datasets refer to a certain kind of datasets where each data point is associated with a timestamp. @@ -8,10 +9,20 @@ This is a common format for datasets that are collected over time, such as satel This section covers: -- [Which timeseries datasets are available](/datasets/timeseries#listing-datasets) and how to list them -- [Which common fields](/datasets/timeseries#common-fields) all time series datasets share -- [What collections are](/datasets/collections) and how to access them -- [How to access data](/datasets/loading-data) from a collection for a given time interval + + + Which time series datasets are available and how to list them. + + + Which common fields all time series datasets share. + + + What collections are and how to access them. + + + How to access data from a collection for a given time interval. + + If you want to quickly look up the name of some API method or the meaning of a specific parameter [check out the @@ -22,11 +33,21 @@ This section covers: Here are some terms used throughout this section. -- **Data points**: time series data points are the individual entities that make up a dataset. Each data point is associated with a timestamp. - Each data point consists of a set of fixed [metadata fields](/datasets/timeseries#common-fields) as well as individual fields that are defined on a dataset level. -- **Datasets**: time series datasets are a container for individual data points. All data points in a time series dataset share the same data type, so all - data points in a dataset share the same set of fields. -- **Collections**: Collections are a way of grouping data points within a dataset. They are useful for representing a logical grouping of data points that are commonly queried together. + + + Time series data points are the individual entities that make up a dataset. Each data point is associated with a + timestamp. Each data point consists of a set of fixed [metadata fields](/datasets/timeseries#common-fields) as well + as individual fields that are defined on a dataset level. + + + Time series datasets are a container for individual data points. All data points in a time series dataset share the + same data type, so all data points in a dataset share the same set of fields. + + + Collections are a way of grouping data points within a dataset. They are useful for representing a logical grouping + of data points that are commonly queried together. + + ## Creating a datasets Client @@ -122,6 +143,8 @@ datasets = await client.datasets() # raises AuthenticationError ## Next steps -- [Accessing datasets](/datasets/timeseries) -- [Async support](/sdks/python/async) -- [Working with Xarray](/sdks/python/xarray) + + + + + diff --git a/datasets/loading-data.mdx b/datasets/loading-data.mdx index a6c5fa3..276b03e 100644 --- a/datasets/loading-data.mdx +++ b/datasets/loading-data.mdx @@ -1,6 +1,7 @@ --- title: Loading Time Series Data description: Learn about how to load data from Time Series Dataset collections +icon: download --- ## Overview @@ -24,16 +25,16 @@ assume that you have already [created a client](/datasets/introduction#creating- client = Client() datasets = client.datasets() - collections = datasets.open_data.asf.sentinel1_sar.collections() - collection = collections["Sentinel-1A"] + collections = datasets.open_data.copernicus.sentinel1_sar.collections() + collection = collections["S1A_IW_RAW__0S"] ``` ```python Python (Async) from tilebox.datasets.aio import Client client = Client() datasets = await client.datasets() - collections = await datasets.open_data.asf.sentinel1_sar.collections() - collection = collections["Sentinel-1A"] + collections = await datasets.open_data.copernicus.sentinel1_sar.collections() + collection = collections["S1A_IW_RAW__0S"] ``` @@ -56,42 +57,39 @@ Check out the example below to see how to load a data point at a specific time f ```python Python (Sync) - data = collection.load("2022-05-31 23:59:55.000") + data = collection.load("2024-08-01 00:00:01.362") print(data) ``` ```python Python (Async) - data = await collection.load("2022-05-31 23:59:55.000") + data = await collection.load("2024-08-01 00:00:01.362") print(data) ``` ```txt Output - Size: 549B -Dimensions: (time: 1, latlon: 2, n_footprint: 5) + Size: 721B +Dimensions: (time: 1, latlon: 2) Coordinates: - ingestion_time (time) datetime64[ns] 8B 2023-10-20T10:04:23 - id (time) @@ -112,12 +110,12 @@ when calling `load`. Check out the example below to see this in action. ```python Python (Sync) -data = collection.load("2022-05-31 23:59:55.000", skip_data=True) +data = collection.load("2024-08-01 00:00:01.362", skip_data=True) print(data) ``` ```python Python (Async) -data = await collection.load("2022-05-31 23:59:55.000", skip_data=True) +data = await collection.load("2024-08-01 00:00:01.362", skip_data=True) print(data) ``` @@ -127,13 +125,11 @@ print(data) Size: 160B Dimensions: (time: 1) Coordinates: - ingestion_time (time) datetime64[ns] 8B 2023-10-20T10:04:23 - id (time) ```txt Output - + Size: 0B Dimensions: () Data variables: *empty* @@ -176,8 +172,8 @@ timestamps, which would need to be manually converted again to different timezon from datetime import datetime import pytz - # Tokyo has a UTC+9 hours offset, so this is the same as 2017-01-01 02:45:35 UTC - tokyo_time = pytz.timezone('Asia/Tokyo').localize(datetime(2017, 1, 1, 11, 45, 35)) + # Tokyo has a UTC+9 hours offset, so this is the same as 2017-01-01 02:45:25.679 UTC + tokyo_time = pytz.timezone('Asia/Tokyo').localize(datetime(2017, 1, 1, 11, 45, 25, 679000)) print(tokyo_time) data = collection.load(tokyo_time) print(data) # time is in UTC since the API always returns UTC timestamps @@ -186,8 +182,8 @@ timestamps, which would need to be manually converted again to different timezon from datetime import datetime import pytz - # Tokyo has a UTC+9 hours offset, so this is the same as 2017-01-01 02:45:35 UTC - tokyo_time = pytz.timezone('Asia/Tokyo').localize(datetime(2017, 1, 1, 11, 45, 35)) + # Tokyo has a UTC+9 hours offset, so this is the same as 2017-01-01 02:45:25.679 UTC + tokyo_time = pytz.timezone('Asia/Tokyo').localize(datetime(2017, 1, 1, 11, 45, 25, 679000)) print(tokyo_time) data = await collection.load(tokyo_time) print(data) # time is in UTC since the API always returns UTC timestamps @@ -196,13 +192,14 @@ timestamps, which would need to be manually converted again to different timezon ```txt Output -2017-05-01 11:45:35+09:00 - -Dimensions: (time: 1) +2017-01-01 11:45:25.679000+09:00 + Size: 725B +Dimensions: (time: 1, latlon: 2) Coordinates: - ingestion_time (time) datetime64[ns] 2017-01-01T15:26:32 - id (time) ```txt Output - Size: 456MB -Dimensions: (time: 955942, latlon: 2, n_footprint: 5) + Size: 725MB +Dimensions: (time: 1109597, latlon: 2) Coordinates: - ingestion_time (time) datetime64[ns] 8MB 2023-10-20T09:52:37 ... 20... - id (time) @@ -389,30 +383,27 @@ Another way of specifying a time interval when loading data is to use an iterabl ```txt Output - Size: 24kB -Dimensions: (time: 50, latlon: 2, n_footprint: 5) + Size: 33kB +Dimensions: (time: 50, latlon: 2) Coordinates: - ingestion_time (time) datetime64[ns] 400B 2023-10-20T09:52:37 ... 2... - id (time) ```python Python (Sync) -datapoint_id = "01856a9e-2c08-0990-6cc7-9a860b1115a1" +datapoint_id = "01916d89-ba23-64c9-e383-3152644bcbde" datapoint = collection.find(datapoint_id) print(datapoint) ``` ```python Python (Async) -datapoint_id = "01856a9e-2c08-0990-6cc7-9a860b1115a1" +datapoint_id = "01916d89-ba23-64c9-e383-3152644bcbde" datapoint = await collection.find(datapoint_id) print(datapoint) ``` @@ -443,30 +434,27 @@ print(datapoint) ```txt Output - Size: 549B -Dimensions: (latlon: 2, n_footprint: 5) + Size: 725B +Dimensions: (latlon: 2) Coordinates: - ingestion_time datetime64[ns] 8B 2023-10-20T10:05:57 - id + + + ### Copernicus Data Space @@ -83,3 +90,174 @@ Tilebox currently supports the following Umbra Space datasets: - Umbra Synthetic Aperture Radar (SAR) All data is provided with a Creative Commons License (CC by 4.0), which gives you the right to do just about anything you want with it. + +## Sample Code + +Here is a sample code snippets that shows how to access open data using the Tilebox Python client. + + + + + +```python Code +from pathlib import Path + +from tilebox.datasets import Client +from tilebox.storage import ASFStorageClient + +# Creating clients +client = Client(token="YOUR_TILEBOX_API_KEY") +datasets = client.datasets() +storage_client = ASFStorageClient( + user="YOUR_ASF_USER", + password="YOUR_ASF_PASSWORD", + cache_directory=Path("./data") +) + +# Choosing the dataset and collection +ers_dataset = datasets.open_data.asf.ers_sar +collections = ers_dataset.collections() +collection = collections["ERS-2"] + +# Loading metadata +ers_data = collection.load(("2009-01-01", "2009-01-02"), show_progress=True) + +# Selecting a data point to download +selected = ers_data.isel(time=0) # index 0 selected + +# Downloading the data +downloaded_data = storage_client.download(selected, extract=True) + +print(f"Downloaded granule: {downloaded_data.name} to {downloaded_data}") +print("Contents: ") +for content in downloaded_data.iterdir(): + print(f" - {content.relative_to(downloaded_data)}") +``` + +```txt Output +Downloaded granule: E2_71629_STD_L0_F183 to data/ASF/E2_71629_STD_F183/E2_71629_STD_L0_F183 +Contents: + - E2_71629_STD_L0_F183.000.vol + - E2_71629_STD_L0_F183.000.meta + - E2_71629_STD_L0_F183.000.raw + - E2_71629_STD_L0_F183.000.pi + - E2_71629_STD_L0_F183.000.nul + - E2_71629_STD_L0_F183.000.ldr +``` + + + + + + + +```python Code +from pathlib import Path + +from tilebox.datasets import Client +from tilebox.storage import CopernicusStorageClient + +# Creating clients +client = Client(token="YOUR_TILEBOX_API_KEY") +datasets = client.datasets() +storage_client = CopernicusStorageClient( + access_key="YOUR_ACCESS_KEY", + secret_access_key="YOUR_SECRET_ACCESS_KEY", + cache_directory=Path("./data") +) + +# Choosing the dataset and collection +s2_dataset = datasets.open_data.copernicus.sentinel2_msi +collections = s2_dataset.collections() +collection = collections["S2A_S2MSI2A"] + +# Loading metadata +s2_data = collection.load(("2024-08-01", "2024-08-02"), show_progress=True) + +# Selecting a data point to download +selected = s2_data.isel(time=0) # index 0 selected + +# Downloading the data +downloaded_data = storage_client.download(selected) + +print(f"Downloaded granule: {downloaded_data.name} to {downloaded_data}") +print("Contents: ") +for content in downloaded_data.iterdir(): + print(f" - {content.relative_to(downloaded_data)}") +``` + +```txt Output +Downloaded granule: S2A_MSIL2A_20240801T002611_N0511_R102_T58WET_20240819T170544.SAFE to data/Sentinel-2/MSI/L2A/2024/08/01/S2A_MSIL2A_20240801T002611_N0511_R102_T58WET_20240819T170544.SAFE +Contents: + - manifest.safe + - GRANULE + - INSPIRE.xml + - MTD_MSIL2A.xml + - DATASTRIP + - HTML + - rep_info + - S2A_MSIL2A_20240801T002611_N0511_R102_T58WET_20240819T170544-ql.jpg +``` + + + + + + + +```python Code +from pathlib import Path + +from tilebox.datasets import Client +from tilebox.storage import UmbraStorageClient + +# Creating clients +client = Client(token="YOUR_TILEBOX_API_KEY") +datasets = client.datasets() +storage_client = UmbraStorageClient(cache_directory=Path("./data")) + +# Choosing the dataset and collection +umbra_dataset = datasets.open_data.umbra.sar +collections = umbra_dataset.collections() +collection = collections["SAR"] + +# Loading metadata +umbra_data = collection.load(("2024-01-05", "2024-01-06"), show_progress=True) + +# Selecting a data point to download +selected = umbra_data.isel(time=0) # index 0 selected + +# Downloading the data +downloaded_data = storage_client.download(selected) + +print(f"Downloaded granule: {downloaded_data.name} to {downloaded_data}") +print("Contents: ") +for content in downloaded_data.iterdir(): + print(f" - {content.relative_to(downloaded_data)}") +``` + +```txt Output +Downloaded granule: 2024-01-05-01-53-37_UMBRA-07 to data/Umbra/ad hoc/Yi_Sun_sin_Bridge_SK/6cf02931-ca2e-4744-b389-4844ddc701cd/2024-01-05-01-53-37_UMBRA-07 +Contents: + - 2024-01-05-01-53-37_UMBRA-07_SIDD.nitf + - 2024-01-05-01-53-37_UMBRA-07_SICD.nitf + - 2024-01-05-01-53-37_UMBRA-07_CSI-SIDD.nitf + - 2024-01-05-01-53-37_UMBRA-07_METADATA.json + - 2024-01-05-01-53-37_UMBRA-07_GEC.tif + - 2024-01-05-01-53-37_UMBRA-07_CSI.tif +``` + + + + + +## Further reading + + + + diff --git a/datasets/timeseries.mdx b/datasets/timeseries.mdx index 2cd75db..638822c 100644 --- a/datasets/timeseries.mdx +++ b/datasets/timeseries.mdx @@ -1,6 +1,7 @@ --- title: Time Series Data description: Learn about Time Series Datasets +icon: timeline --- Time series datasets are a container for individual data points. @@ -14,10 +15,10 @@ One of those, the `time` field enables you to perform time-based data queries on Here is a quick overview of the API for listing and accessing datasets which this page covers. Some usage examples for different use-cases are provided below. -| Method | API Reference | Description | -| -------------------------------------- | ---------------------------------------------------------------- | ---------------------------- | -| `client.datasets` | [Listing datasets](/api-reference/datasets/listing-datasets) | List all available datasets. | -| `datasets.open_data.asf.sentinel1_sar` | [Accessing a dataset](/api-reference/datasets/accessing-dataset) | Access a specific dataset. | +| Method | API Reference | Description | +| --------------------------------------------- | ---------------------------------------------------------------- | ---------------------------- | +| `client.datasets` | [Listing datasets](/api-reference/datasets/listing-datasets) | List all available datasets. | +| `datasets.open_data.copernicus.sentinel1_sar` | [Accessing a dataset](/api-reference/datasets/accessing-dataset) | Access a specific dataset. | ## Listing datasets @@ -31,14 +32,14 @@ For example, to access a dataset called dataset in a dataset group called some, client = Client() datasets = client.datasets() - dataset = datasets.open_data.asf.sentinel1_sar + dataset = datasets.open_data.copernicus.sentinel1_sar ``` ```python Python (Async) from tilebox.datasets.aio import Client client = Client() datasets = await client.datasets() - dataset = datasets.open_data.asf.sentinel1_sar + dataset = datasets.open_data.copernicus.sentinel1_sar ``` diff --git a/quickstart.mdx b/quickstart.mdx index eb98f4b..dbab678 100644 --- a/quickstart.mdx +++ b/quickstart.mdx @@ -40,7 +40,7 @@ If you prefer to work locally in your device, the steps below help you get start # select an Opendata dataset datasets = client.datasets() - dataset = datasets.open_data.asf.sentinel2_msi + dataset = datasets.open_data.copernicus.sentinel2_msi # and load data from a collection in a given time range collection = dataset.collection("S2A_S2MSI1C") diff --git a/sdks/python/async.mdx b/sdks/python/async.mdx index 06962cb..5019231 100644 --- a/sdks/python/async.mdx +++ b/sdks/python/async.mdx @@ -43,11 +43,11 @@ Check out the examples below to see how that works for a few examples. datasets = client.datasets() # Listing collections -dataset = datasets.open_data.asf.sentinel1_sar +dataset = datasets.open_data.copernicus.sentinel1_sar collections = dataset.collections() # Collection information -collection = collections["Sentinel-1A"] +collection = collections["S1A_IW_RAW__0S"] info = collection.info() print(f"Data for My-collection is available for {info.availability}") @@ -55,7 +55,7 @@ print(f"Data for My-collection is available for {info.availability}") data = collection.load(("2022-05-01", "2022-06-01"), show_progress=True) # Finding a specific datapoint -datapoint_uuid = "01811c8f-0928-e6f5-df34-364cfa8a86e8" +datapoint_uuid = "01910b3c-8552-7671-3345-b902cc0813f3" datapoint = collection.find(datapoint_uuid) ``` @@ -64,11 +64,11 @@ datapoint = collection.find(datapoint_uuid) datasets = await client.datasets() # Listing collections -dataset = datasets.open_data.asf.sentinel1_sar +dataset = datasets.open_data.copernicus.sentinel1_sar collections = await dataset.collections() # Collection information -collection = collections["Sentinel-1A"] +collection = collections["S1A_IW_RAW__0S"] info = await collection.info() print(f"Data for My-collection is available for {info.availability}") @@ -76,7 +76,7 @@ print(f"Data for My-collection is available for {info.availability}") data = await collection.load(("2022-05-01", "2022-06-01"), show_progress=True) # Finding a specific datapoint -datapoint_uuid = "01811c8f-0928-e6f5-df34-364cfa8a86e8" +datapoint_uuid = "01910b3c-8552-7671-3345-b902cc0813f3" datapoint = await collection.find(datapoint_uuid) ``` @@ -110,13 +110,13 @@ from tilebox.datasets.timeseries import RemoteTimeseriesDatasetCollection # for client = Client() datasets = client.datasets() -collections = datasets.open_data.asf.sentinel1_sar.collections() +collections = datasets.open_data.copernicus.landsat8_oli_tirs.collections() def stats_for_2020(collection: RemoteTimeseriesDatasetCollection) -> None: """Fetch data for 2020 and print the number of data points that were loaded.""" data = collection.load(("2020-01-01", "2021-01-01"), show_progress=True) n = data.sizes['time'] if 'time' in data else 0 - print(f"There are {data.sizes['time']} datapoints in {collection.name} for 2020.") + print(f"There are {n} datapoints in {collection.name} for 2020.") start = time.time() @@ -139,13 +139,13 @@ from tilebox.datasets.timeseries import RemoteTimeseriesDatasetCollection # for client = Client() datasets = await client.datasets() -collections = await datasets.open_data.asf.sentinel1_sar.collections() +collections = await datasets.open_data.copernicus.landsat8_oli_tirs.collections() async def stats_for_2020(collection: RemoteTimeseriesDatasetCollection) -> None: """Fetch data for 2020 and print the number of data points that were loaded.""" data = await collection.load(("2020-01-01", "2021-01-01"), show_progress=True) n = data.sizes['time'] if 'time' in data else 0 - print(f"There are {data.sizes['time']} datapoints in {collection.name} for 2020.") + print(f"There are {n} datapoints in {collection.name} for 2020.") start = time.time() @@ -167,19 +167,19 @@ so it finishes first. ```txt Python (Sync) -Fetching data: 100% |██████████████████████████████ [00:13<00:00, 207858 datapoints, 3.91 MB/s] -There are 207858 datapoints in Sentinel-1A for 2020. -Fetching data: 100% |██████████████████████████████ [00:11<00:00, 179665 datapoints, 4.39 MB/s] -There are 179665 datapoints in Sentinel-1B for 2020. -Fetching data took 25.34 seconds +There are 19624 datapoints in L1GT for 2020. +There are 1281 datapoints in L1T for 2020. +There are 65313 datapoints in L1TP for 2020. +There are 25375 datapoints in L2SP for 2020. +Fetching data took 10.92 seconds ``` ```txt Python (Async) -Fetching data: 100% |██████████████████████████████ [00:19<00:00, 207858 datapoints, 2.21 MB/s] -Fetching data: 100% |██████████████████████████████ [00:17<00:00, 179665 datapoints, 2.94 MB/s] -There are 179665 datapoints in Sentinel-1B for 2020. -There are 207858 datapoints in Sentinel-1A for 2020. -Fetching data took 20.12 seconds +There are 1281 datapoints in L1T for 2020. +There are 19624 datapoints in L1GT for 2020. +There are 25375 datapoints in L2SP for 2020. +There are 65313 datapoints in L1TP for 2020. +Fetching data took 7.45 seconds ``` diff --git a/sdks/python/sample-notebooks.mdx b/sdks/python/sample-notebooks.mdx index 1520b9e..7850a5b 100644 --- a/sdks/python/sample-notebooks.mdx +++ b/sdks/python/sample-notebooks.mdx @@ -42,16 +42,16 @@ They allow to work in notebooks, which are documents that contains both code and Notebooks don't need any setup and can be shared with others. - + [Jupyter notebooks](https://jupyter.org/) are the original interactive environment for Python. They are great to work with, but require a local installation. - + [Google Colab](https://colab.research.google.com/) is a free tool that offers a hosted interactive Python environment. Google Colab is great to connect to local Jupyter instances, and to share code using Google credentials, or within organizations that use Google Workspace. - + [JetBrains Datalore](https://datalore.jetbrains.com/) is a free and convenient way to collaboratively test, develop and share Python code and algorithms. It comes with secret management built in, so you can store your credentials and share notebooks. Datalore comes with the advanced JetBrains syntax highlighting and autocompletion software diff --git a/sdks/python/xarray.mdx b/sdks/python/xarray.mdx index a8347ca..40316d7 100644 --- a/sdks/python/xarray.mdx +++ b/sdks/python/xarray.mdx @@ -4,6 +4,8 @@ description: Xarray library, common use-cases and how they can be implemented ea icon: chart-bar --- +[example_satellite_data.nc]: https://github.com/tilebox/docs/raw/main/assets/data/example_satellite_data.nc + [Xarray](https://xarray.dev/) is a library for working with labelled multi-dimensional arrays. Xarray is built on top of [NumPy](https://numpy.org/) and [Pandas](https://pandas.pydata.org/). Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, @@ -24,20 +26,30 @@ The Tilebox Python client provides access to your satellite data in the form of [xarray.Dataset](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html#xarray.Dataset). This brings a great number of benefits compared to custom Tilebox specific data structures such as: -- **Familiarity**: Xarray is built on top of NumPy and Pandas, which are two of the most popular Python libraries for - scientific computing. If you are already familiar with these libraries, you are right at home with Xarray. -- **Performance**: By using NumPy under the hood, which in turn is built on top of C and Fortran, Xarray benefits from - all the performance optimizations that those libraries offer. This means that Xarray is fast and can handle large - datasets with ease. -- **Interoperability**: Xarray is a popular library and is used by many other libraries. This means that you can - easily integrate Xarray into your existing workflows. Many third party libraries are available to extend Xarray - with more capability for different use cases. -- **Flexibility**: Xarray is a flexible library and can be used for a wide range of use-cases. It's also - easy to extend Xarray with custom capability. + + + Xarray is built on top of NumPy and Pandas, which are two of the most popular Python libraries for scientific + computing. If you are already familiar with these libraries, you are right at home with Xarray. + + + By using NumPy under the hood, which in turn is built on top of C and Fortran, Xarray benefits from all the + performance optimizations that those libraries offer. This means that Xarray is fast and can handle large datasets + with ease. + + + Xarray is a popular library and is used by many other libraries. This means that you can easily integrate Xarray + into your existing workflows. Many third party libraries are available to extend Xarray with more capability for + different use cases. + + + Xarray is a flexible library and can be used for a wide range of use-cases. It's also easy to extend Xarray with + custom capability. + + ## An example dataset -To get an understanding of how Xarray works, a simple example dataset is used, as it could be returned by a +To get an understanding of how Xarray works, a sample dataset is used, as it could be returned by a [Tilebox timeseries dataset](/datasets/timeseries). @@ -47,7 +59,7 @@ from tilebox.datasets import Client client = Client() datasets = client.datasets() -collection = datasets.open_data.asf.sentinel1_sar.collection("Sentinel-1A") +collection = datasets.open_data.copernicus.landsat8_oli_tirs.collection("L1GT") satellite_data = collection.load(("2022-05-01", "2022-06-01"), show_progress=True) print(satellite_data) ``` @@ -57,7 +69,7 @@ from tilebox.datasets.aio import Client client = Client() datasets = await client.datasets() -collection = datasets.open_data.asf.sentinel1_sar.collection("Sentinel-1A") +collection = datasets.open_data.copernicus.landsat8_oli_tirs.collection("L1GT") satellite_data = await collection.load(("2022-05-01", "2022-06-01"), show_progress=True) print(satellite_data) ``` @@ -65,52 +77,49 @@ print(satellite_data) ```txt Output - Size: 8MB -Dimensions: (time: 16507, latlon: 2, n_footprint: 5) + Size: 305kB +Dimensions: (time: 514, latlon: 2) Coordinates: - ingestion_time (time) datetime64[ns] 132kB 2023-10-20T10:04:07 ... ... - id (time) This is a simple dataset that was generated to showcase some common Xarray use-cases. If you want to follow along, you - can download the dataset as a NetCDF file. The [Reading and writing - files section](/sdks/python/xarray#reading-and-writing-files) explains how to save and load Xarray datasets to and - from NetCDF files. + can [download the dataset as a NetCDF file][example_satellite_data.nc]. The [Reading and writing files + section](/sdks/python/xarray#reading-and-writing-files) explains how to save and load Xarray datasets to and from + NetCDF files. Here is a breakdown of the preceding output: - `satellite_data` **dataset** contains different **dimensions**, **coordinates** and **variables** -- `time` **dimension** consists of 570396 elements. This means there are 570396 data points in the dataset +- `time` **dimension** consists of 514 elements. This means there are 514 data points in the dataset - `time` **dimension coordinate** contains datetime values. This is the time when the data was measured. The `*` mark shows that it's a dimension coordinate. Dimension coordinates are used for label based indexing and alignment, it means you can use the time to access individual data points in the dataset - `ingestion_time` **non-dimension coordinate** contains datetime values. This is the time when the data was ingested into the Tilebox database. Non-dimension coordinates are variables that contain coordinate data, but are not - used for label based indexing and alignment. They can [even be multidimensional](https://docs.xarray.dev/en/stable/examples/multidimensional-coords.html). -- `sensor` **variable** contains integers. This variable tells you which sensor produced a given measurement. - A sensor in this case is identified by a number, `1` or `2` in the example dataset -- `measurement` **variable** contains floating point values. This variable contains the actual measurement values. + used for label based indexing and alignment. They can [even be multidimensional](https://docs.xarray.dev/en/stable/examples/multidimensional-coords.html) +- The dataset contains 28 **variables** +- `bands` **variable** contains integers, this variable tells you how many bands the data contains +- `sun_elevation` **variable** contains floating point values, this variable contains the sun elevation when the data was measured Check out the [xarray terminology overview](https://docs.xarray.dev/en/stable/user-guide/terminology.html) to deepen @@ -127,30 +136,33 @@ no more API requests are required, there is no difference between the sync and a There a couple of different ways that you can access data in a dataset. The Xarray documentation provides a [great overview](https://docs.xarray.dev/en/stable/user-guide/indexing.html) of all those methods. -You can access the `measurement` variable: +You can access the `sun_elevation` variable: ```python Accessing values -# Let's print the first measurement value -print(satellite_data.measurement[0]) +# Let's print the first sun elevation value +print(satellite_data.sun_elevation[0]) ``` ```txt Output - array(3.07027067) Coordinates: -ingestion_time datetime64[ns] 2017-01-01T15:26:32 time datetime64[ns] -2017-01-01T02:45:35 + Size: 8B +array(44.19904463) +Coordinates: + ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629 + id - Dimensions: () Coordinates: ingestion_time datetime64[ns] 2017-01-01T15:26:32 - time datetime64[ns] 2017-01-01T02:45:35 Data variables: sensor int64 2 - measurement float64 3.07 - + Size: 665B +Dimensions: (latlon: 2) +Coordinates: + ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629 + id - Dimensions: (time: 3) Coordinates: ingestion_time (time) datetime64[ns] - 2022-12-31T20:56:40 ... 2022-12-31T... * time (time) datetime64[ns] - 2022-12-31T15:47:54 ... 2022-12-31T... Data variables: sensor (time) int64 1 2 - 1 measurement (time) float64 1.491 2.045 2.798 - +First 3 sun_elevations [44.19904463 57.77561083 58.76316786] +Last 3 sun_elevations [55.60690523 56.72453179 57.81917624] +Sub dataset of the last 3 datapoints + Size: 2kB +Dimensions: (time: 3, latlon: 2) +Coordinates: + ingestion_time (time) datetime64[ns] 24B 2024-07-22T09:08:24.7395... + id (time) array([3.58839564e+00, -2.70314237e+00, 3.27767130e-03, ..., 2.83278085e+00, 1.49074120e+00, -2.79836407e+00]) Coordinates: ingestion_time (time) datetime64[ns] -2017-01-01T15:26:32 ... 2022-12-31T... * time (time) datetime64[ns] -2017-01-01T02:54:03 ... 2022-12-31T... + Size: 216B +array([63.89629314, 63.35038654, 64.10330149, 64.11904038, 64.32007459, + 65.00696561, 60.81739662, 65.72788105, 65.90881403, 65.90881403, + 66.51835574, 66.51835574, 61.24068875, 66.34420723, 66.34420723, + 65.07319907, 65.07319907, 67.19808628, 67.19808628, 67.69088228, + 61.54950615, 67.76723723, 67.76723723, 68.23219829, 68.23219829, + 64.37400345, 64.37400345]) +Coordinates: + ingestion_time (time) datetime64[ns] 216B 2024-07-22T09:06:43.558629 ...... + id (time) 1.5) & - (satellite_data.measurement < 1.6) + (satellite_data.cloud_cover == 0) & + (satellite_data.sun_elevation > 45) & + (satellite_data.sun_elevation < 90) ) -filtered_measurements = satellite_data.measurement[data_filter] -print(filtered_measurements) +filtered_sun_elevations = satellite_data.sun_elevation[data_filter] +print(filtered_sun_elevations) ``` ```txt Output - array([1.54675131, 1.58851704, -1.52978976, ..., 1.54684979, 1.58256101, 1.5325089 ]) Coordinates: -ingestion_time (time) datetime64[ns] 2017-01-01T05:21:17 ... 2022-12-31T... * -time (time) datetime64[ns] 2017-01-01T18:17:47 ... 2022-12-31T... + Size: 216B +array([63.89629314, 63.35038654, 64.10330149, 64.11904038, 64.32007459, + 65.00696561, 60.81739662, 65.72788105, 65.90881403, 65.90881403, + 66.51835574, 66.51835574, 61.24068875, 66.34420723, 66.34420723, + 65.07319907, 65.07319907, 67.19808628, 67.19808628, 67.69088228, + 61.54950615, 67.76723723, 67.76723723, 68.23219829, 68.23219829, + 64.37400345, 64.37400345]) +Coordinates: + ingestion_time (time) datetime64[ns] 216B 2024-07-22T09:06:43.558629 ...... + id (time) - Dimensions: () Coordinates: ingestion_time datetime64[ns] 2020-12-27T18:30:47 - time datetime64[ns] 2021-01-14T07:21:04 Data variables: sensor int64 1 - measurement float64 3.873 - + Size: 665B +Dimensions: (latlon: 2) +Coordinates: + ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629 + id >> raises KeyError: "2021-01-14T07:21:05" +nearest_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.000000") +>>> raises KeyError: "2022-05-01T11:28:28.000000" ``` The `method` parameter can be used to return the closest value instead of raising an error. -```python Finding the closest measurement -nearest_measurement = satellite_data.sel(time="2021-01-14T07:21:05", method="nearest") -assert nearest_measurement.equals(specific_measurement) # passes +```python Finding the closest data point +nearest_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.000000", method="nearest") +assert nearest_datapoint.equals(specific_datapoint) # passes ``` @@ -296,29 +368,29 @@ Xarray and NumPy offer a wide range of statistical functions that can be applied a few examples: ```python Computing dataset statistics -measurements = satellite_data.measurement -min_meas = measurements.min().item() -max_meas = measurements.max().item() -mean_meas = measurements.mean().item() -std_meas = measurements.std().item() -print(f"Measurements from {min_meas:.2f} to {max_meas:.2f} with mean {mean_meas:.2f} and a std of {std_meas:.2f}") +cloud_cover = satellite_data.cloud_cover +min_meas = cloud_cover.min().item() +max_meas = cloud_cover.max().item() +mean_meas = cloud_cover.mean().item() +std_meas = cloud_cover.std().item() +print(f"Cloud cover from {min_meas:.2f} to {max_meas:.2f} with mean {mean_meas:.2f} and a std of {std_meas:.2f}") ``` ```txt Output -Measurements from 0.00 to 4.00 with mean 1.91 and a std of 1.44 +Cloud cover from 0.00 to 100.00 with mean 76.48 and a std of 34.17 ``` -You can also use many NumPy functions directly on a dataset or DataArray. For example, to find out which sensors -you are dealing with, you can use [np.unique](https://numpy.org/doc/stable/reference/generated/numpy.unique.html) to -get all the unique values in the `sensor` data array. +You can also use many NumPy functions directly on a dataset or DataArray. For example, to find out how many bands +the data contains, you can use [np.unique](https://numpy.org/doc/stable/reference/generated/numpy.unique.html) to +get all the unique values in the `bands` data array. ```python Finding unique values import numpy as np -print("Sensors:", np.unique(satellite_data.sensor)) +print("Sensors:", np.unique(satellite_data.bands)) ``` ```txt Output -Sensors: [1 2] +Sensors: [12] ``` ## Reading and writing files @@ -328,6 +400,8 @@ to share your data with others or if you want to persist your data for later use formats, including NetCDF, Zarr, GRIB, and many more. For a full list of supported formats, please refer to the [official documentation page](https://docs.xarray.dev/en/stable/user-guide/io.html). +You might need to install the `netcdf4` package first. You can do this by running `pip install netcdf4`. + Here is how you can save the example dataset to a NetCDF file: ```python Saving a dataset to a file @@ -344,7 +418,7 @@ satellite_data = xr.open_dataset("example_satellite_data.nc") ``` In case you want to follow along with the examples in this section, you can download the example dataset as a NetCDF -file here. +file [here][example_satellite_data.nc]. ## Further reading @@ -354,8 +428,35 @@ or check out the [Xarray Tutorials](https://tutorial.xarray.dev/intro.html). Some useful capability that this section did not cover include: -- [Grouping data](https://docs.xarray.dev/en/stable/user-guide/groupby.html) -- [Computation](https://docs.xarray.dev/en/stable/user-guide/computation.html) -- [Time series specific functionality](https://docs.xarray.dev/en/stable/user-guide/time-series.html) -- [Interpolation](https://docs.xarray.dev/en/latest/user-guide/interpolation.html) -- [Plotting](https://docs.xarray.dev/en/latest/user-guide/plotting.html) + + + + + + +