Groupby decimal days of year -- groubby("ddayofyear") #10068

geacomputing · 2025-02-21T13:03:29Z

geacomputing
Feb 21, 2025

Hi there.

I wanted to share an idea I had while working with 5-minute sampled weather data. I often use the groupby feature to analyze daily trends, but I wanted to find a way to include sub-hourly variability in my analysis. To do this, I wanted to use decimal days of the year instead of just the day of the year, and re-aggregate the data by hour of the day before computing the mean.

<xarray.Dataset>
Dimensions:                 (time: 402897)
Coordinates:
  * time                    (time) datetime64[ns] 2020-09-08T06:05:00 ... 202...
Data variables: (12/15)
    RECORD                  (time) int64 0 1 2 3 ... 157853 157854 157855 157856
    BattV_Avg               (time) float64 13.2 13.2 13.19 ... 13.55 13.55 13.55
    PTemp_C_Avg             (time) float64 31.99 32.35 32.72 ... 12.13 12.22
    AirTC_Avg               (time) float64 31.56 32.14 32.31 ... 7.82 8.07 8.01
    RH                      (time) float64 35.6 34.6 34.5 ... 44.6 44.3 45.8
    Raining                 (time) int64 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
                     ...
    SlrW_Avg                (time) float64 336.4 348.9 359.9 ... 340.3 323.3
    SlrkJ_Tot               (time) float64 99.25 104.7 108.0 ... 102.1 96.98
    WindDir                 (time) int64 271 328 294 245 256 ... 65 330 345 42
    WS_ms_Avg               (time) float64 1.606 1.469 1.249 ... 2.118 2.344

Having a 10 minutes sampling frequency:

print(ds1.time[0:10])

<xarray.DataArray 'time' (time: 10)>
array(['2020-09-08T06:05:00.000000000', '2020-09-08T06:10:00.000000000',
       '2020-09-08T06:15:00.000000000', '2020-09-08T06:20:00.000000000',
       '2020-09-08T06:25:00.000000000', '2020-09-08T06:30:00.000000000',
       '2020-09-08T06:35:00.000000000', '2020-09-08T06:40:00.000000000',
       '2020-09-08T06:45:00.000000000', '2020-09-08T06:50:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2020-09-08T06:05:00 ... 2020-09-08T06:50:00

I very often rely on the (amazing) groupby feature, specifically for doing daily trends:
ds1.groupby('time.dayofyear').mean()

However, to avoid loosing sub-hourly variability and in this specific case,

I was interested in using decimal days of the year, that is to include the sub-hourly trends (and data) I have, and
Instead of averaging over the original 10 minutes, I first wanted to re-aggregate over the hour of the (same) day and than compute the mean.

I experimented with pandas Multiindexer and came up with a decimal dayofyear grouping method, which I call "ddayofyear". This method takes in high-resolution data, re-aggregates it to a meaningful resolution (such as [hours, daysofyear]), and computes the mean. I think this could be helpful to others and wanted to share it.

I even think it could be refined and added to the existing groupers in xarray. The result of my method is a DataArray with a float coordinate representing the decimal day of the year.

The result is:

<xarray.DataArray (ddayofyear: 8784)>
array([25.81254691, 27.29062751, 28.85573064, ..., 22.1240744 ,
       23.37302464, 25.26632443])
Coordinates:
  * ddayofyear  (ddayofyear) float64 1.0 1.042 1.083 1.125 ... 366.9 366.9 367.0

As you see the coordinate is no longer int, but float (now decimal!).

Happy coding,
Marco

The full code:


"""

Created on Fri Feb 21 10:12:28 2025

Marco Miani
EMS GEA Computing LTD
Through numbers, the Earth.

https://www.gea-computing.eu
office@geacomputing.eu



"""


import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt





#%% Create a synthetic timeseries

#Define a string for controlling frequency
f = '10min'


# Create a datetime index with 5-minute intervals
time = pd.date_range(start='2020-01-01T00:00:00', end='2025-01-01T23:00:00', freq=f)  

# Generate a sinusoid with a daily cycle
temperature = 25 + 5 * np.sin(2 * np.pi * (time.hour + time.minute / 60) / 24)
temperature = 25 + 5 * np.sin(2 * np.pi * (time.hour + time.minute / 60) / 24) + 2 * np.sin(2 * np.pi * ((time.dayofyear - 1) - 200) / 365.25)


# Add some Gaussian noise to the sinusoid
temperature += 1.25*np.random.normal(0, 1, size=len(time))



#%% Create an xarray data array 



# Create a DataArray
da = xr.DataArray(temperature, dims=['time'], coords={'time': time})




# %% Grouping: 

def groupby_hours(da):    
    # Define a grouper
    grouper = xr.DataArray(
      pd.MultiIndex.from_arrays(
                  [da.time.dt.dayofyear.values, 
                    da.time.dt.hour.values,
                    #da.time.dt.minute.values
                    ],\
                  names= [
                          'dayofyear', 
                          'hour', 
                          #'minute'
                          ],\
                  ), dims=['time'], coords=[da.time],\
                )
    
        
       
    #Do the grouping    
    da_gr = da.groupby(grouper).mean()
    
    
    group1 = 'group_level_0'
    group2 = 'group_level_1'
    

        
    ddayofyear = da_gr[group1] + (da_gr[group2] / 24)    
        
    # Polish coordinates and dimensions
    da_gr = da_gr.rename({'group': 'ddayofyear'}).assign_coords(ddayofyear=ddayofyear.values)
    
    da_gr.ddayofyear.attrs['name'] = 'decimal days'
    
    return da_gr

da_gr = groupby_hours(da)

dcherian · 2025-02-22T14:39:42Z

dcherian
Feb 22, 2025
Maintainer

Very nice. You shouldn't need to use pandas here.

ds.coords["ddoy"] = ds.time.dt.year + ds.time.dt.hour / 24
ds.groupby("ddoy").mean()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby decimal days of year -- groubby("ddayofyear") #10068

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Groupby decimal days of year -- groubby("ddayofyear") #10068

geacomputing Feb 21, 2025

Replies: 1 comment

dcherian Feb 22, 2025 Maintainer

geacomputing
Feb 21, 2025

dcherian
Feb 22, 2025
Maintainer