OpenGeoHub Summer School 2020, Wageningen

Outline

Part I: Introduction to Earth observation (EO) data cubes and the gdalcubes R package
Part II: Examples on using data cubes to combine imagery from different satellite-based EO missions
Part III: Summary, discussion, and practical exercises

All the material of this tutorial is online at GitHub, including R markdown sources, rendered HTML output, and solutions to the practical exercises.

Sample Data

Part I will use a collection of 180 Landsat 8 surface reflectance images, covering a small part of the Brazilian Amazon forest. The live demonstration will use the full resolution dataset (62 gigabytes compressed; > 200 gigabytes unzipped; available here), whereas we will use a downsampled version of the same dataset with coarse spatial resolution (300 meter pixel size; 740 megabytes compressed; 2 gigabytes unzipped) in the practical part (available here).

After downloading whichever version of the dataset, make sure to unzip it. The following R code will download the low resolution version to your current working directory and unzip it.

download.file("https://uni-muenster.sciebo.de/s/e5yUZmYGX0bo4u9/download", 
              destfile = "L8_Amazon.zip", mode="wb")
unzip("L8_Amazon.zip", exdir = "L8_Amazon")

Part I: Introduction to the gdalcubes R package

Raw satellite imagery is mostly distributed as collection of files, whether on download portals of space agencies, or in cloud computing environments (Amazon Web Services, Google Cloud, …). If we want to analyze more than a single image, or even images from multiple satellites, we quickly run into the following challenges:

spectral bands at different spatial resolutions
spatially overlapping images
irregular time series for pixels from different tiles (or in overlapping areas)
different spatial reference systems of images
different data formats and structures

To do time series analysis, process larger areas, and / or combine datasets from different sensors / satellites, we first must restructure our data, e.g. as a data cube with a single spatial reference system, regular pixel sizes, both in time and in space.

Notice that what we call cube is actually not really a cube. It has (up to) four dimensions, and the lengths of the dimensions may be different. Therefore, four dimensional regular raster data cubes also cover simple time series, multiband time series, grayscale images, multispectral images, and time-series of images.

Existing tools (with a focus on R)

GDAL, the Geospatial Data Abstraction Library is a software library reading and writing all relevant raster (and vector) data formats, and providing functions to warp (reproject, rescale, resample, and crop) multiband raster images. It has a three dimensional (space, bands) raster data model and solves some of the problems (data formats, image warping). However, it does not know about the organization of data products, and time. GDAL is written in C / C++ but the rgdalpackage (Bivand et al. 2019) provides an easy to use interface in R.

There are further R packages to process satellite imagery:

raster(Hijmans 2019)

well established, stable, reliable
three-dimensional only, no multispectral AND multitemporal stacks
modern version terra (Hijmans 2020) recently published on CRAN (see https://cran.r-project.org/package=terra)

stars (Pebesma 2019)

see tutorial by Edzer Pebesma yesterday
arbitrary dimensions
has vector data cubes
lazy evaluation approach, compute only the pixels you see

First steps with gdalcubes

gdalcubes is a quite new C++ library and R package that mostly wraps functions of the C++ library. It uses GDAL to read, write, and warp images, but understands date/time and how complex satellite image data products are organized. We will focus on the R package here. To get started, please install the gdalcubes package from CRAN with:

install.packages("gdalcubes")

We can load the package and make sure that all computations later in this tutorial use up to 8 threads with:

library(gdalcubes)
gdalcubes_options(threads=8)

Please notice that this tutorial needs package version 0.3.0, which you can check by running:

packageVersion("gdalcubes")

## [1] '0.3.0'

Image Collections

At first, we need to tell gdalcubes where our satellite images come from and how they are organized.

L8.files = list.files("/media/marius/Samsung_T5/eodata/L8_Amazon_full", pattern = ".tif", recursive = TRUE, full.names = TRUE) 
head(L8.files, 15)

##  [1] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_pixel_qa.tif"  
##  [2] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_radsat_qa.tif" 
##  [3] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_aerosol.tif"
##  [4] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band1.tif"  
##  [5] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band2.tif"  
##  [6] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band3.tif"  
##  [7] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band4.tif"  
##  [8] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band5.tif"  
##  [9] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band6.tif"  
## [10] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band7.tif"  
## [11] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014082001T1-SC20190715051515/LC08_L1TP_226063_20140820_20170420_01_T1_pixel_qa.tif"  
## [12] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014082001T1-SC20190715051515/LC08_L1TP_226063_20140820_20170420_01_T1_radsat_qa.tif" 
## [13] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014082001T1-SC20190715051515/LC08_L1TP_226063_20140820_20170420_01_T1_sr_aerosol.tif"
## [14] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014082001T1-SC20190715051515/LC08_L1TP_226063_20140820_20170420_01_T1_sr_band1.tif"  
## [15] "/media/marius/Samsung_T5/eodata/L8_Amazon_full/LC082260632014082001T1-SC20190715051515/LC08_L1TP_226063_20140820_20170420_01_T1_sr_band2.tif"

length(L8.files)

## [1] 1800

sum(file.size(L8.files)) / 1000^3 # gigabytes

## [1] 271.3001

We see that every image is represented by a directory, with individual files for spectral bands. To be able to work with these images as “one object”, we must scan all the files once and extract some metadata (date/time, coordinate reference system, spatial extent, etc.). Therefore, we compile the images to an image collection with the following command:

L8.col = create_image_collection(L8.files, format = "L8_SR", out_file = "L8.db")
# L8.col = image_collection("L8.db") 
L8.col

## A GDAL image collection object, referencing 180 images with 10  bands
## Images:
##                                       name      left       top    bottom
## 1 LC08_L1TP_226063_20140719_20170421_01_T1 -54.15776 -3.289862 -5.392073
## 2 LC08_L1TP_226063_20140820_20170420_01_T1 -54.16858 -3.289828 -5.392054
## 3 LC08_L1GT_226063_20160114_20170405_01_T2 -54.16317 -3.289845 -5.392064
## 4 LC08_L1TP_226063_20160724_20170322_01_T1 -54.16317 -3.289845 -5.392064
## 5 LC08_L1TP_226063_20170609_20170616_01_T1 -54.17399 -3.289810 -5.392044
## 6 LC08_L1TP_226063_20170711_20170726_01_T1 -54.15506 -3.289870 -5.392083
##       right            datetime
## 1 -52.10338 2014-07-19T00:00:00
## 2 -52.11418 2014-08-20T00:00:00
## 3 -52.10878 2016-01-14T00:00:00
## 4 -52.10878 2016-07-24T00:00:00
## 5 -52.11958 2017-06-09T00:00:00
## 6 -52.09798 2017-07-11T00:00:00
##                                                 srs
## 1 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs
## 2 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs
## 3 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs
## 4 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs
## 5 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs
## 6 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs
## [ omitted 174 images ] 
## 
## Bands:
##         name offset scale unit       nodata image_count
## 1    AEROSOL      0     1                           180
## 2        B01      0     1      -9999.000000         180
## 3        B02      0     1      -9999.000000         180
## 4        B03      0     1      -9999.000000         180
## 5        B04      0     1      -9999.000000         180
## 6        B05      0     1      -9999.000000         180
## 7        B06      0     1      -9999.000000         180
## 8        B07      0     1      -9999.000000         180
## 9   PIXEL_QA      0     1                           180
## 10 RADSAT_QA      0     1                           180

The result is essentially an index stored in a single file (SQLite) database, typically consuming around a few kilobytes per image. Later in the tutorial, this index e.g. allows to avoid reading image data from files which do not intersect with our spatiotemporal area of interest.

The format argument describes, how all the metadata can be extracted from the images. The gdalcubes package comes with a set of predefined image collection formats for particular data products. We can list available formats with:

collection_formats()

##    CHIRPS_v2_0_daily_p05_tif | Image collection format for CHIRPS v 2.0 daily
##                              | global precipitation dataset (0.05 degrees
##                              | resolution) from GeoTIFFs, expects list of .tif
##                              | or .tif.gz files as input. [TAGS: CHIRPS,
##                              | precipitation]
##  CHIRPS_v2_0_monthly_p05_tif | Image collection format for CHIRPS v 2.0 monthly
##                              | global precipitation dataset (0.05 degrees
##                              | resolution) from GeoTIFFs, expects list of .tif
##                              | or .tif.gz files as input. [TAGS: CHIRPS,
##                              | precipitation]
##            ESA_CCI_SM_ACTIVE | Collection format for ESA CCI soil moisture
##                              | active product (version 4.7) [TAGS: Soil
##                              | Moisture, ESA, CCI]
##           ESA_CCI_SM_PASSIVE | Collection format for ESA CCI soil moisture
##                              | passive product (version 4.7) [TAGS: Soil
##                              | Moisture, ESA, CCI]
##    GPM_IMERG_3B_DAY_GIS_V06A | Collection format for daily
##                              | IMERG_3B_DAY_GIS_V06A data [TAGS: Precipitation,
##                              | GPM, IMERG]
##                      L8_L1TP | Collection format for Landsat 8 Level 1 TP
##                              | product [TAGS: Landsat, USGS, Level 1, NASA]
##                        L8_SR | Collection format for Landsat 8 surface
##                              | reflectance product [TAGS: Landsat, USGS, Level
##                              | 2, NASA, surface reflectance]
##                      MxD09GA | Collection format for selected bands from the
##                              | MODIS MxD09GA (Aqua and Terra) product [TAGS:
##                              | MODIS, surface reflectance]
##                      MxD10A2 | Collection format for selected bands from the
##                              | MODIS MxD10A2 (Aqua and Terra) v006 Snow Cover
##                              | product [TAGS: MODIS, Snow Cover]
##                      MxD11A1 | Collection format for selected bands from the
##                              | MODIS MxD11A2 (Aqua and Terra) v006 Land Surface
##                              | Temperature product [TAGS: MODIS, LST]
##                      MxD11A2 | Collection format for selected bands from the
##                              | MODIS MxD11A2 (Aqua and Terra) v006 Land Surface
##                              | Temperature product [TAGS: MODIS, LST]
##                      MxD13A2 | Collection format for selected bands from the
##                              | MODIS MxD13A2 (Aqua and Terra) product [TAGS:
##                              | MODIS, VI, NDVI, EVI]
##                      MxD13A3 | Collection format for selected bands from the
##                              | MODIS MxD13A3 (Aqua and Terra) product [TAGS:
##                              | MODIS, VI, NDVI, EVI]
##                      MxD13Q1 | Collection format for selected bands from the
##                              | MODIS MxD13Q1 (Aqua and Terra) product [TAGS:
##                              | MODIS, VI, NDVI, EVI]
##                      MxD14A2 | Collection format for the MODIS MxD14A2 (Aqua
##                              | and Terra) product [TAGS: MODIS, Fire]
## PlanetScope_3B_AnalyticMS_SR | Image collection format for PlanetScope 4-band
##                              | scenes [TAGS: PlanetScope, BOA, Surface
##                              | Reflectance]
##             Sentinel1_IW_GRD | Image collection format for Sentinel 1 Level 1
##                              | GRD data as downloaded from the Copernicus Open
##                              | Access Hub, expects a list of file paths as
##                              | input. The format works on original ZIP
##                              | compressed as well as uncompressed imagery.
##                              | [TAGS: Sentinel, Copernicus, ESA, SAR]
##                Sentinel2_L1C | Image collection format for Sentinel 2 Level 1C
##                              | data as downloaded from the Copernicus Open
##                              | Access Hub, expects a list of file paths as
##                              | input. The format works on original ZIP
##                              | compressed as well as uncompressed imagery.
##                              | [TAGS: Sentinel, Copernicus, ESA, TOA]
##            Sentinel2_L1C_AWS | Image collection format for Sentinel 2 Level 1C
##                              | data in AWS [TAGS: Sentinel, Copernicus, ESA,
##                              | TOA]
##                Sentinel2_L2A | Image collection format for Sentinel 2 Level 2A
##                              | data as downloaded from the Copernicus Open
##                              | Access Hub, expects a list of file paths as
##                              | input. The format should work on original ZIP
##                              | compressed as well as uncompressed imagery.
##                              | [TAGS: Sentinel, Copernicus, ESA, BOA, Surface
##                              | Reflectance]
##          Sentinel2_L2A_THEIA | Image collection format for Sentinel 2 Level 2A
##                              | data as downloaded from Theia. [TAGS: Sentinel,
##                              | ESA, Flat Reflectance, Theia]

The number of available formats is still rather limited, but continues to grow and is extensible (using add_collection_format()). In fact, a collection format is a single JSON (JavaScript Object Notation) file, describing some rules how to extract e.g. date/time, and bands from filenames (examples at https://github.com/gdalcubes/collection_formats). Writing collection formats for your own non-standard datasets in many cases is not too difficult and documented here.

In our example, we used the predefined format "L8_SR" for Landsat 8 surface reflectance data as downloaded from the USGS portal.

The creation of image collections is typically done only once. We can add images to an existing collection with add_images() and we can derive the spatiotemporal extent of the collection with:

extent(L8.col, srs="EPSG:4326")

## $left
## [1] -59.12746
## 
## $right
## [1] -52.09798
## 
## $top
## [1] -1.844241
## 
## $bottom
## [1] -6.84404
## 
## $t0
## [1] "2013-06-12T00:00:00"
## 
## $t1
## [1] "2019-07-06T00:00:00"

Throughout the package, coordinate reference systems can be specified as strings which are understood by GDAL. This includes the “EPSG:XXXX”, WKT, proj4, and other formats (see here.

Defining a Data Cube View

We can define a target data cube by its geometry, i.e., the spatiotemporal extent, the spatial reference system, the spatial size, and the temporal duration of cells. We call this a data cube view, i.e., the geometry of a cube without connecting it to any data. To create a data cube view, we can use the cube_view() function:

v.overview.500m = cube_view(srs="EPSG:3857", extent=L8.col, dx=500, dy=500, dt = "P1Y", resampling="average", aggregation="median")
v.overview.500m

## A data cube view object
## 
## Dimensions:
##                 low              high count pixel_size
## t              2013              2019     7        P1Y
## y -763764.387686915 -205264.387686915  1117        500
## x -6582280.06164712 -5799280.06164712  1566        500
## 
## SRS: "EPSG:3857"
## Temporal aggregation method: "median"
## Spatial resampling method: "average"

v.subarea.100m = cube_view(extent=list(left=-6180000, right=-6080000, bottom=-550000, top=-450000, 
   t0="2014-01-01", t1="2018-12-31"), dt="P1Y", dx=100, dy=100, srs="EPSG:3857", 
   aggregation = "median", resampling = "average")
v.subarea.100m

## A data cube view object
## 
## Dimensions:
##        low     high count pixel_size
## t     2014     2018     5        P1Y
## y  -550000  -450000  1000        100
## x -6180000 -6080000  1000        100
## 
## SRS: "EPSG:3857"
## Temporal aggregation method: "median"
## Spatial resampling method: "average"

v.subarea.30m.daily = cube_view(view = v.subarea.100m, extent = 
    list(t0 = "2015-01-01",t1 = "2015-12-31"), dt="P1D", dx=30, dy=30) 
v.subarea.30m.daily

## A data cube view object
## 
## Dimensions:
##          low       high count pixel_size
## t 2015-01-01 2015-12-31   365        P1D
## y    -550010    -449990  3334         30
## x   -6180010   -6079990  3334         30
## 
## SRS: "EPSG:3857"
## Temporal aggregation method: "median"
## Spatial resampling method: "average"

Notice that the data cube view does not contain any information on bands, because it is independent from particular data products.

Data Cube Creation

Having defined an image collection, and a data cube view, a data cube is simply the combination of the two. We can create a data cube with the raster_cube() function:

L8.cube.overview = raster_cube(L8.col, v.overview.500m)
L8.cube.overview

## A GDAL data cube proxy object
## 
## Dimensions:
##                 low              high count pixel_size chunk_size
## t              2013              2019     7        P1Y          1
## y -763764.387686915 -205264.387686915  1117        500        256
## x -6582280.06164712 -5799280.06164712  1566        500        256
## 
## Bands:
##         name offset scale nodata unit
## 1    AEROSOL      0     1    NaN     
## 2        B01      0     1    NaN     
## 3        B02      0     1    NaN     
## 4        B03      0     1    NaN     
## 5        B04      0     1    NaN     
## 6        B05      0     1    NaN     
## 7        B06      0     1    NaN     
## 8        B07      0     1    NaN     
## 9   PIXEL_QA      0     1    NaN     
## 10 RADSAT_QA      0     1    NaN

L8.cube.subarea = raster_cube(L8.col, v.subarea.100m)
L8.cube.subarea

## A GDAL data cube proxy object
## 
## Dimensions:
##        low     high count pixel_size chunk_size
## t     2014     2018     5        P1Y          1
## y  -550000  -450000  1000        100        256
## x -6180000 -6080000  1000        100        256
## 
## Bands:
##         name offset scale nodata unit
## 1    AEROSOL      0     1    NaN     
## 2        B01      0     1    NaN     
## 3        B02      0     1    NaN     
## 4        B03      0     1    NaN     
## 5        B04      0     1    NaN     
## 6        B05      0     1    NaN     
## 7        B06      0     1    NaN     
## 8        B07      0     1    NaN     
## 9   PIXEL_QA      0     1    NaN     
## 10 RADSAT_QA      0     1    NaN

L8.cube.subarea.daily = raster_cube(L8.col, v.subarea.30m.daily)
L8.cube.subarea.daily

## A GDAL data cube proxy object
## 
## Dimensions:
##          low       high count pixel_size chunk_size
## t 2015-01-01 2015-12-31   365        P1D          1
## y    -550010    -449990  3334         30        256
## x   -6180010   -6079990  3334         30        256
## 
## Bands:
##         name offset scale nodata unit
## 1    AEROSOL      0     1    NaN     
## 2        B01      0     1    NaN     
## 3        B02      0     1    NaN     
## 4        B03      0     1    NaN     
## 5        B04      0     1    NaN     
## 6        B05      0     1    NaN     
## 7        B06      0     1    NaN     
## 8        B07      0     1    NaN     
## 9   PIXEL_QA      0     1    NaN     
## 10 RADSAT_QA      0     1    NaN

This is very cheap, simply returning proxy objects, but not reading any image data. The package delays the computational intensive parts as much as possible (e.g., until users call plot()). The returned object knows about the bands of the data product. We can use select_bands() to get only the spectral bands we are interested in:

L8.cube.overview.rgb = select_bands(L8.cube.overview, c("B02", "B03", "B04"))
L8.cube.overview.rgb

## A GDAL data cube proxy object
## 
## Dimensions:
##                 low              high count pixel_size chunk_size
## t              2013              2019     7        P1Y          1
## y -763764.387686915 -205264.387686915  1117        500        256
## x -6582280.06164712 -5799280.06164712  1566        500        256
## 
## Bands:
##   name offset scale nodata unit
## 1  B02      0     1    NaN     
## 2  B03      0     1    NaN     
## 3  B04      0     1    NaN

There are some utility functions on data cubes, including:

names(L8.cube.overview.rgb)

## [1] "B02" "B03" "B04"

srs(L8.cube.overview.rgb)

## [1] "EPSG:3857"

bands(L8.cube.overview.rgb)

##   name offset scale nodata unit
## 1  B02      0     1    NaN     
## 2  B03      0     1    NaN     
## 3  B04      0     1    NaN

dim(L8.cube.overview.rgb)

## [1]    7 1117 1566

Plotting Data Cubes

The plot function can be used to visualize data cubes. Calling plot() will start reading and processing the data:

For a simple RGB plot, we use the rgb argument to specify which bands correspond to the red, green, and blue channels, and specify the black and white points of the channels (to control contrast and brightness) in zlim.

plot(L8.cube.overview.rgb, rgb=3:1, zlim=c(0,1500))

plot(select_bands(L8.cube.subarea, c("B02", "B03", "B04")), rgb=3:1, zlim=c(0,1500))

Notice that we can also plot bands individually, creating a two-dimensional plot layout of bands and time. Using key.pos = 1, and col=viridis::viridis, we plot a legend at the bottom of the plot, and use the viridis color scales (this requires the viridis package).

plot(L8.cube.overview.rgb, zlim=c(0,1500), key.pos=1, col=viridis::viridis)

## Warning in plot.cube(L8.cube.overview.rgb, zlim = c(0, 1500), key.pos = 1, : too
## many time instances, plotting only times 1 to 5

Plotting an identical data cube twice, with different visualization arguments zlim, col, and others will not need to reprocess the data cube again. plot() internally writes netCDF files to a temporary directory and remembers that a specific cube is already available.

The plot() function also considers different types of data cubes. For example, if the number of cells in x and y direction equals one, we get a simple time series plot, as we will see later in this tutorial.

Masking

The raster_cube() function supports an additional mask band, e.g., to leave out cloudy pixels when creating a data cube. To define a mask, we can use the image_mask() function. This function expects the name of the mask band as its first band argument. Additionally, we can either pass a vector of values that are masked (all bands set to NA if the specified mask band has one of the provided values) as the values argument, or give a range of mask values by passing minimum and maximum values as min and max arguments. Masks can be inverted by setting invert = TRUE. For bit field masks (as for MODIS data), it is possible to extract specific bits (applying a logical AND) of the band values, before comparing them to the values or range of the mask.

L8.clear_mask = image_mask("PIXEL_QA", values=c(322, 386, 834, 898, 1346, 324, 388, 836, 900, 1348), invert = TRUE)
x = raster_cube(L8.col, v.subarea.100m, mask=L8.clear_mask) 
x = select_bands(x, c("B02","B03","B04"))
plot(x, rgb=3:1, zlim=c(0,1500))

Animations

The data cube representation makes it straightforward to create animations, by plotting time slices of the cube individually, and use these plots as animation frames:

animate(select_bands(raster_cube(L8.col, v.subarea.100m, mask = L8.clear_mask), c("B02","B03","B04")), rgb=3:1, zlim=c(0,1300))

## # A tibble: 5 x 7
##   format width height colorspace matte filesize density
##   <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
## 1 gif      720    720 sRGB       FALSE        0 72x72  
## 2 gif      720    720 sRGB       TRUE         0 72x72  
## 3 gif      720    720 sRGB       TRUE         0 72x72  
## 4 gif      720    720 sRGB       TRUE         0 72x72  
## 5 gif      720    720 sRGB       TRUE         0 72x72

Exporting Data Cubes

Sometimes we want to process or visualize data cubes with external software or other packages. We can export data cubes either as single netCDF files, or as a collection of (cloud-optimized) GeoTIFF files, where each time-slice of a cube will be stored as one (multiband) file.

Both, netCDF and GeoTIFF export support compression, and packing (converting double precision numeric values to smaller integer types by applying an offset and scale) to reduce the file size if needed (see documentation at ?write_ncdf, and ?write_tif).

write_ncdf(L8.cube.overview.rgb, tempfile(pattern = "cube", tmpdir = getwd(),fileext = ".nc"))

Similarly, the package includes an st_as_stars() function to convert data cubes to stars objects (Pebesma 2019) and the write_tif() function can be used to process (only three-dimensional) data cubes with the raster package (Hijmans 2019):

x = raster::stack(
  write_tif(
    select_bands(
      raster_cube(L8.col, v.subarea.100m), "B05")))
x

## class      : RasterStack 
## dimensions : 1000, 1000, 1e+06, 5  (nrow, ncol, ncell, nlayers)
## resolution : 100, 100  (x, y)
## extent     : -6180000, -6080000, -550000, -450000  (xmin, xmax, ymin, ymax)
## crs        : +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +no_defs 
## names      : cube_3b4a3deb93162014, cube_3b4a3deb93162015, cube_3b4a3deb93162016, cube_3b4a3deb93162017, cube_3b4a3deb93162018

The exported GeoTIFF files can also be used for interactive plotting using the mapview package.

library(mapview)
mapView(x[[1]], maxpixels =  4e+06, layer.name="B05",
        map.types=c("Esri.WorldImagery","OpenTopoMap"))

Data Cube Processing

The gdalcubes package comes with some built-in operations to process data cubes. The following operations produce a derived data cube from one or more input data cubes.

Operator	Description
`apply_pixel`	Apply arithmetic expressions on band values per pixel.
`fill_time`	Fill missing values by simple time-series interpolation.
`filter_pixel`	Filter pixels based on logical expressions.
`filter_geom`	Filter pixels that do not intersect with a given input geometry
`join_bands`	Combine bands of two or more identically shaped input data cubes.
`reduce_space`	Apply a reducer function over time slices of a data cube.
`reduce_time`	Apply a reducer function over individual pixel time series.
`select_bands`	Select a subset of a data cube’s bands.
`window_time`	Apply a moving window reducer of kernel over individual pixel time series.

These operations can be chained (e.g., using the pipe operator %>% from the magrittr package, which passes a left-hand-side R expression as the first argument to the function on the right-hand-side (e.g. rnorm(100) %>% mean).

The implementation of these operations in gdalcubes works chunk-wise, i.e., reads only the chunk of the input data cube that is currently needed. This makes sure that only small parts are needed in main memory. The chunk sizes can be controlled as an additional argument to the raster_cube() function.

Arithmetic expressions on pixel band values

The apply_pixel() function can be used to apply per-pixel arithmetic expressions on band values of a data cube. Examples include the calculation of vegetation indexes. The function takes a data cube, a string vector of arithmetic expressions, and a vector of result band names as arguments. Below, we derive the normalized difference vegetation index (NDVI) from the red and near infrared (NIR) channel. We can apply multiple expressions at the same time by providing a vector of expressions (and names).

library(magrittr)
library(colorspace)
ndvi.col = function(n) {
  rev(sequential_hcl(n, "Green-Yellow"))
}
L8.ndvi = raster_cube(L8.col, v.subarea.100m, mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI", keep_bands=FALSE)
L8.ndvi

## A GDAL data cube proxy object
## 
## Dimensions:
##        low     high count pixel_size chunk_size
## t     2014     2018     5        P1Y          1
## y  -550000  -450000  1000        100        256
## x -6180000 -6080000  1000        100        256
## 
## Bands:
##   name offset scale nodata unit
## 1 NDVI      0     1    NaN

plot(L8.ndvi, col=ndvi.col, zlim=c(-0.3,1), key.pos = 1)

Creating a chain of data cube operations still returns proxy objects, knowing the size and shape of the output data cube, before calling plot will start computations. In the example, we do not need the original bands after computing the NDVI and set keep_bands = FALSE (this is the default).

Similar to apply_pixel() we can filter pixels by arithmetic expressions with filter_pixel(). Values of all bands for pixels not fulfilling a logical expression will be set to NA.

raster_cube(L8.col, v.subarea.100m, mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)" , names = "NDVI") %>%
  filter_pixel("NDVI < 0.7") %>%
  plot(col=ndvi.col, zlim=c(-0.3,1), na.color = "aliceblue", key.pos = 1)

Reduction over time and space

Data cubes can be reduced over the space and time dimensions. The reduce_time() function applies one or more reducer functions over pixel time series, producing a single (multiband) result image, whereas reduce_space() reduces time slices in the cube to single values (per band), resulting in a single (multiband) time series.

The example below derives median NDVI values over all pixel time series.

raster_cube(L8.col, v.subarea.100m, mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI", keep_bands=FALSE) %>%
  reduce_time("median(NDVI)") %>%
  plot(col=ndvi.col, nbreaks=100, zlim=c(-0.3,1), key.pos = 1)

Possible reducers include "min", "mean", "median", "max", "count" (count non-missing values), "sum", "var" (variance), and "sd" (standard deviation). Reducer expressions are always given as a string starting with the reducer name followed by the band name in parentheses. Notice that it is not possible to apply more complex arithmetic expressions here. It is however possible to mix reducers and bands:

raster_cube(L8.col, v.subarea.100m, mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI", keep_bands=TRUE) %>%
  reduce_time("median(NDVI)", "mean(NDVI)","max(B05)")

## A GDAL data cube proxy object
## 
## Dimensions:
##        low     high count pixel_size chunk_size
## t     2014     2014     1        P5Y          1
## y  -550000  -450000  1000        100        256
## x -6180000 -6080000  1000        100        256
## 
## Bands:
##          name offset scale nodata unit
## 1 NDVI_median      0     1    NaN     
## 2   NDVI_mean      0     1    NaN     
## 3     B05_max      0     1    NaN

Results of reduce_space() are plotted as simple time series.

raster_cube(L8.col, v.subarea.100m,  mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI") %>%
  reduce_space("median(NDVI)", "sd(NDVI)") %>%
  plot()

The "count" reducer is often very useful to get an initial understanding of an image collection.

raster_cube(L8.col, cube_view(view=v.overview.500m, dt="P1D"), mask=L8.clear_mask) %>%
  select_bands(c("B01")) %>%
  reduce_time("count(B01)") %>%
  plot(key.pos=1)

raster_cube(L8.col, cube_view(view=v.overview.500m, dt="P1M"), mask=L8.clear_mask) %>%
  select_bands("B01") %>%
  reduce_space("count(B01)") %>%
  plot()

We can see that there are almost no observations during the months from October to May, because the download was limited to images with low cloud percentages.

Time-series methods

There are two more built-in functions that operate on individual pixel time series.

The fill_time() function interpolates missing values by preceding or succeeding values (using simple linear or nearest neighbor interpolation, or carrying observations forwards or backwards), The window_time() function can either apply a moving window kernel, or apply a reducer function over moving windows.

In the example below, we sum NDVI changes between subsequent time slices in the data cube, and visualize the result using a diverging color scale from the colorspace package.

col.trend = function(n) {
  rev(diverging_hcl(n = n, palette = "Green-Brown"))
}
raster_cube(L8.col, v.subarea.100m, mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI") %>%
  fill_time(method = "locf") %>%
  window_time(kernel = c(-1,1), window=c(1,0)) %>%
  reduce_time("sum(NDVI)") %>%
  plot(zlim=c(-0.4,0.4),col = col.trend, key.pos=1, nbreaks = 10)

Applying user-defined R functions

So far, we have provided expressions and reducers as characters / strings. The reason was that these methods automatically translate to built-in C++ functions. In the current version, the following functions may also receive R functions as argument.

Operator	Input	Output
`apply_pixel`	Vector of band values for one pixel	Vector of band values of one pixel
`reduce_time`	Multi-band time series as a matrix	Vector of band values
`reduce_space`	Three-dimensional array with dimensions bands, x, and y	Vector of band values
`apply_time`	Multi-band time series as a matrix	Multi-band time series as a matrix

This opens up quite a bunch of things we can do, e.g. using functions from our favorite R packages to process pixel time series. In the example below, we simply fit a line to individual NDVI pixel time series and return its slope (trend).

raster_cube(L8.col, cube_view(view = v.subarea.100m, dx=180), mask = L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI") %>%
  reduce_time(names=c("ndvi_trend"), FUN=function(x) {
    z = data.frame(t=1:ncol(x), ndvi=x["NDVI",])
    result = NA
    if (sum(!is.na(z$ndvi)) > 3) {
      result = coef(lm(ndvi ~ t, z, na.action = na.exclude))[2]
    }
    return(result) 
  }) %>%
  plot(key.pos=1, col=col.trend, nbreaks=10, zlim=c(-0.2,0.2))

There is no limit in what we can do in the provided R function, but we must take care of a few things:

The reducer function is executed in a new R process without access to the current workspace. It is not possible to access variables defined outside of the function and packages must be loaded within the function.
The reducer function must always return a vector with the same length (for all time series).
It is a good idea to think about NA values, i.e. you should check whether the complete time series is NA, and that missing values do not produce errors.

Similarly, we could create a maximum NDVI RGB composite image for 2016 with the following function:

raster_cube(L8.col, cube_view(view = v.subarea.100m, dx=180, dt = "P16D", extent = list(t0 = "2016-01-01", t1="2016-12-31")), mask = L8.clear_mask) %>%
  select_bands(c("B02","B03", "B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI", keep_bands = TRUE) %>%
  reduce_time(names=c("R", "G", "B"), FUN=function(x) {
    i = which.max(x["NDVI",])
    return(x[c("B04","B03","B02"), i])
  }) %>%
  plot(rgb = 1:3, zlim=c(0, 1200))

Extraction from Data Cubes

Besides exporting data cubes as netCDF or GeoTIFF files and converting to stars objects, package version 0.3.0 adds functions to extract points (query_points()), time series (query_timeseries()), and summary statistics over polygons (zonal_statistics()). These functions are especially useful for integration with vector data, e.g. to prepare training datasets for applying machine learning methods, or to enrich movement trajectories with environmental data.

Internals

Data cubes in memory

Internally, data cubes are simple chunked arrays. A full data cube is not kept in memory (unless is has only one chunk). A chunk is simply a dense four-dimensional C++ array of type double. Chunks, which are completely empty, however, do not consume main memory and are propagated among cube operations. For example if one spatiotemporal chunk does not intersect with any images in a collection, it will be empty and operations will also be skipped. For this reason, working with higher temporal resolution (e.g. daily for Landsat 8) is not necessarily much slower. Exceptions to this include the application of user-defined R functions and the writing of dense files (netCDF / GeoTIFF). However, using compression should be effective in the latter case.

For performance tuning, the size of chunks can be specified as additional argument to the raster_cube() function. This affects also parallelization.

Proxy Objects

The package is mostly wrapping functions and objects in C++. It makes a lot of use of the Rcpp package and external pointers, pointing to temporary in-memory objects, managed in C++. Therefore, it is not possible to use save() to save a data cube and reload it in a new R session. Proxy objects internally can be serialized as processing graphs. These are JSON text documents containing all information how a proxy object can be recreated.

raster_cube(L8.col, cube_view(view = v.subarea.100m, extent=list(t0="2014-01",t1="2018-12")), mask=L8.clear_mask) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel("(B05-B04)/(B05+B04)", names = "NDVI") %>%
  fill_time(method = "locf") %>%
  window_time(kernel = c(-1,1), window=c(1,0)) %>%
  reduce_time("sum(NDVI)") %>%
  as_json() %>% 
  print()

## {
##     "cube_type": "reduce_time",
##     "in_cube": {
##         "cube_type": "window_time",
##         "in_cube": {
##             "cube_type": "fill_time",
##             "in_cube": {
##                 "band_names": [
##                     "NDVI"
##                 ],
##                 "cube_type": "apply_pixel",
##                 "expr": [
##                     "(b05-b04)/(b05+b04)"
##                 ],
##                 "in_cube": {
##                     "bands": [
##                         "B04",
##                         "B05"
##                     ],
##                     "cube_type": "select_bands",
##                     "in_cube": {
##                         "chunk_size": [
##                             1,
##                             256,
##                             256
##                         ],
##                         "cube_type": "image_collection",
##                         "file": "L8.db",
##                         "mask": {
##                             "bits": [
## 
##                             ],
##                             "invert": true,
##                             "mask_type": "value_mask",
##                             "values": [
##                                 900,
##                                 324,
##                                 836,
##                                 1346,
##                                 1348,
##                                 898,
##                                 388,
##                                 834,
##                                 386,
##                                 322
##                             ]
##                         },
##                         "mask_band": "PIXEL_QA",
##                         "view": {
##                             "aggregation": "median",
##                             "resampling": "average",
##                             "space": {
##                                 "bottom": -550000,
##                                 "left": -6180000,
##                                 "nx": 1000,
##                                 "ny": 1000,
##                                 "right": -6080000,
##                                 "srs": "EPSG:3857",
##                                 "top": -450000
##                             },
##                             "time": {
##                                 "dt": "P1Y",
##                                 "t0": "2014",
##                                 "t1": "2018"
##                             }
##                         }
##                     }
##                 },
##                 "keep_bands": false
##             },
##             "method": "locf"
##         },
##         "kernel": [
##             -1,
##             1
##         ],
##         "win_size_l": 1,
##         "win_size_r": 0
##     },
##     "reducer_bands": [
##         [
##             "sum",
##             "NDVI"
##         ]
##     ]
## }
##

Futher package options

Further global package options are documented in ?gdalcubes_options.

References

Bivand, R., Keitt, T., and Rowlingson, B. (2019), Rgdal: Bindings for the ’geospatial’ data abstraction library.

Hijmans, R. J. (2019), Raster: Geographic data analysis and modeling.

Hijmans, R. J. (2020), Terra: Spatial data analysis.

Pebesma, E. (2019), Stars: Spatiotemporal arrays, raster and vector data cubes.

Creating and Analyzing Multi-Variable Earth Observation Data Cubes in R

Marius Appel

Aug 18, 2020