In contrast to the previous parts, performing the actual analysis can be done completely from R, on any operating system. To connect to the database we use the scidb and scidbst packages. The former is the ordinary R client from the SciDB developers whereas the latter works on top and overrides methods to maintain spatial and / or temporal references, provides a few functions that mimic the raster package, and supports downloading small images such as results from change monitoring.
To install both packages directly from github, we can use the devtools package.
library(devtools)
install_github("Paradigm4/SciDBR")
install_github("flahn/scidbst")
Assuming you have a runnning SciDB installation as explained in the first part of this tutorial, you can connect to it with the scidbconnect()
function. In the following, we assume the credentials needed to connect to the database have been stored as environment variables.
library(scidb)
library(scidbst)
scidbconnect(host = Sys.getenv("scidb.host"),
username = Sys.getenv("scidb.username"),
password = Sys.getenv("scidb.password"),
port = Sys.getenv("scidb.port"),
auth_type = "digest",
protocol = "https")
When connected, the scidbst.ls()
function lists available arrays with spatial and / or temporal reference. Function arguments like extent
can be used to requests additional metadata such as the spatial and / or temporal extent.
scidbst.ls(extent = T)
One of the core design principles of the scidb package is to work with proxy objects. If you access an array from R, it will not download the data but a create a reference (proxy) to the array in the database. This is also true when you execute functions on scidb / scidbst objects. The packages will derive the shape of result arrays but not run any computations on the data.
We are interested in the array ‘L7_SW_ETHOPIA’. The scidb()
and scidbst()
take the array name as argument and will create an R proxy. The latter explicitly works for spatial and temporal arrays. It will maintain metadata such as the spatial extent.
l7 = scidbst("L7_SW_ETHOPIA")
l7
## Title: L7_SW_ETHOPIA
## Spatial Extent:
## xmin: 33.0733155251169
## xmax: 39.0830225403798
## ymin: 3.39572950381923
## ymax: 11.0748904038534
## CRS:
## +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## Temporal Extent:
## From: 2003-07-21
## To: 2014-12-27
## TRS:
## dimension: "t"
## t0: 2003-07-21
## dt: P1D
## SciDB expression L7_SW_ETHOPIA
## SciDB schema <band1:int16> [y=-14358:35189,2048,0,x=-12295:35417,2048,0,t=0:*,1,0]
## variable dimension type nullable start end chunk
## 1 y TRUE int64 FALSE -14358 35189 2048
## 2 x TRUE int64 FALSE -12295 35417 2048
## 3 t TRUE int64 FALSE 0 * 1
## 4 band1 FALSE int16 FALSE
This array is quite large. In this tutorial, we want to analyze a small subregion only.
We can select subregions of the complete array with the crop()
function that works similar to the raster::crop()
function.
l7.subset = crop(l7, extent(37.42, 37.48, 9.78, 9.81))
Notice that this again only creates a proxy object and does not run any computations. By using scidbst
, we are able to work with real coordinates instead of pixel indexes.
The next step is to repartition the data such that one chunk contains the complete time series of very small regions. This is needed because we want to run bfastmonitor()
in parallel over all time series and the SciDB plugins r_exec
and stream
allow to run scripts independently (in parallel) on different chunks. In this example, one chunk will contain the complete time series (4177 days) of 32x32 pixels.
l7.subset.repart = repart(l7.subset, chunk=c(32,32,4177))
So far, the data is stored as integers (scaled by 10000) and not as true NDVI values and we also have not removed any invalid values. We do this in a single line using the functions transform
, project
, and filter
. transform
computes new attributes based on existing attributes and an arithmetic expression. The expression here is a string because it will be executed in SciDB later and must be conform with SciDB syntax, which might slightly differ from R. The project
function simply selects a subset of the attributes (here, we are not interested in the original scaled integers any longer). Finally the filter
function takes a predicate which is evaluated for all array cells and if false, the cell will be removed from the output array. Here, we leave out any NDVI values outside \((-1, 1)\).
l7.subset.repart.ndvi = filter(project(transform(l7.subset.repart, ndvi="double(band1) / 10000"), "ndvi"), "ndvi >= -1 and ndvi <=1")
The last needed preprocessing step is to add (integer) dimension values as attributes to all cells (again, with the transform
function). This is needed because we will later get all observations of one chunk in a data frame and we need to group them by x and y in order to process all \(32^2\) time series of a chunk separately. Afterwards we will store the result as a new array in SciDB with the scidbsteval()
function. This function finally runs the computations of all previous operations. The result will be a new array with provided name.
l7.subset.bfast.in = transform(l7.subset.repart.ndvi, dimx = "double(x)", dimy = "double(y)", dimt="double(t)")
array.name1 = paste("wur_", paste(sample(letters,replace = T, size = 12), collapse=""), sep="")
array.name1 # generated random name to store the array
## [1] "wur_gzjosinefyqa"
scidbsteval(l7.subset.bfast.in, name = array.name1)
## Title: wur_gzjosinefyqa
## Spatial Extent:
## xmin: 37.4198355789559
## xmax: 37.4805974879332
## ymin: 9.77981707035939
## ymax: 9.81033426231212
## CRS:
## +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## Temporal Extent:
## From: 2003-10-02
## To: 2014-12-20
## TRS:
## dimension: "t"
## t0: 2003-07-21
## dt: P1D
## SciDB expression wur_gzjosinefyqa
## SciDB schema <ndvi:double,dimx:double,dimy:double,dimt:double> [y=-14358:35189,32,0,x=-12295:35417,32,0,t=0:*,4177,0]
## variable dimension type nullable start end chunk
## 1 y TRUE int64 FALSE -14358 35189 32
## 2 x TRUE int64 FALSE -12295 35417 32
## 3 t TRUE int64 FALSE 0 * 4177
## 4 ndvi FALSE double FALSE
## 5 dimx FALSE double FALSE
## 6 dimy FALSE double FALSE
## 7 dimt FALSE double FALSE
Let’s get a first overview of the data and compute some summary statistics (minimum, maximum, mean, and standard deviation of the NDVI, total number of observations).
l7.subset.bfast.in = scidbst(array.name1)
l7.subset.repart.ndvi.stats = aggregate(l7.subset.bfast.in, FUN="max(ndvi), min(ndvi), avg(ndvi), count(*)")
l7.subset.repart.ndvi.stats@proxy[]
## i ndvi_max ndvi_min ndvi_avg count
## 1 0 0.9111 -0.2507 0.47918 2735262
For scidbst objects, the slot @proxy
returns the underlying scidb proxy object and the operation []
can be used to download data from a SciDB array as an R data.frame.
Unfortunately, we need to build an AFL query to run bfastmonitor with r_exec
. This query, will include
r_exec()
unpack()
)store()
)We store the complete query as an R string and use the R function iquery()
to run it. The most interesting part is the R script. It automatically receives vectors dimx
, dimy
, dimt
, and ndvi
with the chunk data. We merge these vectors as columns in a data.frame and use plyr::ddply()
to group rows by dimx
, and dimy
and to apply a function to each group. In other words, apply a function to all time series independently. This function then calls bfastts()
to construct a ts
object and runs bfastmonitor()
. The result is a list of vectors, whose elements will be mapped to array attributes. For each pixel, we return its x and y coordinate, the length of the time series and the detected change date and magnitude. Results from all chunks then well be merged automatically, i.e. we get a single result array.
array.name2 = paste("wur_bfastmonitor_", paste(sample(letters,replace = T, size = 12), collapse=""), sep="")
array.name2
## [1] "wur_bfastmonitor_uqocvocwzkwf"
afl.query.R = paste("store(unpack(r_exec(", array.name1, ",'output_attrs=5','expr=
require(xts)
require(bfast)
require(plyr)
set_fast_options()
ndvi.df = data.frame(ndvi=ndvi,dimy=dimy,dimx=dimx,dimt=dimt)
f <- function(x) {
return(
tryCatch({
ndvi.ts = bfastts(x$ndvi,as.Date(\"2003-07-21\") + x$dimt,\"irregular\")
bfast.result = bfastmonitor(ndvi.ts, start = c(2010, 1), order=1,history=\"ROC\")
return(c(nt=length(x$dimt), breakpoint = bfast.result$breakpoint, magnitude = bfast.result$magnitude ))
}, error=function(e) {
return (c(nt=0,breakpoint=NA,magnitude=NA))
}))}
ndvi.change = ddply(ndvi.df, c(\"dimy\",\"dimx\"), f)
list(dimy = as.double(ndvi.change$dimy), dimx = as.double(ndvi.change$dimx), nt = as.double(ndvi.change$nt), brk = as.double(ndvi.change$breakpoint), magn = as.double(ndvi.change$magnitude) )'),i),
", array.name2 ,")", sep="")
iquery(afl.query.R) # this runs the query above and takes a few minutes
The result of the operation is a one-dimensional array with 5 attributes:
This one-dimensional array has no spatial / temporal reference any longer. We will thus work with functions from scidb instead of scidbst. Most functions in both packages work in the same way. We will assign the spatial reference to the result after the results have been postprocessed and stored as new array.
l7.subset.bfast.out = scidb(array.name2)
l7.subset.bfast.out
## SciDB expression wur_bfastmonitor_uqocvocwzkwf
## SciDB schema <inst:int64,n:int64,expr_value_0:double,expr_value_1:double,expr_value_2:double,expr_value_3:double,expr_value_4:double> [i=0:*,1000000,0]
## variable dimension type nullable start end chunk
## 1 i TRUE int64 FALSE 0 * 1000000
## 2 inst FALSE int64 FALSE
## 3 n FALSE int64 FALSE
## 4 expr_value_0 FALSE double FALSE
## 5 expr_value_1 FALSE double FALSE
## 6 expr_value_2 FALSE double FALSE
## 7 expr_value_3 FALSE double FALSE
## 8 expr_value_4 FALSE double FALSE
Looking at the schema of the result array, it has pretty uninformative attribute names such as expr_value_0
. Our first step is hence to rename the attributes using a cambination of transform
and project
. At the same time, we cast the datatype of x and y to int64
, which is needed because these will eventually become new dimensions of the result.
l7.subset.bfast.out = scidb::project(transform(l7.subset.bfast.out, y="int64(expr_value_0)", x="int64(expr_value_1)", n = "int16(expr_value_2)", breakdate = "expr_value_3", magnitude="expr_value_4"), c("y","x","n","breakdate","magnitude"))
head(l7.subset.bfast.out)
## i_1 i y x n breakdate magnitude
## 1 0 0 4689 7641 108 2012.510 -0.02435436
## 2 1 1 4689 7642 106 NA -0.02158497
## 3 2 2 4689 7643 108 2012.118 -0.04512922
## 4 3 3 4689 7644 109 2011.899 -0.05365862
## 5 4 4 4689 7645 107 2012.510 -0.06296517
## 6 5 5 4689 7646 108 2012.510 -0.04654914
To represent the result as a two-dimensional array (image), we need to convert its attributes x
and y
to dimensions. Therefore, the function redimension()
can be used with new dimensions as character vector argument.
l7.subset.bfast.out.redim = redimension(l7.subset.bfast.out,dim = c("y","x"))
This array now has two unbounded dimensions. For simplicity, we trim empty areas around the actual data and provide specific bounds for the dimensions. We need to derive the original array extent (in integer pixel coordinates) first and use these values in subarray()
, which selects a rectangular subregion in the image and lets new array dimensions start with 0.
array_extent = aggregate(l7.subset.bfast.in, FUN="min(dimx),max(dimx),min(dimy),max(dimy),min(dimt),max(dimt)")@proxy[]
array_extent
## i dimx_min dimx_max dimy_min dimy_max dimt_min dimt_max
## 1 0 7597 7819 4689 4800 73 4169
l7.subset.bfast.out.redim.trimmed = subarray(l7.subset.bfast.out.redim,limits = c(4689,7597, 4800, 7819))
Notice that the operator []
can be used to download data from scidb proxy objects to R data.frames. We are now ready to store the result as a new array:
array.name3 = paste("wur_bfastmonitor_map_", paste(sample(letters,replace = T, size = 12), collapse=""), sep="")
array.name3
## [1] "wur_bfastmonitor_map_kpicttaestqz"
scidbeval(l7.subset.bfast.out.redim.trimmed, name=array.name3)
## SciDB expression wur_bfastmonitor_map_kpicttaestqz
## SciDB schema <n:int16,breakdate:double,magnitude:double> [y=0:111,1000,0,x=0:222,1000,0]
## variable dimension type nullable start end chunk
## 1 y TRUE int64 FALSE 0 111 1000
## 2 x TRUE int64 FALSE 0 222 1000
## 3 n FALSE int16 FALSE
## 4 breakdate FALSE double FALSE
## 5 magnitude FALSE double FALSE
To set the spatial reference of the result image, we run the same trimming on a scidbst proxy for the bfastmonitor input array. This will not run any computations but automatically derive correct location information. We then use this information to set the spatial reference of the result image.
srs.reference.array = subarray(l7.subset.bfast.in,limits = c(7597,4689,0, 7819, 4800, 0))
setSRS(scidb(array.name3), srs(srs.reference.array), affine(srs.reference.array))
For two dimensional arrays, the scidbst package comes with function to download data as RasterBrick with as()
. We need to create a scidbst object first and then plot the time series length of individual pixels, the change date, and the change magnitude. The length of the time series clearly shows a pattern due to Landsat7’s failure of the scan line corrector.
result = scidbst(array.name3)
result
## Title: wur_bfastmonitor_map_kpicttaestqz
## Spatial Extent:
## xmin: 36.6274784878975
## xmax: 36.6882403968748
## ymin: 8.98745997930097
## ymax: 9.0179771712537
## CRS:
## +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## SciDB expression wur_bfastmonitor_map_kpicttaestqz
## SciDB schema <n:int16,breakdate:double,magnitude:double> [y=0:111,1000,0,x=0:222,1000,0]
## variable dimension type nullable start end chunk
## 1 y TRUE int64 FALSE 0 111 1000
## 2 x TRUE int64 FALSE 0 222 1000
## 3 n FALSE int16 FALSE
## 4 breakdate FALSE double FALSE
## 5 magnitude FALSE double FALSE
x = as(result, "RasterBrick")
## Downloading data...
## Done.
## Processing image structure...
plot(x$n, main="Sample size")
plot(x$breakdate, main="Break date")
plot(x$magnitude, main="Change magnitude")