# Bulk downloading and analysing MODIS data in R

Today we are gonna work with bulk downloaded data from MODIS. The MODIS satellites Terra and Aqua have been floating around the earth since the year 2000 and provide a reliable and free-to-obtain source of remote sensing data for ecologists and conservationists. Among the many MODIS products, Indices of Vegetation Greenness such as NDVI or EVI have been used in countless ecological studies and are the first choice of most ecologists for relating field data to remote sensing data. Here we gonna demonstrate **3** different ways how to acquire and calculate the mean EVI for a given Region and time-period. For this we are using the MOD13Q1 product. We focus on the area a little bit south of Mount Kilimanjaro and the temporal span around May 2014.

**(1)**

The first handy R-package we gonna use is MODISTools. It is able to download spatial-temporal data via a simple subset command. The nice advantage of using *MODISTools* is that it only downloads the point of interest as the subsetting and processing is done server-wise. This makes it excellent to download only what you need and reduce the amount of download data. A disadvantage is that it queries a perl script on the daac.ornl.gov server, which is often horrible slow and stretched to full capacity almost every second day.

# Using the MODISTools Package library(MODISTools) # MODISTools requires you to make a query data.frame coord <- c(-3.223774, 37.424605) # Coordinates south of Mount Kilimanjaro product <- "MOD13Q1" bands <- c("250m_16_days_EVI","250m_16_days_pixel_reliability") # What to query. You can get the names via GetBands savedir <- "tmp/" # You can save the downloaded File in a specific folder pixel <- c(0,0) # Get the central pixel only (0,0) or a quadratic tile around it period <- data.frame(lat=coord[1],long=coord[2],start.date=2013,end.date=2014,id=1) # To download the pixels MODISSubsets(LoadDat = period,Products = product,Bands = bands,Size = pixel,SaveDir = "",StartDate = T) MODISSummaries(LoadDat = period,FileSep = ",", Product = "MOD13Q1", Bands = "250m_16_days_EVI",ValidRange = c(-2000,10000), NoDataFill = -3000, ScaleFactor = 0.0001,StartDate = TRUE,Yield = T,Interpolate = T, QualityScreen = TRUE, QualityThreshold = 0,QualityBand = "250m_16_days_pixel_reliability") # Finally read the output read.table("MODIS_Summary_MOD13Q1_2014-08-10.csv",header = T,sep = ",")

The Mean EVI between the year 2013 up to today on the southern slope of Kilimanjaro was .403 (SD=.036). The Yield (integral of interpolated data between start and end date) is 0.05. MODISTools is a very much suitable for you if you want to get hundreds of individual point locations, but less suitable if you want to extract for instance values for a whole area (eg. a polygon shape) or are just annoyed by the frequent server breakdowns…

**(2)**

If you want to get whole MODIS tiles for your area of interest you have another option in R available. The MODIS package is particularly suited for this job in R and even has options to incorporate ArcGis as path for additional processing. It is also possible to use the MODIS Reprojection Tool or gdal (our choice) as underlying workhorse.

library(MODIS) dates <- as.POSIXct( as.Date(c("01/05/2014","31/05/2014"),format = "%d/%m/%Y") ) dates2 <- transDate(dates[1],dates[2]) # Transform input dates from before # The MODIS package allows you select tiles interactively. We however select them manually here h = "21" v = "09" runGdal(product=product,begin=dates2$beginDOY,end = dates2$endDOY,tileH = h,tileV = v,) # Per Default the data will be stored in # ~homefolder/MODIS_ARC/PROCESSED # After download you can stack the processed TIFS vi <- preStack(path="~/MODIS_ARC/PROCESSED/MOD13Q1.005_20140810192530/", pattern="*_EVI.tif$") s <- stack(vi) s <- s*0.0001 # Rescale the downloaded Files with the scaling factor # And extract the mean value for our point from before. # First Transform our coordinates from lat-long to to the MODIS sinus Projection sp <- SpatialPoints(coords = cbind(coord[2],coord[1]),proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84")) sp <- spTransform(sp,CRS(proj4string(s))) extract(s,sp) # Extract the EVI values for the available two layers from the generated stack #> 0.2432 | 0.3113

**(3)**

If all the packages and tools so far did not work as expected, there is also an alternative to use a combination of R and Python to analyse your MODIS files or download your tiles manually. The awesome pyModis scripts allow you to download directly from a USGS server, which at least for me was almost always faster than the LP-DAAC connection the two other solutions before used. However up so far both the MODISTools and MODIS package have processed the original MODIS tiles (which ship in hdf4 container format) for you. Using this solution you have to access the hdf4 files yourself and extract the layers you are interested in.

Here is an example how to download a whole hdf4 container using the modis_download.py script Just query **modis_download.py -h** if you are unsure about the command line parameters.

# Custom command. You could build your own using a loop and paste f <- paste0("modis_download.py -r -p MOD13Q1.005 -t ",tile," -e ","2014-05-01"," -f ","2014-05-30"," tmp/","h21v09") # Call the python script system(f) # Then go into your defined download folder (tmp/ in our case) # Now we are dealing with the original hdf4 files. # The MODIS package in R has some processing options to get a hdf4 layer into a raster object using gdal as a backbone library(MODIS) lay <- "MOD13Q1.A2014145.h21v09.005.2014162035037.hdf" # One of the raw 16days MODIS tiles we downloaded mod <- getSds(lay,method="gdal") # Get the layer names and gdal links # Load the layers from the hdf file. Directly apply scaling factor ndvi <- raster(mod$SDS4gdal[1]) * 0.0001 evi <- raster(mod$SDS4gdal[2]) * 0.0001 reliability <- raster(mod$SDS4gdal[12]) s <- stack(ndvi,evi,reliability) #Now extract the coordinates values extract(s,sp)

There you go. Everything presented here was executed on a Linux Debian machine and I have no clue if it works for you Windows or MAC users. Try it out. Hope everyone got some inspiration how to process MODIS data in R 😉

# BIOFRAG – Biodiversity responses to Forest Fragmentation

Another interesting project closely related to PREDICTS is the BIOFRAG Project, which tries to construct a global database of research papers dealing with Forest Fragmentation and its impacts on Biodiversity taxa. One final goal of the BIOFRAG project is the development of a new fragmentation index using watersheds delineation algorithm and fragment descriptors in order to characterize Fragment traits. I am very interested in seeing the final outcome of this approach and maybe I even find the time to implement their algorithm in LecoS for QGIS as soon as it is released. Their database paper, lead authored by Marion Pfeifer, was just released to the public as open-access paper. You can read it in full here.

If you consider of contributing data then more information can be found on the BIOFRAG blog and all researchers involved with forest fragmentation research should consider contributing to them and also to PREDICTS (see here) if you haven’t already done so. And as usual: If you were studying in Africa, then please get in touch with me! I will contact you as soon as I return from my Fieldwork in Kenya and Tanzania at the end of May.

# Statistical inferences using p-values

And another quick post for today. Here is a nice infographic I just found on the Nature News page. Nice demonstration how p-values can fail us in making hypothesis inferences. Just another article bashing p-values you could say. Or “Just switch already to Bayesian stats or report real effect strengths instead of p-values”. Although the matter is clear for many ecologists out there, the majority still happily uses p-values inferring that they proved their working hypothesis wrong or true. At my former and also at my current university p-values are still being taught and used in all courses related to data analysis. Students are being asked and expected to always (!) report the p-value and trained to look specifically for something they claim is statistical significance of an effect. And then people are wondering why the hell everyone still uses century old techniques. Often while not even knowing what it exactly means. I certainly believe (and I say that while being still educated 🙂 ) that especially in the education of future ecologists and conservationists statistics courses should become mandatory for all (under)graduates. In times of big data analysis basic statistical knowledge has to be a must for everyone.

The related Nature News article can be found here. More nice infos and facts about my research in Africa and fieldwork trip will appear around May.

EDIT: And as a funny addition check out this awesome R-function which gives you an appropriate significance description for every p-value 😀

# Macroecology playground (3) – Spatial autocorrelation

Hey, it has been over 2 months, so welcome in 2014 from my side. And i am sorry for not posting more updates recently, but like everyone i was (and still am) under constant working pressure. This year will be quite interesting for me personally as i am about to start my thesis project and will (besides other things) go to Africa for fieldwork. But for now i will try to catch your interest with a new macroecology playground post dealing with the important issue of spatial autocorrelation. See the other Macroecology playground posts here and here for knowing what happened in the past.

Spatial autocorrelation is the issue that data points in geographical space are somewhat dependent on each other or their values correlated because of spatial proximity/distance. This is also known as the first law of geography (Google it). However, most of the statistical tools we have available assume that all our datapoints are independent from each other, which is rarely the case in macroecology. Just imagine the steep slope of mountain regions. Literally all big values will always occur near the peak of the mountains and decrease with distance from the peak. There is thus already a data-inherent gradient present which we somehow have to account for, if we are to investigate the effect of altitude alone (and not the effect of the proximity to nearby cells).

In our hypothetical example we want to explore how well the topographical average (average height per grid cell) can explain amphibian richness in South America and if the residuals (model errors) in our model are spatially autocorrelated. I can’t share the data, but i believe the dear reader will get the idea of what we are trying to do.

# Load libraries library(raster) # Load in your dataset. In my case i am loading both the Topo and richness from a raster stack. amp <- s$Amphibians.richness topo <- s$Topographical.Average summary(fit1 <- lm(getValues(amp)~getValues(topo))) # Extract from the output Multiple R-squared: 0.1248, Adjusted R-squared: 0.1242 F-statistic: 217.7 on 1 and 1527 DF, p-value: < 2.2e-16 par(mfrow=c(2,1)) plot(amp,col=rainbow(100,start=0.2)) plot(s$Topographical.Average)

What did we do? As you can see we fitted a simple linear regression model using the values from both the amphibian richness raster layer and the topographical range raster. The relation seems to be highly significant and this simple model can explain up to 12.4% of the variation. Here is the basic plot output for both response and predictor variable.

As you can see high values of both layers seem to be spatially clustered. So the likelihood of violating the independence of datapoints in a linear regression model is quite likely. Lets investigate the spatial autocorrelation by looking at Moran’s I, which is a measure for spatial autocorrelation (technically its just a determinant of correlation that calculated the pearsons r of surrounding values within a certain window). So lets investigate if the residual values (the error in model fit) are spatially autocorrelated.

library(ncf) # For the Correlogram # Generate an Residual Raster from the model before rval <- getValues(amp) # Create new raster rval[as.numeric(names(fit1$residuals))]<- fit1$residuals # replace all data-cells with res value resid <- topo values(resid) <-rval;rm(rval) #replace our values in this new raster names(resid) <- "Residuals" # Now calculate Moran's I of the new residual raster layer x = xFromCell(resid,1:ncell(resid)) # take x coordinates y = yFromCell(resid,1:ncell(resid)) # take y coordinates z = getValues(resid) # and the values of course # Now calculate Moran's I # Use the extracted coordinates and values, increase the distance in 100er steps and don't forget to use latlon=T (given that you have your rasters in WGS84 projected) system.time(co <- correlog(x,y,z,increment = 100,resamp = 0, latlon = T,na.rm=T)) # this can take a while. # It takes even longer if you try to estimate significance of spatial autocorrelation # Now show the result plot(0,type="n",col="black",ylab="Moran's I",xlab="lag distance",xlim=c(0,6500),ylim=c(-1,1)) abline(h=0,lty="dotted") lines(co$correlation~co$mean.of.class,col="red",lwd=2) points(x=co$x.intercept,y=0,pch=19,col="red")

Ideally Moran’s I should be as close to zero as possible. In the above plot you can see that values in close distance (up to 2000 Distance units) and with greater distance as well, the model residuals are positively autocorrelated (too great than expected by chance alone, thus correlated with proximity). The function **correlog** allows you to resample the dataset to investigate significance of this patterns, but for now i will just assume that our models residuals are significantly spatially autocorrelated.

There are numerous techniques to deal with or investigate spatial autocorrelation. Here the interested reader is advised to look at Dormann et al. (2007) for inspiration. In our example we will try to fit a simultaneous spatial autoregressive model (SAR) and try to see if we can partially get the spatial autocorrelation out of the residual error. SARs can model the spatial error generating process and operate with weight

matrices that specify the strength of interaction between neighbouring sites (Dormann et al., 2007). If you know that the spatial autocorrelation occurs in the response variable only, a so called “lagged-response model” would be most appropriate, otherwise use a “mixed” SAR if the error occurs in both response and predictors. However Kissling and Carl (2008) investigated SAR models in detail and came to the conclusion that lagged and mixed SARs might not always give better results than ordinary least square regressions and can generate bias (Kissling & Carl, 2008). Instead they recommend to calculate “spatial error” SAR models when dealing with species distribution data, which assumes that the spatial correlation does neither occur in response or predictors, but in the error term.

So lets build the spatial weights and fit a SAR:

library(spdep) x = xFromCell(amp,1:ncell(amp)) y = yFromCell(amp,1:ncell(amp)) z = getValues(amp) nei <- dnearneigh(cbind(x,y),d1=0,d2=2000,longlat=T) # Get neighbourlist of interactions with a distance unit 2000. nlw <- nb2listw(nei,style="W",zero.policy=T) # You should calculate the interaction weights with the maximal distance in which autocorrelation occurs. # But here we will just take the first x-intercept where positive correlation turns into the negative. # Now fit the spatial error SAR sar_e <- errorsarlm(z~topo,data=val,listw=nlw,na.action=na.omit,zero.policy=T) # We use the generated z values and weights as input. Nodata values are excluded and zeros are given to boundary errors # Now compare how much Variation can be explained summary(fit1)$adj.r.squared # The r_squared of the normal regression > 0.124 summary(sar_e,Nagelkerke=T)$NK # Nagelkerkes pseudo r_square of the SAR > 0.504 # -- for SAR. So we could increase the influence of topographical average value on amphibian richness # Finally do a likelihood ratio test LR.sarlm(sar_e,fit1) # Likelihood ratio for spatial linear models >data: >Likelihood ratio = 869.7864, df = 1, p-value < 2.2e-16 >sample estimates: >Log likelihood of sar_e; Log likelihood of fit1 > -7090.903 >-7525.796 # Not only are our two models significantly different, but the log likelihood of our SAR is also greater than the ordinary model # indicating a better fit.

The SAR is one of many methods to deal with spatial autocorrelation. I agree that the choice of of the weights matrix distance is a bit arbitrary (it made sense for me), so you might want to investigate the occurence of spatial correlations a bit more prior to fitting a SAR. So have we dealt with the autocorrelation? Lets just calculate Moran’s I values again for both the old residual and the SAR residual values. Looks better doesn’t it?

References:

- F Dormann, C., M McPherson, J., B Araújo, M., Bivand, R., Bolliger, J., Carl, G., … & Wilson, R. (2007). Methods to account for spatial autocorrelation in the analysis of species distributional data: a review.
*Ecography*,*30*(5), 609-628. -
Kissling, W. D., & Carl, G. (2008). Spatial autocorrelation and the selection of simultaneous autoregressive models.
*Global Ecology and Biogeography*,*17*(1), 59-71.

# ZonalStatistics, Neighbourhood analysis and more for LecoS

Since QGIS 2.0 stable was released just a while ago, i thought that it would be time to enhance my plugin LecoS a bit more. Furthermore i also missed some functions, for instance i found no appropriate function to compute ZonalStatistics for a set of rasters of mine. SAGA has a function to calculate some stats using a categorical and a zone raster layer. However it is lacking a raster output and specific stats. So i added a new ZonalStatistics function to LecoS and i am sure that it will be of some use to Landscape ecologists and other GIS users out there. See a usecase below!

Furthermore i regularly use a lot of short python scripts to generate and query raster layers using a **gdal**+**numpy** backbone. Those custom functions of mine are a lot faster than any other plugin (all hail to **numpy**), which is why i also implemented some functions that are already available in QGIS through other plugins.

Here is the total changelog from the last LecoS version 1.8.2 to the new 1.9 (note that QGIS 1.8 won’t be supported anymore):

# Version 1.9 ### Major Update: ### - Added new tools to the Processing toolbox for use in complex models - Function to count Raster cells -> Output as table - Function to query raster values below a point layer - Function to intersect two landscape (raster) layers -> Output clipped raster - Function to creates a new random raster based on values from a statistical distribution -> Output raster - Function to conduct a Neighborhood analysis (analogous to r.neighbors. or Focal Statistics in ArcGis) - Function to conduct a connected component labeling of connected patches - Function to conduct ZonalStatistics based on two landscapes (ZonalStatistics with raster layers in ArcGIS) - Improved the overall documentation for the Processing Toolbox and created new simple icons - Fixed Bug: http://hub.qgis.org/issues/8810

I didn’t create any new graphical interfaces as i believe that sextante aka processing is the future. All new functions were therefore only added to the processing toolbox and not as seperate GUI. This also has the cool advantage that you could use all LecoS tools within more complex multi-algorithms models. The most visible difference to older LecoS versions is that i created a new icon for every function (make them distinguishable) and wrote documentary information.

Click more to see a short tutorial demonstrating the functions using real data.

# Macroecology playground (2) – About the Mid domain effect null model

The use of null models in ecology has a long history (Connor & Simberloff,1979) and was in the epicenter of many scientific disputes. Some of them are even continuing until today (or here). I will spare the readers of this blog any further discussions or arguments as i haven’t entirely made up my own mind yet. Statistically speaking many null models make perfect sense for me if ecological data is just seen as “data”. The biological perspective of many null models however can be discussed as many of them make assumptions (random distribution of species in spatial community ecology for instance), which seem to be hardly true *in natura*. I agree that ecologists have to make careful considerations while designing their statistical analysis. I am going to follow the debate about null models more in the future, but for now let me introduce you to a simple null model in macroecology.

One of the most used null models in Macroecology is the so called Mid domain effect (MDE) null model. Given that the effect of all possible environmental predictors on a species distribution decreases, we would expect that the species richness peaks shift toward the center of their geometric constraints (Colwell & Lees, 2000; Colwell et al., 2004). This so called mid domain peak is build on the stochastic phenomena that if you shuffle species ranges inside a geometric constraint, you will always find that the greatest overlaps occur in the very center.

For an easy visualization**:** Just imagine an aluminum box full of different sized pencils. One of those you had back in primary school. The pencils inside are of varying size, some might be nearly as long as the whole box, others are nearly depleted. Close the box and shuffle it. If you now open the box again, you will find the most pencils (or parts of a pencil) in the middle of the box.

One way to generate a MDE null model from given species ranges is to use a so called spreading dye algorithm, which emulates grow of cells inside the given geometric constraints from a random starting point (emulating multiple drops of dye inside a water pont). Click the GIF image below to watch a growing MDE (**CAREFUL – BIG GIF PICTURE > 4mb**). As input the number of occupied grid cells per bird species in south America was used. The range was kept constant, but the starting point varies.

As you can observe the relative bird species richness peaks in the middle of the continent after some time. This patterns becomes more prominent if the algorithm runs for all 2869 bird species occurring in south America. The final image and their range quartiles look like this :

Here you can observe that the overall mid domain peak can only be observed for the fourth quartile. For the other three the relative distribution is quite random, which might explain why the MDE null model often explains quite a lot of the variance for widespread species (Dunn et al., 2007). The MDE null model has been criticized and defended again multiple times, but is still widely used in macroecology. Critics usually bring up possible influences of phylogeny (Davies et al, 2005) or geometric constrains (Connolly, 2005; McClain et al., 2007). Issues particularly with the spreading dye algorithm are, that the simulated species ranges are like spreading ink drops which are very similar in shape. In reality species ranges often have quite complex and different configurations/shapes. Furthermore the models stops at the borders of the geometric contrains (the coastline of south America). Any random drop of ink near the coast line will therefore always grow into the heart of the country, which therefore makes the shape of the used geometric constrain the most important predictor of a possible range peak. If for instance the model would be repeated for a more irregular shape (like middle America) the peaks will develop where the greatest land mass is (so around texas and bolivia). The sheer probability of an ink dye developing in panama or Ecuador is too low due to the chance of hitting this small shape. This is a property of the algorithm and might result in non-significant null models for the middle American regions.

**References**

- Colwell RK, Lees DC (2000) The middomain effect: Geometric constraints on the

geography of species richness. Trends Ecol Evol 15:70 –76. - Colwell, R. K., Rahbek, C., & Gotelli, N. J. (2004). The Mid‐Domain Effect and Species Richness Patterns: What Have We Learned So Far?.
*The American Naturalist*,*163*(3), E1-E23. - Connor, E. F., & Simberloff, D. (1979). The assembly of species communities: chance or competition?.
*Ecology*, 1132-1140. - Connolly, S. R. (2005). Process‐Based Models of Species Distributions and the Mid‐Domain Effect.
*The American Naturalist*,*166*(1), 1-11. -
Davies, T. J., Grenyer, R., & Gittleman, J. L. (2005). Phylogeny can make the mid-domain effect an inappropriate null model.
*Biology letters*,*1*(2), 143-146. - Dunn, R. R., McCain, C. M., & Sanders, N. J. (2007). When does diversity fit null model predictions? Scale and range size mediate the mid‐domain effect. Global Ecology and Biogeography, 16(3), 305-312
- McClain, C. R., White, E. P., & Hurlbert, A. H. (2007). Challenges in the application of geometric constraint models.
*Global Ecology and Biogeography*,*16*(3), 257-264.

# Macroecology playground (1) – Bird species richness in a nutshell

Ahh, Macroecology. The study of ecological patterns and processes on big scales. Questions like “what factors determine distribution and diversity of all life on earth?” have troubled scientists since A.v.Humboldt and Wallace times. At the University of Copenhagen a whole research center has been dedicated to this specific field and macro-ecological studies are more and more present in prestigious journals like Nature and Science. Previous studies at the center have found skewed distributions of bird richness with a specific bias towards the mountains (Jetz & Rahbek, 2002, Rahbek et al., 2007). In this blog post i am going to play a bit around with some data from Rahbek et al. (2007). The analysis and the graphs are by no means sufficient (and even violate many model assumptions like homoscedasticity, normality and data independence) and are therefore more of exploratory nature 😉 The post will show you how to build a raster stack of geographical data and how to use the data in some very basic models.

It was recommended to me to use the freely available SAM software for the analysis but although the program is really nice and fast it isn’t suitable enough for me as you can not modify specific model parameters or graphical outputs. And as a self-declared R junkie i refuse to work with “click-compute-result” tools 😉

So here is how the head of SAM data file (“data.sam”) looks like (i won’t share it, so please generate your own data).

As you can see the *.sam* file is technically just a tabulator separated table with the coordinates for a gridcell (1° gridcell on a latitude-longitude projection) and all response and predictor values for this cell. To get this data into R we are gonna use the raster package to generate a so called raster stack for our analysis. This is how i did it

# Load libraries library(raster) # Create Data from SAM data <- read.delim(file="data.sam",header=T,sep="\t",dec=".") # read in a data.frame coordinates(data) <- ~Longitude+Latitude # Convert to a SpatialPointsDataframe cs <- "+proj=longlat +datum=WGS84 +no_defs" # define the correct projection (long-lat) gridded(data) <- T # Make a SpatialPixelsDataframe proj4string(data) <- CRS(cs) # set the defined CRS # Create Raster layer stack s <- stack() for(n in names(data)){ d <- data.frame(coordinates(data),data[,n]) ras <- rasterFromXYZ(xyz=d,digits=10,crs=CRS(cs)) s <- addLayer(s,ras) rm(d,n,ras) } # Now you can query and plot the raster layers from the stack plot(s$Birds.richness,col=rainbow(100,start=0.1))

You wanna do some modeling or extract data? Here you go. First we make a subset of some of our predictors from the raster stack and then fit ordinary least squares multiple regression models to our data to see how much variance can be explained. Note that linear regressions are not the proper techniques for this kind of analysis (degrees of freedom to high due to spatial autocorrelation, violation of assumptions mentioned before), but its still useful for explanatory purposes.

# Extract some predictors from the raster Stack predictors <- subset(s,c(7,8,10)) names(predictors) > "NDVI" "Topographical.Range" "Annual.Mean.Temperature" # Now extract the data from both the bird richness layer and the predictors birds <- getValues(s$Birds.richness) val <- as.data.frame(getValues(predictors)) # Do the multiple regression fit <- lm(birds~.,data=val) summary(fit) > Estimate Std. Error t value Pr(>|t|) (Intercept) 215.675282 15.837493 13.62 <2e-16 *** NDVI -34.541242 1.245769 -27.73 <2e-16 *** Topographical.Range 0.056458 0.002452 23.03 <2e-16 *** Annual.Mean.Temperature 0.940664 0.054747 17.18 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 81.86 on 1525 degrees of freedom (1461 observations deleted due to missingness) Multiple R-squared: 0.6931, Adjusted R-squared: 0.6925 F-statistic: 1148 on 3 and 1525 DF, p-value: < 2.2e-16

Ignore the p-values and just focus on the adjusted r² value. As you can see we are able to explain nearly 70% of the variance with this simple model. So how do our residuals and the predicted values look like? For that we have to create analogous raster layers containing both the predicted and the residual values. Then we plot all species raster layers again using the **spplot** function from the package sp (automatically loaded with “raster”)

# Estimates prediction rval <- getValues(s$Birds.richness) # Create new values rval[as.numeric(names(fit$fitted.values))]<- predict(fit) # replace all data-cells with predicted values pred <- predictors$NDVI # make a copy of an existing raster values(pred) <-rval;rm(rval) #replace all values in this raster copy names(pred) <- "Prediction" # Residual Raster rval <- getValues(s$Birds.richness) # Create new values rval[as.numeric(names(fit$residuals))]<- fit$residuals # replace all data-cells with residual values resid <-predictors$NDVI values(resid) <-rval;rm(rval) names(resid) <- "Residuals"</pre> # Do the plot with spplot ss <- stack(s$Birds.richness, pred, resid) sp <- as(ss, 'SpatialGridDataFrame') trellis.par.set(sp.theme()) spplot(sp)

While looking at the residual plot you might notice that our simple model fails to explain all the variation at mountain altitudes (the Andes). Still the predicted values look very alike the observed richness. Bird species Richness is highest at tropical mountain ranges, which is consistent with results from Africa (Jetz & Rahbek, 2002). Reasons for this pattern are not fully understood yet, but if i had to discuss this with a colleague i would probably bring up arguments like older evolutionary time, higher habitat heterogeneity and greater numbers of climatic niches at mountain ranges. At this point you would then test for spatial autocorrelation using Moran´s I, adjust your data to that and use more sophisticated methods like General Additive Models (GAMs) or Spatial Autoregressive Model (SARs) and account for the spatial autocorrelation. See Rahbek et al. (2007) for the actual study.

**References:**

- Jetz, W., & Rahbek, C. (2002). Geographic range size and determinants of avian species richness.
*Science*,*297*(5586), 1548-1551. - Rahbek, C., Gotelli, N. J., Colwell, R. K., Entsminger, G. L., Rangel, T. F. L., & Graves, G. R. (2007). Predicting continental-scale patterns of bird species richness with spatially explicit models.
*Proceedings of the Royal Society B: Biological Sciences*,*274*(1607), 165-174.