Interesting Paper: Global warming and invertebrate colouring
Just now another very interesting paper has been published in Nature Communications, which was written by former colleagues of mine from the University of Marburg.
Global warming favours lightcoloured insects in Europe
As we all know many insect species like butterflies, bees or dragonflies have their main activity pattern during the day due to their ectotherm thermoregulation. Body colour is an important aspect of this thermoregulation as darker ( more blackish) individuals usually heat up faster. Therefore darker insects have an advantage compared to brighter insects in cooler climates as they heat up more rapidly and can forage earlier. This pattern can be mapped on a larger scale using occurrence data and has been known as “thermal melanism hypothesis” in macroecology. The authors go a step further from here as they not only display a new biogeographic pattern previously unknown to science ( colouring gradient of European dragonflies and butterflies from south to north), but they also demonstrate how this mechanistic link between a macroecological pattern and a functional trait can be used to forecast the effect of climate change on insects.
Possible critics: A definite next step in the analysis would be to include real measurements of optical colour rather than RGB values of scanned pictures. The colour values used in this study were all derived from scientific taxonomic drawings of those insects and thus biased by subjectivity of the respective artist. Nevertheless this bias should be consistent (if the same artist has sketched the images) so it should not influence the colouring gradient. It is also interesting to note that many insects ( I know this for instance from my work with bugs and hoverflies) can adapt their body colouring to their habitat or differ quite a lot within a population. Differing melanism in body color and wing colouration might be related to the climatic niche they occur in, but the insects themselves might also possess phenotypic plasticity to adapt for instance to different habitats and background (Hochkirch et al. 2008). This pattern certainly needs more investigations in the future.
The article has been published as open access paper, so give it a try 😉

Hochkirch, A., Deppermann, J. and Gröning, J. (2008), Phenotypic plasticity in insects: the effects of substrate color on the coloration of two groundhopper species. Evolution & Development, 10: 350–359.
 Zeuss, D. et al. (2014) Global warming favours lightcoloured insects in Europe. Nat. Commun. 5:3874
Macroecology for QGIS, the new QSDM plugin
This is just a quick posting informing all the QGIS interested readers of this blog that I am about to release a new QGIS plugin. It’s name is QSDM (QGIS Species Distribution Modelling) and similar as with LecoS it is particular suited for the practicing ecologists out there. This time i had no plan and interest of coding a graphical interface and thus the whole plugin can only be executed from within the Processing Toolbox (QGIS version > 2.0 ). In my opinion this will be the future of most advanced QGIS plugins anyway.
So what is the idea? Basically QSDM is a plugin taking statistical models for species distribution modeling to QGIS. For now only the famous Maxent is enabled and working, but the ambitious plan is to enable other modeling techniques such as RandomForests and LogisticRegression as well if the user has the necessary libraries enabled.
You might ask what is the advantage of running Maxent from within QGIS? First, you can immediately see the output so it is nice for visual exploration. Second, the QSDM plugin helps you with the formating of your layers and occurrence files. For instance all input raster layers are automatically unified to a common resolution and exported as ESRI .asc files. You simply need to load in your layers and let the tool do the rest. For those of you who want more control (and I really insist that you want to), I also enabled functions to generate a custom parameter file for Maxent and enabled an option to start the Maxent GUI in a new process.
–> I recognize that the easiness of this tool might tempt more people to execute tools without really understanding what they do and how they work. Please be sure what you do and always (!!!) validate the outputs of the tools you use (this includes QSDM). For understanding Maxent parameters I highly recommend reading the attached literature list and this publication!
Other things i implemented in the initial release of QSDM
 Create Species Richness grid
 Creates a new raster containing Species Richness or Endemism of input occurence layer
 Calculate Niche Overlap Statistics
 Can calculate Schoener’s D or Warren’s I based on Hellinger distances for all input layers.
 Range Shift
 Shows the difference between two input prediction layers. For instance for current and likely future conditions.
 Data Transformations
 Makes quick transformations of input raster layers
More is planned, but this depends entirely on my inclination to do so, the time I have available and if it can be useful for my own research as well.
Please remember that the plugin is still experimental. So please don’t be angry if it doesn’t work for you. testing was conducted on QGIS 2.2 stable on my Debian Linux machine and it should hopefully work for Windows as well. But similar as with LecoS i have no opportunity to test the plugin on Mac OS based systems and I also don’t really intend to :p. Sorry Apple.
Literature:
 Steven J. Phillips, Robert P. Anderson and Robert E. Schapire, (2006) “Maximum entropy modeling of species geographic distributions” Ecological Modelling, Vol 190/34 pp 231259
 Steven J. Phillips and Miroslav Dudik, (2008) “Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation.” Ecography, Vol 31, pp 161175
 Jane Elith et al. (2011) “A statistical explanation of MaxEnt for ecologists” Diversity and Distributions, 17, 43–57 DOI: 10.1111/j.14724642.2010.00725.x
Macroecology playground (3) – Spatial autocorrelation
Hey, it has been over 2 months, so welcome in 2014 from my side. And i am sorry for not posting more updates recently, but like everyone i was (and still am) under constant working pressure. This year will be quite interesting for me personally as i am about to start my thesis project and will (besides other things) go to Africa for fieldwork. But for now i will try to catch your interest with a new macroecology playground post dealing with the important issue of spatial autocorrelation. See the other Macroecology playground posts here and here for knowing what happened in the past.
Spatial autocorrelation is the issue that data points in geographical space are somewhat dependent on each other or their values correlated because of spatial proximity/distance. This is also known as the first law of geography (Google it). However, most of the statistical tools we have available assume that all our datapoints are independent from each other, which is rarely the case in macroecology. Just imagine the steep slope of mountain regions. Literally all big values will always occur near the peak of the mountains and decrease with distance from the peak. There is thus already a datainherent gradient present which we somehow have to account for, if we are to investigate the effect of altitude alone (and not the effect of the proximity to nearby cells).
In our hypothetical example we want to explore how well the topographical average (average height per grid cell) can explain amphibian richness in South America and if the residuals (model errors) in our model are spatially autocorrelated. I can’t share the data, but i believe the dear reader will get the idea of what we are trying to do.
# Load libraries library(raster) # Load in your dataset. In my case i am loading both the Topo and richness from a raster stack. amp < s$Amphibians.richness topo < s$Topographical.Average summary(fit1 < lm(getValues(amp)~getValues(topo))) # Extract from the output Multiple Rsquared: 0.1248, Adjusted Rsquared: 0.1242 Fstatistic: 217.7 on 1 and 1527 DF, pvalue: < 2.2e16 par(mfrow=c(2,1)) plot(amp,col=rainbow(100,start=0.2)) plot(s$Topographical.Average)
What did we do? As you can see we fitted a simple linear regression model using the values from both the amphibian richness raster layer and the topographical range raster. The relation seems to be highly significant and this simple model can explain up to 12.4% of the variation. Here is the basic plot output for both response and predictor variable.
As you can see high values of both layers seem to be spatially clustered. So the likelihood of violating the independence of datapoints in a linear regression model is quite likely. Lets investigate the spatial autocorrelation by looking at Moran’s I, which is a measure for spatial autocorrelation (technically its just a determinant of correlation that calculated the pearsons r of surrounding values within a certain window). So lets investigate if the residual values (the error in model fit) are spatially autocorrelated.
library(ncf) # For the Correlogram # Generate an Residual Raster from the model before rval < getValues(amp) # Create new raster rval[as.numeric(names(fit1$residuals))]< fit1$residuals # replace all datacells with res value resid < topo values(resid) <rval;rm(rval) #replace our values in this new raster names(resid) < "Residuals" # Now calculate Moran's I of the new residual raster layer x = xFromCell(resid,1:ncell(resid)) # take x coordinates y = yFromCell(resid,1:ncell(resid)) # take y coordinates z = getValues(resid) # and the values of course # Now calculate Moran's I # Use the extracted coordinates and values, increase the distance in 100er steps and don't forget to use latlon=T (given that you have your rasters in WGS84 projected) system.time(co < correlog(x,y,z,increment = 100,resamp = 0, latlon = T,na.rm=T)) # this can take a while. # It takes even longer if you try to estimate significance of spatial autocorrelation # Now show the result plot(0,type="n",col="black",ylab="Moran's I",xlab="lag distance",xlim=c(0,6500),ylim=c(1,1)) abline(h=0,lty="dotted") lines(co$correlation~co$mean.of.class,col="red",lwd=2) points(x=co$x.intercept,y=0,pch=19,col="red")
Ideally Moran’s I should be as close to zero as possible. In the above plot you can see that values in close distance (up to 2000 Distance units) and with greater distance as well, the model residuals are positively autocorrelated (too great than expected by chance alone, thus correlated with proximity). The function correlog allows you to resample the dataset to investigate significance of this patterns, but for now i will just assume that our models residuals are significantly spatially autocorrelated.
There are numerous techniques to deal with or investigate spatial autocorrelation. Here the interested reader is advised to look at Dormann et al. (2007) for inspiration. In our example we will try to fit a simultaneous spatial autoregressive model (SAR) and try to see if we can partially get the spatial autocorrelation out of the residual error. SARs can model the spatial error generating process and operate with weight
matrices that specify the strength of interaction between neighbouring sites (Dormann et al., 2007). If you know that the spatial autocorrelation occurs in the response variable only, a so called “laggedresponse model” would be most appropriate, otherwise use a “mixed” SAR if the error occurs in both response and predictors. However Kissling and Carl (2008) investigated SAR models in detail and came to the conclusion that lagged and mixed SARs might not always give better results than ordinary least square regressions and can generate bias (Kissling & Carl, 2008). Instead they recommend to calculate “spatial error” SAR models when dealing with species distribution data, which assumes that the spatial correlation does neither occur in response or predictors, but in the error term.
So lets build the spatial weights and fit a SAR:
library(spdep) x = xFromCell(amp,1:ncell(amp)) y = yFromCell(amp,1:ncell(amp)) z = getValues(amp) nei < dnearneigh(cbind(x,y),d1=0,d2=2000,longlat=T) # Get neighbourlist of interactions with a distance unit 2000. nlw < nb2listw(nei,style="W",zero.policy=T) # You should calculate the interaction weights with the maximal distance in which autocorrelation occurs. # But here we will just take the first xintercept where positive correlation turns into the negative. # Now fit the spatial error SAR sar_e < errorsarlm(z~topo,data=val,listw=nlw,na.action=na.omit,zero.policy=T) # We use the generated z values and weights as input. Nodata values are excluded and zeros are given to boundary errors # Now compare how much Variation can be explained summary(fit1)$adj.r.squared # The r_squared of the normal regression > 0.124 summary(sar_e,Nagelkerke=T)$NK # Nagelkerkes pseudo r_square of the SAR > 0.504 #  for SAR. So we could increase the influence of topographical average value on amphibian richness # Finally do a likelihood ratio test LR.sarlm(sar_e,fit1) # Likelihood ratio for spatial linear models >data: >Likelihood ratio = 869.7864, df = 1, pvalue < 2.2e16 >sample estimates: >Log likelihood of sar_e; Log likelihood of fit1 > 7090.903 >7525.796 # Not only are our two models significantly different, but the log likelihood of our SAR is also greater than the ordinary model # indicating a better fit.
The SAR is one of many methods to deal with spatial autocorrelation. I agree that the choice of of the weights matrix distance is a bit arbitrary (it made sense for me), so you might want to investigate the occurence of spatial correlations a bit more prior to fitting a SAR. So have we dealt with the autocorrelation? Lets just calculate Moran’s I values again for both the old residual and the SAR residual values. Looks better doesn’t it?
References:
 F Dormann, C., M McPherson, J., B Araújo, M., Bivand, R., Bolliger, J., Carl, G., … & Wilson, R. (2007). Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography, 30(5), 609628.

Kissling, W. D., & Carl, G. (2008). Spatial autocorrelation and the selection of simultaneous autoregressive models. Global Ecology and Biogeography, 17(1), 5971.
Macroecology playground (2) – About the Mid domain effect null model
The use of null models in ecology has a long history (Connor & Simberloff,1979) and was in the epicenter of many scientific disputes. Some of them are even continuing until today (or here). I will spare the readers of this blog any further discussions or arguments as i haven’t entirely made up my own mind yet. Statistically speaking many null models make perfect sense for me if ecological data is just seen as “data”. The biological perspective of many null models however can be discussed as many of them make assumptions (random distribution of species in spatial community ecology for instance), which seem to be hardly true in natura. I agree that ecologists have to make careful considerations while designing their statistical analysis. I am going to follow the debate about null models more in the future, but for now let me introduce you to a simple null model in macroecology.
One of the most used null models in Macroecology is the so called Mid domain effect (MDE) null model. Given that the effect of all possible environmental predictors on a species distribution decreases, we would expect that the species richness peaks shift toward the center of their geometric constraints (Colwell & Lees, 2000; Colwell et al., 2004). This so called mid domain peak is build on the stochastic phenomena that if you shuffle species ranges inside a geometric constraint, you will always find that the greatest overlaps occur in the very center.
For an easy visualization: Just imagine an aluminum box full of different sized pencils. One of those you had back in primary school. The pencils inside are of varying size, some might be nearly as long as the whole box, others are nearly depleted. Close the box and shuffle it. If you now open the box again, you will find the most pencils (or parts of a pencil) in the middle of the box.
One way to generate a MDE null model from given species ranges is to use a so called spreading dye algorithm, which emulates grow of cells inside the given geometric constraints from a random starting point (emulating multiple drops of dye inside a water pont). Click the GIF image below to watch a growing MDE (CAREFUL – BIG GIF PICTURE > 4mb). As input the number of occupied grid cells per bird species in south America was used. The range was kept constant, but the starting point varies.
As you can observe the relative bird species richness peaks in the middle of the continent after some time. This patterns becomes more prominent if the algorithm runs for all 2869 bird species occurring in south America. The final image and their range quartiles look like this :
Here you can observe that the overall mid domain peak can only be observed for the fourth quartile. For the other three the relative distribution is quite random, which might explain why the MDE null model often explains quite a lot of the variance for widespread species (Dunn et al., 2007). The MDE null model has been criticized and defended again multiple times, but is still widely used in macroecology. Critics usually bring up possible influences of phylogeny (Davies et al, 2005) or geometric constrains (Connolly, 2005; McClain et al., 2007). Issues particularly with the spreading dye algorithm are, that the simulated species ranges are like spreading ink drops which are very similar in shape. In reality species ranges often have quite complex and different configurations/shapes. Furthermore the models stops at the borders of the geometric contrains (the coastline of south America). Any random drop of ink near the coast line will therefore always grow into the heart of the country, which therefore makes the shape of the used geometric constrain the most important predictor of a possible range peak. If for instance the model would be repeated for a more irregular shape (like middle America) the peaks will develop where the greatest land mass is (so around texas and bolivia). The sheer probability of an ink dye developing in panama or Ecuador is too low due to the chance of hitting this small shape. This is a property of the algorithm and might result in nonsignificant null models for the middle American regions.
References
 Colwell RK, Lees DC (2000) The middomain effect: Geometric constraints on the
geography of species richness. Trends Ecol Evol 15:70 –76.  Colwell, R. K., Rahbek, C., & Gotelli, N. J. (2004). The Mid‐Domain Effect and Species Richness Patterns: What Have We Learned So Far?. The American Naturalist, 163(3), E1E23.
 Connor, E. F., & Simberloff, D. (1979). The assembly of species communities: chance or competition?. Ecology, 11321140.
 Connolly, S. R. (2005). Process‐Based Models of Species Distributions and the Mid‐Domain Effect. The American Naturalist, 166(1), 111.

Davies, T. J., Grenyer, R., & Gittleman, J. L. (2005). Phylogeny can make the middomain effect an inappropriate null model. Biology letters, 1(2), 143146.
 Dunn, R. R., McCain, C. M., & Sanders, N. J. (2007). When does diversity fit null model predictions? Scale and range size mediate the mid‐domain effect. Global Ecology and Biogeography, 16(3), 305312
 McClain, C. R., White, E. P., & Hurlbert, A. H. (2007). Challenges in the application of geometric constraint models. Global Ecology and Biogeography, 16(3), 257264.
Macroecology playground (1) – Bird species richness in a nutshell
Ahh, Macroecology. The study of ecological patterns and processes on big scales. Questions like “what factors determine distribution and diversity of all life on earth?” have troubled scientists since A.v.Humboldt and Wallace times. At the University of Copenhagen a whole research center has been dedicated to this specific field and macroecological studies are more and more present in prestigious journals like Nature and Science. Previous studies at the center have found skewed distributions of bird richness with a specific bias towards the mountains (Jetz & Rahbek, 2002, Rahbek et al., 2007). In this blog post i am going to play a bit around with some data from Rahbek et al. (2007). The analysis and the graphs are by no means sufficient (and even violate many model assumptions like homoscedasticity, normality and data independence) and are therefore more of exploratory nature 😉 The post will show you how to build a raster stack of geographical data and how to use the data in some very basic models.
It was recommended to me to use the freely available SAM software for the analysis but although the program is really nice and fast it isn’t suitable enough for me as you can not modify specific model parameters or graphical outputs. And as a selfdeclared R junkie i refuse to work with “clickcomputeresult” tools 😉
So here is how the head of SAM data file (“data.sam”) looks like (i won’t share it, so please generate your own data).
As you can see the .sam file is technically just a tabulator separated table with the coordinates for a gridcell (1° gridcell on a latitudelongitude projection) and all response and predictor values for this cell. To get this data into R we are gonna use the raster package to generate a so called raster stack for our analysis. This is how i did it
# Load libraries library(raster) # Create Data from SAM data < read.delim(file="data.sam",header=T,sep="\t",dec=".") # read in a data.frame coordinates(data) < ~Longitude+Latitude # Convert to a SpatialPointsDataframe cs < "+proj=longlat +datum=WGS84 +no_defs" # define the correct projection (longlat) gridded(data) < T # Make a SpatialPixelsDataframe proj4string(data) < CRS(cs) # set the defined CRS # Create Raster layer stack s < stack() for(n in names(data)){ d < data.frame(coordinates(data),data[,n]) ras < rasterFromXYZ(xyz=d,digits=10,crs=CRS(cs)) s < addLayer(s,ras) rm(d,n,ras) } # Now you can query and plot the raster layers from the stack plot(s$Birds.richness,col=rainbow(100,start=0.1))
You wanna do some modeling or extract data? Here you go. First we make a subset of some of our predictors from the raster stack and then fit ordinary least squares multiple regression models to our data to see how much variance can be explained. Note that linear regressions are not the proper techniques for this kind of analysis (degrees of freedom to high due to spatial autocorrelation, violation of assumptions mentioned before), but its still useful for explanatory purposes.
# Extract some predictors from the raster Stack predictors < subset(s,c(7,8,10)) names(predictors) > "NDVI" "Topographical.Range" "Annual.Mean.Temperature" # Now extract the data from both the bird richness layer and the predictors birds < getValues(s$Birds.richness) val < as.data.frame(getValues(predictors)) # Do the multiple regression fit < lm(birds~.,data=val) summary(fit) > Estimate Std. Error t value Pr(>t) (Intercept) 215.675282 15.837493 13.62 <2e16 *** NDVI 34.541242 1.245769 27.73 <2e16 *** Topographical.Range 0.056458 0.002452 23.03 <2e16 *** Annual.Mean.Temperature 0.940664 0.054747 17.18 <2e16 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 81.86 on 1525 degrees of freedom (1461 observations deleted due to missingness) Multiple Rsquared: 0.6931, Adjusted Rsquared: 0.6925 Fstatistic: 1148 on 3 and 1525 DF, pvalue: < 2.2e16
Ignore the pvalues and just focus on the adjusted r² value. As you can see we are able to explain nearly 70% of the variance with this simple model. So how do our residuals and the predicted values look like? For that we have to create analogous raster layers containing both the predicted and the residual values. Then we plot all species raster layers again using the spplot function from the package sp (automatically loaded with “raster”)
# Estimates prediction rval < getValues(s$Birds.richness) # Create new values rval[as.numeric(names(fit$fitted.values))]< predict(fit) # replace all datacells with predicted values pred < predictors$NDVI # make a copy of an existing raster values(pred) <rval;rm(rval) #replace all values in this raster copy names(pred) < "Prediction" # Residual Raster rval < getValues(s$Birds.richness) # Create new values rval[as.numeric(names(fit$residuals))]< fit$residuals # replace all datacells with residual values resid <predictors$NDVI values(resid) <rval;rm(rval) names(resid) < "Residuals"</pre> # Do the plot with spplot ss < stack(s$Birds.richness, pred, resid) sp < as(ss, 'SpatialGridDataFrame') trellis.par.set(sp.theme()) spplot(sp)
While looking at the residual plot you might notice that our simple model fails to explain all the variation at mountain altitudes (the Andes). Still the predicted values look very alike the observed richness. Bird species Richness is highest at tropical mountain ranges, which is consistent with results from Africa (Jetz & Rahbek, 2002). Reasons for this pattern are not fully understood yet, but if i had to discuss this with a colleague i would probably bring up arguments like older evolutionary time, higher habitat heterogeneity and greater numbers of climatic niches at mountain ranges. At this point you would then test for spatial autocorrelation using Moran´s I, adjust your data to that and use more sophisticated methods like General Additive Models (GAMs) or Spatial Autoregressive Model (SARs) and account for the spatial autocorrelation. See Rahbek et al. (2007) for the actual study.
References:
 Jetz, W., & Rahbek, C. (2002). Geographic range size and determinants of avian species richness. Science, 297(5586), 15481551.
 Rahbek, C., Gotelli, N. J., Colwell, R. K., Entsminger, G. L., Rangel, T. F. L., & Graves, G. R. (2007). Predicting continentalscale patterns of bird species richness with spatially explicit models. Proceedings of the Royal Society B: Biological Sciences, 274(1607), 165174.
More possibilities of GBIF records retrieval in r
Scott Chamberlain posted new interesting examples what you can do with the “rgbif” package in the ropensci suite. See over to his page at github for some excellent demonstrations what is possible with just a few lines of rcode and the vastlast amounts of GBIF data.
GBIF certainly becomes one of the best and easiest to use data sources in many fields of ecology. Although the data coverage for many countries is still underreported, other countries made quite the process for free and easy access to biodiversity information (For instance the majority of the volunteer raised vegetation data provided by the German FloraWeb server is already available in GBIF).