#install.packages('gplots')
#install.packages('iterpc')
#install.packages('expm')
library(dartRverse)
library(dartR.popgen)
12 Genetic Structure
Session Presenters
Required packages
As always we need to have dartRverse installed and loaded. In addition you need to have dartR.popgen installend.
make sure you have the packages installed, see Install dartRverse
Genetic Distance and Visualization
Load data and explore
# Read in the example data
<- readRDS("./data/Example_12.1_SNP.Rdata")
gl # Interrogate to see what the data are
nLoc(gl)
[1] 10529
nInd(gl)
[1] 113
table(pop(gl))
signata farnorth krefftii nigra gunabarra emmottii macquarii
11 10 10 10 10 10 10
subglobosa worrelli tanybaraga victoriae
10 10 12 10
Calculate Distance
# Calculate an Euclidean Distance Matrix on individuals
<- gl.dist.ind(gl) D
Starting gl.dist.ind
Processing genlight object with SNP data
Calculating raw Euclidean Distances between individuals
Returning a stat::dist object
Completed: gl.dist.ind
# Represent visually as a heat map
gl.plot.heatmap(D)
Starting gl.plot.heatmap
Processing a distance matrix
Starting gl.colors
Selected color type div
Completed: gl.colors
Completed: gl.plot.heatmap
PCA
# Undertake a PCA on the raw data
<- gl.pcoa(gl) pc
Starting gl.pcoa
Processing genlight object with SNP data
Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state
Starting gl.colors
Selected color type 2
Completed: gl.colors
Completed: gl.pcoa
# Plot the first two dimensions of the PCA
gl.pcoa.plot(pc,gl,pop.labels="legend")
Starting gl.pcoa.plot
Processing an ordination file (glPca)
Processing genlight object with SNP data
Plotting populations identified by a legend
Preparing plot .... please wait
Completed: gl.pcoa.plot
# Plot the first and third dimensions of the PCA
gl.pcoa.plot(pc,gl,xaxis=1,yaxis=3, pop.labels="legend")
Starting gl.pcoa.plot
Processing an ordination file (glPca)
Processing genlight object with SNP data
Plotting populations identified by a legend
Preparing plot .... please wait
Completed: gl.pcoa.plot
Select one cluster
# Select only the data from one cluster in the primary PCA
<- gl.drop.pop(gl, pop.list=c("victoriae","subglobosa",
temp "worrelli","tanybaraga"))
Starting gl.drop.pop
Processing genlight object with SNP data
Checking for presence of nominated populations, deleting them
Warning: Resultant dataset may contain monomorphic loci
Locus metrics not recalculated
Completed: gl.drop.pop
# Plot the first two dimensions of the secondary PCA
<- gl.pcoa(temp) pc
Starting gl.pcoa
Processing genlight object with SNP data
Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state
Starting gl.colors
Selected color type 2
Completed: gl.colors
Completed: gl.pcoa
gl.pcoa.plot(pc, temp,pop.labels="legend")
Starting gl.pcoa.plot
Processing an ordination file (glPca)
Processing genlight object with SNP data
Plotting populations identified by a legend
Preparing plot .... please wait
Completed: gl.pcoa.plot
# Repeat the analysis as a PCoA
<- gl.dist.ind(temp) D
Starting gl.dist.ind
Processing genlight object with SNP data
Calculating raw Euclidean Distances between individuals
Returning a stat::dist object
Completed: gl.dist.ind
<- gl.pcoa(D) pco
Starting gl.pcoa
Processing a distance matrix
Performing a PCoA, individuals as entities, no correction applied
Starting gl.colors
Selected color type 2
Completed: gl.colors
Completed: gl.pcoa
gl.pcoa.plot(pco,temp,pop.labels="legend")
Starting gl.pcoa.plot
Processing an ordination file (glPca)
Processing genlight object with SNP data
Plotting populations identified by a legend
Preparing plot .... please wait
Completed: gl.pcoa.plot
Structure and FastStructure
To run STRUCTURE from within R, we need to install the non-GUI executable (the version without frontend) for your operating system [e.g windows, mac or linux]. You can download STRUCTURE for your OS from http://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/html/structure.html.
To let R know where the executable is, you need to set the path to the executable. For example, under windows, if you have the executable in d:/programms/structure/structure.exe
, you can use the command below.
Running STRUCTURE
The small example below is just to test if the programm is set up correctly, hence the binary for your system is in the right place and the path is set correctly.
#quick test to run on windows
<- bandicoot.gl[,1:10]
bc <- gl.run.structure(bc, k.range = 2:6, num.k.rep = 5, exec = 'd:/programms/structure/structure.exe' ) sr
for Linux/Mac the command might look like this:
#quick test to run on linux/mac
<- bandicoot.gl[,1:10]
bc <- gl.run.structure(bc, k.range = 2:6, num.k.rep = 5, exec = '../binaries/linux/structure' ) sr
A more serious example: simulated possums
For this example we will run STRUCTURE on a simulated dataset of possums. The dataset is a genlight object and contains 10 populations of 30 individuals each and 1000 loci and is part of the dartRverse package. The populations are fairly independent but are linked by some imigration, so a typical Metapopulation scenario. The populations are named A to J and follow more or less an isolation by distance (actually an isolation by resistance model). So population next to each other (e.g. B and C) are fairly well mixed and populations further apart from the rest (e.g. D) are more isolated.
To simplify this further we limit our exercise to the first 150 individuals only (which entails now 5 populations a 30 individuals, A to $E). To get an overview on the population we use the function: gl.map.interactive
which plots the individuals on a map. Please note the genlight/dartR object needs to have valid lat long coordinates for each individual to be able to do so.
Also be aware this tutorial will not talk about how to filter your data set before running structure. For this please refer to the previous tutorials on filtering and data preparation.
<- possums.gl[1:150,] #only the first 150 individuals
possums table(pop(possums)) #check the individuals and the populations
A B C D E
30 30 30 30 30
gl.map.interactive(possums)
Starting gl.map.interactive
Processing genlight object with SNP data
Completed: gl.map.interactive
As mentioned this is a simulated data set and we know the true population structure. Populations B and C are very close to each other so those are basically one population and E is at least somewhat in the vicinity. To confirm that our simulation was successful, we quickly can run a pcoa to confirm the population structure.
<- gl.pcoa(possums, plot.out = FALSE, verbose = 0) pcoa
Starting gl.colors
Selected color type 2
Completed: gl.colors
<- gl.pcoa.plot(pcoa, possums, verbose = 0) pcoaplot
And yes it looks like expected.
Confirm that you understand how the map and the pcoa plot relate to each other.
Now, lets see what kind of information we can get from a structure run.
Different to the above we run structure with at least some repetitions for each K (num.k.rep=10) and also over a suitable range. Unfortunately, this takes a while and we will not run it here. Instead we will load the results from a previous run that is saved in the data folder of this repository. [In case you want to re-run structure simply rename or delete the file in your folder, but be aware that this will take a while.].
if (file.exists("./data/srnoad.rds")) {
<- readRDS("./data/srnoad.rds")
srnoadelse {
} <- possums.gl[1:150,]
possums <- gl.run.structure(possums, k.range = 2:7, num.k.rep = 10, exec =
srnoad './binaries/structure',plot.out = FALSE,
burnin=50000, numreps=100000, noadmix=FALSE)
saveRDS(srnoad, "./data/srnoad.rds")
}
Hey this was quick! We have the results of the structure run. Let’s see what we can do with it.
Structure run (no-admixture)
To run structure we use an dartR implementation of functions which is based on the functions provided by the strataG package (F. Archer (2016)). There you can find a detailed description of the function and its arguments. We had to integrate the functionality because strataG is no longer on CRAN.
The possible arguments are listed below:
parameter | description |
---|---|
k.range | vector of values to for maxpop in multiple runs. If set to NULL, a single STRUCTURE run is conducted with maxpops groups. If specified, do not also specify maxpops. |
num.k.rep | number of replicates for each value in k.range. |
label | label to use for input and output files |
delete.files | logical. Delete all files when STRUCTURE is finished? |
exec | name of executable for STRUCTURE. Defaults to “structure”. |
burnin | number of iterations for MCMC burnin. |
numreps | number of MCMC replicates. |
noadmix | logical. No admixture? |
freqscorr | logical. Correlated frequencies? |
randomize | randomize. |
pop.prior | a character specifying which population prior model to use: “locprior” or “usepopinfo”. |
locpriorinit | parameterizes locprior parameter r - how informative the populations are. Only used when pop.prior = “locprior”. |
maxlocprior | specifies range of locprior parameter r. Only used when pop.prior = “locprior”. |
gensback | integer defining the number of generations back to test for immigrant ancestry. Only used when pop.prior = “usepopinfo”. |
migrprior | numeric between 0 and 1 listing migration prior. Only used when pop.prior = “usepopinfo”. |
popflag | a vector of integers (0, 1) or logicals identifiying whether or not to use strata information. Only used when pop.prior = “usepopinfo”. |
pops | vector of population labels to be used in place of numbers in STRUCTURE file. |
Some words about admixture
Admixture Model
Definition: The admixture model assumes that individuals can have ancestry from multiple populations. This means that the genetic makeup of an individual can be a mixture of two or more populations. This model is particularly useful for analyzing genetic data from populations that are known to have mixed or where there is gene flow between populations.
Application: It is applied when there is historical or recent admixture between populations, and it allows for the estimation of individual ancestry proportions from each of the inferred clusters. For example, an individual might be 50% from population A, 30% from population B, and 20% from population C under the admixture model.
Utility: The admixture model can uncover complex patterns of genetic structure that are not apparent under the assumption of discrete, non-overlapping populations.
No-Admixture Model
Definition: The no-admixture model assumes that individuals have ancestry from only one population. This model is particularly useful for analyzing genetic data from populations that are known to be isolated from one another.
Application: This model is used in situations where populations are relatively well-defined and isolated, with little to no gene flow between them. It simplifies the analysis by considering that an individual’s entire genetic information originates from one of the K clusters without any mixture.
Utility: The no-admixture model is useful for identifying distinct populations and their members, especially in cases where populations are clearly separated geographically or temporally.
Structure Results
Okay now that we got that out of our way lets see how to interprete the results of the structure run.
Evanno Plots
The Evanno method is a statistical approach used to determine the most likely number of genetic clusters (K) present in a dataset analyzed by STRUCTURE software. STRUCTURE is a computational tool used for inferring population structure using genetic data. Identifying the correct number of clusters (K) is crucial for accurately interpreting genetic data in the context of population structure, evolutionary biology, and conservation genetics. The Evanno method specifically addresses the challenge of choosing the optimal K by analyzing the rate of change in the likelihood of data between successive K values, rather than just relying on the maximum likelihood. This is done through the calculation of ΔK, a quantity based on the second order rate of change of the likelihood function with respect to K. The method suggests that the value of K corresponding to the highest ΔK should be considered the most likely number of clusters present in the dataset.
The Evanno method is a method to determine the most likely number of populations. It is based on the second order rate of change of the likelihood function with respect to K. The method is implemented in the gl.evanno
function.
<- gl.evanno(srnoad) ev
Plotting the results (Q matrix)
The Q matrix represents the estimated ancestry proportions of individuals across different inferred genetic clusters. STRUCTURE aims to identify K clusters (populations) that best explain the patterns of genetic variation observed in the data, with K either being predefined by the user or determined using methods like the Evanno method.
The Q matrix is essentially a matrix where each row corresponds to an individual in the dataset, and each column represents one of the K inferred genetic clusters. The entries in the matrix are the estimated proportions (ranging from 0 to 1) of each individual’s genome that originated from each cluster. The sum of an individual’s ancestry proportions across all K clusters equals 1.
The values in the Q matrix can be interpreted as the fraction of an individual’s ancestry that comes from each of the K clusters. The Q matrix is often visualized using bar plots or stacked bar graphs, where each individual’s ancestry proportions are shown as segments of a bar colored differently for each cluster.
To get a plot for a certain level you need to specify K or at least a range of K.
<- gl.plot.structure(srnoad, K=3, clumpak = TRUE) qmatnoad
Starting gl.plot.structure
Completed: gl.plot.structure
head(qmatnoad[[1]])
Label cluster1 cluster2 cluster3 K orig.pop ord
<char> <num> <num> <num> <char> <fctr> <int>
1: 1 0 0 1 3 A 1
2: 10 0 0 1 3 A 2
3: 100 1 0 0 3 D 91
4: 101 1 0 0 3 D 92
5: 102 1 0 0 3 D 93
6: 103 1 0 0 3 D 94
A “spatial” structure plot
<- gl.map.structure(qmat = qmatnoad, x = possums, K=3, scalex=1, scaley=0.5 ) gm
The above run was a structure run using the no-admixture model for a range of K=2:7. Study the results by plotting it for various Ks and try to interpret the results in the light of the simulated population structure.
Structure run (no-admixture=FALSE :-))
We will now run the same example using the admixture model. The only difference is that we will set the no-admixture parameter to FALSE.
if (file.exists("./data/srnoad.rds")) {
<- readRDS("./data/srad.rds")
sradelse {
} <- possums.gl[1:150,]
possums <- gl.run.structure(possums, k.range = 2:7, num.k.rep = 10, exec =
srad '../binaries/linux/structure',plot.out = FALSE,
burnin=50000, numreps=100000, noadmix=FALSE)
saveRDS(srad, "./data/srad.rds")
}
And again the evanno plots
<- gl.evanno(srad) ev
Something between 4-5 seems to be a good number of K.
Hence lets look at the qmat (for K=4) and then you can check the same for K=5
<- gl.plot.structure(srad, K=4) qmatad
Starting gl.plot.structure
Completed: gl.plot.structure
head(qmatad[[1]])
Label cluster1 cluster2 cluster3 cluster4 K orig.pop ord
<char> <num> <num> <num> <num> <char> <fctr> <int>
1: 1 0.02099371 0.01889411 0.03918822 0.92092396 4 A 21
2: 10 0.01950371 0.01700341 0.02000401 0.94348888 4 A 23
3: 100 0.01869981 0.93530017 0.02700001 0.01900001 4 D 102
4: 101 0.01430130 0.94469458 0.01700171 0.02400241 4 D 113
5: 102 0.01999801 0.94820537 0.01699831 0.01479831 4 D 99
6: 103 0.03420331 0.91689208 0.02780251 0.02110211 4 D 91
An the spatial version
<- gl.map.structure(qmat = qmatad, x = possums,K=4, scalex=1, scaley=0.5 ) gm
Faststructure
Faststructure is a faster implementation of the structure algorithm. Be aware though it is named Fast’structure’ it is a fairly different implementation of the original approach, hence the results might differ from the original STRUCTURE. The method is based on a variational Bayesian framework and is designed to be faster and more scalable than the original STRUCTURE software. It is particularly useful for analyzing large datasets with many individuals and/or many SNPs (>5000). One of the most important differences is that there is no no-admixture model in Faststructure, but you can run two models that allow for “complex” situtations (logistic prior) and situations where the ancestry is more evenly distributed (simple prior). Also the way how to identify K differ between the methods. We will run the previous examples with both settings and compare the results.
The method is now implemented in the gl.run.faststructure
function. Unfortunatley noone to my knowledge has compiled Faststructure for windows, so it is only available for Linux and Mac. We also need to have plink installed as this is the required input format for faststructure.
Faststructure simple prior
#takes about 3 minutes
#fsr.sim <- gl.run.faststructure(possums, k.range = 2:7, num.k.rep = 10,
# exec = "./binaries/fastStructure", output = tempdir(),
# exec.plink="./binaries", prior = "simple")
#load from the previous run
<- readRDS(("./data/fsr.sim.rds")) fsr.simple
Here is the output of the run:
$plot fsr.simple
Here we can check the marginal likelihoods for the different K values. The recommended K is then the one with the highest marginal likelihood at the lowest K possible. So here we would decide on K=4. As before we can plot the Q matrix and the spatial structure plot.
<- gl.plot.faststructure(fsr.simple$q_list, k.range = 4,clumpak = T) qmat.simple
Starting gl.select.colors
Warning: No color library specified, set to scales and palette set to hue_pal
Need to select one of baseR, brewer, scales, gr.palette or gr.hcl
Showing and returning 4 colors for library scales : palette hue_pal
Completed: gl.select.colors
<-gl.map.structure(qmat.simple, K = 4, x = possums, scalex = 1, scaley = 0.5) gm
Faststructure logistic prior
#takes about 4 minutes
#fsr.log <- gl.run.faststructure(possums, k.range = 2:7, num.k.rep = 10, exec = "./binaries/fastStructure", output = tempdir(), exec.plink="./binaries", prior = "logistic",)
<- readRDS(("./data/fsr.log.rds")) fsr.log
Here is the output of the run:
$plot fsr.log
As before we can plot the Q matrix and the spatial structure plot.
<- gl.plot.faststructure(fsr.log$q_list, k.range = 4,clumpak = T)
qmat.log <-gl.map.structure(qmat.log, K = 4, x = possums, scalex = 1, scaley = 0.5) gm
Exercises
Now it is time to compare the results from all four previous runs.
If you are feeling energetic, you can try to run the structure and faststructure on the crocodile dataset. (crocs.rds)
This is a publically available dataset (though it has been filtered and pruned) from the paper:) to make is a bit easier to run. Nevertheless it still has >1500 SNPs and more then 400 individuals, so it takes a bit of time to run. If you do not want to wait I did two faststructure runs for you. * prior = simple, K=2:18 (max number of populations), but only 3 repeat (num.k.rep) * prior = logistic, K=2:18 (max number of populations), but only 3 repeat (num.k.rep)
Maybe plot the data first, have a look at a PCA and then decide on the number of populations (range of K) you want to run the faststructure on. It might also make sense to simplify the number of populations in a smaller number of subunits. A description of the dataset (and the original paper) can be found here: Fukuda et al. (2024)
<- readRDS("./data/crocs.rds")
crocs gl.map.interactive(crocs)
Starting gl.map.interactive
Processing genlight object with SNP data
Completed: gl.map.interactive
Further Study
Foxes in Australia
This is a data set (only a subset) of foxes in Australia, which I had lying around. Admittedly the dataset has been pruned, but maybe run a faststructure analysis to check for the population structure between foxes from the west and to the east.
<- readRDS("./data/foxes.rds")
foxes gl.map.interactive(foxes, pop.labels = FALSE)
Starting gl.map.interactive
Processing genlight object with SNP data
Completed: gl.map.interactive
Readings
Georges et al. (2023)
Battey, Ralph, and Kern (2020)