W08 Genetic Structure of Wild Populations to Inform Management

Author

Craig Moritz & Arthur Georges

Published

March 10, 2026

W08 Genetic Divergence & Phylogenetics

Introduction

In this session we are going to cover some of the basic analyses that can be taken to inform management when faced with threatened species showing genetic structure across their natural range. The central considerations are how to draw from those populations to establish captive insurance colonies, and how to plan releases to boost declining populations.

We have heard of some real-world examples on lions by Laura Pertola and small Australian mammals by Kate Rick. In this session, we will cover some of the considerations for defining management units, ESUs and other userful designations. Craig will cover the theory and practice of population divergence. We will then look a PCA in a little more detail as one of the most powerful tools for examining structure within large SNP datasets. We will run through the seven deadly sins of classical PCA that allow the technique to be used with confidence.

Please refer to the article Georges, A., Mijangos, L., Patel, H., Aitkens, M. and Gruber, B. 2023. Distances and their visualization in studies of spatial-temporal genetic variation using single nucleotide polymorphisms (SNPs). BioxRiv https://doi.org/10.1101/2023.03.22.533737.

We move on to the example of the Broad Toothed Rat in the Barrington Tops region of NSW, not so much as a definitive study, but as a sandpit to explore some of the analyses that can be applied to a structured wild population.

We finish with an Exercise using data from the endangered Western Sawshell Turtle of the Murray Darling River Basin.

A central question in both of these studies is when there is strong genetic structure across the landscape, does this imply separate management of the distinct genetic units, and if so, at what scale.

Prerequisites

No prerequisites required. This is a very basic treatment

Workflow

  1. Genetic structure and SNPs, defining units relevant to management.

  2. The seven deadly sins of classical PCA

  3. Jump into the sandpit with the Broad Toothed Rat

  4. Exercise with the Western Sawshell Turtle

  5. Discussion

Learning outcomes

  • Population divergence models

  • Usage of PCA

  • To correctly identify population structure

The why & how of population divergence models

Analysing the history of divergence and gene flow is key to defining conservation units within species

Sampling matters for species with limited dispersal!

Battey et al. 2020. Space is the place…. Genetics.
  • Restricted dispersal -> isolation by distance

  • Spatially cluster sampling yields biased estimates of population parameters, clustering etc.

Why not use SNP-based phylogenies to estimate divergence times?

  1. Gene trees differ from species trees: Average divergence is > split time due to deep coalescence of lineages. Difference increases with Na

  2. SNP-based analyses overestimate divergence unless # invariant sites is known

See Leache et al. 2015 Syst. Biol.

Approaches to modeling divergence histories – population models and the multispecies coalescent

  1. 2 population model at mutation-migration equilibrium (e.g MIGRATE)

  2. 2 population Isolation with Migration (IM) model

  3. Multispecies coalescent models +/- migration

Fitting population divergence models

DeJode et al. 2022; Evol. Appl.
  • All models are wrong…
  • Which one(s) best fit your data?
  • Methods: dadi, moments, DILS
  • Example: Marine populations

Example: rock wallabies from northern Australia (Potter et al. 2024 Syst. Biol.)

  • Phylogenetic discordance suggests current or historical gene flow among species

  • Models support migration (IM or SC) between species and populations

Take home messages

  • Well-distributed population sampling is important! Often this is iterative.
  • SNP based phylogenies, PCAs etc are useful as heuristics and to develop hypotheses BUT…
  • Coalescent-based population methods should be used to estimate population divergence histories
  • Comparing alternative models of divergence is useful BUT…
  • Remember all models are wrong…

Seven sins of classical PCA

1. Missing values

displacement

variance inflation

2. Inappropriate distance metric

PCA vs PCoA

You can substitute the standardized covariance matrix used by PCA with any standardized distance metric (Gower, 1966).

dartR coding

  • 0, homozygous reference allele (DArT set this as the most common allele)
  • 1, heterozygous
  • 2, homozygous alternate allele

Relative distances between individuals must not depend on the arbitrary choice of reference and alternate allele. Prudent to check this.

3. Including close relatives

Create spurious cryptic aggregations. Why?

4. Choosing two dimensions for convenience

  • Separation is definitive. Proximity is ambiguous.

  • Identify informative dimensions (Broken-Stick Criterion) Explicitly choose what information willing to discard

5. Confounding % explained with biological importance

Sin 5: “PC1 explains X% across datasets, a comparable across studies measure of”strength of structure”

But % variance depends on loci, scaling, LD pruning, MAF filters, sample composition. Use the Tracy-Widom framework, and by default in dartR, the broken-stick criterion to distinguish signal from noise.

AND

PCA and PCoA compute % explained in fundamentally different ways and so are not comparable.

6. Structural variants and gross linkage

Weird inexplicable patterns – 2 or 3 replicated sets of aggregations If two patterns – sex linkage?

If 2 or 3, large structural variant?

7. Over-interpretation

Sin 7: Treating PCs as direct evidence for specific processes, discrete groups, or biological mechanisms when PCA is fundamentally a variance-maximizing projection showing patterns.

(NovembreStephen2008?)

(McVean2009?)

BOTTOM LINE

Care to avoid the seven sins. PCA is an incredible tool that lends itself to SNP analysis. But it is a hypothesis generating tool only and must be followed by more definitive analyses.

Worked Example: Broad Toothed Rat

Cute as mustard - EPBC Act:Endangered. Patchy distribution – Barrington Tops, Snowy Mountains, Eastern Highlands, Victorian Alps, Otway Ranges, Wilsons Promontory and western Tasmania.
  • Insurance breeding colony established

  • Sampled using cheek swabs

  • Genotyped using DArTseq

  • Advice on who to breed with whom to maximize genetic diversity in the offspring

  • Opportunity for us to explore genetic diversity in the source population

Required packages

library(dartRverse)

Step 1 Load in the data:

gl.set.verbosity(3)
Starting gl.set.verbosity 
  Global verbosity set to: 3 
Completed: gl.set.verbosity 
gl <- gl.load(file="./data/Mastocomys.Rdata")
Starting gl.load 
  Processing genlight object with SNP data
  Loaded object of type SNP from ./data/Mastocomys.Rdata 
Starting gl.compliance.check 
  Processing genlight object with SNP data
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking for population assignments
    Population assignments confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
  Checking for monomorphic loci
    Dataset contains monomorphic loci
  Checking for loci with all missing data
    No loci with all missing data detected
  Checking whether individual names are unique.
  Checking for individual metrics
    Individual metrics confirmed
  Spelling of coordinates checked and changed if necessary to 
            lat/lon
Completed: gl.compliance.check 
Completed: gl.load 

Step 2 Interrogate:

gl
 ********************
 *** DARTR OBJECT ***
 ********************

 ** 20 genotypes,  1,727 SNPs , size: 1.5 Mb

    missing data: 21535 (=62.35 %) scored as NA

 ** Genetic data
   @gen: list of 20 SNPbin
   @ploidy: ploidy of each individual  (range: 2-2)

 ** Additional data
   @ind.names:  20 individual labels
   @loc.names:  1727 locus labels
   @loc.all:  1727 allele labels
   @position: integer storing positions of the SNPs [within 69 base sequence]
   @pop: population of each individual (group size range: 4-6)
   @other: a list containing: loc.metrics, ind.metrics, latlon, loc.metrics.flags, verbose, history 
    @other$ind.metrics: seq, id, pop, sex, lat, lon, service, plate_location 
    @other$loc.metrics: AlleleID, CloneID, AlleleSequence, TrimmedSequence, SNP, SnpPosition, CallRate, OneRatioRef, OneRatioSnp, FreqHomRef, FreqHomSnp, FreqHets, PICRef, PICSnp, AvgPIC, AvgCountRef, AvgCountSnp, RepAvg, clone, uid, rdepth, monomorphs, maf, OneRatio, PIC 
   @other$latlon[g]: coordinates for all individuals are attached
popNames(gl)
[1] "Brumlow"         "Gloucester_Tops" "Polblue"         "Watergauge"     
indNames(gl)
 [1] "ARP_Layley.1"  "ARP_Pete.1"    "ARP_Layley.2"  "ARP_Pete.2"   
 [5] "ARP_Deano.1"   "ARP_Matty.1"   "ARP_Ruby.1"    "ARP_Deano.2"  
 [9] "ARP_Matty.2"   "ARP_Ruby.2"    "ARP_Delilah.1" "ARP_Naura.1"  
[13] "ARP_Delilah.2" "ARP_Naura.2"   "ARP_Dot.1"     "ARP_Lawrie.1" 
[17] "ARP_Peggy.1"   "ARP_Dot.2"     "ARP_Lawrie.2"  "ARP_Peggy.2"  
table(pop(gl))

        Brumlow Gloucester_Tops         Polblue      Watergauge 
              6               6               4               4 
gl@other$latlong
NULL

Step 3 Examine geography:

gl.map.interactive(gl)
Starting gl.map.interactive 
  Processing genlight object with SNP data
Completed: gl.map.interactive 

Step 4 Filtering:

gl.report.callrate(gl)
Starting gl.report.callrate 
  Processing genlight object with SNP data
  Reporting Call Rate by Locus
  No. of loci = 1727 
  No. of individuals = 20 
    Minimum      :  0.05 
    1st quartile :  0.25 
    Median       :  0.35 
    Mean         :  0.37652 
    3r quartile  :  0.45 
    Maximum      :  1 
    Missing Rate Overall:  0.6235 

   Quantile Threshold Retained Percent Filtered Percent
1      100%      1.00        1     0.1     1726    99.9
2       95%      0.75      115     6.7     1612    93.3
3       90%      0.65      213    12.3     1514    87.7
4       85%      0.60      263    15.2     1464    84.8
5       80%      0.50      415    24.0     1312    76.0
6       75%      0.45      526    30.5     1201    69.5
7       70%      0.45      526    30.5     1201    69.5
8       65%      0.40      675    39.1     1052    60.9
9       60%      0.35      895    51.8      832    48.2
10      55%      0.35      895    51.8      832    48.2
11      50%      0.35      895    51.8      832    48.2
12      45%      0.30     1155    66.9      572    33.1
13      40%      0.30     1155    66.9      572    33.1
14      35%      0.30     1155    66.9      572    33.1
15      30%      0.25     1413    81.8      314    18.2
16      25%      0.25     1413    81.8      314    18.2
17      20%      0.25     1413    81.8      314    18.2
18      15%      0.20     1566    90.7      161     9.3
19      10%      0.20     1566    90.7      161     9.3
20       5%      0.15     1660    96.1       67     3.9
21       0%      0.05     1727   100.0        0     0.0

Completed: gl.report.callrate 
gl.filter.callrate(gl, threshold=0.5)
Starting gl.filter.callrate 
  Processing genlight object with SNP data
  Warning: Data may include monomorphic loci in call rate 
                    calculations for filtering
  Recalculating Call Rate
  Removing loci based on Call Rate, threshold = 0.5 
  Summary of filtered dataset
    Call Rate for loci > 0.5 
    Original No. of loci : 1727 
    Original No. of individuals: 20 
    No. of loci retained: 415 
    No. of individuals retained: 20 
    No. of populations:  4 

Completed: gl.filter.callrate 

Step 5 Exploratory visualization:

pca <- gl.pcoa(gl)
Starting gl.pcoa 
  Processing genlight object with SNP data
  Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by 'spam'
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by 'spam'
  Ordination yielded 1 informative dimensions( broken-stick criterion) from 19 original dimensions
    PCA Axis 1 explains 0 % of the total variance
    PCA Axis 1 and 2 combined explain NA % of the total variance
    PCA Axis 1-3 combined explain NA % of the total variance
Starting gl.colors 
Selected color type 2 
Completed: gl.colors 
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Completed: gl.pcoa 
gl.pcoa.plot(pca,gl,ellipse=TRUE, plevel=0.67)
Starting gl.pcoa.plot 
  Processing an ordination file (glPca)
  Processing genlight object with SNP data
  Plotting populations in a space defined by the SNPs
  Preparing plot .... please wait

Completed: gl.pcoa.plot 

gl <- gl.impute(gl) 
Starting gl.impute 
  Processing genlight object with SNP data
  Imputation based on drawing from the nearest neighbour
  Warning: Population Brumlow has 1336 loci with all missing values.
  Method = 'neighbour': -966 values to be imputed (excluding all-NA loci).
  Warning: Population Gloucester_Tops has 1323 loci with all missing values.
  Method = 'neighbour': -986 values to be imputed (excluding all-NA loci).
  Warning: Population Polblue has 1316 loci with all missing values.
  Method = 'neighbour': -1547 values to be imputed (excluding all-NA loci).
  Warning: Population Watergauge has 1250 loci with all missing values.
  Method = 'neighbour': -1184 values to be imputed (excluding all-NA loci).
  Calculating the unscaled distance matrix -- euclidean 
  Residual missing values were filled randomly drawing from the global allele profiles by locus
  Imputation method: neighbour 
  No. of missing values before imputation: 21535 
  No. of loci with all NA's for any one population before imputation: 3500 
  No. of values imputed: 21535 
  No. of missing values after imputation: 0 
  No. of loci with all NA's for any one population after imputation: 0 
Completed: gl.impute 
pca <- gl.pcoa(gl) # 2 informative axes
Starting gl.pcoa 
  Processing genlight object with SNP data
  Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state

  Ordination yielded 4 informative dimensions( broken-stick criterion) from 19 original dimensions
    PCA Axis 1 explains 39 % of the total variance
    PCA Axis 1 and 2 combined explain 53.5 % of the total variance
    PCA Axis 1-3 combined explain 65.5 % of the total variance
Starting gl.colors 
Selected color type 2 
Completed: gl.colors 

Completed: gl.pcoa 
gl.pcoa.plot(pca,gl,ellipse=TRUE, plevel=0.67)
Starting gl.pcoa.plot 
  Processing an ordination file (glPca)
  Processing genlight object with SNP data
  Plotting populations in a space defined by the SNPs
  Preparing plot .... please wait

Completed: gl.pcoa.plot 

Step 6 Analysis:

# F Statistics
gl.report.fstat(gl)
Starting gl.report.fstat 
  Processing genlight object with SNP data
$Stat_matrices
$Stat_matrices$Fst
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.5296  0.1881     0.1252
Gloucester_Tops  0.5296              NA  0.4305     0.4394
Polblue          0.1881          0.4305      NA     0.1283
Watergauge       0.1252          0.4394  0.1283         NA

$Stat_matrices$Fstp
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.6925  0.3166     0.2225
Gloucester_Tops  0.6925              NA  0.6019     0.6105
Polblue          0.3166          0.6019      NA     0.2274
Watergauge       0.2225          0.6105  0.2274         NA

$Stat_matrices$Dest
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.2220  0.0810     0.0444
Gloucester_Tops  0.2220              NA  0.2564     0.2352
Polblue          0.0810          0.2564      NA     0.0714
Watergauge       0.0444          0.2352  0.0714         NA

$Stat_matrices$Gst_H
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.8060  0.4062     0.2775
Gloucester_Tops  0.8060              NA  0.7711     0.7599
Polblue          0.4062          0.7711      NA     0.3139
Watergauge       0.2775          0.7599  0.3139         NA


$Stat_tables
      Stat_tables.Brumlow_vs_Gloucester_Tops Stat_tables.Brumlow_vs_Polblue
Fst                                   0.5296                         0.1881
Fstp                                  0.6925                         0.3166
Dest                                  0.2220                         0.0810
Gst_H                                 0.8060                         0.4062
      Stat_tables.Brumlow_vs_Watergauge Stat_tables.Gloucester_Tops_vs_Polblue
Fst                              0.1252                                 0.4305
Fstp                             0.2225                                 0.6019
Dest                             0.0444                                 0.2564
Gst_H                            0.2775                                 0.7711
      Stat_tables.Gloucester_Tops_vs_Watergauge
Fst                                      0.4394
Fstp                                     0.6105
Dest                                     0.2352
Gst_H                                    0.7599
      Stat_tables.Polblue_vs_Watergauge
Fst                              0.1283
Fstp                             0.2274
Dest                             0.0714
Gst_H                            0.3139

Starting gl.plot.heatmap 
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by 'spam'
  Processing a data matrix
Registered S3 method overwritten by 'dendextend':
  method     from 
  rev.hclust vegan
Warning in rep(col, length.out = leaves_length): 'x' is NULL so the result will
be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL

Completed: gl.plot.heatmap 
Completed: gl.report.fstat 
$Stat_matrices
$Stat_matrices$Fst
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.5296  0.1881     0.1252
Gloucester_Tops  0.5296              NA  0.4305     0.4394
Polblue          0.1881          0.4305      NA     0.1283
Watergauge       0.1252          0.4394  0.1283         NA

$Stat_matrices$Fstp
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.6925  0.3166     0.2225
Gloucester_Tops  0.6925              NA  0.6019     0.6105
Polblue          0.3166          0.6019      NA     0.2274
Watergauge       0.2225          0.6105  0.2274         NA

$Stat_matrices$Dest
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.2220  0.0810     0.0444
Gloucester_Tops  0.2220              NA  0.2564     0.2352
Polblue          0.0810          0.2564      NA     0.0714
Watergauge       0.0444          0.2352  0.0714         NA

$Stat_matrices$Gst_H
                Brumlow Gloucester_Tops Polblue Watergauge
Brumlow              NA          0.8060  0.4062     0.2775
Gloucester_Tops  0.8060              NA  0.7711     0.7599
Polblue          0.4062          0.7711      NA     0.3139
Watergauge       0.2775          0.7599  0.3139         NA


[[2]]
      Stat_tables.Brumlow_vs_Gloucester_Tops Stat_tables.Brumlow_vs_Polblue
Fst                                   0.5296                         0.1881
Fstp                                  0.6925                         0.3166
Dest                                  0.2220                         0.0810
Gst_H                                 0.8060                         0.4062
      Stat_tables.Brumlow_vs_Watergauge Stat_tables.Gloucester_Tops_vs_Polblue
Fst                              0.1252                                 0.4305
Fstp                             0.2225                                 0.6019
Dest                             0.0444                                 0.2564
Gst_H                            0.2775                                 0.7711
      Stat_tables.Gloucester_Tops_vs_Watergauge
Fst                                      0.4394
Fstp                                     0.6105
Dest                                     0.2352
Gst_H                                    0.7599
      Stat_tables.Polblue_vs_Watergauge
Fst                              0.1283
Fstp                             0.2274
Dest                             0.0714
Gst_H                            0.3139
# Allelic Richness
r1 <- gl.report.allelerich(gl)
Starting gl.report.allelerich 
  Processing genlight object with SNP data
  Calculating Allelic Richness, averaged across
                    loci, for each population
Starting gl.colors 
Selected color type dis 
Completed: gl.colors 
  Returning a dataframe with allelic richness values
Completed: gl.report.allelerich 

# Allelic Richness Simulation
r1 <- gl.report.allelerich(gl)
Starting gl.report.allelerich 
  Processing genlight object with SNP data
  Calculating Allelic Richness, averaged across
                    loci, for each population
Starting gl.colors 
Selected color type dis 
Completed: gl.colors 
  Returning a dataframe with allelic richness values
Completed: gl.report.allelerich 

r2 <- gl.report.nall(  x = gl,  simlevels = seq(1, nInd(gl), 5),  reps = 10,  plot.colors.pop = gl.colors("dis"),  ncores = 2)
Starting gl.report.nall 
  Processing genlight object with SNP data
Starting gl.filter.allna 
  Identifying and removing loci that are all missing (NA) 
                    in any one population
  Deleting loci that are all missing (NA) in any one population
     Brumlow : deleted 0 loci
     Gloucester_Tops : deleted 0 loci
     Polblue : deleted 0 loci
     Watergauge : deleted 0 loci

  Loci all NA in one or more populations: 0 deleted

Completed: gl.filter.allna 
Starting gl.colors 
Selected color type dis 
Completed: gl.colors 

Completed: gl.report.nall 
# Private Alleles
r1 <- gl.report.pa(gl)
Starting gl.report.pa 
  Processing genlight object with SNP data
  p1 p2            pop1            pop2 N1 N2 fixed priv1 priv2 Chao1 Chao2
1  1  2         Brumlow Gloucester_Tops  6  6   244   547   495     0     1
2  1  3         Brumlow         Polblue  6  4    31   217   465     0     0
3  1  4         Brumlow      Watergauge  6  4     3   190   357     0     0
4  2  3 Gloucester_Tops         Polblue  6  4   233   455   755     1     0
5  2  4 Gloucester_Tops      Watergauge  6  4   226   438   657     1     0
6  3  4         Polblue      Watergauge  4  4    28   379   298     0     0
  totalpriv   AFD asym asym.sig
1      1042 0.283   NA       NA
2       682 0.191   NA       NA
3       547 0.146   NA       NA
4      1210 0.349   NA       NA
5      1095 0.319   NA       NA
6       677 0.201   NA       NA
  Table of private alleles and fixed differences returned
Completed: gl.report.pa 

Step 7. Interpetation

  • PCA indicates a high degree of population structure
  • This is borne out by the FST analysis
  • Allelic diversity is a little low, not exceptionally, and not differentially
  • Each subpopulation has a high frequency of private alleles

Working Hypothesis: The Barrington Tops populations likely represent a former widespread and connected population, subsequently fragmented with differential loss and retention of alleles under the influence of drift.

Future Work: - Increase sample sizes to better capture allelic profiles - Calculate \(N_e\) to assess likelihood of population persistence - Plot \(N_e\) through time to see if there is evidence of a bottleneck

Kate Rick: Strong structure does not necessarily mean separate management of divergent groups.

Maximizing the Barrington Tops population potential to adapt and persist may require cross-breeding and distributed release.

Management (subject to findings of future work) Sampling should be from across the sites in Barrington Tops, cross-breeding should be undertaken to capture allelic diversity, and strategic re-release across sites is likely to be a well-supported strategy.

Worked Example: Western Sawshell Turtle

Not cute as mustard, EPBC Act:Endangered; Headwaters of the Murray-Darling Basin – Namoi, Gwydir and Border Rivers subdrainage basins

  • Insurance breeding colony established

  • Sampled using skin biopsy

  • Genotyped using DArTseq

  • Is the colony representative of the genetic diversity of wild populations

  • Advice on who to breed with whom to maximize genetic diversity in the offspring for release

Exercise: Western Sawshell Turtle

Step 1. Load in the data (Myuchelys.Rdata)

Step 2. Interrogate the data

Step 3. Examine geographically

Step 4. Filtering – craft a regime

Step 5. Exploratory visualization

Step 6. Analysis

Step 7. Interpretation

Required packages

library(dartRverse)

Step 1 Load in the data:

gl.set.verbosity(3)
Starting gl.set.verbosity 
  Global verbosity set to: 3 
Completed: gl.set.verbosity 
gl <- gl.load(file="./data/Myuchelys.Rdata")
Starting gl.load 
  Processing genlight object with SNP data
  Loaded object of type SNP from ./data/Myuchelys.Rdata 
Starting gl.compliance.check 
  Processing genlight object with SNP data
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking for population assignments
    Population assignments confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
  Checking for monomorphic loci
    No monomorphic loci detected
  Checking for loci with all missing data
    No loci with all missing data detected
  Checking whether individual names are unique.
  Checking for individual metrics
    Individual metrics confirmed
  Spelling of coordinates checked and changed if necessary to 
            lat/lon
Completed: gl.compliance.check 
Completed: gl.load 

Step 2 Interrogate:

gl
 ********************
 *** DARTR OBJECT ***
 ********************

 ** 141 genotypes,  31,289 SNPs , size: 74.9 Mb

    missing data: 512130 (=11.61 %) scored as NA

 ** Genetic data
   @gen: list of 141 SNPbin
   @ploidy: ploidy of each individual  (range: 2-2)

 ** Additional data
   @ind.names:  141 individual labels
   @loc.names:  31289 locus labels
   @loc.all:  31289 allele labels
   @position: integer storing positions of the SNPs [within 69 base sequence]
   @pop: population of each individual (group size range: 13-65)
   @other: a list containing: loc.metrics, ind.metrics, latlon, loc.metrics.flags, verbose, history 
    @other$ind.metrics: id, pop, drainage, species, lat, lon, service, plate_location 
    @other$loc.metrics: AlleleID, CloneID, AlleleSequence, TrimmedSequence, Chrom_Myuchelys_georgesi_rMyuGeo1.pri, ChromPosTag_Myuchelys_georgesi_rMyuGeo1.pri, ChromPosSnp_Myuchelys_georgesi_rMyuGeo1.pri, AlnCnt_Myuchelys_georgesi_rMyuGeo1.pri, AlnEvalue_Myuchelys_georgesi_rMyuGeo1.pri, Strand_Myuchelys_georgesi_rMyuGeo1.pri, SNP, SnpPosition, CallRate, OneRatioRef, OneRatioSnp, FreqHomRef, FreqHomSnp, FreqHets, PICRef, PICSnp, AvgPIC, AvgCountRef, AvgCountSnp, RepAvg, clone, uid, rdepth, monomorphs, maf, OneRatio, PIC 
   @other$latlon[g]: coordinates for all individuals are attached
popNames(gl)
[1] "Border-Nth" "Border-Sth" "Gwydir"     "Namoi"     
indNames(gl)
  [1] "AA042487" "AA042500" "Q19140"   "AA004437" "AA004461" "AA004469"
  [7] "AA042479" "AA042488" "AA18895"  "Q19141"   "AA004438" "AA004462"
 [13] "AA004470" "AA042480" "AA042489" "Q19128"   "Q19145"   "AA004439"
 [19] "AA004463" "AA004471" "AA042481" "AA042490" "Q19130"   "AA004440"
 [25] "AA004464" "AA004472" "AA042482" "Q19131"   "AA061829" "AA004433"
 [31] "AA004441" "AA004465" "AA004473" "AA042483" "AA042497" "Q19133"  
 [37] "AA061831" "AA004434" "AA004442" "AA004466" "AA004474" "AA042484"
 [43] "AA042498" "Q19134"   "AA004435" "AA004459" "AA004467" "AA004475"
 [49] "AA042485" "AA042499" "Q19139"   "AA004436" "AA004460" "AA004468"
 [55] "AA004476" "AA042486" "LSWt049"  "LSWt057"  "LSWt066"  "LSWt025" 
 [61] "LSWt033"  "LSWt041"  "LSWt050"  "LSWt059"  "LSWt067"  "LSWt026" 
 [67] "LSWt034"  "LSWt042"  "LSWt051"  "LSWt060"  "LSWt068"  "LSWt027" 
 [73] "LSWt035"  "LSWt043"  "LSWt052"  "LSWt061"  "LSWt069"  "LSWt028" 
 [79] "LSWt036"  "LSWt044"  "LSWt053"  "LSWt062"  "LSWt070"  "LSWt029" 
 [85] "LSWt037"  "LSWt045"  "LSWt054"  "LSWt063"  "LSWt071"  "LSWt030" 
 [91] "LSWt038"  "LSWt046"  "LSWt055"  "LSWt064"  "LSWt031"  "LSWt039" 
 [97] "LSWt047"  "LSWt056"  "LSWt065"  "LSWt032"  "LSWt040"  "LSWt048" 
[103] "LSWt074"  "LSWt082"  "LSWt090"  "LSWt098"  "LSWt106"  "LSWt075" 
[109] "LSWt083"  "LSWt091"  "LSWt099"  "LSWt107"  "LSWt076"  "LSWt084" 
[115] "LSWt092"  "LSWt100"  "LSWt108"  "LSWt077"  "LSWt085"  "LSWt093" 
[121] "LSWt101"  "LSWt109"  "LSWt078"  "LSWt086"  "LSWt094"  "LSWt102" 
[127] "LSWt110"  "LSWt079"  "LSWt087"  "LSWt095"  "LSWt103"  "LSWt080" 
[133] "LSWt088"  "LSWt096"  "LSWt104"  "LSWt072"  "LSWt081"  "LSWt089" 
[139] "LSWt097"  "LSWt105"  "LSWt073" 
table(pop(gl))

Border-Nth Border-Sth     Gwydir      Namoi 
        13         41         65         22 
gl@other$latlon
               lat      lon
AA042487 -30.68550 151.1212
AA042500 -30.46795 151.3111
Q19140   -28.83190 151.9486
AA004437 -30.94012 151.3248
AA004461 -30.50848 151.1195
AA004469 -30.50848 151.1195
AA042479 -30.94012 151.3248
AA042488 -30.68550 151.1212
AA18895  -28.82950 151.9399
Q19141   -28.82950 151.9399
AA004438 -30.94012 151.3248
AA004462 -30.51561 151.1210
AA004470 -30.50848 151.1195
AA042480 -30.94012 151.3248
AA042489 -30.68550 151.1212
Q19128   -28.82950 151.9399
Q19145   -28.82950 151.9399
AA004439 -30.94012 151.3248
AA004463 -30.50848 151.1195
AA004471 -30.50848 151.1195
AA042481 -30.68550 151.1212
AA042490 -30.68550 151.1212
Q19130   -28.82950 151.9399
AA004440 -30.94012 151.3248
AA004464 -30.50848 151.1195
AA004472 -30.50848 151.1195
AA042482 -30.68550 151.1212
Q19131   -28.82950 151.9399
AA061829 -30.16712 151.0753
AA004433 -30.94012 151.3248
AA004441 -30.94012 151.3248
AA004465 -30.50848 151.1195
AA004473 -30.50848 151.1195
AA042483 -30.68550 151.1212
AA042497 -30.46795 151.3111
Q19133   -28.83190 151.9486
AA061831 -30.16712 151.0753
AA004434 -30.94012 151.3248
AA004442 -30.94012 151.3248
AA004466 -30.50848 151.1195
AA004474 -30.50848 151.1195
AA042484 -30.68550 151.1212
AA042498 -30.46795 151.3111
Q19134   -28.82950 151.9399
AA004435 -30.94012 151.3248
AA004459 -30.50848 151.1195
AA004467 -30.50848 151.1195
AA004475 -30.50848 151.1195
AA042485 -30.68550 151.1212
AA042499 -30.46795 151.3111
Q19139   -28.83190 151.9486
AA004436 -30.94012 151.3248
AA004460 -30.50848 151.1195
AA004468 -30.50848 151.1195
AA004476 -30.50848 151.1195
AA042486 -30.68550 151.1212
LSWt049  -30.67500 151.4513
LSWt057  -30.67630 151.4529
LSWt066  -30.07390 151.3439
LSWt025  -29.42525 151.8426
LSWt033  -29.92533 151.1063
LSWt041  -30.14095 151.3823
LSWt050  -30.67500 151.4513
LSWt059  -30.07130 151.3358
LSWt067  -30.07390 151.3439
LSWt026  -29.42525 151.8426
LSWt034  -29.92556 151.1069
LSWt042  -30.21392 151.0832
LSWt051  -30.67500 151.4513
LSWt060  -30.07089 151.3374
LSWt068  -30.42880 151.5974
LSWt027  -29.38272 151.8845
LSWt035  -29.92556 151.1069
LSWt043  -30.21540 151.0805
LSWt052  -30.67500 151.4513
LSWt061  -30.06970 151.3402
LSWt069  -30.42880 151.5974
LSWt028  -29.92550 151.1066
LSWt036  -29.92556 151.1069
LSWt044  -30.21540 151.0805
LSWt053  -30.67500 151.4513
LSWt062  -30.07178 151.3387
LSWt070  -29.72596 151.7875
LSWt029  -29.92550 151.1066
LSWt037  -29.92556 151.1069
LSWt045  -30.21540 151.0805
LSWt054  -30.67630 151.4529
LSWt063  -30.07380 151.3437
LSWt071  -29.73085 151.7885
LSWt030  -29.92546 151.1063
LSWt038  -30.14236 151.3813
LSWt046  -30.21457 151.0823
LSWt055  -30.67480 151.4529
LSWt064  -30.07380 151.3437
LSWt031  -29.92546 151.1063
LSWt039  -30.14204 151.3816
LSWt047  -30.21367 151.0836
LSWt056  -30.67480 151.4529
LSWt065  -30.07380 151.3437
LSWt032  -29.92546 151.1063
LSWt040  -30.14157 151.3820
LSWt048  -30.67630 151.4529
LSWt074  -29.73297 151.7878
LSWt082  -29.69141 151.7980
LSWt090  -29.47887 151.4875
LSWt098  -29.50870 151.7105
LSWt106  -29.50729 151.7103
LSWt075  -29.73255 151.7895
LSWt083  -29.58655 151.7411
LSWt091         NA       NA
LSWt099  -29.50729 151.7103
LSWt107  -29.68650 151.5483
LSWt076  -29.76509 151.7721
LSWt084  -29.68465 151.7977
LSWt092  -29.50630 151.7085
LSWt100  -29.50729 151.7103
LSWt108  -29.68650 151.5483
LSWt077  -29.73376 151.7869
LSWt085  -29.50850 151.6267
LSWt093  -29.53298 151.7194
LSWt101  -29.50729 151.7103
LSWt109  -29.68500 151.5521
LSWt078  -29.73375 151.7846
LSWt086  -29.50850 151.6271
LSWt094  -29.53244 151.7182
LSWt102  -29.50906 151.7040
LSWt110  -29.68490 151.5526
LSWt079  -29.69719 151.7991
LSWt087  -29.50270 151.6113
LSWt095  -29.50978 151.7107
LSWt103  -29.50906 151.7040
LSWt080  -29.69669 151.7990
LSWt088  -29.47887 151.4875
LSWt096  -29.50940 151.7105
LSWt104  -29.50906 151.7040
LSWt072  -29.73297 151.7878
LSWt081  -29.69669 151.7990
LSWt089  -29.47887 151.4875
LSWt097  -29.50895 151.7105
LSWt105  -29.50906 151.7040
LSWt073  -29.73330 151.7837

Step 3 Examine geography:

gl.map.interactive(gl)
Starting gl.map.interactive 
  Processing genlight object with SNP data
Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
missing or invalid lat/lon values and will be ignored
Completed: gl.map.interactive 

Step 4 Craft a Filtering Regime:

First let us do a smear plot of individuals (vertical) against loci (horizontal)

gl.smearplot(gl)
  Processing genlight object with SNP data
Starting gl.smearplot 

Completed: gl.smearplot 

Note that missing values are in grey, so they give an indication of how dense your data set is (as opposed to being littered with gaps). Then we craft a filtering regime.

Unfortunately, we do not have the sexes in this dataset, so we cannot filter on sex linkage, which would be our first port of call. So lets start with call rate.

gl.report.callrate(gl)
Starting gl.report.callrate 
  Processing genlight object with SNP data
  Reporting Call Rate by Locus
  No. of loci = 31289 
  No. of individuals = 141 
    Minimum      :  0.0070922 
    1st quartile :  0.87234 
    Median       :  0.992908 
    Mean         :  0.8839168 
    3r quartile  :  1 
    Maximum      :  1 
    Missing Rate Overall:  0.1161 

   Quantile Threshold Retained Percent Filtered Percent
1      100% 1.0000000    13396    42.8    17893    57.2
2       95% 1.0000000    13396    42.8    17893    57.2
3       90% 1.0000000    13396    42.8    17893    57.2
4       85% 1.0000000    13396    42.8    17893    57.2
5       80% 1.0000000    13396    42.8    17893    57.2
6       75% 1.0000000    13396    42.8    17893    57.2
7       70% 1.0000000    13396    42.8    17893    57.2
8       65% 1.0000000    13396    42.8    17893    57.2
9       60% 1.0000000    13396    42.8    17893    57.2
10      55% 0.9929080    15752    50.3    15537    49.7
11      50% 0.9929080    15752    50.3    15537    49.7
12      45% 0.9787230    18056    57.7    13233    42.3
13      40% 0.9716310    18829    60.2    12460    39.8
14      35% 0.9503550    20399    65.2    10890    34.8
15      30% 0.9148940    22138    70.8     9151    29.2
16      25% 0.8723400    23655    75.6     7634    24.4
17      20% 0.8226950    25112    80.3     6177    19.7
18      15% 0.7517730    26704    85.3     4585    14.7
19      10% 0.6312060    28187    90.1     3102     9.9
20       5% 0.3049650    29732    95.0     1557     5.0
21       0% 0.0070922    31289   100.0        0     0.0

Completed: gl.report.callrate 

The distribution of call rate values indicates a need for filtering, and based on that distribution we pick a threshold of 0.95.

Any loci with less than 95% scored across individuals will be discarded.

gl.filter.callrate(gl, threshold=0.95)
Starting gl.filter.callrate 
  Processing genlight object with SNP data
  Recalculating Call Rate
  Removing loci based on Call Rate, threshold = 0.95 
  Summary of filtered dataset
    Call Rate for loci > 0.95 
    Original No. of loci : 31289 
    Original No. of individuals: 141 
    No. of loci retained: 20399 
    No. of individuals retained: 141 
    No. of populations:  4 

Completed: gl.filter.callrate 

Next we filter on callrate by individual.

gl.report.callrate(gl,method="ind")
Starting gl.report.callrate 
  Processing genlight object with SNP data

  Reporting Call Rate by Individual
  No. of loci = 31289 
  No. of individuals = 141 
    Minimum      :  0.797277 
    1st quartile :  0.8339352 
    Median       :  0.9124932 
    Mean         :  0.8839168 
    3r quartile  :  0.9163923 
    Maximum      :  0.9251814 
    Missing Rate Overall:  0.1161 

Listing 4 populations and their average CallRates
  Monitor again after filtering
  Population CallRate  N
1 Border-Nth   0.8448 13
2 Border-Sth   0.9183 41
3     Gwydir   0.8923 65
4      Namoi   0.8183 22

Listing 20 individuals with the lowest CallRates
  Use this list to see which individuals will be lost on filtering by individual
  Set ind.to.list parameter to see more individuals
   Individual Population  CallRate
1    AA042480      Namoi 0.7972770
2    AA042489      Namoi 0.8008885
3    AA042490      Namoi 0.8025824
4    AA042483      Namoi 0.8049155
5    AA042487      Namoi 0.8059382
6    AA042485      Namoi 0.8079836
7    AA004440      Namoi 0.8101569
8     AA18895 Border-Nth 0.8114673
9    AA004435      Namoi 0.8141200
10     Q19145 Border-Nth 0.8142159
11   AA042497     Gwydir 0.8143437
12     Q19128 Border-Nth 0.8147911
13   AA004436      Namoi 0.8150468
14   AA004460     Gwydir 0.8166448
15   AA042481      Namoi 0.8175078
16   AA004433      Namoi 0.8189779
17     Q19130 Border-Nth 0.8205440
18   AA004442      Namoi 0.8209914
19   AA061829     Gwydir 0.8233245
20     Q19131 Border-Nth 0.8238359

)

Completed: gl.report.callrate 

No individuals have an unacceptable call rate so we do not filter any out.

gl.report.rdepth(gl)
Starting gl.report.rdepth 
  Processing genlight object with SNP data
  Reporting Read Depth by Locus
  No. of loci = 31289 
  No. of individuals = 141 
    Minimum      :  2.5 
    1st quartile :  6.6 
    Median       :  10.4 
    Mean         :  12.50514 
    3r quartile  :  16.5 
    Maximum      :  109.9 
    Missing Rate Overall:  0.12 

   Quantile Threshold Retained Percent Filtered Percent
1      100%     109.9        1     0.0    31288   100.0
2       95%      28.4     1567     5.0    29722    95.0
3       90%      23.7     3170    10.1    28119    89.9
4       85%      20.7     4703    15.0    26586    85.0
5       80%      18.4     6272    20.0    25017    80.0
6       75%      16.5     7841    25.1    23448    74.9
7       70%      14.9     9476    30.3    21813    69.7
8       65%      13.6    10958    35.0    20331    65.0
9       60%      12.4    12538    40.1    18751    59.9
10      55%      11.3    14148    45.2    17141    54.8
11      50%      10.4    15647    50.0    15642    50.0
12      45%       9.5    17403    55.6    13886    44.4
13      40%       8.8    18886    60.4    12403    39.6
14      35%       8.1    20368    65.1    10921    34.9
15      30%       7.3    22033    70.4     9256    29.6
16      25%       6.6    23484    75.1     7805    24.9
17      20%       5.8    25199    80.5     6090    19.5
18      15%       5.1    26636    85.1     4653    14.9
19      10%       4.4    28252    90.3     3037     9.7
20       5%       3.7    29879    95.5     1410     4.5
21       0%       2.5    31289   100.0        0     0.0
Completed: gl.report.rdepth 

Filter on read depth, selecting 5x as a minimum.

gl.filter.rdepth(gl, lower=5)
Starting gl.filter.rdepth 
  Processing genlight object with SNP data
  Removing loci with rdepth <= 5 and >= 1000 

  Summary of filtered dataset
    Initial no. of loci = 31289 
    No. of loci deleted = 4424 
    No. of loci retained: 26865 
    No. of individuals: 141 
    No. of populations:  4 
Completed: gl.filter.rdepth 
 ********************
 *** DARTR OBJECT ***
 ********************

 ** 141 genotypes,  26,865 SNPs , size: 72.4 Mb

    missing data: 377576 (=9.97 %) scored as NA

 ** Genetic data
   @gen: list of 141 SNPbin
   @ploidy: ploidy of each individual  (range: 2-2)

 ** Additional data
   @ind.names:  141 individual labels
   @loc.names:  26865 locus labels
   @loc.all:  26865 allele labels
   @position: integer storing positions of the SNPs [within 69 base sequence]
   @pop: population of each individual (group size range: 13-65)
   @other: a list containing: loc.metrics, ind.metrics, latlon, loc.metrics.flags, verbose, history 
    @other$ind.metrics: id, pop, drainage, species, lat, lon, service, plate_location 
    @other$loc.metrics: AlleleID, CloneID, AlleleSequence, TrimmedSequence, Chrom_Myuchelys_georgesi_rMyuGeo1.pri, ChromPosTag_Myuchelys_georgesi_rMyuGeo1.pri, ChromPosSnp_Myuchelys_georgesi_rMyuGeo1.pri, AlnCnt_Myuchelys_georgesi_rMyuGeo1.pri, AlnEvalue_Myuchelys_georgesi_rMyuGeo1.pri, Strand_Myuchelys_georgesi_rMyuGeo1.pri, SNP, SnpPosition, CallRate, OneRatioRef, OneRatioSnp, FreqHomRef, FreqHomSnp, FreqHets, PICRef, PICSnp, AvgPIC, AvgCountRef, AvgCountSnp, RepAvg, clone, uid, rdepth, monomorphs, maf, OneRatio, PIC 
   @other$latlon[g]: coordinates for all individuals are attached

Then report on reproducibility. Recall that DArT run technical replicates to check the reliability of calls.

gl.report.reproducibility(gl)
Starting gl.report.reproducibility 
  Processing genlight object with SNP data
  Reporting Repeatability by Locus
  No. of loci = 31289 
  No. of individuals = 141 
    Minimum      :  0.930233 
    1st quartile :  0.983051 
    Median       :  0.992857 
    Mean         :  0.9890095 
    3r quartile  :  1 
    Maximum      :  1 
    Missing Rate Overall:  0.12 

   Quantile Threshold Retained Percent Filtered Percent
1      100%  1.000000    14560    46.5    16729    53.5
2       95%  1.000000    14560    46.5    16729    53.5
3       90%  1.000000    14560    46.5    16729    53.5
4       85%  1.000000    14560    46.5    16729    53.5
5       80%  1.000000    14560    46.5    16729    53.5
6       75%  1.000000    14560    46.5    16729    53.5
7       70%  1.000000    14560    46.5    16729    53.5
8       65%  1.000000    14560    46.5    16729    53.5
9       60%  1.000000    14560    46.5    16729    53.5
10      55%  1.000000    14560    46.5    16729    53.5
11      50%  0.992857    15668    50.1    15621    49.9
12      45%  0.991667    17337    55.4    13952    44.6
13      40%  0.991228    18776    60.0    12513    40.0
14      35%  0.988889    20346    65.0    10943    35.0
15      30%  0.985714    22052    70.5     9237    29.5
16      25%  0.983051    23708    75.8     7581    24.2
17      20%  0.980000    25123    80.3     6166    19.7
18      15%  0.975000    26626    85.1     4663    14.9
19      10%  0.967742    28169    90.0     3120    10.0
20       5%  0.956897    29735    95.0     1554     5.0
21       0%  0.930233    31289   100.0        0     0.0
Completed: gl.report.reproducibility 

There are quite a few loci that show poor reproducibility. So lets filter out those with a reproducibility of less than 99%

gl.filter.reproducibility(gl, threshold=0.99)
Starting gl.filter.reproducibility 
  Processing genlight object with SNP data
  Removing loci with repeatability less than 0.99 

  Summary of filtered dataset
    Retaining loci with repeatability >= 0.99 
    Original no. of loci: 31289 
    No. of loci discarded: 11431 
    No. of loci retained: 19858 
    No. of individuals: 141 
    No. of populations:  4 
Completed: gl.filter.reproducibility 
 ********************
 *** DARTR OBJECT ***
 ********************

 ** 141 genotypes,  19,858 SNPs , size: 69.6 Mb

    missing data: 371269 (=13.26 %) scored as NA

 ** Genetic data
   @gen: list of 141 SNPbin
   @ploidy: ploidy of each individual  (range: 2-2)

 ** Additional data
   @ind.names:  141 individual labels
   @loc.names:  19858 locus labels
   @loc.all:  19858 allele labels
   @position: integer storing positions of the SNPs [within 69 base sequence]
   @pop: population of each individual (group size range: 13-65)
   @other: a list containing: loc.metrics, ind.metrics, latlon, loc.metrics.flags, verbose, history 
    @other$ind.metrics: id, pop, drainage, species, lat, lon, service, plate_location 
    @other$loc.metrics: AlleleID, CloneID, AlleleSequence, TrimmedSequence, Chrom_Myuchelys_georgesi_rMyuGeo1.pri, ChromPosTag_Myuchelys_georgesi_rMyuGeo1.pri, ChromPosSnp_Myuchelys_georgesi_rMyuGeo1.pri, AlnCnt_Myuchelys_georgesi_rMyuGeo1.pri, AlnEvalue_Myuchelys_georgesi_rMyuGeo1.pri, Strand_Myuchelys_georgesi_rMyuGeo1.pri, SNP, SnpPosition, CallRate, OneRatioRef, OneRatioSnp, FreqHomRef, FreqHomSnp, FreqHets, PICRef, PICSnp, AvgPIC, AvgCountRef, AvgCountSnp, RepAvg, clone, uid, rdepth, monomorphs, maf, OneRatio, PIC 
   @other$latlon[g]: coordinates for all individuals are attached
gl.smearplot(gl)
  Processing genlight object with SNP data
Starting gl.smearplot 

Completed: gl.smearplot 

Step 5 Exploratory visualization:

pca <- gl.pcoa(gl)
Starting gl.pcoa 
  Processing genlight object with SNP data
  Performing a PCA, individuals as entities, loci as attributes, SNP genotype as state

  Ordination yielded 3 informative dimensions( broken-stick criterion) from 140 original dimensions
    PCA Axis 1 explains 22.8 % of the total variance
    PCA Axis 1 and 2 combined explain 29.9 % of the total variance
    PCA Axis 1-3 combined explain 34.1 % of the total variance
Starting gl.colors 
Selected color type 2 
Completed: gl.colors 

Completed: gl.pcoa 
gl.pcoa.plot(pca,gl,ellipse=TRUE, plevel=0.95)
Starting gl.pcoa.plot 
  Processing an ordination file (glPca)
  Processing genlight object with SNP data
  Plotting populations in a space defined by the SNPs
  Preparing plot .... please wait

Completed: gl.pcoa.plot 

Note: You might be tempted to filter out all loci that are less than 100% reliable, but this can cause an ascertainment bias. In particular, loci that are heterozygous will be called with less certainty, and so filtering that is too stringent will preferentially remove more polymophic sites.

Now lets visualize the structure with PCA. Note that we do not need to impute because the missing value rate is only 0.6%, unlikely to cause appreciable distortion.

gl.pcoa.plot(pca,gl,xaxis=1, yaxis=3, ellipse = TRUE, plevel=0.95)
Starting gl.pcoa.plot 
  Processing an ordination file (glPca)
  Processing genlight object with SNP data
  Plotting populations in a space defined by the SNPs
  Preparing plot .... please wait

Completed: gl.pcoa.plot 

Four nice clusters in the space defined by the top two dimensions. Note however that, by the split stick criterion, there are three informative dimensions. Separation can be believed, but proximity is ambiguous. So the proximity of the two drainages within the border rivers region may not be sustained when we examine the third dimension.

gl.pcoa.plot(pca,gl,xaxis=1, yaxis=3,interactive=TRUE)

And indeed it is not. Why do we look at PC axis 1 and 2, then PC axis 1 and 3. Why not 3 and 4. First of all, axis 4 was not informative, and will likely be misleading if we do not scale the axes. Second, if you plot 1 vs 2 and 1 vs 3, you can flop 1 vs 3 back into the page and better visualize the 3D structure in the dataset.

Better still, you can plot the data in 3D and then manually apply a rotation of the axes to better show the separation of sites.

gl.pcoa.plot(pca,gl,xaxis=1, yaxis=2, zaxis=3)
Starting gl.pcoa.plot 
  Processing an ordination file (glPca)
  Processing genlight object with SNP data
  Displaying a three dimensional plot, mouse over for details for each point
  May need to zoom out to place 3D plot within bounds
Completed: gl.pcoa.plot 

All good. Four aggregations that can be distinguished by their allelic profiles folloing ordination and dimension reduction.

Step 6 Analysis:

F Statistics

# F Statistics
gl.report.fstat(gl)
Starting gl.report.fstat 
  Processing genlight object with SNP data
$Stat_matrices
$Stat_matrices$Fst
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.1066 0.2428 0.1617
Border-Sth     0.1066         NA 0.2044 0.1337
Gwydir         0.2428     0.2044     NA 0.2117
Namoi          0.1617     0.1337 0.2117     NA

$Stat_matrices$Fstp
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.1926 0.3907 0.2784
Border-Sth     0.1926         NA 0.3395 0.2358
Gwydir         0.3907     0.3395     NA 0.3494
Namoi          0.2784     0.2358 0.3494     NA

$Stat_matrices$Dest
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.0272 0.0595 0.0405
Border-Sth     0.0272         NA 0.0575 0.0368
Gwydir         0.0595     0.0575     NA 0.0526
Namoi          0.0405     0.0368 0.0526     NA

$Stat_matrices$Gst_H
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.2281 0.4505 0.3254
Border-Sth     0.2281         NA 0.4022 0.2815
Gwydir         0.4505     0.4022     NA 0.4056
Namoi          0.3254     0.2815 0.4056     NA


$Stat_tables
      Stat_tables.Border.Nth_vs_Border.Sth Stat_tables.Border.Nth_vs_Gwydir
Fst                                 0.1066                           0.2428
Fstp                                0.1926                           0.3907
Dest                                0.0272                           0.0595
Gst_H                               0.2281                           0.4505
      Stat_tables.Border.Nth_vs_Namoi Stat_tables.Border.Sth_vs_Gwydir
Fst                            0.1617                           0.2044
Fstp                           0.2784                           0.3395
Dest                           0.0405                           0.0575
Gst_H                          0.3254                           0.4022
      Stat_tables.Border.Sth_vs_Namoi Stat_tables.Gwydir_vs_Namoi
Fst                            0.1337                      0.2117
Fstp                           0.2358                      0.3494
Dest                           0.0368                      0.0526
Gst_H                          0.2815                      0.4056

Starting gl.plot.heatmap 
Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
Also defined by 'spam'
  Processing a data matrix
Warning in rep(col, length.out = leaves_length): 'x' is NULL so the result will
be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL
Warning in rep(if (nam %in% names(L)) L[[nam]] else default, length.out =
indx): 'x' is NULL so the result will be NULL

Completed: gl.plot.heatmap 
Completed: gl.report.fstat 
$Stat_matrices
$Stat_matrices$Fst
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.1066 0.2428 0.1617
Border-Sth     0.1066         NA 0.2044 0.1337
Gwydir         0.2428     0.2044     NA 0.2117
Namoi          0.1617     0.1337 0.2117     NA

$Stat_matrices$Fstp
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.1926 0.3907 0.2784
Border-Sth     0.1926         NA 0.3395 0.2358
Gwydir         0.3907     0.3395     NA 0.3494
Namoi          0.2784     0.2358 0.3494     NA

$Stat_matrices$Dest
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.0272 0.0595 0.0405
Border-Sth     0.0272         NA 0.0575 0.0368
Gwydir         0.0595     0.0575     NA 0.0526
Namoi          0.0405     0.0368 0.0526     NA

$Stat_matrices$Gst_H
           Border-Nth Border-Sth Gwydir  Namoi
Border-Nth         NA     0.2281 0.4505 0.3254
Border-Sth     0.2281         NA 0.4022 0.2815
Gwydir         0.4505     0.4022     NA 0.4056
Namoi          0.3254     0.2815 0.4056     NA


[[2]]
      Stat_tables.Border.Nth_vs_Border.Sth Stat_tables.Border.Nth_vs_Gwydir
Fst                                 0.1066                           0.2428
Fstp                                0.1926                           0.3907
Dest                                0.0272                           0.0595
Gst_H                               0.2281                           0.4505
      Stat_tables.Border.Nth_vs_Namoi Stat_tables.Border.Sth_vs_Gwydir
Fst                            0.1617                           0.2044
Fstp                           0.2784                           0.3395
Dest                           0.0405                           0.0575
Gst_H                          0.3254                           0.4022
      Stat_tables.Border.Sth_vs_Namoi Stat_tables.Gwydir_vs_Namoi
Fst                            0.1337                      0.2117
Fstp                           0.2358                      0.3494
Dest                           0.0368                      0.0526
Gst_H                          0.2815                      0.4056

The structure in the PCA is reflected in the Fst values.

0–0.05: little differentiation 0.05–0.15: low–moderate 0.15–0.25: moderate > 0.25: high / very high

Border Nth vs South : low-moderate Rest : Moderate

What about allelic richness.

# Allelic Richness
r1 <- gl.report.allelerich(gl)
Starting gl.report.allelerich 
  Processing genlight object with SNP data
  Calculating Allelic Richness, averaged across
                    loci, for each population
Starting gl.colors 
Selected color type dis 
Completed: gl.colors 
  Returning a dataframe with allelic richness values
Completed: gl.report.allelerich 

Nothing spectacular there.

gl.report.heterozygosity(gl)
Starting gl.report.heterozygosity 
  Processing genlight object with SNP data
  Calculating Observed Heterozygosities, averaged across
                    loci, for each population
  Calculating Expected Heterozygosities

Starting gl.colors 
Selected color type dis 
Completed: gl.colors 
  Reporting Heterozygosity by Population

  No. of loci = 31289 
  No. of individuals = 141 
  No. of populations = 4 
    Minimum Observed Heterozygosity:  0.059274 
    Maximum Observed Heterozygosity:  0.094814 
    Average Observed Heterozygosity:  0.073427 

    Minimum Unbiased Expected Heterozygosity:  0.078557 
    Maximum Unbiased Expected Heterozygosity:  0.11251 
    Average Unbiased Expected Heterozygosity:  0.093563 
  Heterozygosity estimates not corrected for uncalled invariant loci

                  pop    n.Ind n.Loc n.Loc.adj polyLoc monoLoc all_NALoc
Border-Nth Border-Nth 11.37846 30201         1   10030   20171      1088
Border-Sth Border-Sth 37.87628 31102         1   18810   12292       187
Gwydir         Gwydir 58.40274 31072         1   19520   11552       217
Namoi           Namoi 18.54781 30368         1   11547   18821       921
                 Ho     HoSD     HoSE       He     HeSD     HeSE      uHe
Border-Nth 0.069692 0.150969 0.000869 0.084082 0.156419 0.000900 0.087947
Border-Sth 0.094814 0.160880 0.000912 0.111025 0.166190 0.000942 0.112510
Gwydir     0.059274 0.125246 0.000711 0.077885 0.142433 0.000808 0.078557
Namoi      0.069927 0.145044 0.000832 0.092670 0.162685 0.000934 0.095238
              uHeSD    uHeSE      FIS    FISSD    FISSE
Border-Nth 0.163608 0.000941 0.169373 0.379021 0.004008
Border-Sth 0.168413 0.000955 0.104848 0.304782 0.002233
Gwydir     0.143663 0.000815 0.113107 0.299205 0.002154
Namoi      0.167192 0.000959 0.209529 0.376625 0.003654
  Returning a dataframe with heterozygosity values
Completed: gl.report.heterozygosity 

Again, not a lot in that. No strong evidence of inbreeding depression in any of the four aggregations.

gl.report.diversity(gl)
Starting gl.report.diversity 
  Processing genlight object with SNP data
Starting gl.filter.allna 
  Identifying and removing loci and individuals scored all 
                missing (NA)
  Deleting loci that are scored as all missing (NA)
  Zero loci that are missing (NA) across all individuals
  Deleting individuals that are scored as all missing (NA)
  Zero individuals that are missing (NA) across all loci
Completed: gl.filter.allna 
Warning in one_H_alpha_es[[x[1]]]$dummys[i1 %in% index] +
one_H_alpha_es[[x[2]]]$dummys[i2 %in% : longer object length is not a multiple
of shorter object length
Warning in one_H_alpha_all[i0 %in% index] - (one_H_alpha_es[[x[1]]]$dummys[i1
%in% : longer object length is not a multiple of shorter object length
Warning in one_H_alpha_es[[x[1]]]$dummys[i1 %in% index] +
one_H_alpha_es[[x[2]]]$dummys[i2 %in% : longer object length is not a multiple
of shorter object length
Warning in one_H_alpha_all[i0 %in% index] - (one_H_alpha_es[[x[1]]]$dummys[i1
%in% : longer object length is not a multiple of shorter object length
Warning in one_H_alpha_es[[x[1]]]$dummys[i1 %in% index] +
one_H_alpha_es[[x[2]]]$dummys[i2 %in% : longer object length is not a multiple
of shorter object length
Warning in one_H_alpha_all[i0 %in% index] - (one_H_alpha_es[[x[1]]]$dummys[i1
%in% : longer object length is not a multiple of shorter object length
Warning in one_H_alpha_es[[x[1]]]$dummys[i1 %in% index] +
one_H_alpha_es[[x[2]]]$dummys[i2 %in% : longer object length is not a multiple
of shorter object length
Warning in one_H_alpha_all[i0 %in% index] - (one_H_alpha_es[[x[1]]]$dummys[i1
%in% : longer object length is not a multiple of shorter object length
Warning in one_H_alpha_es[[x[1]]]$dummys[i1 %in% index] +
one_H_alpha_es[[x[2]]]$dummys[i2 %in% : longer object length is not a multiple
of shorter object length
Warning in one_H_alpha_all[i0 %in% index] - (one_H_alpha_es[[x[1]]]$dummys[i1
%in% : longer object length is not a multiple of shorter object length
Warning in one_H_alpha_es[[x[1]]]$dummys[i1 %in% index] +
one_H_alpha_es[[x[2]]]$dummys[i2 %in% : longer object length is not a multiple
of shorter object length
Warning in one_H_alpha_all[i0 %in% index] - (one_H_alpha_es[[x[1]]]$dummys[i1
%in% : longer object length is not a multiple of shorter object length
Starting gl.colors 
Selected color type dis 
Completed: gl.colors 



|           | nloci| m_0Ha| sd_0Ha| m_1Ha| sd_1Ha| m_2Ha| sd_2Ha|
|:----------|-----:|-----:|------:|-----:|------:|-----:|------:|
|Border-Nth | 30201| 0.296|  0.457| 0.125|  0.225| 0.084|  0.156|
|Border-Sth | 31102| 0.599|  0.490| 0.177|  0.237| 0.111|  0.166|
|Gwydir     | 31072| 0.621|  0.485| 0.129|  0.204| 0.078|  0.142|
|Namoi      | 30368| 0.350|  0.477| 0.139|  0.234| 0.093|  0.163|


pairwise non-missing loci

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|         NA|     NA|    NA|
|Border-Sth |      30154|         NA|     NA|    NA|
|Gwydir     |      30064|      30892|     NA|    NA|
|Namoi      |      29807|      30274|  30230|    NA|


0_H_beta

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|      0.252|  0.257| 0.230|
|Border-Sth |      0.218|         NA|  0.255| 0.254|
|Gwydir     |      0.272|      0.299|     NA| 0.257|
|Namoi      |      0.139|      0.228|  0.271|    NA|


1_H_beta

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|      0.277|  0.265| 0.279|
|Border-Sth |      0.041|         NA|  0.270| 0.280|
|Gwydir     |      0.064|      0.047|     NA| 0.268|
|Namoi      |      0.056|      0.035|  0.059|    NA|


2_H_beta

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|      0.160|  0.143| 0.161|
|Border-Sth |      0.028|         NA|  0.127| 0.146|
|Gwydir     |      0.060|      0.048|     NA| 0.141|
|Namoi      |      0.039|      0.025|  0.055|    NA|


|           | nloci| m_0Da| sd_0Da| m_1Da| sd_1Da| m_2Da| sd_2Da|
|:----------|-----:|-----:|------:|-----:|------:|-----:|------:|
|Border-Nth | 30201| 1.296|  0.457| 1.166|  0.311| 1.140|  0.283|
|Border-Sth | 31102| 1.599|  0.490| 1.230|  0.331| 1.181|  0.305|
|Gwydir     | 31072| 1.621|  0.485| 1.165|  0.284| 1.124|  0.259|
|Namoi      | 30368| 1.350|  0.477| 1.185|  0.323| 1.155|  0.296|


pairwise non-missing loci

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|         NA|     NA|    NA|
|Border-Sth |      30154|         NA|     NA|    NA|
|Gwydir     |      30064|      30892|     NA|    NA|
|Namoi      |      29807|      30274|  30230|    NA|


0_D_beta

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|      0.252|  0.257| 0.230|
|Border-Sth |      1.218|         NA|  0.255| 0.254|
|Gwydir     |      1.272|      1.299|     NA| 0.257|
|Namoi      |      1.139|      1.228|  1.271|    NA|


1_D_beta

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|      0.323|  0.315| 0.329|
|Border-Sth |      1.084|         NA|  0.315| 0.324|
|Gwydir     |      1.106|      1.088|     NA| 0.317|
|Namoi      |      1.102|      1.079|  1.101|    NA|


2_D_beta

|           | Border-Nth| Border-Sth| Gwydir| Namoi|
|:----------|----------:|----------:|------:|-----:|
|Border-Nth |         NA|      0.183|  0.174| 0.186|
|Border-Sth |      1.042|         NA|  0.157| 0.170|
|Gwydir     |      1.074|      1.059|     NA| 0.171|
|Namoi      |      1.054|      1.037|  1.068|    NA|
Completed: gl.report.diversity 
# Private Alleles
r1 <- gl.report.pa(gl)
Starting gl.report.pa 
  Processing genlight object with SNP data
  p1 p2       pop1       pop2 N1 N2 fixed priv1 priv2 Chao1 Chao2 totalpriv
1  1  2 Border-Nth Border-Sth 13 41   111  1950 11174   120   314     13124
2  1  3 Border-Nth     Gwydir 13 65   243  3258 13084   120   682     16342
3  1  4 Border-Nth      Namoi 13 22   159  3321  4948   120   265      8269
4  2  3 Border-Sth     Gwydir 41 65   296  8874  9571   314   682     18445
5  2  4 Border-Sth      Namoi 41 22   151 10684  3093   314   265     13777
6  3  4     Gwydir      Namoi 65 22   234 12302  4057   682   265     16359
    AFD asym asym.sig
1 0.084   NA       NA
2 0.116   NA       NA
3 0.097   NA       NA
4 0.122   NA       NA
5 0.097   NA       NA
6 0.110   NA       NA
  Table of private alleles and fixed differences returned
Completed: gl.report.pa 

Again, not a lot to see there, though there seems to be a discrepency between q=0 and the results from gl.report.heterozygosity. Coders?

Lets see where these four populations sit in relation to expectation build on the global trend in allelic richness against population size. This takes some time to run, so maybe do this at another time.

# Allelic Richness Simulation
r2 <- gl.report.nall(  x = gl,  simlevels = seq(1, nInd(gl), 5),  reps = 10,  plot.colors.pop = gl.colors("dis"),  ncores = 2)
Starting gl.report.nall 
  Processing genlight object with SNP data
Starting gl.filter.allna 
  Identifying and removing loci that are all missing (NA) 
                    in any one population
  Deleting loci that are all missing (NA) in any one population
     Border-Nth : deleted 1088 loci
     Border-Sth : deleted 187 loci
     Gwydir : deleted 217 loci
     Namoi : deleted 921 loci

  Loci all NA in one or more populations: 1608 deleted

Completed: gl.filter.allna 
Starting gl.colors 
Selected color type dis 
Completed: gl.colors 

Completed: gl.report.nall 

Questions:

  • Is there structure in the dataset evident in the PCA?
  • Is this borne out by FST analysis?
  • Are there discrete management units evident?
  • Are any of these showing evidence of critical loss of genetic diversity/inbreeding?
  • What are the lessons/advice for management?