W15 SNP that Matter

Author

Elise Furlan, Bernd Gruber & collaborators

Published

Invalid Date

W15 SNP that Matter

Learning outcomes

Throughout this session, we will discuss the utility and functionality of SNP panels and when they offer a practical alternative to genome wide SNP datasets. We will discuss how SNP panel selection can target SNPs based on different criteria, which will impact the utility of the SNP panel to identify various population genetic metrics and consequently, their utility to address different conservation questions.

By the end of this session you should be able to:

Explain why we use reduced SNP panels in conservation genetics (cost, logistics, speed, monitoring)
Define a panel goal (e.g. population assignment / structure, relatedness / parentage, individual ID, sex-linked, hybrid detection, adaptive markers)
Apply a quality-control workflow for panel design (missingness, call rate, MAF, sequence length, alignment quality)
Use gl.select.panel() to create panels using alternative strategies (e.g. dapc, random, pic)
Use gl.check.panel() to evaluate how well a panel recovers a target parameter (e.g. FST)
Design and compare replicate panels (stability / robustness)

Context: Why SNP panels?

Genome-wide SNP datasets are fantastic for discovery and deep inference, but monitoring programs often need something else:

cheaper per sample
robust on low-quality DNA (sometimes)
repeatable across years
scalable (lots of samples, many timepoints)
quick turnaround for management decisions

Typical conservation use-cases

assigning individuals to management units
detecting illegal translocations / admixture
parentage and pedigree tracking (captive or reintroduction programs)
long-term genetic monitoring (He, FIS, FST trends)
hybridisation detection
sex-linked markers for skewed sex ratios / demographic monitoring

Prerequisites

You should be comfortable with: - reading a genlight object - basic filtering in the dartRverse (gl.filter.*) - interpreting simple reports (call rate, missingness, MAF)

No prior experience with panel design is assumed.

Setup

library(dartRverse)
library(ggplot2)
library(knitr)
library(ggpmisc)

Data for this tutorial

We use an example SNP dataset (pre-filtered genome-wide SNPs), plus a version that already includes BLAST alignment metrics.

rfbe      <- readRDS(file.path(path.data, "rfbe.rds"))
rfbe20_5  <- readRDS(file.path(path.data, "rfbe20_5_blast.rds"))

Interpretation questions - How many loci / individuals do we start with? - How many populations? - Are some populations very small? (this matters for the panel goal)

Part 1 — Decide what the panel is for

Panel goal → design choices

A panel for population structure (FST / assignment) often benefits from: - loci that differentiate populations (high between-pop variance) - moderate allele frequencies (informative, not singletons) - broad coverage across the genome (avoid tight linkage when possible)

A panel for relatedness / parentage often benefits from: - high heterozygosity loci (high information per locus) - more loci than structure panels (often) - careful QC to minimise genotyping error

A panel for sex-linked SNPs benefits from: - strong sex association - robust genotype calls - validation on independent samples

In this hands-on we focus on a structure / assignment panel of 50 SNPs, and evaluate it via how well it recovers pairwise FST.

Exercise 1 (discussion)

Pick one purpose for your panel. Write down:

Purpose (choose one): structure / assignment, relatedness, individual ID, hybridisation, sex-linked, adaptive, other
What is the key output you care about? (e.g. FST matrix, assignment accuracy, PID, relatedness error)
What is a realistic budgeted panel size? (e.g. 50, 96, 192, 384 loci)

Part 2 — Quality filtering workflow (worked example)

Step 1 — Ensure adequate sample sizes per population

Many panel methods implicitly assume you have enough individuals in each population to estimate allele frequencies reliably.

We keep populations with > 10 individuals:

tt     <- table(pop(rfbe))
pop20  <- names(tt)[tt > 10]

rfbe20 <- gl.keep.pop(rfbe, pop.list = pop20)

table(pop(rfbe20))

Step 2 — Remove loci with population-specific all-NA

rfbe20_1 <- gl.filter.allna(rfbe20, by.pop = TRUE)

Step 3 — Call-rate filtering

rfbe20_2 <- gl.filter.callrate(rfbe20_1, threshold = 0.99)

Step 4 — Minor allele count / frequency filtering

For panel design, rare alleles can be: - uninformative for many tasks - more sensitive to genotyping error - unstable across timepoints

Here we use a minor allele count threshold of 5:

rfbe20_3 <- gl.filter.maf(rfbe20_2, threshold = 5, by.pop = FALSE)
nLoc(rfbe20_3)

Step 5 (optional but recommended) — Sequence length

Panel development often requires flanking sequence (primer/probe design).
We keep loci with trimmed sequences of length ≥ 30.

index    <- nchar(as.character(rfbe20_3@other$loc.metrics$TrimmedSequence)) > 29
rfbe20_4 <- rfbe20_3[, index]
nLoc(rfbe20_4)

Step 6 (optional) — Alignment quality to a reference genome

If you have a reference genome, you can BLAST trimmed sequences to check for: - strong matches (bitscore) - uniqueness (avoid multi-mapping) - alignment length / mismatches

In this tutorial we load a pre-computed BLAST-annotated dataset:

rfbe20_5 <- readRDS(file.path(path.data, "rfbe20_5_blast.rds"))

Filter by a bitscore threshold (example: ≥ 100):

index <- rfbe20_5@other$loc.metrics$bitscore >= 100
index <- ifelse(is.na(index), FALSE, index)

rfbe20_6 <- rfbe20_5[, index]
nLoc(rfbe20_6)

# Order individuals by population (some panel methods benefit from this)
rfbe20_6 <- rfbe20_6[order(pop(rfbe20_6)), ]

Exercise 2 — Explore the impact of QC thresholds

# Goal: Explore how QC thresholds change the number of loci you have available for panel design.
#
# 1) Try call-rate thresholds of 0.95, 0.98, 0.99.
# 2) Try minor allele count thresholds of 2, 5, 10.
# 3) Record how many loci remain after each combination.
#
# Tip: Start from rfbe20_1 each time so comparisons are fair.

# Example template:
# cr <- 0.99
# mac <- 5
# tmp <- gl.filter.callrate(rfbe20_1, threshold = cr)
# tmp <- gl.filter.maf(tmp, threshold = mac, by.pop = FALSE)
# nLoc(tmp)

Reflection

When you increase call-rate or MAF thresholds, what happens to locus count?
Why might stricter QC improve panel robustness (even if you lose loci)?

Part 3 — Selecting informative SNPs

Panel selection strategies in `gl.select.panel()`

Different methods target different signals. For example:

dapc: picks loci most informative for discriminating groups (good for structure / assignment)
pic / picdart: picks loci with high information content (often useful broadly)
hafall / hafpop: targets high allele frequency loci (sometimes helps genotyping stability)
pahigh: private allele patterns
random: a baseline comparison

methods <- c(
  "dapc", "random", "pahigh", "monopop", "stratified",
  "hafall", "hafpop", "pic", "picdart"
)
methods

Worked example — Select a 50 SNP panel using DAPC

panel_dapc <- gl.select.panel(rfbe20_6, method = "dapc", nl = 50)
nLoc(panel_dapc)

Evaluate the panel — does it recover FST?

We compare the panel-derived FST to the full dataset FST.

out_dapc <- gl.check.panel(panel_dapc, rfbe20_6, parameter = "Fst")
out_dapc

Now compare to a random panel:

panel_random <- gl.select.panel(rfbe20_6, method = "random", nl = 50)
out_random   <- gl.check.panel(panel_random, rfbe20_6, parameter = "Fst")
out_random

Interpretation

A good structure panel should: - preserve the ranking of pairwise FST values - keep values close to the full dataset (small error) - be stable across replicate panels (not overly “lucky”)

Exercise 3 — Build your own SNP panel and evaluate it

# Your task:
# 1) Choose a panel method (e.g. "dapc", "pic", "picdart", "hafpop", "random")
# 2) Choose a panel size (nl), e.g. 25, 50, 100
# 3) Create the panel
# 4) Check how well it recovers FST relative to the full dataset
#
# Record:
# - method
# - nl
# - your interpretation of the output (good / ok / poor)

my_method <- "pic"   # change me
my_nl     <- 50      # change me

my_panel <- gl.select.panel(rfbe20_6, method = my_method, nl = my_nl)
nLoc(my_panel)

my_eval <- gl.check.panel(my_panel, rfbe20_6, parameter = "Fst")
my_eval

Challenge

Try the same method at nl = 25, 50, 100. How many loci do you need before the panel becomes “good enough” for your management question?

Part 4 — Replicate panels and robustness

Why replicate?

Some selection methods can be sensitive to randomness and sampling variation.
To avoid overconfidence from a single “lucky” panel, we generate replicate panels and compare performance.

A simple replicate loop (example)

set.seed(42)

B <- 20                # number of replicate panels
nl <- 50
method <- "dapc"

evals <- vector("list", B)

for (b in 1:B) {
  p   <- gl.select.panel(rfbe20_6, method = method, nl = nl)
  ev  <- gl.check.panel(p, rfbe20_6, parameter = "Fst")
  evals[[b]] <- ev
}

# You can now summarise evals (depends on what gl.check.panel returns in your version)

Exercise 4 — Panel stability

# Goal: Compare stability of TWO methods.
#
# 1) Pick two methods (e.g. "dapc" vs "pic" OR "dapc" vs "random")
# 2) Generate B replicate panels for each method
# 3) Evaluate each with gl.check.panel(parameter="Fst")
# 4) Decide: which method is more stable and why?

set.seed(1)

B  <- 10
nl <- 50

m1 <- "dapc"
m2 <- "random"

# Write your replicate code here (use the template above).
# Tip: store results in a data.frame with columns: method, replicate, metric(s)

Discussion

In a monitoring program, robustness matters. A method that is slightly worse on average but more stable may be preferable.

Part 5 — Wrap-up: practical checklist

A pragmatic SNP panel checklist

Define the primary purpose (structure vs parentage vs ID vs sex-linked…)
QC hard (missingness, call rate, MAC/MAF)
Ensure adequate sample sizes for allele frequency estimation
Consider lab constraints (sequence length, alignment uniqueness, flanking sequence)
Select candidate panels (multiple methods + sizes)
Evaluate against the full dataset (target parameter)
Test stability (replicate panels)
Validate on independent samples (ideally)

W15 SNP that Matter

Learning outcomes

Context: Why SNP panels?

Typical conservation use-cases

Prerequisites

Setup

Data for this tutorial

Part 1 — Decide what the panel is for

Panel goal → design choices

Part 2 — Quality filtering workflow (worked example)

Step 1 — Ensure adequate sample sizes per population

Step 2 — Remove loci with population-specific all-NA

Step 3 — Call-rate filtering

Step 4 — Minor allele count / frequency filtering

Step 5 (optional but recommended) — Sequence length

Step 6 (optional) — Alignment quality to a reference genome

Exercise 2 — Explore the impact of QC thresholds

Part 3 — Selecting informative SNPs

Panel selection strategies in gl.select.panel()

Worked example — Select a 50 SNP panel using DAPC

Evaluate the panel — does it recover FST?

Exercise 3 — Build your own SNP panel and evaluate it

Part 4 — Replicate panels and robustness

Why replicate?

A simple replicate loop (example)

Exercise 4 — Panel stability

Part 5 — Wrap-up: practical checklist

A pragmatic SNP panel checklist

Further reading

Panel selection strategies in `gl.select.panel()`