W09 Hidden dartR Powers

Author

Bernd Gruber, Luis Mijangos & Arthur Georges

Published

March 9, 2026

W09 dartR developer

Choosing colours for discrete groups

Introduction

When plotting discrete categories such as populations or locations, it is helpful to use palettes with clearly distinguishable colours.

# Explore palettes available through dartRverse
gl.select.colors()

# Explore palettes from RColorBrewer
RColorBrewer::display.brewer.all()
RColorBrewer::brewer.pal.info

# Explore built-in palettes from grDevices
grDevices::palette.pals()
grDevices::hcl.pals()

# Base R also provides palettes such as:
# rainbow(), heat.colors(), topo.colors(), terrain.colors(), and cm.colors()
# Select a palette for populations or sampling locations
col2 <- gl.select.colors(
  x = possums.gl,
  library = "gr.hcl",
  palette = "Dark 3"
)

Adobe Color Wheel

Adobe Color Wheel is useful for designing palettes for categorical data.
It helps create balanced colour combinations while keeping groups easy to distinguish.
Its accessibility tools are also useful for checking how palettes appear to people with colour-vision deficiencies.

https://color.adobe.com/create/color-wheel

Reusing the same colours across plots

To keep population colours consistent across multiple plots, use the same vector in plot.colors.pop.

gl.report.diversity(possums.gl, plot.colors.pop = col2)
gl.report.heterozygosity(possums.gl, plot.colors.pop = col2)

Saving and customising a PCoA plot

The arguments plot.dir and plot.file can be used to save a plot object for later editing.

t1 <- possums.gl

pca <- gl.pcoa(t1)

gl.pcoa.plot(
  pca,
  t1,
  plot.dir = getwd(),
  plot.file = "test",
  pt.colors = col2,
  pt.size = 2
)
# Read the saved plot object and modify it with ggplot2 syntax
p1 <- readRDS("test.RDS")

p2 <- p1 +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = "none"
  )

print(p2)

Alternative PCoA displays

# Interactive 2D plot
gl.pcoa.plot(
  pca,
  t1,
  pt.colors = col2,
  interactive = TRUE,
  pt.size = 2
)
# 3D plot using the third axis
gl.pcoa.plot(
  pca,
  t1,
  pt.colors = col2,
  zaxis = 3,
  pt.size = 3
)

Clustering methods

# Run sparse non-negative matrix factorisation
r1 <- gl.run.snmf(platypus.gl, maxK = 3, rep = 4)
# Plot ancestry coefficients and order individuals by dendrogram
gl.plot.snmf(
  snmf.result = r1,
  border.ind = 0.25,
  plot.K = 3,
  plot.theme = NULL,
  color.clusters = NULL,
  ind.name = TRUE,
  plot.out = TRUE,
  plot.file = NULL,
  plot.dir = NULL,
  den = TRUE,
  inverse.den = TRUE,
  x = platypus.gl,
  plot.colors.pop = c("red", "blue", "green")
)

Smear plots

Smear plots help visualise locus informativeness and marker behaviour across datasets.

# SNP data: remove loci with incomplete call rate and monomorphic loci
t1 <- gl.filter.callrate(platypus.gl, threshold = 1, mono.rm = TRUE)

gl.smearplot(
  t1,
  den = TRUE,
  plot.colors = gl.colors("4")
)
# Reorder loci by FST to highlight the most differentiated loci
fst_loc <- data.frame(
  order = 1:nLoc(t1),
  fst = utils.basic.stats(t1)$perloc$Fstp
)

fst_loc <- fst_loc[order(fst_loc$fst, decreasing = TRUE), ]
t1 <- t1[, fst_loc$order]

gl.smearplot(
  t1,
  den = TRUE,
  plot.colors = gl.colors("4")
)
# SilicoDArT data
t2 <- testset.gs

gl.smearplot(
  t2,
  den = TRUE,
  plot.colors = gl.colors("4")
)
# Reorder SilicoDArT loci by PIC (polymorphism information content)
pic_loc <- data.frame(
  order = 1:nLoc(t2),
  pic = t2$other$loc.metrics$PIC
)

pic_loc <- pic_loc[order(pic_loc$pic, decreasing = TRUE), ]
t2 <- t2[, pic_loc$order]

gl.smearplot(
  t2,
  den = TRUE,
  plot.colors = gl.colors("4")
)
# Group individuals by population
gl.smearplot(
  platypus.gl,
  group.pop = TRUE,
  plot.colors = gl.colors("4")
)

gl.smearplot(
  bandicoot.gl,
  den = TRUE,
  plot.colors = gl.colors("4")
)

Using genomic resources

BLAST

BLAST (Basic Local Alignment Search Tool) is a fast sequence-comparison method used to search a query sequence against a reference database and identify local matches.
In dartRverse, gl.blast() can create FASTA files, prepare a BLAST database, run BLAST, and retain the best hit per sequence.

BLAST+ can be downloaded from:
https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Common BLAST nucleotide tasks

  • megablast: fastest option for highly similar DNA sequences
  • dc-megablast: better for more divergent but still related sequences
  • blastn: general-purpose nucleotide search
  • blastn-short: optimised for short sequences such as primers or probes
t1 <- platypus.gl

t1 <- gl.blast(
  x = t1,
  ref_genome = "./data/chr_X3_mOrnAna1.pri.v4.fa.gz",
  task = "megablast",
  Percentage_identity = 70,
  Percentage_overlap = 0.8,
  bitscore = 50,
  number_of_threads = 2
)

Assign chromosome names and SNP positions

head(t1$other$loc.metrics)

t1$position <- t1$other$loc.metrics$ChromPos_Platypus_Chrom_NCBIv1
t1$chromosome <- t1$other$loc.metrics$Chrom_Platypus_Chrom_NCBIv1

head(t1$chromosome)

SNP density and linkage disequilibrium

p1 <- gl.plot.snp.density(
  x = t1,
  bin.size = 1e6,
  min.snps = 20,
  min.length = 1e6,
  color.palette = viridis::viridis,
  chr.info = TRUE,
  plot.title = NULL,
  plot.theme = theme_dartR()
)
r2 <- gl.report.ld.map(t1, ld.max.pairwise = 1e7)
p2 <- gl.ld.distance(r2, ld.resolution = 1e6)

Haploview-style LD plotting

popNames(t1)

tbl_chr <- table(t1$chromosome)
head(tbl_chr)

chr <- names(tbl_chr)[2]

# Reminder:
# Individuals are stored in rows and loci are stored in columns
pos_snp_chr <- t1[, t1$chromosome == chr]$position
p2 <- gl.ld.haplotype(
  x = t1,
  pop_name = "TENTERFIELD",
  chrom_name = chr,
  ld_max_pairwise = 1e7,
  maf = 0.05,
  ld_stat = "R.squared",
  ind.limit = 10,
  min_snps = 10,
  ld_threshold_haplo = 0.5,
  plot_het = TRUE,
  snp_pos = TRUE,
  target.snp1 = pos_snp_chr[43],
  target.snp2 = pos_snp_chr[80],
  target.snp3 = pos_snp_chr[20],
  col.all = "black",
  col.target1 = "green",
  col.target2 = "blue",
  col.target3 = "red",
  coordinates = NULL,
  color_haplo = "viridis",
  color_het = "deeppink"
)

GFF files

A GFF file (General Feature Format) is a plain-text annotation file used to describe genomic features such as genes, exons, CDS regions, mRNA, and regulatory elements.
It stores genomic coordinates together with metadata such as feature type, strand, source, and identifiers.

Useful annotation sources:

# Read the first two lines of a GFF file
readLines(
  "./data/chr_X3_mOrnAna1.pri.v4.gff",
  n = 2
)
# Match chromosome names to the format used in the GFF file
# This removes everything after the second underscore
t1$chromosome <- as.factor(
  sub("^(([^_]*_){1}[^_]*)_.*$", "\\1", t1$chromosome)
)
chr <- "NC_041751.1"
loci <- locNames(t1[, t1@chromosome == chr])

# Map loci to the nearest gene feature in the GFF annotation
r3 <- gl.find.genes.for.loci(
  t1,
  gff.file = "./data/chr_X3_mOrnAna1.pri.v4.gff",
  loci = loci
)

head(r3)
# Find loci located within genes matching a name or pattern
r4 <- gl.find.loci.in.genes(
  t1,
  gff.file = "./data/chr_X3_mOrnAna1.pri.v4.gff",
  gene = "MHC",
  save2tmp = TRUE
)

other stuff


Learning outcomes

In this session we will explore some of the newest and most exciting functions recently added to the dartRverse ecosystem.


Session overview

  • gl.map.interactive — Interactive maps with gene flow
    • Visualising population locations on an interactive leaflet map
    • Overlaying a directed gene flow graph using private allele estimates
    • Interpreting arrow colour and thickness as gene flow magnitude
  • gl.gen2fbm, gl.fbm2gen, gl.pca — The FBM memory model
    • What is a File-Backed Matrix (FBM) and why does it matter?
    • Comparing the compressed dartR object vs the FBM representation
    • Why imputation is required before running PCA on FBM objects
    • Benchmarking: runtime comparison of standard vs FBM-backed PCA
    • Demonstrating that imputed FBM-PCA results are identical to standard PCA
  • gl.print.history — Your analysis audit trail
    • Recovering the filter history applied to a genlight object
    • Practical use cases

gl.map.interactive

Package: dartR.base

gl.map.interactive creates an interactive leaflet map that plots the sampling locations of populations stored in a genlight object. Beyond a simple location map, the function can overlay a directed gene flow network: you supply a pairwise migration matrix and the function draws arrows between populations whose colour and line thickness scale with the estimated magnitude of gene flow, making asymmetric dispersal immediately visible.


Basic interactive map

The simplest call only needs your genlight object. We use the built-in possums.gl dataset, which contains five possum populations across south-eastern Australia.

# Basic interactive map — population locations only
gl.map.interactive(possums.gl)

You should see a leaflet map with one marker per population. Hover or click on a marker to see the population name and sample size.


Estimating private alleles as a gene flow proxy

Before drawing directional arrows we need a directed matrix encoding the relative strength of gene flow between every pair of populations. gl.report.pa returns a data frame with columns pop1, pop2, priv1 (private alleles in pop1 not in pop2) and priv2 (private alleles in pop2 not in pop1). We reshape this into a square populations × populations matrix where entry [i, j] holds the number of private alleles in population i that are absent in population j — a directional proxy for gene flow from ij.

# Report private alleles — returns a data frame with columns pop1, pop2, priv1, priv2
pa_df <- gl.report.pa(possums.gl[1:120,], plot.display = FALSE, verbose = 0)
head(pa_df)

# Get sorted population names
pops <- sort(unique(c(as.character(pa_df$pop1), as.character(pa_df$pop2))))

# Initialise an empty square matrix
pa_matrix <- matrix(0, nrow = length(pops), ncol = length(pops),
                    dimnames = list(pops, pops))

# Fill: priv1 = private alleles in pop1 absent from pop2  →  flow pop1 → pop2
for (i in seq_len(nrow(pa_df))) {
  p1 <- as.character(pa_df$pop1[i])
  p2 <- as.character(pa_df$pop2[i])
  pa_matrix[p1, p2] <- pa_df$priv1[i]   # pop1 → pop2
  pa_matrix[p2, p1] <- pa_df$priv2[i]   # pop2 → pop1
}

pa_matrix

Directed gene flow map

Pass the square matrix to gl.map.interactive. Arrow thickness and colour scale with the private allele count, so source and sink populations are immediately apparent.

# Interactive map with directed gene flow overlay
gl.map.interactive(
  possums.gl[1:120],
  matrix = pa_matrix,symmetric = FALSE)

gl.gen2fbm, gl.fbm2gen & gl.pca — The FBM memory model

Packages: dartR.base (conversion functions), dartR.popgen (PCA)

Background: two representations of a dartR object

A standard dartR / genlight object stores the SNP matrix in a highly compressed bitwise format. This keeps the object small in memory — often only a few MB even for tens of thousands of loci — but every mathematical operation must first decompress the data, which adds overhead for computationally intensive analyses such as PCA on very large datasets.

The File-Backed Matrix (FBM) representation, powered by the bigstatsr package, stores the genotype matrix in a binary flat file on disk. The file is larger than the compressed genlight object, but individual rows and columns can be accessed directly without decompression, making linear algebra operations (including PCA via randomised SVD) dramatically faster for large datasets.

Property Compressed genlight FBM-backed
Object size in RAM Very small Small (only file handle)
File on disk None Large flat file
Can be saved with saveRDS ✅ Yes ❌ No — file path becomes invalid
PCA / linear algebra speed Slower (decompress first) Much faster
Requires imputation for PCA Optional Required

Important: An FBM object cannot be saved across sessions with saveRDS / save. It must be recreated each time. Think of it as a fast in-session cache, not a storage format.


Converting to and from FBM

# Simulate a medium-sized genlight object
gl_sim <- glSim(n.ind = 200, n.snp.nonstruc = 5000, ploidy = 2)
gl_sim

class(gl_sim) <- "dartR"

# Convert to FBM — writes the flat file and attaches it to the object
gl_fbm <- gl.gen2fbm(gl_sim)

# Inspect the FBM slot
gl_fbm@fbm       # bigstatsr FBM object
dim(gl_fbm@fbm)  # rows = individuals, cols = loci

# Compare object sizes in RAM (the FBM data lives on disk, not in RAM)
cat("Compressed genlight: ", object.size(gl_sim) / 1024, "KB\n")
cat("FBM genlight object: ", object.size(gl_fbm) / 1024, "KB\n")

Converting back is equally straightforward:

# Convert back to standard compressed genlight
gl_back <- gl.fbm2gen(gl_fbm)
gl_back

Runtime benchmark: standard PCA vs FBM-backed PCA

library(patchwork)
# Simulate a larger dataset to make the speed difference apparent
set.seed(1)
gl_large <- glSim(n.ind = 200, n.snp.nonstruc = 5000, ploidy = 2)
class(gl_large) <- "dartR"
pop(gl_large) <- rep(paste0("Pop", 1:4), each = 50)
indNames(gl_large) <- paste0("Ind", 1:200)
# Prepare FBM version with imputation
gl_large_fbm <- gl.gen2fbm(gl_large)

  system.time(pc_gen <-gl.pcoa(gl_large, verbose = 0))
  
  system.time(pc_fbm <- gl.pcoa(gl_large_fbm, verbose = 0))

  p1 <- gl.pcoa.plot(pc_gen, gl_large, verbose = 0) +
          ggplot2::ggtitle("Standard PCA (compressed genlight)")
  pc_fbm$scores[,1:2] <- -pc_fbm$scores[,1:2]  # flip axes for better visual comparison
  
  p2 <- gl.pcoa.plot(pc_fbm, gl_large_fbm, verbose = 0) +
          ggplot2::ggtitle("FBM-backed PCA (imputed FBM)")
  
  p1 + p2

The FBM pathway is typically 5–50× faster for datasets with >10 000 loci, with the speedup increasing with dataset size.


Are the results identical? Side-by-side PCA plots

Take-home message: Use the standard compressed genlight for everyday work and storage. Switch to the FBM representation when running computationally intensive analyses on large datasets, remembering to impute first and never to save the FBM object to disk.


gl.print.history

Package: dartR.base

Every filtering function in dartR automatically appends a record of what was done — thresholds used, loci/individuals removed, timestamp — to an internal history log stored in gl@other$history. gl.print.history retrieves and displays this log in a human-readable format.

This is invaluable when you return to an analysis weeks later and need to reconstruct exactly which filters were applied, or when documenting your QC pipeline for a methods section.


# Start with the built-in testset and apply a series of filters
gl_work <- testset.gl

gl_work <- gl.filter.callrate(gl_work, method = "loc", threshold = 0.95, verbose = 0)
gl_work <- gl.filter.maf(gl_work,      threshold = 0.05, verbose = 0)
gl_work <- gl.filter.monomorphs(gl_work,                 verbose = 0)
gl_work <- gl.filter.secondaries(gl_work,                verbose = 0)

# Recover the complete filter audit trail
gl.print.history(gl_work)

The output lists each function call in chronological order together with the key parameters and the number of loci/individuals retained at each step. When you share a filtered genlight object with a collaborator via saveRDS, the history travels with it — they can inspect your entire filter chain without needing to read through your script.

Tip: If you ever wonder “wait, did I filter for HWE on this object?”, gl.print.history will tell you immediately — no more scrolling through old scripts.


Additional reading

  • bigstatsr — Privé F et al. (2018). Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics.
  • dartRverse documentation — https://green-striped-gecko.github.io/dartR/
  • ?gl.map.interactive, ?gl.impute, ?gl.print.history

Wrap up

In this session we covered three powerful additions to the dartRverse:

  • gl.map.interactive lets you visualise population locations and overlay directed gene flow networks derived from private allele estimates, with arrow thickness and colour encoding migration magnitude and direction.

  • gl.gen2fbm / gl.fbm2gen / gl.pca introduce a File-Backed Matrix memory model that trades a larger disk footprint for dramatically faster linear algebra — ideal for PCA and similar analyses on large SNP datasets. The FBM object cannot be saved across sessions and requires imputation before PCA, but the resulting ordination is identical to the standard approach.

  • gl.print.history exposes the complete filter audit trail embedded in every dartR genlight object, making your analyses transparent, reproducible and easy to document.

Getting started with development - joining the dartR team

As an open-source project dartR encourages its users to actively contribute to the package, as far as development goes the more the merrier! By helping with development you can improve the quality of the package, limit bugs, and help with the development of new features.

That being said, open source is not a synonym for free for all! To ensure the stability and condition of the package remains high, there are a number of rules we must adhere to when developing new features.

As such in this tutorial, we’ll be providing an introduction as to the means by which you can become a developer, and actively contribute to the package. We’ll introduce you to the style guide, how to use it in the development of functions, provide a brief intro to github and how to manage pushing, pulling, merging and forking!

Session overview

  • 1. Introduction to basics of development
    • Starting with basics
    • Writing functions
  • 2. DartR style guide
    • Roxygen2 documentation at start of functions
    • Good practice for writing functions
    • Using gl.document for writing start of functions
  • 3. Introduction to github
    • Repo’s
    • Push/pull merge etc.
    • Forking directories
  • 4. Example - writing basic function
    • gl.something.basic
    • write basic function for printing or combining inputs
  • 5. GitHub specifically for dartR
    • fork directory
    • create local git repo on rstudio
    • create function - compile local copy - make sure package is installable - ie changes have broken the package
    • commit/push to forked repo
    • make pull/merge request from forked repo to original repo
  • 6. Common errors/hangups
    • ie errors with local variables not being defined etc.
  • 7. Roundup/questions etc.

1. Basics of development and the dartR style guide

The dartR function: a skeleton

As discussed, there exists a variety of rules which we should adhere to as developers, in doing so we can ensure the quality, stability and uniformity of our work across platforms. We also mustn’t forget other developers exist! In all things open source, there are more than one set of eyes, so ensuring easy collaboration is the key to success.

As such we’ll begin with the basic structure of a dartR function, its appearance and the logic by which you should write your code. All dartR functions have the following skeleton:

  1. Introductory code based on roxygen2;
  2. Examples again coded in roxygen2;
  3. The initial function definition;
  4. Setting the verbosity level with the help of gl.set.verbosity();
  5. Validating the data passed to the function with the help of gl.check.datatype();
  6. Checking that the required dependencies are in place;
  7. Applying some function specific checks on the values passed to the function;
  8. Announcing the start of the function for the user with the help of utils.flag.start();
  9. DO THE JOB
  10. Delivering the graphical outputs
  11. Saving the graphical outputs as gglplot and other outputs to the session temporary directory;
  12. Adding to the history of the genlight object, for later recall;
  13. Announcing the closure of the function for the user;
  14. Returning any parameters
  15. Closing the function

We’ll step through each of these features in more detail to ensure you understand their inclusion.

Roxygen2 documentation

Roxygen2 is an R package used in the documentation of functions, primarily for the generation of help() queries, documentation files (.Rd) and the description blurbs

INCLUDE PHOTO OF ROXYGEN2 DOCUMENTATION HERE

Define the function

The next step is to define the function and its inputs:

INSERT PHOTOT OF DEFINTION

Set the verbosity level

Setting the verbosity allows the user to edit the verbosity of the output as required - including things such as function completion, output location, progress bars etc. There are 6 possible paramater values to choose from:

0 – silent or fatal errors. Good for batch processing. 1 – notifies begin and end, and fatal errors. 2 – notifies warnings and progress log, in addition to above. 3 – reports progress and results summary, in addition to above. 4 – to be implemented. 5 – displays a full report.

This value can either be set when calling the function using the parameter “verbose” or one can set the global verbosity level using verbosity <- gl.set.verbosity(value=n).

Apply checks of parameters

We now need to apply checks to the input parameters to ensure the user supplied data matches the format specified. We can achieve a simple check with the utils.check.datatype, which checks the main data supplied fits one of the following data types

  • Genlight
  • dist
  • fd
  • matrix

In addition you should apply your own parameter checks of user input. This ensures any errors that arise from incorrect formatting of data is explicitly mentioned as such in the function output. This avoids the outputting of vague often verbose system errors that may arise instead and may be difficult for the user to debug. We can illustrate this with an example:

Basic flags

Its important to flag the start of the script- to ensure the output displays when the job has commenced. We can do this with the utils.flag.start function.

In addition we also flag the start of the function with a # DO THE JOB tag: such as:

thingy <- function(a,b){
  
  if(!(a == "num" | b == "num")){
    stop("a and b wrong format")
  }

  # DO THE JOB
  
  output <- a + b
  return(output)

}

Have some style! Programming with style

As a flexible programming language, R allows for a diversity of methods when tackling a problem. Consider the inclusion of packages designed to improve flow and structure and this effect is amplified even more so! As such it its important you write code not only for yourself but for those with whom you’re working.

As such perhaps the most important rule in this regard is to COMMENT!!!! By generously commenting your code (even where it may seem unnecessary), you lend your code a degree of readability that transcends the actual program itself. Even the most dense code can become more readable if commented correctly and it allows both yourself and other programmers to pick it up quickly and continue where you left off.

In addition it is important to properly indent and space your code. Given that R is lazily compiled, that is to say compiled line by line, it lacks the strict indentation required of other, compiled languages, such as C, C++ or Fortran.

For more info on what constitutes good, readable, efficient code - Hadley Wickham has written a comprehensive tome on the subject - Advanced R, for which there is a thorough treatment of style - http://adv-r.had.co.nz/Style.html.

2. Introduction to Github

-   Repo's
-   Push/pull merge etc.
-   forking directories

As I’m sure many of you are aware - dartR’s code is hosted on GitHub - an opensource development platform for version management and control. This includes all of the base dependencies - such as dartR.base and dartRverse in addition to the expansion modules and experimental functions/versions being actively developed.

The manner in which this code is governed by a set of strict protocols (as mentioned above) - designed to protect the stability of the packages and ensure they’re (relatively) bug free. This can be summarised as below:

Repo’s

A repo or repository consists of the base from which all code is hosted. It contains a projects files, code and a history of the revisions done to said code. In addition is hosts branches - or copies of the original main branch - from which developers can work in isolation from the original code and edit at will.

As seen in the images below - the dartR.captive repo hosts a number of branches - each hosted by its respective developer. As such none of these developers directly edit the code present on the main branch - so as to avoid conflicts.

Clearly there is the potential for the local and remote branches established by individual developers to diverge from the work that has been contributed by others. To avoid working on stale branches, developers should follow the below procedure: 1. Pull the latest working version of dartR from the remote origin/dev to your local branch (e.g. dev_jacob). 2. Resolve any conflicts (hopefully few if you do all of this regularly). 3. Add new scripts or alter existing scripts, do a local build, and test the scripts function appropriately without error. 4. Commit any changes you have made to scripts, and include any files created by the local build. 5. Push your local branch (e.g. dev_jacob)to your remote branch (e.g. origin/dev_jacob)where it is then available to the Core dartR Team to evaluate your changes committed in 1 above, and ultimately merge with dev.

Forking a directory

As an alternate - you can also fork a directory and then push any changes made - without requiring developer access to the repo. This may be the best alternative if proposing a small change or if not expecting to make repeated changes to the package (ie you might only include some level of functionality YOU require).

To begin you simply press the fork button at the top of the repo home page:

This will then take you a page where you can specify how you’d like to fork the original page.

3. Example function - gl.does.something

Having introduced you to the basic of writing functions for dartR and github we’ll now go through the process of writing our own function and then pushing it to a fork.

gl.document

#' @name gl.document
#' @title Generate a roxygen2 Documentation Template for a Function
#' @description
#' Creates a skeleton \code{roxygen2} documentation file for a specified 
#' function. The generated file contains standard documentation fields 
#' including title, description, parameters, details, return value, examples, 
#' and references. The output is written as a new \code{.R} file in the 
#' specified directory.
#'
#' The function inspects the formal arguments of the target function and 
#' automatically generates \code{@param} entries for each argument. This 
#' provides a structured starting point for developing consistent 
#' documentation across a package.
#'
#' @param func_name Name of the function to be documented (unquoted).
#' @param author_name Character string specifying the author of the function.
#' @param example_dataset Name of an example dataset to include in the 
#' documentation examples (currently placeholder, not implemented).
#' @param outputDir Character string specifying the directory where the 
#' documentation file will be written. The file will be named 
#' \code{<func_name>.R}.
#'
#' @details 
#' The function constructs a list of standard documentation fields and writes 
#' them in roxygen2 format to a new file connection. Parameter names are 
#' extracted using \code{formals()} and written as placeholder entries for 
#' subsequent manual editing.
#' 
#' The generated template includes the following sections:
#' \itemize{
#'   \item \code{@name}
#'   \item \code{@title}
#'   \item \code{@description}
#'   \item \code{@param}
#'   \item \code{@details}
#'   \item \code{@return}
#'   \item \code{@author}
#'   \item \code{@examples}
#'   \item \code{@references}
#' }
#'
#' @return 
#' A new \code{.R} file containing a roxygen2 documentation template is 
#' written to the specified directory. The function returns \code{NULL} 
#' invisibly.
#'
#' @author 
#' Author name supplied via \code{author_name}.
#'
#' @examples
#' # Example usage:
#' # gl.document(
#' #   func_name = myFunction,
#' #   author_name = "Your Name",
#' #   example_dataset = "gl.example",
#' #   outputDir = "R/"
#' # )
#'
#'
#' @export

gl.document <- function(func_name, 
                        author_name,
                        example_dataset = NULL,
                        outputDir){
  
  # Convert function to character name
  funcName <- as.character(substitute(func_name))
  
  # Extract parameter names
  parameters <- names(formals(func_name))
  
  # ---- Construct example call ----
  example_call <- paste0(funcName, "(", 
                         paste(parameters, collapse = ", "),
                         ")")
  
  # Build example block
  example_lines <- c(
    "#' @examples",
    "#' # Example usage:",
    if(!is.null(example_dataset)) 
      paste0("#' data(", example_dataset, ")"),
    paste0("#' ", example_call)
  )
  
  # Remove NULL if no dataset
  example_lines <- example_lines[!is.na(example_lines)]
  
  # ---- Create file connection ----
  fileConn <- file(paste0(outputDir, funcName, ".R"), "wt")
  
  # ---- Write header fields ----
  writeLines(paste0("#' @name ", funcName), fileConn)
  writeLines(paste0("#' @title Title for ", funcName), fileConn)
  writeLines(paste0("#' @description ", funcName, " does:"), fileConn)
  writeLines("#'", fileConn)
  
  # ---- Write parameters ----
  for(params in parameters){
    writeLines(paste0("#' @param ", params, " Insert description."), fileConn)
  }
  
  writeLines("#'", fileConn)
  
  # ---- Write details ----
  writeLines(paste0("#' @details"), fileConn)
  writeLines("#' Detailed description goes here.", fileConn)
  writeLines("#'", fileConn)
  
  # ---- Write return ----
  writeLines(paste0("#' @return ", funcName, " returns:"), fileConn)
  writeLines("#'", fileConn)
  
  # ---- Write author ----
  writeLines(paste0("#' @author ", author_name), fileConn)
  writeLines("#'", fileConn)
  
  # ---- Write examples ----
  writeLines(example_lines, fileConn)
  writeLines("#'", fileConn)
  
  # ---- Write references ----
  writeLines("#' @references", fileConn)
  writeLines("#' Patterson, J. (2005). Maximum ride. New York: Little, Brown.", fileConn)
  writeLines("#'", fileConn)
  
  writeLines("#' @export", fileConn)
  
  close(fileConn)
  
  invisible(NULL)
}

The function gl.document writes basic document for creating new R functions such that the user can then fill in the rest with ease. We’ll be pushing our changes to a fork of the dartR.base package.

Forking

I’ve already create a fork of dartR.base called dartR.base_testing with which to illustrate this example - this was done in a manner identical to how it was explained above.

We will now pull this fork into a new R repo - test if the function works within the confines of the package and then push our changes pack to the repo.

Setting up the R project

We’ll start by creating a new R project under version control. Go to ‘New Project’ then version control, then git - upon which you should get a screen like below:

Then name the cloned repo and insert the repository URL for the github repo you’d like to clone. We can now create a new fork for working in. Go to the git tab and click ‘New branch’ upon which you’ll be greeted with the following screen:

This will then pull the contents of the R repo with a screen such as:

After creating a new R file we can then paste in the contents of our R function - with a useful name - in this case gl.document.R. We can then attempt to install the package to test it can built from scratch without error:

Given it was successful we can now push our changes to the original repo - which upon being successful we should get the following message:

We can now attempt to merge our dev branches to the regular dev_branch.

Having completed the pull request we can see if the branches can be merged successfully.

As we can see the branches can be merged without trouble - so we can go ahead and approve the merged request.

Looking at the dev page of our fork we can now see the dev branch has had a recent push (as expected) with our description in the R folder. As such our push was successful and our changes have been added to the repository.