Filters loci based on pairwise Hamming distance between sequence tags

Hamming distance is calculated as the number of base differences between two sequences which can be expressed as a count or a proportion. Typically, it is calculated between two sequences of equal length. In the context of DArT trimmed sequences, which differ in length but which are anchored to the left by the restriction enzyme recognition sequence, it is sensible to compare the two trimmed sequences starting from immediately after the common recognition sequence and terminating at the last base of the shorter sequence.

gl.filter.hamming(
  x,
  threshold = 0.2,
  rs = 5,
  taglength = 69,
  plot.out = TRUE,
  plot_theme = theme_dartR(),
  plot_colors = two_colors,
  pb = FALSE,
  save2tmp = FALSE,
  verbose = NULL
)

Arguments

x: Name of the genlight object containing the SNP data [required].
threshold: A threshold Hamming distance for filtering loci [default threshold 0.2].
rs: Number of bases in the restriction enzyme recognition sequence [default 5].
taglength: Typical length of the sequence tags [default 69].
plot.out: Specify if plot is to be produced [default TRUE].
plot_theme: Theme for the plot. See Details for options [default theme_dartR()].
plot_colors: List of two color names for the borders and fill of the plots [default two_colors].
pb: Switch to output progress bar [default FALSE].
save2tmp: If TRUE, saves any ggplots and listings to the session temporary directory (tempdir) [default FALSE].
verbose: Verbosity: 0, silent or fatal errors; 1, begin and end; 2, progress log ; 3, progress and results summary; 5, full report [default 2, unless specified using gl.set.verbosity].

Value

A genlight object filtered on Hamming distance.

Details

Hamming distance can be computed by exploiting the fact that the dot product of two binary vectors x and (1-y) counts the corresponding elements that are different between x and y. This approach can also be used for vectors that contain more than two possible values at each position (e.g. A, C, T or G).

If a pair of DNA sequences are of differing length, the longer is truncated.

The algorithm is that of Johann de Jong https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/ as implemented in utils.hamming.

Only one of two loci are retained if their Hamming distance is less that a specified percentage. 5 base differences out of 100 bases is a 20

Author

Custodian: Arthur Georges – Post to https://groups.google.com/d/forum/dartr

Examples

# SNP data
test <- platypus.gl
test <- gl.subsample.loci(platypus.gl,n=50)
#> Starting gl.subsample.loci 
#>   Processing genlight object with SNP data
#>   Warning: data include loci that are scored NA across all individuals.
#>   Consider filtering using gl <- gl.filter.allna(gl)
#>   Warning: Dataset contains monomorphic loci which will be included in the gl.subsample.loci selections
#>   Subsampling at random 50 loci from dartR object 
#> Completed: gl.subsample.loci 
#> 
result <- gl.filter.hamming(test, threshold=0.25, verbose=3)
#> Starting gl.filter.hamming 
#>   Processing genlight object with SNP data
#>   Warning: data include loci that are scored NA across all individuals.
#>   Consider filtering using gl <- gl.filter.allna(gl)
#>   Note: Hamming distance ranges from zero (sequence identity)
#>                 to 1 (no bases shared at any position)
#>   Note: Calculating pairwise Hamming distances between trimmed 
#>                 reference sequence tags
#>   Filtering loci with a Hamming Distance of less than 0.25 
#> 

#>   Summary of filtered dataset
#>     Initial No. of loci: 50 
#>     Hamming d > 0.25 = 17 bp
#>     Loci deleted 0 
#>     Final No. of loci: 50 
#>     No. of individuals: 81 
#>     No. of populations:  3 
#> Completed: gl.filter.hamming 
#>