Aligns nucleotides sequences against those present in a target database using blastn

Basic Local Alignment Search Tool (BLAST; Altschul et al., 1990 & 1997) is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. This function creates fasta files, creates databases to run BLAST, runs blastn and filters these results to obtain the best hit per sequence.

This function can be used to run BLAST alignment of short-read (DArTseq data) and long-read sequences (Illumina, PacBio... etc). You can use reference genomes from NCBI, genomes from your private collection, contigs, scaffolds or any other genetic sequence that you would like to use as reference.

gl.blast(
  x,
  ref_genome,
  task = "megablast",
  Percentage_identity = 70,
  Percentage_overlap = 0.8,
  bitscore = 50,
  number_of_threads = 2,
  verbose = NULL
)

Arguments

x: Either a genlight object containing a column named 'TrimmedSequence' containing the sequence of the SNPs (the sequence tag) trimmed of adapters as provided by DArT; or a path to a fasta file with the query sequences [required].
ref_genome: Path to a reference genome in fasta of fna format [required].
task: Four different tasks are supported: 1) “megablast”, for very similar sequences (e.g, sequencing errors), 2) “dc-megablast”, typically used for inter-species comparisons, 3) “blastn”, the traditional program used for inter-species comparisons, 4) “blastn-short”, optimized for sequences less than 30 nucleotides [default 'megablast'].
Percentage_identity: Not a very sensitive or reliable measure of sequence similarity, however it is a reasonable proxy for evolutionary distance. The evolutionary distance associated with a 10 percent change in Percentage_identity is much greater at longer distances. Thus, a change from 80 – 70 percent identity might reflect divergence 200 million years earlier in time, but the change from 30 percent to 20 percent might correspond to a billion year divergence time change [default 70].
Percentage_overlap: Calculated as alignment length divided by the query length or subject length (whichever is shortest of the two lengths, i.e. length / min(qlen,slen) ) [default 0.8].
bitscore: A rule-of-thumb for inferring homology, a bit score of 50 is almost always significant [default 50].
number_of_threads: Number of threads (CPUs) to use in blastn search [default 2].
verbose: verbose= 0, silent or fatal errors; 1, begin and end; 2, progress log ; 3, progress and results summary; 5, full report [default 2 or as specified using gl.set.verbosity]

Value

If the input is a genlight object: returns a genlight object with one hit per sequence merged to the slot $other$loc.metrics. If the input is a fasta file: returns a dataframe with one hit per sequence.

Details

Installing BLAST

You can download the BLAST installs from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

It is important to install BLAST in a path that does not contain spaces for this function to work.

Running BLAST

Four different tasks are supported:

“megablast”, for very similar sequences (e.g, sequencing errors)
“dc-megablast”, typically used for inter-species comparisons
“blastn”, the traditional program used for inter-species comparisons
“blastn-short”, optimized for sequences less than 30 nucleotides

If you are running a BLAST alignment of similar sequences, for example Turtle Genome Vs Turtle Sequences, the recommended parameters are: task = “megablast”, Percentage_identity = 70, Percentage_overlap = 0.8 and bitscore = 50.

If you are running a BLAST alignment of highly dissimilar sequences because you are probably looking for sex linked hits in a distantly related species, and you are aligning for example sequences of Chicken Genome Vs Bassiana, the recommended parameters are: task = “dc-megablast”, Percentage_identity = 50, Percentage_overlap = 0.01 and bitscore = 30.

Be aware that running BLAST might take a long time (i.e. days) depending of the size of your query, the size of your database and the number of threads selected for your computer.

BLAST output

The BLAST output is formatted as a table using output format 6, with columns defined in the following order:

qseqid - Query Seq-id
sacc - Subject accession
stitle - Subject Title
qseq - Aligned part of query sequence
sseq - Aligned part of subject sequence
nident - Number of identical matches
mismatch - Number of mismatches
pident - Percentage of identical matches
length - Alignment length
evalue - Expect value
bitscore - Bit score
qstart - Start of alignment in query
qend - End of alignment in query
sstart - Start of alignment in subject
send - End of alignment in subject
gapopen - Number of gap openings
gaps - Total number of gaps
qlen - Query sequence length
slen - Subject sequence length
PercentageOverlap - length / min(qlen,slen)

Databases containing unfiltered aligned sequences, filtered aligned sequences and one hit per sequence are saved to the temporal directory (tempdir) and can be accessed with the function gl.print.reports and listed with the function gl.list.reports. Note that they can be accessed only in the current R session because tempdir is cleared each time that the R session is closed.

BLAST filtering

BLAST output is filtered by ordering the hits of each sequence first by the highest percentage identity, then the highest percentage overlap and then the highest bitscore. Only one hit per sequence is kept based on these selection criteria.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410.
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389-3402.
Pearson, W. R. (2013). An introduction to sequence similarity (“homology”) searching. Current protocols in bioinformatics, 42(1), 3-1.

Author

Berenice Talamantes Becerra & Luis Mijangos (Post to https://groups.google.com/d/forum/dartr)

Examples

if (FALSE) { # \dontrun{
res <- gl.blast(x= testset.gl,ref_genome = 'sequence.fasta')
# display of reports saved in the temporal directory
gl.list.reports()
# open the reports saved in the temporal directory
blast_databases <- gl.print.reports(1)
} # }