Rationale
There are at least three major steps when doing single-locus species delimitation analysis:
Preparing your data;
Running species delimitation analysis;
Visualizing results.
The rationale of delimtools
is to provide helper
functions for all these steps. This guide introduces you to these
functions and shows how to apply them to your datasets. In this
tutorial, we will use the geophagus
dataset provided in
this package. This dataset contains 354 cytochrome c oxidase sequences
of the Geophagus sensu stricto species group downloaded from
GenBank. Most of these sequences are from the data analysed by Ximenes
et al. (2021) and is documented in ?geophagus
.
Preparing your data
Usually, our main source of information for single-locus species
delimitation analysis is a FASTA
formated file containing nucleotide sequences. The most straighforward
way to import a FASTA formated file to R Environment is by using
ape::read.FASTA()
:
path_to_file <- system.file("extdata/geophagus.fasta", package = "delimtools")
geophagus <- ape::read.FASTA(file= path_to_file)
Since geophagus
dataset is provided along with
delimtools
, one can simply call it directly:
geophagus
#> 354 DNA sequences in binary format stored in a list.
#>
#> All sequences of same length: 690
#>
#> Labels:
#> MZ504301.1
#> MZ504328.1
#> MZ504318.1
#> MZ504313.1
#> MZ504311.1
#> MZ504312.1
#> ...
#>
#> Base composition:
#> a c g t
#> 0.237 0.282 0.178 0.303
#> (Total: 244.26 kb)
Check identifiers across files
Although the header of each sequence in a FASTA file may contain
information about the source organism from which the sequence was
obtained, we recommend storing this information in a tabular format file
(e.g. csv
, tsv
, xlsx
, etc.),
using an unique sequence identifier for both FASTA and tabular files. We
can have a glimpse of our sequence metadata by checking
geophagus_info
:
dplyr::glimpse(geophagus_info)
#> Rows: 354
#> Columns: 19
#> $ scientificName <chr> "Geophagus_altifrons", "Geophagus_altifrons", "G…
#> $ scientificNameGenBank <chr> "Geophagus altifrons", "Geophagus altifrons", "G…
#> $ class <chr> "Teleostei", "Teleostei", "Teleostei", "Teleoste…
#> $ order <chr> "Cichliformes", "Cichliformes", "Cichliformes", …
#> $ family <chr> "Cichlidae", "Cichlidae", "Cichlidae", "Cichlida…
#> $ genus <chr> "Geophagus", "Geophagus", "Geophagus", "Geophagu…
#> $ dbid <dbl> 2063684375, 2063684465, 2063684523, 2063684453, …
#> $ gbAccession <chr> "MZ504486.1", "MZ504531.1", "MZ504560.1", "MZ504…
#> $ gene <chr> "coi", "coi", "coi", "coi", "coi", "coi", "coi",…
#> $ length <dbl> 659, 699, 627, 612, 552, 654, 627, 651, 666, 609…
#> $ organelle <chr> "mitochondrion", "mitochondrion", "mitochondrion…
#> $ catalogNumber <chr> "1614", "17578", "1616", "1194", "2748", "1610",…
#> $ country <chr> "Brazil: Tapajós- Jacareacanga- Pará", "Brazil: …
#> $ publishedAs <chr> "Mapping the hidden diversity of the Geophagus s…
#> $ publishedIn <chr> "Unpublished", "Unpublished", "Unpublished", "Un…
#> $ publishedBy <chr> "Ximenes,A.M.", "Ximenes,A.M.", "Ximenes,A.M.", …
#> $ date <chr> "25-OCT-2021", "25-OCT-2021", "25-OCT-2021", "25…
#> $ decimalLatitude <dbl> -6.276008, -3.449100, -6.276008, -1.501489, -6.1…
#> $ decimalLongitude <dbl> -57.74000, -57.73533, -57.74000, -52.68211, -48.…
It’s desirable that both our FASTA file and our metadata have the
same dimensions and no abscent, mistyped, or duplicated sequence
identifiers within and between these files. Let’s check if our FASTA
file and our metadata has the same sequence identifiers using
delimtools::check_identifiers()
:
check_identifiers(data= geophagus_info, identifier= "gbAccession", dna= geophagus)
#> ✔ Identifiers are the same across files.
Now, let’s omit the first 10 sequences and check again:
check_identifiers(geophagus_info, "gbAccession", geophagus[11:354])
#> Error in `check_identifiers()`:
#> ! Identifiers are not of equal length:
#> ✖ You've supplied inputs with size lengths of 354 and 344.
#> ℹ Identifiers absent in `geophagus_info`:
#>
#> ℹ Identifiers absent in `geophagus[11:354]`:
#> MZ504313.1, MZ504328.1, MZ504311.1, MZ504299.1, MZ504318.1, MZ504309.1,
#> MZ504301.1, MZ504337.1, MZ504312.1, MZ504341.1
Notice that check_identifiers
will not
evaluate if sequence identifiers in FASTA and metadata appears in the
same order:
identical(names(geophagus), geophagus_info$gbAccession)
#> [1] FALSE
all.equal(names(geophagus), geophagus_info$gbAccession)
#> [1] "351 string mismatches"
Many other functions of delimtools
relies on a tabular
format file like geophagus_info
, so please make sure to
have one prior to any analysis.
Cleaning DNA and Haplotype Collapsing
Sometimes, our alignment file may contain non ACTG bases,
like ambiguities (N, R, Y, etc), gaps (-), or missing data (?) which may
have unintended consequences later. Using clean_dna
, we can
find which sequences may need further manual checking.
clean_dna(geophagus)
#> Warning: ⚠ You have missing data "('N','-' '?')" or ambiguity inside your sequence, i.e.
#> not padding the ends, and this may have unintended consequences later, as they
#> have now been removed!
#> ℹ The names of the samples are bellow.
#> GU701784.1, GU701785.1
#> 354 DNA sequences in binary format stored in a list.
#>
#> Mean sequence length: 646.198
#> Shortest sequence: 505
#> Longest sequence: 690
#>
#> Labels:
#> MZ504301.1
#> MZ504328.1
#> MZ504318.1
#> MZ504313.1
#> MZ504311.1
#> MZ504312.1
#> ...
#>
#> Base composition:
#> a c g t
#> 0.237 0.282 0.178 0.303
#> (Total: 228.75 kb)
Notice that clean_dna
will remove both
leading and trailing gaps from the alignment, being necessary to realign
the file after that.
The presence of identical sequences (haplotypes) in our dataset may
overestimate the number of lineages detected by tree-based species
delimitation methods (e.g. bGMYC, GMYC, PTP, mPTP). Prior to
phylogenetic tree inference for these analyses, it is a best practice to
remove (collapse) these sequences into its unique haplotypes. To
collapse an alignment into its unique haplotypes, you may use the
hap_collapse
function with default settings:
hap_collapse(geophagus, clean= TRUE, collapseSubstrings = TRUE)
#> Warning: ⚠ You have missing data "('N','-' '?')" or ambiguity inside your sequence, i.e.
#> not padding the ends, and this may have unintended consequences later, as they
#> have now been removed!
#> ℹ The names of the samples are bellow.
#> GU701784.1, GU701785.1
#> 137 DNA sequences in binary format stored in a list.
#>
#> Mean sequence length: 643.007
#> Shortest sequence: 505
#> Longest sequence: 690
#>
#> Labels:
#> MZ504301.1
#> MZ504318.1
#> MZ504341.1
#> MZ504337.1
#> MZ504342.1
#> MZ504304.1
#> ...
#>
#> Base composition:
#> a c g t
#> 0.238 0.281 0.177 0.304
#> (Total: 88.09 kb)
Notice that from the initial 354 DNA sequences,
hap_collapse
returned a total of 137 unique DNA sequences.
This result may vary if setting collapseSubstrings
to
FALSE
:
hap_collapse(geophagus, clean=TRUE, collapseSubstrings = FALSE)
#> Warning: ⚠ You have missing data "('N','-' '?')" or ambiguity inside your sequence, i.e.
#> not padding the ends, and this may have unintended consequences later, as they
#> have now been removed!
#> ℹ The names of the samples are bellow.
#> GU701784.1, GU701785.1
#> 246 DNA sequences in binary format stored in a list.
#>
#> Mean sequence length: 639.764
#> Shortest sequence: 505
#> Longest sequence: 690
#>
#> Labels:
#> MZ504301.1
#> MZ504328.1
#> MZ504318.1
#> MZ504341.1
#> MZ504337.1
#> MZ504299.1
#> ...
#>
#> Base composition:
#> a c g t
#> 0.237 0.282 0.177 0.303
#> (Total: 157.38 kb)
After setting collapseSubstrings
to FALSE
,
we obtained 246 unique DNA sequences. These additional 109 sequences are
in reality shorter but identical sequences which in most cases varies
only by the presence of leading and trailing gaps. By setting both
clean_dna
and collapseSubstrings
to
TRUE
, we ensure the removal of these leading and trailing
gaps from sequences before collapsing these sequences into unique
haplotypes.
Warning: be carefull when using
collapseSubstrings
when using non-coding genes
(e.g. 12S/16S rRNAs)
To visualize the relationships between these unique haplotypes and
the samples, we may use the haplotype_tbl
function:
haplotype_tbl(geophagus, clean= TRUE, collapseSubstrings = TRUE, verbose = FALSE)
#> # A tibble: 137 × 3
#> labels n_seqs collapsed
#> <chr> <dbl> <chr>
#> 1 MZ504318.1 38 MZ504328.1, MZ504313.1, MZ504311.1, MZ504312.1, MZ504309.1…
#> 2 MZ504540.1 20 MZ504505.1, MZ504553.1, MZ504554.1, MZ504552.1, MZ504542.1…
#> 3 MZ504420.1 19 MZ504417.1, MZ504437.1, MZ504425.1, MZ504427.1, MZ504422.1…
#> 4 MZ504488.1 16 MZ504538.1, KU568830.1, JN026709.1, MZ504522.1, MZ504523.1…
#> 5 MZ504484.1 15 MZ504496.1, MZ504487.1, MZ504573.1, MZ504560.1, MZ504497.1…
#> 6 MZ504462.1 14 MZ504479.1, MZ504477.1, MZ504481.1, MZ504476.1, MZ504463.1…
#> 7 MZ504375.1 13 MZ504372.1, MZ504382.1, MZ504381.1, MZ504379.1, MZ504383.1…
#> 8 MZ504535.1 8 MZ504515.1, MZ504525.1, MZ504533.1, MZ504534.1, MZ504536.1…
#> 9 MZ504393.1 8 MZ504445.1, MZ504413.1, MZ504407.1, MZ504408.1, MZ504410.1…
#> 10 MZ504400.1 6 MZ504404.1, MZ504401.1, MZ504402.1, MZ504403.1, MZ504399.1
#> # ℹ 127 more rows
You may also collapse other types of information by using
collapse_others
. We can collapse our metadata to see which
species and sampling localities are associated with each haplotype:
haps_df <- haplotype_tbl(geophagus, verbose = FALSE)
collapse_others(data= geophagus_info,
hap_tbl= haps_df,
labels= "gbAccession",
cols= c("scientificName", "country"))
#> # A tibble: 137 × 5
#> labels n_seqs collapsed collapsed_scientific…¹ collapsed_country
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 MZ504318.1 38 MZ504328.1, MZ504… Geophagus_proximus Brazil: Lago do …
#> 2 MZ504540.1 20 MZ504505.1, MZ504… Geophagus_altifrons, … Brazil: Tocantin…
#> 3 MZ504420.1 19 MZ504417.1, MZ504… Geophagus_sp.1 Brazil: Branco- …
#> 4 MZ504488.1 16 MZ504538.1, KU568… Geophagus_altifrons, … Brazil: Água Boa…
#> 5 MZ504484.1 15 MZ504496.1, MZ504… Geophagus_altifrons Brazil: São Seba…
#> 6 MZ504462.1 14 MZ504479.1, MZ504… Geophagus_sp.3 Brazil: Iriri- P…
#> 7 MZ504375.1 13 MZ504372.1, MZ504… Geophagus_megasema Brazil: Mamirauá…
#> 8 MZ504535.1 8 MZ504515.1, MZ504… Geophagus_altifrons Brazil: Maués-Aç…
#> 9 MZ504393.1 8 MZ504445.1, MZ504… Geophagus_sp.1 Brazil: Xingu- u…
#> 10 MZ504400.1 6 MZ504404.1, MZ504… Geophagus_sp.1 Brazil: Demeni- …
#> # ℹ 127 more rows
#> # ℹ abbreviated name: ¹collapsed_scientificName
Running species delimitation analysis
After preparing your data (alignments and phylogenetic trees), the
next step it to run species delimitation analyses and combine them into
a single tbl_df
. Please check Installing
Species Delimitation Software to know which sofware
delimtools
support. Check the examples below.
ABGD
We can run ABGD by using the abgd_tbl
function. This
function requires a PATH for a FASTA file as input for analysis and a
PATH for the ABGD executable:
dat_all_ali <- here::here("vignettes/geophagus.fasta")
abgd_df <- abgd_tbl(infile= dat_all_ali,
exe= "/usr/local/bin/abgd",
model= 3,
outfolder= here::here("vignettes"),
haps= haps_df$labels)
#> ℹ ABGD files are located in directory '/home/pedro/Documentos/Projetos/Packages/delimtools/vignettes'.
Then we can check abgd_df
print(abgd_df)
#> # A tibble: 137 × 2
#> labels abgd
#> <chr> <int>
#> 1 MZ504301.1 1
#> 2 MZ504318.1 1
#> 3 MZ504341.1 1
#> 4 MZ504337.1 1
#> 5 MZ504342.1 1
#> 6 MZ504304.1 1
#> 7 MZ504332.1 1
#> 8 MZ504343.1 1
#> 9 MZ504315.1 1
#> 10 MZ504345.1 1
#> # ℹ 127 more rows
ASAP
We can run ASAP by using the asap_tbl
function. This
function requires a PATH for a FASTA file as input for analysis and a
PATH for the ASAP executable:
asap_df <- asap_tbl(infile= dat_all_ali,
exe= "/usr/local/bin/asap",
haps= haps_df$labels,
outfolder = here::here("vignettes"),
model= 3)
#> ℹ ASAP files are located in directory '/home/pedro/Documentos/Projetos/Packages/delimtools/vignettes'.
print(asap_df)
#> # A tibble: 137 × 2
#> labels asap
#> <chr> <int>
#> 1 MZ504301.1 1
#> 2 MZ504318.1 1
#> 3 MZ504332.1 1
#> 4 MZ504343.1 1
#> 5 MZ504304.1 1
#> 6 MZ504342.1 1
#> 7 MZ504341.1 1
#> 8 MZ504315.1 1
#> 9 MZ504345.1 1
#> 10 MZ504337.1 1
#> # ℹ 127 more rows
Note: You should always indicate full PATH for
infile
, exe
, and outfolder
in
ASAP for full functioning.
GMYC
We can run GMYC analysis by using the splits
function
gmyc
. This function requires an ultrametric tree file of
class phylo
as input for analysis. We can convert our
geophagus_beast
treedata object into a phylo
object by using ape::as.phylo
:
gmyc_res <- splits::gmyc(as.phylo(geophagus_beast), method= "single")
Now, lets check the summary
:
summary(gmyc_res)
#> Result of GMYC species delimitation
#>
#> method: single
#> likelihood of null model: 995.0483
#> maximum likelihood of GMYC model: 1010.027
#> likelihood ratio: 29.95778
#> result of LR test: 3.124278e-07***
#>
#> number of ML clusters: 19
#> confidence interval: 18-22
#>
#> number of ML entities: 21
#> confidence interval: 19-25
#>
#> threshold time: -0.0164289
GMYC outputs a list containing several results. We can use
gmyc_tbl
to extract the clusters in a tbl_df
format:
bGMYC
We can run bGMYC analysis by using the bGMYC
function
bgmyc.singlephy
. This function requires an ultrametric tree
file of class phylo
as input for analysis. We can convert
our geophagus_beast
treedata object into a
phylo
object by using ape::as.phylo
:
bgmyc_res <- bGMYC::bgmyc.singlephy(as.phylo(geophagus_beast),
mcmc= 11000,
burnin= 1000,
thinning= 100,
t1= 2,
t2= Ntip(as.phylo(geophagus_beast)),
start= c(1, 0.5, 50))
bGMYC outputs a list containing several results. We can use
bgmyc_tbl
to extract the clusters in a tbl_df
format using 0.05
as posterior probability threshold for
clustering samples into species partitions:
bgmyc_df <- bgmyc_tbl(bgmyc_res, ppcutoff = 0.05)
print(bgmyc_df)
#> # A tibble: 137 × 2
#> labels bgmyc
#> <chr> <int>
#> 1 GU701784.1 1
#> 2 GU701785.1 1
#> 3 JN988869.1 1
#> 4 MH780911.1 1
#> 5 OR732927.1 1
#> 6 OR732928.1 1
#> 7 MZ050845.1 2
#> 8 MZ051032.1 2
#> 9 MZ051706.1 2
#> 10 MZ051794.1 2
#> # ℹ 127 more rows
PTP and mPTP
We can run both PTP and mPTP analysis by using the
mptp_tbl
. This function requires a PATH for a phylogram as
input for analysis, a PATH for the mptp executable, a string specifying
which method to use (“multi” for mPTP; “single” for PTP), and a minimum
branch length (minbrlen) for computations:
mptp_df <- mptp_tbl(infile= here::here("vignettes/geophagus_raxml.nwk"),
exe= "/usr/local/bin/mptp",
method= "multi",
minbrlen= 2e-06,
outfolder = here::here("vignettes"),
delimname = "mptp")
#> mptp 0.2.5_linux_x86_64, 16GB RAM, 4 cores
#> https://github.com/Pas-Kapli/mptp
#>
#> Parsing tree file...
#> Loaded rooted tree...
#> Number of edges greater than minimum branch length: 159 / 272
#> Score Null Model: 681.702507
#> Best score for multi coalescent rate: 748.771263
#> LRT computed p-value: 0.000000
#> Writing delimitation file /home/pedro/Documentos/Projetos/Packages/delimtools/vignettes/geophagus_raxml.nwk.mptp.multi.txt ...
#> Number of delimited species: 14
#> Creating SVG delimitation file /home/pedro/Documentos/Projetos/Packages/delimtools/vignettes/geophagus_raxml.nwk.mptp.multi.svg ...
#> Done...
#> ℹ mPTP files are located in '/home/pedro/Documentos/Projetos/Packages/delimtools/vignettes'.
ptp_df <- mptp_tbl(infile= here::here("vignettes/geophagus_raxml.nwk"),
exe= "/usr/local/bin/mptp",
method= "single",
minbrlen= 2e-06,
outfolder = here::here("vignettes"),
delimname = "ptp")
#> mptp 0.2.5_linux_x86_64, 16GB RAM, 4 cores
#> https://github.com/Pas-Kapli/mptp
#>
#> Parsing tree file...
#> Loaded rooted tree...
#> Number of edges greater than minimum branch length: 159 / 272
#> Score Null Model: 681.702507
#> Best score for single coalescent rate: 747.727196
#> LRT computed p-value: 0.000000
#> Writing delimitation file /home/pedro/Documentos/Projetos/Packages/delimtools/vignettes/geophagus_raxml.nwk.mptp.single.txt ...
#> Number of delimited species: 17
#> Creating SVG delimitation file /home/pedro/Documentos/Projetos/Packages/delimtools/vignettes/geophagus_raxml.nwk.mptp.single.svg ...
#> Done...
#> ℹ mPTP files are located in '/home/pedro/Documentos/Projetos/Packages/delimtools/vignettes'.
To estimate a minimum branch length for your dataset, use
minbr_len
:
min_brlen(as.phylo(geophagus_raxml))
#> ℹ Printing 5 smallest tip-to-tip distances in a tree with 137 tips ...
#>
#>
#> |dist | n|
#> |:--------|--:|
#> |0.000002 | 12|
#> |0.000003 | 2|
#> |0.000004 | 6|
#> |0.000005 | 2|
#> |0.001561 | 2|
Local Minima
We can run the Local Minima analysis by using the spider
function localMinima
. This function requires a genetic
distance matrix as input for analysis:
mat <- dist.dna(geophagus, model= "raw", pairwise.deletion = TRUE)
lmin <- spider::localMinima(mat)
#> [1] 0.006828358 0.018994392 0.040791871 0.055999413 0.067151612 0.082105695
We can plot the lmin
object to select the threshold. We
will use the first value of the lmin
object:
Since Local Minima
output is a numeric vector, we need
to use locmin_tbl
to get our species partitions in a
tbl_df
format:
locmin_df <- locmin_tbl(mat,
threshold= lmin$localMinima[1],
haps= haps_df$labels)
print(locmin_df)
#> # A tibble: 137 × 2
#> labels locmin
#> <chr> <int>
#> 1 MZ504301.1 1
#> 2 MZ504318.1 1
#> 3 MZ504341.1 1
#> 4 MZ504337.1 1
#> 5 MZ504342.1 1
#> 6 MZ504304.1 1
#> 7 MZ504332.1 1
#> 8 MZ504343.1 1
#> 9 MZ504315.1 1
#> 10 MZ504345.1 1
#> # ℹ 127 more rows
We can also use patristic distances to generate species partitions as
well. Using the geophagus_beast
tree, we can apply a
threshold to classify tips of the tree by using its branch lengths:
tr_mat <- ape::cophenetic.phylo(as.phylo(geophagus_beast)) |> as.dist()
tr_lmin <- spider::localMinima(tr_mat)
#> [1] 0.1114476 0.2020889 0.3527801
plot(tr_lmin) |> abline(v=tr_lmin$localMinima[1],col="red")
treedist_df <- locmin_tbl(tr_mat,
threshold = tr_lmin$localMinima[1],
delimname = "treedist")
print(treedist_df)
#> # A tibble: 137 × 2
#> labels treedist
#> <chr> <int>
#> 1 GU701784.1 1
#> 2 GU701785.1 1
#> 3 JN988869.1 1
#> 4 MH780911.1 1
#> 5 OR732927.1 1
#> 6 OR732928.1 1
#> 7 MZ050845.1 2
#> 8 MZ051032.1 2
#> 9 MZ051706.1 2
#> 10 MZ051794.1 2
#> # ℹ 127 more rows
Fixed threshold
We can use locmin_tbl
to create a species partition
using a fixed threshold. For example, we can use a threshold of 2% to
generate species partitions:
percent_df <- locmin_tbl(mat, threshold = 0.02, haps= haps_df$labels, delimname = "percent")
print(percent_df)
#> # A tibble: 137 × 2
#> labels percent
#> <chr> <int>
#> 1 MZ504301.1 1
#> 2 MZ504318.1 1
#> 3 MZ504341.1 1
#> 4 MZ504337.1 1
#> 5 MZ504342.1 1
#> 6 MZ504304.1 1
#> 7 MZ504332.1 1
#> 8 MZ504343.1 1
#> 9 MZ504315.1 1
#> 10 MZ504345.1 1
#> # ℹ 127 more rows
Morphology
We can also turn species ranks or results from morphological analysis
into a species partition by using morph_tbl
. For
geophagus
dataset, we will use the scientific names in
geophagus_info
:
morph_df <- morph_tbl(labels= geophagus_info$gbAccession,
sppVector= geophagus_info$scientificName) |>
dplyr::filter(labels %in% haps_df$labels)
print(morph_df)
#> # A tibble: 137 × 2
#> labels morph
#> <chr> <int>
#> 1 MZ504501.1 1
#> 2 MZ504504.1 1
#> 3 MZ504510.1 1
#> 4 MZ504543.1 1
#> 5 MZ504545.1 1
#> 6 MZ504488.1 1
#> 7 MZ504483.1 1
#> 8 MZ504558.1 1
#> 9 MZ504484.1 1
#> 10 MZ504502.1 1
#> # ℹ 127 more rows
Joining Species Delimitation Outputs
Since we have several species partitions, we need to merge them into
a single data frame in order to visualize the results. However, since
each method has its own rankings, we need to recode them prior to any
further analysis. For this task, we can use delim_join
.
This function takes a list
of species partitions as input
in order to merge and recode all species partitions:
all_delims <- list(abgd_df, asap_df, bgmyc_df, gmyc_df,
locmin_df, morph_df, mptp_df, ptp_df,
percent_df, treedist_df)
all_delims_df <- delim_join(all_delims)
We can summarise the all_delims_df
object by using the
report_delim
function:
report_delim(all_delims_df)
#> ℹ Joined delimitations have a total of 45 unique species partitions.
#> ℹ Check below the number of species partitions per method:
#>
#>
#> |method | partitions|
#> |:--------|----------:|
#> |gmyc | 21|
#> |locmin | 21|
#> |abgd | 19|
#> |bgmyc | 19|
#> |ptp | 17|
#> |morph | 16|
#> |asap | 14|
#> |mptp | 14|
#> |percent | 12|
#> |treedist | 7|
#> # A tibble: 137 × 11
#> labels abgd asap bgmyc gmyc locmin morph mptp percent ptp treedist
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 MZ504301.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 2 MZ504318.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 3 MZ504341.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 4 MZ504337.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 5 MZ504342.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 6 MZ504304.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 7 MZ504332.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 8 MZ504343.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 9 MZ504315.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> 10 MZ504345.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1 sp44
#> # ℹ 127 more rows
We can also check the match ratio statistics of Ahrens et al (2014)
using match_ratio
:
match_ratio(all_delims_df) |>
dplyr::arrange(dplyr::desc(match_ratio)) |>
print(n=Inf)
#> # A tibble: 45 × 5
#> pairs delim_1 delim_2 n_match match_ratio
#> <chr> <int> <int> <int> <dbl>
#> 1 abgd-locmin 19 21 17 0.85
#> 2 bgmyc-gmyc 19 21 17 0.85
#> 3 asap-percent 14 12 10 0.77
#> 4 abgd-bgmyc 19 19 14 0.74
#> 5 abgd-ptp 19 17 13 0.72
#> 6 bgmyc-ptp 19 17 13 0.72
#> 7 asap-ptp 14 17 11 0.71
#> 8 gmyc-locmin 21 21 15 0.71
#> 9 bgmyc-locmin 19 21 14 0.7
#> 10 gmyc-ptp 21 17 13 0.68
#> 11 asap-morph 14 16 10 0.67
#> 12 mptp-ptp 14 17 10 0.65
#> 13 asap-mptp 14 14 9 0.64
#> 14 bgmyc-morph 19 16 11 0.63
#> 15 locmin-ptp 21 17 12 0.63
#> 16 percent-ptp 12 17 9 0.62
#> 17 abgd-asap 19 14 10 0.61
#> 18 asap-bgmyc 14 19 10 0.61
#> 19 abgd-gmyc 19 21 12 0.6
#> 20 abgd-percent 19 12 9 0.58
#> 21 bgmyc-percent 19 12 9 0.58
#> 22 morph-percent 16 12 8 0.57
#> 23 abgd-morph 19 16 9 0.51
#> 24 asap-gmyc 14 21 9 0.51
#> 25 gmyc-morph 21 16 9 0.49
#> 26 abgd-mptp 19 14 8 0.48
#> 27 bgmyc-mptp 19 14 8 0.48
#> 28 gmyc-percent 21 12 8 0.48
#> 29 morph-ptp 16 17 8 0.48
#> 30 morph-mptp 16 14 7 0.47
#> 31 asap-locmin 14 21 8 0.46
#> 32 gmyc-mptp 21 14 8 0.46
#> 33 mptp-percent 14 12 6 0.46
#> 34 locmin-morph 21 16 8 0.43
#> 35 locmin-percent 21 12 7 0.42
#> 36 percent-treedist 12 7 4 0.42
#> 37 locmin-mptp 21 14 7 0.4
#> 38 asap-treedist 14 7 4 0.38
#> 39 mptp-treedist 14 7 4 0.38
#> 40 ptp-treedist 17 7 4 0.33
#> 41 abgd-treedist 19 7 4 0.31
#> 42 bgmyc-treedist 19 7 3 0.23
#> 43 gmyc-treedist 21 7 3 0.21
#> 44 locmin-treedist 21 7 3 0.21
#> 45 morph-treedist 16 7 2 0.17
By checking the results, all comparisons including treedist have a
low match_ratio
. Since species partitions with very low
match_ratio
values can impact consensus calculations, we
will remove treedist before visualizing our results
all_delims_df <- dplyr::select(all_delims_df, -treedist)
report_delim(all_delims_df)
#> ℹ Joined delimitations have a total of 43 unique species partitions.
#> ℹ Check below the number of species partitions per method:
#>
#>
#> |method | partitions|
#> |:-------|----------:|
#> |gmyc | 21|
#> |locmin | 21|
#> |abgd | 19|
#> |bgmyc | 19|
#> |ptp | 17|
#> |morph | 16|
#> |asap | 14|
#> |mptp | 14|
#> |percent | 12|
#> # A tibble: 137 × 10
#> labels abgd asap bgmyc gmyc locmin morph mptp percent ptp
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 MZ504301.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 2 MZ504318.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 3 MZ504341.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 4 MZ504337.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 5 MZ504342.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 6 MZ504304.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 7 MZ504332.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 8 MZ504343.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 9 MZ504315.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> 10 MZ504345.1 sp1 sp20 sp1 sp1 sp1 sp36 sp1 sp40 sp1
#> # ℹ 127 more rows
Visualizing Results
We can visualize the results by using delim_autoplot
.
This function will plot a phylogenetic tree alongside with the results
contained in all_delims_df
. We will use the
geophagus_beast
tree to visualize the results:
# customize tip labels of the tree
tip_tab <- geophagus_info |>
dplyr::filter(gbAccession %in% geophagus_beast@phylo$tip.label) |>
dplyr::mutate(labs= glue::glue("{gbAccession} | {scientificName}")) |>
dplyr::select(gbAccession, labs, scientificName)
# create a customized color palette
cols <- delim_brewer(delim= all_delims_df, package="randomcoloR", seed=42)
# plot
delim_autoplot(all_delims_df,
geophagus_beast,
consensus= TRUE,
n_match = 5,
tbl_labs= tip_tab,
col_vec= cols,
hexpand= 0.7,
widths = c(0.5, 0.5))
#> Warning: ⚠ Argument `delim_order` not provided. Using default order from
#> `all_delims_df`.