vcf2others

Installation

This package needs to be installed from GitHub. devtools::install_github("legalLab/vcf2others")

Introduction

This package has two main functionalities. First, it is used for filtering, subsetting, merging and otherwise wrangling VCF files. Second, it is used for converting the VCF file to other population genetic and phylogenetic formats for downstream analyses.

How to use the functions of this package

Following are examples of the usage of the functions of this package.

Load VCF and associated files of individuals and groups

First we need to load our files. In addition to the VCF, we also need to have a white/black list of individuals (read from a file of one column with an ‘id’ header), a strata file associating individuals with groups (read from a file of two tab-separated columns with an ‘id’ and ‘pop’ headers), and a file of groups (read from a file of one column with a ‘pop’ header). These auxilary files are used for keeping/excluding individuals and groups of individuals during filtering and during conversion to other data formats. The strata file is also necessary since many of the population genetic formats require information on grouping of individuals.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.15.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(vcf2others)

# set path to example files and project name
data_path <- paste0(system.file("extdata", package="vcf2others"), "/")
project <- "crocs_"
postfix <- "discosnp_sub"

# load vcf
vcf <- read.vcfR(paste0(data_path, project, postfix, ".vcf.gz"))
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 22
#>   header_line: 23
#>   variant count: 10000
#>   column count: 26
#> Meta line 22 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 10000
#>   Character matrix gt cols: 26
#>   skip: 0
#>   nrows: 10000
#>   row_num: 0
#> Processed variant 1000Processed variant 2000Processed variant 3000Processed variant 4000Processed variant 5000Processed variant 6000Processed variant 7000Processed variant 8000Processed variant 9000Processed variant 10000Processed variant: 10000
#> All variants processed

# read individuals to include
indivs <- read.table(paste0(data_path, "indivs_b"), header = TRUE)$id %>%
  as.character()

# check if all indivs are in vcf_names
if (any(!(indivs %in% colnames(vcf@gt)[-1]))) stop(paste("Some individuals in list not in VCF"))

# get groups and group info on samples in vcf
# default file names for strata is "strata" and for groups "groups"
project_info <- get_vcf_group_info(vcf, data_path)
#> Joining with `by = join_by(id)`

Assessment of Missing Data

The function assess_vcf_missing_data() is used to generate a table and graph of missing data for each sample in a VCF. It is useful for visualizing data before and after filtering, and evaluate the effect of filtering. The function accepts parameters which form part of the name of the input file. The idea is that the VCF name contains information on the name of the project (project), how the VCF was extracted (postfix), and how it was filtered (fltr). The “postfix” and “fltr” can be left blank if the VCF name does not contain this information. Samples are ordered and colored by group assignment.

# define results path
res_path <- data_path
# species for plot title
species <- "Paleosuchus/Caiman"
# filter - filter parameters in file name
fltr <- ""

assess_vcf_missing_data(vcf, data_path, res_path, project, postfix, fltr, species)

Basic stats for individuals in VCF

The function vcf_stats() is used to generate a table with basic stats for each sample in the VCF. These stats include average read depth, heterozygosity, number of heterozygotes, homozygotes, reference homozygotes and alternate homogygotes, percent missing SNPs, total missing SNPs and total SNPs.

# get a table of basic sample stats

vcf_stats(vcf, res_path, project)

Filtering, subsetting, merging and otherwise wrangling VCF files

All functions can be used directly on a VCF, but many of them are called by the other functions of this package. For example vcf_filter_invariant() will remove invariant SNPs from a VCF, however, VCF by definition should not have invariant SNPs. However, if a VCF is subsetted or certain individuals are removed, SNPs can become invariant and need to be removed.

Subsetting

The functions vcf_extract_indivs(), vcf_extract_pops() and vcf_sub_loci() are used to subset a VCF by specific individuals, a group (all individuals of a group/population), or a random subset of loci, respectively. Subsetting individuals and groups can be done either as a whitelist (keep the listed individuals/groups) or blacklist (remove the listed individuals/groups).

# subset VCF by individuals (default whitelist = TRUE, remove invariants = TRUE)
indivs1 <- c("CTGA_H4635", "CTGA_H4667")
vcf1 <- vcf_extract_indivs(vcf, indivs1)

# subset VCF by groups (default whitelist = TRUE, remove invariants = TRUE)
groups1 <- c("paleosuchus_1", "paleosuchus_2") %>%
  as.factor()
vcf1 <- vcf_extract_pops(vcf, project_info$indiv_group, groups1)

# subset VCF by n random loci (default n_loci = 1000)
vcf1 <- vcf_sub_loci(vcf, n_loci = 5000)

Filtering

Filtering consists of removing individuals and loci based on specific properties. The function vcf_filter_missing_indivs() removes individuals above a % missing SNP threshold. The other functions filter on properties of individuals SNPs across individuals. The function vcf_filter_missingness() removes SNPs above a % missing threshold, vcf_filter_quality() removes SNPs below a quality threshold, vcf_filter_maf() removes SNPs with minor allele frequency below a % threshold, vcf_filter_hets() removes SNPs with heterozigosity above a % threshold, vcf_filter_rank() removes SNPs with rank below a % threshold (rank is specific to VCFs generated by DiscoSnpRad and measures the likelihood of observed SNP variation being due to paralogs), and vcf_filter_coverage() removes SNPs below a specific coverage (makes them missing in the individual). Finally the vcf_filter_invariant() removes invariant SNPs (SNPs may become invariant after subsetting and filtering). The functions vcf_filter_oneSNP() and vcf_filter_multiSNP() filter a VCF to remove linked SNPs (retain only unliked SNPs) or retain only linked SNPs, respectively. In general, all analyses assume that SNPs are unlinked, however, some analyses such as FineRadStructure work with linked SNPs.

# filter VCF for analyses (unlinked SNPs)
vcf_oneSNP <- vcf_extract_indivs(vcf, indivs, whitelist = FALSE) %>%
  vcf_filter_missing_indivs(.8) %>%
  vcf_filter_rank(.5) %>%
  vcf_filter_maf(.03) %>%
  vcf_filter_coverage(6) %>%
  vcf_filter_oneSNP() %>%
  vcf_filter_missingness(.2) %>%
  vcf_filter_missing_indivs(.3)
#> Removed samples are:
#> CTGA_H4663
#> CTGA_H4668
#> Final % missing data in VCF is 55.12%
#> Final % missing data in VCF is 13.92%
#> Removed samples are:
#> CTGA_H4666
#> CTGA_H4667
#> CTGA_H4669
#> Final % missing data in VCF is 7.1%

# filter VCF for analyses (linked SNPs)
vcf_multiSNP <- vcf_extract_indivs(vcf, indivs, whitelist = FALSE) %>%
  vcf_filter_missing_indivs(.8) %>%
  vcf_filter_rank(.5) %>%
  vcf_filter_maf(.03) %>%
  vcf_filter_coverage(6) %>%
  vcf_filter_multiSNP() %>%
  vcf_filter_missingness(.2) %>%
  vcf_filter_missing_indivs(.3)
#> Removed samples are:
#> CTGA_H4663
#> CTGA_H4668
#> Final % missing data in VCF is 55.12%
#> Warning: In dplyr::row_number(): 
#> ℹ Expression not supported in Arrow
#> → Pulling data into R
#> Final % missing data in VCF is 12.89%
#> Removed samples are:
#> CTGA_H4667
#> CTGA_H4669
#> Final % missing data in VCF is 8.27%

Extracting, merging and adding

The functions vcf_merge(), vcf_add_indivs(), vcf_extract_indivs() and vcf_extract_pops() are used to merge two VCF files, add individuals from a second VCF into a first VCF, and extract individuals/groups from a VCF saving it into a separate VCF. The vcf_add_indivs(), vcf_extract_indivs() and vcf_extract_pops() functions were designed in mind to work with outgroups. Because of phylogenetic effects, outgroups have relatively more missing data than ingroups, and so often are removed during filtering. The solution is to extract the outgroup taxa with vcf_extract_indivs() or vcf_extract_pops(), filter the ingroup VCF, and add the outgroup taxa with vcf_add_indivs(). It is important to NOT filter invariant loci during the filtering of the ingroup, and only filter them vcf_filter_invariant() after adding back the outgroup. The functions vcf_extract_indivs(), vcf_extract_pops(), vcf_filter_missing_indivs() and vcf_filter_coverage() by default filter invariants, but this can be turned off using the “f_invar = FALSE” flag.

# extract individuals from VCF (keep all loci)
indivs1 <- c("CTGA_H4635", "CTGA_H4637", "CTGA_H4643", "CTGA_H4645", "CTGA_H4647")
vcf_outgrp <- vcf_extract_indivs(vcf, indivs1, f_invar = FALSE)
vcf_ingrp <- vcf_extract_indivs(vcf, indivs1, whitelist = FALSE, f_invar = FALSE)

# extract groups of individuals from VCF (keep all loci)
groups1 <- as.factor("caiman")
vcf_outgrp <- vcf_extract_pops(vcf, project_info$indiv_group, groups1, f_invar = FALSE)
vcf_ingrp <- vcf_extract_pops(vcf, project_info$indiv_group, groups1, whitelist = FALSE, f_invar = FALSE)

# merge vcf_outgrp into vcf_ingrp
vcf1 <- vcf_merge(vcf_ingrp, vcf_outgrp)
#> Joining with `by = join_by(FORMAT, id)`

# add individuals from vcf_outgrp into vcf_ingrp (whitelist = TRUE is default)
indivs1 <- c("CTGA_H4637", "CTGA_H4643", "CTGA_H4645")
vcf1 <- vcf_add_indivs(vcf_ingrp, vcf_outgrp, indivs1)
#> Joining with `by = join_by(FORMAT, id)`

Filtering with outgroup

Outgroups, which tend to have many fewer individuals than ingroups, has more missing data than ingroups due to phylogenetic effects. Filtering the entire dataset often results in removal of the outgroup taxa when vcf_filter_invariant() is used during filtering to remove individuals with % missing data above some threshold. Therefore, a better strategy may be to extract the outgroup, filter the ingroup, add the outgroup, and remove any invariant SNPs.

# extract outgroup from VCF (keep all loci)
groups1 <- as.factor("caiman")
vcf_outgrp <- vcf_extract_pops(vcf, project_info$indiv_group, groups1, f_invar = FALSE)

# filter ingroup VCF for analyses then add outgroups
vcf1 <- vcf_extract_pops(vcf, project_info$indiv_group, groups1, whitelist = FALSE, f_invar = FALSE) %>%
  vcf_filter_missing_indivs(.8, f_invar = FALSE) %>%
  vcf_filter_rank(.5) %>%
  vcf_filter_maf(.03) %>%
  vcf_filter_coverage(6) %>%
  vcf_filter_oneSNP() %>%
  vcf_filter_missingness(.2) %>%
  vcf_filter_missing_indivs(.3, f_invar = FALSE) %>%
  vcf_merge(vcf_outgrp) %>%
  vcf_filter_invariant()
#> Removed samples are:
#> CTGA_H4663
#> CTGA_H4668
#> Final % missing data in VCF is 57.65%
#> Final % missing data in VCF is 7.5%
#> Removed samples are:
#> CTGA_H4667
#> Final % missing data in VCF is 2.78%
#> Joining with `by = join_by(FORMAT, id)`

Converting a VCF file to other population genetic and phylogenetic formats

All the following functions will take a VCF, groups and indiv_group (individuals to groups relationship), and convert the VCF to other populations genetic and phylogenetic formats, exporting/writing a file of this format. By default all functions will include missing data. If that is not desired, only 100% complete data matrices can be exported using the “inc_missing = FALSE” flag. The function vcf2genlight() automatically returns a genlight object and optionally can also export/write a file of this format to the working directory. Generally the vcf2genlight() function is called within a script using functions of the adegenet package rather than importing the genlight object.

res_path <- data_path
project <- "trigonatus_"
postfix <- "discosnp_sub"

##########
# datasets for analyses with unlinked SNPs
vcf <- vcf_oneSNP
# get groups and group info on samples in vcf
project_info <- get_vcf_group_info(vcf, data_path)
#> Joining with `by = join_by(id)`

##########
# export data formats
# migrate-n https://peterbeerli.com/migrate-html5/
vcf2migrate(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '_migrate.txt'))
# arlequin http://cmpg.unibe.ch/software/arlequin35/
vcf2arlequin(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.arp'))
# structure https://web.stanford.edu/group/pritchardlab/structure.html
vcf2structure(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.str'))
# faststucture http://rajanil.github.io/fastStructure/
vcf2structure(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.fstr'), method = "F")
# genepop https://gitlab.mbb.univ-montp2.fr/francois/genepop
vcf2genepop(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.gen'))
# smartsnp https://github.com/ChristianHuber/smartsnp
vcf2smartsnp(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.smartsnp'))
# eigenstrat https://github.com/DReichLab/EIG/tree/master
vcf2eigenstrat(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '_eigenstrat'))
# bayescan https://github.com/mfoll/BayeScan
vcf2bayescan(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.bayescan'))
# treemix https://bitbucket.org/nygcresearch/treemix/wiki/Home
vcf2treemix(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.treemix'))
# apparent https://github.com/halelab/apparent/tree/master
vcf2apparent(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.apparent'))
# related https://github.com/timothyfrasier/related
vcf2related(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.related'))
# dataframe with population membership in the last column
vcf2df(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.df'))
# snapp https://www.beast2.org/snapp/
vcf2snapp(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '_snapp.nex'))
# nexus - only SNPs, meant for SDVq analyses https://www.asc.ohio-state.edu/kubatko.2/software/SVDquartets/
vcf2nexus(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '_sdvq.nex'))
# fasta https://www.ncbi.nlm.nih.gov/genbank/fastaformat/
vcf2fasta(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.fna'))

##########
# datasets for analyses with linked SNPs
vcf <- vcf_multiSNP
# get groups and group info on samples in vcf
project_info <- get_vcf_group_info(vcf, data_path)
#> Joining with `by = join_by(id)`

##########
# export data formats
# fineRadStructure - expects VCF of linked SNPs
vcf2fineRadStructure(vcf, project_info$indiv_group, project_info$groups, out_file = paste0(res_path, project, postfix, '.finerad'))

Tomas Hrbek

11/26/2023