pophelper: an R package and web app to analyse and visualize population structure

The pophelper r package and web app are software tools to aid in population structure analyses. They can be used for the analyses and visualization of output generated from population assignment programs such as admixture, structure and tess. Some of the functions include parsing output run files to tabulate data, estimating K using the Evanno method, generating files for clumpp and functionality to create barplots. These functions can be streamlined into standard r analysis workflows. The latest version of the package is available on github (https://github.com/royfrancis/pophelper). An interactive web version of the pophelper package is available which covers the same functionalities as the r package version with features such as interactive plots, cluster alignment during plotting, sorting individuals and ordering of population groups. The interactive version is available at http://pophelper.com/.


Introduction
The use of multilocus molecular markers to differentiate populations and to infer their genetic structure and composition is a powerful tool in ecology, conservation and landscape genetics (Sunnucks 2000;Falush et al. 2003). One of the most successful approaches employed is the assignment of individuals (Manel et al. 2005) which includes distance-based methods (Piry et al. 2004), frequency-based methods (Paetkau et al. 1995), Bayesian methods (Rannala & Mountain 1997;Baudouin & Lebrun 2000) and those involving Markov chain Monte Carlo methods (Cornuet et al. 1999;Pritchard et al. 2000;Paetkau et al. 2004). Several programs have been published in the recent years with various features and differences to investigate the population structure. Some of the softwares discussed in this article includes ADMIXTURE (Alexander et al. 2009), FASTSTRUCTURE (Raj et al. 2014), STRUCTURE (Pritchard et al. 2000) and TESS (Franc ßois & Durand 2010). Along with assignment of individuals, it is often useful to cluster individuals into K populations. Estimation of the number of populations (K) is often a tricky problem, and no standard method exists that can be applied in all situations. The approach by Evanno et al. (2005) is one method to estimate the value of K and has been widely cited (Morgan et al. 2007;Blackburn & Maddison 2014;Diez et al. 2015).
With STRUCTURE, the input formats have been made convenient for users using data conversion tools such as GENALEX (Peakall & Smouse 2012), but the handling of output files have not been implemented in a convenient manner. Both of these programs also lack functionalities to generate high-quality graphics. Two common downstream approaches are to align assignment clusters across replicate runs using CLUMPP (Jakobsson & Rosenberg 2007) and to visualize the output using DISTRUCT (Rosenberg 2004). The use of these programs is often difficult and cumbersome. Therefore, a program to easily and quickly accept run files and generate outputs that can be directly used in downstream programs was a necessity.
The most popular of these 'helper' programs was STRUCTURE HARVESTER (Earl 2012). STRUCTURE HARVESTER is a web-based utility with a graphical user interface that can accept STRUCTURE runs to generate Evanno plots and input files for CLUMPP, but does not produce plots or allow to work with CLUMPP output. A more recent development was STRUCTURE PLOT (Ramasamy et al. 2014), which uses R shiny web framework (Chang et al. 2015) to interactively work in the browser. Some of the limitations of STRUCTURE PLOT include the lack of parsing run files, data tabulation, cluster alignment, CLUMPP export or Evanno method calculation. These programs also do not handle input from multiple softwares. The statistical programming language R (R Development Core Team 2013) is widely used in data analysis and its use in ecology and population analysis has risen over the years. The program POPHELPER presented in this article is provided as an R package for command-line use and as a web app for interactive use. Many of the limitations of previously discussed applications have been rectified in this package which include parsing run files from various softwares, tabulation of runs, plotting barplots with population labels, sorting of individuals and ordering of populations, alignment of clusters and merging of repeats using CLUMPP, etc.

Features
The latest version of POPHELPER R package can be installed from the Github repository (https://github.com/ royfrancis/pophelper) with installation instructions provided. A brief introduction on functions, usage and workflow are provided on the github page. A detailed demonstration is provided as an html vignette along with the package accessible from the github page. The POPHELPER package employs the ggplot (Wickham 2009) graphing package to create high-quality graphics. The POPHELPER package currently accepts ADMIXTURE, FASTSTRUC-TURE, STRUCTURE and TESS runs as input. Files from other sources or modified data can be used as input as long as they are all numeric, tabular data delimited by white space, tab or comma. The files must be free of headers and decimal must be defined by dot.
The common scenario with most users is that they have a directory full of run files. POPHELPER provides functions to tabulate all runs listing the files and various parameters such as filename, K value, number of individuals and to summarize the runs by number of repeats. TESS files generated in separate directories can be easily collated into a single directory for further processing. AD-MIXTURE, FASTSTRUCTURE, STRUCTURE, TESS or any tabular run files can be converted to R data frames for further analysis. The Evanno method to detect K has been implemented for STRUCTURE runs (Fig. 1). This method requires at least three sequential values of K with minimum of two repeats. Informative feedback is provided in cases where the data does not fulfil the requirements. The results are available as plots as well as text files. Input files for CLUMPP (combined file with repeats and the paramfile) are easily generated into separate directories. The CLUMPP executable can be used to generate the aligned and merged files in each K directory. The aligned and merged files can be collated into a single directory using POPHELPER. CLUMPP files can be generated from any of the input run files. Barplots (Fig. 2) can be created from any of the input run files, combined files, aligned files or Fig. 2 A screenshot of the POPHELPER GUI for analysing and visualizing assignment run files. A typical barplot produced in POPHELPER is seen at top right. Three runs (K = 2, K = 3 and K = 4) are joined together as a single plot with common grouped labels. The filenames and K values are shown on the right-side panel. Function arguments or GUI options are provided to tweak and customize almost every aspect of the plot.  merged files. The plots can be created separately or as joined plots, with or without population labels (Fig. 3). The combined, aligned and merged files from CLUMPP can also be plotted (Fig. 4). Individuals in the barplot can be sorted by any one of clusters or a clustering that takes all clusters into account (similar to the 'Sort by Q' option in STRUCTURE software). When using population labels, population groups can be subsetted or reordered (Fig. 4). When using population labels along with sorting, individuals are sorted within the population groups. Multiline plots (similar to that implemented in STRUCTURE software) can be created from any of the previously mentioned file types to identify individuals. Multiline plots (Fig. 5) produce an A4 size image with a certain number of rows and certain number of individuals per row depending on the number of individuals. A reasonable default is calculated, but this can be adjusted. Multiline plots can be sorted by one or all clusters. The plot functions have several arguments to tweak and customize the figure as required. Export formats include JPEG (lossy), PNG (lossless) or PDF (vector). The package uses simple commands and basic R functions enabling access to beginner R users.
The web implementation of POPHELPER is available at https://pophelper.com. The web app was created in R using shiny (Chang et al. 2015) web framework and is suitable for users who wish to work interactively. The app makes use of several R packages such as fields (Nychka et al. 2015), RESHAPE2 (Wickham 2007), MARK-DOWN , GTABLE (Wickham 2012), GRIDEX-TRA (Auguie 2015), PLYR (Wickham 2011), DT (Xie 2015), SHINYACE (LLC. TTa 2013), RCHARTS (Vaidyanathan 2013) and SHINYJS (Attali 2015). The web app has similar features to the R package. Interactive graphics are available for data exploration, and standard graphics (Fig. 2) are available for print and publication. The web app has an implementation of the Evanno method (Fig. 1) for determining K as well as cluster alignment during plotting using CLUMPP. Population labels can be uploaded as a text file or copy-pasted to mark populations under the barplot. Individuals can be sorted by one or more clusters. Population groups can be subsetted or reordered. As the cluster alignment is performed during plotting, CLUMPP and related files such as combined and aligned files are not required to be handled by the user. Population clusters can be coloured using a wide range of colour palettes including colour-blind friendly palettes. Options are provided to adjust the size, colour and position of most chart elements as well as population labels.
Detailed description of functionality and tutorials are available with the application (R package as well as the web app). The POPHELPER package is expected to be frequently updated, and new functionalities will be added as and when possible. It is hoped that the R package will be useful for implementation into analysis workflows and the web app for interactive use.

Conclusion
The POPHELPER R package and web app are available to assist users working with molecular markers to investigate population structure. The R package is easy to use and possible to fit into workflows and data analysis pipelines. The interactive web app is available for applied users who wish to use a graphical user interface.