Tuesday, June 2, 2015

QTLs for saltwater tolerance in mosquitos

Original article by Smith et al. 2015, Heredity 1-9.

In summary:
Saltwater tolerance is potentially a case of rapid evolution, occurring multiple times across genera and within species complexes. Often studied through osmoregulation, and via the excretory system. Smith et al 2015 crossed species of mosquito larva that were saltwater-tolerant and freshwater-obligate. Both the BIM and random forest methods supported 6 QTL regions. Maternal inheritance may influence tolerance, as F1 hybrids that were a result of freshwater males crossing with euryhaline females sustained higher levels of saltwater than other combinations  For the QTL on the X marker, having only 1 copy rather than 2 seemed to be beneficial. This was seen with a difference in salt tolerance between genders: more males tolerant to salt were hemi/homozygous  at the QTL on the X marker, while females tolerant to salt were heterozygous. Ultimately, salt tolerant larva were typically homozygous for the euryhaline genotype at at least 2 of the 6 QTL regions, although it was not consistently the same regions. More than 1 recombination/chromosome event (cross) will be needed for fine-scale mapping. Hybrid sterility will also make fine-scale mapping difficult.

Terms
QTL - loci, or regions in the genome that are involved in the expression of quantitative traits, traits that when partitioned into environmental/genetic origins and variances, have some part that is heritable and therefore coded in the genome. It is a powerful way to find how the genetics underly traits, by associating phenotypes with genomic regions and genotypes. Techniques include screening for candidate genes based on other taxa, SNP-based analyses.

exaptation- when a trait originally evolved no in direct response to selection as is typical of adaptations, but rather via co-expression with other traits that were directly under selection (like saltwater tolerance going along for the ride with drought resistance (Arribas etal 2014)

epistasis (synergistic vs. disruptive): interaction between multiple genes, or one gene with its genetic background (its modifier genes), leading to non-additive effects. Smith et al 2015 tested for it by omitting subsets of data and rerunning analyses to see if QTL results changed.

Segregation distortion: when the segregation ratio of a locus deviates from the expected Mendelian ratio (ie, vary from 1:1 for backcrosses of 1:2:1 for F2s). It is observed because although the markers by definition have no function, they are linked to genes that are subject to gametic/zygotic selection. If the markers actually caused the distortion themselves, they would not be neutral but rather candidate genes for selection. SD influences the viability of individuals with different genotypes of the loci. (Xu et al 2008)

Bayesian interval mapping - Bayesian model selection for mapping multiple, interacting QTLs.

bootstrapping - generally, a resampling method to determine properties of an estimator. Sample the samples.

machine learning inre random forests: learning method, where decision trees are constructed outputs the mode class or mean prediction of the individual trees. Here it was used a a  non-parametric technique for classifying samples into fresh/saltwater based on bootstrap sampling of predictors, like SNPs and sex.

telomeres: region of repetitive nucleotides at the ends of chromatin. centromeres also have highly repeietvive DNA sequences and transposable elements. Both are areas that cause issues in inference because of errors, missing data, and abnormal linkage levels.


Saturday, May 2, 2015

some genomic terms and concepts

GWAS- genome wide association study; studying the genetic basis of complex traits by identifying many loci (SNPS) across the genome and looking for patterns or outliers (e.g., allele frequencies)

Missing heritability - refers in part to the complexity of trait heritability; when the combined effect of genes contribute more to the phenotype than the prominent, individual causal genes. "Significant" variants that cause the trait are identified with GWAS.

Sanger sequencing - precedes next gen sequencing; sequencing of nucleotides, where chain-terminating nucleotides are added to the recipe. Tags or labels on these nucleotides are now common, so that all 4 bases can be combined in a single reaction. Products are then stacked to determine the sequence.

Pyrosequencing - Sequencing based on the emission of light when specific nucleotides are added, with the amount of light proportional to amount of incorporated nucleotides.  Nucleotides are added one a time, so the sequence can be determined.

Illumina sequencing - of next-gen sequencing;  Reversible dye-terminators are  used so that single bases can id'd as they are introduced into DNA strands. Dyed bases with blocking groups attach during the PCR; a laser excites the dyes and enables a photo of the sequence to be taken, and then the blocking groups and dyes are removed. This process is repeated until the whole strand is sequenced. This allows for multiple strands to be sequenced simultaneously because of the automated nature of the process.

Transposable elements/transposons/te: DNA sequence that are "mobile genetic elements". I.e., they can change their position within the genome, sometimes creating or reversing mutations, and generally are non-coding although they can be important in genome function and evolution.

 "random walk" and "house of cards" models of mutation -  the former causes a random additive change (increase or decrease) on the character of the mutant individual, whereas the mutation in the latter is independent and thereby brings down the structure built up by evolution.

Friday, January 30, 2015

I see you: presence-only data.

While I've been reading a lot of black bear papers recently in preparation for a grant proposal, today I'm going to post about the issue of presence-only data: data that include only information about where individuals have been detected and do not include information about where individuals have not been detected. Data that include both  are often referred to as presence-absence data, which in most cases, if not all, are preferred because information about where individuals are not is just as informative for inferences about range distributions, population size, etc. However, sometimes such data can't be collected, because data are from museum or herbarium collections, are collected opportunistically or incidentally from sightings or reports. Opportunistic citizen science data is an example of presence-only data. You take what you can get. In those cases, perhaps a lot of data can be collected but the 'compromised' quality requires some thought and creative work-arounds. Ive only just started thinking and reading about the problem, but my understanding of it is as such:

The problem: The data collected consist of instances and locations where individuals or species havebeen observed or detected, i.e., y=1. Basing range distributions solely on this data ignores the places that were not sampled that truly have y=1's, and there is also no information about where there are not individuals, in which y=0. There is no reference or background against which to assess the collected y=1's. Thus, only naive, cursory estimates of occupancy are possible - or at least so has been the prevailing thought.

Current proposed and adopted methods: In addition to the observed y=1's, environmental or covariate data are often also collected at the sampled locations, in an attempt to relate presence, occupancy, or distribution to environmental attributes like elevation, percent forest cover or urban densities.

There are envelope models that describe the distribution of the presence-only data. Methods like BIOCLIM, HABITAT, and SVM are examples, and I know nothing about these.

An option is to determine a reference or background against which to compare the observed y=1's. In other words, with the lack of y=0 data, an option is to generate them, or create pseudo-absences. But   post-hoc y=0's could contain both true y=0's as well as some y=1's that appear to be y=0 if detection probability is less than certain or 100%. Furthermore, a researcher then has to determine how many of these psuedo-absences to create, and how many are created can greatly affect the probability of occurence, that state y=1. One approach to creating psuedo-absences is a case-control design, where the logistic regression for the probability of state y=1 given the environmental covariate data  is adjusted with the ln of the proportions of the occupied and unoccupied locations. But we dont know how those occupied and unoccupied locations are actually split n the real world.  That background is an unsampled matrix of unused landscapes (like plants that either use a spot or dont), or it can be viewed as available for use  (like a bear moving on a landscape could use agricultural areas, but perhaps just less often); this is apparently a subtle but methodologically important distinction. Some have suggested that the ratio of sampled to unsampled background locations be several orders of magnitude in size to minimize sampling errors (Manly et al 2002 and McDonald 2003 via Pearce and Boyce 2005). An exponential model to estimate the relative likelihood of occupancy or occurrence can be used instead of the logistic model, and finally another approach is to use a logistic regression to approximate a logistic discrimination model.  When relative likelihoods of occupancy are estimated, they are not constrained to be less than 1, which is weird.  All of these approaches attempt to account for the background or landscape of the data from which the observed y=1's were collected, but each take a slightly different approach that I have not read enough on to describe.

Pearce and Boyce (2005) state, " We are unaware of any application explicitly modelling abundance given presence only". Although not about abundance, Royle et al in 2012 came out with a likelihood approach for occurrence/occupancy probability with presence-only data, arguing that the popular and widely-used Maxent doesnt actually do that but instead provides habitat suitability indices that are quite different from estimates of occurrence probability. Royle et al provide a parametric approach, that can be implemented via MaxLik, an R package, and invoke Bayes rule that requires random sampling and constant detection probability. The major problems with MaxEnt, they argue, as I see it, is that they use a penalized and exponential version of the detection probability given occurrence, based on the maxmum entropy distrbution. This penalization shrinks the regression coefficients to 0,  but Royle et al argues that this approach biases the estimator because the intercept, Beta0, is set to be an arbitrarily determined number. In comparing MaxEnt to their approach, they found that MaxEnt provided variable under and overestimates. They caution that effort and detection probability are in fact often not consistent, such as with roadside surveys or where density of the study population or effort are high.

 Of course, a reply came quickly in 2013, from  Hastie and Fithin (2013) , and the debate about making inferences from presence-only data is not over yet. They claim Royle et al have performed "statistical alchemy" by imposing parametric assumptions to estimate overall occurrence probability. This is shaky ground to build inference.

Needless to say, presence-only data are tricky to work with, and presence-absence data seem preferable in every comparison. On top of these issues, another aspect of presence-only data, that it can be cheaper to collect and therefore yield larger datasets, makes it an attractive to use. In particular, I have been thinking about studies that may try to combine presence-absence data, such as from capture-recapture and occupancy efforts, with presence-only. It seems like finding a way to generate psuedo-absences to make the presence-only data mirror the presence-absence data would be one approach.  I can imagine studies where presence-absence data are capture-recapture collected, while presence-only data come from depauperate occupancy approaches. How to combine, then? Blanc et al (2014) provide one such example with Eurasian lynx, by making abundance an explicit instead of derived parameter the estimating models, and hinging the connection between abundance and occupancy on the fact that occupancy is only possible when abundance is >0.  They mention that their approach is a development on Freeman and Besbeas (2012) with the addition of imperfect detection. But then they mention that this is all for non-spatial capture recapture, because their abundance N~homogeneousPoisson(lambda) and N is explicit, whereas spatial capture recapture approaches use an inhomogeneous process and N is derived.


PEARCE, J. L. and BOYCE, M. S. (2006), Modelling distribution and abundance with presence-only data. Journal of Applied Ecology, 43: 405–412. doi: 10.1111/j.1365-2664.2005.01112.x

Royle, J. A., Chandler, R. B., Yackulic, C. and Nichols, J. D. (2012), Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods in Ecology and Evolution, 3: 545–554. doi: 10.1111/j.2041-210X.2011.00182.x

Hastie, T., & Fithian, W. (2013). Inference from presence-only data; the ongoing controversy. Ecography36(8), 864–867. doi:10.1111/j.1600-0587.2013.00321.x

Blanc, L., Marboutin, E., Gatti, S., Zimmermann, F., Gimenez, O. (2014), Improving abundance estimation by combining capture–recapture and occupancy data: example with a large carnivore. Journal of Applied Ecology, 51: 1733–1739. doi: 10.1111/1365-2664.12319

Thursday, January 22, 2015

Influence of drift and admixture on population structure of American black bears in the Central Interior Highlands 50 years after translocation

Puckett, E. E. et al. 2014. Molecular Ecology 23: 2414-2427

Objectives
During the 19th and early 20th centuries, back bears experienced both overall range contraction and local extirpation (Smith and Clark 1994?).  In the 60s and 70s, black bears were translocated from Minnesota and Manitoba to  Arkansas, and since then the population size has increased. It has since 8 been  generations since translocation, given a generation time of 6.3 years  (Onorato et al 2004).  There are some physical barriers, including rivers, highways, and discontinuous forest habitat.

"Bottlenecks, founder events (Nei et al 1975), and genetic drift (Nei and Tajima 1981) often result in decreased genetic diversity and increased population differentiation". On the other hand, migration can decrease the effects of drift due to gene flow.

The objective of the study was to identify population structure of bears in the 50 years following translocation and test for signatures of remnant genetic lineages.

Methods:
15 microsatellites for 7 studies, totally n=643 bears

  • identified parent offspring pairs with ml-relate and removed on individual from meac
  • deviations from HWE with ARLEQUIN
  • null alleles with Micro-checker.
  • differences in allelic richness with Kruskal-Wallis test
  • population structure with STRUCTURE, and analysed hierarchical substructure we separate analyses for each of K=4 under admixture
  • migration using BAYESASS.
  • demographic history with DIYABC, and tested different hypotheses: admixture, founder, split

mtdna with cytochrome b.

  • aligned data wtih GENEIOUS
  • assigned new haplotypes to Wooding and Ward's clades with MRBAYES.
  • identified substitution rate with FINDMODEL
  • haplotype and nucleotide frequencies in ARLEQUIN.

Results

STRUCTURE results at successive numbers of K reflect the varying processes effecting differentiation between population.

Haplotypic diversity mirrored nuclear genetic diversity. Admixture was the best supported model. Lower frequency hapolotypes at the tips of the network suggest new mutations derived from haplotypes occupying internal nodes, but distinct haplotypes could be remnant original ones, recent mutations, or the result of introduction from translocation.

Drift supported by decreased Fst values from the sources in MN and Manitoba to the study area

Mamagement implications: conserve gene flow, afford protection to subpopulations

Tuesday, January 20, 2015

Phylogeography and Pleistocene Evolution in the North American Black Bear

Wooding, S., and R. Ward. 1997. Mol. Biol. Evol 14:1096-1105.

Objectives: "to determine the character of phylogeographic structuring in a widespread North American carnivore" by 1) identifying if distinct patterns of distribution are present, 2) ascertaining the time scale over which diversity has evolved, using a molecular clock, 3a) identifying patterns of recent population growth using pairwise comparison of sequences, 3b) determining prevalance of migration by comparing geographic distributions of diversity and the context of lineage age, and 4) discussing patterns of genetic diversity with respect to geological and habitat changes

Methods:
  • n=118 mtDNA sequences; Human primers H16498 and L15997, to amplify a control region in the mtDNA; Sequencing of single strand products
  • n=258 RFLPS of bears from 16 localities; Clades identified in sequencing were used to identify diagnostic RFLPS so that future unsequenced samples could be assigned to clades by amplifying the human primers and digesting the PCR products with restriction enzymes
  • Calculated a nucelotide substitution rate for the control region, using methods detailed by Waits 1996, resulting in 2.8% per Myr (slow for mammal coding)
  • Used asiatic black bear as outgroup for phylogeographic analyses
  • Population growth assessed with mismatch distributions for pairwise sequences, to see if sample evolved in a growing population (Rogers 1995).
Results and Conclusions:
"The long-term population history of black bears appears to be characterized predominantly by long-term regional isolation followed by recent contact and hybridization": two major clades were identified from 12 lineages and were spatially clustered, with one clade represented in 14/16 localities. The clades differed at an average of 4.8% of nucleotide positions, which is unusual within mammalian populations, and suggests deep/long-term divergence. The origin of black bear clades seem to have originated on the Pliocene/Pleistocene boundary, 1.6-2.0 MYA, with patterns in diversity congruent " with forest refuge formation during the Pleistocene and regional expansion based on expanding forest refugia.

Within regions, no obvious phylogeographic structuring is present, and that dispersal between populaton is probably a regular occurrence (may need to look at microsatellites to find any significant structure, which could also identify sex-biased differences since usats are biparentally inherited). Black bears have expanded with changes in their forest habitat, and patterns of genetic diversity within regions may be strongly affected by both regional mixing and population growth.

Keywords and Concepts:
 molecular phylogeography: " a means of understanding evolutionary processes within species" (Avise 1994), and for "understanding the historical factors leading to extant patterns of diversity",  using information from "geographical distrbution and topological relationships of genetic lineages, which reflects the long term structure and demographic history of populations"

mtdna:  mitochondrial DNA and is circular and double-stranded. It is only inherited maternally, and therefore has a smaller effective population size, and genetic drift can have a stronger effect. Has a higher mutation rate than nuclear DNA, and therefore can be used to track long ancestries

RFLPs: restriction fragment length polymorphism. fragmenting sequences of homologous DNA using restriction enzymes, and then separating the fragments by length.

Simulations suggest that lineage age  ∝  lineage range ( Neigel and Avise 1993)

The beginning of a long organization project, hopefully

Good Afternoon!

This blog is a personal attempt to organize the papers I've read for discussion meetings with my research labs: one in loosely focused on conservation genetics and another in spatial ecology. I recognize that this could be done using a citation manager, but I've got a backlog of papers starting from a few years back. I'd to make sure that I lose as little information and understanding gained from reading these papers. Furthermore, this will be good for me to "write" on a consistent basis

Each post will be a summary of a paper that I've read and discussed in lab meetings. They will include major conclusions, definitions of key words, and explicit but generalizable concepts. Organization and linking between posts will be maintained through blog labels, so that I can quickly find all the papers that talk about particular concepts, like Fst, for example.