library(simDNA)

This vignette will explain more about using the package simDNA for simulating DNA from various types of population sizes. For more information about analyzing DNA using this package, please consult the analyzeDNA vignette.

The standard coalescent model

We will start by explaining the theory behind the simulating function simDNAseq in the simDNA package. When looking at the standard coalescent model we are interested in analyzing how a given population evolved over time (measured in generations). This is typically done by taking a sample of size \(n\) from the entire population and finding their most recent common ancestor.

When going backwards in time, we start out by having \(n\) individuals (lineages) and end up with one common ancestor after the occurrence of \(n-1\) coalescent events (i.e. two individuals ‘merge’ into one \(n-1\) times). During this span of time we thus see that the total number of individuals decreases by one after each coalescent event. In other words there are \(n, n-1, \dots, 2\) individuals present respectively during this time. We denote by \(T_i\), for \(i=2,\dots ,n\), the time during which there were \(i\) individuals present. Using genealogical tree terminology these are also called branch lengths and are illustrated in the figure from Wakeley below.

Here we see that the time, when there were exactly 5 individuals present, corresponds to \(T_5\). We denote this specific region of the tree the 5’th level.

Mutations are the primary factor contributing to coalescent events, and once we have observed an ancestral tree these are sprinkled on according to a Poisson distribution. That is, given we are in level \(k\) of the tree, the number of mutations is determined by \(Pois(T_k k \theta/2)\) where \(\theta\) is the mutation rate.

When we have observed mutations as described above, we are interested in the segregating sites matrix. That is, for each individual we want to find the positions deviating from the nucleotide in the ancestral DNA sequence. Such mutant nucleotides are denoted by the number 1 and by 0 otherwise.

Since the branch lengths are generated differently according to the type of the population size, these are explained seperately.

Fixed population size

Assuming that the underlying population has a constant size \(N\) during the span of time, we know that \[ T_k \sim exp\Big(\binom{k}{2}\Big), \quad \text{for } k=2,\dots,n. \] When specifying the population type to be ‘fixedPop’ in the function simDNAseq the simulated branch lengths come from the above distribution, and we use these lengths to compute the segregating sites matrix. An example hereof could be

simDNAseq(n = 10, seqLen = 22, mutRate = 2, popType = "fixedPop")
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#>  [1,]    0    0    0    0    0    0    1    0    0     0     0     1     0
#>  [2,]    0    0    0    0    0    0    0    0    1     1     1     0     0
#>  [3,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#>  [4,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#>  [5,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#>  [6,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#>  [7,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>  [8,]    0    0    0    0    0    0    0    0    1     0     0     0     1
#>  [9,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [10,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#>       [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
#>  [1,]     0     0     0     0     0     0     0     0     0
#>  [2,]     0     0     0     0     0     0     1     0     0
#>  [3,]     0     0     0     0     0     0     0     0     0
#>  [4,]     0     0     0     0     0     0     0     0     0
#>  [5,]     0     0     0     0     0     0     0     0     1
#>  [6,]     0     0     0     0     0     0     0     0     0
#>  [7,]     0     0     0     0     0     0     0     0     0
#>  [8,]     0     0     0     0     0     0     0     0     0
#>  [9,]     0     0     0     0     0     0     0     0     0
#> [10,]     0     0     0     0     0     0     0     0     0

Variable population size

Assuming that the underlying population has a variable size during the span of time, we denote by \(N(j)\) the number of sequences in the population \(j\) generations before the present. For \(j=0\) we let \(N := N(0)\) denote the number of individuals in the present. Furthermore we let \[ f(x)=\frac{N(j)}{N}, \quad \text{for } \frac{j-1}{N} < x \le \frac{j}{N}, \,\, j=1,2,\dots \] denote the relative size function and \[ \Lambda(t) = \int_0^t \frac{1}{f(x)}\, dx \] denote the integrated intensity function. This function is essential in the algorithm for simulating branch lengths in a population of variable size. The algorithm is the following

  1. Simulate \(T_k \sim exp\big(\binom{k}{2}\big)\) for \(k=2,\dots,n\).
  2. Calculate the cumulative sum \((S_n,S_{n-1}\dots,S_2):=(T_n,T_n+T_{n-1},\dots,T_n+\cdots +T_2)\).
  3. Calculate \(\big(\Lambda^{-1}(S_n),\dots,\Lambda^{-1}(S_2)\big)\).
  4. Return \(\big(\Lambda^{-1}(S_n), \Lambda^{-1}(S_{n-1})-\Lambda^{-1}(S_n),\dots,\Lambda^{-1}(S_2)-\Lambda^{-1}(S_3)\big)\).

Note that since \(\Lambda(t)\) depends on the relative size function, the above algorithm looks different for different types of population sizes.

Exponentially growing population size

Assuming that the size of the population grows exponentially (forwards in time) we can use the above algorithm with \[ f(x)=exp(- \lambda x), \] where \(\lambda\) denotes the rate of which the population size increases.

When specifying the population type to be ‘varPop’ in the function simDNAseq the simulated branch lengths come from the algorithm using the above relative size function. We use these branch lengths to compute the segregating sites matrix. An example hereof could be

simDNAseq(n = 8, seqLen = 20, mutRate = 5, popType = "varPop", expRate = 1.5)
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [3,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [4,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [5,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [6,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#> [7,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [8,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#> [1,]     0     1     0     0     0     0     0
#> [2,]     0     0     0     0     0     0     0
#> [3,]     0     0     0     0     0     0     0
#> [4,]     0     0     0     0     0     0     0
#> [5,]     0     0     0     0     0     0     0
#> [6,]     0     0     0     0     0     0     0
#> [7,]     0     0     0     0     0     1     0
#> [8,]     0     0     0     0     0     1     0

Suddenly expanded population size

When the size of the population suddenly expands, we have that backwards in time the population size is \(N\) before the expansion and \(\alpha N\) afterwards for \(0<\alpha \le 1\). The decline in population size happened at time \(b N\) (measured in generations) in the past. In this case the relative size function used in the algorithm is \[ f(x)= \left\{ \begin{array}{ll} 1 & 0\le x < b, \\ \alpha & x\ge b. \end{array} \right. \]

When specifying the population type to be ‘sudExpPop’ in the function simDNAseq, the simulated branch lengths come from the algorithm using the above relative size function. We use these lengths to compute the segregating sites matrix. An example hereof could be

simDNAseq(n = 25, seqLen = 30, mutRate = 8, popType = "sudExpPop", expansionTime = 2, proportion = 0.9)
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#>  [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>  [2,]    0    0    1    0    0    0    0    0    0     0     0     0     0
#>  [3,]    0    0    0    0    0    1    0    0    0     0     0     0     0
#>  [4,]    0    0    0    0    0    0    0    0    0     0     0     0     1
#>  [5,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>  [6,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>  [7,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>  [8,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#>  [9,]    0    0    0    0    0    0    0    0    0     0     0     0     1
#> [10,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [11,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [12,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [13,]    0    0    0    0    0    0    0    0    0     0     0     0     1
#> [14,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [15,]    0    1    0    0    1    0    0    0    0     0     0     0     0
#> [16,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [17,]    0    0    0    0    0    0    0    0    0     0     0     0     1
#> [18,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [19,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [20,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [21,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [22,]    1    0    0    0    0    0    0    0    0     1     1     0     0
#> [23,]    0    0    0    0    0    0    0    0    0     0     0     1     0
#> [24,]    0    0    0    0    0    0    0    0    0     0     0     0     0
#> [25,]    0    0    0    0    0    0    0    0    1     0     0     0     0
#>       [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#>  [1,]     1     0     0     1     0     0     0     0     0     0     0
#>  [2,]     0     0     0     1     0     0     0     0     1     0     0
#>  [3,]     0     0     0     0     0     0     0     0     0     0     0
#>  [4,]     0     0     0     0     0     1     0     0     0     0     0
#>  [5,]     0     0     0     1     0     0     0     0     0     0     0
#>  [6,]     0     0     0     1     0     0     0     0     0     0     0
#>  [7,]     0     0     0     1     0     0     0     0     0     0     0
#>  [8,]     0     0     0     0     0     0     0     0     0     0     0
#>  [9,]     0     0     0     0     0     0     0     0     0     0     0
#> [10,]     0     1     0     0     0     0     0     0     0     0     0
#> [11,]     0     1     0     0     0     0     0     0     0     0     0
#> [12,]     0     0     0     1     0     0     0     0     0     0     0
#> [13,]     0     0     0     0     0     0     1     0     0     0     0
#> [14,]     0     0     0     1     0     0     0     0     0     0     0
#> [15,]     0     0     0     1     1     0     0     0     0     0     0
#> [16,]     0     1     1     0     0     0     0     0     0     0     0
#> [17,]     0     0     0     0     0     0     0     0     0     0     0
#> [18,]     0     0     0     1     0     0     0     0     0     0     0
#> [19,]     0     0     0     1     0     0     0     0     0     0     0
#> [20,]     0     0     0     1     0     0     0     1     0     0     0
#> [21,]     0     0     0     1     0     0     0     0     0     0     0
#> [22,]     0     0     0     1     0     0     0     0     0     0     0
#> [23,]     0     0     0     1     0     0     0     0     0     0     0
#> [24,]     0     0     0     1     0     0     0     0     0     0     0
#> [25,]     0     0     0     1     0     0     0     0     0     0     0
#>       [,25] [,26] [,27] [,28] [,29] [,30]
#>  [1,]     0     0     0     0     1     0
#>  [2,]     0     0     0     0     1     0
#>  [3,]     0     0     0     0     0     0
#>  [4,]     0     0     0     0     1     0
#>  [5,]     0     0     0     0     1     0
#>  [6,]     0     0     0     0     1     0
#>  [7,]     0     0     0     0     1     0
#>  [8,]     0     0     0     0     0     0
#>  [9,]     0     0     0     0     1     0
#> [10,]     0     0     0     0     1     0
#> [11,]     0     0     1     0     1     0
#> [12,]     0     0     0     0     1     0
#> [13,]     0     0     0     0     1     0
#> [14,]     0     0     0     0     1     0
#> [15,]     0     0     0     0     1     0
#> [16,]     0     0     0     0     1     0
#> [17,]     0     0     0     0     1     0
#> [18,]     0     0     0     0     1     0
#> [19,]     0     0     0     0     1     0
#> [20,]     0     0     0     0     1     0
#> [21,]     1     1     0     1     1     0
#> [22,]     0     0     0     0     1     0
#> [23,]     0     0     0     0     1     0
#> [24,]     0     0     0     0     1     0
#> [25,]     0     0     0     0     1     0

References

Wakeley J. (2009) Coalescent Theory: An Introduction. Colorado: Roberts and Company Publishers.

Tavaré, S. (2004) Ancestral Inference in Population Genetics. Berlin: Springer-Verlag.