Population Growth, Fitness, and Selection

From 22115
Jump to navigation Jump to search

This exercise is part of the course Computational Molecular Evolution (22115).

Overview

In this exercise you are going to use a simple computer program for simulating the growth of a population. The simulation will be based on an exponential growth model and will involve organisms having one of two different genotypes ('A' and 'a'). By performing the simulation with a range of different model parameter values you will sharpen your intuitive understanding of the dynamics of the model.

You will also get a brief introduction to assessing how well a model describes some observed data (the "fit" of a model), and learn how model parameters can be estimated from such data (the process known as "model fitting").

Finally, you will start learning how to work in the UNIX environment that we will use in the rest of the course (and that is alo used by bioinformaticians in the real world).

If you are interested in learning more about UNIX you can browse this set of tutorials.

Getting started

1. Construct today's working directory:

In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).

cd /path/to/molevol
mkdir simulation
cd simulation

The mkdir (make directory) command constructs a new directory named "simulation". Subsequently the cd (change directory) command is used to select the newly constructed directory as "current working directory".

2. Copy required files:

cp ../data/growth.py ./growth.py
cp ../data/modelfit.data ./modelfit.data
ls -l

The cp command copies the simulation program (growth.py) and a file containing experimental data (modelfit.data), from the data directory to your current working directory. Two dots in a filepath (../) means "parent directory" (so two levels up would be: ../../).

The ls (list) command with option -l (long) shows information about the files present in the current directory.

3. Start extra Terminal window and cd to working directory:

Start a second Terminal app via the start menu. Also cd to today's working directory in that terminal window:

cd /path/to/molevol/simulation

In this part of the exercise, you will need two terminal windows open at the same time.



Exploration of models of Population Growth, Fitness, and Selection


Question 1

Have a look at the simulation program:

nedit growth.py &
This command opens the file growth.py (which is a simple text file) in the nedit editor. Having an ampersand ("&") at the end of the command makes nedit run in the background.
Note: you can use any text editor to open this python file. If you are on a mac you will probably use `mate` instead of `nedit` for instance.
The growth.py program is written in the programming language Python. You will probably be able to understand most of what is going on even if you have no programming experience (Note that hash signs - "#" - indicate the beginning of a comment. Text after # on a line will not be executed.)

Question: what are the initial values of the parameters N0 (initial population size), fA (initial frequency of 'A'), rate_A (growth rate, or fitness, of 'A'), and rate_a (growth rate of 'a')


Question 2

Run the simulation program:

In one of your terminal windows, type the following:

./growth.py
This command executes the program growth.py, which is located in your current working directory (the dot in "./growth.py" means "current directory"). For each generation the state of the population and the frequencies of the 'A' and 'a' alleles will be printed to the screen. The simulation runs for 20 generations, but you could change this by altering the max_gen parameter.

Run the simulation again, saving output to file:

./growth.py > poptest
The symbol ">" causes the output to be "redirected" to a file named "poptest" (you could have named it anything). Open the result file in nedit to verify that all output has been saved:
nedit poptest &

Question: what is the total population size (N) at generation 0 and generation 20? (Close the nedit window when you're done.)


Question 3

Start rstudio:

(Note: depending on you operating system you may have to start Rstudio using the command-line as below, or by double clicking or using the start-menu. If you do not use the command-line, then you need to set the current working directory to be your simulation directory).

In the second terminal window type the following:

rstudio &
RStudio is an "integrated development environment for R, a programming language for statistical computing and graphics." It is among the tools you are required to have some familiarity with when you do bioinformatics.

In the console window in RStudio, type the following to load the "tidyverse" package:

library(tidyverse)
The 'tidyverse' is "a set of packages that work in harmony because they share common data representations and user interface. This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step. Learn more about the 'tidyverse' at <https://tidyverse.org>."

Now, read data from the file you just created, select variables, and reshape it for plotting: In RStudio type the following:

df = read_table("poptest")
df2 = df %>% select(t, N, N_A, N_a) 
df3 = df2 %>% pivot_longer(cols = -c("t"))
The "%>%" is the so-called "pipe" symbol, and works like pipe in UNIX, i.e, by sending the output from one command into a second command. This is extremely useful for making long, combined manipulations in R, while keeping them readable. The select command selects the columns we want to use for plotting (t, and all columns with population info: N, N_A, N_a). The pivot_longer command converts data from wide format to long format, which is useful when you want to automatically create legends for multiple variables (More generally, you want to understand what "tidy" data is, and why that will help with your data analyses: R For Data Science: Tidy Data). You may want to inspect the intermediate data frames by simply typing their names in RStudio:
df
df2
df3

(Note: Normally you would here have used the pipe operator to chain these commands together, thus avoiding constructing the intermediate data frames. Here I only included these steps to make it clear what is going on in the different steps of the command).

Now, plot the population sizes:

 ggplot(df3, aes(x=t, y=value, color=name)) + geom_line()
The ggplot command plots the population data as a function of time, with automatic legend and coloration based on variable (N, N_A, and N_a). Specifically, we here plot total population size ("N"), the number of organisms with allele 'A' ("NA"), and the number of organisms with allele 'a' ("Na") for each generation. Recall that in this run the two alleles had the same fitness (rate_A = rate_a = 1.2) but different initial population sizes (fA = 0.3, fa = 0.7).

Now plot the allele frequencies: In RStudio type the following:

df4 =  df %>% select(t, fA, fa) %>% pivot_longer(cols = -c("t"))
ggplot(df4, aes(x=t, y=value, color=name)) + geom_line()
Here we have plotted the frequencies of the two alleles for each simulated generation. Note how we here combined the select and pivot_longer commands into one by using the pipe.

Question: How do the frequencies of the two alleles (fA and fa) behave for this case where the two alleles have the same fitness?


Question 4:

Simulation number 1: In the terminal window where you are not running RStudio: Using what you learned above, run the simulation program for the parameter values listed below, and save the output to a file named "res.1" (without the quotes).

You should use nedit to alter the relevant parameter values in the growth.py file. Remember to save growth.py after making the alterations (File -> Save).

N0 = 50     rate_A = 1.2  rate_a = 1.2  fA = 0.3

Question: What is the relative fitness of allele 'a' (using 'A' as a reference):


Question 5:

Based on the file "res.1" you just created above, plot the population sizes: In RStudio, type the following:

df = read_table("res.1")
df2 =  df %>% select(t, N, N_A, N_a) %>% pivot_longer(cols = -c("t"))
ggplot(df2, aes(x=t, y=value, color=name)) + geom_line()

Now plot allele frequencies:

df3 =  df %>% select(t, fA, fa) %>% pivot_longer(cols = -c("t"))
ggplot(df3, aes(x=t, y=value, color=name)) + geom_line()

Question: What is the behavior of population sizes (N_A and N_a) and allele frequencies (fA and fa) for this set of parameter values?


Question 6:

Simulation number 2: Again, using what you have learned, Run the simulation program for the parameter values listed below, and save the output to a file named "res.2" (without the quotes).

N0 = 1000   rate_A = 0.7  rate_a = 0.7  fA = 0.3

Question: What is the relative fitness of allele 'a' (using 'A' as a reference):


Question 7:

Plot population sizes: In RStudio, type the following:

df = read_table("res.2")
df2 =  df %>% select(t, N, N_A, N_a) %>% pivot_longer(cols = -c("t"))
ggplot(df2, aes(x=t, y=value, color=name)) + geom_line()

Plot allele frequencies:

df3 =  df %>% select(t, fA, fa) %>% pivot_longer(cols = -c("t"))
ggplot(df3, aes(x=t, y=value, color=name)) + geom_line()

Question: What is the behavior of population sizes (N_A and N_a) and allele frequencies (fA and fa) for this set of parameter values?


Question 8:

Simulation number 3: Run the simulation program for the parameter values listed below, and save the output to a file named "res.3".

N0 = 1000   rate_A = 2.0  rate_a = 1.5  fA = 0.02

Question: What is the relative fitness of allele 'a' (using 'A' as a reference)


Question 9:

Plot population sizes:

df = read_table("res.3")
df2 =  df %>% select(t, N, N_A, N_a) %>% pivot_longer(cols = -c("t"))
ggplot(df2, aes(x=t, y=value, color=name)) + geom_line()

Plot allele frequencies:

df3 =  df %>% select(t, fA, fa) %>% pivot_longer(cols = -c("t"))
ggplot(df3, aes(x=t, y=value, color=name)) + geom_line()

Question: What is the behavior of population sizes (N_A and N_a) and allele frequencies (fA and fa) for this set of parameter values?


Question 10:

Simulation number 4: Run the simulation program for the parameter values listed below, and save the output to a file named "res.4".

N0 = 1000   rate_A = 1.2  rate_a = 0.9  fA = 0.02

Question: What is the relative fitness of allele 'a' (using 'A' as a reference)


Question 11:

Plot population sizes:

df = read_table("res.4")
df2 =  df %>% select(t, N, N_A, N_a) %>% pivot_longer(cols = -c("t"))
ggplot(df2, aes(x=t, y=value, color=name)) + geom_line()

Plot allele frequencies:

df3 =  df %>% select(t, fA, fa) %>% pivot_longer(cols = -c("t"))
ggplot(df3, aes(x=t, y=value, color=name)) + geom_line()

Question: What is the behavior of population sizes (N_A and N_a) and allele frequencies (fA and fa) for this set of parameter values?


Question 12:

Simulation number 5: Run the simulation program for the parameter values listed below, and save the output to a file named "res.5".

N0 = 10_000  rate_A = 0.8  rate_a = 0.6  fA = 0.02

NOTE: In python (for versions >= 3.6) you can add underscores to numbers to aid readability. The underscores will be ignored by python but are helpful to get the correct number of digits in a number.

Question: What is the relative fitness of allele 'a' (using 'A' as a reference)


Question 13:

Plot population sizes:

df = read_table("res.5")
df2 =  df %>% select(t, N, N_A, N_a) %>% pivot_longer(cols = -c("t"))
ggplot(df2, aes(x=t, y=value, color=name)) + geom_line()

Plot allele frequencies:

df3 =  df %>% select(t, fA, fa) %>% pivot_longer(cols = -c("t"))
ggplot(df3, aes(x=t, y=value, color=name)) + geom_line()

Question: What is the behavior of population sizes (N_A and N_a) and allele frequencies (fA and fa) for this set of parameter values?


Question 14:

Compare simulation 3 and 5: Simulation 3 had rate_A = 2.0 and rate_a = 1.5, while simulation 5 had rate_A = 0.8 and rate_a = 0.6. In the first simulation there was therefore exponential growth, while in the second there was exponential decline. Now, look at the plots of allele frequencies from these two simulations and compare their behaviour. (If you know some R you may even try to plot both sets of allele frequencies in the same plot).

Question: How does the behavior of fA and fa compare in the two simulations?


Model Fit, Parameter Estimation

In this part of the exercise, we will briefly consider some aspects of the fit between models and reality.


Question 15:

Have a look at the data file:

In RStudio type this:

dat = read_table("modelfit.data")
print(dat)
This file contains a set of (pseudo) empirical data: the size of a population (column labeled "N") for a number of generations (column labeled t).

Plot data points:

ggplot(dat, aes(x=t, y=N)) + geom_point()

Select first 7 data points for model fitting:

We will use the first 7 data points for fitting models, while leaving out the 8th data point to test how well our fitted models generalise. (This is a very minimal version of a technique called "out of sample testing").

train = dat %>% filter(t<14)
print(train)

Estimating parameters of exponential growth model from data:

We will now assume that the population is growing exponentially according to this model:

where is the initial population size, is the instantaneous rate of increase, and is the generation number. Our model thus has two "free parameters: and . You will now attempt to find a "good" set of values for these parameters. We will take "good" values to be those that cause the theoretical curve to lie as close as possible to the observed data. This process is called model-fitting.

There are actually several different ways of defining "as close as possible". One measure that turns out to be convenient is the "sum of squared residuals" (SSR; the word "error" is sometimes used instead of "residual"). The approach is the following: for each x,y point in the data set, the difference between the observed y and the y-value predicted by the model is computed. This difference (the "error" or "residual") is then squared, and the sum of all the squared residual terms is then taken to be an indication of how well the model fits the data. The best fitting model thus has the smallest possible SSR, and this approach is therefore referred to as "least squares model fitting". Another measure of model fit that we will return to later in the course is the model likelihood.

Find a good set of initial parameter values:

We will use the nls function ("Non-linear Least Squares") in R to fit our models. This function takes as input a formula (the model) and some data, and returns estimates of the free model parameters, along with summaries of model fit. The function also requires a set of "start values" - guesses at approximate values for the free parameters in the model - to run. These do not need to be very close to the final result, they should just help by placing the numerical model fitting algorithm in a reasonable neighbourhood in parameter space.

In the R commands below, replace "???" by your guess about reasonable parameter values, and then check how well they fit by plotting the data points along with the line corresponding to your initial values. You will probably need to repeat this step a few times until you are in the vicinity of the correct values (but dont overdo it - the nls function should do most of the work). Note: It will make it simpler to repeat these commands if you place them in a small R script (click the plus icon in the upper left corner of RStudio and select "R Script"):

N0_init = ???
r_init = ???
xpred = seq(0, 15, 0.1)
ypred = N0_init * exp(r_init * xpred)
dfpred = tibble(x=xpred, y=ypred)
ggplot(train, aes(x=t, y=N)) + 
    geom_point() + 
    geom_line(dfpred, mapping=aes(x=x, y=y), color="blue")


Fit the exponential model:

Once you are satisfied with the initial parameter values, use the following commands to fit the exponential model using least squares:

fit1 = nls(
    formula = N ~ N0 * exp(r * t), 
    data = train,
    start=list(N0=N0_init, r=r_init),
    trace=TRUE
)

The option trace=TRUE causes nls to print the sum-of-squared residuals and the parameter values are printed at the conclusion of each iteration. Have a look at how well the fitted values match the training data:

xpred = seq(0, 15, 0.1)
ypred = predict(fit1, list(t = xpred))
dfpred = tibble(x=xpred, y=ypred)
ggplot(train, aes(x=t, y=N)) + 
    geom_point() + 
    geom_line(dfpred, mapping=aes(x=x, y=y), color="blue")

The function "predict()" takes as input the results from fitting a model (i.e., estimates of the free parameters) along with some predictor values (here the t-values). It produces as output the predicted outcome values (here population sizes) corresponding to the input predictor values.

Now, print a summary of the fitted model

print(fit1)

Question: Find the estimated parameters (N0 and r) and the residual sum of squares of the fitted model in this output.


Question 16:

Fit polynomial model:

So far we have assumed that the exponential model was the best choice for describing the growth data. This model has two free parameters: "N0" and "r". Let us now consider an alternative model with 7 free parameters, namely the 6'th order polynomial

The free parameters of this model are: . Note that we now have as many free parameters as data points in our training set

Using the same approach as we did above for an exponential model, we now fit this polynomial model to the data. In this case it turns out we do not need to include guesses at starting values:

fit2 = nls(
    formula = N ~ b0 + I(b1*t) + I(b2*t^2) + I(b3*t^3) + I(b4*t^4) + I(b5*t^5) + I(b6*t^6), 
    data = train,
    trace=TRUE,
    control=list(warnOnly = TRUE)
)

Ignore the warning message. The I() function is here used to force nls to interpret the multiplication and exponentiation operators (* and ^) literally. This is necessary because they have special meanings in R formulas. The option warnOnly = TRUE was needed to force nls to finish fitting the model (in this case, with as many parameters as data points, nls would usually stop with an error message).

Plot the fitted model compared to the training data:

xpred = seq(0, 13, 0.1)
ypred = predict(fit2, list(t = xpred))
dfpred = tibble(x=xpred, y=ypred)
ggplot(train, aes(x=t, y=N)) + 
    geom_point() + 
    geom_line(dfpred, mapping=aes(x=x, y=y), color="blue")

Also, print a summary of the fitted model:

print(fit2)

Question: Report the estimated parameters (b0, b1, b2, b3, b4, b5, b6) and the sum of squares of residuals for the fitted model


Question 17:

Question: Does the polynomial or the exponential model have the best fit, measured using the sum of squared residuals?


Question 18:

Assess predictive performance of polynomial and exponential models:

You will recall that we used 7 of our 8 data points for fitting the models. The 8th data point had these values:

   t       N
  14    11825

Now, use the models trained on the first 7 data points to predict the population value for t=14:

predict(fit1, list(t=14))
predict(fit2, list(t=14))

Question: Report the predicted values for t=14 for the exponential and polynomial models


Question 19:

Also, plot the two models and compare to all 8 data points:

xpred = seq(0, 14.5, 0.1)
yexp = predict(fit1, list(t = xpred))
ypoly = predict(fit2, list(t = xpred))
dfpred = tibble(x=xpred, exponential=yexp, polynomium=ypoly) %>%
    pivot_longer(cols=c(exponential, polynomium), names_to="model") %>%
    filter(value < 14000 & value > -5000)
ggplot() + 
    geom_point(dat, mapping=aes(x=t, y=N)) + 
    geom_line(dfpred, mapping=aes(x=x, y=value, color=model))

Question: Based on the data point for t=14 which model do you now think has captured the biological reality best?