First, let’s read in the data and see how it is represented. For reading in the tree I’m going to take advantage of the read.tree()
function from the ape package because writing a function for reading in a tree is a pain. If you’re interested in how to write a tree reading function I’d recommend checking out Paul Lewis’ tutorial for building phylogenetic software, it also serves as a good introduction to C++. As for reading in the sequence data, we will be using the read.nexus.data()
function in ape.
##Load in relevant packages
library(ape) ##This package has a lot of tree-related functions and has the 'phylo' class
##Load in prewritten functions
source("./challenge_fxns.R")
##read in our files
tree<-read.tree("../data/challenge_tree.tree")
seq<-read.nexus.data("../data/challenge_data.nex")
Cool beans! Now let’s see how the data are represented in R. Let’s start with the tree.
print(tree)##We can see some basic information by just printing our tree
##
## Phylogenetic tree with 3 tips and 2 internal nodes.
##
## Tip labels:
## [1] "t1" "t2" "t3"
##
## Rooted; includes branch lengths.
str(tree)##We can see the attributes and class of our tree by using the str() function
## List of 4
## $ edge : int [1:4, 1:2] 4 5 5 4 5 1 2 3
## $ edge.length: num [1:4] 0.1 0.1 0.1 0.1
## $ Nnode : int 2
## $ tip.label : chr [1:3] "t1" "t2" "t3"
## - attr(*, "class")= chr "phylo"
## - attr(*, "order")= chr "cladewise"
By printing out the tree we can see that we have a rooted tree with 3 tips and branch lengths, this is what we want and a check for correctness.
when we used the str()
function we could see that the tree was of class phylo and had 4 attributes:
We can access the indivudal attributes of the tree by typing our variable followed by $ and then the name of the attribute. For example:
tree$edge
## [,1] [,2]
## [1,] 4 5
## [2,] 5 1
## [3,] 5 2
## [4,] 4 3
We can also plot our tree along with its branch lengths to check if we loaded things in correctly
{
plot(tree)
edgelabels(tree$edge.length)
}
The Newick format for this tree is ((t1:0.1,t2:0.1):0.1,t3:0.1);
, so it looks like the tree is correct.
Alternatively, we can plot the node and edge numberings on the tree. These node and edge numberings correspond to what is in the edge matrix.
{
plot(tree,show.tip.label = FALSE)
edgelabels()
nodelabels()
tiplabels()
}
We have things plotted such that:
This plotting is also a good place to make better sense of the edge matrix attribute.
## [,1] [,2]
## [1,] 4 5
## [2,] 5 1
## [3,] 5 2
## [4,] 4 3
We can see that the second row, denoted by \([2,]\), has \(5\) and \(1\) in the columns. We can look at our plot and find the branch labeled with a \(2\) and see that it is connected to nodes labeled \(5\) and \(1\).
We can also notice that the ordering the columns matters. Node numbers that appear in the first column of the edge matrix are considered the ancestral node of the branch while node numbers in the second column are considered the descendant node of the branch. We can easily see that \(5\) is the ancestor, or parent, node for edge \(2\) while node \(1\) is the descendent, or child, of the branch. Pretty slick!
Similar things can be done for the sequence data to get a feel for how it is represented.
str(seq)
## List of 3
## $ t1: chr "t"
## $ t2: chr "c"
## $ t3: chr "a"
Here we can see that we have a list where each attribute in the list corresponds to one of our tip names. Then, within each list is our sequence information. Usually each attribute contains a vector of characters, however, we only read in an allignment with one character.
Before we get into the meat and potatoes that is the pruning algorithm, we will need a few functions to make our lives easier. Examples of these functions will refer to the node and edge numbers that were plotted above, so either be ready to scroll up to that tree or it may be handy to jot the tree down on a piece of scratch paper. Lastly, for brevity’s sake I won’t bother showing the code for the functions here. The interested reader can find the code along with explainations of the code in the challenge_fxns.R.
isTip(phy,nd)
: this function returns TRUE
if the given node is a tip and FALSE
otherwise.This function has two inputs:
###Example###
isTip(tree,2) ##Returns TRUE
isTip(tree,4) ##Returns FALSE
getChildren(phy,nd)
: This function finds the child node numbers of a given node on a tree. This function has two inputs:
###Example###
getChildren(tree,4) ##Returns 5 3
getChildren(tree,2) ##The node is a tip and has no children. Returns NULL
getBranchLength(phy,nd)
: This function returns the length of the branch between a given node and its parent. This function has two inputs:
###Example###
getBranchLength(tree,3) ##Returns 0.1
getBranchLength(tree,5) ##Returns 0.1
getBranchLength(tree,4) ##The root has no branch that leads to it. Returns NULL
subst_probsJC(i,j,v)
: This function computes the probability that a site transitions from nucleotide \(i\) to nucleotide \(j\) along a branch with length \(\nu\). The probabilities are calculated according to the Jukes Cantor model of sequence evolution: \[
p_{ij}(v) = \left\{
\begin{array}{ll}
\frac{1}{4} + \frac{3}{4}e^{\frac{-4\nu}{3}} & \quad i = j \\
\frac{1}{4} - \frac{1}{4}e^{\frac{-4\nu}{3}} & \quad i \neq j
\end{array}
\right.
\] This function has three inputs:
"a"
, "c"
, "g"
, or "t"
"a"
, "c"
, "g"
, or "t"
###Example###
subst_probsJC("a","g",0.5) ##Returns 0.6350628
subst_probsJC("a","a",0.5) ##Returns 0.1216457
##Compute the probability of transitioning from an A to any other base
A_A<-subst_probsJC("a","a",0.5) #A to A
A_C<-subst_probsJC("a","c",0.5) #A to C
A_G<-subst_probsJC("a","g",0.5) #A to G
A_T<-subst_probsJC("a","t",0.5) #A to T
A_A+A_C+A_G+A_T ##Returns 1. This is a check for correctness. The probability of transitioning from A to any other base should be 1.
siteSeqs2Likelihood(alignment, site_no)
: This function takes the observed nucleotide data at the tips for a given site and converts it into a likelihood format that our other functions can use to compute likelihoods for internal nodes. This function has two inputs:
###Example###
lik_seq<-siteSeqs2Likelihood(seq,1) ##put the first site of our alignment into the correct format
##We made a list where each tip name is an attribute that contains a vector with a 1 for the observed base and 0s everywhere else
lik_seq
## $t1
## a c g t
## 0 0 0 1
##
## $t2
## a c g t
## 0 1 0 0
##
## $t3
## a c g t
## 1 0 0 0
nodeLikelihood(l1,l2,v1,v2,sub_model=subst_probsJC)
: This function computes the conditional probability of each nucleotide for the ancestral node given the likelihood of the nucleotides at the descendent nodes and given the branch lengths that lead to each ancestral node. The likelihood of the ancestor node, denoted \(anc\), for a given base \(i\) is computed as follows: \[
\ell_{anc}(i) = \left(\sum_{j} p_{ij} (\nu_{d1})\ell_{d1}(j) \right)*\left(\sum_{j} p_{ij} (\nu_{d2})\ell_{d2}(j) \right)
\] Where \(d1\) and \(d2\) represent the two descendent nodes of the ancestral node. The computation is repeated for each of the four nucleotides. The function has five inputs:
subst_probsJC
###Example###
##compute the likelihood of each base for the node numbered 5 in our tree
##We can see that the nodes numbered 1 and 2 are the descendents of 5, so we need the sequence data in a likelihood form
lik_seq<-siteSeqs2Likelihood(seq,1)
##the nucleotide likelihood for each child
node1_lik<-lik_seq$t1
node2_lik<-lik_seq$t2
##the branch lengths for each child
v1<-getBranchLength(tree,1) ##branch length for the branch between nodes 5 and 1
v2<-getBranchLength(tree,2) ##branch length for the branch between nodes 5 and 2
node5_lik<-nodeLikelihood(node1_lik,node2_lik,v1,v2,sub_model=subst_probsJC) ##compute the likelihood of each base for node 5
node5_lik
## a c g t
## 0.0009738563 0.0282851014 0.0009738563 0.0282851014
##These numbers look like what we saw in lecture! Nice!
We should now have all the infrastructure we need to compute the site likelihood. All we need to do for this is:
For step 1. we can use the nodeLikelihood()
function to compute the likelihood for each base at the root, the node numbered \(4\). However, in order to do this we need the likelihood at each of the descendent nodes, the nodes numbered \(3\) and \(5\).
The likelihood for node \(3\) is straightforward, it is a tip and we have observed data for that node. As such, we can just convert the sequence data into a likelihood format for that node.
node3_lik<-lik_seq$t3 ##get the likelihood for node 3
##We stored
## a c g t
## 1 0 0 0
v3<-getBranchLength(tree,3) ##also get the branch length that connects nodes 4 and 3
Node \(5\) is a bit more tricky since this isn’t a tip. In order to get the likelihood for the bases at node \(5\) we need to use the nodeLikelihood()
function again but this time on node \(5\) where nodes \(1\) and \(2\) are the descendents. Luckily, we already did this computation in the example for nodeLikelihood()
.
###See the example for nodeLikelihood to see the likelihood computation for node 5
##The likelihood is stored in a variable called node5_lik
v5<-getBranchLength(tree,5) ##also get the branch length that connects nodes 4 and 5
We only need to plug in values to calculate the base likelihoods at the root
root_lik<-nodeLikelihood(node3_lik,node5_lik,v3,v5,subst_probsJC) ##compute the likelihood at the root for each base
root_lik #These numbers look like what we saw in class!
## a c g t
## 2.427687e-03 8.294894e-04 8.358527e-05 8.294894e-04
So we’ve calculated the likelihood for each base, now for step 2. We need to multiply these by their stationary frequencies. The Jukes Cantor model assumes stationary frequencies of \(A=C=G=T=0.25\). After we do this multiplication we only need to sum the values together to get the site likelihood
stationary_freqs<-c(0.25,0.25,0.25,0.25) ##The stationary frequencies are all 0.25
site_lik<-root_lik*stationary_freqs ##multiply by stationary frequencies
site_lik<-sum(site_lik) ##add all values together
site_lik #what a beaut
## [1] 0.001042563
We done did it! Although we did the calculations by hand, hopefully this example gives somewhat of an idea for how this process can be applied more generally. Because, while this may be somewhat easy to do for a single site with 3 taxa, doing these calculations manually quickly becomes cumbersome as the number of taxa and length of sequences grow. We will want to write a recursive algorithm to automate the process of going through the nodes to compute likelihoods and slowly work our way to the root.
Let’s start by first writing a wrapper function that does some of the housekeeping things that we need at the start as well as doing the computations at the end. We will call this function siteLikelihood()
and it will have 4 inputs:
siteSeqs2Likelihood()
##Compute the site likelihood
siteLikelihood<-function(phy,tip_lik,sub_model,stat_freqs){
##first we want to find the root of the tree
rt<-length(phy$tip.label)+1 ##for a tree with n tips, the root node number is n+1
rt_lik<-getLikelihoodRecursive(rt,phy,tip_lik,sub_model) ## This is the recursive function. We call it at the root.
##We will just assume right now that it works and that it gets us the likelihood for each base at the root
##Now we just need to multiply the likelihoods by their base frequencies and add them together
site_lik<-rt_lik*stat_freqs ##multiply by stationary frequencies
site_lik<-sum(site_lik) ##add all values together
return(site_lik)
}
This wrapper function is actually pretty short and sweet. However, we still need to define our recursive function, getLikeLihoodRecursive()
. If we assume it works properly we can see that we are just getting the likelihood at the root and doing the same calculations as when we did things manually.
Before writing the recursive function, it may be helpful to go over recursion. They’re just a fancy way of saying that it is a function that calls itself somewhere in the code. Khan academy has a good explaination of recursion where they recursively compute the factorial of a number. But briefly, when the recursive call is made, it is done in a way that breaks the problem into a smaller subproblem. Recursive calls continue to be made to divide into smaller and smaller subproblems until it reaches what is called the base case. The base case is a state where we know the answer to the subproblem. The answer in the base case is then used to compute the answer in the subproblems and subsequently to solve our initial problem.
Not useful to us at all but here is a fun display of recursion being abused to make fractals in PowerPoint. Or if recursion really has you jazzed up, you may want to consider learning a purely functional programming language as these don’t have for
or while
loops and instead rely solely on recursion for iterative processes.
Alright, now onto the function. getLikelihoodRecursive()
will get us the likelihood for each nucleotide at the node that we specify. The function has 4 inputs:
siteSeqs2Likelihood()
##Compute the likelihood for each nucleotide at a given node
getLikelihoodRecursive<-function(nd,phy,tip_lik,sub_model){
##First we want to check if we are at a tip. This is our Base Case.
if(isTip(phy,nd)){
##If we are at a tip then getting the likelihood is as simple as getting the values from tip_lik
lik<-tip_lik[[phy$tip.label[nd]]]
}
else{ ##we are at an internal node
##Computing the likelihood for an internal node requires us to use nodeLikelihood()
##however in order to use nodeLikelihood we need the likelihood of the descendent nodes
##first let's find the descendents
descs<-getChildren(phy,nd)
##here's where the recursion is, we call getLikelihoodRecursive on each of the descendents and just assume that it works fine.
lik1<-getLikelihoodRecursive(descs[1],phy,tip_lik,sub_model) ##get the likelihood of the first descendent
lik2<-getLikelihoodRecursive(descs[2],phy,tip_lik,sub_model) ##get the likelihood of the second descendent
##Great! so now we have the likelihood at each base for both descendents. Now all we need are the branch lengths
v1<-getBranchLength(phy,descs[1])
v2<-getBranchLength(phy,descs[2])
##Now compute the likelihood as we normally would with nodeLikelihood()
lik<-nodeLikelihood(lik1,lik2,v1,v2,sub_model)
}
return(lik)
}
###Example
getLikelihoodRecursive(1,tree,lik_seq,subst_probsJC) ##Gets the likelihood at node 1 which is a tip.
##Returns
## a c g t
## 0 0 0 1
getLikelihoodRecursive(5,tree,lik_seq,subst_probsJC) ##Gets the likelihood at node 5.
##Returns
## a c g t
## 0.0009738563 0.0282851014 0.0009738563 0.0282851014
Neat-o so we’ve got everything we need to read in the data, format it how we want it and then compute the likelihood at a site. Let’s give it a go.
##load in functions and packages
source("./challenge_fxns.R") ##load in the functions
library(ape)
##read in files
tree<-read.tree("../data/challenge_tree.tree")
seq<-read.nexus.data("../data/challenge_data.nex")
##Format sequence data to a format we can use
lik_seq<-siteSeqs2Likelihood(seq,1)
##stationary frequencies are all 0.25
stationary_freqs<-c(0.25,0.25,0.25,0.25)
##compute the likelihood
siteLikelihood(tree,lik_seq,subst_probsJC,stationary_freqs)
## [1] 0.001042563
If we had longer sequences we would just repeat this process for each site and multiply those likelihoods together using the AND rule. Additionally, this process should work with more taxa. However, with more taxa or longer sequences we will need to worry more about underflow as we are not using log-likelihoods in our calculations.
Nonetheless, We’ve computed the likelihood of the tree given our sequence data and assuming Jukes Cantor model for sequence evolution.