Reading And Working With The Data

First, let’s read in the data and see how it is represented. For reading in the tree I’m going to take advantage of the read.tree() function from the ape package because writing a function for reading in a tree is a pain. If you’re interested in how to write a tree reading function I’d recommend checking out Paul Lewis’ tutorial for building phylogenetic software, it also serves as a good introduction to C++. As for reading in the sequence data, we will be using the read.nexus.data() function in ape.

##Load in relevant packages
library(ape) ##This package has a lot of tree-related functions and has the 'phylo' class

##Load in prewritten functions
source("./challenge_fxns.R")


##read in our files 
tree<-read.tree("../data/challenge_tree.tree")
seq<-read.nexus.data("../data/challenge_data.nex")

Cool beans! Now let’s see how the data are represented in R. Let’s start with the tree.

The Tree

print(tree)##We can see some basic information by just printing our tree
## 
## Phylogenetic tree with 3 tips and 2 internal nodes.
## 
## Tip labels:
## [1] "t1" "t2" "t3"
## 
## Rooted; includes branch lengths.
str(tree)##We can see the attributes and class of our tree by using the str() function
## List of 4
##  $ edge       : int [1:4, 1:2] 4 5 5 4 5 1 2 3
##  $ edge.length: num [1:4] 0.1 0.1 0.1 0.1
##  $ Nnode      : int 2
##  $ tip.label  : chr [1:3] "t1" "t2" "t3"
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

By printing out the tree we can see that we have a rooted tree with 3 tips and branch lengths, this is what we want and a check for correctness.

when we used the str() function we could see that the tree was of class phylo and had 4 attributes:

  • edge: This is a matrix that denotes the topology of the tree and how the branches of our tree connect to nodes. Each row of the matrix corresponds to an branch and the two columns correspond to the two nodes for that branch.
  • edge.length: This is a vector of edge lengths. Each element corresponds to a row in the edge matrix. E.g. the 3rd element in the edge.length attribute corresponds to the branch in the 3rd row of the edge matrix.
  • Nnode: The number of internal nodes in the tree. For a rooted bifurcating tree with \(n\) tips we should have \(n-1\) nodes so this checks out.
  • tip.label:A vector of the tip names are stored here. Each element corresponds to the node numbered with that index. E.g. the 2nd element in tip.label corresponds to the name of the node that is numbered \(2\)

We can access the indivudal attributes of the tree by typing our variable followed by $ and then the name of the attribute. For example:

tree$edge
##      [,1] [,2]
## [1,]    4    5
## [2,]    5    1
## [3,]    5    2
## [4,]    4    3

We can also plot our tree along with its branch lengths to check if we loaded things in correctly

{
  plot(tree)
  edgelabels(tree$edge.length)
}

The Newick format for this tree is ((t1:0.1,t2:0.1):0.1,t3:0.1);, so it looks like the tree is correct.

Alternatively, we can plot the node and edge numberings on the tree. These node and edge numberings correspond to what is in the edge matrix.

{
  plot(tree,show.tip.label = FALSE)
  edgelabels()
  nodelabels()
  tiplabels()  
}

We have things plotted such that:

  • Edge Numbers are denoted with \(\color{green}{\text{green}}\)
  • Internal Node Numbers are denoted with \(\color{blue}{\text{blue}}\)
  • Tip Node Numbers are denoted with \(\color{yellow}{\text{yellow}}\)

This plotting is also a good place to make better sense of the edge matrix attribute.

##      [,1] [,2]
## [1,]    4    5
## [2,]    5    1
## [3,]    5    2
## [4,]    4    3

We can see that the second row, denoted by \([2,]\), has \(5\) and \(1\) in the columns. We can look at our plot and find the branch labeled with a \(2\) and see that it is connected to nodes labeled \(5\) and \(1\).

We can also notice that the ordering the columns matters. Node numbers that appear in the first column of the edge matrix are considered the ancestral node of the branch while node numbers in the second column are considered the descendant node of the branch. We can easily see that \(5\) is the ancestor, or parent, node for edge \(2\) while node \(1\) is the descendent, or child, of the branch. Pretty slick!

The Sequence Data

Similar things can be done for the sequence data to get a feel for how it is represented.

str(seq)
## List of 3
##  $ t1: chr "t"
##  $ t2: chr "c"
##  $ t3: chr "a"

Here we can see that we have a list where each attribute in the list corresponds to one of our tip names. Then, within each list is our sequence information. Usually each attribute contains a vector of characters, however, we only read in an allignment with one character.

Preliminary functions

Before we get into the meat and potatoes that is the pruning algorithm, we will need a few functions to make our lives easier. Examples of these functions will refer to the node and edge numbers that were plotted above, so either be ready to scroll up to that tree or it may be handy to jot the tree down on a piece of scratch paper. Lastly, for brevity’s sake I won’t bother showing the code for the functions here. The interested reader can find the code along with explainations of the code in the challenge_fxns.R.

###Example###
isTip(tree,2) ##Returns TRUE
isTip(tree,4) ##Returns FALSE
###Example###
getChildren(tree,4) ##Returns 5 3  
getChildren(tree,2) ##The node is a tip and has no children. Returns NULL
###Example###
getBranchLength(tree,3) ##Returns 0.1
getBranchLength(tree,5) ##Returns 0.1
getBranchLength(tree,4) ##The root has no branch that leads to it. Returns NULL
###Example###
subst_probsJC("a","g",0.5) ##Returns 0.6350628
subst_probsJC("a","a",0.5) ##Returns 0.1216457

##Compute the probability of transitioning from an A to any other base 
A_A<-subst_probsJC("a","a",0.5)  #A to A
A_C<-subst_probsJC("a","c",0.5)  #A to C
A_G<-subst_probsJC("a","g",0.5)  #A to G
A_T<-subst_probsJC("a","t",0.5)  #A to T
A_A+A_C+A_G+A_T ##Returns 1. This is a check for correctness. The probability of transitioning from A to any other base should be 1.
###Example###
lik_seq<-siteSeqs2Likelihood(seq,1) ##put the first site of our alignment into the correct format

##We made a list where each tip name is an attribute that contains a vector with a 1 for the observed base and 0s everywhere else
lik_seq 
## $t1
## a c g t 
## 0 0 0 1 
## 
## $t2
## a c g t 
## 0 1 0 0 
## 
## $t3
## a c g t 
## 1 0 0 0
###Example###

##compute the likelihood of each base for the node numbered 5 in our tree

##We can see that the nodes numbered 1 and 2 are the descendents of 5, so we need the sequence data in a likelihood form
lik_seq<-siteSeqs2Likelihood(seq,1)

##the nucleotide likelihood for each child
node1_lik<-lik_seq$t1
node2_lik<-lik_seq$t2

##the branch lengths for each child
v1<-getBranchLength(tree,1) ##branch length for the branch between nodes 5 and 1
v2<-getBranchLength(tree,2) ##branch length for the branch between nodes 5 and 2

node5_lik<-nodeLikelihood(node1_lik,node2_lik,v1,v2,sub_model=subst_probsJC) ##compute the likelihood of each base for node 5
node5_lik
##            a            c            g            t 
## 0.0009738563 0.0282851014 0.0009738563 0.0282851014
##These numbers look like what we saw in lecture! Nice!

The Pruning Algorithm

Doing the Algorithm Manually

We should now have all the infrastructure we need to compute the site likelihood. All we need to do for this is:

  1. Compute the likelihood for each base at the root
  2. Multiply the likelihood for each base by their respective stationary frequency
  3. Add all the elements together.

For step 1. we can use the nodeLikelihood() function to compute the likelihood for each base at the root, the node numbered \(4\). However, in order to do this we need the likelihood at each of the descendent nodes, the nodes numbered \(3\) and \(5\).

The likelihood for node \(3\) is straightforward, it is a tip and we have observed data for that node. As such, we can just convert the sequence data into a likelihood format for that node.

node3_lik<-lik_seq$t3 ##get the likelihood for node 3
##We stored
## a c g t 
## 1 0 0 0

v3<-getBranchLength(tree,3) ##also get the branch length that connects nodes 4 and 3

Node \(5\) is a bit more tricky since this isn’t a tip. In order to get the likelihood for the bases at node \(5\) we need to use the nodeLikelihood() function again but this time on node \(5\) where nodes \(1\) and \(2\) are the descendents. Luckily, we already did this computation in the example for nodeLikelihood().

###See the example for nodeLikelihood to see the likelihood computation for node 5

##The likelihood is stored in a variable called node5_lik
v5<-getBranchLength(tree,5) ##also get the branch length that connects nodes 4 and 5

We only need to plug in values to calculate the base likelihoods at the root

root_lik<-nodeLikelihood(node3_lik,node5_lik,v3,v5,subst_probsJC) ##compute the likelihood at the root for each base
root_lik #These numbers look like what we saw in class!
##            a            c            g            t 
## 2.427687e-03 8.294894e-04 8.358527e-05 8.294894e-04

So we’ve calculated the likelihood for each base, now for step 2. We need to multiply these by their stationary frequencies. The Jukes Cantor model assumes stationary frequencies of \(A=C=G=T=0.25\). After we do this multiplication we only need to sum the values together to get the site likelihood

stationary_freqs<-c(0.25,0.25,0.25,0.25) ##The stationary frequencies are all 0.25
site_lik<-root_lik*stationary_freqs ##multiply by stationary frequencies

site_lik<-sum(site_lik) ##add all values together
site_lik  #what a beaut
## [1] 0.001042563

We done did it! Although we did the calculations by hand, hopefully this example gives somewhat of an idea for how this process can be applied more generally. Because, while this may be somewhat easy to do for a single site with 3 taxa, doing these calculations manually quickly becomes cumbersome as the number of taxa and length of sequences grow. We will want to write a recursive algorithm to automate the process of going through the nodes to compute likelihoods and slowly work our way to the root.

Doing The Pruning Algorithm Recursively

Wrapper For The Recursive Function

Let’s start by first writing a wrapper function that does some of the housekeeping things that we need at the start as well as doing the computations at the end. We will call this function siteLikelihood() and it will have 4 inputs:

  • \(phy\): the phylogenetic tree
  • \(tip\_lik\): the sequence data of the tips, all in the likelihood format given by siteSeqs2Likelihood()
  • \(sub\_model\): A substitution model for sequence evolution
  • \(stat\_freqs\): The stationary frequencies for each nucleotide. Frequencies are ordered A,C,T,G.
##Compute the site likelihood
siteLikelihood<-function(phy,tip_lik,sub_model,stat_freqs){
  
  ##first we want to find the root of the tree
  rt<-length(phy$tip.label)+1 ##for a tree with n tips, the root node number is n+1
  
  rt_lik<-getLikelihoodRecursive(rt,phy,tip_lik,sub_model) ## This is the recursive function. We call it at the root.
  ##We will just assume right now that it works and that it gets us the likelihood for each base at the root
  
  
  ##Now we just need to multiply the likelihoods by their base frequencies and add them together
  site_lik<-rt_lik*stat_freqs ##multiply by stationary frequencies
  site_lik<-sum(site_lik) ##add all values together
  
  return(site_lik)
}

This wrapper function is actually pretty short and sweet. However, we still need to define our recursive function, getLikeLihoodRecursive(). If we assume it works properly we can see that we are just getting the likelihood at the root and doing the same calculations as when we did things manually.

A Bit On Recursion

Before writing the recursive function, it may be helpful to go over recursion. They’re just a fancy way of saying that it is a function that calls itself somewhere in the code. Khan academy has a good explaination of recursion where they recursively compute the factorial of a number. But briefly, when the recursive call is made, it is done in a way that breaks the problem into a smaller subproblem. Recursive calls continue to be made to divide into smaller and smaller subproblems until it reaches what is called the base case. The base case is a state where we know the answer to the subproblem. The answer in the base case is then used to compute the answer in the subproblems and subsequently to solve our initial problem.

Not useful to us at all but here is a fun display of recursion being abused to make fractals in PowerPoint. Or if recursion really has you jazzed up, you may want to consider learning a purely functional programming language as these don’t have for or while loops and instead rely solely on recursion for iterative processes.

The Recursive Function For The Pruning Algorithm

Alright, now onto the function. getLikelihoodRecursive() will get us the likelihood for each nucleotide at the node that we specify. The function has 4 inputs:

  • \(nd\): This is the number of the node for which we will compute the likelihood of each nucleotide
  • \(phy\): the phylogenetic tree
  • \(tip\_lik\): the sequence data of the tips, all in the likelihood format given by siteSeqs2Likelihood()
  • \(sub\_model\): A substitution model for sequence evolution.
##Compute the likelihood for each nucleotide at a given node
getLikelihoodRecursive<-function(nd,phy,tip_lik,sub_model){
  
  ##First we want to check if we are at a tip. This is our Base Case.
  if(isTip(phy,nd)){
    ##If we are at a tip then getting the likelihood is as simple as getting the values from tip_lik
    lik<-tip_lik[[phy$tip.label[nd]]]
  }
  else{ ##we are at an internal node 
    ##Computing the likelihood for an internal node requires us to use nodeLikelihood()
    ##however in order to use nodeLikelihood we need the likelihood of the descendent nodes
    
    ##first let's find the descendents
    descs<-getChildren(phy,nd)
    
    ##here's where the recursion is, we call getLikelihoodRecursive on each of the descendents and just assume that it works fine.
    lik1<-getLikelihoodRecursive(descs[1],phy,tip_lik,sub_model) ##get the likelihood of the first descendent
    lik2<-getLikelihoodRecursive(descs[2],phy,tip_lik,sub_model) ##get the likelihood of the second descendent
    
    ##Great! so now we have the likelihood at each base for both descendents. Now all we need are the branch lengths
    v1<-getBranchLength(phy,descs[1])
    v2<-getBranchLength(phy,descs[2])
    
    ##Now compute the likelihood as we normally would with nodeLikelihood()
    lik<-nodeLikelihood(lik1,lik2,v1,v2,sub_model)
  }
    return(lik)
}
###Example
getLikelihoodRecursive(1,tree,lik_seq,subst_probsJC) ##Gets the likelihood at node 1 which is a tip.
##Returns
## a c g t 
## 0 0 0 1

getLikelihoodRecursive(5,tree,lik_seq,subst_probsJC) ##Gets the likelihood at node 5. 
##Returns
##            a            c            g            t 
## 0.0009738563 0.0282851014 0.0009738563 0.0282851014

Putting It All Together

Neat-o so we’ve got everything we need to read in the data, format it how we want it and then compute the likelihood at a site. Let’s give it a go.

##load in functions and packages
source("./challenge_fxns.R") ##load in the functions
library(ape) 

##read in files 
tree<-read.tree("../data/challenge_tree.tree")
seq<-read.nexus.data("../data/challenge_data.nex")

##Format sequence data to a format we can use
lik_seq<-siteSeqs2Likelihood(seq,1)

##stationary frequencies are all 0.25
stationary_freqs<-c(0.25,0.25,0.25,0.25)

##compute the likelihood
siteLikelihood(tree,lik_seq,subst_probsJC,stationary_freqs)
## [1] 0.001042563

If we had longer sequences we would just repeat this process for each site and multiply those likelihoods together using the AND rule. Additionally, this process should work with more taxa. However, with more taxa or longer sequences we will need to worry more about underflow as we are not using log-likelihoods in our calculations.

Nonetheless, We’ve computed the likelihood of the tree given our sequence data and assuming Jukes Cantor model for sequence evolution.