● What's the long branch attraction problem?: Parsimony tries to minimize the number of mutations it can have some problems on trees with long branches. Parsimony algorithm will put the long branches together. If you get the sequences from the left tree, and use this sequences to simulate the tree with some parsimony algorithm, you can obtain the right tree.
● Why shall we model distances as stochastic processes?: Parsimony tryes to find the shortest evolutional path, but there can be a hidden mutations, that would be ignored by parsimony. We can model the substitutions with the probability matrix of substitutions, that depends on the branch length, as the realisation of continuos Markow chain.
● What does a substitution matrix look like?: Substitution matrix is the matrix that stores the probabilies of going from one nucleotide to another dependend on time. Time is aquivalent with branch length of phylogenetic tree.
● What is time-reversibility?: Time reversibility is needed for the efficient calculation of the Likelihood of the tree. The probability of going from i to j must be equal to probability of going from j to i. Evolution would occure in the same way if follow forward or backward in time.
● How does ML work in principle? Remember: AND and OR: at first, we habe to place a virtual root anywhere in the tree. We assume, that all sites are evolved independently, therefore we can calculate the ML for each site independetly, and then sum them. We dont know the inner states of the nodes. Therefore, we need to enumerate all possible assignments of nucleotides for each node
We multiply all probabilities for one site for some specific inner nodes assignment (AND), and summarize the Likelihoods of the all possible inner nodes assignments (OR).
● How does the Felsenstein pruning algorithm work?: we traverse thre tree, every time when we visit the node, we do some computations. We store the vectors with conditional likelihood in the nodes. This vectors describe the probability of observing the nucleotides in the subtree. For each node, bottom up: we sum the probabilies from going from the bottom node to top node, for each nucleotide. In the top node we multiply the probabilies from the left branch and the right branch.
● What's the time & space complexity for evaluating one tree?: We need to store the probability vectors in all nodes, therefore O(n) place, if n is the amount of sequences. We can build the structure bottom um, that requires O(n) operations for each site. If each sequence contains m sites, than O(mn) time for the evaluating the tree.
● How can we optimize branch lengths?: We choose a starting branch randomly, and we start changing the length of this branch, and try to find the brach lenght of this branch, than maximizes the likelihood. To optimize the entire tree we need to go bottom up and optimize the lengths of all brances. This is one iteration. Then we have to iterate over and over, until we reach some convergense.
● How are protein substitution models obtained?: the protein substitution models of the empirical models are obtained by optimizing the model parameters on a large collection of reference alignments.
● Which parameters do we have (to optimize)?: parameters are tree topology, the branch lengths, prior probabilies of the different nucleotides - vector $\pi$, and the probabilies of the substitutions - the values of matrix P.