● By reference versus de novo assembly: De novo assembly means the construction of a new genome sequence. By reference asembly maps the reads to the known genome of this species. De novo assembly is much harder and longer than by reference assembly.
● What is BLAST?: BLAST (Basic Local Alignment Search Tool) is the software to find similar sequences in huge database.
● What is BLAST good for?: BLAST can quickly find the genomes with the sequences that are similar to some query sequence.
● Why not use Smith-Waterman instead?: Smith-Waterman performs O(n*m) operations for each pair of sequences. The number of reference sequences in the databases is very large. For example, GenBank has around 200 milions of sequences. The use of Smith-Waterman would require a 200 millions tables to be calculated. That would take a huge amount of time, even with clever optimisations.
● How does BLAST work → seed, extend,evaluate!: Seed: build a list of all subwords of the given length from the query sequence. For each subword build a neighborhood, which includes similar subwords. → create a list of all subwords and calculate a list for all similar subwords. These subwords are called seeds.Then scan a database for exact matches with this list.
Extend: starting from seeds, extend alignment in both directions. Find the longest sequences with maximum similarity score.
Evaluate: calculate the statistical significance — are sequences biologically related or are they similar just by chance.
● What is Genome assembly?: Genome can be represented as a set of the long DNA molecules. In the shotgun sequencing this molecules will be splitted in the short parts. This parts can be translated into strings of letters. These strings are called reads. The assembler rebuilds the original sequence with these reads.
● How does de novo assembly work?: Assemble reads into larger strings using the overlaps between them. There are two methods to perform de novo assembly: the overlap graphs and the de Brujin graphs.
● What is an overlap graph?: Represent each read as a graph node. Compute all pair-wise alignments between reads and represent overlaps as graph edges.
● How do we traverse this graph to assemble a genome?: Then find the Hamiltonian path for this grap. Find a path, which visits each node exactly once. This path is the reconstructed original genome sequence. Problem: there is no efficient algorithm for finding such path.
● What is a de Bruijn graph?: Each read is decomposed into series of k-mers, similar to BLAST seeding. Each unique k-mer is represented by an edge of the graph. Nodes are (k-1)-mers: prefix and suffix of the k-mer associated with the edge connecting them.
● What is a k-mer?: k-mers are the subsequences of the original string with the lengt k.
● How do we traverse a de Bruijn graph?: The original sequence can be reconstructed by finding an Eulerian path (visit each edge exactly once). There exists an efficient algorithm to find an Eulerian path.
● Which problem are we trying to solve?: We are trying to map all reads to some reference genome.
● Why not use Blast?: we want to construct new genome, not to find the similarities in another genomes.