Welcome to Chapter 6, a first look at the evolution of coronaviruses. In this chapter, we will try to unravel the origin of SARS-CoV-2. In the previous chapter, we explored the artificial origin hypothesis. Now, you might naturally ask the following. What exactly is the origin of SARS-CoV-2? To help us explore that question, we will turn to the field of phylogenetics, the study of finding evolutionary relationships between organisms. The basic principles of phylogenetics can be modeled by the intuitive diagram shown here, which is called a phylogenetic tree or phylogeny, the formal term for An evolutionary tree. A phylogenetic tree is a diagram showing how present-day species are evolutionarily related. Before we can uncover the secrets of the phylogenetic tree, we need to understand the concept of a tree data structure, which as we shall see, demands and understanding of the graph, abstract data type. In computer science, a graph is An abstract data type that is a collection of nodes or vertices and edges connecting these nodes. A node is a single data point in the graph. It's one basic piece of information that a graph represents. An edge is simply a connection between two nodes, used to denote a relationship between the connective nodes. There are two types of edges. A directed edge is valid in one direction with respect to the nodes it connects. For example, a directed edge from node A, the source node, to node B, the destination node, is only valid in the stated direction, from A to B, but not from B to A. Think of these like one-way streets. You can only travel across the directed edge in one direction. An undirected edge is valid in both directions with respect to the nodes it connects. For example, An undirected edge from node A, the source node, to node B, and the destination node, is valid in the stated direction, from A to B, as well as the reverse direction, from B to A. Let's think of these like two-way streets and you can travel across in undirected edge in either direction. The simplest definition of a tree then is An undirected graph without Any cycles nor Any unconnected parts, such as the example shown here. We define a cycle as a path that starts and ends at the same node and does not contain Any duplicate edges. You can prove to yourself that the tree here fits these criteria. In a rooted tree, a given node can have a single parent node above it, and can have Any number of child nodes below it. There is a single node at the top of the tree that does not have a parent, which we call the root, and there can be Any number of nodes at the bottom of the tree that have no children, which we call leaves. All nodes that have at least one child are called internal nodes. Just like with a family tree, a node's parent, grandparent, and so on, all the way to the root, are considered that node's ancestors, and a node's children and grandchildren and so on, all the way to the leaves, are considered the node's descendants. In the example rooted tree to the left, the root is 1. The leaves are 2, 3, 5, and 6, and the internal nodes are 1 and 4. Note that we typically draw edges in a rooted tree as pointing away from the root. In An unrooted tree, there is no notion of parents or children. Instead, a given node has neighbors. Any node with just a single neighbor is considered a leaf, and Any node with more than one neighbor is considered An internal node. The leaves are 2, 3, 5, and 6, and the internal nodes are 1 and 4. Now that we have An understanding of tree structure, let's map the components of a tree data structure onto the example rooted phylogenetic tree from earlier. In a typical phylogeny, the leaves represent present day species. The leaves in this phylogeny are orangutan, gorilla, chimpanzee, and human. The internal nodes of a typical phylogeny represent the ancestors of the leaves, and we typically assume these ancestors are extinct. The internal nodes in this phylogeny are X and Y. Note that each internal node of a rooted phylogeny represents the most recent common ancestor or MRCA of all of its descendants and the root represents the MRCA of all nodes in the phylogeny. For example, Y is the MRCA of chimpanzee and human, X is the MRCA of gorilla and Y and thus also chimpanzee and human, and R is the MRCA of orangutan X and Y and thus also gorilla, chimpanzee, and human. In other words, R, which is the root of this phylogeny is the MRCA of all nodes in the phylogeny. The edges that connect the nodes of a phylogenetic tree represent evolutionary relationships and their length denote some unit of evolutionary distance. For example millions of years, days, generations, number of mutations, and so on. Note that the direction in which a phylogeny is drawn is irrelevant. We happened to draw the root to the left and the leaves to the right in this example, so time moves forward as we move to the right of the tree. But we could have drawn the root at the top and the leaves at the bottom. So time would move forward as we move down the tree. Similarly, we could have drawn it in any other orientation. In theory, all phylogenies should be rooted trees but in practice, modern techniques that attempt to infer the evolutionary history of multiple molecular sequences are typically only able to infer unrooted phylogenies. For example, we can represent the evolutionary history between orangutan, gorilla, chimpanzee, and human using this unrooted phylogeny. In the field of phylogenetics and active problem of interests is true routing. The task of estimating the true root and an unrooted phylogeny. For example the true root of this phylogeny which was R in our previously seen rooted phylogeny would be along the branch between orangutan and X. This may not be obvious at first but try to imagine moving the nodes in two-dimensional space. Inferring the true root of an unrooted phylogeny is inherently of interests to evolutionary biologists. The root of the phylogeny is the MRCA and inferring the true root directly informs us of the true direction of evolution across species. This is especially important in viral phylogenetics. As the root of a viral phylogeny tells us the MRCA and thus probable origin of the viral samples we observe in the epidemic. However, the concept of tree routing is out of the scope of this course and you can assume that all phylogenetic trees you will be analyzing will be unrooted. You will see that unrooted trees will provide us with sufficient information for the types of questions we'll be exploring in this course and we will attempt to discover the true root of the SARS-CoV-2 phylogeny in the next course. Now that we have discussed the fundamentals of phylogenetics, it is time to investigate the origins of SARS-CoV-2. Specifically, you are likely wondering what species might have been responsible for this deadly outbreak in humans. To explore this question, we will study the evolutionary history of SARS-CoV-2 and its close relatives by attempting to infer phylogeny. The first step in this endeavor is to identify and collect virus sequences similar to SARS-CoV-2. We will do the blast our favorite genetics search engine as a tool for the job. Since we want to compare our SARS-CoV-2 sequence with other coronaviruses and not itself, we can explicitly exclude SARS-CoV-2 from the search results. Much like the results from any other search engine, blast would likely contain far more hits than we actually need and it will sort them out using a calculated metric of relevance. In simple terms, the most relevant results will be near the top. How exactly does blast measure relevance such that it is able to show us the most relevant hits at the top of the results? For each potential head blast computes a metric known as bit-score which measures sequence similarity in a manner that is independent of query sequence lamps and database size, which is normalized by pairwise sequence alignment score. The specific details behind bit-score calculation are out of the scope of this course. Regardless, blast will report many more hips than we want to use and it is our job to curate this dataset to include as many or as few sequences as we want. At the time of recording this video, the entry with the largest max score is named synthetic SARS-CoV-2 spike glycoprotein measles mobile a virus, which has a max score of 2,637 and a percent identity of 100 percent. Should we be surprised by this? Despite the fact that we have excluded SARS-CoV-2 from our search space, the first hit is 100 percent identical to our query. What happened? If we look at the details of the stop hit, you will see that the synthetic variant was created by inserting a SARS-CoV-2 as protein sequence into the measles morbillivirus genome in order to create a COVID-19 vaccine. For our purposes, given that this S protein sequence has a percent identity of 100 percent with respect to our own SARS-CoV-2 as protein sequence, meaning that the two sequences are identical, we can simply use the sequence to represent SARS-CoV-2 in our phylogenetic tree, which allows us to skip manually appending our own S protein sequence is a database of SARS-CoV-2 like as protein sequences that we will build. They're also quite a few bat and pangolin coronaviruses in the results. As an exploratory exercise, we can extract the top 10 most relevant sequences to use as our dataset. Again, note that much like your everyday web browser search engine, the results of blast can change over time based on the latest sequences that have been uploaded to its database by scientists across the world. You might find that your sequence list might differ from ours, but the overall message will be the same. Note that our current dataset contains a very narrow set of sequences that are extremely similar and almost identical, in fact, to SARS-CoV-2. In order to get a broader understanding of where SARS-CoV-2 fits among its more distant coronavirus relatives, we are to run a second blast search that restricts the database to only include a single sequence per viral species. Each of these representative sequences is known as a reference sequence for that viral strain because we're only allowing one sequence per viral strain and thus excluding the possibility of multiple identical or near identical sequences from the same species. We can expect these sequences to be far more distinct from one another than in the results of a previous search, which can in fact be observed in low percent identities, as well as in lower max scores. At the time of recording this video, the entry with the largest max score is named spike glycoprotein SARS coronavirus Tor2, which has a max score of 2,038 and a percent identity of 75.96 percent. Indeed, SARS-CoV-2 is not only a relative of SARS-CoV-1, but it is also a member of a family of different coronaviruses which live in mammals. A subset of this coronaviruses live in humans, although this does not necessarily make them more likely to be closely related to SARS-CoV-2. Once we have obtained our collection of S protein sequences, we can use math to perform multiple sequence alignment or MSA on them. In this visualization, whitespaces represent gaps, and shades of green represent matches and mismatches. The darkest green represents an exact match to the most frequent amino acid in a given column and lighter shades of green represent mismatched amino acids that are increasingly different from the most frequent amino acid in the column with respect to biochemical properties. Even just looking at this alignment with our naked eye, we can see that some sequences appear to be quite similar, while others have fairly unique amino acid sequences. For example, the sequences found from the default database look very similar to one another, and on the other hand, the sequences in the reference protein sequence database appear to be quite different from our spike S protein sequence as seen here. This simple visualization of multiple sequence alignment enables us to make a quick comparison about which sequences seem to be more or less similar from one another, but it does not allow us to answer higher-level questions about the evolutionary relationships between these sequences. Instead, we need to leverage a phylogenetic inference tool that'll take our MSA as its input and will conduct an analysis and create a visualization which will make the answers to these questions trivial. We will use FastTree for this purpose. FastTree takes an input of a multiple sequence alignment and it produces a phylogenetic tree such as the one shown here. FastTree is a maximum likelihood phylogenetic inference tool. In short, FastTree defines a likelihood score that is essentially the likelihood that a given tree generated a given multiple sequence alignment under a given mathematical model of evolution. We simply provide FastTree the multiple sequence alignment and the mathematical model of evolution and then FastTree searches for a tree that maximizes this likelihood score. How exactly does FastTree search for this maximum likelihood tree? The truth is FastTree uses very complex algorithms and heuristics to do so. The details of which are outside the scope of this course. If however, you are curious about the algorithmic details behind FastTree, you can find a complete in-depth explanation on the FastTree homepage. One important comment, note that all trees outputted by FastTree, including this one here, are unrooted trees. Recall that unrooted trees only show similarity and local relationships between various strains. They fail to show the complete evolutionary history of these strains. For the remainder of this chapter, we will do what we can with this unrooted tree, and in future course, we will discuss the computational problem of tree rooting or the process of inferring an appropriate route or MRCA of a given unrooted phylogeny, as well as the corresponding analysis that you can conduct with this rooted tree. Note that, if you try visualizing an unrooted phylogenetic tree outputted by FastTree, many visualization tools will display the tree in a way that makes it look rooted, but the rooting displayed by these visualizations is purely arbitrary. It's not necessarily the true root of the phylogeny. Be sure to keep this distinction in mind, and when visualizing unrooted trees, try to visualize them in this circular radial fashion to avoid being misled by an arbitrary root. Also note that the branch lengths in the tree outputted by FastTree are in units of normalized number of substitutions and not in units of time. The number of substitutions is correlated with time, because as more time passes, more mutations will occur. These two units are not directly proportional. We will discuss scaling a substitution tree into a TimeTree in future courses, but for now, keep this caveat in mind. Let us take a closer look at our tree. It is clear that the virus is extracted using the default blast database, also shown green, happen to exist on leaf nodes closer to one another when compared to the viruses extracted using the reference sequence database shown in red. This is expected because the default database, or our first blast search, contains sequence identical or near identical to our SARS-CoV-2 sequence, whereas the RefSeq database or our second blast search, does not contain any identical or redundant sequences. Now, let us take a closer look at the animal host species of the coronavirus strains for this phylogenetic study. Specifically sequences that are very close to our SARS-CoV-2 S protein sequence, or in other words, the sequences from our first blast search. A SARS-CoV-2 S protein sequence is shown in green here. Sequences from bat reservoirs are shown in red, and sequences from pangolin reservoirs are shown in blue. Also the segment of the phylogeny corresponding to these sequences is circled in green. It is a bit hard to see in this visualization, but the SARS-CoV-2 S protein sequence is nested very closely within a group of bat coronavirus sequences and the pangolin coronavirus sequences are much further away in this tree. This could possibly suggest that the bat species acted as a host for the SARS-CoV-2 virus before it was transmitted to humans. However, only looking at unrooted phylogeny, rather than the rooted phylogeny with branch lands in unit of substitutions rather than the unit of time, can be misleading to our interpretation. Further, our dataset is inherently sub-sampled. We are only able to analyze the data that we have collected and not all possible coronavirus data has been collected to this day. There could well be other coronaviruses existing in animals that we simply haven't found ye, and if so, the sequences of those hypothetical coronaviruses are missing from our analysis. In the following section, we shall explore yet another theory for the origin of this virus. The last hypothesis we will explore, is the intermediate host hypothesis. Having completed our analysis in the preceding section, you may deduce that the current available data seem to align well with the direct zoonotic transfer hypothesis. The phylogenetic analysis we conducted does, at least to some extent, support this mechanism. However, there are also researchers who look at the same data and believe that it's supports another theory, the intermediate host hypothesis. Let's investigate why. The intermediate host hypothesis argues that a species with which humans have regular contact, was infected by bats or similar and in turn, humans were infected by this intermediate species. It is supported by the same data that support the direct zoonotic transfer hypothesis, but it considers one additional piece of information, historical evidence. In particular, past epidemic and pandemic outbreaks have often been attributed to the direct zoonotic transfer hypothesis during initial instigation stages, and they were then later revised to feed the intermediate host hypothesis. How is this possible? Let's continue that following hypothetical timeline. First, an epidemic is detected and biologists identify the viral species that is responsible. Second, biologists perform a phylogenetic analysis that identifies the closest chain to humans as a non-mammalian species, often bats. At the time of recording this video, this stage of investigation is where we are currently at with COVID-19. It's just these two steps, we would lean toward the that examined transport hypothesis. However, we have often seen a third event in the timeline appear, often well, after the initial epidemic. Biologists discovered an intermediate between humans and the originally-thought species. Let's explore two relatively recent historical scenarios in which this third event occurred. We will go back to a pandemic we introduced in Chapter 1, the 2003 outbreak of Severe Acute Respiratory Syndrome or SARS caused by a virus, SARS-CoV-1. During the search for the animal reservoir of SARS-CoV-1, biologist discovered infected by incidents like the Leland animal marketing chain. Meat from these animals is often added to so-called dragon, tiger phoenix soup and expensive Cantonese dish. The discovery deepen grid to change the infected series fate. Instead of ending up in the soup, never made into SARS scapegoats and was threatening. However, when later searches failed to identify more source of the incidents, biologists started to wonder whether found exhibits were truly the original source of SARS. In 2005 they discovered that SARS like was in Chinese horseshoe legs. The burst out to be, that SARS could be predators. But they could probably only pass the virus to humans through intermediate hosts. Since red meat is considered a delicacy and is also used in traditional Chinese medicine, there's plenty of chances to come in close contact with the virus in overcrowded live animal markets. At this stage of a SARS outbreak, both that executive transfer hypothesis and the intermediate host hypothalamus workflows. However, as biologists gathered more data and eventually constructed the evolutionary tree of coronavirus from base service to humans, they found strong evidence that the seeds were indeed intermediaries hid in bats and humans. In particular, they found that both incidents and human were ends of SARS query, we're nested with bin and bed virus hydrogen. Therefore, viewpoints were shifted to an intermediate host hypothalamus in the case of SARS could be one. We should amazed the leading theory to this day. It is important to note that unlike with the un-rooted mutation tree, we insert using fast chain, these conclusions could only be made from a rooted trunk tree. The concepts of phylogenetic routine can they think are out of the scope of this course and they will be discussed in the future course. For now, trust us that the tree root link shown here is accurate. Having explored the subscore one, let us now turn to our second case study. The second outbreak of interests to the intermediate host hypothalamus is the 2012 Middle East Respiratory Syndrome or MERS epidemic, which was caused by the virus MERS-CoV. Similar to SARS-CoV-1, MERS-CoV was suspected to have originated from one of two sources, either direct from pets or a throw came out intermediary. Once again, we see both the direct energy transfer hypothalamus and intermediate host hypothesis appearing closer to their analysis have confirmed that the KML to human transmission appeared, which given the evidence but do serve as reservoirs for miscarried, strongly supports the intermediate host hypothesis for nurse. However, additional research suggests that both theorists may be correct in particular, and that in addition to KML to human transmission, bed to human transmission may have also viewed. Therefore supporting their exam, which is transfer hypothalamus. To this day, this investigation are still in progress. We hope this two cases have served as examples of why some researchers support the intermediate host headquarters. Specifically, some researchers argue that given how long it took to find the intermediate hosts in the SARS and MERS outbreaks, there could be an intermediate host in the COVID-19 outbreak that we just haven't found yet. Also, keep in mind that while we focused on SARS and MERS specifically in this video, there are other corona viruses which have been traced to human transmission via a minor intermediate. This SARS-COVID-2 will join this list, perhaps what on-demand hotel. In the next course, we will trace the spread of the SARS-CoV-2 and study its evolution within human population.