Chapter 5, Investigating the Origins of SARS-CoV-2. In this chapter, we will try to unravel the origins of SARS-CoV-2. Let's first introduce the three most prominent hypotheses about the origins of SARS-CoV-2. For now, we will just provide a brief introduction and we will later thoroughly dive into each of them. First is the artificial origin hypothesis. Which posits that a genetic laboratory design and artificial man-made SARS-CoV-2 virus. Second is the direct zoonotic transfer hypothesis, which says that humans were infected with SARS-CoV-2 directly from bats or from some other original host. Last is the intermediate host hypothesis, which guesses that humans were infected with SARS-CoV-2 through a mammalian intermediate. Before we explore the three hypotheses about the origins of SARS-CoV-2, we need to understand the fundamentals of aligning biological sequences. Up to this point, our analyses have just used one vital sample genetic sequence. Recall that we began our investigation in Chapter 2 by downloading a single set of sequencing reads for SARS-CoV-2 with the accession number SRR 10971381. As you might imagine, similar reads and corresponding assembled genome sequences exist for many other viral species. We can use sequence alignments to compare SARS-CoV-2, to closely related viruses. In Chapter 2, we explained how we compare pairs of sequences by aligning them, such as using the Basic Local Alignment Search Tool or BLAST for short. However, the term alignment is a bit too generic as there are multiple types of alignment. We will use this example pair of sequences to explain the two most common types of pairwise alignment, global alignment, and local alignment. Here we have lined up the entirety of our two sequences to construct what is called a global alignment. Green represents columns in which both the sequences have the same letters, which are matches. Red represents columns in which the two sequences have different letters, which are mismatches. Finally, blue represents columns in which one of the two sequences has a gap character, shown as a dash. These columns are known as indels, which is short for insertions and deletions. Here we have lined up the same two sequences to construct what is called a local alignment. In addition to the colors we use in global alignment, we have an additional color here. Grey represents columns that are ignored by our alignment. You'll notice that the shot region included in our local alignments seem to line up much more nicely between the two sequences than when we try to align them using global alignment. We are being pretty hand baby with what we can start lining up nicely. Let us formalize this a bit. Specifically let us define a scoring function in which we score a given column as plus one if it has a match, minus 1 if it has a mismatch, and minus 1 if it has an indel. Further, let us define the score of an entire alignment as the sum of the scores of its columns. In the case of local alignment, let us simply ignore the gray columns entirely. Our global alignment has 22 matches, 18 indels, and two mismatches giving us an overall score of plus two. Optimal global alignment is defined as a maximum score alignment among all possible alignments of two sequences over their entire lengths. On the other hand, our local alignment has 12 matches and two indels giving us an overall alignment score of 10. Remember that we're ignoring the gray columns here. An ultimate local alignment is defined as the maximum scoring alignment among all possible alignments of all possible substance of the two sequences. We have now discussed the concept of pairwise sequence alignment, but how does sequence alignment helps us explore the potential and artificial origins of SARS-CoV-2? To explore the idea of artificial origin, we will want to compare SARS-COVID-2 against its close relatives. This in sequence alignment. If SARS-COVID-2 has any special features, not apparent in its relatives, could that imply that it was engineered in the lab? Let's recall that this spike protein is the key behind how SARS-COVID-2 invades humans cells. In Chapter 4, we used Proca to identify the start and end coordinates of the spike protein which allows us to easily extract its nucleotide sequence. It turns out that Proca also automatically translated these protein coordinate nucleotide sequence into a protein sequence of amino acids. Just as we did with the nucleotide sequences in the past, we can use blast to find existing protein sequences that are similar to our spike protein sequence. Whether we compare the SARS-COVID-2 spike protein sequence to the most similar existing spike protein sequence, which comes from Bat coronavirus RaTG13, we see that SARS-COVID-2 spike has an insertion of four aminases, PRRA. Is there any significance to this insertion? The answer is, yes. In Chapter 3, we introduced protease as a molecular scissors that can cut proteins at specific locations called cleavage sites. It turns out that the different proteases are specific to different cleavage sites. One protease called furin is known to cleave proteins when there is an amino acid sequence of RRAR. If we look carefully at the SARS COVID-2 spike protein, we can see that the insertions of PRRA creates a few cleavage sites at position 682. The fact that the PRRA insertion forms an RRAR few in cleavage site is rather suspicious. This is because the furin cleavage of this S protein in coronaviruses is known to be associated with increased infectiousness. In other words, coronaviruses with a furin cleavage site tend to be more infectious than coronaviruses without one. Well, Eureka. Before [inaudible] session that we found just happens to create the cleavage site of furin, not a likely coincidence. Shouldn't this suggest the SARS COVID-2 was argument to know that, to make it a more infectious virus? Before we jump to such a dramatic conclusion, let's take a closer look at this insertion. Recall that a single amino acid sequence can be translated from multiple different nucleotide sequences. This is a direct consequence of multiple codons translating to the same amino acid. This also means that while the RaTG13 and SARS COVID-2 sequences appear to be identical, except for a suspicious person. The man-made PRRAR insertion, the underlying nucleotide sequences may actually be quite different. Let's take a look. Let's zoom into this window at the nucleotide-level. In these visualization, the top row represents the letters both sequences have in common, known as the consensus sequence. The middle row represents SARS COVID-2 and the bottom row represents RaTG13. Notice that even though the amino acid sequences were identical, aside from this specific concession, the nucleotide sequences are actually quite different. In fact, if we were to look at the 144 aligned positions flanking either side of the PRRA insert, only partially pictured here, we would see 90 nucleotide mismatches. Yet the results and translated amino acid sequences remain the same. Those 19 mismatches over a [inaudible] eight nucleotides,144 nucleotides on each side, represented 6.6 percent difference. For now, take our word that 6.6 percent difference is large and join us in the next course to learn how we can use this information to calculate an estimated time, since our most recent common ancestor of those two viruses. We have now established that also the RaTG13 and SARS-COVID-2 genomes look identical from amino acid perspective, they are actually quite distant. With this acceleration in mind, we hope you agree with us that we need to examine some other relatives of SARS-CoV-2 in order to better understand whether the RRAR insertion came from. Let's broaden our horizons beyond RaTG 13 and SARS-CoV-2. Let's look at the spectrum of coronavirus is to see if there are others that share this suspicious PRRA insertion. In particular, if the insertion was through [inaudible] that it would be unlikely for us to observe it in natural spike proteins of other corona viruses. In the preceding section, we considered alignments of two sequences. However, our goal of examining a variety of coronaviruses creates a new challenge. You would have to align many spike protein sequences. Perhaps we could simply align many spike proteins by constructing their individual pairwise alignments. The challenge is that while amino acid sequences of protein is performing the same function are likely to be somewhat similar, functionally critical similarities may be elusive in case of distance species. You indeed can align pairs of sequences, but if sequence similarities are weak, the reverse alignment may not identify evolutionary or ideally sequences. However, simultaneous comparison of many sequences often allows us to find similarities that virus sequence alignment fails to reveal. But pharmacisians sometimes they emit while pairwise alignment whispers, multiple alignment shouts. We are those facing the multiple sequence alignment problem, which we can illustrate using this example that represent alignment of three sequences and a three-year-old matrix. The algorithm for constructing multiple sequence alignments are very complex, but the good news is that we don't need to reinvent the wheel. There are widely used tools we can leverage to examine spike proteins [inaudible]. It turns out that very few protein sequences related to SARS-CoV-2 spike include the RRAR furin cleavage site, which was rendered the case for the artificial origin hypothesis. However, we may still be jumping to conclusions. Let's take a closer look at how furin cleavage site work. In fact, the furin cleavage site is not specific to RRAR, rather, any amino acid sequence which fits the pattern of R followed by any two amino acids followed by R may serve as a furin cleavage site. When generalizing our research to this more generalized furin cleavage site pattern, it turns out that the furin cleavage site seem to be fairly common feature of this viral group. The presence of the furin cleavage site in multiple measuring coronavirus is strong evidence that these insertion may not have been artificial, but rather it may have been a product of natural evolution across many coronavirus strains. We will close this chapter by mentioning this study that suggest that while this furin site does have a role in determining wider infectivity and host range, it may be not as critical in determining the virus effectiveness as previously thought. Instead, a potentially more important feature of the spike protein is a receptor binding domain. The part of the spike protein that binds to the receptor of the host cell, which is crucial for host recognition and infectiousness. It turns out that the mutations in the receptor binding domain or RBD of human SARS-CoV-2 are not unique and there are identical mutation in the pangolin coronavirus genome. This suggests that the mutations determine high affinity to human cells, may have existed for some time in various populations before the virus pass to humans. Having explored the artificial origin hypothesis, we will now turn our attention towards that evolutionary history of the virus.