In this section we will start with some practical applications and exercises that will allow you to go through some of the commands that we've shown. Let's go back to our data example. As you might recall, we were talking about three plant systems, toy apple, toy pear, and toy peach, and all the relevant data was stored in the directory plants. And we have three types of information, genome files which were multi-faceted files of sequences of chromosomes or scaffolds. The second type of genomic information was annotations of genes and their potential variants, these genes could have one or multiple variants. And then we had the list of samples that had been collected for each of the species. So these are the three types of information corresponding to the three types of files that we'll be performing operations on. And we'll try to answer a list of practical questions. The first set of questions which I will be showing in this section will refer to operations on the data set pertaining to a single species. And for the following section, we'll be talking about operations that apply to cross-species comparisons. So let's get started. Some examples of questions would be, how many chromosomes are there in the genome? That would be one question. Another one would be, how many genes and how many transcripts? How many variations of the genes are there in each of the genomes? And we can further qualify this last section by asking, for how many genes do we have a single variant? For how many genes, we have more than one variable. You might recall I was talking about color was potentially being yellow, green or red. So let's see how we would try to answer this samples given our dataset and we're going to exemplify with apple. We're in the core directory plants. And we're going to move on with cd to the apple directory. And we'll try to answer the first question, how many chromosomes are there in the genome? You might recall that using the comment more, we were able to visualize the content of the file apple.genome. And it revealed a multi fast eight structure where each fast eight five, each fast eight sequence, each chromosomal sequence was introduced by a header, a fast eight header line. It started with a greater dense sign followed by identifier, and then everything was followed by the actual sequence itself. In other words one easy way to determine how many chromosomes we have or how many scaffolds we have in the genome is to simply count how many header lines we have. Now, I'm going to point out that one distinctive item in a fast header lines is the greater than sign that precedes the header line. So, what we can do is, we can simply do a grab. So, count how many times lines that start with the greater than sign appear in the genome file. This is simply typed as grab in quotes the greater than sign, which is the marker for a header, a sequence header. In the genome, sorry, in the apple.genome file. And this will simply list all of those. Going back and so it shows us all of those. But it only shows us the listing. We want the number. To do the number we can apply one of two different commands in UNIX. We can tie this to the wc-l which gives us the number of lines that get read from the standard input, so tied it to. And it's going to tell us five. Alternatively, one option of the grep command is to put grep-c, which gives us the count, apple.genome. Which will be five. So that's a simple way of looking at how many chromosomes we can find in a file. Let's try to answer the next question. How many genes are there in the apple species? And how many variants there are. And we can find that information in the file apple.genes. Let's remind ourselves. So, the gene name is listed in column one and the variant name is listed in column two. Observe that there might be multiple lines for the same gene because there might be multiple variants of the gene. So simply cutting the first column of the file. Cut -f1 apple.genes. The column that contains the gene names and pipe in the tool more so we can look at it, will give us two listings of the gene smell. Two listings of size, three listing of color and so on. So what kind of comment can you use at this point to create only one copy for each gene name? So we can count the number of genes. If you thought about unique then that is a good option because gene names are actually sorted within the list. So unique, so pipe in the output of the cut F1 apple.genes to unique. Within it gives us a listing of the genes. And now piping that to a WC-L to give us the number will give us that. So there are ten different genes in the apple genome. Another way of doing this instead of using the command unique would be sort uniquely. So sort and only report one occurrence for each one nine for a number of occurrences of the same item. So that would give us color and they're listed alphabetically. Color, shape, size, smell, taste, with uppercase words being listed before the lowercase words as I mentioned in the beginning. And we can again pipe that to wc -l to see how many of them. And indeed, we're replicating the answer which is ten. So now, we will do a similar operation to identify the number of variants. To find the number of variants, we're looking at column number two. So cot -F2 apple.genes, gives us the list of variances. Let's sort this uniquely, actually first let's hide that and find the number, which is 16. Piped to wc-l. And now let's also sort uniquely to see the number of distinct variants. And if you count those you're going to see that it is 16. So, every variant was independently. And that's true, because indeed the file contains only one line for each variant of a gene. So, the number of genes is 10, and the number of variants is 60. Now, let's address the more complicated option of how many genes have a single variant. And then the related question of how many genes have multiple variants. So the information, Pertains to genes and variants and is found in columns one and two. So, let's first cut just columns one and two from the file apple.genes. Now, for each gene, we can obtain the number of variants that are related to it. So the way we can do that is by looking at column one, and every thing will be listed as many times as the number of variants that it has. So if we now cut only the column one. We will have smell twice, size twice, color three times, and for instance apple ten showing up just one time. So how do we extract that information which will allow us to identify which genes have multiple variance versus which genes have only one variant? A simple way is to use unique just as we have done before. But instead of just listing the nine you are going to count how many times it appears in one particular location in the file. In the c file. So unique/c. So let's see what it gives us. So that shows smell, that smell appears two times, size appears two times, color three times, appl4 one time, appl5 one time, and so on. So from this listing, so as you've seen, we already wrote a series a commands. From this listing we can apply one more command to identify first all the Gs that have one variant. We recognize them because they would one listed in the column. So doing a graph of the pattern space, one, and space, will give us only those lines that have a single variant. We can further pipe that to a wc-l and get 5. Or equivalently we can use the -c option with grep and that's going to give us 5. So the number of genes that have a single variant is five. Now we can simply say ten genes that have ten total genes, minus five genes that have a single variant, would leave us with exactly five genes that have multiple variants. Or, we could do it the hard but interesting way. So let's see what pipe of commands, pipeline of commands, would allow us to retrieve that information. Again we're copying column number one. We're using "uniq-C" to see how many times the G name appears, which is the equivalent to the number of variance it has. Now we want all lines except for those that are listed just once. So one option to the common grep is -v. So -v says ignore the pattern and report everything else. So grip-v open quotes, one and another space, so that says take away, ignore, all the lines that have this pattern. And that will show us, indeed, the five genes that have at least two variants. And these are smell, size, color, taste, and shape. And to make sure that there are five of them, we can pipe the result through wc-l And the result is five. So we answered some simple questions related to the structure and organization of genes and genomes. Specifically we answered the question how many chromosomes are there in the genome, starting from the fast eight file. And then how many genes and transcript variants and how many genes have a single variant versus how many have multiple variants. Which are simple and practical questions in your general analysis. In the following section, we will be addressing a couple of questions that relate to the too late information across the species.