Analyzing covid's genome

This tutorial will make heavy use of k1lib.bioinfo.cli module, and to show another example of what a typical workflow looks like. File is in GenBank format.

Overview

Here's what it kinda looks like:

And the end:

So, 29903 nucleotides in total, just as advertised. The last nucleotide section always starts with "ORIGIN", so let's look for that:

Origin

Nice. Let's extract everything out:

This is rather long, so there's a built in operation for that

Features

Before ORIGIN "section", there's the FEATURES section that looks like this:

As you can see, there are multiple features, like source, 5'UTR, gene, CDS, and whatnot. Of course, you can extract these on your own, but builtin functions already have something like that:

Say you want to search the features for a frameshift event, you can do something like this:

So apparently, there's a frameshift at nucleotide 13468, where it gets repeated twice. Let's check if that's correct. First, let's grab the protein:

ORF1ab is quite a chunky boi. Over 7k length, or 71% of the genome. The nucleotides of interest are:

So, the shifted nt sequence must be "AACCGG", or:

Yep, bingo! Peptide sequence starts with NR

Spike

Also in the news before delta variant times, I've heard they talk a lot about "D614G" variant, I wonder what's that all about, then discovered this:

Yeah this checks out. So "D614G" mutation just means at position 614 on the spike protein, a D (aspartic acid) has become G (glycine).

ORF3a

Let's try again at a different spot. I grabbed a random mutation with this code name: "hCoV-19/Japan/PG-69007/2021: ORF3a L275F"

Lmao, the change is right at the last amino acid

All proteins

Let's see the distribution of all genes:

And how much of the genome are the proteins themselves?

All proteins combined take up like 97.7% of the genome. Quite densely packed, unlike eukaryote genomes.

UTR

How about utr regions? Do they take up much? Let's quickly search for them:

Really close to 100% now