Analyzing covid's genome¶

This tutorial will make heavy use of k1lib.bioinfo.cli module, and to show another example of what a typical workflow looks like. File is in GenBank format.

In [1]:
from k1lib.imports import *

Overview¶

Here's what it kinda looks like:

In [2]:
cat("covid.gb") | headOut()
LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,
            complete genome.
ACCESSION   NC_045512
VERSION     NC_045512.2
DBLINK      BioProject: PRJNA485481
KEYWORDS    RefSeq.
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;

And the end:

In [3]:
cat("covid.gb") | rows()[-10:] | headOut()
    29341 tattgacgca tacaaaacat tcccaccaac agagcctaaa aaggacaaaa agaagaaggc
    29401 tgatgaaact caagccttac cgcagagaca gaagaaacag caaactgtga ctcttcttcc
    29461 tgctgcagat ttggatgatt tctccaaaca attgcaacaa tccatgagca gtgctgactc
    29521 aactcaggcc taaactcatg cagaccacac aaggcagatg ggctatataa acgttttcgc
    29581 ttttccgttt acgatatata gtctactctt gtgcagaatg aattctcgta actacatagc
    29641 acaagtagat gtagttaact ttaatctcac atagcaatct ttaatcagtg tgtaacatta
    29701 gggaggactt gaaagagcca ccacattttc accgaggcca cgcggagtac gatcgagtgt
    29761 acagtgaaca atgctaggga gagctgccta tatggaagag ccctaatgtg taaaattaat
    29821 tttagtagtg ctatccccat gtgattttaa tagcttctta ggagaatgac aaaaaaaaaa
    29881 aaaaaaaaaa aaaaaaaaaa aaa

So, 29903 nucleotides in total, just as advertised. The last nucleotide section always starts with "ORIGIN", so let's look for that:

Origin¶

In [4]:
cat("covid.gb") | grep("ORIGIN", after=1e9) | headOut(3)
ORIGIN      
        1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct
       61 gttctctaaa cgaactttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact

Nice. Let's extract everything out:

In [5]:
cat("covid.gb") | grep("ORIGIN", after=1e9) | ~head(1) | op().strip().all() | op().split(" ").all() | cut()[1:] | join("").all() | join("") | op()[:100]
Out[5]:
'attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtggctgtcactc'

This is rather long, so there's a built in operation for that

In [6]:
# hide behind a wrapper cause I get annoyed at Jupyter Lab's contextual help displaying the huge text
nt = k1lib.Wrapper(cat("covid.gb") | gb.origin())
nt()[:100]
Out[6]:
'attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtggctgtcactc'

Features¶

Before ORIGIN "section", there's the FEATURES section that looks like this:

In [7]:
cat("covid.gb") | grep("FEATURES", after=1e9) | headOut(20)
FEATURES             Location/Qualifiers
     source          1..29903
                     /organism="Severe acute respiratory syndrome coronavirus
                     2"
                     /mol_type="genomic RNA"
                     /isolate="Wuhan-Hu-1"
                     /host="Homo sapiens"
                     /db_xref="taxon:2697049"
                     /country="China"
                     /collection_date="Dec-2019"
     5'UTR           1..265
     gene            266..21555
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /db_xref="GeneID:43740578"
     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /ribosomal_slippage
                     /note="pp1ab; translated by -1 ribosomal frameshift"

As you can see, there are multiple features, like source, 5'UTR, gene, CDS, and whatnot. Of course, you can extract these on your own, but builtin functions already have something like that:

In [8]:
feats = cat("covid.gb") | gb.feats() | deref()
feats | rows()[:3] | deref()
Out[8]:
[['     source          1..29903',
  '                     /organism="Severe acute respiratory syndrome coronavirus',
  '                     2"',
  '                     /mol_type="genomic RNA"',
  '                     /isolate="Wuhan-Hu-1"',
  '                     /host="Homo sapiens"',
  '                     /db_xref="taxon:2697049"',
  '                     /country="China"',
  '                     /collection_date="Dec-2019"'],
 ["     5'UTR           1..265"],
 ['     gene            266..21555',
  '                     /gene="ORF1ab"',
  '                     /locus_tag="GU280_gp01"',
  '                     /db_xref="GeneID:43740578"']]

Say you want to search the features for a frameshift event, you can do something like this:

In [9]:
feats | gb.feats.filt("frameshift", "CDS") | item() | headOut()
     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /ribosomal_slippage
                     /note="pp1ab; translated by -1 ribosomal frameshift"
                     /codon_start=1
                     /product="ORF1ab polyprotein"
                     /protein_id="YP_009724389.1"
                     /db_xref="GeneID:43740578"
                     /translation="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQ

So apparently, there's a frameshift at nucleotide 13468, where it gets repeated twice. Let's check if that's correct. First, let's grab the protein:

In [10]:
orf1ab = feats | gb.feats.filt("frameshift", "CDS") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")
len(orf1ab), len(orf1ab)*3 / len(nt()) * 100
Out[10]:
(7096, 71.19018158713173)

ORF1ab is quite a chunky boi. Over 7k length, or 71% of the genome. The nucleotides of interest are:

In [11]:
nt()[13465:][:20]
Out[11]:
'aacgggtttgcggtgtaagt'

So, the shifted nt sequence must be "AACCGG", or:

In [12]:
"AACCGG" | translate() | item()
Out[12]:
'NR'
In [13]:
orf1ab[(13468-266+1)//3-1:][:20]
Out[13]:
'NRVCGVSAARLTPCGTGTST'

Yep, bingo! Peptide sequence starts with NR

Spike¶

In [14]:
s = feats | gb.feats.filt("spike", "CDS") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")

Also in the news before delta variant times, I've heard they talk a lot about "D614G" variant, I wonder what's that all about, then discovered this:

In [15]:
s[613:][:10], "DG" | longAa() | item()
Out[15]:
('DVNCTEVPVA', 'AsparticAcid Glycine')

Yeah this checks out. So "D614G" mutation just means at position 614 on the spike protein, a D (aspartic acid) has become G (glycine).

ORF3a¶

In [16]:
orf3a = feats | gb.feats.filt("ORF3a", "CDS") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")

Let's try again at a different spot. I grabbed a random mutation with this code name: "hCoV-19/Japan/PG-69007/2021: ORF3a L275F"

In [17]:
orf3a[274:], "LF" | longAa() | item()
Out[17]:
('L', 'Leucine Phenylalanine')

Lmao, the change is right at the last amino acid

All proteins¶

In [18]:
genes = ["ORF1ab", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF10"]
In [19]:
proteinLengths = feats | oneToMany(*(gb.feats.filt(f"/gene=\"{g}\"") for g in genes))\
| (gb.feats.filt(" CDS ") | item() | gb.feats.tags("translation") | op()[0][1].replace(" ", "")).all()\
| shape(0).all() | deref()

Let's see the distribution of all genes:

In [20]:
proteinLengths | wrapList() | transpose() | insertColumn(genes) | ~sort(1) | display(None)
ORF1ab   7096   
S        1273   
N        419    
ORF3a    275    
M        222    
ORF7a    121    
ORF8     121    
E        75     
ORF6     61     
ORF7b    43     
ORF10    38     

And how much of the genome are the proteins themselves?

In [21]:
sum(proteinLengths) * 3 / len(nt()) * 100
Out[21]:
97.75607798548641

All proteins combined take up like 97.7% of the genome. Quite densely packed, unlike eukaryote genomes.

UTR¶

How about utr regions? Do they take up much? Let's quickly search for them:

In [22]:
feats | gb.feats.filt("UTR") | item().all() | deref()
Out[22]:
["     5'UTR           1..265",
 '     stem_loop       29609..29644',
 '     stem_loop       29629..29657',
 "     3'UTR           29675..29903"]
In [23]:
utr = feats | gb.feats.filt("UTR") | item().all() | rows(0, 3) | op().split("R")[1].all()\
| (op().split("..") | toInt() | toList() | ~aS(lambda x, y: y - x)).all() | toSum()
In [24]:
(sum(proteinLengths) * 3 + utr) / len(nt()) * 100
Out[24]:
99.40139785305823

Really close to 100% now

In [ ]: