from k1lib.imports import *
Here's what it kinda looks like:
cat("covid.gb") | headOut()
LOCUS       NC_045512              29903 bp ss-RNA     linear   VRL 18-JUL-2020
DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1,
            complete genome.
ACCESSION   NC_045512
VERSION     NC_045512.2
DBLINK      BioProject: PRJNA485481
KEYWORDS    RefSeq.
SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
  ORGANISM  Severe acute respiratory syndrome coronavirus 2
            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
And the end:
cat("covid.gb") | rows()[-10:] | headOut()
    29341 tattgacgca tacaaaacat tcccaccaac agagcctaaa aaggacaaaa agaagaaggc
    29401 tgatgaaact caagccttac cgcagagaca gaagaaacag caaactgtga ctcttcttcc
    29461 tgctgcagat ttggatgatt tctccaaaca attgcaacaa tccatgagca gtgctgactc
    29521 aactcaggcc taaactcatg cagaccacac aaggcagatg ggctatataa acgttttcgc
    29581 ttttccgttt acgatatata gtctactctt gtgcagaatg aattctcgta actacatagc
    29641 acaagtagat gtagttaact ttaatctcac atagcaatct ttaatcagtg tgtaacatta
    29701 gggaggactt gaaagagcca ccacattttc accgaggcca cgcggagtac gatcgagtgt
    29761 acagtgaaca atgctaggga gagctgccta tatggaagag ccctaatgtg taaaattaat
    29821 tttagtagtg ctatccccat gtgattttaa tagcttctta ggagaatgac aaaaaaaaaa
    29881 aaaaaaaaaa aaaaaaaaaa aaa
So, 29903 nucleotides in total, just as advertised. The last nucleotide section always starts with "ORIGIN", so let's look for that:
cat("covid.gb") | grep("ORIGIN", after=1e9) | headOut(3)
ORIGIN      
        1 attaaaggtt tataccttcc caggtaacaa accaaccaac tttcgatctc ttgtagatct
       61 gttctctaaa cgaactttaa aatctgtgtg gctgtcactc ggctgcatgc ttagtgcact
Nice. Let's extract everything out:
(cat("covid.gb") | grep("ORIGIN", after=1e9) | ~head(1) | op().strip().all() | op().split(" ").all() | cut()[1:] | join("").all() | join("") | item())[:100]
'a'
This is rather long, so there's a built in operation for that
# hide behind a wrapper cause I get annoyed at Jupyter Lab's contextual help displaying the huge text
nt = k1lib.Wrapper(cat("covid.gb") | gb.origin())
nt()[:100]
'attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtggctgtcactc'
Before ORIGIN "section", there's the FEATURES section that looks like this:
cat("covid.gb") | grep("FEATURES", after=1e9) | headOut(20)
FEATURES             Location/Qualifiers
     source          1..29903
                     /organism="Severe acute respiratory syndrome coronavirus
                     2"
                     /mol_type="genomic RNA"
                     /isolate="Wuhan-Hu-1"
                     /host="Homo sapiens"
                     /db_xref="taxon:2697049"
                     /country="China"
                     /collection_date="Dec-2019"
     5'UTR           1..265
     gene            266..21555
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /db_xref="GeneID:43740578"
     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /ribosomal_slippage
                     /note="pp1ab; translated by -1 ribosomal frameshift"
As you can see, there are multiple features, like source, 5'UTR, gene, CDS, and whatnot. Of course, you can extract these on your own, but builtin functions already have something like that:
feats = cat("covid.gb") | gb.feats() | deref()
feats | rows()[:3] | deref()
[[' source 1..29903', ' /organism="Severe acute respiratory syndrome coronavirus', ' 2"', ' /mol_type="genomic RNA"', ' /isolate="Wuhan-Hu-1"', ' /host="Homo sapiens"', ' /db_xref="taxon:2697049"', ' /country="China"', ' /collection_date="Dec-2019"'], [" 5'UTR 1..265"], [' gene 266..21555', ' /gene="ORF1ab"', ' /locus_tag="GU280_gp01"', ' /db_xref="GeneID:43740578"']]
Say you want to search the features for a frameshift event, you can do something like this:
feats | gb.feats.filt("frameshift", "CDS") | item() | headOut()
     CDS             join(266..13468,13468..21555)
                     /gene="ORF1ab"
                     /locus_tag="GU280_gp01"
                     /ribosomal_slippage
                     /note="pp1ab; translated by -1 ribosomal frameshift"
                     /codon_start=1
                     /product="ORF1ab polyprotein"
                     /protein_id="YP_009724389.1"
                     /db_xref="GeneID:43740578"
                     /translation="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQ
So apparently, there's a frameshift at nucleotide 13468, where it gets repeated twice. Let's check if that's correct. First, let's grab the protein:
orf1ab = feats | gb.feats.filt("frameshift", "CDS") | item() | gb.feats.tag("translation")
len(orf1ab), len(orf1ab)*3 / len(nt()) * 100
(7096, 71.19018158713173)
ORF1ab is quite a chunky boi. Over 7k length, or 71% of the genome. The nucleotides of interest are:
nt()[13465:][:20]
'aacgggtttgcggtgtaagt'
So, the shifted nt sequence must be "AACCGG", or:
"AACCGG" | translate() | item()
'NR'
orf1ab[(13468-266+1)//3-1:][:20]
'NRVCGVSAARLTPCGTGTST'
Yep, bingo! Peptide sequence starts with NR
s = feats | gb.feats.filt("spike", "CDS") | item() | gb.feats.tag("translation")
Also in the news before delta variant times, I've heard they talk a lot about "D614G" variant, I wonder what's that all about, then discovered this:
s[613:][:10], "DG" | longAa() | item()
('DVNCTEVPVA', 'AsparticAcid Glycine')
Yeah this checks out. So "D614G" mutation just means at position 614 on the spike protein, a D (aspartic acid) has become G (glycine).
orf3a = feats | gb.feats.filt("ORF3a", "CDS") | item() | gb.feats.tag("translation")
Let's try again at a different spot. I grabbed a random mutation with this code name: "hCoV-19/Japan/PG-69007/2021: ORF3a L275F"
orf3a[274:], "LF" | longAa() | item()
('L', 'Leucine Phenylalanine')
Lmao, the change is right at the last amino acid
genes = ["ORF1ab", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF10"]
proteinLengths = feats | oneToMany(*(gb.feats.filt(f"/gene=\"{g}\"") for g in genes))\
| (gb.feats.filt(" CDS ") | item() | gb.feats.tag("translation")).all() | lengths() | deref()
Let's see the distribution of all genes:
proteinLengths | wrapList() | transpose() | insertColumn(*genes) | ~sort(1) | display(None)
ORF1ab 7096 S 1273 N 419 ORF3a 275 M 222 ORF7a 121 ORF8 121 E 75 ORF6 61 ORF7b 43 ORF10 38
And how much of the genome are the proteins themselves?
sum(proteinLengths) * 3 / len(nt()) * 100
97.75607798548641
All proteins combined take up like 97.7% of the genome. Quite densely packed, unlike eukaryote genomes.
How about utr regions? Do they take up much? Let's quickly search for them:
feats | gb.feats.filt("UTR") | item().all() | deref()
[" 5'UTR 1..265", ' stem_loop 29609..29644', ' stem_loop 29629..29657', " 3'UTR 29675..29903"]
utr = feats | gb.feats.filt("UTR") | item().all() | rows(0, 3) | op().split("R")[1].all()\
| (op().split("..") | toInt() | toList() | applyS(lambda r: r[1] - r[0])).all() | toSum()
(sum(proteinLengths) * 3 + utr) / len(nt()) * 100
99.40139785305823
Really close to 100% now