k1lib.cli module

Setup

To install the library, run this in a terminal:

pip install k1lib[all]

If you don’t want to install extra dependencies (not recommended), you can do this instead:

pip install k1lib

To use it in a python file or a notebook, do this:

from k1lib.imports import *

Because there are a lot of functions with common names, you may have custom functions or classes that have the same name, which will override the functions in the library. If you want to use them, you can use cli.sort() instead of sort() for example.

Intro

The main idea of this package is to emulate the terminal (hence “cli”, or “command line interface”), but doing all of that inside Python itself. So this bash statement:

cat file.txt | head -5 > headerFile.txt

Turns into this statement:

cat("file.txt") | head(5) > file("headerFile.txt")

Let’s step back a little bit. In the bash statement, “cat” and “head” are actual programs accessible through the terminal, and “|” will pipe the output of 1 program into another program. cat file.txt will read a file and returns a list of all rows in it, which will then be piped into head -5, which will only return the first 5 lines. Finally, > headerFile.txt will redirect the output to the “headerFile.txt” file. See this video for more: https://www.youtube.com/watch?v=bKzonnwoR2I

On the Python side, “cat”, “head” and “file” are Python classes extended from BaseCli. cat("file.txt") will read the file line by line, and return a list of all of them. head(5) will take in that list and return a list with only the first 5 lines. Finally, > file("headerFile.txt") will take that in and writes it to a file.

You can even integrate with existing shell commands:

ls("~") | cmd("grep *.so")

Here, “ls” will list out files inside the home directory, then pipes it into regular grep on linux, which is then piped back into Python as a list of strings. So it’s equivalent to this bash statement:

ls | grep *.so

Let’s see a really basic example:

# just a normal function
f = lambda x: x**2
# returns 9, no surprises here
f(3)
# f is now a cli tool
f = aS(lambda x: x**2)
# returns 9, demonstrating that they act like normal functions
f(3)
# returns 9, demonstrating that you can also pipe into them
3 | f

You can think of the flow of these clis in terms of 2 phases. 1 is configuring what you want the cli to do, and 2 is actually executing it. Let’s say you want to take a list of numbers and take the square of them:

# configuration stage. You provide a function to `apply` to tell it what to apply to each element in the list
f = apply(lambda x: x**2)
# initialize the input
x = range(5)
# execution stage, normal style, returns [0, 1, 4, 9, 16]
list(f(x))
# execution stage, pipe style, returns [0, 1, 4, 9, 16]
list(x | f)

# typical usage: combining configuration stage and execution stage, returns [0, 1, 4, 9, 16]
list(range(5) | apply(lambda x: x**2))
# refactor converting to list so that it uses pipes, returns [0, 1, 4, 9, 16]
range(5) | apply(lambda x: x**2) | aS(list)

You may wonder why do we have to turn it into a list. That’s because all cli tools execute things lazily, so they will return iterators, instead of lists. Here’s how iterators work:

def gen(): # this is a generator. It generates elements
    yield 3
    print("after yielding 3")
    yield 2
    yield 5
for e in gen():
    print(e)

It will print this out:

3
after yielding 3
2
5

So, iterators feels like lists. In fact, a list is an iterator, range(5), numpy arrays and strings are also iterators. Basically anything that you can iterate through is an iterator. The above iterator is a little special, as it’s specifically called a “generator”. They are actually a really cool aspect of Python, in terms of they execute code lazily, meaning gen() won’t run all the way when you call it. In fact, it doesn’t run at all. Only once you request new elements when trying to iterate over it will the function run.

All cli tools utilize this fact, in terms of they will not actually execute anything unless you force them to:

# returns "<generator object apply.__ror__.<locals>.<genexpr> at 0x7f7ae48e4d60>"
range(5) | apply(lambda x: x**2)
# you can iterate through it directly:
for element in range(5) | apply(lambda x: x**2):
    print(element)
# returns [0, 1, 4, 9, 16], in case you want it in a list
list(range(5) | apply(lambda x: x**2))
# returns [0, 1, 4, 9, 16], demonstrating deref
range(5) | apply(lambda x: x**2) | deref()

In the first line, it returns a generator, instead of a normal list, as nothing has actually been executed. You can still iterate through generators using for loops as usual, or you can convert it into a list. When you get more advanced, and have iterators nested within iterators within iterators, you can use deref to turn all of them into lists.

Also, a lot of these tools (like apply and filt) sometimes assume that we are operating on a table. So this table:

col1	col2	col3
1	2	3
4	5	6

Is equivalent to this list:

[["col1", "col2", "col3"], [1, 2, 3], [4, 5, 6]]

transpose and mtmS provides more flexible ways to transform a table structure (but usually involves more code).

Besides operating on string iterators alone, this package can also be extra meta, and operate on streams of strings, or streams of streams of anything. I think this is one of the most powerful concept of the cli workflow. Check over it here:

Streams tutorial

All cli tools should work fine with torch.Tensor, numpy.ndarray and pandas.core.series.Series, but k1lib actually modifies Numpy arrays and Pandas series deep down for it to work. This means that you can still do normal bitwise or with a numpy float value, and they work fine in all regression tests that I have, but you might encounter strange bugs. You can disable it manually by changing settings.startup.or_patch. If you chooses to do this, you have to be careful and use these workarounds:

# returns (2, 3, 5), works fine
torch.randn(2, 3, 5) | shape()
# will not work, returns weird numpy array of shape (2, 3, 5)
np.random.randn(2, 3, 5) | shape()
# returns (2, 3, 5), mitigation strategy #1
shape()(np.random.randn(2, 3, 5))
# returns (2, 3, 5), mitigation strategy #2
[np.random.randn(2, 3, 5)] | (item() | shape())

All cli-related settings are at settings.cli.

Where to start?

Core clis include:

apply, aS, op, grep
filt, head, rows, cut
deref, item, shape
transpose, joinStreams, batched, count
cat(), ls(), file, stdout

These clis are pretty important, and are used all the time, so look over them to see what the library can do. Whenever you find some cli you have not encountered before, you can just search it in the search bar on the top left of the page.

Then other important, not necessarily core clis include:

applyMp, sort, randomize
wrapList, ignore, cmd
repeat and friends, groupBy

So, start reading over what these do first, as you can pretty much 95% utilize everything the cli workflow has to offer with those alone. Then skim over basic conversions in module conv. While you’re doing that, checkout trace(), for a quite powerful debugging tool.

There are several written tutorials about cli here, and I also made some video tutorials as well, so go check those out.

For every example in the tutorials that you found, you might find it useful to follow the following debugging steps, to see how everything works:

# assume there's this piece of code:
A | B | C | D
# do this instead:
A | deref()
# once you understand it, do this:
A | B | deref()

# assume there's this piece of code:
A | B.all() | C
# do this instead:
A | item() | B | deref()
# once you understand it, you can move on:
A | B.all() | deref()

# assume there's this piece of code:
A | (B & C)
# do this instead:
A | B | deref()

# assume there's this piece of code:
A | (B + C)
# do these instead:
A | deref() | op()[0] | B | deref()
A | deref() | op()[1] | C | dereF()
# there are alternatives to that:
A | item() | B | deref()
A | rows(1) | item() | C | deref()

Finally, you can read over the summary below, see what catches your eye and check that cli out.

Summary

structural	utils	conv	typehint	filt
`transpose`	`size`	`toTensor`	`tBase`	`filt`
`reshape`	`shape`	`toRange`	`tAny`	`inSet()`
`insert`	`item`	`toList`	`tList`	`contains()`
`splitW`	`iden`	`toSum`	`tIter`	`empty`
`splitC`	`join`	`toProd`	`tSet`	`isNumeric()`
`joinStreams`	`wrapList`	`toAvg`	`tCollection`	`instanceOf()`
`joinStreamsRandom`	`equals`	`toMean`	`tExpand`	`head`
`activeSamples`	`reverse`	`toMax`	`tNpArray`	`tail()`
`table()`	`ignore`	`toMin`	`tTensor`	`cut`
`batched`	`rateLimit`	`toPIL`	`tListIterSet()`	`rows`
`window`	`timeLimit`	`toImg`	`tListSet()`	`intersection`
`groupBy`	`tab()`	`toRgb`	`tListIter()`	`union`
`insertColumn`	`indent()`	`toRgba`	`tArrayTypes()`	`unique`
`insertIdColumn()`	`clipboard`	`toGray`	`inferType()`	`breakIf`
`expandE`	`deref`	`toDict`	`TypeHintException`	`mask`
`unsqueeze()`	`bindec`	`toFloat`	`tLowest()`	`tryout`
`count`	`smooth`	`toInt`	`tCheck`
`permute`	`disassemble()`	`toBytes`	`tOpt`
`accumulate`	`tree()`
`AA_`	`lookup`
`peek`	`dictFields`
`peekF`
`repeat`
`repeatF()`
`repeatFrom`
`oneHot`
`indexTable`

modifier	init	inp	output	kxml
`applyS`	`BaseCli`	`cat()`	`stdout`	`node`
`aS`	`Table`	`splitSeek`	`tee`	`maxDepth`
`apply`	`T()`	`curl()`	`file`	`tags`
`applyMp`	`fastF()`	`wget()`	`pretty`	`pretty`
`parallel`	`yieldT()`	`ls()`	`display()`	`display`
`applyTh`	`serial`	`cmd`	`headOut()`
`applySerial`	`oneToMany`	`walk`	`intercept`
`sort`	`mtmS`	`requireCli()`	`plotImgs`
`sortF`
`consume`
`randomize`
`stagger`
`op`
`integrate`

nb	grep	kcsv	trace	optimizations
`cells()`	`grep`	`cat()`	`trace`	`dummy()`
`pretty`	`grepTemplate`
`execute`

Under the hood

How it works underneath is pretty simple. All cli tools implement the “reverse or” operation, or __ror__. So essentially, these 2 statements are equivalent:

3 | obj
obj.__ror__(3)

There are several other operations that certain clis can override, like “>” or “>>”. Also, if you’re an advanced user, there’s also an optimizer that looks like LLVM, so you can implement optimization passes to speed up everything by a lot:

LLVM optimizer tutorial

Biology-related clis

I separated these out because they might not be interesting to the majority of users.

bio	sam	cif	gb	mgi
`go()`	`cat()`	`tables()`	`feats`	`batch`
`quality()`	`header`	`records()`	`origin`
`longFa()`	`flag`
`idx`
`transcribe`
`complement`
`translate`
`medAa`
`longAa`

bio module

This is for functions that are actually biology-related

k1lib.cli.bio.go(term: int)[source]: Looks up a GO term

k1lib.cli.bio.quality(log=True)[source]

Get numeric quality of sequence. Example:

# returns [2, 2, 5, 30]
"##&?" | quality() | deref()

Parameters: log – whether to use log scale (0 -> 40), or linear scale (1 -> 0.0001)

k1lib.cli.bio.longFa()[source]

Takes in a fasta file and put each sequence on 1 line. File “gene.fa”:

>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATC
TTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
>something other gene
CGGACACACAAAAAGAAAGAAGA
TTTTGTGTGCGAATAACTATGAG

Code:

cat("gene.fa") | bio.longFa() | cli.headOut()

Prints out:

>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
>something other gene
CGGACACACAAAAAGAAAGAAGATTTTGTGTGCGAATAACTATGAG

class k1lib.cli.bio.idx(fs: list = [])[source]

k1lib.cli module

Setup

Intro

Where to start?

Summary

Under the hood

Biology-related clis

bio module

cif module

conv module

mgi module

filt module

gb module

grep module

init module

inp module

kcsv module

kxml module

modifier module

nb module

output module

sam module

structural module

trace module

utils module

typehint module

optimizations module

Elsewhere in the library