k1lib.cli module¶
The main idea of this package is to emulate the terminal (hence “cli”, or “command line interface”), but doing all of that inside Python itself. So this bash statement:
cat file.txt | head -5 > headerFile.txt
Turns into this statement:
cat("file.txt") | head(5) > file("headerFile.txt")
You can even integrate with existing shell commands:
ls("~") | cmd("grep so")
Here, “ls” will list out files inside the home directory, then pipes it into regular grep on linux, which is then piped back into Python as a list of strings. So it’s equivalent to this bash statement:
ls | grep so
“cat”, “head”, “file”, “ls” and “cmd” are all classes extended from
BaseCli
. All of them implements the “reverse or” operation, or
__ror__. So essentially, these 2 statements are equivalent:
3 | obj
obj.__ror__(3)
Also, a lot of these tools (like apply
and filt
)
assume that we are operating on a table. So this table:
col1 |
col2 |
col3 |
---|---|---|
1 |
2 |
3 |
4 |
5 |
6 |
Is equivalent to this list:
[["col1", "col2", "col3"], [1, 2, 3], [4, 5, 6]]
transpose
and mtmS
provides more flexible ways
to transform a table structure (but usually involves more code).
Also, the expected way to use these tools is to import everything directly into the current environment, like this:
from k1lib.imports import *
Because there are a lot of clis, you may sometimes unintentionally overwrite an
exposed cli tool. No worries, every tool is also under the cli
object, meaning
you can use deref()
or cli.deref()
.
Besides operating on string iterators alone, this package can also be extra meta, and operate on streams of strings, or streams of streams of anything. I think this is one of the most powerful concept of the cli workflow. If this interests you, check over this:
All clis tools should work totally fine with PyTorch tensors, but not numpy arrays.
This is because numpy arrays actually implements __or__
operator, which overrides
cli tools’ __ror__
operator. Workarounds might look like this:
# returns (2, 3, 5), works fine
torch.randn(2, 3, 5) | shape()
# will not work, returns weird numpy array of shape (2, 3, 5)
np.random.randn(2, 3, 5) | shape()
# returns (2, 3, 5), mitigation strategy #1
shape()(np.random.randn(2, 3, 5))
# returns (2, 3, 5), mitigation strategy #2
[np.random.randn(2, 3, 5)] | (item() | shape())
All settings are at settings
under name “cli”.
Where to start?¶
Core clis include apply
, applyS
(its
multiprocessing cousins applyMp
and applyMpBatched
are great too), op
, filt
, deref
,
item
, shape
, iden
, cmd
,
so start reading there first. Then, skim over everything to know what you can do
with these collection of tools. While you’re doing that, checkout trace()
,
for a quite powerful debugging tool.
There are several written tutorials about cli here, and I also made some video tutorials as well, so go check those out.
bio module¶
This is for functions that are actually biology-related
-
k1lib.cli.bio.
quality
(log=True)[source]¶ Get numeric quality of sequence. Example:
# returns [2, 2, 5, 30] "##&?" | quality() | deref()
- Parameters
log – whether to use log scale (0 -> 40), or linear scale (1 -> 0.0001)
-
k1lib.cli.bio.
longFa
()[source]¶ Takes in a fasta file and put each sequence on 1 line. File “gene.fa”:
>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome CGGACACACAAAAAGAAAGAAGAATTTTTAGGATC TTTTGTGTGCGAATAACTATGAGGAAGATTAATAA >something other gene CGGACACACAAAAAGAAAGAAGA TTTTGTGTGCGAATAACTATGAG
Code:
cat("gene.fa") | bio.longFa() | cli.headOut()
Prints out:
>AF086833.2 Ebola virus - Mayinga, Zaire, 1976, complete genome CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA >something other gene CGGACACACAAAAAGAAAGAAGATTTTGTGTGCGAATAACTATGAG
-
class
k1lib.cli.bio.
idx
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Indexes files with various formats.
-
static
blast
(fileName: Optional[str] = None, dbtype: Optional[str] = None)[source]¶ Uses
makeblastdb
to create a blast database from a fasta file. Example:"file.fa" | bio.idx.blast() bio.idx.blast("file.fa")
-
static
-
class
k1lib.cli.bio.
transcribe
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Transcribes (DNA -> RNA) incoming rows. Example:
# returns "AUCG" "ATCG" | transcribe() # returns ["AUCG"] ["ATCG"] | transcribe() | deref()
-
class
k1lib.cli.bio.
complement
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Get the reverse complement of DNA. Example:
# returns "TAGC" "ATCG" | bio.complement() # returns ["TAGC"] ["ATCG"] | bio.complement() | deref()
-
class
k1lib.cli.bio.
translate
(length: int = 0)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.bio.
medAa
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts short aa sequence to medium one
entrez module¶
This module is not really fleshed out, not that useful/elegant, and I just use
cmd
instead
mgi module¶
All tools related to the MGI database. Expected to use behind the “mgi” module name, like this:
from k1lib.imports import *
["SOD1", "AMPK"] | mgi.batch()
filt module¶
This is for functions that cuts out specific parts of the table
-
class
k1lib.cli.filt.
filt
(predicate: Callable[[T], bool], column: Optional[int] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(predicate: Callable[[T], bool], column: Optional[int] = None)[source]¶ Filters out lines. Examples:
# returns [2, 6] [2, 3, 5, 6] | filt(lambda x: x%2 == 0) | deref() # returns [3, 5] [2, 3, 5, 6] | ~filt(lambda x: x%2 == 0) | deref() # returns [[2, 'a'], [6, 'c']] [[2, "a"], [3, "b"], [5, "a"], [6, "c"]] | filt(lambda x: x%2 == 0, 0) | deref()
You can also pass in
op
, for extra intuitiveness:# returns [2, 6] [2, 3, 5, 6] | filt(op() % 2 == 0) | deref() # returns ['abc', 'a12'] ["abc", "def", "a12"] | filt(op().startswith("a")) | deref()
- Parameters
column –
if integer, then predicate(row[column])
if None, then predicate(row)
-
-
k1lib.cli.filt.
isFile
() → k1lib.cli.filt.filt[source]¶ Filters out non-files. Example:
# returns ["a.py", "b.py"], if those files really do exist ["a.py", "hg/", "b.py"] | isFile()
-
k1lib.cli.filt.
inSet
(values: Set[Any], column: Optional[int] = None) → k1lib.cli.filt.filt[source]¶ Filters out lines that is not in the specified set. Example:
# returns [2, 3] range(5) | inSet([2, 8, 3]) | deref() # returns [0, 1, 4] range(5) | ~inSet([2, 8, 3]) | deref()
-
k1lib.cli.filt.
contains
(s: str, column: Optional[int] = None) → k1lib.cli.filt.filt[source]¶ Filters out lines that don’t contain the specified substring. Sort of similar to
grep
, but this is simpler, and can be inverted. Example:# returns ['abcd', '2bcr'] ["abcd", "0123", "2bcr"] | contains("bc") | deref()
-
class
k1lib.cli.filt.
empty
(reverse=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(reverse=False)[source]¶ Filters out streams that is not empty. Almost always used inverted, but “empty” is a short, sweet name easy to remember. Example:
# returns [[1, 2], ['a']] [[], [1, 2], [], ["a"]] | ~empty() | deref()
- Parameters
reverse – not intended to be used by the end user. Do
~empty()
instead.
-
-
k1lib.cli.filt.
isNumeric
(column: Optional[int] = None) → k1lib.cli.filt.filt[source]¶ Filters out a line if that column is not a number. Example:
# returns [0, 2, ‘3’] [0, 2, “3”, “a”] | isNumeric() | deref()
-
k1lib.cli.filt.
instanceOf
(cls: Union[type, Tuple[type]], column: Optional[int] = None) → k1lib.cli.filt.filt[source]¶ Filters out lines that is not an instance of the given type. Example:
# returns [2] [2, 2.3, "a"] | instanceOf(int) | deref() # returns [2, 2.3] [2, 2.3, "a"] | instanceOf((int, float)) | deref()
-
k1lib.cli.filt.
inRange
(min: float = - inf, max: float = inf, column: Optional[int] = None) → k1lib.cli.filt.filt[source]¶ Checks whether a column is in range or not. Example:
# returns [-2, 3, 6] [-2, -8, 3, 6] | inRange(min=-3) | deref() # returns [-8] [-2, -8, 3, 6] | ~inRange(min=-3) | deref()
-
class
k1lib.cli.filt.
head
(n: int = 10)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(n: int = 10)[source]¶ Only outputs first
n
lines. You can also negate it (like~head(5)
), which then only outputs after firstn
lines. Examples:"abcde" | head(2) | deref() # returns ["a", "b"] "abcde" | ~head(2) | deref() # returns ["c", "d", "e"] "0123456" | head(-3) | deref() # returns ['0', '1', '2', '3'] "0123456" | ~head(-3) | deref() # returns ['4', '5', '6']
-
-
class
k1lib.cli.filt.
columns
(*columns: List[int])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*columns: List[int])[source]¶ Cuts out specific columns, sliceable. Examples:
["0123456789"] | cut(5, 8) | deref() # returns [['5', '8']] ["0123456789"] | cut(2) | deref() # returns ['2'] ["0123456789"] | cut(5, 8) | deref() # returns [['5', '8']] ["0123456789"] | ~cut()[:7:2] | deref() # returns [['1', '3', '5', '7', '8', '9']]
If you’re selecting only 1 column, then Iterator[T] will be returned, not Table[T].
-
-
k1lib.cli.filt.
cut
¶ alias of
k1lib.cli.filt.columns
-
class
k1lib.cli.filt.
rows
(*rows: List[int])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*rows: List[int])[source]¶ Cuts out specific rows. Space complexity O(1) as a list is not constructed (unless you’re using some really weird slices).
- Parameters
rows – ints for the row indices
Example:
"0123456789" | rows(2) | deref() # returns ["2"] "0123456789" | rows(5, 8) | deref() # returns ["5", "8"] "0123456789" | rows()[2:5] | deref() # returns ["2", "3", "4"] "0123456789" | ~rows()[2:5] | deref() # returns ["0", "1", "5", "6", "7", "8", "9"] "0123456789" | ~rows()[:7:2] | deref() # returns ['1', '3', '5', '7', '8', '9'] "0123456789" | rows()[:-4] | deref() # returns ['0', '1', '2', '3', '4', '5'] "0123456789" | ~rows()[:-4] | deref() # returns ['6', '7', '8', '9']
-
-
class
k1lib.cli.filt.
intersection
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Returns the intersection of multiple streams. Example:
# returns set([2, 4, 5]) [[1, 2, 3, 4, 5], [7, 2, 4, 6, 5]] | intersection()
-
class
k1lib.cli.filt.
union
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Returns the union of multiple streams. Example:
# returns {0, 1, 2, 10, 11, 12, 13, 14} [range(3), range(10, 15)] | union()
-
class
k1lib.cli.filt.
unique
(column: int)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(column: int)[source]¶ Filters out non-unique row elements. Example:
# returns [[1, "a"], [2, "a"]] [[1, "a"], [2, "a"], [1, "b"]] | unique(0) | deref()
- Parameters
column – doesn’t have the default case of None, because you can always use
k1lib.cli.utils.toSet
-
-
class
k1lib.cli.filt.
breakIf
(f)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.filt.
mask
(mask: Iterator[bool])[source]¶ Bases:
k1lib.cli.init.BaseCli
gb module¶
All tools related to GenBank file format. Expected to use behind the “gb” module name, like this:
from k1lib.imports import *
cat("abc.gb") | gb.feats()
-
class
k1lib.cli.gb.
feats
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Fetches features, each on a separate stream
-
static
filt
(*terms: str) → k1lib.cli.init.BaseCli[source]¶ Filters for specific terms in all the features texts. If there are multiple terms, then filters for first term, then second, then third, so the term’s order might matter to you
-
static
tag
(tag: str) → k1lib.cli.init.BaseCli[source]¶ Gets a single tag out. Applies this on a single feature only
-
static
-
class
k1lib.cli.gb.
origin
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Return the origin section of the genbank file
grep module¶
-
class
k1lib.cli.grep.
grep
(pattern: str, before: int = 0, after: int = 0, N: int = inf, sep: bool = False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(pattern: str, before: int = 0, after: int = 0, N: int = inf, sep: bool = False)[source]¶ Find lines that has the specified pattern. Example:
# returns ['d', 'd'] "abcde12d34" | grep("d") | deref() # returns ['c', 'd', '2', 'd'], 2 sections of ['c', 'd'] and ['2', 'd'] "abcde12d34" | grep("d", 1) | deref() # returns ['c', 'd'] "abcde12d34" | grep("d", 1, N=1) | deref() # returns ['d', 'e', 'd', '3', '4'], 2 sections of ['d', 'e'] and ['d', '3', '4'] "abcde12d34" | grep("d", 0, 3).till("e") | deref() # returns [['0', '1', '2'], ['3', '1', '4']] "0123145" | grep("1", 2, 1, sep=True) | deref()
You can also separate out the sections:
# returns [['c', 'd'], ['2', 'd']] "abcde12d34" | grep("d", 1, sep=True) | deref() # returns [['c', 'd']] "abcde12d34" | grep("d", 1, N=1, sep=True) | deref() # returns [['1', '2', '3'], ['1', '4', '5']] "0123145" | grep("1", sep=True).till() | deref()
- Parameters
pattern – regex pattern to search for in a line
before – lines before the hit. Outputs independent lines
after – lines after the hit. Outputs independent lines
N – max sections to output
sep – whether to separate out the sections as lists
-
till
(pattern: Optional[str] = None)[source]¶ Greps until some other pattern appear. Inclusive, so you might want to trim the last line. Example:
# returns ['5', '6', '7', '8'], includes last item range(10) | join("") | grep("5").till("8") | deref() # returns ['d', 'e', 'd', '3', '4'] "abcde12d34" | grep("d").till("e") | deref() # returns ['d', 'e'] "abcde12d34" | grep("d", N=1).till("e") | deref()
If initial pattern and till pattern are the same, then you don’t have use this method at all. Instead, do something like this:
# returns ['1', '2', '3'] "0123145" | grep("1", after=1e9, N=1) | deref()
-
init module¶
-
class
k1lib.cli.init.
BaseCli
(fs=[])[source]¶ Bases:
object
A base class for all the cli stuff. You can definitely create new cli tools that have the same feel without extending from this class, but advanced stream operations (like
+
,&
,.all()
,|
) won’t work.At the moment, you don’t have to call super().__init__() and super().__ror__(), as __init__’s only job right now is to solidify any
op
passed to it, and __ror__ does nothing.-
__init__
(fs=[])[source]¶ Not expected to be instantiated by the end user.
- Parameters
fs – if functions inside here is actually a
op
, then solidifies it (make it not absorb __call__ anymore)
-
__and__
(cli: k1lib.cli.init.BaseCli) → k1lib.cli.init.oneToMany[source]¶ Duplicates input stream to multiple joined clis. Example:
# returns [[5], [0, 1, 2, 3, 4]] range(5) | (shape() & iden()) | deref()
Kinda like
apply
. There’re just multiple ways of doing this. This I think, is more intuitive, andapply
is more for lambdas and columns mode. Performances are pretty much identical.
-
__add__
(cli: k1lib.cli.init.BaseCli) → k1lib.cli.init.mtmS[source]¶ Parallel pass multiple streams to multiple clis.
-
all
(n: int = 1) → k1lib.cli.init.BaseCli[source]¶ Applies this cli to all incoming streams.
- Parameters
n – how many times should I chain
.all()
?
-
__or__
(cli) → k1lib.cli.init.serial[source]¶ Joins clis end-to-end
-
-
k1lib.cli.init.
fastF
(c)[source]¶ Tries to figure out what’s going on, is it a normal function, or an applyS, or a BaseCli, etc., and return a really fast function for execution. Example:
# both returns 16, fastF returns "lambda x: x**2", so it's really fast fastF(op()**2)(4) fastF(applyS(lambda x: x**2))(4)
-
class
k1lib.cli.init.
serial
(*clis: List[k1lib.cli.init.BaseCli])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*clis: List[k1lib.cli.init.BaseCli])[source]¶ Merges clis into 1, feeding end to end. Used in chaining clis together without a prime iterator. Meaning, without this, stuff like this fails to run:
[1, 2] | a() | b() # runs c = a() | b(); [1, 2] | c # doesn't run if this class doesn't exist
-
-
class
k1lib.cli.init.
oneToMany
(*clis: List[k1lib.cli.init.BaseCli])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*clis: List[k1lib.cli.init.BaseCli])[source]¶ Duplicates 1 stream into multiple streams, each for a cli in the list. Used in the “a & b” joining operator. See also:
BaseCli.__and__()
-
-
class
k1lib.cli.init.
manyToMany
(cli: k1lib.cli.init.BaseCli)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(cli: k1lib.cli.init.BaseCli)[source]¶ Applies multiple streams to a single cli. Used in the
BaseCli.all()
operator.
-
-
class
k1lib.cli.init.
mtmS
(*clis: List[k1lib.cli.init.BaseCli])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*clis: List[k1lib.cli.init.BaseCli])[source]¶ Applies multiple streams to multiple clis independently. Used in the “a + b” joining operator. See also:
BaseCli.__add__()
.Weird name is actually a shorthand for “many to many specific”.
-
static
f
(f, i: int, n: int = 100)[source]¶ Convenience method, so that this:
mtmS(iden(), op()**2, iden(), iden(), iden()) # also the same as this btw: (iden() + op()**2 + iden() + iden() + iden())
is the same as this:
mtmS.single(op()**2, 1, 5)
Example:
# returns [5, 36, 7, 8, 9] range(5, 10) | mtmS.single(op()**2, 1, 5) | deref()
- Parameters
i – where should I put the function?
n – how many clis in total? Defaulted to 100
-
inp module¶
This module for tools that will likely start the processing stream.
-
k1lib.cli.inp.
cat
(fileName: Optional[str] = None, text: bool = True)[source]¶ Reads a file line by line. Example:
# display first 10 lines of file cat("file.txt") | headOut() # piping in also works "file.txt" | cat() | headOut() # rename file cat("img.png", False) | file("img2.png", False)
- Parameters
fileName – if None, then return a
BaseCli
that accepts a file name and outputs Iterator[str]text – if True, read text file, else read binary file
-
k1lib.cli.inp.
curl
(url: str) → Iterator[str][source]¶ Gets file from url. File can’t be a binary blob. Example:
# prints out first 10 lines of the website curl("https://k1lib.github.io/") | headOut()
-
k1lib.cli.inp.
wget
(url: str, fileName: Optional[str] = None)[source]¶ Downloads a file. Also returns the file name, in case you want to pipe it to something else.
- Parameters
url – The url of the file
fileName – if None, then tries to infer it from the url
-
k1lib.cli.inp.
ls
(folder: Optional[str] = None)[source]¶ List every file and folder inside the specified folder. Example:
# returns List[str] ls("/home") # same as above "/home" | ls() # only outputs files, not folders ls("/home") | isFile()
See also:
isFile()
-
class
k1lib.cli.inp.
cmd
(cmd: str, mode: int = 1, text=True)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(cmd: str, mode: int = 1, text=True)[source]¶ Runs a command, and returns the output line by line. Can pipe in some inputs. If no inputs then have to pipe in
None
. Example:# return detailed list of files None | cmd("ls -la") # return list of files that ends with "ipynb" None | cmd("ls -la") | cmd('grep ipynb$')
It might be tiresome to pipe in
None
all the time. So, you can use “>” operator to yield values right away:# prints out first 10 lines of list of files cmd("ls -la") > headOut()
If you’re using Jupyter notebook/lab, then if you were to display a
cmd
object, it will print out the outputs. So, a single commandcmd("mkdir")
displayed at the end of a cell is enough to trigger creating the directory.Reminder that “>” operator in here sort of has a different meaning to that of
BaseCli
. So you kinda have to becareful about this:# returns a serial cli, cmd not executed cmd("ls -la") | deref() # executes cmd with no input stream and pipes output to deref cmd("ls -la") > deref() # returns a serial cli cmd("ls -la") > grep("txt") > headOut() # executes pipeline cmd("ls -la") > grep("txt") | headOut()
General advice is, right ater a
cmd
, use “>”, and use “|” everywhere else.Let’s see a few more exotic examples. File
a.sh
:#!/bin/bash echo 1; sleep 0.5 echo This message goes to stderr >&2 echo 2; sleep 0.5 echo $(</dev/stdin) sleep 0.5; echo 3
Examples:
# returns [b'1', b'2', b'45', b'3'] and prints out the error message "45" | cmd("./a.sh", text=False) | deref() # returns [b'This message goes to stderr'] "45" | cmd("./a.sh", mode=2, text=False) | deref() # returns [[b'1', b'2', b'45', b'3'], [b'This message goes to stderr']] "45" | cmd("./a.sh", mode=0, text=False) | deref()
Performance-wise, stdout and stderr will yield values right away as soon as the process outputs it, so you get real time feedback. However, this will convert the entire input into a
bytes
object, and not feed it bit by bit lazily, so if you have a humongous input, it might slow you down a little.Settings: - cli.quiet: if True, won’t display errors in mode 1
-
kcsv module¶
All tools related to csv file format. Expected to use behind the “kcsv” module name, like this:
from k1lib.imports import *
kcsv.cat("file.csv") | display()
kxml module¶
All tools related to xml file format. Expected to use behind the “kxml” module name, like this:
from k1lib.imports import *
cat("abc.xml") | kxml.node() | kxml.display()
-
class
k1lib.cli.kxml.
node
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Turns lines into a single node
-
__ror__
(it: Iterator[str]) → Iterator[xml.etree.ElementTree.Element][source]¶
-
-
class
k1lib.cli.kxml.
maxDepth
(depth: Optional[int] = None, copy: bool = True)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(depth: Optional[int] = None, copy: bool = True)[source]¶ Filters out too deep nodes
- Parameters
depth – max depth to include in
copy – whether to limit the nodes itself, or limit a copy
-
__ror__
(nodes: Iterator[xml.etree.ElementTree.Element]) → Iterator[xml.etree.ElementTree.Element][source]¶
-
-
class
k1lib.cli.kxml.
tag
(tag: str)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(tag: str)[source]¶ Finds all tags that have a particular name. If found, then don’t search deeper
-
__ror__
(nodes: Iterator[xml.etree.ElementTree.Element]) → Iterator[xml.etree.ElementTree.Element][source]¶
-
-
class
k1lib.cli.kxml.
pretty
(indent: Optional[str] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__ror__
(it: Iterator[xml.etree.ElementTree.Element]) → Iterator[str][source]¶
-
modifier module¶
This is for quick modifiers, think of them as changing formats
-
class
k1lib.cli.modifier.
applyS
(f: Callable[[T], T])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f: Callable[[T], T])[source]¶ Like
apply
, but much simpler, just operating on the entire input object, essentially. The “S” stands for “single”. Example:# returns 5 3 | applyS(lambda x: x+2)
Like
apply
, you can also use this as a decorator like this:@applyS def f(x): return x+2 # returns 5 3 | f
This also decorates the returned object so that it has same qualname, docstring and whatnot.
-
-
class
k1lib.cli.modifier.
apply
(f: Callable[[T], T], column: Optional[int] = None, cache: int = 0)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f: Callable[[T], T], column: Optional[int] = None, cache: int = 0)[source]¶ Applies a function f to every line. Example:
# returns [0, 1, 4, 9, 16] range(5) | apply(lambda x: x**2) | deref() # returns [[3.0, 1.0, 1.0], [3.0, 1.0, 1.0]] torch.ones(2, 3) | apply(lambda x: x+2, 0) | deref()
You can also use this as a decorator, like this:
@apply def f(x): return x**2 # returns [0, 1, 4, 9, 16] range(5) | f | deref()
You can also add a cache, like this:
def calc(i): time.sleep(0.5); return i**2 # takes 2.5s range(5) | repeatFrom(2) | apply(calc, cache=10) | deref() # takes 5s range(5) | repeatFrom(2) | apply(calc) | deref()
- Parameters
column – if not None, then applies the function to that column only
cache – if specified, then caches this much number of values
-
-
class
k1lib.cli.modifier.
applyMp
(f: Callable[[T], T], prefetch: Optional[int] = None, timeout: float = 8, utilization: float = 0.8, bs: int = 1, **kwargs)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f: Callable[[T], T], prefetch: Optional[int] = None, timeout: float = 8, utilization: float = 0.8, bs: int = 1, **kwargs)[source]¶ Like
apply
, but executef(row)
of each row in multiple processes. Example:# returns [3, 2] ["abc", "de"] | applyMp(lambda s: len(s)) | deref() # returns [5, 6, 9] range(3) | applyMp(lambda x, bias: x**2+bias, bias=5) | deref() # returns [[1, 2, 3], [1, 2, 3]], demonstrating outside vars work someList = [1, 2, 3] ["abc", "de"] | applyMp(lambda s: someList) | deref()
Internally, this will continuously spawn new jobs up until 80% of all CPU cores are utilized. On posix systems, the default multiprocessing start method is
fork()
. This sort of means that all the variables in memory will be copied over. This might be expensive (might also not, with copy-on-write), so you might have to think about that. On windows and macos, the default start method isspawn
, meaning each child process is a completely new interpreter, so you have to pass in all required variables and reimport every dependencies. Read more at https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methodsIf you don’t wish to schedule all jobs at once, you can specify a
prefetch
amount, and it will only schedule that much jobs ahead of time. Example:range(10000) | applyMp(lambda x: x**2) | head() | deref() # 700ms range(10000) | applyMp(lambda x: x**2, 5) | head() | deref() # 300ms # demonstrating there're no huge penalties even if we want all results at the same time range(10000) | applyMp(lambda x: x**2) | deref() # 900ms range(10000) | applyMp(lambda x: x**2, 5) | deref() # 1000ms
The first line will schedule all jobs at once, and thus will require more RAM and compute power, even though we discard most of the results anyway (the
head
cli). The second line only schedules 5 jobs ahead of time, and thus will be extremely more efficient if you don’t need all results right away.Note
Remember that every
BaseCli
is also a function, meaning that you can do stuff like:# returns [['ab', 'ac']] [["ab", "cd", "ac"]] | applyMp(filt(op().startswith("a")) | deref()) | deref()
Also remember that the return result of
f
should not be a generator. That’s why in the example above, there’s aderef()
inside f.Most of the time, you would probably want to specify
bs
to something bigger than 1 (may be 32 or sth like that). This will executesf
multiple times in a single job, instead of executingf
only once per job. Should reduce overhead of process creation dramatically.If you encounter strange errors not seen on
apply
, you can try to clear all pools (usingclearPools()
), to terminate all child processes and thus free resources. On earlier versions, you have to do this manually before exiting, but nowapplyMp
is much more robust.Also, you should not immediately assume that
applyMp
will always be faster thanapply
. Remember thatapplyMp
will create new processes, serialize and transfer data to them, execute it, then transfer data back. If your code transfers a lot of data back and forth (compared to the amount of computation done), or the child processes don’t have a lot of stuff to do before returning, it may very well be a lot slower thanapply
.- Parameters
prefetch – if not specified, schedules all jobs at the same time. If specified, schedules jobs so that there’ll only be a specified amount of jobs, and will only schedule more if results are actually being used.
timeout – seconds to wait for job before raising an error
utilization – how many percent cores are we running? 0 for no cores, 1 for all the cores. Defaulted to 0.8
bs – if specified, groups
bs
number of transforms into 1 job to be more efficient.kwargs – extra arguments to be passed to the function.
args
not included as there’re a couple of options you can pass for this cli.
-
-
class
k1lib.cli.modifier.
applyTh
(f, prefetch: int = 2, timeout: float = 5, bs: int = 1)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f, prefetch: int = 2, timeout: float = 5, bs: int = 1)[source]¶ Kinda like the same as
applyMp
, but executesf
on multiple threads, instead of on multiple processes. Advantages:Relatively low overhead for thread creation
Fast, if
f
is io-boundDoes not have to serialize and deserialize the result, meaning iterators can be exchanged
Disadvantages:
Still has thread creation overhead, so it’s still recommended to specify
bs
Is slow if
f
has to obtain the GIL to be able to do anything
All examples from
applyMp
should work perfectly here.
-
-
class
k1lib.cli.modifier.
applySerial
(f, includeFirst=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f, includeFirst=False)[source]¶ Applies a function repeatedly. First yields input iterator
x
. Then yieldsf(x)
, thenf(f(x))
, thenf(f(f(x)))
and so on. Example:# returns [4, 8, 16, 32, 64] 2 | applySerial(op()*2) | head(5) | deref()
If the result of your operation is an iterator, you might want to
deref
it, like this:rs = iter(range(8)) | applySerial(rows()[::2]) # returns [0, 2, 4, 6] next(rs) | deref() # returns []. This is because all the elements are taken by the previous deref() next(rs) | deref() rs = iter(range(8)) | applySerial(rows()[::2] | deref()) # returns [0, 2, 4, 6] next(rs) # returns [0, 4] next(rs) # returns [0] next(rs)
- Parameters
f – function to apply repeatedly
includeFirst – whether to include the raw input value or not
-
-
k1lib.cli.modifier.
replace
(s: str, target: Optional[str] = None, column: Optional[int] = None)[source]¶ Replaces substring s with target for each line. Example:
# returns ['104', 'ab0c'] ["1234", "ab23c"] | replace("23", "0") | deref()
- Parameters
target – if not specified, then use the default delimiter specified in
settings
-
k1lib.cli.modifier.
remove
(s: str, column: Optional[int] = None)[source]¶ Removes a specific substring in each line.
-
k1lib.cli.modifier.
toFloat
(*columns: List[int], force=False)[source]¶ Converts every row into a float. Example:
# returns [1, 3, -2.3] ["1", "3", "-2.3"] | toFloat() | deref() # returns [[1.0, 'a'], [2.3, 'b'], [8.0, 'c']] [["1", "a"], ["2.3", "b"], [8, "c"]] | toFloat(0) | deref()
With weird rows:
# returns [[1.0, 'a'], [8.0, 'c']] [["1", "a"], ["c", "b"], [8, "c"]] | toFloat(0) | deref() # returns [[1.0, 'a'], [0.0, 'b'], [8.0, 'c']] [["1", "a"], ["c", "b"], [8, "c"]] | toFloat(0, force=True) | deref()
- Parameters
columns – if nothing, then will convert each row. If available, then convert all the specified columns
force – if True, forces weird values to 0.0, else filters out all weird rows
-
k1lib.cli.modifier.
toInt
(*columns: List[int], force=False)[source]¶ Converts every row into an integer. Example:
# returns [1, 3, -2] ["1", "3", "-2.3"] | toInt() | deref()
- Parameters
columns – if nothing, then will convert each row. If available, then convert all the specified columns
force – if True, forces weird values to 0, else filters out all weird rows
See also:
toFloat()
-
class
k1lib.cli.modifier.
sort
(column: int = 0, numeric=True, reverse=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(column: int = 0, numeric=True, reverse=False)[source]¶ Sorts all lines based on a specific column. Example:
# returns [[5, 'a'], [1, 'b']] [[1, "b"], [5, "a"]] | ~sort(0) | deref() # returns [[2, 3]] [[1, "b"], [5, "a"], [2, 3]] | ~sort(1) | deref() # errors out, as you can't really compare str with int [[1, "b"], [2, 3], [5, "a"]] | sort(1, False) | deref()
- Parameters
column – if None, sort rows based on themselves and not an element
numeric – whether to convert column to float
reverse – False for smaller to bigger, True for bigger to smaller. Use
__invert__()
to quickly reverse the order instead of using this param
-
-
class
k1lib.cli.modifier.
sortF
(f: Callable[[T], float], reverse=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f: Callable[[T], float], reverse=False)[source]¶ Sorts rows using a function. Example:
# returns ['a', 'aa', 'aaa', 'aaaa', 'aaaaa'] ["a", "aaa", "aaaaa", "aa", "aaaa"] | sortF(lambda r: len(r)) | deref() # returns ['aaaaa', 'aaaa', 'aaa', 'aa', 'a'] ["a", "aaa", "aaaaa", "aa", "aaaa"] | ~sortF(lambda r: len(r)) | deref()
-
__invert__
() → k1lib.cli.modifier.sortF[source]¶
-
-
class
k1lib.cli.modifier.
consume
(f: Union[k1lib.cli.init.BaseCli, Callable[[T], None]])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f: Union[k1lib.cli.init.BaseCli, Callable[[T], None]])[source]¶ Consumes the iterator in a side stream. Returns the iterator. Kinda like the bash command
tee
. Example:# prints "0\n1\n2" and returns [0, 1, 2] range(3) | consume(headOut()) | toList() # prints "range(0, 3)" and returns [0, 1, 2] range(3) | consume(lambda it: print(it)) | toList()
This is useful whenever you want to mutate something, but don’t want to include the function result into the main stream.
-
-
class
k1lib.cli.modifier.
randomize
(bs=100)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(bs=100)[source]¶ Randomize input stream. In order to be efficient, this does not convert the input iterator to a giant list and yield random values from that. Instead, this fetches
bs
items at a time, randomizes them, returns and fetch anotherbs
items. If you want to do the giant list, then just pass infloat("inf")
, orNone
. Example:# returns [0, 1, 2, 3, 4], effectively no randomize at all range(5) | randomize(1) | deref() # returns something like this: [1, 0, 2, 3, 5, 4, 6, 8, 7, 9]. You can clearly see the batches range(10) | randomize(3) | deref() # returns something like this: [7, 0, 5, 2, 4, 9, 6, 3, 1, 8] range(10) | randomize(float("inf")) | deref() # same as above range(10) | randomize(None) | deref()
-
-
class
k1lib.cli.modifier.
stagger
(every: int)[source]¶ Bases:
k1lib.cli.init.BaseCli
Staggers input stream into multiple stream “windows” placed serially. Best explained with an example:
o = range(10) | stagger(3) o | deref() # returns [0, 1, 2], 1st "window" o | deref() # returns [3, 4, 5], 2nd "window" o | deref() # returns [6, 7, 8] o | deref() # returns [9] o | deref() # returns []
This might be useful when you’re constructing a data loader:
dataset = [range(20), range(30, 50)] | transpose() dl = dataset | batched(3) | (transpose() | toTensor()).all() | stagger(4) for epoch in range(3): for xb, yb in dl: # looping over a window print(epoch) # then something like: model(xb)
The above code will print 6 lines. 4 of them is “0” (because we stagger every 4 batches), and xb’s shape’ will be (3,) (because we batched every 3 samples).
You should also keep in mind that this doesn’t really change the property of the stream itself. Essentially, treat these pairs of statement as being the same thing:
o = range(11, 100) # both returns 11 o | stagger(20) | item() o | item() # both returns [11, 12, ..., 20] o | head(10) | deref() o | stagger(20) | head(10) | deref()
Lastly, multiple iterators might be getting values from the same stream window, meaning:
o = range(11, 100) | stagger(10) it1 = iter(o); it2 = iter(o) next(it1) # returns 11 next(it2) # returns 12
This may or may not be desirable. Also this should be obvious, but I want to mention this in case it’s not clear to you.
-
class
k1lib.cli.modifier.
op
[source]¶ Bases:
k1lib._baseClasses.Absorber
,k1lib.cli.init.BaseCli
Absorbs operations done on it and applies it on the stream. Based on
Absorber
. Example:t = torch.tensor([[1, 2, 3], [4, 5, 6.0]]) # returns [torch.tensor([[4., 5., 6., 7., 8., 9.]])] [t] | (op() + 3).view(1, -1).all() | deref()
Basically, you can treat
op()
as the input tensor. Tbh, you can do the same thing with this:[t] | applyS(lambda t: (t+3).view(-1, 1)).all() | deref()
But that’s kinda long and may not be obvious. This can be surprisingly resilient, as you can still combine with other cli tools as usual, for example:
# returns [2, 3], demonstrating "&" operator torch.randn(2, 3) | (op().shape & identity()) | deref() | item() a = torch.tensor([[1, 2, 3], [7, 8, 9]]) # returns torch.tensor([4, 5, 6]), demonstrating "+" operator for clis and not clis (a | op() + 3 + identity() | item() == torch.tensor([4, 5, 6])).all() # returns [[3], [3]], demonstrating .all() and "|" serial chaining torch.randn(2, 3) | (op().shape.all() | deref()) # returns [[8, 18], [9, 19]], demonstrating you can treat `op()` as a regular function [range(10), range(10, 20)] | transpose() | filt(op() > 7, 0) | deref()
This can only deal with simple operations only. For complex operations, resort to the longer version
applyS(lambda x: ...)
instead!Performance-wise, there are some, but not a lot of degradation, so don’t worry about it. Simple operations executes pretty much on par with native lambdas:
n = 10_000_000 # takes 1.48s for i in range(n): i**2 # takes 1.89s, 1.28x worse than for loop range(n) | apply(lambda x: x**2) | ignore() # takes 1.86s, 1.26x worse than for loop range(n) | apply(op()**2) | ignore() # takes 1.86s range(n) | (op()**2).all() | ignore()
More complex operations can take more of a hit:
# takes 1.66s for i in range(n): i**2-3 # takes 2.02s, 1.22x worse than for loop range(n) | apply(lambda x: x**2-3) | ignore() # takes 2.81s, 1.69x worse than for loop range(n) | apply(op()**2-3) | ignore()
Reserved operations that are not absorbed are:
all
__ror__ (__or__ still works!)
op_solidify
nb module¶
This is for everything related to ipython notebooks. Expected to use behind the “nb” module name, like this:
from k1lib.imports import *
nb.execute("file.ipynb")
-
k1lib.cli.nb.
cells
(fileName)[source]¶ Gets simplified notebook cells from file source, including fields
cell_type
andsource
only. Example:nb.cells("file.ipynb")
-
class
k1lib.cli.nb.
pretty
(magics: bool = False, whitelist: List[str] = [], blacklist: List[str] = [])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(magics: bool = False, whitelist: List[str] = [], blacklist: List[str] = [])[source]¶ Makes the cells prettier. Cell 1 in file.ipynb:
#notest, export a = 3
Cell 2 in file.ipynb:
b = 6
Code:
# only cell 2 gets chosen nb.cells("file.ipynb") | nb.pretty(blacklist=["notest"]) # only cell 1 gets chosen nb.cells("file.ipynb") | nb.pretty(whitelist=["export"])
- Parameters
magics – if False, then if detected magics (‘!’, ‘%%’ symbols), then remove that line in cell’s source
whitelist – every cell that doesn’t have any of these properties will be filtered out
blacklist – every cell that has any of these properties will be filtered out
-
-
class
k1lib.cli.nb.
execute
(fileName=None, _globals: Optional[dict] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(fileName=None, _globals: Optional[dict] = None)[source]¶ Executes cells. Example:
nb.cells("file.ipynb") | nb.execute("nb.ipynb")
Most of the time, you’d want to pass cells through
pretty
first, to make sure everything is nice and clean- Parameters
fileName – not actually used to read the file. If specified, then changes the current working directory to that of the file
_globals – optional dict of global variables
-
output module¶
For operations that feel like the termination
-
class
k1lib.cli.output.
stdout
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Prints out all lines. If not iterable, then print out the input raw. Example:
# prints out "0\n1\n2" range(3) | stdout() # same as above, but (maybe?) more familiar range(3) > stdout()
-
class
k1lib.cli.output.
file
(fileName: Optional[str] = None, text: bool = True)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(fileName: Optional[str] = None, text: bool = True)[source]¶ Opens a new file for writing. Example:
# writes "0\n1\n2\n" to file range(3) | file("test/f.txt") # same as above, but (maybe?) more familiar range(3) > file("text/f.txt") # returns ['0', '1', '2'] cat("folder/f.txt") | deref() # writes bytes to file b'5643' | file("test/a.bin", False) # returns ['5643'] cat("test/a.bin") | deref()
You can create temporary files on the fly:
# creates temporary file url = range(3) > file() # returns ['0', '1', '2'] cat(url) | deref()
This can be especially useful when integrating with shell scripts that wants to read in a file:
seq1 = "CCAAACCCCCCCTCCCCCGCTTC" seq2 = "CCAAACCCCCCCCTCCCCCCGCTTC" # use "needle" program to locally align 2 sequences None | cmd(f"needle {[seq1] > file()} {[seq2] > file()} -filter")
You can also append to file with the “>>” operator:
url = range(3) > file() # appended to file range(10, 13) >> file(url) # returns ['0', '1', '2', '10', '11', '12'] cat(url) | deref()
- Parameters
fileName – if not specified, create new temporary file and returns the url when pipes into it
text – if True, accepts Iterator[str], and prints out each string on a new line. Else accepts bytes and write in 1 go.
-
-
class
k1lib.cli.output.
pretty
(delim='')[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.output.
intercept
(raiseError: bool = True)[source]¶ Bases:
k1lib.cli.init.BaseCli
sam module¶
This is for functions that are .sam or .bam related
-
k1lib.cli.sam.
cat
(bamFile: str, header: bool = True)[source]¶ Get sam file outputs from bam file. Example:
sam.cat("file.bam") | display()
- Parameters
header – whether to include headers or not
-
class
k1lib.cli.sam.
header
(long=True)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.sam.
flag
(f=None)[source]¶ Bases:
k1lib.cli.utils.bindec
-
__init__
(f=None)[source]¶ Decodes flags attribute. Example:
# returns ['PAIRED', 'UNMAP'] 5 | flag() # returns 'PAIRED, UNMAP' 5 | flag(cli.join(", "))
You’ll mostly use this in this format:
sam.cat("file.bam", False) | apply(sam.flag(), 1) | display()
You can change the flag labels like this:
settings.cli.sam.flags = ["paired", ...]
- Parameters
f – transform function fed into
bindec
, defaulted to join(“, “)
-
structural module¶
This is for functions that sort of changes the table structure in a dramatic way. They’re the core transformations
-
k1lib.cli.structural.
yieldSentinel
¶ Object that can be yielded in a stream to ignore this stream for the moment in
joinStreamsRandom
. It will also stopsderef
early.
-
class
k1lib.cli.structural.
joinStreamsRandom
(fs=[])[source]¶ Join multiple streams randomly. If any streams runs out, then quits. If any stream yields
yieldSentinel
, then just ignores that result and continue. Could be useful in active learning. Example:# could return [0, 1, 10, 2, 11, 12, 13, ...], with max length 20, typical length 18 [range(0, 10), range(10, 20)] | joinStreamsRandom() | deref() stream2 = [[-5, yieldSentinel, -4, -3], yieldSentinel | repeat()] | joinStreams() # could return [-5, -4, 0, -3, 1, 2, 3, 4, 5, 6], demonstrating yieldSentinel [range(7), stream2] | joinStreamsRandom() | deref()
-
class
k1lib.cli.structural.
transpose
(dim1: int = 0, dim2: int = 1, fill=None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(dim1: int = 0, dim2: int = 1, fill=None)[source]¶ Join multiple columns and loop through all rows. Aka transpose. Example:
# returns [[1, 4], [2, 5], [3, 6]] [[1, 2, 3], [4, 5, 6]] | transpose() | deref() # returns [[1, 4], [2, 5], [3, 6], [0, 7]] [[1, 2, 3], [4, 5, 6, 7]] | transpose(fill=0) | deref()
Multidimensional transpose works just like
torch.transpose()
too:# returns (2, 7, 5, 3), but detected Tensor, so it will use builtin :meth:`torch.transpose` torch.randn(2, 3, 5, 7) | transpose(3, 1) | shape() # also returns (2, 7, 5, 3), but actually does every required computation. Can be slow if shape is huge torch.randn(2, 3, 5, 7) | deref(ignoreTensors=False) | transpose(3, 1) | shape()
Be careful with infinite streams, as transposing stream of shape (inf, 5) will hang this operation! Either don’t do it, or temporarily limit all infinite streams like this:
with settings.cli.context(inf=21): # returns (3, 21) [2, 1, 3] | repeat() | transpose() | shape()
Also be careful with empty streams, as you might not get any results at all:
# returns [], as the last stream has no elements [[1, 2], [3, 4], []] | transpose() | deref() # returns [[1, 3, 0], [2, 4, 0]] [[1, 2], [3, 4], []] | transpose(fill=0) | deref()
- Parameters
fill – if not None, then will try to zip longest with this fill value
-
static
fill
(fill='', dim1: int = 0, dim2: int = 1)[source]¶ Convenience method to fill in missing elements of a table. Example:
# returns [[1, 2, 3], [4, 5, 0]] [[1, 2, 3], [4, 5]] | transpose.fill(0) | deref() # also returns [[1, 2, 3], [4, 5, 0]], demonstrating how it works underneath [[1, 2, 3], [4, 5]] | transpose(fill=0) | transpose(fill=0) | deref()
-
static
wrap
(f, dim1: int = 0, dim2: int = 1, fill=None)[source]¶ Wraps
f
around 2transpose`s, can be useful in combination with :class:`k1lib.cli.init.mtmS
. Example:# returns [[1, 4, 3, 4], [8, 81, 10, 11]] [range(1, 5), range(8, 12)] | transpose.wrap(mtmS.f(apply(op()**2), 1)) | deref() # also returns [[1, 4, 3, 4], [8, 81, 10, 11]], demonstrating the typical way to do this [range(1, 5), range(8, 12)] | apply(op()**2, 1) | deref()
The example given is sort of to demonstrate this only. Most of the time, just use
apply
with columns instead. But sometimes you need direct access to a column, so this is how you can do it.
-
-
class
k1lib.cli.structural.
joinList
(element=None, begin=True)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(element=None, begin=True)[source]¶ Join element into list. Example:
# returns [5, 2, 6, 8] [5, [2, 6, 8]] | joinList() | deref() # also returns [5, 2, 6, 8] [2, 6, 8] | joinList(5) | deref()
- Parameters
element – the element to insert. If None, then takes the input [e, […]], else takes the input […] as usual
-
-
class
k1lib.cli.structural.
splitList
(*weights: List[float])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*weights: List[float])[source]¶ Splits list of elements into multiple lists. If no weights are provided, then automatically defaults to [0.8, 0.2]. Example:
# returns [[0, 1, 2, 3, 4, 5, 6, 7], [8, 9]] range(10) | splitList(0.8, 0.2) | deref() # same as the above range(10) | splitList() | deref()
-
-
class
k1lib.cli.structural.
joinStreams
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Joins multiple streams. Example:
# returns [1, 2, 3, 4, 5] [[1, 2, 3], [4, 5]] | joinStreams() | deref()
-
class
k1lib.cli.structural.
activeSamples
(limit: int = 100, p: float = 0.95)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(limit: int = 100, p: float = 0.95)[source]¶ Yields active learning samples. Example:
o = activeSamples() ds = range(10) # normal dataset ds = [o, ds] | joinStreamsRandom() # dataset with active learning capability next(ds) # returns 0 next(ds) # returns 1 next(ds) # returns 2 o.append(20) next(ds) # can return 3 or 20 next(ds) # can return (4 or 20) or 4
So the point of this is to be a generator of samples. You can define your dataset as a mix of active learning samples and standard samples. Whenever there’s a data point that you want to focus on, you can add it to
o
and it will eventially yield it.Warning
It might not be a good idea to set param
limit
to higher numbers than 100. This is because, the network might still not understand a wrong sample after being shown multiple times, and will keep adding that wrong sample back in, distracting it from other samples, and reduce network’s accuracy after removing active learning from it.If
limit
is low enough (from my testing, 30-100 should be fine), then old wrong samples will be kicked out, allowing for a fresh stream of wrong samples coming in, and preventing the problem above. If you found that removing active learning makes the accuracy drops dramatically, then try decreasing the limit.- Parameters
limit – max number of active samples. Discards samples if number of samples is over this.
p – probability of actually adding the samples in
-
-
k1lib.cli.structural.
table
(delim: Optional[str] = None)[source]¶ Basically
op().split(delim).all()
. This exists because this is used quite a lot in bioinformatics. Example:# returns [['a', 'bd'], ['1', '2', '3']] ["a|bd", "1|2|3"] | table("|") | deref()
-
class
k1lib.cli.structural.
batched
(bs=32, includeLast=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(bs=32, includeLast=False)[source]¶ Batches the input stream. Example:
# returns [[0, 1, 2], [3, 4, 5], [6, 7, 8]] range(11) | batched(3) | deref() # returns [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]] range(11) | batched(3, True) | deref() # returns [[0, 1, 2, 3, 4]] range(5) | batched(float("inf"), True) | deref() # returns [] range(5) | batched(float("inf"), False) | deref()
-
-
k1lib.cli.structural.
collate
()[source]¶ Puts individual columns into a tensor. Example:
# returns [tensor([ 0, 10, 20]), tensor([ 1, 11, 21]), tensor([ 2, 12, 22])] [range(0, 3), range(10, 13), range(20, 23)] | collate() | toList()
-
k1lib.cli.structural.
insertRow
(*row: List[T])[source]¶ Inserts a row right before every other rows. See also:
joinList()
.
-
k1lib.cli.structural.
insertColumn
(*column, begin=True, fill='')[source]¶ Inserts a column at beginning or end. Example:
# returns [['a', 1, 2], ['b', 3, 4]] [[1, 2], [3, 4]] | insertColumn("a", "b") | deref()
-
k1lib.cli.structural.
insertIdColumn
(table=False, begin=True, fill='')[source]¶ Inserts an id column at the beginning (or end). Example:
# returns [[0, 'a', 2], [1, 'b', 4]] [["a", 2], ["b", 4]] | insertIdColumn(True) | deref() # returns [[0, 'a'], [1, 'b']] "ab" | insertIdColumn()
- Parameters
table – if False, then insert column to an Iterator[str], else treat input as a full fledged table
-
class
k1lib.cli.structural.
toDict
[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.structural.
toDictF
(keyF: Optional[Callable[[Any], str]] = None, valueF: Optional[Callable[[Any], Any]] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(keyF: Optional[Callable[[Any], str]] = None, valueF: Optional[Callable[[Any], Any]] = None)[source]¶ Transform an incoming stream into a dict using a function for values. Example:
names = ["wanda", "vision", "loki", "mobius"] names | toDictF(valueF=lambda s: len(s)) # will return {"wanda": 5, "vision": 6, ...} names | toDictF(lambda s: s.title(), lambda s: len(s)) # will return {"Wanda": 5, "Vision": 6, ...}
-
-
class
k1lib.cli.structural.
expandE
(f: Callable[[T], List[T]], column: int)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
k1lib.cli.structural.
unsqueeze
(dim: int = 0)[source]¶ Unsqueeze input iterator. Example:
t = [[1, 2], [3, 4], [5, 6]] # returns (3, 2) t | shape() # returns (1, 3, 2) t | unsqueeze(0) | shape() # returns (3, 1, 2) t | unsqueeze(1) | shape() # returns (3, 2, 1) t | unsqueeze(2) | shape()
Behind the scenes, it’s really just
wrapList().all(dim)
, but the “unsqueeze” name is a lot more familiar. Also note that the inverse operation “squeeze” is sort ofitem().all(dim)
, if you’re sure that this is desirable:t = [[1, 2], [3, 4], [5, 6]] # returns (3, 2) t | unsqueeze(1) | item().all(1) | shape()
-
class
k1lib.cli.structural.
count
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Finds unique elements and returns a table with [frequency, value, percent] columns. Example:
# returns [[1, 'a', '33%'], [2, 'b', '67%']] ['a', 'b', 'b'] | count() | deref()
-
class
k1lib.cli.structural.
permute
(*permutations: List[int])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*permutations: List[int])[source]¶ Permutes the columns. Acts kinda like
torch.Tensor.permute()
. Example:# returns [['b', 'a'], ['d', 'c']] ["ab", "cd"] | permute(1, 0) | deref()
-
-
class
k1lib.cli.structural.
accumulate
(columnIdx: int = 0, avg=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.structural.
AA_
(*idxs: List[int], wraps=False)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(*idxs: List[int], wraps=False)[source]¶ Returns 2 streams, one that has the selected element, and the other the rest. Example:
# returns [5, [1, 6, 3, 7]] [1, 5, 6, 3, 7] | AA_(1) # returns [[5, [1, 6, 3, 7]]] [1, 5, 6, 3, 7] | AA_(1, wraps=True)
You can also put multiple indexes through:
# returns [[1, [5, 6]], [6, [1, 5]]] [1, 5, 6] | AA_(0, 2)
If you don’t specify anything, then all indexes will be sliced:
# returns [[1, [5, 6]], [5, [1, 6]], [6, [1, 5]]] [1, 5, 6] | AA_()
As for why the strange name, think of this operation as “AĀ”. In statistics, say you have a set “A”, then “not A” is commonly written as A with an overline “Ā”. So “AA_” represents “AĀ”, and that it first returns the selection A.
- Parameters
wraps – if True, then the first example will return [[5, [1, 6, 3, 7]]] instead, so that A has the same signature as Ā
-
-
class
k1lib.cli.structural.
peek
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Returns (firstRow, iterator). This sort of peaks at the first row, to potentially gain some insights about the internal formats. The returned iterator is not tampered. Example:
e, it = iter([[1, 2, 3], [1, 2]]) | peek() print(e) # prints "[1, 2, 3]" s = 0 for e in it: s += len(e) print(s) # prints "5", or length of 2 lists
You kinda have to be careful about handling the
firstRow
, because you might inadvertently alter the iterator:e, it = iter([iter(range(3)), range(4), range(2)]) | peek() e = list(e) # e is [0, 1, 2] list(next(it)) # supposed to be the same as `e`, but is [] instead
The example happens because you have already consumed all elements of the first row, and thus there aren’t any left when you try to call
next(it)
.
-
class
k1lib.cli.structural.
peekF
(f: Union[k1lib.cli.init.BaseCli, Callable[[T], T]])[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(f: Union[k1lib.cli.init.BaseCli, Callable[[T], T]])[source]¶ Similar to
peek
, but will executef(row)
and return the input Iterator, which is not tampered. Example:it = lambda: iter([[1, 2, 3], [1, 2]]) # prints "[1, 2, 3]" and returns [[1, 2, 3], [1, 2]] it() | peekF(lambda x: print(x)) | deref() # prints "1\n2\n3" it() | peekF(headOut()) | deref()
-
-
class
k1lib.cli.structural.
repeat
(limit: Optional[int] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
Yields a specified amount of the passed in object. If you intend to pass in an iterator, then make a list out of it first, as second copy of iterator probably won’t work as you will have used it the first time. Example:
# returns [[1, 2, 3], [1, 2, 3], [1, 2, 3]] [1, 2, 3] | repeat(3) | toList()
- Parameters
repeat – if None, then repeats indefinitely
-
k1lib.cli.structural.
repeatF
(f, limit: Optional[int] = None)[source]¶ Yields a specified amount generated by a specified function. Example:
# returns [4, 4, 4] repeatF(lambda: 4, 3) | toList() # returns 10 repeatF(lambda: 4) | head() | shape(0)
- Parameters
limit – if None, then repeats indefinitely
See also:
repeatFrom
-
class
k1lib.cli.structural.
repeatFrom
(limit: Optional[int] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(limit: Optional[int] = None)[source]¶ Yields from a list. If runs out of elements, then do it again for
limit
times. Example:# returns [1, 2, 3, 1, 2] [1, 2, 3] | repeatFrom() | head(5) | deref() # returns [1, 2, 3, 1, 2, 3] [1, 2, 3] | repeatFrom(2) | deref()
- Parameters
limit – if None, then repeats indefinitely
-
trace module¶
-
class
k1lib.cli.trace.
trace
(f=<k1lib.cli.utils.size object>, maxDepth=inf)[source]¶ Bases:
k1lib.cli.trace._trace
-
last
= None¶ Last instantiated trace object. Access this to view the previous (possibly nested) trace.
-
__init__
(f=<k1lib.cli.utils.size object>, maxDepth=inf)[source]¶ Traces out how the data stream is transformed through complex cli tools. Example:
# returns [1, 4, 9, 16], normal command range(1, 5) | apply(lambda x: x**2) | deref() # traced command, will display how the shapes evolve through cli tools range(1, 5) | trace() | apply(lambda x: x**2) | deref()
There’re a lot more instructions and code examples over the tutorial section. Go check it out!
-
utils module¶
This is for all short utilities that has the boilerplate feeling. Conversion clis
might feel they have different styles, as toFloat
converts object iterator to
float iterator, while toPIL
converts single image url to single PIL image,
whereas toSum
converts float iterator into a single float value.
The general convention is, if the intended operation sounds simple (convert to floats, strings, types, …), then most likely it will convert iterator to iterator, as you can always use the function directly if you only want to apply it on 1 object.
If it sounds complicated (convert to PIL image, tensor, …) then most likely it will convert object to object. Lastly, there are some that just feels right to input an iterator and output a single object (like getting max, min, std, mean values).
-
class
k1lib.cli.utils.
size
(idx=None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(idx=None)[source]¶ Returns number of rows and columns in the input. Example:
# returns (3, 2) [[2, 3], [4, 5, 6], [3]] | size() # returns 3 [[2, 3], [4, 5, 6], [3]] | size(0) # returns 2 [[2, 3], [4, 5, 6], [3]] | size(1) # returns (2, 0) [[], [2, 3]] | size() # returns (3,) [2, 3, 5] | size() # returns 3 [2, 3, 5] | size(0) # returns (3, 2, 2) [[[2, 1], [0, 6, 7]], 3, 5] | size() # returns (1,) and not (1, 3) ["abc"] | size() # returns (1, 2, 3) [torch.randn(2, 3)] | size() # returns (2, 3, 5) size()(np.random.randn(2, 3, 5))
There’s also
lengths
, which is sort of a simplified/faster version of this, but only use it if you are sure thatlen(it)
can be called.If encounter PyTorch tensors or Numpy arrays, then this will just get the shape instead of actually looping over them.
- Parameters
idx – if idx is None return (rows, columns). If 0 or 1, then rows or columns
-
-
k1lib.cli.utils.
shape
¶ alias of
k1lib.cli.utils.size
-
class
k1lib.cli.utils.
item
(amt: int = 1, fill=<object object>)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(amt: int = 1, fill=<object object>)[source]¶ Returns the first row. Example:
# returns 0 iter(range(5)) | item() # returns torch.Size([5]) torch.randn(3,4,5) | item(2) | shape() # returns 3 [] | item(fill=3)
- Parameters
amt – how many times do you want to call item() back to back?
fill – if iterator length is 0, return this
-
-
class
k1lib.cli.utils.
identity
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Yields whatever the input is. Useful for multiple streams. Example:
# returns range(5) range(5) | identity()
-
k1lib.cli.utils.
iden
¶ alias of
k1lib.cli.utils.identity
-
class
k1lib.cli.utils.
toStr
(column: Optional[int] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.utils.
join
(delim: Optional[str] = None)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(delim: Optional[str] = None)[source]¶ Merges all strings into 1, with delim in the middle. Basically
str.join()
. Example:# returns '2\na' [2, "a"] | join("\n")
-
-
class
k1lib.cli.utils.
toNumpy
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts generator to numpy array. Essentially
np.array(list(it))
-
class
k1lib.cli.utils.
toTensor
(dtype=torch.float32)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(dtype=torch.float32)[source]¶ Converts generator to
torch.Tensor
. Essentiallytorch.tensor(list(it))
.Also checks if input is a PIL Image. If yes, turn it into a
torch.Tensor
and return.
-
__ror__
(it: Iterator[float]) → torch.Tensor[source]¶
-
-
class
k1lib.cli.utils.
toList
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts generator to list.
list
would do the same, but this is just to maintain the style
-
class
k1lib.cli.utils.
wrapList
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Wraps inputs inside a list. There’s a more advanced cli tool built from this, which is
unsqueeze()
.
-
class
k1lib.cli.utils.
toSet
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts generator to set.
set
would do the same, but this is just to maintain the style
-
class
k1lib.cli.utils.
toIter
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts object to iterator. iter() would do the same, but this is just to maintain the style
-
class
k1lib.cli.utils.
toRange
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Returns iter(range(len(it))), effectively
-
class
k1lib.cli.utils.
toType
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts object to its type. Example:
# returns [int, float, str, torch.Tensor] [2, 3.5, "ah", torch.randn(2, 3)] | toType() | deref()
-
class
k1lib.cli.utils.
equals
[source]¶ Bases:
object
Checks if all incoming columns/streams are identical
-
class
k1lib.cli.utils.
reverse
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Reverses incoming list. Example:
# returns [3, 5, 2] [2, 5, 3] | reverse() | deref()
-
class
k1lib.cli.utils.
ignore
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Just loops through everything, ignoring the output. Example:
# will just return an iterator, and not print anything [2, 3] | apply(lambda x: print(x)) # will prints "2\n3" [2, 3] | apply(lambda x: print(x)) | ignore()
-
class
k1lib.cli.utils.
toSum
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Calculates the sum of list of numbers. Can pipe in
torch.Tensor
. Example:# returns 45 range(10) | toSum()
-
class
k1lib.cli.utils.
toAvg
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Calculates average of list of numbers. Can pipe in
torch.Tensor
. Example:# returns 4.5 range(10) | toAvg() # returns nan [] | toAvg()
-
k1lib.cli.utils.
toMean
¶ alias of
k1lib.cli.utils.toAvg
-
class
k1lib.cli.utils.
toMax
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Calculates the max of a bunch of numbers. Can pipe in
torch.Tensor
. Example:# returns 6 [2, 5, 6, 1, 2] | toMax()
-
class
k1lib.cli.utils.
toMin
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Calculates the min of a bunch of numbers. Can pipe in
torch.Tensor
. Example:# returns 1 [2, 5, 6, 1, 2] | toMin()
-
class
k1lib.cli.utils.
toPIL
[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
()[source]¶ Converts a path to a PIL image. Example:
ls(".") | toPIL().all() | item() # get first image
-
__ror__
(path) → PIL.Image.Image[source]¶
-
-
class
k1lib.cli.utils.
toBin
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Converts integer to binary string. Example:
# returns "101" 5 | toBin()
-
class
k1lib.cli.utils.
toIdx
(chars: str)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
class
k1lib.cli.utils.
lengths
(fs=[])[source]¶ Bases:
k1lib.cli.init.BaseCli
Returns the lengths of each element. Example:
[range(5), range(10)] | lengths() == [5, 10]
This is a simpler (and faster!) version of
shape
. It assumes each element can be called withlen(x)
, whileshape
iterates through every elements to get the length, and thus is much slower.
-
k1lib.cli.utils.
headerIdx
()[source]¶ Cuts out first line, put an index column next to it, and prints it out. Useful when you want to know what your column’s index is to cut it out. Also sets the context variable “header”, in case you need it later. Example:
# returns [[0, 'a'], [1, 'b'], [2, 'c']] ["abc"] | headerIdx() | deref()
-
class
k1lib.cli.utils.
deref
(maxDepth=inf, ignoreTensors=True)[source]¶ Bases:
k1lib.cli.init.BaseCli
-
__init__
(maxDepth=inf, ignoreTensors=True)[source]¶ Recursively converts any iterator into a list. Only
str
,numbers.Number
andModule
are not converted. Example:# returns something like "<range_iterator at 0x7fa8c52ca870>" iter(range(5)) # returns [0, 1, 2, 3, 4] iter(range(5)) | deref() # returns [2, 3], yieldSentinel stops things early [2, 3, yieldSentinel, 6] | deref()
You can also specify a
maxDepth
:# returns something like "<list_iterator at 0x7f810cf0fdc0>" iter([range(3)]) | deref(0) # returns [range(3)] iter([range(3)]) | deref(1) # returns [[0, 1, 2]] iter([range(3)]) | deref(2)
There are a few classes/types that are considered atomic, and
deref
will never try to iterate over it. If you wish to change it, do something like:settings.cli.atomic.deref = (int, float, ...)
Warning
Can work well with PyTorch Tensors, but not Numpy arrays as they screw things up with the __ror__ operator, so do torch.from_numpy(…) first. Don’t worry about unnecessary copying, as numpy and torch both utilizes the buffer protocol.
- Parameters
maxDepth – maximum depth to dereference. Starts at 0 for not doing anything at all
ignoreTensors – if True, then don’t loop over
torch.Tensor
internals
-
__invert__
() → k1lib.cli.init.BaseCli[source]¶ Returns a
BaseCli
that makes everything an iterator. Not entirely sure when this comes in handy, but it’s there.
-
-
class
k1lib.cli.utils.
bindec
(cats: List[Any], f=None)[source]¶ Bases:
k1lib.cli.init.BaseCli
others module¶
This is for pretty random clis that’s scattered everywhere.
Elsewhere in the library¶
There might still be more cli tools scattered around the library. These are pretty rare, quite dynamic and most likely a cool extra feature, not a core functionality, so not worth it/can’t mention it here. Anyway, execute this:
cli.scatteredClis()
to get a list of them.