k1lib.bioinfo.cli module

This tutorial is for the basics of the k1lib.bioinfo.cli module (docs at https://k1lib.github.io/latest/bioinfo/cli.html). As a quick reminder, this module allows you to use common cli tools from the linux cli inside of Python. The idea for this module came across while I was reading over the Biostar Handbook. They used a lot of cli tools, but all of them are sort of weird, unintuitive, not powerful, and just painful to work with. That's why I made this module to move everything to regular Python.

We're going to go over the multilanguage names dataset from a PyTorch RNN tutorial. The data folder is at cli_name_languages btw. My advice is to read this along with the docs page, and see the sources of functions that you're interested in.

So, we have 18 files in total. Let's look over a few of them:

You can also pipe the file name in btw, like this:

Let's convert all unicode chars to regular ascii (taken from the PyTorch doc):

How many names in total across files?

How many names with weird unicode characters?

See over https://k1lib.github.io/latest/bioinfo/streams for more info about how stuff like cats() and joinStreams() work. Also, partial is a pretty awesome function I might add, look over it at Python functools docs. There're lots of empty names here, so let's get rid of them

Here, we're just stripping white spaces at both ends of each name (strip()) and filters them out (~isValue("")). The tilde ~ sign common in front of every filter functions effectively inverts the filter's condition. How many duplicate names are there in a file?

Okay yeah there's a lot. Let's see how many unique names (of each file) that appear in other files:

Let's see what are the actual Korean names that appear in other files:

cat() | toList() | repeat()'s branch essentially creates Iterator[File], and each File is actually just Iterator[str]. Result of cats() is also Iterator[File]. We want to place these 2 lists' elements on each row, so we can actually operate on them. joinColumns() will output Iterator[(File, File)]. First file is the Korean one, second file is every other file. intersection() will find the common names between the 2 files, and insertColumn() just to have some nice formatting.

How about we do this for every file and record how many names in that that is in other files:

Nice. Anyway, hope you are as thrilled as I am about this. Really complicated loops and whatnot can be explored quite quickly without actually writing any loops, and that helps with bringing down iteration time.