k1lib.cli module¶

This tutorial is for the basics of the k1lib.cli module (docs at https://k1lib.github.io/latest/cli.html). As a quick reminder, this module allows you to use common cli tools from the linux cli inside of Python. The idea for this module came across while I was reading over the Biostar Handbook. They used a lot of cli tools, but all of them are sort of weird, unintuitive, not powerful, and just painful to work with. That's why I made this module to move everything to regular Python.

We're going to go over the multilanguage names dataset from a PyTorch RNN tutorial. The data folder is at cli_name_languages btw. My advice is to read this along with the docs page, and see the sources of functions that you're interested in.

In [1]:
from k1lib.imports import *
import unicodedata, string
In [2]:
namesFolder = "cli_name_languages/names"
nameFiles = glob.glob(f"{namesFolder}/*.txt")
withBareNames = insertColumn(nameFiles | op().split("/")[-1].all() | op().split(".")[0].all() | deref()) | display(None)
nameFiles[:3], len(nameFiles)
Out[2]:
(['cli_name_languages/names/Korean.txt',
  'cli_name_languages/names/Spanish.txt',
  'cli_name_languages/names/Greek.txt'],
 18)

So, we have 18 files in total. Let's look over a few of them:

In [3]:
cat(nameFiles[0]) | headOut(3)
Ahn
Baik
Bang

You can also pipe the file name in btw, like this:

In [4]:
nameFiles[0] | cat() | headOut(3)
Ahn
Baik
Bang

Let's convert all unicode chars to regular ascii (taken from the PyTorch doc):

In [5]:
letters = string.ascii_letters + ".,;'"
def unicodeToAscii(s, notIn=False):
    if notIn: # debug case
        return "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn" and c not in letters)
    else: # "right" case
        return "".join(c for c in unicodedata.normalize("NFD", s) if unicodedata.category(c) != "Mn" and c in letters)

How many names in total across files?

In [6]:
nameFiles | cat().all() | joinStreams() | shape(0)
Out[6]:
20074

How many names with weird unicode characters?

In [7]:
def unicodes(): return nameFiles | cat().all() | joinStreams() | apply(partial(unicodeToAscii, notIn=True))
unicodes() | count() | display(None)
19962          99%   
47             0%    
3              0%    
21      -      0%    
2       --     0%    
1              0%    
23             0%    
1       /      0%    
3       1      0%    
9       ß      0%    
1       ł      0%    
1       :      0%    

See over https://k1lib.github.io/latest/cli/streams for more info about how stuff like cat() and joinStreams() work. Also, partial is a pretty awesome function I might add, look over it at Python functools docs. There're lots of empty names here, so let's get rid of them

In [8]:
unicodes() | op().strip().all() | filt(op() != "") | count() | display()
21   -    55%   
2    --   5%    
1    /    3%    
3    1    8%    
9    ß    24%   
1    ł    3%    
1    :    3%    

Here, we're just stripping white spaces at both ends of each name (strip()) and filters them out (filt(op() != "")). How many duplicate names are there in a file?

In [9]:
nameFiles | cat().all() | (count() | filt(op() != "1", 0) | shape(0)).all() | unsqueeze(1) | withBareNames
Korean       94     
Spanish      296    
Greek        193    
Irish        226    
Scottish     100    
Portuguese   74     
Russian      9342   
Czech        503    
French       273    
German       706    
Japanese     990    
Polish       138    
Arabic       108    
English      3668   
Chinese      246    
Dutch        286    
Italian      701    
Vietnamese   71     

Okay yeah there's a lot. Let's see how many unique names (of each file) that appear in other files:

In [10]:
nameFiles | cat().all() | toSet().all() | joinStreams() | (iden() & toSet()) | shape(0).all() | deref()
Out[10]:
[18015, 17458]

Let's see what are the actual Korean names that appear in other files:

In [11]:
nameFiles | AA_(0) | ((cat() | toList() | repeat()) + cat().all()) | transpose() | intersection().all()\
| insertColumn(list(nameFiles | op().split("/")[-1].all() | op().split(".")[0].all())[1:]) | display(None)
Spanish                                                                                                                                   
Greek                                                                                                                                     
Irish                                                                                                                                     
Scottish                                                                                                                                  
Portuguese                                                                                                                                
Russian      Li     Han                                                                                                                   
Czech                                                                                                                                     
French                                                                                                                                    
German       Wang                                                                                                                         
Japanese     Ko     Seo    Jo                                                                                                             
Polish                                                                                                                                    
Arabic                                                                                                                                    
English      Lee    Moon   Chong   Wang   Chung   Yang                                                                                    
Chinese      Hong   Koo    Chu     Yim    Kang    Han    Chong   Chou   Chin   Sun   Wang   Song   You   Woo   Chang   Yang   Chi   Yun   
Dutch                                                                                                                                     
Italian                                                                                                                                   
Vietnamese   Chu    Ha     Han     Kim    Chung   Ho     Ma                                                                               

cat() | toList() | repeat()'s branch essentially creates Iterator[File], and each File is actually just Iterator[str]. Result of cat().all() is also Iterator[File]. We want to place these 2 lists' elements on each row, so we can actually operate on them. joinColumns() will output Iterator[(File, File)]. First file is the Korean one, second file is every other file. intersection() will find the common names between the 2 files, and insertColumn() just to have some nice formatting.

How about we do this for every file and record how many names in that that is in other files:

In [12]:
analyze2Files = intersection() | shape(0) # takes 2 files, and squish them into 1 value
analyze1Combo = ((cat() | toList() | repeat()) + cat().all()) | transpose() | analyze2Files.all() | toSum() # summing all common values
nameFiles | AA_() | analyze1Combo.all() | unsqueeze(1) | withBareNames
Korean       37    
Spanish      104   
Greek        1     
Irish        78    
Scottish     115   
Portuguese   57    
Russian      74    
Czech        41    
French       102   
German       148   
Japanese     9     
Polish       24    
Arabic       5     
English      381   
Chinese      52    
Dutch        58    
Italian      54    
Vietnamese   20    

Nice. Anyway, hope you are as thrilled as I am about this. Really complicated loops and whatnot can be explored quite quickly without actually writing any loops, and that helps with bringing down iteration time.

Speed analysis¶

While developing this module, I thought I'd have to drop down to C level for it to be fast enough to process anything at all. However, time and time again, it seems like Python is good enough for most things. Any Python operation is around 1.5 orders of magnitude slower than 1ns, so 30ns. Also means that flops rate should be around 7-7.5 orders of magnitude, while we should expect 8-8.5 out of C code. Let's see:

In [13]:
%%time
range(400) | repeatFrom() | apply(lambda x: x+2) | batched(1000) | toSum().all() | ~head(10000) | headOut()
181500
221500
181500
221500
181500
221500
181500
221500
181500
221500
CPU times: user 1.37 s, sys: 4 µs, total: 1.37 s
Wall time: 1.36 s

This is just taking an infinite list of numbers, add 2 to it, batches every 1000 numbers, summing over each row, and do that 10000 times. So 10M ops in around 1 second. Right around the 6.8-6.9 (nice haha) orders of magnitude flop rate. Is this good enough though?

Lots of ppl reported the builtin module csv can parse around 50MB/s. Let's say there're 10 columns, and each column has 10 characters, which equals to 100 bytes/row. So, the throughput should be 500k rows/s, well below what we have here. Even if you assume it's costly to operate on a table, so the figure 5M table elements/s, then that's still lower than what cli tools can achieve. So no need to worry about this.

For loops with yields vs .all()¶

A lot of time, I was worried about the performance of .all() operation. But turns out, applyS(f).all() has roughly the same performance as apply(f), so don't worry about it:

In [14]:
%%time
range(int(1e6)) | applyS(lambda x: x / 2).all() | ignore()
CPU times: user 75.6 ms, sys: 0 ns, total: 75.6 ms
Wall time: 74.7 ms
In [15]:
%%time
range(int(1e6)) | apply(lambda x: x / 2) | ignore()
CPU times: user 73.7 ms, sys: 0 ns, total: 73.7 ms
Wall time: 73 ms
In [16]:
%%time
range(int(1e6)) | apply(applyS(lambda x: x / 2)) | ignore()
CPU times: user 75.2 ms, sys: 45 µs, total: 75.2 ms
Wall time: 74.2 ms
In [ ]: