The other file is for metadata, this one's for image property stuff. Mainly to demonstrate lots of multiprocessing capabilities.
from k1lib.imports import *
import pydicom
["abc", "ab"] | applyMp(lambda x: len(x)) | toList()
[3, 2]
#base = "/home/kelvin/hdd/data/osic-pulmonary-fibrosis-progression"
base = "/home/kelvin/repos/labs/data/osic-pulmonary-fibrosis-progression"
baseT = f"{base}/train"
file = ls(base) | item() | ls() | item() | ls() | item()
f = pydicom.read_file(file)
img = torch.tensor(f.pixel_array.astype(np.float32))
img.shape
torch.Size([512, 512])
All 512 pixels?
%%time
h, w = ls(baseT)\
| applyMp(ls() | apply(lambda dcmF: pydicom.read_file(dcmF).pixel_array.shape) | toList())\
| joinStreams() | transpose() | deref()
Corrupt JPEG data: bad Huffman code Unsupported marker type 0xfa
CPU times: user 1.45 s, sys: 136 ms, total: 1.59 s Wall time: 4.82 s
w | count() | ~sort() | display(None)
24050 512 73% 8078 768 24% 536 888 2% 338 632 1% 24 1302 0%
h | count() | ~sort() | display(None)
24050 512 73% 8078 768 24% 338 632 1% 319 733 1% 71 734 0% 57 752 0% 31 843 0% 30 788 0% 28 1100 0% 24 1302 0%
Oh god why can't the images be even square??? Like why do you want me to suffer??
How about pixel value range? First let's grab the mins and maxes first
def patientMinMax(patientUrl):
import pydicom
from k1lib.cli import ls, apply, deref
def f(dcmF):
img = pydicom.read_file(dcmF).pixel_array
return img.min(), img.max()
return patientUrl | ls() | apply(f) | deref()
%%time
minMax = ls(base) | item() | ls() | applyMp(patientMinMax)
min_, max_ = minMax | joinStreams() | transpose() | (toMin() + toMax()); min_, max_
Corrupt JPEG data: bad Huffman code Unsupported marker type 0xfa
CPU times: user 282 ms, sys: 111 ms, total: 393 ms Wall time: 6.3 s
(-31860, 32747)
Dear god why is the range so fking big? So let's check the histograms then?
def patientHist(patientUrl):
dcms = patientUrl | ls()
histT = torch.zeros(300)
for dcmF in dcms | randomize(None) | head(30):
img = torch.from_numpy(pydicom.read_file(dcmF).pixel_array[::10,::10].astype(np.float32))
histT += torch.histc(img, 300, min_, max_)
#return len(dcms); break
return histT
%%time
hists = ls(baseT) | applyMp(patientHist, None, 20) | deref()
CPU times: user 87.8 ms, sys: 94.4 ms, total: 182 ms Wall time: 5.44 s
histT = torch.zeros(300)
for h in hists: histT += h
r = torch.linspace(min_, max_, 300)
plt.plot(r, histT); plt.yscale("log");
b = histT > 1e5; plt.plot(r[b], histT[b]); r[b][::3], histT[b][::3]
(tensor([-3121.7695, -961.0006, -312.7697, 335.4611, 983.6928]), tensor([ 576358., 1537295., 192812., 587396., 2890611.]))
Very interesting. Bulk of pixel values are in the $[-4300, 3500]$ region, and with frequency $10^7$ to $10^9$. Why is there a plateau at the 40-100 region? Just background noise then?
This seems to agree with the above one:
plt.hist(f.pixel_array.flatten(), bins=30);
And for a slice:
plt.imshow(f.pixel_array);
The purple part between the circle and the square is sort of all 2048:
all(f.pixel_array[:20, :20].flatten() == -2048)
False
And the "air" part above is sort of around -1000 to -1200
f.pixel_array[20:80, 200:300]
array([[ 12, 66, 4, ..., 22, 177, 0], [ 0, 211, 235, ..., 0, 137, 112], [ 4, 46, 179, ..., 0, 0, 192], ..., [ 713, 948, 1065, ..., 84, 71, 233], [ 949, 987, 1014, ..., 745, 910, 1155], [ 730, 899, 920, ..., 1130, 1214, 1272]], dtype=uint16)
Here, we're looking over 10000 different images from all over. I'm really hesitant to use all 32k images, as that's quite slow, and 1000 images is just as representative.
%%time
#crissCross = applyS(lambda tensors: torch.crissCross(*tensors))
bounds = base | ls() | item() | ls() | ls().all() | joinStreams() | randomize(None)\
| batched(100) | head(30) | (
applyMp(lambda dcmF: torch.tensor(pydicom.read_file(dcmF).pixel_array.flatten()[::128].astype(np.float32)))\
| crissCross()
).all() | crissCross()
CPU times: user 39.2 s, sys: 5.54 s, total: 44.8 s Wall time: 5.84 s
def gen(img=None):
if img is None:
f = pydicom.read_file(ls(base) | item() | ls() | ls().all() | joinStreams() | randomize() | item())
img = torch.tensor(f.pixel_array.astype(np.float32))
fig, (ax1, ax2) = plt.subplots(1, 2, dpi=150)
ax1.imshow(img); ax2.imshow(img.histScaled(bounds=bounds));
gen()
The transformed image looks a million times better!
gen()
Like woah. Rescaled images looks so, so much better. Details are so crisp!
Remember to close all pools before finishing with everything, to terminate all child processes.
applyMp.clearPools()