OSIC

So this is sort of another example of the cli workflow. This notebook was created while I was analyzing data from Kaggle's OSIC contest https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression. This mostly just analyzes the csv file provided, and the 30k dicom images. Also, note that this notebook is "quieter", as most things are unmodified. I will sometimes chime in here and there, but not much.

For quick reference: DICOM standard browser

Just to "warm up" multiprocessing stuff. Not strictly necessary tho:

Directory looks something like this:

Train.csv

Looks nice. 1000 data points seems a bit low though.

Patient #scans distribution

Kay yeah, there may be too little data here. Seems like there are 200 unique patients, and the most frequent #scans each patient has is 9. 9 seems pretty comprehensive for a patient though, but 200 patients? That seems a bit too low.

More precisely, there're 176 patients

Weeks distribution

Why are there negative weeks?

Right, so 11 examples of negative weeks. Let's just filter out all of them, it's not worth it to put into our analysis.

Metric

Different lines for different delta values:

So, delta small, sigma big will be best. Doesn't apply at the smallest deltas, but who really cares?

FVC?

The typical prediction would be sth like $2500\pm250$, or $\pm10\%$, which sounds very reasonable.

Percent

Keep in mind that lots of percent values are greater than 100, so this doesn't actually mean percent. How many?

195, or 12% of all datapoints. Yeah, so this can't really be ignored.

Age

Keep in mind that this is for unique patients. Shape's still similar to the bigger picture though.

Sex

This is gonna skew things, but hopefully not a lot.

Smoking status

Sample_submission.csv

So, apparently, the problem requires us to actually predict negative weeks. Yikes. This is gonna be harder than I thought.

Sample looks quite reasonable though.

Train/

Quite nice. All the records in the csv file are also inside the "train" folder. How many individual dcm files?

Damn, that's a lot. File names all indexed nicely?

Getting fields...

Notice how we took only several seconds for everything. That's kinda remarkable!

Quick viz of how this works:

image.pngimage.png

More info: https://www.radiologycafe.com/radiology-trainees/frcr-physics-notes/ct-equipment

Also, "Image Position (Patient) and Image Orientation (Patient) are the 2 only attributes you should ever used when computing distances between slices"

Single value fields:

What's left?

That's a lot to cover. Let's go through them 1 by 1.

Number of bits stored for each pixel sample. But why 12, 13 and 16?

Width of image

Convolution kernel algorithm used to reconstruct the data. I guess this is the algorithm to do the MRI inverse problem I've always wondered about. Apparently, there are multiple algorithms roughly doing the same thing, but may have different tradeoffs. I guess I just didn't expect there to be so many.

In mm

In mm, distance from x-ray source to isocenter/patient

Size of focal spot in mm. Ig this is sort of like resolution?

9 - FrameOfReferenceUID seems to be that random string, assigned for study or sth

Note that that huge spike in the beginning also has a wide distribution to it, but just too small to be noticable.

Also, this is power in kW provided to the x-ray generator. Some are actually up in the thousands because the operator confused W vs kW lmao. So I gotta filter this out.

Correspond to BitStored - 1. Pretty understandable

Specifies the x, y, and z coordinates of the upper left hand corner of the image. Do I really have to flip voxel images around tediously like this???

List of unique values:

Feels like #slice? "A number that identifies this image"

kilo volt peek

Literally, biggest pixel value in the image. Checks out with my own testing

2 values, patient direction in rows and columns. Values include: A (anterior), P (posterior), RL, H (head), F (foot). So, LP = left posterior, LA = left anterior, RA = right anterior. Empty values just default to LP, cause it has a bunch more stuff in it

Quite complex. Refer to docs for more

Pixel value to pad the background

What format are the pixels??? 0 and 1. Let's hope this is automatically handled on getting pixel_array

x-y spacing between each pixels.

x and y always equal to each other?

Specifies part of the imaging target that was used as a reference point. Read docs for more

$output = m\cdot SV+b$, where SV is the stored value. Rescale slope is 1 for all images already btw.

Specifies output units of rescale slope and rescale intercept. HU = hounsfield units, US = unspecified. So all of these are just HU then lmao.

Time for the x-ray source to rotate around the patient. Kinda hard to believe that these things rotate 1 revolution per fucking 0.4 second??? That's so fast.

Height of image

This seems like a unique value for every dcm file

This one's also full of weird numbers

??

I guess this is the locations? Note the tendency to center around 0, cause like, when the lung is exactly at the middle, it sort of should be zero in the comp.

Obv. This sort of means that distances below this threadhold are fused into 1 reading.

Measured in mm. Kinda correspond to pixel spacing. This z axis is much less accurate though, which kinda sucks. Good thing a lot of them clumps toward 0. Also there shouldn't be any negative values here.

The middle bar, expanded out

Resolution physics limitation in mm. 0.35 means that any further detail below 0.35mm is considered bogus.

Sort of unicode stuff?

Ah right, after the RevolutionTime thingy, this makes sense. I guess it's how much mm a full rotation will move? Sounds kinda like SpacingBetweenSlices though, so what's up with that? Official definition is tableFeedPerRotation/totalCollimationWidth. Should really be as close to 1 as possible to maximize image quality and efficiency.

Just to visualize, this is how the collimation thingy looks like. I guess these are independent x-ray sources that each gets their own detector? "C" is the total collimation width.

image.png

This one's also weird

Interesting question, does it always start with 2.25?

Yep, sure enough

This should sort of equal to total collimation width, because like, if it's much larger than that, then you get sparse spots in the image.

Distance from table to center of rotation in mm. Should be 1/2 depth of human body (yep, this checks out). So kinda have to filter out negative values here, cause if so, the patient wouldn't be in the center anymore.

Sort of dependent on table feed per rotation, and rotation/s. Apparently, the value is in mm/s, but I mean, typical value of 10cm/s seems outrageously fast, so there has to be something wrong here? UPDATE: so I actually calculated this dependent variable (feed 1cm-6cm/rotation, and 1 rotation/(0.4-1)s), and got back 1-10cm/s for this value. Okay, so I kinda have to reluctantly agree that the table speed is pretty big. However, that doesn't explain the 20cm/s mark.

Window center and window width is sort of like a filter for values. Anything below center - width/2 becomes min value, anything above center + width/2 becomes max value, and everything in the middle is scaled accordingly like this: $((x - c) / w + 0.5) * (max-min)$. This term $(x-c)/w + 0.5$ goes from 0 to 1, rest is pretty obvious.

Tube current in mA