Divide et impera - the making of a dataset

Divide et impera – the making of a dataset

11 marzo 2022

...especially when you have no other option.

Lovely evening in Turin. I made this myself. Also a lovely slicing into tiles. This will come in handy.

After deciding on the structure of the training dataset, we are faced with the task of feeding the PC with it. Giving whole pictures to the PC is a bad idea, because (but not limited to) the fact it would require immense RAM resources and excruciating processing times. Kinda trying to run Crysis Remastered at 8K, “can it run crysis” preset, with ray-tracing on a Commodore 64. For archaeological reference, my current hardware leverages a 6C/12T Intel Core i7-8750H (base 2.2 GHz, turbo 4.1 GHz), 32 GB of RAM with an Nvidia RTX 2070. On the go, I am backed by my old faithful mid-2012 MacBook Air – Core i5-3727U, base 1.8GHz, boost 2.8 GHz, 8 GB RAM.

A very good idea is to slice all pictures into paired clean/noisy little tiles. Like so:

The size of actual tiles will be far smaller, in the order of ~56×56 px or so. And tiles will be cut from the whole image, and not limited to just one tile like in this example.

This will:

maximise the dataset. Out of a reasonably small number of big pictures, we will have a far greater number of tiles. The machine will then look and learn to rebuild patterns from a broad variety of tiny image specks, instead from a limited number of gigantic matrices.
keep the hardware requirements within humane boundaries. This applies to both learning and denoising.

(For those interested, you can try and poke around with the – still experimental, unpolished, half-baked, but kinda working – code yourself).

With 16 picture pairs (5184×3456 pixels), the memory usage is around ~1.8 GB.

The make_dataset() method actually slices all pictures and populates in-memory the tiles containers. shuffle_dataset() randomly shuffles all tiles (keeping the pairing – of course), so that when we will split the dataset into training and validation and whatnot there’s no chance to learn how to denoise total-black tiles and apply what we have learned on tiles containing rainbow unicorns.

But why keeping all the tiles in memory? That’s a design decision. Reasons:

lazyness. We know there’s a way to construct a generator function that feeds the machine learning model in steps with minimal memory footprint, but this comes at the expense of our patience. A few more GB of RAM won’t hurt anybody, especially for training.
shuffling. Shuffling tiles (keeping them paired clean/noisy) would be more difficult to manage, and at some point we would need to keep everything in memory nonetheless, or directly on disk (in the form of millions of individual PNGs: just nonsense).

The paired tiles are stored in the ds.clean_tiles_ and ds.noise_tiles_ attributes, as numpy arrays. With default arguments, each one contains 1,092,240 28×28 px tiles.

Another design decision: we’re currently using all image channels as if they were equal to keep things simple. They’re not. But what the heck are channels?

Images are usually RGB-encoded. The color of each pixel is represented by three integer values, one for each red, green and blue channels. Consumer-grade images define the “intensity” of each channel with 256 different values, from 0 to 255. 0 is the darkest, 255 is the brightest. These numbers pop out from the 8 bits that are used to store the value of the integer. As bits are in base 2, we have 2^8 (=256) possible values. Colors are made up by additive sum of each channel, like so:

How RGB channels are mixed. Source: wikipedia.

This gives us a grand total of 24 bpp (bits per pixel) for a regular JPG image. Other formats allow a fourth channel, called alpha, that defines how transparent each pixel is. This will come in handy later on.

Not all channels are created equal: I already know that my camera has a very noisy red channel, a noisy green channel and a much-less-noisy blue channel. But, for the sake of simplicity, we will begin treating each channel independently, learning how to remove noise from it as if it were a monochrome image (which, indeed, it is). This will also incidentally make our dataset triple:

Then, we will recombine the three denoised layers into one final color image.