Writing a noise removal machine learning app from scratch – with no idea how to do it

...especially when you have no other option.

Lovely evening in Turin. I made this myself. Also a lovely slicing into tiles. This will come in handy.

After deciding on the structure of the training dataset, we are faced with the task of feeding the PC with it. Giving whole pictures to the PC is a bad idea, because (but not limited to) the fact it would require immense RAM resources and excruciating processing times. Kinda trying to run Crysis Remastered at 8K, “can it run crysis” preset, with ray-tracing on a Commodore 64. For archaeological reference, my current hardware leverages a 6C/12T Intel Core i7-8750H (base 2.2 GHz, turbo 4.1 GHz), 32 GB of RAM with an Nvidia RTX 2070. On the go, I am backed by my old faithful mid-2012 MacBook Air – Core i5-3727U, base 1.8GHz, boost 2.8 GHz, 8 GB RAM.

A very good idea is to slice all pictures into paired clean/noisy little tiles. Like so:

The size of actual tiles will be far smaller, in the order of ~56×56 px or so. And tiles will be cut from the whole image, and not limited to just one tile like in this example.

This will:

  • maximise the dataset. Out of a reasonably small number of big pictures, we will have a far greater number of tiles. The machine will then look and learn to rebuild patterns from a broad variety of tiny image specks, instead from a limited number of gigantic matrices.
  • keep the hardware requirements within humane boundaries. This applies to both learning and denoising.

(For those interested, you can try and poke around with the – still experimental, unpolished, half-baked, but kinda working – code yourself).

With 16 picture pairs (5184×3456 pixels), the memory usage is around ~1.8 GB.

The make_dataset() method actually slices all pictures and populates in-memory the tiles containers. shuffle_dataset() randomly shuffles all tiles (keeping the pairing – of course), so that when we will split the dataset into training and validation and whatnot there’s no chance to learn how to denoise total-black tiles and apply what we have learned on tiles containing rainbow unicorns.

But why keeping all the tiles in memory? That’s a design decision. Reasons:

  • lazyness. We know there’s a way to construct a generator function that feeds the machine learning model in steps with minimal memory footprint, but this comes at the expense of our patience. A few more GB of RAM won’t hurt anybody, especially for training.
  • shuffling. Shuffling tiles (keeping them paired clean/noisy) would be more difficult to manage, and at some point we would need to keep everything in memory nonetheless, or directly on disk (in the form of millions of individual PNGs: just nonsense).

The paired tiles are stored in the ds.clean_tiles_ and ds.noise_tiles_ attributes, as numpy arrays. With default arguments, each one contains 1,092,240 28×28 px tiles.

Another design decision: we’re currently using all image channels as if they were equal to keep things simple. They’re not. But what the heck are channels?

Images are usually RGB-encoded. The color of each pixel is represented by three integer values, one for each red, green and blue channels. Consumer-grade images define the “intensity” of each channel with 256 different values, from 0 to 255. 0 is the darkest, 255 is the brightest. These numbers pop out from the 8 bits that are used to store the value of the integer. As bits are in base 2, we have 2^8 (=256) possible values. Colors are made up by additive sum of each channel, like so:

How RGB channels are mixed. Source: wikipedia.

This gives us a grand total of 24 bpp (bits per pixel) for a regular JPG image. Other formats allow a fourth channel, called alpha, that defines how transparent each pixel is. This will come in handy later on.

Not all channels are created equal: I already know that my camera has a very noisy red channel, a noisy green channel and a much-less-noisy blue channel. But, for the sake of simplicity, we will begin treating each channel independently, learning how to remove noise from it as if it were a monochrome image (which, indeed, it is). This will also incidentally make our dataset triple:

Then, we will recombine the three denoised layers into one final color image.

Our goal is to make the computer learn how to denoise images. But how can it learn, and from what? And what is noise, anyway?

Creeping noise. Source.

A very technical explanation claims noise is something that makes your picture look bad. In pictures, it mostly occurs through two mechanisms:

  • Poverty. That’s why a picture shot with a top-notch camera at ISO 102,400 still looks usable (~ 7,500€), while another shot with an entry level one (~ 350€) at ISO 6,400 looks like a bunch of pixels with values thrown with dice rolls.
  • Quantum mechanics. Light is actually made out of tiny individual particles (that are also waves, that are also particles, and so on), called photons. At the end of their journey(s), photons ultimately crash into the sensor, generating an electric current. When the light is low, when the shooting time is fast, or both, there’s fewer photons smashing into the sensor. This results into random fluctuations in each pixel reading due to the uneven arrival of photons from the source, possibly amplified by the tries of the sensor to amplify this feeble signal by forcing more current into its circuits. Last, this analog signal needs to be converted into digital form (another source of noise).

What can we do when quantum mechanics, thermodynamics, bad luck and other major forces of Nature conjure against us? We fight back.

Just like you would imagine how a noisy picture would look if it weren’t noisy, because you know from experience what a clean picture looks like, the computer can also learn in a similar fashion. Problem is, your computer never went on holiday, took pictures and spent time reviewing them eating pizza.

So that’s what we’re going to do: we will build a dataset made of paired pictures: one clean, one noisy. We will make a lot of them and will then feed them to the computer. Settings:

  • clean picture: ISO 200
  • noisy picture: ISO 1,600 pushed + 2 EV (ISO 6,400 equivalent)

What’s ISO anyway? It’s a value related to the sensitivity of your sensor: the lower, the lower (low ISO == low sensitivity and vice versa).

If you can shoot at low ISO, this means light is plenty and/or you can afford longer exposure times. This results into a noise-free image. If you cannot, it means that light is low and/or you must shoot very fast to freeze a moment (such as water moving), and you need to capture more information in less time. This results into noisy images. ISO values double when the light required to take a picture halve. This is convenient, as ISO 100 and ISO 200 are spaced just like ISO 800 and ISO 1,600 are.

Practical example: a picture taken at ISO 100, 1/1000 second or ISO 800, 1/125 second will look exactly the same (to the exception of noise).

I will be using my Canon EOS 1200D for the making of the dataset. Reasons:

  • That’s the camera that I have.
  • It produces fantastically noisy pictures, which is good for this project (and, incidentally, the reason for it)
  • On the top of abysmal high-ISO capabilities, ISO invariance is also horrible

Most datasets out there are built by artificially introducing gaussian noise into the images. Our main goal is to produce a dataset containing real noise, thus we will take two pictures of the same thing. As mentioned above, one will be clean (ISO 200). The other one will be shot at ISO 1,600 (a value my camera already begins to struggle with), and artificially boosted to ISO 6,400 equivalent, by raising +2 EV the picture. Because of the poor ISO invariance, this results into an even noisier image than one actually shot at ISO 6,400. Example:

This will likely cover real use-case scenarios, where I will need to push ISO 1,600 images upwards. You will find the datasets here.

Coming up: how we’re gonna digest all those images to feed them to the computer for learning.

Trials, errors, dead ends and strokes of genius of two Python tinkerers writing a ML-powered noise removal app from scratch

Stemanz (right) and houseofbards (left) before starting the project. Let’s see how aged we get at the end.

It all began when I downloaded the demo version of an app that promised AI-powered noise reduction in pictures. Sure, it worked well (from acceptably well to astonishingly well depending on individual cases). Sure, it cost a lot. Thus, I was left with two options: just shelling out the cash (quickest option), or getting my hands dirty into making it myself – from scratch (most hellish option).

I chose the most obvious one.

As a rule of thumb, it's better not to suffer alone. Thus, unfortunately for him, I dragged houseofbards into this.

These blog posts will tell the tale of two self-taught Python tinkerers trying to write a working app, from the ground up, that leverages machine learning to remove noise from images. We have no idea whatsoever of what we’re doing, and we’re gonna fill all holes by best practicing:

  • stealing code (I’m told it’s OK as long as you give credit to the poor souls who made the effort in the first place)
  • implementing ideas without understanding them (matrix multiplication, tensor algebra, cost and loss functions, autoencoders, GANs and friends – you know they’re there, and it’s all you need. Stir the cauldron until it works.)
  • approximating (this is when you know when good enough is good enough)

Join us, it’ll be fun!

Enter your email to subscribe to updates.