14 Jun 2021 • on wavegan software docker dsp

Drum Loops From Nowhere

Why Should I Do What Everyone Does

What I currently miss the most with my minimalist no-budget music production setup, definitely would be some organic, pounding and swinging battery. I do not really have much interest in using loops or pre-processed sample libs (tHat CUTt trhU Ze MiX!), because as Chris Randall once said, you have no time to screw around with somebody else’s music if you’re really want to make your own. Truly, with the standards I impose on my self, simply downloading a copy of MT PowerDrums, would be a statement of artistic and engineer impotence. So far, I got some applicable results in “personalised” percussive sounds, by exploring the following approaches:

Basic subtractive and FM synthesis. Unless, you are a sound-design and electrical-engineering expert, that is your usual “blip-blop” type of electro-drum sounds.
Found objects. This is always fun. Just put your smartphone on record every time you take out the trash,
Physical modeling. Good for membrane percussion, quickly gets complicated for cymbals, because of their complex “clustered” nature of sound.

So here is a small demo I built around a manipulated drum loop which was generate. On the course of this post I’ll try to explain how you can create one yourself.

Download Zip

Sample WAVs Generated by NN

Neural Networks for Sound Generation

What I have not tried yet, would be NN-generated sound for drum loops and samples. First, and foremost (for a math-n00b like me), it requires sufficiently less mathematical knowledge . Neural networks consist almost entirely of linear algebra. No third-grade partial derivative equation systems, which wonderful world of physical modeling is so full of. Second, if current physical modeling paradigm concentrates of modeling a sound of a single hit (excitation of membrane or string pluck), generative neural networks, being completely agnostic of the physical nature of signals they are working with, could in fact, produce a non-deterministic rhythmic sequence of percussion hits (or rather, transients), if only we would train the model exclusively on soloed drum parts, of course. Which is the exact thing we’re gonna do!

This way, a trained NN model could become an endless source of free and unique sounds, phrases and rhythmic textures . Well, maybe not completely free, as it still considered a but of a legal “gray area”, whenever you really own something to the copyright owners of the material you’ve trained you neural network on. Some GAN software explicitly states in it’s license, that the code itself and the models it produces are free for non-commercial use only.

The reason, it excites me so much, is that I personally feel, that this kind of approach is a bit underexplored area of artistic expression using machine-learning. I draw this conclusion from the fact, that the most advanced projects in this segment, such as Jukebox AI, seem to concentrate on producing a flow of “consumption ready” generated “music”. I netherless, admit the stellar results in facilitating that, achieved by community of developers and users of that software. My goals are much less ambitious.

Generative Adversarial Networks in One Passage Under a Minute

The basics of machine learning is finding the coefficients (weights) for nodes in a traversable structure, so when calculating the output, it matches the label apriori given to a sample of input data. The way nodes are connected loosely represent, hence the name. To describe a wide area of machine learning applications on a very high-level, a term “labeling” could be used. So, NN model that got its node weights determined by being “trained” on million of kitten and puppy photos, could then determine (“label”) if an arbitrary input photo is a photo of a cub or a kitten with a decent level of accuracy. In case of generating content task, Two neural networks, the generative one and discriminative one. When learning, generative network acquires a latent space of, from which it draws samples to produce random signals. Those random signals then sent to discriminative network, which is trained to determine if those have the same characteristics as initial data. For sake of that, both networks are trained on the same dataset. So, by playing this , this pair are capable of fluently generating new content that would “fit” the initial data traits.

Using WaveGAN and AWS Cloud Computing Solution to Facilitate the Experiment

For our attempt at NN-generated percussion we will be using WaveGAN - an implementation of GAN which samples audio at high-temporal resolutions, thus, being to capture spectral structures across a range of timescales, when learning. It is implemented in Python using Tensorflow 1.2 framework.

So, how do we train our WaveGAN model on those audio-files? Well, unless you are fond of computer games, issued after Y2K, or (unlike me) a real scientist with access to your university’s mainframe, you probably do not have a powerful enough GPU hardware to train a GAN in a comfort of your dungeon. As a cloud based solution that provides GPU-based computing I chose AWS, for the following initial reaons:

Uncanny familiarity. If you have a goatie and know what “MVC” is, I have bad news for ya.
Religious concerns (there is no way I’m touching Azure)
Advertised dedicated machine learning solution (emphasis on “advertised”)
Advertised “Free-Tier” for newly created accounts (double emphasis on “advertised”)

SageMaker. High Level Overview and Initial Experiment Plan

Let’s take a high level overview of how SageMaker works. So, skimmed from all the buzzwords and marketing, it is a number of core Amazon AWS offerings (virtual machines and cloud storage), all duck-taped toghether around the topic of machine learning. Basically, it automates the following flow, that you would otherwise do manually when traning the neural network in the clooud:

Provisioning and starting an instance
Copying the program and train data to the instance
Start the program in the training mode
Collect metrics
Stop training
Copy the model from the instance to AWS S3 cloud storage (“checkpoint”)
Shut down the instance

// Illustration, training jobs

To be fair, it actually does more than that. For example, it supports training parallelization (it can partition by data, or by.). Sounds fun, but unless you’re using something readymade from SageMaker’s marketplace , you’ll have to do quite unobvious modifications to you code, to make it work.

Also, you’ll get something called “SageMaker Studio”, a small dedicated server, fronted by Jupyter Python Notebook, which aslo has a permanent SSD storage and all the core AWS utils as well, as the essential data-science Python libs like SciPy pre-installed. To execute the notebook’s cell, you’ll need to select a runtime and a kernel, which will start a remote cloud instance and connect to your notebook to a Python kernel . You can choose between TensorFlow and PyTorch being the backbone of your computations. Those notebooks, would be, actually, useful, when getting the output of your trained network. because SageMaker’s solutions for hosting a model seem a bit overcomplicated to me, as they have a flow and setup completely separate from the “training” phase.

The Process. Expectations vs Reality

Before even creating my personal Amazon account, I had the following theoretical steps in mind: Steps:

Create an Amazon Account, they offer something like one month of free computation using their SageMaker platform.
Use WaveGAN (probably not the best choice for the task, but seems easy enough to set up and go).
Get some raw material. Demo loops of drum libraries should be a good enough bit for the first try
Upload them to S3 somehow
Start learning
Monitor the training. Use log-based monitoring
Start a notebook instance
Load the trained model
Generate sounds

So, how did it hold up against my initial expectations? Well, OK-ish, I’d say.

First and foremost, AWS Console Web UI is nothing but a freaking labyrinth. So, I had to bookmark a bunch of pages, to whom I kept constantly coming back to lookup and refrence things along the way.

Second, according to the documentation, if you want SageMaker to work with a custom learn algorithm, you’ll need to build and deploy your custom docker image. Yeah, “DevOps Plumber” trade union card of GTFO, this is how we do things in the world of enterprise-level cloud computing, right? Well, maybe I am too harsh. Actually, you can start a traning job from SageMaker Studio using an entrypoint that starts the custom algorythm. to use multiple source files, you need to use the source_dir argument when setting up the Estimator, and then put your code files in the directory you specify.

Minor caveat though: if your algorithm uses a dependency outside standard GNU/SciPi set, which is almost always the case for any audio-related, you cannot provide it with the training job. Techincally, you can install this dependency on your SageMaker Studio instance and run your code there. But guess what? Since training utilises almost all of the CPU, so, after 10 mins of no heartbeats, Amazon marks the instance as dead and shuts it down. Uskomaton!

Theoretically, the manual tells you, that you could use the standard scipy utilities for reading your audio files, if they stricly comply to 44.2Khz 16bit PCM standard. I tried that, with the following:

for f in $(find ./ -name '*.wav'); do 
ffmpeg -i $f -acodec pcm_s16le -ar 44100 ${f%.wav}z.wav; 
rm $f;
mv ${f%.wav}z.wav $f;
done

That did not work, as WaveGAN kept crashing over reading input files, so I refined steps of training a GAN to be:

Build and upload your custom image to Amazon ECR
Upload raw material to S3
Start a training job either through SageMaker UI or Python API.

Notice how I’ve amended the “monitoring” step. I did it, because SageMaker’s proposed solution for monitoring is a freaking joke. Like downloading the checkpoints and examining them locally with Tensorboard? Bruuuuuuh

Using this tutorial I was able to put together a modified version of the original tensorflow docker container which I made publicly available in this repo. After a few 30 min feedback cycles, I was able to get it working.

This is where you’ll begin to realize that those “Free Tier” conditions are quite limiting. First and foremost, no GPU instances are included in free tier limits. Technically, TensorFlow supports calculation on XLA devices, and you would need to change the following in the initialisation code of WaveGAN here:

            '/device::XLA_CPU:{}'.format(prefetch_gpu_num)))

Sweet, but it is still notebably slower than one would expect, I got about 720 steps overnight, versus about 2000 on a GPU instance. Maybe I should have been using the CPU version of the AWS Tensorflow image for that?
And adversities with the GPU instances do not end on their cost. They are disabled in your AWS account by default and you have to contact AWS support to get access to them. AWS support service SLA is about 24 hours, which is quite alright for an enterprise like this, but could be a real show-stopper, when you have your creative juices flowing.

What was left behind

I did not familiarize myself enough with the machine learning domain in general and Tensorflow framework specifically to take advantage of SageMaker’s distributed learnning capabilities.

Seems like WaveGAN output sample rates are quite low 16-22Khz depending on the output length. This means that we would have a very pronounced aliasing in the cymbals/percussion domain, and generally low sound fidelity.

For more rhythmic excitement and cohesiveness, I probably need longer slices (currently, maximum is 4 seconds at 16Khz)

It might be better to skip the AWS WebUI altogether and use boto3 library to run the code that deploys image and runs training locally.

account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-decision-trees:latest".format(account, region)

tree = sage.estimator.Estimator(
    image,
    role,
    1,
    "ml.c4.2xlarge",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
)

tree.fit(data_location)

I am still not sure if I can provide clear instructions on setting up the AWS permissions for the whole setup. So, I’ll just leave a citation from AWS documentation here:

Running this notebook requires permissions in addition to the normal SageMakerFullAccess permissions. This is because we’ll creating new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy AmazonEC2ContainerRegistryFullAccess to the role that you used to start your notebook instance. There’s no need to restart your notebook instance when you do this, the new permissions will be available immediately.

I used a very basic soloed drumkit from a drum lib demo for the input data. Next time I could try something more extreme like those isolated drum playthroughs: 1, 2, 3, 4

Need to see, how WaveGAN is generating sample slices from the source audio. Need to see the number of slices and how they are affecting the number of training epochs. The number of steps recommended in the manual is 200K. For that matter I would maybe need to extract the slicing logic into separate “analytical” code that would prompt me about the numer of steps/time needed.

P.S.

If you ever wandered how to run docker on a machine without a network interface:

dockerd --iptables=false --bridge=none