Harmonization of Music using Deep Learning

I had taken up Machine learning course at my institute where a term project using concepts of machine learning was to be submitted for final evaluation in grading. Many of my team members were new to machine learning and hence we decided to use pre-modeled TensorFlow “easy sequence to sequence” to our purpose, rather than making a new network from scratch. Here is a brief overview to my project.

Problem Statement

To output a harmony line for every input melody line using deep learning through neural networks.


In music, harmonization is the chordal accomplishment to a line or melody. A general overview suggests that analysis of music is similar to Natural Language Processing. Thus, treating music as series of notes and splitting them into fundamental units (bars) for processing was the preliminary idea. Works like “DeepHear: Composing and harmonizing music with neural networks” gave us motivation to model our own deep learning network. Below are the details of how music theory works and the way it can be converted to useful units keeping in mind the aim of our project.

Music Theory

On a keyboard, the white keys represent A through G, and back to A again. However, there are also black keys, which play intermediate notes that are known as sharp or flat notes. It is important to note that the interval between two successive keys is always the same; this distance is known a half-step. Naturally, two half-steps make a whole-step. I.e. the F key is a whole-step apart from the G key and the B and C keys are only a half-step apart.


The staff provides a backdrop onto which music and musical symbols can be placed. It is made of five lines, which create four spaces within the outermost two lines.

Music is notated upon the staff in a very logical way. A note increases in pitch as it moves up the staff, and therefore decreases in pitch as it is moved down the staff. A staff of music is read in the same way a line of text is read; from left to right. Therefore, notes on the left are played before notes to the right of it.


Clefs associate particular pitches with particular lines and spaces on a staff. It is always placed at the beginning of every line of music for this reason. A note on a certain line in a staff (say, the middle line) will indicate an entirely different pitch if the clef is changed.


Treble or G Clef – This is usually used for those relatively medium to high pitches. The notes from bottom line to top are E, G, B, D, and F, with the other notes in between.

Bass or F Clef – This is usually used for those relatively medium to low pitches. The notes from bottom line to top are G, B, D, F, and A


An accidental can raise or lower the pitch of any given note. The direction in which a pitch is changed, and by how much a pitch will change (in terms of half or whole-steps) depends on the accidental being used.

When an accidental is being employed, it is always placed immediately before the note whose pitch is intended to be changed.

The flat is an indicator that a pitch is to be lowered one half-step. It looks a bit like a lowercase b. The sharp is an indicator that a pitch is to be raised one half-step. It looks a bit like the “pound sign” #.

I.e. the black key to the right of the G sounds a pitch that is a half-step higher than a G. This pitch can be called a G-sharp. Conversely, the black key to the left of G will sound a pitch a half-step lower than G, which can be called a G-flat.


A musical note will also indicate its duration. Included here are rests. Rests are the opposite of notes. When a note is notated, it indicates the performance of a certain pitch of a certain length at a certain time. When a rest is notated, it indicates silence of a certain length at a certain time.

There are quite a few symbols to indicate the length of a note. Every symbol is two times the length of the symbol that follows it and is one half the duration of the symbol preceding it.

Time Signature

Music is usually divided into measures or bars, usually containing between 2 and 8 beats. The time signature is written next to the clef.

The upper number shows how many beats are in a measure. The lower number determines what kind of note gets the beat (4 = quarter note, 8 = eighth note). Measures are separated by bar lines.


Our neural network model cannot process music files as it is. It requires vector/word inputs. So, the musical information from these files have to be extracted and represented in the form of logical vectors/words. Originally, the music files in the dataset are in the MIDI format. They are converted into MusicXML and further parsed to obtain word files which are further converted to vectors, to be directly used for training the model.

  • MIDI to XML : Music in the dataset is in the form of MIDI files, but it is not well suited for sheet music information. On the other hand MusicXML files have a standard representation for note information and is widely used for analysing music. MIDI to XML batch conversion can be easily achieved.
  • XML to Words : In our word representation of music, each note is identified using its pitch and octave. For example, A4 has pitch as ‘A’ and is of the 4th octave. As for the chords, the notes in them are concatenated without spaces. Thus, a word constitutes of a single note/chord. The duration for which a note/chord is played is mapped using the number of repetitions  of each word, i.e., if the duration of a note/chord is 8 units, then it has been repeated 8 times. The unit of duration is typically taken to be 1/64th note. For example, :

G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4

Finally, a bar is single bar constitutes a single sentence.

The XML files contains tags for each of these elements as well as a separate tag for the harmony line. These can be easily parsed in python to produce the word files.

  • Words to Vec using Word2Vec : Once the words are obtained, such that every bar is a sentence, they are further processed and converted to vectors using the Word2Vec model. This intuitively extracts the relation between different chords/notes and is more meaningful than using a term frequency approach. A single model is trained for both melody and harmony lines, which extracts the meaning between different notes/chords, giving a vector for every note/chord.


The proposed project can be taken as an analogy to English – French translation system. As discussed in Music theory, a note of melody line will sound pleasing only with some specific note at the same time interval in harmony line. Considering a ‘bar’ to be equivalent to a sentence and a ‘note’ to be equivalent to a word, the complete melody line can be understood as corpus of text in English language and corresponding harmony line to be corpus of text in French language.

The English-French translation is already achieved by considering it to be a sequence to sequence learning problem. Since it is evident that any word can be related to the previous context in the sentence, it is required that the training model should have capability to memorize the previous contexts. This is well achieved by the use of RNN cells (Recurrent Neural Network), known for maintaining relation with the previous inputs. In simple terms, the network must be able to figure out the relation of the input words with the neighboring words of the sequence. Thus, a sequence to sequence problem is formulated as an encoder network (an RNN) cascaded with a decoder network (another RNN). The job of the encoder network is to interpret the information from the input sequence, essentially the meaning it is trying to convey. The decoder then conveys the meaning into the other sequence.

In the translation problem, the encoder network extracts the important information from the English sentence and the decoder reads that information to produce the french sentence. The neurons which are generally used are either gated recurrent units (GRUs) or long-short term memory units (LSTMs).

Now projecting this situation to Music problem, I passed corpus of notes as melody line to encoder and retained the corpus of notes from decoder supposed to be harmony line. This was done by adopting sequence to sequence learning model provided by TensorFlow, that already includes the implementations of encoders and decoder neuron sequences as discussed above.

The model was fed with two corpus of texts, one melody line and the other harmony line, that included representation of notes in string format for each “bar” of the melody line and harmony line in row wise order.

Example of bar in melody line fed to encoder:

C4 C4 C4 C4 C4 C4 C4 C4 G3 G3 G3 G3 C3 C3 C3 C3 E4 E4 E4 E4 E4 E4 E4 E4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 C4 C4 C4 C4 C4 C4 C4 C4 G3 G3 G3 G3 C3 C3 C3 C3 E4 E4 E4 E4 E4 E4 E4 E4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4

The model first creates a vocabulary for encoder and decoder independently, based upon occurrence of new notes and their associated frequency. This vocabulary is then used by the network to tag ids to notes with some integer, that is readable by model. This sequence of integers is then given as input to the model but with a few caveats.

  1. Handling inputs and outputs with various lengths: In our case the input and output sizes can vary since the number of beats in each bar may not be constant across songs. To handle this I created ‘buckets’ such that the lengths smaller than the given bucket size are affixed with a special ‘_PAD’ symbol. e.g. if they bucket size is given as (5,5), the input and outputs with lengths less than this would get affixed with the pad symbol – ‘G4 G4 G4 G4’ becomes ‘G4 G4 G4 G4 _PAD’. The same is applicable to outputs as well.
  2. _GO and _END tags: The start of the decoder labels (after padding) is affixed with a ‘_GO’ symbol to signify the start of the sentence and ‘_END’ is the end of the sentence.
  3. Reversing the inputs to the encoder: It has been observed that if the order of the inputs to the encoder is reversed the results are better and the same is implemented by default in TensorFlow. i.e. ‘G4 G4 G4 G4 _PAD’ is inputted as ‘_PAD G4 G4 G4 G4’.

The result from the decoder is again a list of integers which indicate the position in the vocabulary. They are decoded into symbols by using a reverse-vocabulary which is mapping from the indices to the symbols.


I started off with polyphonic data in which multiple notes can be played at once. This was represented as ‘C3E3G3’ and the likes. Since, the order is always from lower pitch to higher pitch, only distinct chords are possible in the vocabulary. Bucketting used is (70, 70) and (100, 100) throughout the experiments.

The test input is,

‘C4 C4 C4 C4 C4 C4 C4 C4 G3 G3 G3 G3 C3 C3 C3 C3 E4 E4 E4 E4 E4 E4 E4 E4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 C4 C4 C4 C4 C4 C4 C4 C4 G3 G3 G3 G3 C3 C3 C3 C3 E4 E4 E4 E4 E4 E4 E4 E4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4 G3D4’.

    1. 3 layer GRU network of size 128 each, batch-size = 100
      Perplexity after 100k iterations: 1.76
    1. 3 layer GRU network of size 256 each, batch-size = 100
      Perplexity after 100k iterations: 1.65
  1. 5 layer GRU network of size 128 each, batch-size = 100
    Perplexity after 100k iterations: 1.34

(I could not run a 5 layer GRU of 256 because the free GPU space was insufficient)

Seeing that the results were not quite good I restricted the vocabulary to monotones (i.e. no chords). In this case the input is :

‘C4 C4 C4 C4 C4 C4 C4 C4 G3 G3 G3 G3 C3 C3 C3 C3 E4 E4 E4 E4 E4 E4 E4 E4 D4 D4 D4 D4 D4 D4 D4 D4 C4 C4 C4 C4 C4 C4 C4 C4 G3 G3 G3 G3 C3 C3 C3 C3 E4 E4 E4 E4 E4 E4 E4 E4 D4 D4 D4 D4 D4 D4 D4 D4’.

The results are as follows:

    1. 3 layer GRU network of size 128 each, batch-size = 100  (still training)
      Perplexity after 105k iterations: 1.16

(One the most interesting things with this output is that not only are the two harmony notes pleasing with respect to the melody, but the model has picked-up the repetition in the melody and repeated the harmony accordingly)


Leave a Reply, I generally respond quickly

%d bloggers like this: