Demystifying Deep Reinforcement Learning

11/8/2016

Brief introduction to Reinforcement Learning, up to Deep Q Learning.

Demystifying Deep Reinforcement Learning. Two years ago, a small company in London called DeepMind uploaded their pioneering paper. Deep Reinforcement Learning algorithms have achieved unprecedented performance in domains such as playing Atari. Contribute to Deep-Reinforcement-Learning-Survey development by creating an account on GitHub. Demystifying Deep Reinforcement Learning.

Deep Reinforcement Learning With Neon. This is the part 2 of my series on deep reinforcement learning.

It was the beginning of 2. RMSProp was just one slide in Hinton. We struggled with debugging, learned a lot, but when Deep.

Mind also published their code alongside their Nature paper . Supposedly a new deep learning toolkit was released once every 2. Amongst the popular ones are both the old- timers like Theano, Torch. Caffe, as well as the newcomers like Neon, Keras and Tensor.

Flow. New algorithms are getting implemented within days of publishing. At some point I realized that all the complicated parts that caused us headaches a year earlier are now readily implemented in most deep learning toolkits. And when Arcade Learning Environment – the system used for emulating Atari 2. Python API, the time was right for a new deep reinforcement learning implementation.

Writing the main code took just a weekend, followed by weeks of debugging. But finally it worked! You can see the result here: https: //github. Currently the most notable restriction in Neon is that it only runs on the latest n. Vidia Maxwell GPUs, but that.

Basically all you need are Neon, Arcade Learning Environment and simple. For trying out pre- trained models you don. For example to run a pre- trained model for Breakout, type./play.

You can give the control back to the AI by pressing . You can watch an example video here: Training a New Model. To train a new model, you first need an Atari 2. ROM file for the game. Once you have the ROM, save it to roms folder and run training like this./train. As a result of training the following files are created: results/pong.

If you would like to re- test your pre- trained model later, you can use the testing script./test. To save the results to file, add - -csv. There is a simple plotting script to produce graphs from the statistics file. For example./plot. This produces the file results/pong.

By default it produces four plots: average score,number of played games,average maximum Q- value of validation set states,average training loss. For all the plots you can see random baseline (where it makes sense) and result from training phase and testing phase. You can actually plot any field from the statistics file, by listing names of the fields in the - -fields parameter.

The default plot is achieved with - -fields average. Names of the fields can be taken from the first line of the CSV file. Visualizing the Filters. The most exciting thing you can do with this code is to peek into the mind of the AI. For that we are going to use guided backpropagation, that comes out- of- the- box with Neon. In simplified terms, for each convolutional filter it finds an image from a given dataset that activates this filter the most. Then it performs backpropagation with respect to the input image, to see which parts of the image affect the .

This can be seen as a form of saliency detection. To perform filter visualization run the following./nvis. The results can be found in results/breakout. There are 3 convolutional layers (named 0. I have visualized only 2 filters from each (Feature Map 0- 1). For each filter an image was chosen that activates it the most.

Right image shows the original input, left image shows the guided backpropagation result. You can think of every filter as an .

The left image shows where this particular . You can use mouse wheel to zoom in! Because input to our network is a sequence of 4 grayscale images, it’s not very clear how to visualize it. I made a following simplification: I’m using only the last 3 screens of a state and putting them into different RGB color channels.

So everything that is gray hasn’t changed over 3 images; blue was the most recent change, then green, then red. You can easily follow the idea if you zoom in to the ball – it’s trajectory is marked by red- green- blue.

It’s harder to make sense of the backpropagation result, but sometimes you can guess that the filter tracks movement – the color from one corner to another progresses from red to green to blue. Filter visualization is an immensely useful tool, you can immediately make interesting observations. The first layer filters focus on abstract patterns and cannot be reliably associated with any particular object. They often activate the most on score and lives, possibly because these have many edges and corners.

As expected, there are filters that track the ball and the paddle. There are also filters that activate when the ball is about to hit a brick or the paddle.

Also as expected, higher layer filters have bigger receptive fields. This is not so evident in Breakout, but can be clearly seen in this file for Pong. It’s interesting that filters in different layers are more similar in Breakout than in Pong. The guided backpropagation is implemented in Neon as a callback, called at the end of training.

I made a simple wrapper that allows using it on pre- trained models. One advantage of using the wrapper is that it incorporates guided backpropagation and visualization into one step and doesn. But for that I needed to make few modifications to the Neon code, that are stored in the nvis folder. How Does It Compare To Others? There are a few other deep reinforcement learning implementations out there and it would be interesting to see how the implementation in Neon compares to them. The most well- known is the original Deep.

Mind code published with their Nature article. Another maintained version is deep.

To calculate it for simple. For Deep. Mind I used the values from Nature paper.

Their scores are not collected using exactly the same protocol (the particular number below was average of 1. Another interesting measure is the number of training steps per second. In all cases I looked at the first epoch, where exploration rate is close to 1 and the results therefore reflect more of the training speed than the prediction speed.

All tests were done on n. Vidia Geforce Titan X. Implemented in. Breakout average score. Pong average score.

Training steps per second. Deep. Mind. Lua + Torch. The learning results are not on- par with Deep. Mind yet, but they are close enough to run interesting experiments with it.

How Can You Modify It? The main idea in publishing the simple. There is also main. Statistics class that implements the basic callback mechanism to keep statistics collection separate from the main loop. But the gist of the deep reinforcement learning algorithm is implemented in the aforementioned four classes. Environment. Environment is just a lightweight wrapper around the A.

L. E Python API. It should be easy enough to add other environments besides A. L. E, for example Flappy Bird or Torcs – you just have to implement a new Environment class. Give it a try and let me know! Replay. Memory. Replay memory stores state transitions or experiences.

This results in a huge decrease in memory usage, without significant loss in performance. Assembling screens into states can be done fast with Numpy array slicing. Datatype for the screen pixels is uint. M experiences take 6. GB – you can run it with only 8. GB of memory! Default would have been float.

GB. If you would like to implement prioritized experience replay, then this is the main class you need to change. Deep. QNetwork. This class implements the deep Q- network. It is actually the only class that depends on Neon. Because deep reinforcement learning handles minibatching differently, there was no reason to use Neon’s Data. Iterator class. Therefore the lower level Model. Model. bprop() are used instead. A few suggestions for anybody attempting to do the same thing: You need to call Model.

Model. This allocates the tensors for layer activations and weights in GPU memory. Neon tensors have dimensions (channels, height, width, batch.

In particular batch size is the last dimension. This data layout allows for the fastest convolution kernels. After transposing the dimensions of a Numpy array to match Neon tensor requirements, you need to make a copy of that array!

Otherwise the actual memory layout hasn. This results in less round- trips between CPU and GPU. Consider doing tensor arithmetics in GPU, the Neon backend provides nice methods for that. Also note that these operations are not immediately evaluated, but stored as a Op. Tree. The tensor will be actually evaluated when you use it or transfer it to CPU. If you would like to implement double Q- learning, then this is the class that needs modifications. Agent. Agent class just ties everything together and implements the main loop.

Conclusion. I have shown, that using a well- featured deep learning toolkit such as Neon, implementing deep reinforcement learning for Atari video games is a breeze. Filter visualization features in Neon provide important insights into what the model has actually learned. While not the goal on its own, computer games provide an excellent sandbox for trying out new reinforcement learning approaches.

0 Comments

Demystifying Deep Reinforcement Learning

Leave a Reply.

Author

Archives

Categories