How To Develop and Evaluate Large Deep Learning Models with Keras on Amazon Web Services

Keras is a Python deep learning library that provides easy and convenient access to the powerful numerical libraries Theano and TensorFlow.

Large deep learning models require a lot of compute time to run. You can run them on your CPU but it can take hours or days to get a result. If you have access to a GPU on your desktop, you can drastically speed up the training time of your deep learning models.

In this post, you will discover how you can get access to GPUs to speed up the training of your deep learning models by using the Amazon Web Service (AWS) infrastructure. For less than a dollar per hour and often a lot cheaper you can use this service from your workstation or laptop.

Let’s get started.

  • Update Oct/2016: Updated examples for Keras 1.1.0.
  • Update Mar/2017: Updated to use new AMI, Keras 2.0.2 and TensorFlow 1.0.
Amazon Web Services

Amazon Web Services
Photo by Andrew Mager, some rights reserved

Tutorial Overview

The process is quite simple because most of the work has already been done for us.

Below is an overview of the process.

  1. Setup Your AWS Account.
  2. Launch Your AWS Instance.
  3. Login and Run Your Code.
  4. Close Your AWS Instance.

Note, it costs money to use a virtual server instance on Amazon. The cost is low for ad hoc model development (e.g. less than one US dollar per hour), which is why this is so attractive, but it is not free.

The server instance runs Linux. It is desirable although not required that you know how to navigate Linux or a unix-like environment. We’re just running our Python scripts, so no advanced skills are needed.

1. Setup Your AWS Account

You need an account on Amazon Web Services.

  • 1. You can create an account by the Amazon Web Services portal and click “Sign in to the Console”. From there you can sign in using an existing Amazon account or create a new account.
AWS Sign-in Button

AWS Sign-in Button

  • 2. You will need to provide your details as well as a valid credit card that Amazon can charge. The process is a lot quicker if you are already an Amazon customer and have your credit card on file.
AWS Sign-In Form

AWS Sign-In Form

Once you have an account you can log into the Amazon Web Services console.

You will see a range of different services that you can access.

2. Launch Your AWS Instance

Now that you have an AWS account, you want to launch an EC2 virtual server instance on which you can run Keras.

Launching an instance is as easy as selecting the image to load and starting the virtual server. Thankfully there is already an image available that has almost everything we need it is called the Deep Learning AMI Amazon Linux Version and was created and is maintained by Amazon. Let’s launch it as an instance.

  • 1. Login to your AWS console if you have not already.
AWS Console

AWS Console

  • 2. Click on EC2 for launching a new virtual server.
  • 3. Select “US West Orgeon” from the drop-down in the top right hand corner. This is important otherwise you will not be able to find the image we plan to use.
  • 4. Click the “Launch Instance” button.
  • 5. Click “Community AMIs”. An AMI is an Amazon Machine Image. It is a frozen instance of a server that you can select and instantiate on a new virtual server.
Community AMIs

Community AMIs

  • 6. Enter “ami-dfb13ebf” (this is the current AMI id for v2.0 but the AMI may have been updated since, you check for a more recent id) in the “Search community AMIs” search box and press enter. You should be presented with a single result.
Search for Deep Learning AMI

Search for Deep Learning AMI

  • 7. Click “Select” to choose the AMI in the search result.
  • 8. Now you need to select the hardware on which to run the image. Scroll down and select the “g2.2xlarge” hardware. This includes a GPU that we can use to significantly increase the training speed of our models.
Select g2.2xlarge Hardware

Select g2.2xlarge Hardware

  • 9. Click “Review and Launch” to finalize the configuration of your server instance.
  • 10. Click the “Launch” button.
  • 11. Select Your Key Pair.
    • If you have a key pair because you have used EC2 before, select “Choose an existing key pair” and choose your key pair from the list. Then check “I” acknowledge…”.
    • If you do not have a key pair, select the option “Create a new key pair” and enter a “Key pair name” such as keras-keypair. Click the “Download Key Pair” button.
Select Your Key Pair

Select Your Key Pair

  • 12. Open a Terminal and change directory to where you downloaded your key pair.
  • 13. If you have not already done so, restrict the access permissions on your key pair file. This is requred as part of the SSH access to your server. For example:

  • 14. Click “Launch Instances”. If this is your first time using AWS, Amazon may have to validate your request and this could take up to 2 hours (often just a few minutes).
  • 15. Click “View Instances” to review the status of your instance.
Deep Learning AMI Status

Deep Learning AMI Status

Your server is now running and ready for you to log in.

3. Login, Configure and Run

Now that you have launched your server instance, it is time to log in and start using it.

  • 1. Click “View Instances” in your Amazon EC2 console if you have not already.
  • 2. Copy “Public IP” (down the bottom of the screen in Description) to clipboard. In this example my IP address is 54.186.97.77. Do not use this IP address, your IP address will be different.
  • 3. Open a Terminal and change directory to where you downloaded your key pair. Login to your server using SSH, for example:

  • 4. When prompted, type “yes” and press enter.

You are now logged into your server.

Terminal Login to Deep Learning AMI

Terminal Login to Deep Learning AMI

We may need to make two small changes before we can start using Keras. If the AMI has been updated since writing this, you may not need to make these changes

This will just take a minute. You will have to do these changes each time you start the instance.

3a. Update Keras

Update to a specific version of Keras known to work on this configuration, at the time of writing the latest version of Keras is version 2.0.2 We can specify this version as part of the upgrade of Keras via pip.

You can also confirm that Keras is installed and is working correctly by typing:

You should see:

You are now free to copy-and-paste or upload your Keras python scripts to the server and start running them.

You may also want to install scikit-learn:

Looking for something to try on your new instance, see this tutorial:

4. Close Your AWS Instance

When you are finished with your work you must close your instance.

Remember you are charged by the amount of time that you use the instance. It is cheap, but you do not want to leave an instance on if you are not using it.

  • 1. Log out of your instance at the terminal, for example you can type:

  • 2. Log in to your AWS account with your web browser.
  • 3. Click EC2.
  • 4. Click “Instances” from the left-hand side menu.
Review Your List of Running Instances

Review Your List of Running Instances

  • 5. Select your running instance from the list (it may already be selected if you only have one running instance).
Select Your Running AWS Instance

Select Your Running AWS Instance

  • 6. Click the “Actions” button and select “Instance State” and choose “Terminate”. Confirm that you want to terminate your running instance.

It may take a number of seconds for the instance to close and to be removed from your list of instances.

Beat the Math/Theory Doldrums and Start using Deep Learning in your own projects Today, without getting lost in “documentation hell”

Deep Learning With Python Mini-CourseGet my free Deep Learning With Python mini course and develop your own deep nets by the time you’ve finished the first PDF with just a few lines of Python.

Daily lessons in your inbox for 14 days, and a DL-With-Python “Cheat Sheet” you can download right now.   

Download Your FREE Mini-Course  

 

Tips and Tricks for Using Keras on AWS

Below are some tips and tricks for getting the most out of using Keras on AWS instances.

  • Design a suite of experiments to run beforehand. Experiments can take a long time to run and you are paying for the time you use. Make time to design a batch of experiments to run on AWS. Put each in a separate file and call them in turn from another script. This will allow you to answer multiple questions from one long run, perhaps overnight.
  • Run scripts as a background process. This will allow you to close your terminal and turn off your computer while your experiment is running.

You can do that easily as follows:

You can then check the status and results in your script.log file later. Learn more about nohup.

  • Always close your instance at the end of your experiments. You do not want to be surprised with a very large AWS bill.
  • Try spot instances for a cheaper but less reliable option. Amazon sell unused time on their hardware at a much cheaper price, but at the cost of potentially having your instance closed at any second. If you are learning or your experiments are not critical, this might be an ideal option for you. You can access spot instances from the “Spot Instance” option on the left hand side menu in your EC2 web console.

More Resources For Deep Learning on AWS

Below is a list of resources to learn more about AWS and building deep learning in the cloud.

Summary

In this post you discovered how you can develop and evaluate your large deep learning models in Keras using GPUs on the Amazon Web Service. You learned:

  • Amazon Web Services with their Elastic Compute Cloud offers an affordable way to run large deep learning models on GPU hardware.
  • How to set-up and launch an EC2 server for deep learning experiments.
  • How to update the Keras version on the server and confirm that the system is working correctly.
  • How to run Keras experiments on AWS instances in batch as background tasks.

Do you have any questions about running your models on AWS or about this post? Ask your questions in the comments and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

 What If You Could Develop Your Own Deep Nets in Minutes

...with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more...

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

39 Responses to How To Develop and Evaluate Large Deep Learning Models with Keras on Amazon Web Services

  1. Jeremy M June 11, 2016 at 4:21 am #

    Hi Jason,

    Do you guys have any tutorials on deploying the model as service? I am trying to allow a user to be able to upload an image and for me to classify it with a custom classifier.

    I saw your book and I don’t think this was touched upon. I think having that would be a great capstone project to the book.

    • Jason Brownlee June 14, 2016 at 8:19 am #

      I don’t have any information at the moment on deploying a model as a service.

      Generally, you could use a MLaaS like Google Prediction, Amazon, Azure on BigML.

      I have setup models as a service in operations, but it has always been a custom job. E.g. custom delivery of inputs and custom handling of outputs of the model.

  2. Utkarsh June 21, 2016 at 4:14 am #

    So, this is the error I get when theano is called.

    CNMeM is enabled with initial size: 95.0% of memory, cuDNN Version is too old. Update to v5, was 3007.

    How do I solve this ?

    • Jason Brownlee June 23, 2016 at 10:26 am #

      You can ignore this error. It will not affect the examples.

  3. Mmm July 23, 2016 at 9:49 pm #

    Thanks for the great tutorial.

    1. Is this limited to the specific examples or can I train anything on aws?
    2. How can I upload / download data to / from the server?
    3. Are you aware of windows based server configuration with caffe and python availiable?

    • Jason Brownlee July 24, 2016 at 6:40 am #

      Yes, you can use this AWS to train any models you like, within reason.

      You can copy your data/code to your AWS instance using secure copy from the command line as follows:

      scp -r -i /path/to/keys yourdir ip.address:/path/

      Sorry, I don’t know about windows or windows servers.

      • Fred Mailhot August 19, 2016 at 4:14 am #

        Thanks for a helpful start-up…looking forward to putting this to work on real examples.

        Also, the root partition only has about 4GB of free space, but /mnt should have around 60GB. Probably best to put bigger data files there.

  4. Alexander August 25, 2016 at 12:08 am #

    Hi! Thanks for a great tutorial! Currently I’m training VGG16 and it trains pretty faster in compare to my CPU – 870s vs 70s per epoch 🙂
    BTW what if not just terminate but also create a backup image and start from it each time?

  5. Linghai Li September 14, 2016 at 1:25 pm #

    how to deploy a deeplearning cluster with aws.

    • Jason Brownlee September 14, 2016 at 1:27 pm #

      Great question Linghai. I don’t know off the top of my head.

      Let me know how you go.

  6. Anurag Priyadarshi September 17, 2016 at 8:21 pm #

    Damn! missed it. another reason to view regularly visit your blogs. I just spent the night understanding how to ensure cudnn and keras and all that stuff work properly on aws. I could have just seen this blog and used the AMI.

  7. Ritchie October 14, 2016 at 3:59 pm #

    Hi,

    I linked an open-source AWS AMI to your awesome guide so beginners can install. Do let me know if it’s cool.

    https://github.com/ritchieng/tensorflow-aws-ami

    TFAMI contains Keras, TensorFlow and OpenAI Gym. It is one of the hardest combinations to install. So I decided to create an open-source AMI that is actively maintained. Feel free to recommend to beginners instead of the one that you are currently recommending.

    TFAMI is available in ALL regions.

    Cheers,
    Ritchie

    • Jason Brownlee October 15, 2016 at 10:15 am #

      Hi Ritchie,

      Great work. I’ll try out your AMI when I get time – looks really useful.

      Thanks for sharing.

    • James December 12, 2016 at 5:24 am #

      Hi Ritchie,

      Thanks for creating this. However, I tried your TFAMI.v2 (N. Virginia ami-a96634be) using a p2.xlarge AWS spot request and found that no GPU could be located on the machine (output below). Any ideas?

      >>> import tensorflow as tf
      I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
      I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally
      I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
      I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
      I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
      >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
      E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
      I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: ip-172-31-50-145
      I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: ip-172-31-50-145
      I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.57.0
      I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: “””NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
      GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)
      “””
      I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
      E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:296] kernel version 367.48.0 does not match DSO version 367.57.0 — cannot find working devices in this configuration
      I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.

  8. Sachin October 21, 2016 at 2:00 pm #

    Hi Jason,

    Any chance you can show us how to deploy this on a small cluster (say two GPU machines for now). Would really really appreciate that.

    Also just bought your book on DL and ML (via company). Is there extra material in the books?

    Thanks,
    Sachin

    • Jason Brownlee October 22, 2016 at 6:56 am #

      Hi Sachin,

      I don’t have an example of deploying a model to a cluster, sorry.

  9. Katya November 13, 2016 at 1:43 am #

    Thank you for a very clear tutorial, Jason. Got your books now too. It works for me with one exception – I cannot switch to g2.2xlarge, I can only select t2.micro as I chose to sign up for the 12 month free trial. I looked high and low but could not find a way to switch to faster hardware.

    I don’t mind paying for the compute instance on the hourly rate, but I don’t want to be stuck in an expensive developer account with monthly subcription as I need AWS for a pet project of mine.

    Is there a way about it? I looked into Reserved Instances but there again it seemed like I would be stuck with a service for 12 months minimum.

    Thank you.

    • Hugues Fontenelle November 13, 2016 at 2:48 pm #

      Katya, I don’t know if this apply to you, but I had to specifically request a limit increase to get access to the GPU instances. See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html
      Apart from that I pay per hour without any subscription.
      Hope it helps.

    • Jason Brownlee November 14, 2016 at 7:33 am #

      Hi Katya,

      I believe you may need to contact AWS support and request access to the larger hardware. They may need to perform a quick check of your account then approve access. This is often a very fast process.

      Let me know how you go.

  10. Jenny November 30, 2016 at 4:41 am #

    Hi Jason:

    Thank you for posting this! I was wondering if jupyter notebook is preinstalled in this AMI or I have to do it myself? Thanks!

    • Jason Brownlee November 30, 2016 at 7:57 am #

      I don’t know Jenny, I do not use notebooks myself, just command line.

  11. Supriya January 2, 2017 at 12:09 pm #

    Hi,
    I am running into several errors while running an Faster RCNN network on the ami-125b2c72. Which version of opencv do I install for this ami? Do I need to be changing/downgrading the versions of CUDA and cuDNN?

    • Jason Brownlee January 3, 2017 at 7:35 am #

      Hi Supriya, I did not install or upgrade anything other than is stated in the tutorial above. No opencv, and I did not touch cuda or cudnn.

  12. Weiwei January 7, 2017 at 4:51 pm #

    When we stop the instance all the data will be erased, it seems a little annoying to redownload the datasets and configure our codes. Should we use EBS storage service? or what is your suggestion? Thanks

    • Jason Brownlee January 8, 2017 at 5:19 am #

      Correct.

      Consider using storage like S3 for datasets. Take a look at the Amazon docs, but I think there is cost advantages in storing data in the amazon infrastructure.

  13. Anupam March 12, 2017 at 5:32 am #

    Hi Jason, First of all Sorry for Off-topic discussion,

    I would like to know, is it possible to run a Java Project developed on Eclipse Environment (lots of packages has imported while developing the project) on AWS Server to get the high computational speed? If yes, can you give some references for the above? I am new to Amazon AWS and I found the description is a bit complicated.

    I like this tutorial from which any one can understand the steps very easily.

    Regards

    • Jason Brownlee March 12, 2017 at 8:29 am #

      Yes I don’t see why not. I don’t have a tutorial or resources, sorry.

  14. Nick March 17, 2017 at 10:22 am #

    Hi Jason,

    unfortunately ami-125b2c72 is not available anymore at least at us-east-1. Can you recommend the other one having Keras + GPU?

    Best,
    Nick

    • Jason Brownlee March 18, 2017 at 7:45 am #

      Sorry to hear that, I’ll look into a next best step.

      Until then, I think this one is good:
      https://github.com/ritchieng/tensorflow-aws-ami

      • Nick March 21, 2017 at 12:37 am #

        Unfortunately this one is also not available: “Latest update: our Amazon credits has expired ¯\_(ツ)_/¯ As such, we are unable to host these AMIs on all regions “. Will try to run Kaggle ML image seems to be working one.

        • Jason Brownlee March 21, 2017 at 8:40 am #

          Hi Nick, the AMI is still available, I have just used it myself.

          • Nick March 21, 2017 at 10:34 pm #

            Hi Jason,

            thank you for your effort, but I still can not find it. Today, I’ve searched again section “Community AMI’s” in region N.Virginia for name TFAMI.v3 and for ami-name: ami-0e969619. Did the same for N.California, checked TFAMI.v3 and ami-08451468. No ami’s found.

            I think, you were able to find it because you saved it into “My AMI’s” list.

          • Jason Brownlee March 22, 2017 at 8:00 am #

            I agree Nick, it too is now gone.

            I will find a solid replacement (or make one) and update the tutorial soon.

            UPDATE: I have updated the tutorial to use a new better AMI.

  15. CK March 30, 2017 at 11:49 am #

    Hi, did you decide not to update the AMI to keras v 2.0.2? Looks like the AMI has v 1.2.2

    • Jason Brownlee March 31, 2017 at 5:50 am #

      Note the section in the tutorial where I show how to update the version of Keras.

Leave a Reply