Running large deep learning processes on Amazon Web Services EC2 is a cheap and effective way to learn and develop models.
For just a few dollars you can get access to tens of gigabytes of RAM, tens of CPU cores, and multiple GPUs. I highly recommend it.
If you are new to EC2 or the Linux command line, there are a suite of commands that you will find invaluable when running your deep learning scripts in the cloud.
In this tutorial, you will discover my private list of the 10 commands I use every time I use EC2 to fit large deep learning models.
After reading this post, you will know:
- How to copy your data to and from your EC2 instances.
- How to set up your scripts to run for days, weeks, or months safely.
- How to monitor processes, the system, and GPU performance.
Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Note: All commands executed from your workstation assume you are running a Linux-type environment (e.g. Linux, OS X or cygwin).
Do you have any other tips, tricks, or favorite commands for running models on EC2?
Let me know in the comments below.
Overview
The commands presented in this post assume that your AWS EC2 instance is already running.
For consistency, a few other assumptions are made:
- Your server IP address is 54.218.86.47; change this to the IP address of your server instance.
- Your username is ec2-user; change this to your user name on your instance.
- Your SSH key is located in ~/.ssh/ and has the filename aws-keypair.pem; change this to your SSH key location and filename.
- You are working with Python scripts.
If you need help setting up and running a GPU-based AWS EC2 instance for deep learning, see the tutorial:
1. Log in from Your Workstation to the Server
You must log into the server before you can do anything useful.
You can log in easily using the SSH secure shell.
I recommend storing your SSH key in your ~/.ssh/ directory with a useful name. I use the name aws-keypair.pem. Remember: the file must have the permissions 600.
The following command will log you into your server instance. Remember to change the username and IP address to your relevant username and server instance IP address.
1 |
ssh -i ~/.ssh/aws-keypair.pem ec2-user@54.218.86.47 |
2. Copy Files from Your Workstation to the Server
You copy files from your workstation to your server instance using secure copy (scp).
The example below, run on your workstation, will copy the script.py Python script in the local directory on your workstation to your server instance.
1 |
scp -i ~/.ssh/aws-keypair.pem script.py ec2-user@54.218.86.47:~/ |
3. Run Script as Background Process on the Server
You can run your Python script as a background process.
Further, you can run it in such a way that it will ignore signals from other processes, ignore any standard input (stdin), and forward all output and errors to a log file.
In my experience, all of this is required for long-running scripts for fitting large deep learning models.
1 |
nohup python /home/ec2-user/script.py >/home/ec2-user/script.py.log </dev/null 2>&1 & |
This assumes you are running the script.py Python script located in the /home/ec2-user/ directory and that you want the output of this script forwarded to the file script.py.log located in the same directory.
Tune for your needs.
If this is your first experience with nohup, you can learn more here:
If this is your first experience with redirecting standard input (stdin), standard output (stout), and standard error (sterr), you can learn more here:
4. Run Script on a Specific GPU on the Server
I recommend running multiple scripts at the same time, if your AWS EC2 instance can handle it for your problem.
For example, your chosen EC2 instance may have 4 GPUs, and you could choose to run one script on each.
With CUDA, you can specify which GPU device to use with the environment variable CUDA_VISIBLE_DEVICES.
We can use the same command above to run the script and specify the specific GPU device to use as follows:
1 |
CUDA_VISIBLE_DEVICES=0 nohup python /home/ec2-user/script.py >/home/ec2-user/script.py.log </dev/null 2>&1 & |
If you have 4 GPU devices on your instance, you can specify CUDA_VISIBLE_DEVICES=0 to CUDA_VISIBLE_DEVICES=3.
I expect this would work for the Theano backend, but I have only tested it with the TensorFlow backend for Keras.
You can learn more about CUDA_VISIBLE_DEVICES in the post:
5. Monitor Script Output on the Server
You can monitor the output of your script while it is running.
This may be useful if you output a score each epoch or after each algorithm run.
This example will list the last few lines of your script log file and update the output as new lines are added to the script.
1 |
tail -f script.py.log |
Amazon may aggressively close your terminal if the screen does not get new output in a while.
An alternative is to use the watch command. I have found Amazon will keep this terminal open:
1 |
watch "tail script.py.log" |
I have found that standard out (stout) from python scripts does not appear to be updated frequently.
I don’t know if this is an EC2 thing or a Python thing. This means you may not see the output in the log updated often. It seems to be buffered and output when the buffer hits fixed sizes or at the end of a run.
Do you know more about this?
Let me know in the comments below.
6. Monitor System and Process Performance on the Server
It is a good idea to monitor the EC2 system performance. Especially the amount of RAM you are using and have left.
You can do this using the top command that will update every few seconds.
1 |
top -M |
You can also monitor the system and just your process, if you know its process identifier (PID).
1 |
top -p PID -M |
7. Monitor GPU Performance on the Server
It is a good idea to keep an eye on your GPU performance.
Again, keep an eye on GPU utilization, on which GPUs are running, if you plan on running multiple scripts in parallel and in GPU RAM usage.
You can use the nvidia-smi command to keep an eye on GPU usage. I like to use the watch command that keeps the terminal open and clears the screen for each new result.
1 |
watch "nvidia-smi" |
8. Check What Scripts Are Still Running on the Server
It is also important to keep an eye on which scripts are still running.
You can do this with the ps command.
Again, I like to use the watch command to keep the terminal open.
1 |
watch "ps -ef | grep python" |
9. Edit a File on Server
I recommend not editing files on the server unless you really have to.
Nevertheless, you can edit a file in place using the vi editor.
The example below will open your script in vi.
1 |
vi ~/script.py |
Of course, you can use your favorite command line editor, like emacs; this note is really for you if you are new to the Unix command line.
If this is your first exposure to vi, you can learn more here:
10. From Your Workstation Download Files from the Server
I recommend saving your model and any results and graphs explicitly to new and separate files as part of your script.
You can download these files from your server instance to your workstation using secure copy (scp).
The example below is run from your workstation and will copy all PNG files from your home directory to your workstation.
1 |
scp -i ~/.ssh/aws-keypair.pem ec2-user@54.218.86.47:~/*.png . |
Additional Tips and Tricks
This section lists some additional tips when working heavily on AWS EC2.
- Run multiple scripts at a time. I recommend selecting hardware that has multiple GPUs and running multiple scripts at a time to make full use of the platform.
- Write and edit scripts on your workstation only. Treat EC2 as a pseudo-production environment and only ever copy scripts and data there to run. Do all development on your workstation and write small tests of your code to ensure it will work as expected.
- Save script outputs explicitly to a file. Save results, graphs, and models to files that can be downloaded later to your workstation for analysis and application.
- Use the watch command. Amazon aggressively kills terminal sessions that have no activity. You can keep an eye on things using the watch command that send data frequently enough to keep the terminal open.
- Run commands from your workstation. Any of the commands listed above intended to be run on the server can also be run from your workstation by prefixing the command with “ssh –i ~/.ssh/aws-keypair.pem ec2-user@54.218.86.47” and quoting the command you want to run. This can be useful to check in on processes throughout the day.
Summary
In this tutorial, you discovered the 10 commands that I use every time I am training large deep learning models on AWS EC2 instances with GPUs.
Specifically, you learned:
- How to copy your data to and from your EC2 instances.
- How to set up your scripts to run for days, weeks, or months safely.
- How to monitor processes, the system, and GPU performance.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Amazon Web Services is quite good, I am also using the platform for backup purposes. This is really a very informative blog post.
Thanks, I’m glad it helped.
Hi Jason, another tool I can not live without is tmux. It enables you easy switching between programs in one terminal and keeps programs running when exiting terminal (https://github.com/tmux/tmux/wiki)
Great, thanks Primoz!
Hey Jason,
Thanks for your post, really helpful. One service that I’m using is Amazon batch, that will automate every training that I want to do using spot instances. Coupling it with S3 and docker, it removes the hassle of handling the code and dataset. Just a couple of dev ops steps more but i think it is worth it 🙂
Very cool, thanks for sharing Nicolas.
If you ever put together a recipe and post it on github or something, let me know.
sounds exciting, would you please let us know how you set them up ?
Could you recommend a verbose setting to the Keras functions/methods? I’ve never understood the difference between 1, 2, and 100.
Also, I’ve noticed that if I have a long-running Jupyter notebook running Keras code on a P2 machine that it will often get disconnected. However, if I use verbose=2 for instance then the DL task will finish. Do you have any additional recommendations for using Jupyter notebooks with DL on AWS?
Consider not running in a notebook. Do dev on your workstation and run code on the server. Don’t dev on the server if you can help it.
Regarding the verbose flag in Keras:
0 – no output
1 – progress bar and end of epoch summary
2 – no progress bar, just end of epoch summary.
Tks a lot for you sending email to me everyday. I will follow you to learn ML when I have free time。
Thanks Pony.
Start the Python executable with “python -u” to avoid buffering the output. Then you will get every message to stdout immediately.
Great tip DS, thanks!
If you start a script without nohup, use the “disown” command to tell the process to ignore the hangup signal. Run “jobs” to see which job number to use with “disown”.
If you start a script without the “&”, that is, it isn’t in the background, you can “Ctrl-Z”, then type “bg” to put it in the background. “fg” will reverse this.
I don’t know any way to reattach stdout/stderr after a process has been started.
Interesting approach, thanks for sharing DS!
Your posts are always spot on. A truly complete resource to self-taught Machine Learning proficiency. Thanks.
Thanks.
Hi Dr. Jason,
I am a total newbie to AWS and cloud. Could please advise whether:
1) We could run py scripts on AWS and point the data source (csv files) to my local machine?
2) Or is a way to upload all my csv files to AWS and retrieve them?
3) What are the commands for retrieving the data?
Your books have helped me a lot, thank you and please keep up good work!
Technically yes, but don’t do that. Have all data local to the model if you can.
Yes, you can upload the CSV. I show how in the post above.
Again, I show how to download in the post above.
I think ssh is (or has to become) trivial for this audience. I would suggest enriching that part with how to configure the “.ssh/config” for easier and more secure access to remote Linux servers.
Great suggestion.
It depends, many of my readers still seem to have problem with running scripts on the command line.
Thanks for the article. Finally understood the meaning of nohup
You’re welcome.
I’m happy to hear that.
Hi Jason,
Excellent summary.
Just regarding 9. Edit a File on Server. I recommend using nano instead of vi for users who are not too much used to vi. Nano is easier to use.
Thanks.
Great suggestion.
Amazing help here in one article! Thank you!
You’re welcome, I’m happy it helps.
Jason, I am your official fan! Thank you for such a great contributions you are making into developers’ world!
You’re welcome!
Thank you very much for the informative post!
You’re welcome.