Quick Summary:
| Latest modENCODE Public AMI | ami-6f15c006 |
| Latest modENCODE Genome Browser AMI | ami-c9d01ba0 |
| Latest modENCODE Public Data Snapshot | snap-21a5f844 |
What is it?
The entire modENCODE data corpus is
now available on the Amazon Web
Services EC2 cloud. What this means is that virtual machines and
virtual compute clusters that you run within the EC2 cloud can mount
the modENCODE data set in whole or in part. Your software can run analyses against
the data files directly without experiencing the long waits and logistics associated
with copying the datasets
over to your local hardware.
How does it work?
The modENCODE DCC
has created a series of EC2 snapshots of the modENCODE data sets. Each
snapshot is between 100 and 1000 GB in size, due to Amazon's 1 TB
maximum volume size. There are currently 12 such volumes; although
they are currently a mixture of C. elegans and D. melanogaster date,
we are in the process of reorganizing the volume contents into
logically related sets, and the number will change. To manage these
volumes, we have created a small "root" snapshot that contains
utilities to mount the large volumes.
There are three ways to work with the modENCODE data in the
cloud:
- Launch a custom modENCODE AMI (Amazon Machine
Image) that has the entire data set pre-mounted. This is the
most convenient way, but gives you no flexibility in choosing
the machine image or the datasets that will be loaded. The image
is derived from a vanilla Ubuntu Maverick image (10.10), 64-bit
architecture, with minimal enhancements. The main additions are
the existence of FTP and Apache servers that can be used to
browse, search and bulk download the datasets.
- Launch a machine image running the modENCODE Genome Browser. This
has exactly the same data and user interface as the
public modENCODE
genome browser, but you will have privileged access to it and can
customize and configure it.
- Launch the machine image of your choice, then attach and mount the
modENCODE data root volume. You then run a small script that locates
and mounts the other data volumes. This gives you flexibility over
which machine image to run and allows you to include only the
modENCODE datasets that you care about
Because AWS is a commercial service, you pay for your usage. All fees
go to Amazon Web Services; neither the modENCODE project nor its
funding agency (National Institutes of Health, USA) receive
compensation for this service.
You will be paying for two main items:
- CPU usage
- You pay Amazon for every full or partial hour of virtual machine
run time you use. The fees range from a few cents per hour to a
dollar/hr, depending on the type of virtual machine you choose to
run. See AWS Pricing
for details.
- Disk storage
- You pay Amazon for every GB-month of storage you use. This means
you will be charged for every copy of the modENCODE data set you make
for use with your virtual machines. The price is roughly
$0.10/GB-month of storage, or $400/mo if you mount the entire data set
(given that the set is currently 4 TB in size). However, because it
is quick to create the volumes, you can easily create them from
snapshot as needed, run them for a few hours or days, and then dispose
of them.
If you are careful to use only what you need, AWS can be much more
convenient, and often less expensive, than working with the data
locally.
How do I get started?
Before you begin, you will need to obtain an Amazon AWS account if
you do not already have one. Select the "Sign Up Now" button on
the AWS home page.
During the account creation process, you will be given an Access
Key ID and an Access Secret Key. Please be sure to record them both,
as you will need them to mount modENCODE data snapshots (these data
will remain private and will not be seen by modENCODE staff). The
scripts do not need the X.509 credentials that certain other AWS
applications use.
Running the modENCODE AMI
Here are step-by-step instructions for running the modENCODE AMI
with all the data sets preloaded.
- Use
the AWS Console to locate public AMI
ami-6f15c006.
- Right-click on the image entry and select "Launch Instance". Alternatively, just
click on the link at the end of the previous step.
- The launch wizard will guide you through selecting the number of
instances to launch, the availability region in which to launch the
instance(s), and the SSH keypair to use for login. You may use your
default SSH keypair or create a new one.
- When you select the security group, you can decide whether you
wish to support the built-in web and FTP servers. If you wish to
do so, then create a security group that has the following rules
defined:
| Port | Source | Status |
| 20 (FTP-data) | 0.0.0.0/0 | Open |
| 21 (FTP-comm) | 0.0.0.0/0 | Open |
| 22 (SSH) | 0.0.0.0/0 | Open |
| 80 (HTTP) | 0.0.0.0/0 | Open |
| 12000-12200 (FTP PASV) | 0.0.0.0/0 | Open |
You may also defer the creation of these rules till later. Just
create a security group that has the SSH port open at the time
of instance creation. When you are ready to open up the web and
FTP ports, log into the instance and run
/modencode/bin/open_ports.pl.
- By default, when you launch the modENCODE image a brief
message is logged to modENCODE staff indicating the time and date that
the resource was used. More information about why we do this, and instructions
to disable it, can be found under
Logging Usage of the modENCODE Image.
- When the console indicates that the instance is running ssh to
its public DNS address using your ssh key and the username "ubuntu":
ssh -i path_to_key.pem ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com
See Layout of the modENCODE data image for
information on where the files are located.
Running the modENCODE Genome Browser
Use
the AWS Console to locate public AMI
ami-c9d01ba0.
Right-click on the image entry and select "Launch Instance". Alternatively, just
click on the link at the end of the previous step.
The launch wizard will guide you through selecting the number of
instances to launch, the availability region in which to launch the
instance(s), and the SSH keypair to use for login. You may use your
default SSH keypair or create a new one.
The free tier-eligible "t1.micro" instance type will function, but it will be
slow and some tracks may time out. For acceptable performance, we suggest you
run at least an "m1.large" type.
When you select the security group, you should assign a security
group that allows for secure shell and web access:
| Port | Source | Status |
| 22 (SSH) | 0.0.0.0/0 | Open |
| 80 (HTTP) | 0.0.0.0/0 | Open |
| 443 (HTTPS) | 0.0.0.0/0 | Open |
When the console indicates that the instance is running, cut and
paste its public DNS address into your favorite browser. After a brief
pause, the browser interface will come up. On newly-created instances,
the browser will start a bit slowly because data is still being written
into the instance's virtual disks, but it will speed up after a short
while.
To log into the instance, ssh to the instance using your public key pair
and the username "ubuntu":
ssh -i path_to_key.pem ubuntu@ec2-xx-xx-xx-xx.compute-1.amazonaws.com
The main browser configuration file can be found at /etc/gbrowse2/GBrowse.conf. The various
databases and data files needed to run the browser are located under /modencode.
The browser is based on the standard Generic Genome Browser
(GBrowse). The configuration files are located under /etc/gbrowse2,
and the datafiles are located in
/modencode. See the
GBrowse 2.0 HOWTO for tips on customizing and extending the
browser software.
Attaching the modENCODE Data Root to the Instance of your Choice
If you wish to attach the modENCODE data to your own instance, there are several steps.
- Using the Amazon Console, Elastic Firefox, or the command-line
tools of your choice, launch one or more virtual machine instances
in the availability zone of your choice. While you are free to
use any Amazon Machine Image (AMI) you care to, the modENCODE data
management scripts have only been tested on Ubuntu images version
10.10 (Maverick) and 11.04 (Natty), both 32-bit and 64-bit
versions. The scripts will likely work with other Linux
distributions, although some features (such as automatic web
server management) will need tweaking. Windows images
will not work.
The Bionimbus virtual
machine is a nice starting place, as it contains pipelines for
peak calling and other common operations. The current Bionimbus
AMI is ami-efa24c86 (available in AWS US Eastern region only).
For generic Ubuntu choices, see the Ubuntu AMI Locator.
- During the instance launching process, you will be asked to select an SSH keypair for login,
as well as a security group to assign the instance to. You can use your default AWS keypair for
this. However, you may wish to create a new security group with just the SSH port open. The reason for this
is that the ModENCODE image setup script will open up the FTP and HTTP ports in order to give you
browse-level access to the data, and you may not wish to have these ports opened up in your default
security group.
- Once the instance is up and running, create a volume from the modENCODE root snapshot.
With the Amazon Console or tool of your choice, locate public
snapshot snap-21a5f844.
Be sure to place your volume in the same
availability zone as the instance created in the previous step. The root snapshot is small
(1 GB) and the volume will be ready to use almost instantly.
- Attach the newly-created volume to your instance. From the
console (or other tool), find the volume you just created and attach
it to the instance. You may specify any device. The default /dev/sdf works, but we recommend
using /dev/sdf1 in order to be consistent with other mountable volumes. This step should complete
very quickly.
- Using ssh and the private key selected in step (3),
log onto the instance. Create a mount point called /modencode, and mount the
volume:
sudo mkdir /modencode
sudo mount /dev/sdf1 /modencode
- [Optional] Edit the file /modencode/DATA_SNAPSHOTS.txt. This
contains the list of data snapshots to mount as volumes. To deselect a
volume, comment it out by placing a "#" in front of the line. For example:
# snap-14031774 modENCODE D. melanogaster signal data from 5 September 2011, part 1
Currently the data is spread out among the volumes without rhyme or reason (it was done
in order to maximize disk usage efficiency), so you can't choose among functionally-significant
data sets. However, we are reorganizing the files during September 2011, at which time the
data snapshot volumes will be sorted by organism and data type.
- Now run the setup script to mount the data volumes:
/modencode/bin/setup.pl
This will install a few libraries and then configure the data
volumes. The script uses "sudo" when necessary, and you may be asked
for your login password if your instance requires you to do so.
During the snapshot mounting steps, you will also be asked to provide
your EC2 Access Key ID and Secret Key. Please enter them. You will be
given the option to save these keys to a file in your home directory
to avoid the prompts in the future. Neither the sudo password, nor the
EC2 secret key ever leaves the instance, and they are not available to
modENCODE staff.
- During the setup process, the script will ask you whether you are willing
to have a log entry sent to the modENCODE staff
so that we can assess the usage of the AWS resources. This records minimal,
non-identifying information such as the time and date you mounted the dataset(s).
Simply answer "no" if you wish to skip this.
- That's it! You will find a flat listing of the data at
/modencode/data/all_files, and a hierarchically-organized set of
symbolic links (by organism, data type, and technique) at
/modencode/data.
Web Services
The default install creates a data browser running on your instance,
accessible at http://instance-address/. Using the data browser, you
can quickly browse through the datasets installed on the machine, link
through the ModMine, and the modENCODE genome browser. If you choose,
you may download selected datasets as .tar.gz files. In addition,
there will be an anonymous FTP server running at ftp://instance-address. You may
choose to turn one or both of these services off:
sudo /etc/init.d/apache2 stop
sudo /etc/init.d/vsftpd stop
To disable them permanently, simply rename the services' startup scripts:
sudo mv /etc/init.d/apache2 /etc/init.d/apache2.off
sudo mv /etc/init/vsftpd.conf /etc/init/vsftpd.conf.off
Layout of the modENCODE data image
Regardless of whether you launched the preconfigured modENCODE AMI or
attached the data volumes to your own instance, the layout is as
follows:
/modencode
- The root of all the data and utilities
/modencode/data
- All the data is mounted under this directory. It is also the root
for anonymous login of the image's FTP server.
/modencode/data/C.elegans, /modencode/data/D.melanogaster,
/modencode/data/D.yakuba, ...
- All the datasets, organized hierarchically by species, experiment
and datatype. These are actually symbolic links into
/modencode/data/all_files. See /modencode/data/README for
organization, filenaming scheme and data formats.
/modencode/data/all_files/cele-raw-1,
/modencode/data/all_files/cele-raw-2 ...
- These are the mountpoints for the big datasets, organized by
species and datatype.
/modencode/data/MANIFEST.txt
- This is a three-column tab-separated file that maps modENCODE accession numbers
to their files names. The three columns are:
MODENCODE_ACCESSION,ORIGINAL_FILENAME,UNIFORM_FILENAME. The original
filename is exactly as submitted by the web lab group, and may be
cryptic. In particular, the original filenames sometimes make
reference to a genome build, such as C. elegans WS170, on which the
data was originally mapped. However, all genome coordinates have
been updated to the most recent freeze, WS220 for worm, and R5 for
fly. The "uniform" filename is a long, but consistent name that describes the
organism, the target factor, the teechnique, the file format and conditions
such as developmental stage. See the README at the top level of /modencode/data
for a full description.
/modencode/data/metadata.csv
- This is a longer tab-separated file that contains metadata in addition to
the filenames. The format is described in
/modencode/data/README.
/modencode/bin
- These are Perl scripts that are used for mounting the datasets,
initializing the instance, and building the Web and FTP sites.
/modencode/htdocs
- This is the root of the image's Web server.
/modencode/release
- This directory contains files and utilities used by modENCODE
staff to create and maintain the image.
For full information about a dataset of interest, you can retrieve the full experimental protocols and other metadata
using the following URL:
http://intermine.modencode.org/release-25/keywordSearchResults.do?searchTerm=modencode_XXXX&searchSubmit=GO
Replace "XXXX" with the accession number from column 1 of either MANIFEST.txt of metadata.csv. The accession number is also present in the uniform filename.
Logging Usage of the modENCODE Image and/or Snapshot
By default we log the first time someone attaches the modENCODE
datasets or launches an instance based on the modENCODE image. We
record only the time and date this occurred, the version of the
snapshot or image that was launched, the availability zone in which
the resource was used, and the type of machine instance that was used,
such as "m1.small". The purpose of this logging is assess the usage
of the resource, and to justify the cost of storing these datasets
in the cloud.
If you do not wish this initial logging to occur, you can disable it
as follows:
- When launching the modENCODE AMI, pass the following line of user-data to the instance:
#!/bin/echo noregister. This will disable the registration step
which otherwise occurs.
- When attaching the modENCODE data to an existing image via the
setup.pl script,
simply answer "no" when the script asks you whether you are willing to register your usage.
Feedback Requested
If you have questions, comments or suggestions, please contact us at the link below. Thanks, and have fun!
For assistance, please contact help@modencode.org