modENCODE on the EC2 Cloud

Quick Summary:
Latest modENCODE Public AMIami-3be94152
Latest modENCODE Genome Browser AMIami-dee738b7
Latest modENCODE Public Data Snapshotsnap-d4353ca5

What is it?

The entire modENCODE data corpus is now available on the Amazon Web Services EC2 cloud. What this means is that virtual machines and virtual compute clusters that you run within the EC2 cloud can mount the modENCODE data set in whole or in part. Your software can run analyses against the data files directly without experiencing the long waits and logistics associated with copying the datasets over to your local hardware.

How does it work?

The modENCODE DCC has created a series of EC2 snapshots of the modENCODE data sets. Each snapshot is between 100 and 1000 GB in size, due to Amazon's 1 TB maximum volume size. There are currently 12 such volumes; although they are currently a mixture of C. elegans and D. melanogaster date, we are in the process of reorganizing the volume contents into logically related sets, and the number will change. To manage these volumes, we have created a small "root" snapshot that contains utilities to mount the large volumes.

There are three ways to work with the modENCODE data in the cloud:

  1. Launch a custom modENCODE AMI (Amazon Machine Image) that has the entire data set pre-mounted. This is the most convenient way, but gives you no flexibility in choosing the machine image or the datasets that will be loaded. The image is derived from a vanilla Ubuntu Maverick image (10.10), 64-bit architecture, with minimal enhancements. The main additions are the existence of FTP and Apache servers that can be used to browse, search and bulk download the datasets.
  2. Launch a machine image running the modENCODE Genome Browser. This has exactly the same data and user interface as the public modENCODE genome browser, but you will have privileged access to it and can customize and configure it.
  3. Launch the machine image of your choice, then attach and mount the modENCODE data root volume. You then run a small script that locates and mounts the other data volumes. This gives you flexibility over which machine image to run and allows you to include only the modENCODE datasets that you care about

Because AWS is a commercial service, you pay for your usage. All fees go to Amazon Web Services; neither the modENCODE project nor its funding agency (National Institutes of Health, USA) receive compensation for this service.

You will be paying for two main items:

CPU usage
You pay Amazon for every full or partial hour of virtual machine run time you use. The fees range from a few cents per hour to a dollar/hr, depending on the type of virtual machine you choose to run. See AWS Pricing for details.
Disk storage
You pay Amazon for every GB-month of storage you use. This means you will be charged for every copy of the modENCODE data set you make for use with your virtual machines. The price is roughly $0.10/GB-month of storage, or $400/mo if you mount the entire data set (given that the set is currently 4 TB in size). However, because it is quick to create the volumes, you can easily create them from snapshot as needed, run them for a few hours or days, and then dispose of them.

If you are careful to use only what you need, AWS can be much more convenient, and often less expensive, than working with the data locally.

How do I get started?

Before you begin, you will need to obtain an Amazon AWS account if you do not already have one. Select the "Sign Up Now" button on the AWS home page.

During the account creation process, you will be given an Access Key ID and an Access Secret Key. Please be sure to record them both, as you will need them to mount modENCODE data snapshots (these data will remain private and will not be seen by modENCODE staff). The scripts do not need the X.509 credentials that certain other AWS applications use.

Running the modENCODE AMI

Here are step-by-step instructions for running the modENCODE AMI with all the data sets preloaded.

  1. Use the AWS Console to locate public AMI ami-3be94152.
  2. Right-click on the image entry and select "Launch Instance". Alternatively, just click on the link at the end of the previous step.
  3. The launch wizard will guide you through selecting the number of instances to launch, the availability region in which to launch the instance(s), and the SSH keypair to use for login. You may use your default SSH keypair or create a new one.
  4. When you select the security group, you can decide whether you wish to support the built-in web and FTP servers. If you wish to do so, then create a security group that has the following rules defined:

    20 (FTP-data)
    21 (FTP-comm)
    22 (SSH)
    80 (HTTP)
    12000-12200 (FTP PASV)

    You may also defer the creation of these rules till later. Just create a security group that has the SSH port open at the time of instance creation. When you are ready to open up the web and FTP ports, log into the instance and run /modencode/bin/

  5. By default, when you launch the modENCODE image a brief message is logged to modENCODE staff indicating the time and date that the resource was used. More information about why we do this, and instructions to disable it, can be found under Logging Usage of the modENCODE Image.
  6. When the console indicates that the instance is running ssh to its public DNS address using your ssh key and the username "ubuntu":
    ssh -i path_to_key.pem

See Layout of the modENCODE data image for information on where the files are located.

Running the modENCODE Genome Browser

  • Use the AWS Console to locate public AMI ami-dee738b7.
  • Right-click on the image entry and select "Launch Instance". Alternatively, just click on the link at the end of the previous step.
  • The launch wizard will guide you through selecting the number of instances to launch, the availability region in which to launch the instance(s), and the SSH keypair to use for login. You may use your default SSH keypair or create a new one.
  • The free tier-eligible "t1.micro" instance type will function, but it will be slow and some tracks may time out. For acceptable performance, we suggest you run at least an "m1.large" type.
  • When you select the security group, you should assign a security group that allows for secure shell and web access:

    22 (SSH)
    80 (HTTP)
    443 (HTTPS)

  • When the console indicates that the instance is running, cut and paste its public DNS address into your favorite browser. After a brief pause, the browser interface will come up. On newly-created instances, the browser will start a bit slowly because data is still being written into the instance's virtual disks, but it will speed up after a short while.
  • To log into the instance, ssh to the instance using your public key pair and the username "ubuntu":
    ssh -i path_to_key.pem
    The main browser configuration file can be found at /etc/gbrowse2/GBrowse.conf. The various databases and data files needed to run the browser are located under /modencode.

    The browser is based on the standard Generic Genome Browser (GBrowse). The configuration files are located under /etc/gbrowse2, and the datafiles are located in /modencode. See the GBrowse 2.0 HOWTO for tips on customizing and extending the browser software.

    Attaching the modENCODE Data Root to the Instance of your Choice

    If you wish to attach the modENCODE data to your own instance, there are several steps.

    1. Using the Amazon Console, Elastic Firefox, or the command-line tools of your choice, launch one or more virtual machine instances in the availability zone of your choice. While you are free to use any Amazon Machine Image (AMI) you care to, the modENCODE data management scripts have only been tested on Ubuntu images version 10.10 (Maverick) and 11.04 (Natty), both 32-bit and 64-bit versions. The scripts will likely work with other Linux distributions, although some features (such as automatic web server management) will need tweaking. Windows images will not work. The Bionimbus virtual machine is a nice starting place, as it contains pipelines for peak calling and other common operations. The current Bionimbus AMI is ami-efa24c86 (available in AWS US Eastern region only). For generic Ubuntu choices, see the Ubuntu AMI Locator.
    2. During the instance launching process, you will be asked to select an SSH keypair for login, as well as a security group to assign the instance to. You can use your default AWS keypair for this. However, you may wish to create a new security group with just the SSH port open. The reason for this is that the ModENCODE image setup script will open up the FTP and HTTP ports in order to give you browse-level access to the data, and you may not wish to have these ports opened up in your default security group.
    3. Once the instance is up and running, create a volume from the modENCODE root snapshot. With the Amazon Console or tool of your choice, locate public snapshot snap-d4353ca5. Be sure to place your volume in the same availability zone as the instance created in the previous step. The root snapshot is small (1 GB) and the volume will be ready to use almost instantly.
    4. Attach the newly-created volume to your instance. From the console (or other tool), find the volume you just created and attach it to the instance. You may specify any device. The default /dev/sdf works, but we recommend using /dev/sdf1 in order to be consistent with other mountable volumes. This step should complete very quickly.
    5. Using ssh and the private key selected in step (3), log onto the instance. Create a mount point called /modencode, and mount the volume:
      sudo mkdir /modencode
      sudo mount /dev/sdf1 /modencode
    6. [Optional] Edit the file /modencode/DATA_SNAPSHOTS.txt. This contains the list of data snapshots to mount as volumes. To deselect a volume, comment it out by placing a "#" in front of the line. For example:
      # snap-14031774	modENCODE D. melanogaster signal data from 5 September 2011, part 1
      Currently the data is spread out among the volumes without rhyme or reason (it was done in order to maximize disk usage efficiency), so you can't choose among functionally-significant data sets. However, we are reorganizing the files during September 2011, at which time the data snapshot volumes will be sorted by organism and data type.
    7. Now run the setup script to mount the data volumes:
      This will install a few libraries and then configure the data volumes. The script uses "sudo" when necessary, and you may be asked for your login password if your instance requires you to do so. During the snapshot mounting steps, you will also be asked to provide your EC2 Access Key ID and Secret Key. Please enter them. You will be given the option to save these keys to a file in your home directory to avoid the prompts in the future. Neither the sudo password, nor the EC2 secret key ever leaves the instance, and they are not available to modENCODE staff.
    8. During the setup process, the script will ask you whether you are willing to have a log entry sent to the modENCODE staff so that we can assess the usage of the AWS resources. This records minimal, non-identifying information such as the time and date you mounted the dataset(s). Simply answer "no" if you wish to skip this.
    9. That's it! You will find a flat listing of the data at /modencode/data/all_files, and a hierarchically-organized set of symbolic links (by organism, data type, and technique) at /modencode/data.

    Web Services

    The default install creates a data browser running on your instance, accessible at http://instance-address/. Using the data browser, you can quickly browse through the datasets installed on the machine, link through the ModMine, and the modENCODE genome browser. If you choose, you may download selected datasets as .tar.gz files. In addition, there will be an anonymous FTP server running at ftp://instance-address. You may choose to turn one or both of these services off:

    sudo /etc/init.d/apache2 stop
    sudo /etc/init.d/vsftpd stop

    To disable them permanently, simply rename the services' startup scripts:

    sudo mv /etc/init.d/apache2 /etc/init.d/
    sudo mv /etc/init/vsftpd.conf /etc/init/

    Layout of the modENCODE data image

    Regardless of whether you launched the preconfigured modENCODE AMI or attached the data volumes to your own instance, the layout is as follows:

    The root of all the data and utilities
    All the data is mounted under this directory. It is also the root for anonymous login of the image's FTP server.
    /modencode/data/C.elegans, /modencode/data/D.melanogaster, /modencode/data/D.yakuba, ...
    All the datasets, organized hierarchically by species, experiment and datatype. These are actually symbolic links into /modencode/data/all_files. See /modencode/data/README for organization, filenaming scheme and data formats.
    /modencode/data/all_files/cele-raw-1, /modencode/data/all_files/cele-raw-2 ...
    These are the mountpoints for the big datasets, organized by species and datatype.
    This is a three-column tab-separated file that maps modENCODE accession numbers to their files names. The three columns are: MODENCODE_ACCESSION,ORIGINAL_FILENAME,UNIFORM_FILENAME. The original filename is exactly as submitted by the web lab group, and may be cryptic. In particular, the original filenames sometimes make reference to a genome build, such as C. elegans WS170, on which the data was originally mapped. However, all genome coordinates have been updated to the most recent freeze, WS220 for worm, and R5 for fly. The "uniform" filename is a long, but consistent name that describes the organism, the target factor, the teechnique, the file format and conditions such as developmental stage. See the README at the top level of /modencode/data for a full description.
    This is a longer tab-separated file that contains metadata in addition to the filenames. The format is described in /modencode/data/README.
    These are Perl scripts that are used for mounting the datasets, initializing the instance, and building the Web and FTP sites.
    This is the root of the image's Web server.
    This directory contains files and utilities used by modENCODE staff to create and maintain the image.

    For full information about a dataset of interest, you can retrieve the full experimental protocols and other metadata using the following URL:

    Replace "XXXX" with the accession number from column 1 of either MANIFEST.txt of metadata.csv. The accession number is also present in the uniform filename.


    Logging Usage of the modENCODE Image and/or Snapshot

    By default we log the first time someone attaches the modENCODE datasets or launches an instance based on the modENCODE image. We record only the time and date this occurred, the version of the snapshot or image that was launched, the availability zone in which the resource was used, and the type of machine instance that was used, such as "m1.small". The purpose of this logging is assess the usage of the resource, and to justify the cost of storing these datasets in the cloud.

    If you do not wish this initial logging to occur, you can disable it as follows:

    1. When launching the modENCODE AMI, pass the following line of user-data to the instance: #!/bin/echo noregister. This will disable the registration step which otherwise occurs.
    2. When attaching the modENCODE data to an existing image via the script, simply answer "no" when the script asks you whether you are willing to register your usage.

    Feedback Requested

    If you have questions, comments or suggestions, please contact us at the link below. Thanks, and have fun!

    For assistance, please contact