Increase Amazon EC2 Reliability and Performance with RAID

May 25, 2012 at 06:00 PM

While I haven't *knock on wood* had any EBS failures in Amazon's cloud myself, I have heard the horror stories and that makes me uneasy. Another issue with disks in cloud that I do run into a lot is latency. The disk io in many cases is slower to begin with, and random bouts of latency tend to crop up.

I have addressed both of these problems by deploying RAID 10 on my Amazon EC2 instances. It sounds techie but you don't have to be a rocket scientist to do this. If you are managing an EC2 instance you can do it and I have published a script that will get you there in a few steps.

First you need to have the ec2-api-tools installed and working on a machine. This can be a server but you can also do this on your workstation. For Arch Linux users, there is a package in the AUR.

The key to getting those tools working is setting up your environment variables. I use a little script called awsenv.sh like this:

#!/bin/bash

export AWS_USER_ID="0349-01234-09134"
export AWS_ACCESS_KEY_ID="BLAHDEBLAHBLAHBLAH"
export AWS_SECRET_ACCESS_KEY="somecharsthatmeansnothing"
export EC2_PRIVATE_KEY="/path/to/EC2-key.pem"
export EC2_CERT="/path/to/EC2-cert.pem"

Call it with: $ source awsenv.sh

Now you're ready to grab my script from: https://github.com/bparsons/buildec2raid

Once you have the api tools working, using the script is really easy:

Example:

$ ./buildec2raid.sh -s 1024 -z us-east-1a -i i-9i8u7y7y

This example would create a 1TB (terrabyte) array in the us-east-1a availability zone and attach it to instance i-918u7y7y.

The script does the basic RAID math for you. It uses 8 disks but you can change the DISKS variable near the top of the script if you prefer another topology. I really suggest that you use RAID 10. That way you can pull a slow EBS volume out of your array and then replace it without much hassle.

Once the volumes are created and attached to the instance, you log into the instance and initialize the array:

$ mdadm --create -l10 -n8 /dev/md0 /dev/xvdh*

That starts the array up. Then all you have to do is format it. Here is an XFS example:

$ sudo mkfs.xfs -l internal,lazy-count=1,size=128m -d agcount=2 /dev/md0

If you are new to software RAID you will find it helpful to check out the Linux RAID Wiki

Dont forget to add the mountpoint to your /etc/fstab file and create the /etc/mdadm.conf file:

# mdadm --examine --scan > /etc/mdadm.conf

Permanent Link — Posted in Cloud Computing, Geek Tactics, Amazon Web Services

Update Amazon Route53 via python and boto

April 18, 2012 at 08:00 AM

I wrote a python script to update DNS on Amazon Route53. You can use it on dynamic hosts by putting it into cron, or on boot for cloud instances with inconsistent IP addresses.

It uses the boto Amazon Web Services python interface for the heavy lifting. You'll need that installed. (Arch Linux has a python-boto package)

You need to edit the script to place your AWS credentials in the two variables near the top of the script (awskeyid, awskeysecret). Then it's ready to go.

You can specify the hostname as an argument on the command line:

        updatedns.py myhost.mydomain.com

        ...or it will try and resolve the hostname itself.

You can download the script here, or from github.

Permanent Link — Posted in Cloud Computing, Geek Tactics, Amazon Web Services

Cloud Architecture Best Practices

August 31, 2011 at 09:33 AM

"Plan for failure" is not a new mantra when it comes to information technology. Evaluating the worst case scenario is part of defining system requirements in many organizations. The mistake that many are making when they start to implement cloud is that they don't re-evaluate their existing architecture and the economics around redundancy.

All organizations make trade-offs between cost and risk. Having truly fully redundant architecture at all levels of the system is usually seen as unduly expensive. Big areas of exposure like databases and connectivity get addressed but some risk is usually accepted.

One of the things that change with cloud architectures is that cost and risk equation. The combination of not taking that into account and the assumption that fail-over is a built-in component of cloud is what leads to downtime.

Brian Heaton has published a great article, Securing Data in the Cloud, that walks through the Amazon cloud regional outage this past April. It shows contrasting examples of organizations that planned poorly and were affected and those who planned well and weren't impacted. It also lists six great rules for managing the risk of cloud outages:

1. Incorporate failover for all points in the system. Every server image should be deployable in multiple regions and data centers, so the system can keep running even if there are outages in more than one region.

2. Develop the right architecture for your software. Architectural nuances can make a huge difference to a system’s failover response. A carefully created system will keep the database in sync with a copy of the database elsewhere, allowing for a seamless failover.

3. Carefully negotiate service-level agreements. SLAs should provide reasonable compensation for the business losses you may suffer from an outage. Simply receiving prorated credit for your hosting costs during downtime won’t compensate for the costs of a large system failure.

4. Design, implement and test a disaster recovery strategy. One component of such a plan is the ability to draw on resources like failover instances, at a secondary provider. Provisions for data recovery and backup servers are also essential. Run simulations and periodic testing to ensure your plans will work.

5. In coding your software, plan for worst-case scenarios. In every part of your code, assume that the resources it needs to work might become unavailable, and that any part of the environment could go haywire. Simulate potential problems in your code, so that the software will respond correctly to cloud outages.

6. Keep your risks in perspective, and plan accordingly. In cases where even a brief downtime would incur massive costs or impair vital government services, multiple redundancies and split-second failover can be worth the investment, but it can be quite costly to eliminate the risk of a brief failure.

Another thing I see in that article and many others is that "cloud" doesn't mean "easy". Most organizations are writing middle-ware to work between their established processes/procedures and their cloud deployments. Part of this is that cloud enforces good virtualization practices and I suspect many IT shops have taken short cuts here and there. There are cloud-centric projects concentrating on configuration and deployment management, but not one size fits all - so expect to do some custom development as you migrate to the cloud.

References: Securing Data in the Cloud, by Brian Heaton on Government Technology

Permanent Link — Posted in Cloud Computing

Understanding Cloud Computing Vulnerabilities

August 19, 2011 at 09:09 AM

Discussions about cloud computing security often fail to distinguish general issues from cloud-specific issues.

Here is a great overview from Security & Privacy IEEE magazine of common IT vulnerabilities and how they are impacted by the new cloud paradigm.

The article starts off defining vulnerability in general and then goes on to establish the vulnerabilities that are inherent in cloud computing models.

It really boils down to access:

Of all these IAAA vulnerabilities, in the experi­ence of cloud service providers, currently, authentica­tion issues are the primary vulnerability that puts user data in cloud services at risk

...which  is really no different than anything in traditional IT.

via InfoQ: Understanding Cloud Computing Vulnerabilities.

Permanent Link — Posted in Cloud Computing, Security

How The Cloud Changes Disaster Recovery

July 26, 2011 at 03:44 PM

Chart of recovery time vs. cost

Data Center Knowledge has posted a great article illuminating the effect that cloud computing is having on the economics of disaster recovery (DR) for information technology. Having fast DR used to mean adding a large considerable expense to your IT budget in order to "duplicate" what you have.

Using cloud technologies not only is this less expensive, but is a great first step towards transitioning IT infrastructure into the cloud paradigm.

http://www.datacenterknowledge.com/archives/2011/07/26/how-the-cloud-changes-disaster-recovery/

Permanent Link — Posted in Cloud Computing