These type of posts are probably more common than not, especially since EC2 and similar platforms are nothing new these days, but I thought I would write a bit on what my experience has been with EC2, as I have now been working with it for about 5 years.
Without getting into the pros and cons of the over-arching decision of whether or not to choose ec2 for project xyz, I will just say, sometimes its a good fit, other times it's not. There is a lot of things to consider if your going to use ec2, but for the intents of this post I am not going to get into that.
EC2 is a wonderful technology. It can handle jobs both teeny and massively large, however, the bigger the project / application is, the more pitfalls there are. Here are some common ones I have hit over the last several years:
It is not magic.
Sometimes, people (clients, I'm looking at you) have this feeling that AWS/EC2 is a silver bullet. Well, like all things in computer land, there is no silver bullets. Just because you have the luxury of some added convenience, that doesn't mean that you can take advantage of it. In fact, unless you've designed for the ground up with the whole 'cloud' paradigm (I hate the buzzwords, sorry) there is a small chance any of the convenience that it offers, outside of 'setup time' is going to be much of an advantage.
It will force you out of business or into an extremely reliable architecture.
EC2 instances are not real computers. Shocker, I know. They run on top of real computers, but in my experience of working with thousands of instances, I have seen many failures, and many different kinds of failures. From machines just up and disappearing as though they never existed, to all kinds of I/O issues and more traditional network/routing issues. Part of this is this is simply par for the course, whether its virtualized or not, but I believe that the failures are quite a bit more common in ec2, and instead of driving down to a datacenter or having a noc monkey pick up the pieces for you, your just left sitting there, no server, no data.
The end result, is that you learn from this (or don't and suffer) and you end up designing architecture that can tolerate this. Once you finally figure this out, you start to not care about machines dying. You know in advance they are all shady. The cool thing about this, is that the lessons learned, apply back to real metal hardware.
It is not cheap to be inefficient, it is not reliable to be cheap.
This is a tricky one. On the one hand, it can become much too easy to start 'throwing hardware' at a problem. If you have a decent ops platform/person/button, etc, it can be done with the click of a button or a single command. But this can add up at a furious pace if its not tuned well. On the other hand, if you cheap out on everything, and do not have architecture that is fully redundant, it just won't be reliable, as this is not your server, or your datacenter, or your uninterruptible power supply, etc. The correct approach I believe is somewhere in the middle. You want to design the architecture correctly, so that it is highly efficient, so that you can afford to have redundancy, while at the same time, you don't want to spend so much time pre-optimizing and pre-planning that you never get your product done. It always seems to be a fine line.
Local disks are superior to cheap-ebs if architecture permits.
EBS (Elastic Block Storage) is basically a SAN or NAS that you can easily configure your virtual server(s) filesystem's to live on, so that if something happens to your virtual server and it goes away and never comes back, at least you'll have the data, and you can hook this data up to a new virtual server, and away you go. (Sounds much easier than it is) However, if you do not spend the extra cash to purchase what they refer to as provisioned iops (IO operations per second), it seemingly always fails to impress. My experience has been much better with the local disks, but there is one catch, they are temporary. If there is an outage and you lose your instance, poof, there goes your data too. However, when approached correctly, in the mindset that all servers are disposable and your data will never be caught in only one place, this can be an acceptable shortcoming. Overall, either spend the money for provisioned IOPS (AKA not-yet-oversubscribed san), or design things so that using ephemeral storage is an option. A common pattern is to have things like master databases on ephemeral storage, and have them slave data off to a system that is on mediocre-but-reliable normal ebs. This gives you the best of both worlds in some cases.
If it looks quacks like a duck, it still might only be the shady duck's cousin.
AWS provides many services besides just virtual servers, and this list is growing all the time. One example is Amazon RDS, or Relational Database Service. This product is marketed and sold to people who are interested in using MySQL in EC2, but don't want to go to the hassle (and all the pitfalls) of running their own MySQL systems. On the surface it seems like a great idea, and if your not running a complicated and large application, it probably suits most people pretty well. However, when you dig deep and need to use all the advanced and weird features of MySQL, you will find RDS falls short. Some examples just from recent experience are things like mysqldump not working as expected. In one case, we saw only about 10% of stored procedures get exported, and no errors are reported. I suspect this might have something to do with the magical stored procedures such as rds_kill, which AWS surely doesn't want you to see how they work, and who knows if its even a stored procedure, it could be some weird internal hack. Which brings me to the next point, you can't just kill a thread using KILL, you have to call the rds_kill stored procedure to kill a thread/query. Furthermore, trying to do any normal replication chains in the way you have been doing it for years, just won't work. Lastly, there is weird hiccups such as when you change privileges, sometimes it works instantly (yes, we tried flush), and sometimes, it doesn't work until the connection is torn down and re-established. In other cases, the user has to be completely destroyed, and re-created, for the new grants to work properly. The nuances do not end here, sadly.
Sacrifices in the name of Simplicity for the masses.
Another service AWS provides that has gained traction, is called ELB, or elastic load balancer. ELB was a much welcomed product after having to manage load balancing in AWS using products like ZXTM (now called stingray traffic manager) as virtual appliances. Using ZXTM inside the confines of EC2 works quite well, but things like failover feel like bastardized versions of the real ZXTM's featureset, outside the confines of EC2. This is due to how the network virtualization works, and that you have to swap elastic ip's to get stuff done. In the real world, you have direct access to the ethernet layer, and failover works well. So, ELB was going to be awesome. Only one problem, is that there is virtually no configuration available on a deep level. Sure, you can configure how often you want it to check to see if the servers behind it are alive, and what ports to forward to where, and they even provide a decent api to automate machines joining into an ELB pool, however, in comparison to a product like ZXTM/Stingray, it really falls short. The latter provides it's own traffic management language in which you can describe complex rules on how to route traffic around. You can modify headers, investigate parameters, and make decisions based on that. You can have multiple apps share an ip address and be separated based on request path patterns, and that is just the tip of the iceberg. With ELB, you get none of this. You get the most simple cut and dry load balancing paradigm possible. Even hardware load balancers from the 90's have better feature sets. Now, it's not all bad though, as ELB, has actually proved to be the most reliable product I have used at AWS, despite it's simplicity.
In summary, EC2 and its associated AWS services, can be a blessing and a curse. I feel like working with them over the years has made me a better engineer, more capable of dealing with any array of problems, and as a result has made my output more consistent, more reliable, and more efficient. It has caused me to subscribe to the mindset of thinking about failures in every which way, and learning to accept them as part of daily life. In the end, there is more uptime to be had I suppose.
Simply put however, its not for everyone, or everything. If I could go back in time, I am not sure I would have made any different decisions, as the decisions I have made have got me where I am, however, I think that there has been times where if I was not to choose EC2, things would have been more simple.
Ok, thats all my ramblings on this for now. Good luck, and keep your umbrella close by!