During my work in BigDataCraft.com I saw repeating problem our customers face. The problem is how to get efficient SQL on big data in the cloud.
Lets see a typical case.
First case – daily logs of some nature arrived and stored in the S3. There is a need to do a few dozens reports over new data each day. Daily data size is from dozens to hundreds of gigabytes. There is also need to do ad-hoc reports on the older data.
What are our options?
to run Hive over S3. It will work, but slow. Transer from S3 will be big part of execution time. Hive itself is slow. Data will be transferred between S3 and EC2 instances for each query. We can do things a bit better by manually scripting to keep recent data in the local HDFS.
to run efficient engine – Cloudera Impala. Problem that it does not work with S3 directly. To go this way we will need to script moving data to local HDFS, and cleaning it up when it outgrow the local space. In other words – we will need to treat local HDFS as a cache manually.
To use Redshift. It is good option, but will also require managing moving data from s3. it is also proprietary and will lock us in the AWS. And it is expensive if we do not commit for a several years reservations.
Second case is extension of first one. We have some data accumulated in the S3. And our data science team wants to slice and dice some data to find some interesting XYZ. What should they do? Data scientists usually not so IT savvy to build and manage own hadoop cluster. Sometimes they just want to get SQLs on subset of data with interactive speeds.
So today they have the following options
To ask management to build Hadoop cluster with Impala , or Spark and ask DevOps to script data transfer , etc
They ask management for expensive MPP database and maintain modest amount of data inside, to avoid hefty per terabyte licensing
To master usage of EMR.
Just as a summary – it is very easy to store data in S3, but it is not easy to get efficient SQL over it. I
We decided to do live a bit easier for people who wants to analyze data living in the s3.
How we did it?
Conceptually we see the solution in organizing fast local drives and remote S3 into one storage. This blog post is devoted to this idea – http://bigdatacraft.com/archives/470
In order to field-test this concept we built ImpalaToGo – modification of Cloudera Impala which is capable of working directly with S3, while using local drives as a cache.
In practice it means the following advantages for the above cases:
We can define external tables over S3 data, in the same way we used to do it in Hive.
When we repeat query on the same data, it runs 3-10 times faster since data is accessed from local drives, which are much faster than s3.
When local drives are getting full – least used data will be deleted automatically, to spare you to manage it.
If you want to try it – we have pre built AMI and documentation how to organize them into cluster. It is completely free and open source (https://github.com/ImpalaToGo/ImpalaToGo/wiki/How-to-try).
If you have any questions – please write to email@example.com