Not long ago, ApacheTM Hadoop R (a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models) emerged as a solution to big data challenges. However, there were some inherent issues, related to performance and time lags as Hadoop is designed for batch processing and not for real-time queries. Another challenge with Hadoop is the requirement of `Map-Reduce’ perspective which was a deterrent for SQL engineers. To manage this issue, a data warehouse system called Hive was introduced. Hive wrapped the Map-Reduce nitty-gritty into an SQL-like interface with its Hive Query language. However, this did not resolve the inherent issues with Hadoop’s Map-Reduce approach i.e. latency.
As a result of these challenges, open-source tools such as Spark, Impala and HAWQ emerged, and these tools leveraged techniques to reduce the latency associated with batch-based Hadoop jobs. Shark is one such Hadoop extension tool that speeds up both in-memory and on-disk queries. Impala, another such tool, works well with Hive/HDFS and resembles traditional parallel databases.
With our passion for technology, we at Tavant, have tested these emerging solutions to evaluate their performance in real-world cases.
Given below is our analysis of Shark:
We simulated a total of six ad servers with a structured set of logs capturing the details of ad requests and deliveries. We generated 4 million requests in one hour per ad server, taking the size of logs on one server to 125 MB in one hour. We then set up two clusters – one with Hadoop/Hive and one with Spark/Shark. The same set of machine configurations was used for running both the clusters: OS: Ubuntu 12.04 LTS, Ram: 2GB, Number of nodes: 2
We executed a query to find out the number of requests, impressions and clicks based on the geographical location of the user.
The following infographic illustrates the execution time recorded for both the cases:
Thus, it can be inferred that Shark is superior to Hive in terms of performance.
However, we witnessed a few issues with Shark:
- The memory size available to the Shark process must be chosen wisely, depending on the data size to be processed, in order to avoid ‘Out of Memory’ error.
- The improvement in the performance of Shark over Hive is not consistently greater by a constant factor. Heavy workloads and different queries may show less gap in the execution times of Shark and Hive.
Nonetheless, Shark seems a good option at this point. Future releases of Shark will make available to us more features and upgrades.
Don’t miss our next blog: `Evaluation of Impala’.