From the Blogosphere
Cloudera Impala – Closing the Near Real Time Gap Working with Big Data
Building data structures and loading data
By: Bob Gourley
Nov. 27, 2012 07:05 AM
On October 24, 2012 Cloudera announced the release of Cloudera Impala and the commercial support subscription service of Cloudera Enterprise Real Time Query (RTQ). During the Hadoop World/STRATA Conference in NYC, I was invited over to see a demonstration. Impala is a SQL based Real Time Query/Ad Hoc query engine built on top of HDFS or Hbase. As I watched the demonstration unfold, I wondered if one of the remaining technology gaps in the NOSQL arsenal had been closed. What gap you ask? Near Real Time Analytics on a NOSQL stack. Working with customers across the Cyber Security customer space, not only do they face the familiar BIGDATA horsemen of the apocalypse: Volume, Velocity and Variety but one more large challenge crept in: Time (V3T). The Near Real Time Analysis/Near Real Time Analytic capability that Cloudera Impala provides is essential in many high value use cases associated with Cyber Security: comparing current activity with observed historical norms, correlation of many disparate data sources/enrichment and automated threat detection algorithms.
When the demonstration concluded, the Cloudera representatives and I discussed the potential of performing an informal independent evaluation of Cloudera Impala against some of the common Real Time/Near Real Time use cases in Cyber Security. I agreed to step up and perform an independent evaluation as well as developing a demonstration platform for FedCyber 2012 (almost three weeks hence for inquiring minds). So let us set the field: a new BETA technology, NO prior exposure to the technology or documentation, a vendor making promises, addressing a large technology gap and three weeks to implement, seemed straight forward; no pressure.
The day after I returned from the STRATA Conference, I returned to my office and provisioned four Virtual Machines in order to build the Impala demonstration. As a committer/contributor for SherpaSurfing an open source Cyber Security solution, I have an abundance of data sets, enrichment sources, Hive data structures and services. Given the amount of time and the audience for FedCyber 2012, I decided to focus on some Intrusion Detection and Netflow related use cases for the demonstration. The data sets for the demonstration included base data sets: 20 million Netflow events, 8 million Intrusion Detection System events and enrichment: Geographic, Blacklist, Whitelist and Protocol related information. Each of the selected uses cases for this demonstration is critical to the Perform Near-Real Time Network Analysis domain in Cyber Security. The name for the demonstration system was decided to be the Impala Mission Demonstration Platform (IMDP). The IMDP was implemented based on vendor recommendations with no tuning or optimization.
The IMDP effort provided me with my first opportunity to work with Cloudera Manager. Although this post is focused on Cloudera Impala I would be remiss not to mention Cloudera Manager. I have worked with Hadoop since 1.0 and built more than a few clusters over the years. I used the installation and configuration guides provided with Cloudera Impala and followed the recommendations. One of the first recommendations was use of the Cloudera Manager. Using the Cloudera Manager (CDH 4.1), I was able to roll out a four node cluster in two hours. I was able to discover the hosts, manage services and provision them in accordance with the IMDP deployment plan. The deployment plan consisted of:
The Cloudera Manager saved at least two days of effort in deploying the cluster, the tight integration with the support portal, comprehensive help and one place to work with all properties of the entire cluster and view space consumption metrics; verdict on Cloudera Manager: Cloudera masterful, bold stroke, thumbs up.
Now that the cluster build-out completed; I shifted attention to deploying and configuring the Cloudera Impala service. Using Cloudera Manager, I deployed Impala on three nodes: three instances of Impalad and one impala state store, in a matter of minutes. I completed the deployment and configuration of the Hive MetaStore. Keeping in mind this is a BETA; the documentation was complete, but fragmented on deployment and configuration (HIVE MetaStore portion); verdict on impala deployment and configuration: solid for a BETA (needs an example hive-site.xml, configuration guide needs better flow).
At this point all configuration and deployment was completed, attention turned to building data structures and loading data. I took the Data Definition Language (DDL) scripts or data structures for ten data sources and enrichment; ported them over to Hive and tested them in less than four hours. It is worthy of mention that the data sources for this demonstration are large flat tables: netflow and intrusion detection system. Cloudera Impala uses HIVE as an Extract Transform Load (ETL) engine, using Hive I defined all of the data structures in source files which were sourced using hive shell: created a database (Sherpa). Hive was then used to load data into the tables that were just created. Creating data structures in Hive was simple as usual and loading data sets was quick (20 million netflow events in 57 seconds). Logging into impala-shell, issued a refresh of the MetaStore and I was working with data. I performed verification of the data load, all data loaded and no issues were revealed. One area of potential improvement would be more comprehensive messages on load failure. Defining the data structures and loading data using Hive was nothing new; verdict: really good; easy to use, easy to load, but need to improve failed load messages.
Finally, we moved on to the most interesting stage which is using Cloudera Impala in a series of Real Time Query (RTQ) scenarios that are common across the Cyber Security customer space. The real world scenarios selected come from the perform netflow analysis set of use case(s). In each of these scenarios, the exact same queries were executed on the same cluster using Hive and then Impala against the same data structures (database and tables). In the Hive approach, we traverse the batch processing stack and with Impala we traverse the Real Time Query (RTQ) stack performing a series of analytics. In the first use case, I ran a five tuple (sip, sport, dip, dport, protocol) summary covering bytes per packet, summing bytes and packets for a 20 million event set resulted in: identical result sets, Hive 82 seconds – Impala 6 seconds. In the second use case, I performed a summary of destination ports where the source port is 80 which resulted in: identical result sets, Hive 57 seconds, Impala 5 seconds. In the third use case, I performed correlation between netflow and intrusion detection systems, correlating netflow with intrusion detection events for several hours which resulted in: identical result sets, Hive 40 seconds, Impala sub-second. Finally, for FedCyber 2012, I developed a java based situational awareness dashboard which connected to Cloudera Impala via ODBC and executed analytics performing: correlation of blacklists, Intrusion Detection, Netflow, statistical cubes for ten hours with a refresh of every five seconds without failure or issue. The ODBC implementation easily provided the ability to export data to desktop tools (using ODBC) and common BI tools as advertised. Developing and Using Cloudera Impala verdict: This is as advertised; easy to use, easy to implement on, very fast, very flexible and more than capable of running real time analytics. The Impala shell is limited but much of the demonstration work was done using result sets so it was not an impediment.
In summation, I have worked for over a decade across the vast BIGDATA technology space covering Legacy Relational Database, Data Warehouse, and NOSQL; Cloudera Impala proved more than capable of running near real time analytics and providing mission relevance to customers with a Near Real Time (NRT) requirement. Based on my initial review Cloudera Impala appears to be a bold step in closing the gap of near real time analytics on a NOSQL stack. I did encounter some minor problems, but the few problems and limitations that were encountered in this demonstration were documented and published in the known issues document so they will not be shared; none were show stoppers.
The notes, details and all of the lessons learned, data structures and the configuration guide from the demonstration are being published out on Github under SherpaSurfing in the coming days. These documents cover everything in detail and will enable developers to replicate the demonstration platform and get a jump start on Cloudera Impala. Finally, I would like to thank two contributors: Hanh Le, Robert Webb and Six3 Systems for helping me pull this off.
Cloud Expo Breaking News
Best Recent Articles on Cloud Computing & Big Data Topics
The Arlington, Virginia-based National Science Foundation has just released its "Report on Support for Cloud Computing" - in response to the America Competes Reauthorization Act of 2010, Section 524. It is an absolute must-read for all concerned with current and future research projects in Cloud Computing.
"The volume of data we're generating now from machines pales in comparison to the volume of data we'll soon generate from our own bodies," says data security expert Dave Asprey. Writing in a Trend Micro blog, Asprey - who is one of the leaders in the emerging Quantified Self movement - explains his vision of a world in which personal biometrical data is shared via the cloud.
Cloud computing has caught the attention of business leaders around the world in every industry because of its enormous transformative potential. Visionary companies know that the value of the cloud is far greater than the current focus solely on technology and operating costs: when combined with a collaborative approach to designing processes, cloud computing will change how we do business.
Want to make sense of the hottest new concept in Enterprise IT? Want to understand in just hours what experts have spent many hundreds of days deciphering? Cloud computing is a technology that has rapidly evolving peppered with a lot of hype along the way. Customers find it hard to navigate through this and make sense of what aspects of this technology will give them real business benefit. Cloud Computing Bootcamp, led by our 2013 Bootcamp Instructor Larry Carvalho, is a great way to get a practical understanding of this technology. We offer multiple days of actionable insight into what vendor offerings are currently available and help you comprehend their strategy. The ever-popular Bootcamp, which is now held regularly around the world, is being held in conjunction with the 12th Cloud Expo, June 10-13, 2013, at the Javits Center, New York, NY.
Did you know that ninety percent of the data in the world has been created in the last two years? Every day, we create 2.5 quintillion (or 2.518) bytes of data, according to IBM. As corporations across all industries globally are struggling with how to retain, aggregate and analyze this mounting volume of what the industry refers to as Big Data, it also provides a unique opportunity for innovative startups that recognize the business prospects Big Data presents. Big Data is not just unlocking new information but new sources of economic and business value. Interactivity is driving Big Data, with people and machines both consuming and creating it. Digital companies focused on becoming good at aggregating and analyzing the data created by the end users of their product, who then provide their customers with solid insights taken from that data are at a distinct competitive advantage over others in the marketplace.
Industry-specific clouds are those PaaS, IaaS, and PaaS services that are tailored for a specific vertical, such as transportation, retail, finance, and health care. IDC sees a $65 billion market in these industry solutions for 2013, rising to $100 billion in 2016. The value of industry-specific clouds is that businesses within a vertical can connect to applications, processes, and databases that are pre-defined for that vertical within a public or private cloud. They can extend processes and databases into the business domain, versus defining the data and processes within a generic cloud-based platform. So, are industry specific clouds right for your business? What options are out there? How do you figure out the ROI?
SYS-CON Events announced today that Rackspace Hosting, the open cloud company, has been named "Platinum Plus Sponsor" of SYS-CON's 12th International Cloud Expo, which will take place on June 10-13, 2013, at the Javits Center in New York City, New York. Rackspace® Hosting (NYSE: RAX) is the open cloud company, delivering open technologies and powering more than 205,000 customers worldwide. Rackspace provides its renowned Fanatical Support® across a broad portfolio of IT products, including Public Cloud, Private Cloud, Hybrid Hosting and Dedicated Hosting. Rackspace has been recognized by Bloomberg BusinessWeek as a Top 100 Performing Technology Company, is featured on Fortune's list of 100 Best Companies to Work For and is included on the Dow Jones Sustainability Index. Rackspace was positioned in the Leaders Quadrant by Gartner Inc. in the "2011 Magic Quadrant for Managed Hosting." Rackspace is headquartered in San Antonio with offices and data centers around the world.
10th International Cloud Expo, held on June 11-14, 2012 at the Javits Center in New York City, featured four content-packed days with a rich array of sessions about the business and technical value of cloud computing led by exceptional speakers from every sector of the cloud computing ecosystem. The Cloud Expo series is the fastest-growing Enterprise IT event in the past 10 years, devoted to every aspect of delivering massively scalable enterprise IT as a service. We invite you to enjoy our photo album of the show - we'll be adding new images all week.
Ulitzer.com announced "the World's 30 most influential Cloud bloggers," who collectively generated more than 24 million Ulitzer page views. Ulitzer's annual "most influential Cloud bloggers" list was announced at Cloud Expo, which drew more delegates than all other Cloud-related events put together worldwide. "The world's 50 most influential Cloud bloggers 2010" list will be announced at the Cloud Expo 2010 East, which will take place April 19-21, 2010, at the Jacob Javitz Convention Center, in New York City, with more than 5,000 expected to attend.
Cloud computing is becoming one of the next industry buzz words. It joins the ranks of terms including: grid computing, utility computing, virtualization, clustering, etc. Cloud computing overlaps some of the concepts of distributed, grid and utility computing, however it does have its own meaning if contextually used correctly. The conceptual overlap is partly due to technology changes, usages and implementations over the years. Trends in usage of the terms from Google searches shows Cloud Computing is a relatively new term introduced in the past year. There has also been a decline in general interest of Grid, Utility and Distributed computing. Likely they will be around in usage for quit a while to come. But Cloud computing has become the new buzz word driven largely by marketing and service offerings from big corporate players like Google, IBM and Amazon.
SYS-CON Events announced today that Dell Inc. has been named "Silver Sponsor" of SYS-CON's 12th International Cloud Expo, which will take place on June 10-13, 2013, at the Javits Center in New York City, New York. For more than 28 years, Dell has empowered countries, communities, customers and people everywhere to use technology to realize their dreams. Customers trust Dell to deliver technology solutions that help them do and achieve more, whether they're at home, work, school or anywhere in their world. Learn more about Dell's story, purpose and people behind its customer-centric approach.
One of the most compelling promises of the cloud is that you can pull out a credit card and be working in minutes. No purchase orders to fill out, no equipment to wait for on the loading dock. Just instant access to the resources you need, when you need them. But accessibility comes at a price, and an unintentional consequence may be that you create yet another orphaned identity silo. Enterprise IT has spent years consolidating its mishmash of directories, only to discover that cloud now threatens to turn back their hard-won victories. In his session at the 12th International Cloud Expo, Scott Morrison, CTO and Chief Architect at Layer 7 Technologies, will look at strategies to incorporate identity into cloud applications. Enterprise identity or social login can both be a part of your go-to-cloud strategy, but you must plan for this upfront, rather than try to retrofit identity and access control at a later date.
Cloud Expo, Cloud Expo East, Cloud Expo West, Cloud Expo Silicon Valley, Cloud Expo Europe, Cloud Expo Tokyo, Cloud Expo Prague, Cloud Expo Hong Kong, Cloud Expo Sao Paolo are trademarks and /or registered trademarks (USPTO serial number 85009040) of Cloud Expo, Inc.
The World's Most Influential Blogs