Big Data top paying skills

 According to kdnuggets the Big Data related skills led the list of top paying technical skills (six-figure salaries) in 2013.

The study focus on  technology professionals in the U.S. who enjoyed raises over the last year(2013).

Average U.S. tech salaries increased nearly three percent to $87,811 in 2013, up from $85,619 the previous year.Technology professionals understand they can easily find ways to grow their career in 2014, with two-thirds of respondents (65%) confident in finding a new, better position. That overwhelming confidence matched with declining salary satisfaction (54%, down from 57%) will keep tech-powered companies on edge about their retention strategies.

Companies are willing to pay hefty amounts to professionals with Big Data skills.

According to a report released on Jan 29, 2014 an average salary for a professional having knowledge and experience in programming language R was $115,531 in year 2013. 

Other Big Data oriented skills such as NoSQL, MapReduce, Cassandra, Pig, Hadoop, MongoDB are among top 10 paying skills. 


Source: kdnuggets

Mongo-Hadoop Adapter 1.1

The Mongo-Hadoop Adapter 1.1 have been released, it makes  easy to use Mongo databases, or mongoDB backup files in .bson format, as the input source or output destination for Hadoop Map/Reduce jobs. By inspecting the data and computing input splits, Hadoop can process the data in parallel so that very large datasets can be processed quickly.

The Mongo-Hadoop adapter also includes support for Pig and Hive, which allow very sophisticated MapReduce workflows to be executed just by writing very simple scripts.

  • Pig is a high-level scripting language for data analysis and building map/reduce workflows
  • Hive is a SQL-like language for ad-hoc queries and analysis of data sets on Hadoop-compatible file systems.

Hadoop streaming is also supported, so map/reduce functions can be written in any language besides Java. Right now the Mongo-Hadoop adapter supports streaming in Ruby, Node.js and Python.

How it Works

How the Hadoop Adapter works

  • The adapter examines the MongoDB Collection and calculates a set of splits from the data
  • Each of the splits gets assigned to a node in Hadoop cluster
  • In parallel, Hadoop nodes pull data for their splits from MongoDB (or BSON) and process them locally
  • Hadoop merges results and streams output back to MongoDB or BSON


New Search App in Hue 2.4

Hue 2.4 unleashed the power of Hadoop, in this version you can now search across Hadoop data just like you would do keyword searches with Google or Yahoo! In addition, a wizard lets you tweak the result snippets and tailors the search experience to your needs.

The new Hue Search app uses the regular Solr API underneath the hood, yet adds a remarkable list of UI features that makes using search over data stored in Hadoop a breeze. It integrates with the other Hue apps like File Browser for looking at the index file in a few clicks.

Here’s a video demoing queries and results customization. The demo is based on Twitter Streaming data collected with Apache Flume and indexed in real time:


More information:

Open-sourcing Parquet

Twitter and Cloudera are open-sourcing Parquet: columnar storage format for Hadoop

Parquet is a columnar storage format for Hadoop.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind, and uses the repetition/definition level approach to encoding such data structures, as popularized by Google Dremel. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

The initial code, available at, defines the file format (parquet-format), provides Java building blocks for processing columnar data, and implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example of a complex integration — Input/Output formats that can convert Parquet-stored data directly to and from Thrift objects (parquet-mr).

A preview version of Parquet support will be available in Cloudera’s Impala 0.7.

With Impala’s current preview implementation, we see a roughly 10x performance improvement compared to the other supported formats. We observe this performance benefit across multiple scale factors (10GB/node, 100GB/node, 1TB/node). We believe there is still a lot of room for improvement in the implementation and we’ll share more thorough results following the 0.7 release.

Twitter is starting to convert some of its major data source to Parquet in order to take advantage of the compression and deserialization savings.


Parquet is currently under heavy development. Parquet’s near-term roadmap includes:

  • Hive SerDes (Criteo)
  • Cascading Taps (Criteo)
  • Support for dictionary encoding, zigzag encoding, and RLE encoding of data (Cloudera and Twitter)
  • Further improvements to Pig support (Twitter)

We’ve also heard requests to provide an Avro container layer, similar to what we do with Thrift. Seeking volunteers!

We welcome all feedback, patches, and ideas. We plan to contribute Parquet to the Apache Incubator when the development is farther along.

Parquet is Copyright 2013 Twitter, Cloudera and other contributors.

Licensed under the Apache License, Version 2.0:


Hadoop more than ever before @Yahoo!

Yahoo! is committed to Hadoop more than ever before, according to their developer blog

Hadoop at Yahoo!

Hadoop at Yahoo!

In 2012, we stabilized Hadoop 0.23 (a branch very close to Hadoop 2.0, less the HDFS HA enhancements), validated hundreds of user feeds and thousands of applications, and rolled it out on tens of thousands of production nodes. The rollout is expected to complete fully in Q1 2013, and is a testimony to what we stated earlier, our commitment to pioneering new ground for Hadoop. To give you an idea, we have run over 14 million jobs on YARN (Nextgen MapReduce for Apache Hadoop) and average more than 80,000 jobs on a single cluster per day on Hadoop 0.23. In addition, we made sure that the other Apache projects like PigHive,OozieHCatalog, and HBase run on top of Hadoop 0.23. We also stood up a near real-time scalable processing and storage infrastructure in a matter of few weeks with MapReduce/YARN, HBase, ZooKeeper, and Storm clusters to enable the next generation of Personalization and Targeting services for Yahoo!.

As the largest Hadoop user and a major open source contributor, we have continued our commitment to the advancement of Hadoop through co-hosting Hadoop Summit 2012 and sponsoring Hadoop World + Strata Conference, 2012 in NY. We continue to sponsor the monthly Bay Area Hadoop User Group meetup (HUG), one of the largest Hadoop meetups anywhere in the world, running into its fourth year now at the URL’s café of our Sunnyvale campus.


More information available at:

CDH 4.1.3 has been released

CDH 4.1.3 is now available. CDH (Cloudera’s Distribution, including Apache Hadoop) is Cloudera’s 100% open-source Hadoop distribution . This version is a maintenance release that fixing some key issues including the following:

  • HBASE-7498 – Make REST server thread pool size configurable
  • HADOOP-6762 – Exception while doing RPC I/O closes channel
  • OOZIE-1130 -
  • Upgrade from 3.2 to 3.3 failing due to change in WorkflowInstance structure
  • MAPREDUCE-2217 – The expire launching task should cover the UNASSIGNED task
  • MAPREDUCE-4907 - TrackerDistributedCacheManager issues too many getFileStatus calls
  • OOZIE-994 – ActionCheckXCommand does not handle failures properly


Release note

Gartner predict strong Hadoop adoption for Business Intelligence and Analytics



Gartner Says Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources

Analysts to Discuss Growth in Data Sources at Gartner Business Intelligence and Analytics Summits 2013, February 5-7 in Barcelona, February 25-26 in Sydney and March 18-20 in Grapevine, Texas


Business intelligence (BI) and analytics need to scale up to support the robust growth in data sources, according to the latest predictions from Gartner, Inc. Business intelligence leaders must embrace a broadening range of information assets to help their organizations.

“New business insights and improved decision making with greater finesse are the key benefits achievable from turning more data into actionable insights, whether that data is from an increasing array of data sources from within or outside of the organization,” said Daniel Yuen, research director at Gartner. “Different technology vendors, especially niche vendors, are rushing into the market, providing organizations with the ability to tap into this wider information base in order to make sounder strategic and prompter operational decisions.”

Gartner outlined three key predictions for BI teams to consider when planning for the future:

By 2015, 65 percent of packaged analytic applications with advanced analytics will come embedded with Hadoop.

Organizations realize the strength that Hadoop-powered analysis brings to big data programs, particularly for analyzing poorly structured data, text, behavior analysis and time-based queries. While IT organizations conduct trials over the next few years, especially with Hadoop-enabled database management system (DBMS) products and appliances, application providers will go one step further and embed purpose-built, Hadoop-based analysis functions within packaged applications. The trend is most noticeable so far with cloud-based packaged application offerings, and this will continue.

“Organizations with the people and processes to benefit from new insights will gain a competitive advantage as having the technology packaged reduces operational costs and IT skills requirements, and speeds up the time to value,” said Bill Gassman, research director at Gartner. “Technology providers will benefit by offering a more competitive product that delivers task-specific analytics directly to the intended role, and avoids a competitive situation with internally developed resources.”

By 2016, 70 percent of leading BI vendors will have incorporated natural-language and spoken-word capabilities.

BI/analytics vendors continue to be slow in providing language- and voice-enabled applications. In their rush to port their applications to mobile and tablet devices, BI vendors have tended to focus only on adapting their traditional BI point-and-click and drag-and-drop user interfaces to touch-based interfaces. Over the next few years, BI vendors are expected to start playing a quick game of catch-up with the virtual personal assistant market. Initially, BI vendors will enable basic voice commands for their standard interfaces, followed by natural language processing of spoken or text input into SQL queries. Ultimately, “personal analytic assistants” will emerge that understand user context, offer two-way dialogue, and (ideally) maintain a conversational thread.

“Many of these technologies can and will underpin these voice-enabled analytic capabilities, rather than BI vendors or enterprises themselves developing them outright,” said Douglas Laney, research vice president at Gartner.”

By 2015, more than 30 percent of analytics projects will deliver insights based on structured and unstructured data.

Business analytics have largely been focused on tools, technologies and approaches for accessing, managing, storing, modeling and optimizing for analysis of structured data. This is changing as organizations strive to gain insights from new and diverse data sources. The potential business value of harnessing and acting upon insights from these new and previously untapped sources of data, coupled with the significant market hype around big data, has fueled new product development to deal with a data variety across existing information management stack vendors and has spurred the entry of a flood of new approaches for relating, correlating, managing, storing and finding insights in varied data.

“Organizations are exploring and combining insights from their vast internal repositories of content — such as text and emails and (increasingly) video and audio — in addition to externally generated content such as the exploding volume of social media, video feeds, and others, into existing and new analytic processes and use cases,” said Rita Sallam, research vice president at Gartner. “Correlating, analyzing, presenting and embedding insights from structured and unstructured information together enables organizations to better personalize the customer experience and exploit new opportunities for growth, efficiencies, differentiation, innovation and even new business models.”

More detailed analysis is available in the report “Predicts 2013: Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources.” The report is available on Gartner’s website at

Additional information and analysis on data sources will be discussed at the Gartner Business Intelligence & Analytics Summit 2013 taking place February 5-7 in Barcelona, February 25-26 in Sydney and March 18-20 in Grapevine, Texas. The Gartner BI & Analytics Summit is specifically designed to drive organizations toward analytics excellence by exploring the latest trends in BI and analytics and examining how the two disciplines relate to one another. Gartner analysts will discuss how the Nexus of Forces will impact BI and analytics, and share best practices for developing and managing successful mobile BI, analytics and master data management initiatives.

The infoworld technology of the year 2013

Infoworld just published its Technology of the Year Award winners and some well known NoSQL solution have been rewarded:

  • Apache Hadoop
  • Apache Cassandra
  • Couchbase Server

CDH 4.1 has been released

The Cloudera‘s Distribution for Hadoop (CDH) cloud scripts enable you to run Hadoop on cloud providers’ clusters. CDH consists of 100% open source Apache Hadoop plus nine other open source projects from the Hadoop ecosystem. CDH is thoroughly tested and certified to integrate with the widest range of operating systems and hardware, databases and data warehouses, and business intelligence and ETL systems.


As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Hadoop and related projects, annually and then updates to CDH every three months.  Updates primarily comprise bug fixes but we will also add enhancements.  We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.

We’re pleased to announce the availability of CDH4.1.  We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production.  CDH4.1 is an update that has a number of fixes but also a number of useful enhancements.  Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive.  With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication.  In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations.  Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows.  The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements –  since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day.  In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase.  CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

CDH4.1 is available on all of the usual platforms and form factors.  You can install it via Cloudera Manager or learn how to install the packages manually here.


Jaspersoft Big Data Survey

Jaspersoft’s new big data survey includes 631 respondents from the company’s user community. The survey includes respondents from more than fifteen countries that are primarily employed by companies with less than US$ 10M in revenue (30 percent).

In addition, most of the participants indicated they were in technical roles. Only 6 percent of respondents specified they were business users, while 63 percent were application developers and 19 percent were either report developers or business intelligence administrators. The high number of respondents in non-management roles is important to note because there is a risk it could skew the results. The participants may have detailed knowledge of implementation details, but may lack visibility across the enterprise to all big data initiatives that are underway.

Even if participants don’t have insight into what’s swirling in executives heads, they are aware the work on managing big data has started. According to the survey, twelve percent of the companies represented have already deployed a big data analytics solution. Twice as many, 24 percent, are currently implementing, and 13 percent plan to have a project underway in the next six months. Another 13 percent are planning a project in the next 12 months. However, a significant number of organizations, 38 percent, have no immediate plans for initiating a big data project.

Given all of the data that suggests big data can yield enormous business benefits, why are 38 percent of the companies represented choosing not to invest in big data? Respondents cited multiple reasons, but the most prevalent response (37 percent) was that their organization only had structured relational data. The second most popular answer, 35 percent of responses, was that the organization did not understand what big data is.

Other studies have also highlighted the lack of big data skills as a barrier to organizations capitalizing on the potential of big data. Both of the top responses in Jaspersoft’s survey show this is indeed a problem — especially the top answer since data does not have to be unstructured to be considered big data.
It is also interesting that most of the company’s included in the survey are not dealing with the massive data volumes often discussed by vendors and technology evangelist. Only two percent of survey participants said their project would manage exabytes of data, and just slightly more, eight percent, dealt with petabytes. Most respondents, 78 percent, indicated their projects dealt with terabytes of data or less. Twelve percent of respondents were unsure of their project’s total estimated data volume.

E-commerce, financials and customer relationship management enterprise applications were the biggest source of content for big data projects. Respondent could select more than one response, and 363 responses specified  data originated from one of the top three systems. Hadoop may be getting the most press for managing big data, but respondents overwhelming indicated (60 percent of responses) the data for their big data analytics project was stored in a traditional relational database. Only 26 percent of responses mentioned Hadoop HDFS or Hbase.


The Big Picture for Big Data

Jaspersoft’s study supports what many of the other big data studies have shown. The majority of organizations are beginning to invest in big data, but big data skills remain a significant challenge. The technology landscape is still diverse, and relational databases continue to play an important role in managing growing data volumes.