hRaven v0.9.8

The @twitterhadoop team just released hRaven v0.9.8

hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs’ execution for an application and more comprehensive trending and analysis over time.

  • Apache HBase (0.94+) – a running HBase cluster is required for the hRaven data storage
  • Apache Hadoop – hRaven current supports collection of job data on specific versions of Hadoop:
    • CDH upto CDH3u5, Hadoop 1.x upto MAPREDUCE-1016
    • Hadoop 1.x post MAPREDUCE-1016 and Hadoop 2.0 are supported in versions 0.9.4 onwards



Twemproxy v 0.3.0 has been released

twemproxy v0.3.0 is out: bug fixes and support for smartos (solaris) / bsd (macos)

twemproxy (pronounced “two-em-proxy”), aka nutcracker is a fast and lightweight proxy for memcached and redis protocol. It was primarily built to reduce the connection count on the backend caching servers.


  • Fast.
  • Lightweight.
  • Maintains persistent server connections.
  • Keeps connection count on the backend caching servers low.
  • Enables pipelining of requests and responses.
  • Supports proxying to multiple servers.
  • Supports multiple server pools simultaneously.
  • Shard data automatically across multiple servers.
  • Implements the complete memcached ascii and redis protocol.
  • Easy configuration of server pools through a YAML file.
  • Supports multiple hashing modes including consistent hashing and distribution.
  • Can be configured to disable nodes on failures.
  • Observability through stats exposed on stats monitoring port.
  • Works with Linux, *BSD, OS X and Solaris (SmartOS)


More details and source code available here:

TheBigDB of facts

TheBigDB is a very loosely structured database of facts,free and open to everybody


Through a very simple API you can browse the database and access facts such as:

  • { nodes: [“Gold”, “atomic radius”, “144 pm”] }
  • { nodes: [“Bill Clinton”, “job”, “President of the United States”], period: { from: “1993-01-20 12:00:00″, to: “2001-01-20 11:59:59″ } }
  • { nodes: [“Apple”, “average weight”, “150g”] }

That’s it. Really.

Anyone can create, upvote or downvote a statement.

There are no datatypes, namespaces, lists or domains. Just nodes, one after the other, with a simple and easy to use API to search through them.

"Taming big data" IBM's best practices for the care of big data

Infographics “Taming big data” provided by IBM.

Certain things cannot be overlooked when dealing with data. Best practices must be instituted for the care of big data just as they have long been in small data. Before enjoying big data’s amazing analytical feats, you must first get it under control – with tools that are up to the challenge of implementing best practices in a big data world.

  • availability
  • management
  • disaster recovery
  • provisioning
  • optimization
  • backup & restore
  • security
  • governance
  • auditing
  • replication
  • virtualization
  • archiving


Hadoop more than ever before @Yahoo!

Yahoo! is committed to Hadoop more than ever before, according to their developer blog

Hadoop at Yahoo!

Hadoop at Yahoo!

In 2012, we stabilized Hadoop 0.23 (a branch very close to Hadoop 2.0, less the HDFS HA enhancements), validated hundreds of user feeds and thousands of applications, and rolled it out on tens of thousands of production nodes. The rollout is expected to complete fully in Q1 2013, and is a testimony to what we stated earlier, our commitment to pioneering new ground for Hadoop. To give you an idea, we have run over 14 million jobs on YARN (Nextgen MapReduce for Apache Hadoop) and average more than 80,000 jobs on a single cluster per day on Hadoop 0.23. In addition, we made sure that the other Apache projects like PigHive,OozieHCatalog, and HBase run on top of Hadoop 0.23. We also stood up a near real-time scalable processing and storage infrastructure in a matter of few weeks with MapReduce/YARN, HBase, ZooKeeper, and Storm clusters to enable the next generation of Personalization and Targeting services for Yahoo!.

As the largest Hadoop user and a major open source contributor, we have continued our commitment to the advancement of Hadoop through co-hosting Hadoop Summit 2012 and sponsoring Hadoop World + Strata Conference, 2012 in NY. We continue to sponsor the monthly Bay Area Hadoop User Group meetup (HUG), one of the largest Hadoop meetups anywhere in the world, running into its fourth year now at the URL’s café of our Sunnyvale campus.


More information available at:

Probability, The Analysis of Data

Probability, The Analysis of Data – Volume 1

is a free book available online, it provides educational material in the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website Hardcopies are available at affordable prices.

Handling "schema" change in production

I often heard and read about such situation, you started a brand new application based on a NoSQL datastore everything goes fine so far, you’re almost happy but all of the sudden you face a critical point: you need to change the “schema” for your application and you’re already live,running production solution.

From this point in time, you must ask yourself, is my  amount of data relatively small(i.e. documents count) so I can run a batch process in order to update all the documents in bulk , writing a small conversion program.

Unfortunately won’t always turn this way and sometimes due to the big amount of data you’re dealing with,  performing bulk batch updates wouldn’t feasible due to the time and impact on performance.

In such case you must consider a Lazy Update Approach , this is where in your application you can check whether the document is in the ‘previous schema’ when you need to read it in and update it when you write it out again.

Over time this will eventually migrate documents in ‘previous schema’  to the new, though it’s possible that you may end up with documents that rarely get accessed and so remain in an ‘previous schema’. You must then wait for the number of documents that remain in the  ‘previous schema’  to be small enough so  you could run batch jobs to update these remaining documents.

During this conversion process, you need to be very careful to any process which perform operation over multiple documents, this is the downside, those process might need to be rewrited as well and at least carefully reviewed.

DuckDuckGo serves 1 Million searches a day has published an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012. You fill find detail on how they use memcached, postgreSql and many other great peace of software to serves 1 million search a day !

Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search. You need the kind of smarts DDG has built into their search engine. One problem of course is now that data has become valuable many grown ups don’t want to share anymore.



The full article on

Gartner predict strong Hadoop adoption for Business Intelligence and Analytics



Gartner Says Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources

Analysts to Discuss Growth in Data Sources at Gartner Business Intelligence and Analytics Summits 2013, February 5-7 in Barcelona, February 25-26 in Sydney and March 18-20 in Grapevine, Texas


Business intelligence (BI) and analytics need to scale up to support the robust growth in data sources, according to the latest predictions from Gartner, Inc. Business intelligence leaders must embrace a broadening range of information assets to help their organizations.

“New business insights and improved decision making with greater finesse are the key benefits achievable from turning more data into actionable insights, whether that data is from an increasing array of data sources from within or outside of the organization,” said Daniel Yuen, research director at Gartner. “Different technology vendors, especially niche vendors, are rushing into the market, providing organizations with the ability to tap into this wider information base in order to make sounder strategic and prompter operational decisions.”

Gartner outlined three key predictions for BI teams to consider when planning for the future:

By 2015, 65 percent of packaged analytic applications with advanced analytics will come embedded with Hadoop.

Organizations realize the strength that Hadoop-powered analysis brings to big data programs, particularly for analyzing poorly structured data, text, behavior analysis and time-based queries. While IT organizations conduct trials over the next few years, especially with Hadoop-enabled database management system (DBMS) products and appliances, application providers will go one step further and embed purpose-built, Hadoop-based analysis functions within packaged applications. The trend is most noticeable so far with cloud-based packaged application offerings, and this will continue.

“Organizations with the people and processes to benefit from new insights will gain a competitive advantage as having the technology packaged reduces operational costs and IT skills requirements, and speeds up the time to value,” said Bill Gassman, research director at Gartner. “Technology providers will benefit by offering a more competitive product that delivers task-specific analytics directly to the intended role, and avoids a competitive situation with internally developed resources.”

By 2016, 70 percent of leading BI vendors will have incorporated natural-language and spoken-word capabilities.

BI/analytics vendors continue to be slow in providing language- and voice-enabled applications. In their rush to port their applications to mobile and tablet devices, BI vendors have tended to focus only on adapting their traditional BI point-and-click and drag-and-drop user interfaces to touch-based interfaces. Over the next few years, BI vendors are expected to start playing a quick game of catch-up with the virtual personal assistant market. Initially, BI vendors will enable basic voice commands for their standard interfaces, followed by natural language processing of spoken or text input into SQL queries. Ultimately, “personal analytic assistants” will emerge that understand user context, offer two-way dialogue, and (ideally) maintain a conversational thread.

“Many of these technologies can and will underpin these voice-enabled analytic capabilities, rather than BI vendors or enterprises themselves developing them outright,” said Douglas Laney, research vice president at Gartner.”

By 2015, more than 30 percent of analytics projects will deliver insights based on structured and unstructured data.

Business analytics have largely been focused on tools, technologies and approaches for accessing, managing, storing, modeling and optimizing for analysis of structured data. This is changing as organizations strive to gain insights from new and diverse data sources. The potential business value of harnessing and acting upon insights from these new and previously untapped sources of data, coupled with the significant market hype around big data, has fueled new product development to deal with a data variety across existing information management stack vendors and has spurred the entry of a flood of new approaches for relating, correlating, managing, storing and finding insights in varied data.

“Organizations are exploring and combining insights from their vast internal repositories of content — such as text and emails and (increasingly) video and audio — in addition to externally generated content such as the exploding volume of social media, video feeds, and others, into existing and new analytic processes and use cases,” said Rita Sallam, research vice president at Gartner. “Correlating, analyzing, presenting and embedding insights from structured and unstructured information together enables organizations to better personalize the customer experience and exploit new opportunities for growth, efficiencies, differentiation, innovation and even new business models.”

More detailed analysis is available in the report “Predicts 2013: Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources.” The report is available on Gartner’s website at

Additional information and analysis on data sources will be discussed at the Gartner Business Intelligence & Analytics Summit 2013 taking place February 5-7 in Barcelona, February 25-26 in Sydney and March 18-20 in Grapevine, Texas. The Gartner BI & Analytics Summit is specifically designed to drive organizations toward analytics excellence by exploring the latest trends in BI and analytics and examining how the two disciplines relate to one another. Gartner analysts will discuss how the Nexus of Forces will impact BI and analytics, and share best practices for developing and managing successful mobile BI, analytics and master data management initiatives.

Processing data with Drake

Introducing ‘Drake’, a “Make for Data”

We call this tool Drake, and today we are excited to share Drake with the world, as an open source project. It is written in Clojure.

Drake is a text-based command line data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs.  It automatically resolves dependencies and provides a rich set of options for controlling the workflow. It supports multiple inputs and outputs and has HDFS support built-in.

We use Drake at Factual on various internal projects. It serves as a primary way to define, run, and manage data workflow. Some core benefits we’ve seen:
    • Non-programmers can run Drake and fully manage a workflow
    • Encourages repeatability of the overall data building process
    • Encourages consistent organization (e.g., where supporting scripts live, and how they’re run)
    • Precise control over steps (for more effective testing, debugging, etc.)
    • Unifies different tools in a single workflow (shell commands, Ruby, Python, Clojure, pushing data to production, etc.)

Drake official blog: