Hadoop more than ever before @Yahoo!

Yahoo! is committed to Hadoop more than ever before, according to their developer blog

Hadoop at Yahoo!

Hadoop at Yahoo!

In 2012, we stabilized Hadoop 0.23 (a branch very close to Hadoop 2.0, less the HDFS HA enhancements), validated hundreds of user feeds and thousands of applications, and rolled it out on tens of thousands of production nodes. The rollout is expected to complete fully in Q1 2013, and is a testimony to what we stated earlier, our commitment to pioneering new ground for Hadoop. To give you an idea, we have run over 14 million jobs on YARN (Nextgen MapReduce for Apache Hadoop) and average more than 80,000 jobs on a single cluster per day on Hadoop 0.23. In addition, we made sure that the other Apache projects like PigHive,OozieHCatalog, and HBase run on top of Hadoop 0.23. We also stood up a near real-time scalable processing and storage infrastructure in a matter of few weeks with MapReduce/YARN, HBase, ZooKeeper, and Storm clusters to enable the next generation of Personalization and Targeting services for Yahoo!.

As the largest Hadoop user and a major open source contributor, we have continued our commitment to the advancement of Hadoop through co-hosting Hadoop Summit 2012 and sponsoring Hadoop World + Strata Conference, 2012 in NY. We continue to sponsor the monthly Bay Area Hadoop User Group meetup (HUG), one of the largest Hadoop meetups anywhere in the world, running into its fourth year now at the URL’s café of our Sunnyvale campus.


More information available at: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/

Yahoo planning to spinoff its Hadoop Software Unit

According to an article from the Wall Street Journal Yahoo is now weighing spinning off its Hadoop engineering unit into a new firm that would continue to develop the free software and charge companies for its expertise in using it, according to people familiar with the matter.


The article:


Yahoo Inc., which generates $6 billion in annual revenue by selling online ads, is considering a new strategy to exploit what analysts say could be another billion-dollar business: Hadoop.

Over the past six years, the Internet pioneer helped to develop Hadoop, data analysis software that it now uses to ferret and cull spam from Yahoo mail, determine which stories to place on its home page and pick relevant ads for viewers. Thousands of other companies also use it to analyze large amounts of data, making it a valuable tool amid the explosion of digital information generated on the Web and mobile devices, as well as within financial markets and other industries.

Yahoo is now weighing spinning off its Hadoop engineering unit into a new firm that would continue to develop the free software and charge companies for its expertise in using it, according to people familiar with the matter.

Yahoo would collaborate with and take a significant stake in the new firm, which would compete with a bevy of start-ups such as Cloudera Inc. and established players such as International Business Machines Corp. that distribute versions of Hadoop and offer related services to other companies, these people said.

A Yahoo spokeswoman declined to comment. But Benchmark Capital, a Silicon Valley venture-capital firm, says it has talked to Yahoo about how it might form a Hadoop company.

Hadoop is “the biggest movement in enterprise-software in years” said Rob Bearden, a partner at Benchmark Capital.

Yahoo “could stand to make a ton of money and enable the innovation…of how new data architecture is managed” by businesses, said Mr. Bearden.

Raymie Stata, Yahoo’s chief technology officer, said in an interview that “no matter what might happen, Yahoo will be in the business of developing Hadoop.”

Hadoop has become an increasingly popular tool for Internet companies including Yahoo, eBay Inc., Facebook Inc., and Twitter Inc. EHarmony.com, the match-making site, has said it uses Hadoop to help make matches among its members.

But the software is being adopted by other industries. Financial companies such as Visa Inc. have said they use Hadoop to detect fraud, and financial-trading firms use it to detect stock-market patterns or predict mortgage-default rates, according to Amr Awadallah, a former Yahoo engineer and co-founder of Cloudera, which helps more than 90 companies to use Hadoop. Doug Cutting, the creator of Hadoop who named the software after his child’s toy elephant, now works at Cloudera.

James Kobielus, a senior analyst at Forrester Research, said in five years the market for Hadoop-based products could reach billions of dollars.

“This whole market is on the upswing,” said Mr. Kobielus, adding that there are no standards in place for how companies use Hadoop and that Yahoo could help create them.

Technology giant IBM recently included Hadoop to its existing product offerings for enterprise-software customers. IBM used Hadoop to power “Watson,” the supercomputer that gained notoriety by recently beating human champions on the “Jeopardy!” game show.

Yahoo executives say the company began developing Hadoop in 2005 following the publication of a paper by Google Inc. about MapReduce, software Google created to help convert large amounts of data on the Web into a clean set of Web-search results.

The non-profit Apache Software Foundation coordinates Hadoop contributions by programmers to what the industry calls open-source software. Yahoo has contributed the majority of the code, and the core team that developed much of it is still at the company, including Eric Baldeschwieler, who manages Yahoo’s Hadoop engineers.

Write to Amir Efrati at amir.efrati@wsj.com



http://s4.io/ available

Following our previous article Yahoo’s RealTime MapReduce: S4 being open source

Website http://s4.io/ and twitter account http://www.twitter.com/s4project are now available



Yahoo's RealTime MapReduce: S4 being open source

Yahoo is making an strong move by pushing its more advanced technology available as an Open Sourced project. Yahoo’s RealTime MapReduce, called S4, is meant to be a real-time, distributed, fault-tolerant, scalable, event driven, expandable platform  and allows programmers to easily implement applications for processing continuous unbounded streams of data.This project has been approved for being Open Sourced andwill be available on Github at http://github.com/s4 You should see the S4 codebase available to you soon (the code and website content is being staged by the team).
All details about S4 available here: http://labs.yahoo.com/event/99