Understanding Cloudera by using its VirtualBox Demo

Cloudera is the Apache packaging solution to deploy the integrated solution including Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr, The Cloudera VirtualBox Demo will bring you all this platform configured and ready to experiment with in less than 5 minutes.

Making it easy for users to experiment with these tools increases the chances for adoption.

CDH Mac OS X VirtualBox VM

 

 

HBase 0.90.3 released, first release candidate

The first hbase (the hadoop database) 0.90.3 release candidate is available.

Download

Release note

 

Bug

  • [HBASE-3539] – Improve shell help to reflect all possible options
  • [HBASE-3597] – ageOfLastAppliedOp should update after cluster replication failures
  • [HBASE-3708] – createAndFailSilent is not so silent; leaves lots of logging in ensemble logs
  • [HBASE-3712] – HTable.close() doesn’t shutdown thread pool
  • [HBASE-3722] – A lot of data is lost when name node crashed
  • [HBASE-3734] – HBaseAdmin creates new configurations in getCatalogTracker
  • [HBASE-3740] – hbck doesn’t reset the number of errors when retrying
  • [HBASE-3741] – Make HRegionServer aware of the regions it’s opening/closing
  • [HBASE-3744] – createTable blocks until all regions are out of transition
  • [HBASE-3749] – Master can’t exit when open port failed
  • [HBASE-3750] – HTablePool.putTable() should call tableFactory.releaseHTableInterface() for discarded table
  • [HBASE-3755] – Catch zk’s ConnectionLossException and augment error message with more help
  • [HBASE-3756] – Can’t move META or ROOT from shell
  • [HBASE-3771] – All jsp pages don’t clean their HBA
  • [HBASE-3783] – hbase-0.90.2.jar exists in hbase root and in ‘lib/’
  • [HBASE-3794] – TestRpcMetrics fails on machine where region server is running
  • [HBASE-3800] – If HMaster is started after NN without starting DN in Hbase 090.2 then HMaster is not able to start due to AlreadyCreatedException for /hbase/hbase.version
  • [HBASE-3817] – HBase Shell has an issue accepting FILTER for the ‘scan’ command.
  • [HBASE-3821] – “NOT flushing memstore for region” keep on printing for half an hour

Improvement

  • [HBASE-2470] – Add Scan.setTimeRange() support in Shell
  • [HBASE-3536] – [site] Add book all-in-one-page
  • [HBASE-3580] – Remove RS from DeadServer when new instance checks in
  • [HBASE-3634] – Fix JavaDoc for put(List<Put> puts) in HTableInterface
  • [HBASE-3652] – Speed up tests by lowering some sleeps
  • [HBASE-3695] – Some improvements to Hbck to test the entire region chain in Meta and provide better error reporting
  • [HBASE-3746] – Clean up CompressionTest to not directly reference DistributedFileSystem
  • [HBASE-3747] – [replication] ReplicationSource should differanciate remote and local exceptions
  • [HBASE-3767] – Improve how HTable handles threads used for multi actions
  • [HBASE-3773] – Set ZK max connections much higher in 0.90
  • [HBASE-3795] – Remove the “Cache hit for row” message
  • [HBASE-3798] – [REST] Allow representation to elide row key and column key
  • [HBASE-3805] – Log RegionState that are processed too late in the master
  • [HBASE-3818] – docs adding troubleshooting.xml
  • [HBASE-3860] – HLog shouldn’t create a new HBC when rolling
  • [HBASE-3861] – MiniZooKeeperCluster.startup() should refer to hbase.zookeeper.property.maxClientCnxns

Task

 

Thirty+ issues have been resolved since 0.90.2

Release notes are available here, [2].

HBase 0.90.2 released

HBase 0.90.2 is a maintenance release that fixes several important bugssince version 0.90.1, while retaining API and data compatibility. Therelease notes may be found on the Apache JIRA:

Improvements:

  • Allow Explicit Splits from the Shell
  • HFile CLI Improvements
  • MultiGet methods in Thrift
  • number of active threads in HTable’s ThreadPoolExecutor
  • Improve the selection of regions to balance
  • [replication] Wait a few seconds before transferring queues
  • Improve RegionSplitter Performance
  • Make HBCK Faster
  • improve/fix support excluding Tests via Maven -D property
  • [replication] Transferring queues shouldn’t be done inline with RS startup
  • Parallelize Server Requests on HBase Client
  • Check the sloppiness of the region load before balancing
  • NMapInputFormat should use a different config param for number of maps

New feature:

  • RegionSplitter : Utility class for manual region splitting

Downloads

Release notes

HSearch – open source, NoSQL Search Engine

HSearch is an open source, NoSQL Search Engine. Use it when you need real-time search on your Big Data. This project’s goal is to index over 100 billion records atop your commodity hardware cluster. HSearch is an open source, distributed, multi-format, structured and unstructured content search engine built on HBase platform. As the complete index is stored in HBase table, it inherits the HBase storable capabilities.

HSearch features include:

  • Multi-XML formats
  • Record and document level search access control
  • Continuous index updation
  • Parallel indexing using multi-machines
  • Embeddable inside application
  • A REST-ful Web service gateway that supports XML
  • Auto sharding
  • Auto replication

 

Official Website:

http://www.bizosys.com/product.html

http://bizosyshsearch.sourceforge.net/

NoSQL at Netflix: how & why

Netflix the leading US provider of VOD solution posted an article about how they moved into the cloud.

Yury Izrailevsky, Director of Cloud and Systems Infrastructure at Netflix explains and details there NoSQL infrastructure:

The full article is available on their blog, have a nice reading:

http://techblog.netflix.com/2011/01/nosql-at-netflix.html

HBase: The Definitive Guide 1st edition

“HBase: The Definitive Guide” by Lars George

will be soon available, you can order here: http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/

Product Description

If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require, whether you simply want to evaluate this high-performance, non-relational database, or put it into practice right away.

HBase’s adoption rate is beginning to climb, and several IT executives are asking pointed questions about this high-capacity database. This is the only book available to give you meaningful answers.

  • Learn how to distribute large datasets across an inexpensive cluster of commodity servers
  • Develop HBase clients in many programming languages, including Java, Python, and Ruby
  • Get details on HBase’s primary storage system, HDFS—Hadoop’s distributed and replicated filesystem
  • Learn how HBase’s native interface to Hadoop’s MapReduce framework enables easy development and execution of batch jobs that can scan entire tables
  • Discover the integration between HBase and other facets of the Apache Hadoop project

About the Author

Lars George has been involved with HBase since 2007, and became a full HBase committer in 2009. He has spoken at various Hadoop User Group meetings, as well as large conferences such as FOSDEM in Brussels. He also started the Munich OpenHUG meetings. He now works closely with Cloudera to support Hadoop and HBase in and around Europe through technical support, consulting work, and training.

 

 

 

How twitter uses nosql

InfoQ has released a video of Twitter‘s Kevin Weil explaining how the company uses NoSQL.

Twitter architecture mainly rely over the following products:

  • Scribe
  • Hadoop
  • Pig
  • Hbase
  • FlockDB
  • Cassandra

 

Video is available here: http://www.slideshare.net/nkallen/q-con-3770885

 

Twitter and all those data, what for ?

 

Twitter uses all the data for example, running comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use 3rd party clients or “power users” use Twitter differently from average users. The company is also interested in determining whether certain features or occurrences “trigger” a casual usual into becoming a frequent user. For example: do people become frequent users when they start following the right people or discover the right feature?

Other questions Weil says Twitter is interested in include determining which types of tweets get retweeted the most, what types of social graph structures result in the most successful networks and how to tell the difference between humans and bots.

 

NoSQL job offer in Luxembourg

Trendiction Hiring Search Specialist:

Trendiction is a growing web data collector company which crawls, analyzes, indexes and stores the content of news sites, blogs, message boards and other social media, in order to make this data available to our clients in a structured format.

Our clients are market research companies, analytic companies, and generally anyone who needs access to high quality social media data.

For more information, please visit our website at http://www.trendiction.de/w/product

We are extending our developer team by 3 or more developers. This job offer is for a permanent position at our offices at the Technoport in Esch-Sur-Alzette, Luxembourg.

Requirements and Experience:

* Java
* Good algorithm & data structure knowledge (e.g. http://www.amazon.de/Algorithmen-Datenstrukturen-Thomas-Ottmann/dp/3827410290)
* Very strong information retrieval knowledge (tf-idf, lucene, stemming, faceted search, …)
* Ability (and appreciation) for working in a small company/startup environment.
* University background (Bachelor or Master degree in computer sciences or related fields)

You should be creative, self-motivated, be able to write quick prototypes, try things out on your own and simply get things done.

A plus if you have experience setting up automated build tools and test tools, or are a java expert.

Preferentially, you have developed/designed/run a large scale search cluster in production.

Responsibilities:

* Improve/redesign current search engine (in terms of retrieval speed, indexation speed, quantity of indexed data and reliability)
* Add new features (language dependent search, statistics, ability to access more metadata, faceted search, …)

What we provide:

* Work on interesting problems/features/products involving large datasets.
* Write production code for large scale application tools like apache hadoop, cassandra, hbase, zookeeper running on over 50 servers, and processing terabytes of data each day.
* Work in a team of ETH Zurich, Harvard and TU Munich graduates.
* Start-up atmosphere and environment: actively take part in the development of the company.

Contact:

Thibaut Britz   –  CTO
thibaut.britz@trendiction.com
+352 20 33 35 31

Trendiction S.à.r.l.
66, rue de Luxembourg
L-4221 Esch-sur-Alzette
Luxembourg