Wrangler the smartest ETL and much more

Wrangler is an interactive tool for data cleaning and transformation. A smart ETL software which allows to spend less time formatting and more time analyzing your data. Take the time to watch the below video, its really worth it.

Official website:  http://vis.stanford.edu/wrangler/

Intel releases open source GraphBuilder for big data

Intel has released an open source tool designed to improve firms’ handling and analysis of unstructured data.

Source: Intel official blog http://blogs.intel.com/intellabs/2012/12/06/graphbuilder/

Intel said that its GraphBuilder tool would aim to fill a market void in the handling of big data for computer learning. Currently available as a beta release, the tool allows developers to construct large graphs which can then be used with big data analysis frameworks.

“GraphBuilder not only constructs large-scale graphs fast but also offloads many of the complexities of graph construction, including graph formation, cleaning, compression, partitioning, and serialisation,” wrote Intel principal scientist Ted Willke.

“This makes it easy for just about anyone to build graphs for interesting research and commercial applications.”

Willke said that the tool was developed in a collaboration with researchers at the University of Washington in Seattle. The teams sought to address a perceived hole in the market for tools to build the graph data used for many big data analysis activities.

“Scanning the environment, we identified a more general hole in the open source ecosystem: A number of systems were out there to process, store, visualise, and mine graphs but, surprisingly, not to construct them from unstructured sources,” Willke explained.

“So, we set out to develop a demo of a scalable graph construction library for Hadoop.”

The researchers estimate that GraphBuilder can help big data platforms analyse data as much as 50 times faster than the conventional MapReduce system.

The project is one of many research efforts dedicated to improving the performance of big data analysis platforms. Last month, researchers from the University of California Berkeley showcased a pair of technologies dubbed ‘Spark’ and ‘Shark’ which promise to dramatically improve the performance of the Apache Hive big data system.

The big data market has been suffering from a general lack of qualified analysts and developers, say vendors. Companies have sought to help bridge the gap by extending training efforts and partnerships with universities.

Ever heard of Cloudant ?

Cloudant was founded in Cambridge, Massachusetts in 2008 by three MIT physicists who at the time were moving multi-petabyte data sets around from the Large Hadron Collider. Frustrated by the available tools for managing and analyzing Big Data in their research, the founders built a distributed, fault-tolerant, globally scalable data layer on top of Apache CouchDB.

The service has grown since then. The team now manages and serves mobile and web app data on behalf of thousands of developers and hundreds of customers to their users around the world.

 

The Cloudant Data Layer collects, stores, analyzes and distributes application data across a global network of secure, high-performance data centers, delivering low-latency, non-stop data access to users no matter where they’re located.

Features

Cloudant enables advanced features like full-text search, replication, off-line computing, mobile sync, geo-location, and federated analytics. A RESTful API and support for standards like JSON and MapReduce make Cloudant easy to use; there are never schema or data migrations to slow you down.

GLOBAL DATA NETWORK

  • Global network of servers
  • Built-in replication and syncing
  • Pushes data closer to users
  • Built-in failover/disaster recovery

NOSQL

  • Schemaless development for JSON & other doc types
  • MapReduce for indexing & data access
  • Replication of apps, indexes & data
  • RESTful API

DATA LAYER AS A SERVICE

FAULT TOLERANCE

  • Clustering in a ring (a la Amazon Dynamo)
  • Built-in distributed Erlang
  • Masterless
  • Code sent to data, node-local, data-parallel

BUILT-IN ANALYTICS

  • Push olap-style workflows into the database
  • Based on incremental, chainable MapReduce
  • Multi-language support
  • Attachment analytics

DISTRIBUTED, FULL-TEXT SEARCH

  • Based on Lucene libraries
  • Support for custom indexers
  • Learn more about Cloudant search

NGDATA publish Big Data Whitepaper

Consumer-centric companies such as banks have more data about their consumers but relatively very little intelligence about them. The world is increasingly interconnected, instrumented and intelligent and in this new world the velocityvolume, and variety of data being created is unprecedented. As the amount of data created about a consumer is growing, the percentage of data that banks and retailers can process is going down fast.

In this whitepaper, you will learn:

  • New revenue opportunities that banks can realize by embracing Big Data
  • Challenges banks are facing in getting a single view of consumer
  • Key banking use cases: Mobile Wallet and Fraud Detection
  • How interactive Big Data management can help you in changing the game?

Download this whitepaper to learn how banks can leverage Big Data to transform their business, know their customers better, realize new revenue opportunities, and detect frauds.

The NGDATA Team

 

LinkedIn open source Helix

LinkedIn has announce the release of Helix, an open source cluster management system.

 

What is Helix?

The LinkedIn infrastructure stack consists of a number of different distributed systems, each specialized to solve a particular problem. This includes online data storage systems (Voldemort, Espresso), messaging systems (Kafka), change data capture system (Databus), a distributed system that provides search as a service (SeaS), and a distributed graph engine.

Although each system services a different purpose, they share a common set of requirements:

  • Resource management: The resources in the system (such as the database and indexes) must be divided among nodes in the cluster.
  • Fault tolerance: Node failures are unavoidable in any distributed system. However, the system as whole must continue to be available in the presence of such failures, without losing data.
  • Elasticity: As workload grows, clusters must be able to grow to accommodate the increased demand.
  • Monitoring: The cluster must be monitored for node failures as well as other health metrics, such as load imbalance and SLA misses.

Rather than forcing each system to reinvent the wheel, we decided to build Helix, a cluster management framework that solves these common problems. This allows each Distributed System to focus on its distinguishing features, while leaving Helix to take care of cluster management functions.

Helix provides significant leverage beyond just code reuse. At scale, the operational cost of management, monitoring and recovery in these systems far outstrips their single node complexity. A generalized cluster management framework provides a unified way of operating these otherwise diverse systems, leading to operational ease.

Helix at Linkedin

Helix has been under development at LinkedIn since April 2011. Currently, it is used in production in three different systems:

    • Espresso: Espresso is a distributed, timeline consistent, scalable document store that supports local secondary indexing and local transactions. Espresso runs on a number of storage node servers that store and index data and answer queries. Espresso databases are horizontally partitioned across multiple nodes, with each partition having a specified number of replicas. Espresso designates one replica of each partition as master (which accepts writes) and the rest as slaves; only one master may exist for each partition at any time. Helix manages the partition assignment, cluster-wide monitoring, and mastership transitions during planned upgrades and unplanned failure. Upon failure of the master, a slave replica is promoted to be the new master.

 

    • Databus: Databus is a change data capture (CDC) system that provides a common pipeline for transporting events from LinkedIn primary databases to caches, indexes and other applications such as Search and Graph that need to process the change events. Databus deploys a cluster of relays that pull the change log from multiple databases and let consumers subscribe to the change log stream. Each Databus relay connects to one or more database servers and hosts a certain subset of databases (and partitions) from those database servers, depending on the assignment from Helix.

 

  • SeaS (Search as a Service): LinkedIn’s Search-as-a-service lets other applications define custom indexes on a chosen dataset and then makes those indexes searchable via a service API. The index service runs on a cluster of machines. The index is broken into partitions and each partition has a configured number of replicas. Each new indexing service gets assigned to a set of servers, and the partition replicas must be evenly distributed across those servers. When indexes are bootstrapped, the search service uses snapshots of the data source to create new index partitions. Helix manages the assignment of index partitions to servers. Helix also limits the number of concurrent bootstraps in the system, as bootstrapping is an expensive process.

Try it out

We invite you to download and try out Helix. In the past year, we have had significant adoption and contributions to Helix by multiple teams at Linkedin. By open sourcing Helix, we intend to grow our contributor base significantly and invite interested developers to participate.

We will also be presenting a paper on Helix at the upcoming SOCC (ACM Symposium on Cloud Computing) at San Jose, CA on Oct 15th, 2012.

IBM PureData System

Big data is the core of your new enterprise application architecture. In the broader evolutionary picture, analytics and transactions will share a common big data infrastructure, encompassing storage, processing, memory, networking and other resources. More often than not, these workloads will run on distinct performance-optimized integrated systems, but will interoperate through a common architectural backbone.

 

Deploying a big-data infrastructure that does justice to both analytic and transactional applications can be challenging, especially when you lack platforms that are optimized to handle each type of workload. But the situation is improving. A key milestone in the evolution of big data toward agile support for analytics-optimized transactions is today, October 9, 2012, with the release of IBM PureData System. This is a new family of workload-specific, hardware/software expert integrated systems for both analytics and transactions. IBM has launched workload-optimized new systems for transactions (IBM PureData System for Transactions), data warehousing and advanced analytics (IBM PureData System for Analytics), and real-time business intelligence, online analytical processing and text analytics (IBM PureData System for Operational Analytics).

What are the common design principles that all of the PureData System platforms embody, and which they share with other PureSystems solutions? They all incorporate the following core features:

  • Patterns of expertise for built-in solution best practices: PureData System incorporate integrated expertise patterns, which represent encapsulations of best practices drawn from the time-proven practical know-how of myriad data and analytics deployments. PureData System are built upon pre-defined, preconfigured, pre-optimized solution architectures. This enables them to support repeatable deployments of analytics and transactional computing with full lifecycle management, monitoring, security and so forth.
  • Scale-in, out and up capabilities: PureData System support both the “scale-out” and “scale-up” approaches to capacity growth, also known as “horizontal” and “vertical” scaling, respectively. They also incorporate “scale-in” architectures, which allow you to add workloads and boost performance within existing densely configured nodes. You can execute dynamic, unpredictable workloads with linear performance gains while making most efficient use of existing server capacity. And you can significantly scale your big data storage, application software and compute resources per square foot of precious data-center space.
  • Cloud-ready deployment: PureData System provide workload-optimized hardware/software nodes that are building blocks for big-data clouds. As repeatable nodes, they support cloud architectures that scale on all three “Vs” of the big data universe–volume, velocity and variety–and may be deployed into any high-level cloud topology (centralized, hub-and-spoke, federated, etc) either on your premises or in the data center of whatever cloud, hosting, or outsourcing vendor you choose.
  • Clean-slate designs for optimal performance: PureData System incorporate “clean-slate design” principles. These allow us to to optimize and innovate in the internal design of each new integrated solution, improving performance, scalability, resiliency and so forth without being constrained by the artifacts of older platforms. When we think about the insides of our boxes, we’re always thinking outside the box.
  • Integrated management for maximum administrator productivity: PureData System incorporate unified management tooling and expertise patterns to enable low lifecycle cost of ownership and high administrator productivity. The tooling automates and facilitates the work of human administrators overseeing a wide range of workload management, troubleshooting and administration tasks over the solutions’ useful lives. As workload-optimized systems, these solutions embed integrate expertise patterns that automate and optimize the work of human administrators.

Taken together, these principles enable the PureData platforms to realize fast business value, reduce total cost of ownership, and support maximum scalability and performance on a wide range of analytics and transactional workloads. These same principles are also the architectural backbone for the recently released IBM PureApplication Systems and IBM PureFlex Systems platforms.

 

Learn more about IBM PureData System

MongoDB on Windows Azure v0.6 has been released

Version 0.6 of MongoDB for Windows Azure has been released. This release includes many bug fixes and enhancements. This release introduces a new manager web application and support for back ups.

Here are the highlights:

A list of all issues (features, bugs) that were part of the release can be found at https://jira.mongodb.org/secure/ReleaseNote.jspa?projectId=10482&version=11383

BigData – Key Figures

Back to basics, facts and key figures about the data:
  • Bad data or poor data quality costs US businesses $600 billion annually.
  • 247 billion e-mail messages are sent each day… up to 80% of them are spam.
  • Poor data or “lack of understanding the data” are cited as the #1 reasons for overrunning project costs.
  • 70% of data is created by individuals – but enterprises are responsible for storing and managing 80% of it. (source)
  • We can expect a 40-60 per cent projected annual growth in the volume of data generated, while media intensive sectors, including financial services, will see year on year data growth rates of over 120 per cent.
  • Every hour, enough information is consumed by internet traffic to fill 7 million DVDs.  Side by side, they’d scale Mount Everest 95 times.
  • The volume of data that businesses collect is exploding: in 15 of the US economy’s 17 sectors, for example, companies with upward of 1,000 employees store, on average, more information than the Library of Congress does (source).
  • 48 hours worth of video is posted on YouTube every hour of everyday (source).
  • Every month 30 billion pieces of content are shared on Facebook (source).
  • By 2020 the production of data will be 44 times what we produced in 2009. (source)
  • If an average Fortune 1000 company can increase the usability of its data by just 10%, the company could expect an increase of over 2 billion dollars. (Source: InsightSquared infographic)

Facebook – Everything is interesting to us

Facebook is collecting a lot of data. Every time you click a notification, visit a page, upload a photo, or check out a friend’s link, you’re generating data for the company to track. You can multiply that by 950 million people and you have a lot of information to deal with.

Here are some of the stats the company provided Wednesday to demonstrate just how big Facebook’s data really is:

  • 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)
  • 2.7 billion Likes per day
  • 300 million photos uploaded per day
  • 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters
  • 105 terabytes of data scanned via Hive, Facebook’s Hadoop query language, every 30 minutes
  • 70,000 queries executed on these databases per day
  • 500+terabytes of new data ingested into the databases every day

“If you aren’t taking advantage of big data, then you don’t have big data, you have just a pile of data,” said Jay Parikh, VP of infrastructure at Facebook on Wednesday. “Everything is interesting to us.”

 

Dead simple design with Reddit's database

Reddit’s database has two tables

Steve Huffman talks about Reddit’s approach to data storage in a High Scalability post from 2010. I was surprised to learn that they only have two tables in their database.

Lesson: Don’t worry about the schema.

[Reddit] used to spend a lot of time worrying about the database, keeping everthing nice and normalized. You shouldn’t have to worry about the database. Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work. They used replication for backup and for scaling. Schema updates and maintaining replication is a pain. They would have to restart replication and could go a day without backups. Deployments are a pain because you have to orchestrate how new software and new database upgrades happen together.

Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There’s a row for every attribute. There’s a row for title, url, author, spam votes, etc. When they add new features they didn’t have to worry about the database anymore. They didn’t have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance.

The price is you can’t use cool relational features. There are no joins in the database and you must manually enforce consistency. No joins means it’s really easy to distribute data to different machines. You don’t have to worry about foreign keys are doing joins or how to split the data up. Worked out really well. Worries of using a relational database are a thing of the past.

This fits with a piece I read the other day about how MongoDB has high adoption for small projects because it lets you just start storing things, without worrying about what the schema or indexes need to be. Reddit’s approach lets them easily add more data to existing objects, without the pain of schema updates or database pivots. Of course, your mileage is going to vary, and you should think closely about your data model and what relationships you need.