Probability, The Analysis of Data

Probability, The Analysis of Data – Volume 1

is a free book available online, it provides educational material in the area of data analysis.

http://www.theanalysisofdata.com/probability/0_2.html

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website http://theanalysisofdata.com. Hardcopies are available at affordable prices.

Handling "schema" change in production

I often heard and read about such situation, you started a brand new application based on a NoSQL datastore everything goes fine so far, you’re almost happy but all of the sudden you face a critical point: you need to change the “schema” for your application and you’re already live,running production solution.

From this point in time, you must ask yourself, is my  amount of data relatively small(i.e. documents count) so I can run a batch process in order to update all the documents in bulk , writing a small conversion program.

Unfortunately won’t always turn this way and sometimes due to the big amount of data you’re dealing with,  performing bulk batch updates wouldn’t feasible due to the time and impact on performance.

In such case you must consider a Lazy Update Approach , this is where in your application you can check whether the document is in the ‘previous schema’ when you need to read it in and update it when you write it out again.

Over time this will eventually migrate documents in ‘previous schema’  to the new, though it’s possible that you may end up with documents that rarely get accessed and so remain in an ‘previous schema’. You must then wait for the number of documents that remain in the  ‘previous schema’  to be small enough so  you could run batch jobs to update these remaining documents.

During this conversion process, you need to be very careful to any process which perform operation over multiple documents, this is the downside, those process might need to be rewrited as well and at least carefully reviewed.

Dead simple design with Reddit's database

Reddit’s database has two tables

Steve Huffman talks about Reddit’s approach to data storage in a High Scalability post from 2010. I was surprised to learn that they only have two tables in their database.

Lesson: Don’t worry about the schema.

[Reddit] used to spend a lot of time worrying about the database, keeping everthing nice and normalized. You shouldn’t have to worry about the database. Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work. They used replication for backup and for scaling. Schema updates and maintaining replication is a pain. They would have to restart replication and could go a day without backups. Deployments are a pain because you have to orchestrate how new software and new database upgrades happen together.

Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There’s a row for every attribute. There’s a row for title, url, author, spam votes, etc. When they add new features they didn’t have to worry about the database anymore. They didn’t have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance.

The price is you can’t use cool relational features. There are no joins in the database and you must manually enforce consistency. No joins means it’s really easy to distribute data to different machines. You don’t have to worry about foreign keys are doing joins or how to split the data up. Worked out really well. Worries of using a relational database are a thing of the past.

This fits with a piece I read the other day about how MongoDB has high adoption for small projects because it lets you just start storing things, without worrying about what the schema or indexes need to be. Reddit’s approach lets them easily add more data to existing objects, without the pain of schema updates or database pivots. Of course, your mileage is going to vary, and you should think closely about your data model and what relationships you need.

Berkeley "big data" class for free

You can freely get access to a  2 day class  “big data” from Berkeley

It’s online live streaming, as explained on their website: http://ampcamp.berkeley.edu/

 

The first UC Berkeley AMP Camp will be hosted in Berkeley (and online) -22, 2012, brought to you by the AMPLab, featuring hands-on tutorials teaching Big Data analysis using the AMPLab software stack, including SparkShark, and Mesos. These tools help accelerate Hadoop and other popular data management platforms.

The AMPLab works at the intersection of machine learning, cloud computing, and crowdsourcing; integrating Algorithms, Machines, and People (AMP) to make sense of Big Data, and we want to share our expertise with you! Attendees will learn to solve Big Data problems using components of the Berkeley Data Analytics System (BDAS) and cutting edge machine learning algorithms.

The AMP Camp curriculum includes:

  • Attendees will analyze real data on EC2 with SparkShark, and Mesos, which enable interactive queries and iterative jobs up to 30x faster than Hadoop MapReduce
  • Scalable machine learning algorithms
  • Crowdsourcing answer to questions that can’t be answered by computers alone
  • Case-studies presented by active BDAS users

Registration for in-person attendance is sold out, but we encourage you to click the button below to register for our FREE online live streaming of the event Aug 21-22. The live stream includes all talks, hands-on exercises, and walk through’s using Spark, Shark, and Mesos on real data. If you miss any live or in-person event, you can return to this site to find archives of all materials.

Introduction to graph database by Neo4J

Great introduction to Graph Databases provided by Neo4J:

This video demonstrates how graph databases fit within the NOSQL space, and where they are most appropriately used. In this session you will learn:

  • Overview of NOSQL
  • Why graphs matter
  • Overview of Neo4j
  • Use cases for graph databases

Big Data university

Big Data University is an online educational site run by new and experienced HadoopBig Data and DB2 users who want to learn, contribute with course materials, or look for job opportunities.

The site includes free and fee-based courses delivered by experienced professionals and teachers.

Big Data University is hosted on the Cloud, and is run by a group of enthusiasts from around the world. They useMoodle 2 course management system enabled to run on DB2.

http://bigdatauniversity.com/courses/

Harvard released metadata for 12Millions library book into the public domain

Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API.

Official press release

Harvard’s Open Metadata policy

API details

MapReduce for the masses

Great initiative from CommonCrawl.org , called “MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl”

I’ve duplicate, hereafter, their original blogpost from:  http://commoncrawl.org/mapreduce-for-the-masses/

 

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Ready to get started?  Watch our screencast and follow along below:

Step 1 – Install Git and Eclipse

We first need to install a few important tools to get started:

Eclipse (for writing Hadoop code)

How to install (Windows and OS X):

Download the “Eclipse IDE for Java developers” installer package located at:

http://www.eclipse.org/downloads/

How to install (Linux):

Run the following command in a terminal:

RHEL/Fedora

 # sudo yum install eclipse

Ubuntu/Debian

 # sudo apt-get install eclipse

Git (for retrieving our sample application)

How to install (Windows)

Install the latest .EXE from:

http://code.google.com/p/msysgit/downloads/list

How to install (OS X)

Install the appropriate .DMG from:

http://code.google.com/p/git-osx-installer/downloads/list

How to install (Linux)

Run the following command in a terminal:

RHEL/Fedora

# sudo yum install git

Ubuntu/Debian

# sudo apt-get install git

Step 2 – Check out the code and compile the HelloWorld JAR

Now that you’ve installed the packages you need to play with our code, run the following command from a terminal/command prompt to pull down the code:

# git clone git://github.com/ssalevan/cc-helloworld.git

Next, start Eclipse.  Open the File menu then select “Project” from the “New” menu.  Open the “Java” folder and select “Java Project from Existing Ant Buildfile”.  Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file.  Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.

We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears.  Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”.  Select “Ant Builder” from the dialog which appears, then click OK.

To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute.  To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier.  To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code.  To specify the target, enter “dist” without the quotes into the “Arguments” field.  Click OK and close the Properties window.

Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.

Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials

If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:

https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Once you’ve registered, visit the following page and copy down your Access Key ID and Secret Access Key:

https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key

This information can be used by any Amazon Web Services client to authorize things that cost money, so be sure to keep this information in a safe place.

Step 4 – Upload the HelloWorld JAR to Amazon S3

Uploading the JAR we just built to Amazon S3 is a lot simpler than it sounds. First, visit the following URL:

https://console.aws.amazon.com/s3/home

Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:

<your checkout dir>/dist/lib/HelloWorld.jar

Step 5 – Create an Elastic MapReduce job based on your new JAR

Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:

https://console.aws.amazon.com/elasticmapreduce/home

and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.

The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:

s3n://<your bucket name>/HelloWorld.jar

then add the following arguments to it:

org.commoncrawl.tutorial.HelloWorld <your aws secret key id> <your aws secret key>2010/01/07/18/1262876244253_18.arc.gz s3n://<your bucket name>/helloworld-out

CommonCrawl stores its crawl information as GZipped ARC-formatted files (http://www.archive.org/web/researcher/ArcFileFormat.php), and each one is indexed using the following strategy:

/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz

Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:

1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)

2. Log into Amazon S3 with your AWS access codes

3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010

4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)

Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)

If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:

2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)

2010/ - All crawl files from 2010

Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.

Step 6 – Watch the show

Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.

Step 7 – Start playing!

If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.

IETF working on a convention for HTTP access to JSON resources

Internet draft is working on A Convention for HTTP Access to JSON Resources

Abstract

This document codifies a convention for accessing JSON representations of resources via HTTP.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

 

http://tools.ietf.org/html/draft-pbryan-http-json-resource-01

Set Up a Hadoop Cluster with Mongo @Discovering stuff

Great blog post from Artem Yankov on how to set up a Hadoop cluster with MongoDb

It is a step by step guide that show  how to:

  • Create your own AMI with the custom settings (installed hadoop and mongo-hadoop)
  • Launch a hadoop cluster on EC2
  • Add mode nodes to the cluster
  • Run the jobs

Access the blog post