Choosing a Shard key

Choosing a shard key can be difficult, and the factors involved largely depend on your use case.

In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys

LevelDB a fast and lightweight key/value database library by Google

LevelDB a fast and lightweight key/value database library by Google

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.


  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value)Get(key)Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
  • Detailed documentation about how to use the library is included with the source code.


  • This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library.


Here is a performance report (with explanations) from the run of the included db_bench program. The results are somewhat noisy, but should be enough to get a ballpark performance estimate.


We use a database with a million entries. Each entry has a 16 byte key, and a 100 byte value. Values used by the benchmark compress to about half their original size.

   LevelDB:    version 1.1
   Date:       Sun May  1 12:11:26 2011
   CPU:        4 x Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
   CPUCache:   4096 KB
   Keys:       16 bytes each
   Values:     100 bytes each (50 bytes after compression)
   Entries:    1000000
   Raw Size:   110.6 MB (estimated)
   File Size:  62.9 MB (estimated)


Write performance

The “fill” benchmarks create a brand new database, in either sequential, or random order. The “fillsync” benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The “overwrite” benchmark does random writes that update existing keys in the database.


   fillseq      :       1.765 micros/op;   62.7 MB/s
   fillsync     :     268.409 micros/op;    0.4 MB/s (10000 ops)
   fillrandom   :       2.460 micros/op;   45.0 MB/s
   overwrite    :       2.380 micros/op;   46.5 MB/s


Each “op” above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.

Each “fillsync” operation costs much less (0.3 millisecond) than a disk seek (typically 10 milliseconds). We suspect that this is because the hard disk itself is buffering the update in its memory and responding before the data has been written to the platter. This may or may not be safe based on whether or not the hard disk has enough power to save its memory in the event of a power failure.

Read performance

We list the performance of reading sequentially in both the forward and reverse direction, and also the performance of a random lookup. Note that the database created by the benchmark is quite small. Therefore the report characterizes the performance of leveldb when the working set fits in memory. The cost of reading a piece of data that is not present in the operating system buffer cache will be dominated by the one or two disk seeks needed to fetch the data from disk. Write performance will be mostly unaffected by whether or not the working set fits in memory.


   readrandom   :      16.677 micros/op;  (approximately 60,000 reads per second)
   readseq      :       0.476 micros/op;  232.3 MB/s
   readreverse  :       0.724 micros/op;  152.9 MB/s


LevelDB compacts its underlying storage data in the background to improve read performance. The results listed above were done immediately after a lot of random writes. The results after compactions (which are usually triggered automatically) are better.


   readrandom   :      11.602 micros/op;  (approximately 85,000 reads per second)
   readseq      :       0.423 micros/op;  261.8 MB/s
   readreverse  :       0.663 micros/op;  166.9 MB/s


Some of the high cost of reads comes from repeated decompression of blocks read from disk. If we supply enough cache to the leveldb so it can hold the uncompressed blocks in memory, the read performance improves again:

   readrandom   :       9.775 micros/op;  (approximately 100,000 reads per second before compaction)
   readrandom   :       5.215 micros/op;  (approximately 190,000 reads per second after compaction)

"NoSQL, no security?"

“NoSQL, no security?” is a presentation from AppSec USA 2012, Austin, Texas.

Presented by Will Urbanski (@willurbanski)

NGDATA publish Big Data Whitepaper

Consumer-centric companies such as banks have more data about their consumers but relatively very little intelligence about them. The world is increasingly interconnected, instrumented and intelligent and in this new world the velocityvolume, and variety of data being created is unprecedented. As the amount of data created about a consumer is growing, the percentage of data that banks and retailers can process is going down fast.

In this whitepaper, you will learn:

  • New revenue opportunities that banks can realize by embracing Big Data
  • Challenges banks are facing in getting a single view of consumer
  • Key banking use cases: Mobile Wallet and Fraud Detection
  • How interactive Big Data management can help you in changing the game?

Download this whitepaper to learn how banks can leverage Big Data to transform their business, know their customers better, realize new revenue opportunities, and detect frauds.



Switching from the Relational to the Graph model

One of the main resistences of RDBMS users to pass to a NoSQL product are related to the complexity of the model: Ok, NoSQL products are super for BigData and BigScale but what about the model?


There is a technical breakthrough between old RDBMS and Graph Database. GraphDB handles relationships as a physical link to the record assigned when the edge is created on the other side  RDBMS computes the relationship every time you query a database


White paper – Evolving Role of the Data Warehouse in the Era of Big Data

Infoworld has publish a great white paper: “The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics (Ralph Kimball)”

In this white paper, we describe the rapidly evolving landscape for designing an enterprise data warehouse (EDW) to support business analytics in the era of “big data.” We describe the scope and challenges of building and evolving a very stable and successful EDW architecture to meet new business requirements. These include extreme integration, semi- and un-structured data sources, petabytes of behavioral and image data accessed through MapReduce/Hadoop as well as massively parallel relational databases, and then structuring the EDW to support advanced analytics. This paper provides detailed guidance for designing and administering the necessary processes for deployment. This white paper has been written in response to a lack of specific guidance in the industry as to how the EDW needs to respond to the big data analytics challenge, and what necessary design elements are needed to support these new requirements.


Big_Data_Analytics” – The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics A Kimball Group White Paper By Ralph Kimball




Google's F1 – distributed scalable RDBMS


Google has moved its advertising services from MySQL to a new database, created in-house, called F1. The new system combines the best of NoSQL and SQL approaches.


According to Google Research, many of the services that are critical to Google’s ad business have historically been backed by MySQL, but Google has recently migrated several of these services to F1, a new RDBMS developed at Google. The team at Google Research says that F1 gives the benefits of NoSQL systems (scalability, fault tolerance, transparent sharding, and cost benefits) with the ease of use and transactional support of an RDBMS.


Google Research has developed F1 to provide relational database features such as a parallel SQL query engine and transactions on a highly distributed storage system that scales on standard hardware.





The store is dynamically sharded, supports replication across data centers while keeping transactions consistent, and can deal with data center outages without losing data. The downside of keeping the transactions consistent means F1 has higher write latencies compared to MySQL, so the team restructured the database schemas and redeveloped the applications so the effect of the increased latency is mainly hidden from external users. Because F1 is distributed, Google says it scales easily and can support much higher throughput for batch workloads than a traditional database.


The database is sharded by customer, and the applications are optimized using shard awareness. When more power is needed, the database can grow by adding shards. The use of shards in this way has some drawbacks, including the difficulties of rebalancing shards, and the fact you can’t carry out cross-shard transactions or joins.






F1 has been co-developed with a new lower-level storage system called Spanner. This is described as a descendant of Google’s Bigtable, and as the successor to Megastore. Megastore is the transactional indexed record manager built by Google on top of its BigTable NoSQL datastore. Spanner offers synchronous cross-datacenter replication (with Paxos, the algorithm for fault tolerant distributed systems). It provides snapshot reads, and does multiple reads followed by a single atomic write to ensure transaction consistency.


F1 is based on sharded Spanner servers, and can deal with parallel reads with SQL or Map-Reduce. Google has deployed it using five replicas spread across the country to survive regional disasters. Reads are much slower than MySQL, taking between 5 and 10ms.


The SQL parallel query engine was developed from scratch to hide the remote procedure call (RPC) latency and to allow parallel and batch execution. The latency is dealt with by using a single read phase and banning serial reads, though you can carry out asynchronous reads in parallel. Writes are buffered at the client, and sent as one RPC. Object relational mapping calls are also handled carefully to avoid those that are problematic to F1.


The research paper on F1, presented at SIGMOD 2012, cites serial reads and for loops that carry out one query per iteration as particular avoidance points, saying that while these hurt performance in all databases, they are disastrous on F1. In view of this, the client library is described as very lightweight ORM – it doesn’t really have the “R”. It never uses relational or traversal joins, and all objects are loaded explicitly.




Git as a NoSQL Database

We all know that Git is amazing for storing code. It is fast, reliable, flexible, and it keeps our project history nuzzled safely in its object database while we sleep soundly at night.

But what about storing more than code? Why not data? Much flexibility is gained by ditching traditional databases, but at what cost?

More information about the speaker Bradon Keepers

MIT: Looking At The World Through Twitter Data

Great article from MIT

Me and my friend Kang, also an MIT CS undergrad started playing with some data from Twitter a little while ago. I wrote this post to give a summary of some of the challenges we faced, some things we learned along the way, and some of our results so far.I hope they’ll show to you, as they did to us, how valuable social data can be.

Scaling With Limited Resources

Thanks to AWS, companies these days are not as hard to scale as they used to be. However, college dorm-room projects still are. One of the less important things a company has to worry about is paying for its servers. That’s not the case for us, though since we’ve been short on money and pretty greedy with data.

Here are some rough numbers about the volume of the data that we analyzed and got our results from.

User data: ~ 70 GB
Tweets: > 1 TB and growing
Analysis results: ~ 300 GB

> 10 billion rows in our databases.

Given the fact that we use almost all of this data to run experiments everyday, there was no way we could possibly afford putting it up on Amazon on a student budget. So we had to take a trip down to the closest hardware store and put together two desktops. That’s what we’re still using now.

We did lots of research about choosing the right database. Since we are only working on two nodes and are mainly bottlenecked by insertion speeds, we decided to go with good old MySQL. All other solutions were too slow on a couple nodes, or were too restrictive for running diverse queries. We wanted flexibility so we could experiment more easily.

Dealing with the I/O limitations on a single node is not easy. We had to use all kind of different of tricks to get around our limitations. SSD Caching, Bulk insertions, MySQL partitioning, dumping to files, extensive use of bitmaps, and the list goes on.

If you have advice or questions regarding all that fun stuff, we’re all ears. : )

Now on to the more interesting stuff, our results.

Word-based Signal Detection

Our first step towards signal and event detection was counting single words. We counted the number of occurrences of every word in our tweets during every hour. Sort of like Google’s Ngrams, but for tweets.

Make sure you play around with some of our experimental results here. The times are EST/EDT. *

Here are some cool stuff we found from looking at the counts. If you also find anything interesting from looking at different words, please share it with us!

Daily and Weekly Fluctuations

If you look for a very common word like ‘a’ to see an estimate of the total volume of tweets, you clearly see a daily and weekly pattern. Tweeting peaks at around 7 pm PST (10 pm EST) and hits a low at around 3 am PST every day. There’s also generally less tweeting during Fridays and Saturdays, probably because people have better things to do with their lives than to tweet!


A side note:

We’ve tried to focus on English speaking tweeters within the States. Note that the percentage of tweets containing ‘a’ also fluctuates during the day, which is surprising at first. But, this is because non-English tweets that we have discarded are much more frequent during the night in our time zone, and they often don’t contain the word ‘a’ as often as English tweets do.


I’m a night owl myself and I had always been curious to know at exactly what time the average person goes to sleep or at least thinks about it! I looked for the words “sleep”, “sleeping”, and “bed”. You can do this yourself, but the only problem you’ll see is that not all the tweets have the same time zones. To solve this issue, we isolated several million tweets which had users who had set their locations to Boston or Cambridge, MA. Then, we created a histogram of their average sleeping hours. Here’s the result:


It seems like the average Bostonian sleeps at around midnight! Of course, that’s probably not the average everywhere. After all, a fourth of our city are nocturnal college students!

You can look at all kinds of words relating to recurring events like ‘lunch’, ‘class’, ‘work’, ‘hungry’ and whatever you can imagine. I promise you, you’ll be fascinated.

Here are some suggestions:
Coke, Valentine, Hugo and other oscars-related words, IPO.
(please suggest other interesting things I should add to this list)

I’m Obsessed With Linguistics

As we were looking at different words, we noticed that the words Monday, Tuesday, etc show very interesting weekly patterns. They reflect a signal that has its peak on the respective day, as you’d expect, and which rises as you get closer to that day. This means that people have more anticipation for days that are closer, more or less linearly. But if you pay closer attention, you’ll see that the day immediately before the search term corresponds to a clear valley in the curve. This points to a very interesting linguistic phenomena. That in English, we never refer to the next day with the name of that weekday, and instead use the word ‘tomorrow’.


On a Wednesday, people don’t say ‘Thursday’


We tried to find events that we thought would have a strong reflection in the Twitter sphere. ‘superbowl’, ‘sopa’, and ‘goldman’ were pretty interesting. Here are the graphs for those three, which you can also recreate yourself.

Tweets about ‘Superbowl’ during each hour

Tweets about ‘Sopa’ during each hour

Tweets about ‘Goldman Sachs’ during each hour

We’ll post more about our attempt to exactly dissect what happened on Twitter during these events as time progressed. In the Goldman Sachs case, for example, the peak happens on the day of the controversial public exit of a GS employee, which was reflected in the NYTimes. The earliest news release time was at 7am GMT which is the same as the first signs of a rise in our signal.

Politics and Public Sentiment

If you query the word Obama, this is what you’ll see:


When we first saw this spike, we were very suspicious. The spike seemed way too prominent to be associated with an event. (~25 times the average signal amplitude) But guess what. The spike was at 9 PM on Jan 24th when the state of the union speech happened!

We were curious to see some of the 250 K sample tweets containing ‘obama’ from that hour. Here are a few of them along with some self-declared descriptions from users:

ok. the obama / giffords embrace made me choke up a little.
A teacher from Killeen, Texas. ~200 followers.

I love that Obama starts out with a tribute to our military. #SOTU
A liker of progressive politics from Utah. ~100 followers.

Great Speech Obama #SOTU
A CEO from NYC. ~700 followers.

There were both positive and negative tweets. But we wanted to know whether the tweets were positive or negative because that’s what really matters in a context like this. Here are some results and an explanation of how our sentiment analysis works in general.

Sentiment Analysis

Our sentiment analysis is done by training our model using several thousand manually classified sample tweets. It reaches very good prediction accuracy according to our tests and as I’ll explain below.

The graph below shows the normalized sentiments of several different sets of tweets for each day during a 3 month period.

The two graphs here are sentiments of millions of independent randomly chosen tweets from our set. The fact that they follow each other so closely is the important achievement of our system. It means that the signal to noise ratio is so high that the sentiment is clearly measurable.

You can also observe a weekly periodicity in the general sentiment. Interestingly, it shows that people have happier and more positive tweets during the weekends compared to the rest of the week! In addition, the signal acts sort of unusually around January 1st and February 14th!

The graphs below, on the other hand, are indicative of the sentiments of tweets in which some combination of keywords related to ‘economy’ or ‘energy’ were talked about. As you can see, the patterns in the graphs are fairly stable other than at a single day, January 24th, where the sentiment significantly drops. That’s when Obama’s state of the union speech was, and it looks like his speech triggered a lot of negative tweets related to energy and the economy.

We were curious to see whether the sentiment was only negative for this bag of tweets (those that contain energy, economy), or if tweets about obama during the state of the union speech were negative in general. Here are our results of running sentiment analysis on all the data containing the word ‘obama’ in the past 4 months.

Here, the blue curve is the average sentiment of tweets for each day and the red curve shows the amount of variance in the sentiment. (If it’s high, then there were more happy and sad tweets and when it’s low, tweets were more neutral)

The graph clearly shows that there was some heated debate about Obama on exactly January 24th. On the same day, the sentiment has dropped and so we can tell that there was more negative tweeting than there was positive. It looks like people weren’t too happy during the #sotu speech.

As we looked at the graph, it was also hard to ignore the other peak in variance (polarity) that seems to appear around the 15th to 17th of January. It seems like sentiment about Obama fell and raised rapidly in only a few days in that time span. This is likely a result of tweets about SOPA/PIPA and Obama’s disagreement with the bill which happened during those days.

The Growth of Twitter

Twitter has had periods of slow and rapid growth since its inception on March 21st of 2006 up until now. We tried to capture the growth of Twitter, since its very first user until earlier this year. Here is the result showing the number of people joining Twitter during every hour since day one:

As you can see, there are some very abnormally large numbers of people joining during specific hours throughout Twitter’s lifetime. One is apparently in April 2009, when theyreleased their search feature. Another is probably when they rolled out the then New Twitter for everyone.

Summing up…

As we started looking at the data for the first time, we were absolutely blown away by all the cool insights you can extract from it. It makes you wonder why people aren’t doing these sorts of things more often with all this cheap data and computing power that they have these days.

This was our first blog post and we hope you liked what you saw. You’re awesome if you’re still reading. Stay tuned for more and please give us any feedback you may have!

*The count results aren’t perfectly accurate. There is a general upward trend because twitter deletes history before 3200 tweets.There is also a discontinuity on around Feb 11th which is because of a temporary glitch we had.

Couchbase Survey Shows Accelerated Adoption of NoSQL in 2012

Couchbase today announced the results of an industry survey conducted in December that shows growing adoption of NoSQL in 2012. According to the survey, the majority of the more than 1,300 respondents will fund NoSQL projects in the coming year, saying the technology is becoming more important or critical to their company’s daily operations. Respondents also indicated that the lack of flexibility/rigid schemas associated with relational technology was a primary driver toward NoSQL adoption.

You can read the result of the survey here as well as some surprises in the survey at this page

NoSQL 2012 Survey Highlights

Key data points from the Couchbase NoSQL survey include:

  • Nearly half of the more than 1,300 respondents indicated they have funded NoSQL projects in the first half of this year. In companies with more than 250 developers, nearly 70% will fund NoSQL projects over the course of 2012.
  • 49% cited rigid schemas as the primary driver for their migration from relational to NoSQL database technology. Lack of scalability and high latency/low performance also ranked highly among the reasons given for migrating to NoSQL  (see chart below for more details).
  • 40% overall say that NoSQL is very important or critical to their daily operations, with another 37% indicating it is becoming more important.



Surprises from the Survey

Language mix. A common theme in the results was what one could interpret as the “mainstreaming” of NoSQL database technology. The languages being used to build applications atop NoSQL database technology, while they include a variety of more progressive choices, are dominated by the mundane: Java and C#. And while we’ve had a lot of anecdotal interest in a pure C driver for Couchbase (which we now have, by the way), only 2.1% of the respondents indicated it was the “most widely used” language for application development in their environment, behind Java, C#, PHP, Ruby, Python and Perl (in order).

Schema management is the #1 pain driving NoSQL adoption. So I’ll admit that I wasn’t actually surprised by this one, because I’d already been surprised by it earlier. Two years ago if you had asked me what the biggest need we were addressing was, I would have said it was the need for a “scale-out” solution at the data layer versus the “scale-up” nature of the relational model. That users wanted a database that scaled like their application tier – just throw more cheap servers behind a load balancer as capacity needs increase. While that is still clearly important, the survey results confirmed what I’d been hearing (to my initial surprise) from users: the flexibility to store whatever you want in the database and to change your mind, without the requirement to declare or manage a schema, is more important.