Reveal your 2012 NoSQL Plans and win an ipad2

Couchbase offer a chance to win an iPad2 if you reveal your 2012 NoSQL Plans

Sword of the data, a step behind [part 2]

This article follow the “Sword of the data, a brief History [part 1]” available here



The Sword of the data, part #2

A step behind "the data issue" 



1/A horizon of new possibilities

We’re living a new era, which can probably be qualified as the big data era.In one decade, data is driving revolutionary changes in computing and the Internet,including new opportunities for generating revenue, improving the efficiency of business processes, any organization(whatever its size) can now operate over the world in real time etc …


2/Data-centric  enterprise

As all organization became more and more data-centric every day, they are capturing data at deeper levels of detail and keeping more history than they ever have before(and still increasing at an unprecedented pace/rate). Data is now the most important and valuable component of modern applications, websites, organisation business intelligence used for decision-making.


3/In this new era, “whatever lives or dies strictly by the sword of data”

When competing for the same ground, the keys to success will be linked to the ways you handle your data:

  • Safety: Information security became a requirement
  • Quality, Integrity, Consistency erroneous or conflicting data has a major cost to a business, bottom line, impacting customers, reputation and revenue.
  • Accessibility easy search and number of click required to reach a wanted data makes a difference
  • Availability downtime, now, have a cost
  • Velocity poor performance will reduces productivity



Sword of the data, a brief History [part 1]

The Sword of the data, part #1
A Brief History of the last decade: "making things possible and making them easy"


1/First is the plummeting price of hard drive space, the storage cost has evenly decrease over the last 30 years


2/Second has cost the 2001 financial market crash, so called dot-com bubble, but has brings to the world, wired and wireless communication networks transmission for cheap and widespread. In few years, most of the world get connected to the Internet.



3/Third came the software innovation, required by the brand new challenges faced by the Internet company.

Google,Amazon,Facebook,Twitter,LinkedIn had to solve a same technical problem. They needed to handle huge volume of data (at scale never reach before).Managing and processing those very large dataset in order to deliver result in (almost) realtime became the key challenge to lead the competition.

No existing software were available at this time to solve those problems, surprisingly, all those company took the same path: and conclude their only choice were to internally developed new peace of software to handle very massive data set based on distributed architecture (meaning using hundreds of computer that communicate through network in order to achieve a common goal)




4/At last, the innovation made public and open source


Last but not least, most of the software developed get opened source such as Hadoop,Cassandra(Facebook),Voldemort(LinkedIn). Google‘s map reduce patent get free of use. Amazon open its architecture to tier.Innovation were made easily accessible by the public

CouchDB 1.1.1 has been released

CouchDB 1.1.1 has been released, it brings the following changes:

  • Support SpiderMonkey 1.8.5
  • Add configurable maximum to the number of bytes returned by _log.
  • Allow CommonJS modules to be an empty string.
  • Bump minimum Erlang version to R13B02.
  • Do not run deleted validate_doc_update functions.
  • ETags for views include current sequence if include_docs=true.
  • Fix bug where duplicates can appear in _changes feed.
  • Fix bug where update handlers break after conflict resolution.
  • Fix bug with _replicator where include “filter” could crash couch.
  • Fix crashes when compacting large views.
  • Fix file descriptor leak in _log
  • Fix missing revisions in _changes?style=all_docs.
  • Improve handling of compaction at max_dbs_open limit.
  • JSONP responses now send “text/javascript” for Content-Type.
  • Link to ICU 4.2 on Windows.
  • Permit forward slashes in path to update functions.
  • Reap couchjs processes that hit reduce_overflow error.
  • Status code can be specified in update handlers.
  • Support provides() in show functions.
  • _view_cleanup when ddoc has no views now removes all index files.
  • max_replication_retry_count now supports “infinity”.
  • Fix replication crash when source database has a document with empty ID.
  • Fix deadlock when assigning couchjs processes to serve requests.
  • Fixes to the document multipart PUT API.
  • Fixes regarding file descriptor leaks for databases with views.

Attacking and Defending NoSQL

One of the first  paper on NoSQL security. Its  a great presentation  by Bryan Sullivan at the RSA conference, it introduces the main security issue of the NoSQL solutions such as:

  • NoSQL injection, just like SQL injection but manipulating the JSON string instead of the SQL query
  • Authentication is unsupported or discouraged  within NoSQL solution which is a big issue when combined with REST API
  • SSJS Injection aka Server-side JavaScript injection


Should I use MongoDB or CouchDB (or Redis)?

Copy paste from an interesting post on Google+(


Riyad Kalla  –  “Should I use MongoDB or CouchDB (or Redis)?”
I see this question asked a lot online. Fortunately, once you get familiar enough with each of the NoSQL solutions and their ins and outs, strengths and weaknesses, it becomes much clearer when you would use one over the other.From the outside, so many of them look the same, especially Mongo and Couch. Below I will try and break down the big tie-breakers that will help you decide between them.

[**] Querying – If you need the ability to dynamically query your data like SQL, MongoDB provides a query syntax that will feel very similar to you. CouchDB is getting a query language in the form of UNQL in the next year or so, but it is very much under development and we have no knowledge of the impact on view generation and query speed it will have yet, so I cannot recommend this yet.

[**] Master-Slave Replication ONLY – MongoDB provides (great) support for master-slave replication across the members of what they call a “replica set”. Unfortunately you can only write to the master in the set and read from all.

If you need multiple masters in a Mongo environment, you have to set up sharding in addition to replica sets and each shard will be its own replica set with the ability to write to each master in each set. Unfortunately this leads to much more complex setups and you cannot have every server have a full copy of the data set (which can be handy/critical for some geographically dispersed systems – like a CDN or DNS service).

[**] Read Performance – Mongo employs a custom binary protocol (and format) providing at least a magnitude times faster reads than CouchDB at the moment. There is work in the CouchDB community to try and add a binary format support in addition to JSON, but it will still be communicated over HTTP.

[**] Provides speed-oriented operations like upserts and update-in-place mechanics in the database.

[**] Master-Master Replication – Because of the append-only style of commits Couch does every modification to the DB is considered a revision making conflicts during replication much less likely and allowing for some awesome master-master replication or what Cassandra calls a “ring” of servers all bi-directionally replicating to each other. It can even look more like a fully connected graph of replication rules.

[**] Reliability of the actual data store backing the DB. Because CouchDB records any changes as a “revision” to a document and appends them to the DB file on disk, the file can be copied or snapshotted at any time even while the DB is running and you don’t have to worry about corruption. It is a really resilient method of storage.

[**] Replication also supports filtering or selective replication by way of filters that live inside the receiving server and help it decide if it wants a doc or not from another server’s changes stream (very cool).

Using an EC2 deployment as an example, you can have a US-WEST db replicate to US-EAST, but only have it replicate items that meet some criteria. like the most-read stories for the day or something so your east-coast mirror is just a cache of the most important stories that are likely getting hit from that region and you leave the long-tail (less popular) stories just on the west coast server.

Another example of this is say you use CouchDB to store data about your Android game in the app store that everyone around the world plays. Say you have 5 million registered users, but only 100k of them play regularly. Say that you want to duplicate the most activeaccounts to 10 other servers around the globe, so the users that play all the time get really fast response times when they login and update scores or something. You could have a big CouchDB setup in the west code, and then much smaller/cheaper ones spread out across the world in a few disparate VPS servers that all use a filtered replication from your west coast master to only duplicate the most active players.

[**] Mobile platform support. CouchDB actually has installs for iOS and Android. When you combine the ability to run Couch on your mobile devices AND have it bidirectionally sync back to a master DB when it is online and just hold the results when it is offline, it is an awesome combination.

[**] Queries are written using map-reduce functions. If you are coming from SQL this is a really odd paradigm at first, but it will click relatively quickly and when it does you’ll see some beauty to it. These are some of the best slides describing the map-reduce functionality in couch I’ve read:

[**] Every mutation to the data in the database is considered a “revision” and creates a duplicate of the doc. This is excellent for redundancy and conflict resolution but makes the data store bigger on disk. Compaction is what removes these old revisions.

[**] HTTP REST JSON interaction only. No binary protocol (yet – vote for support), everything is callable and testable from the command line with curl or your browser. Very easy to work with.

These are the biggest features, I would say if you need dynamic query support or raw speed is critical, then Mongo. If you need master-master replication or sexy client-server replication for a mobile app that syncs with a master server every time it comes online, then you have to pick Couch. At the root these are the sort of “deal maker/breaker” divides.

Fortunately once you have your requirements well defined, selecting the right NoSQL data store becomes really easy. If your data set isn’t that stringent on its requirements, then you have to look at secondary features of the data stores to see if one speaks to you… for example, do you like that CouchDB only speaks in raw JSON and HTTP which makes it trivial to test/interact with it from the command line? Maybe you hate the idea of compaction or map-reduce, ok then use Mongo. Maybe you fundamentally like the design of one over the other, etc.etc.

If anyone has questions about Redis or others, let me know and I’ll expand on how those fit into this picture.

You hear Redis described a lot as a “data structure” server, and if you are like me this meant nothing to you for the longest time until you sat down and actually used Redis, then it probably clicked.

If you have a problem you are trying to solve that would be solved really elegantly with a List, Set, Hash or Sorted Set you need to take a look at Redis. The query model is simple, but there are a ton of operations on each of the data structure types that allow you to use them in really robust ways… like checking the union between two sets, or getting the members of a hash value or using the Redis server itself as a sub/pub traffic router (yea, it supports that!)

Redis is not a DB in the classic sense. If you are trying to decide between MySQL and MongoDB because of the dynamic query model, Redis is not the right choice. In order to map you data to the simple name/value structure in Redis and the simplified query approach, you are going to spend a significant amount of time trying to mentally model that data inside of Redis which typically requires a lot of denormalization.

If you are trying to deploy a robust caching solution for your app and memcache is too simple, Redis is probably perfect.

If you are working on a queuing system or a messaging system… chances are Redis is probably perfect.

If you have a jobs server that grinds through prioritized jobs and you have any number of nodes in your network submitting work to the job server constantly along with a priority, look no further, Redis is the perfect fit. Not only can you use simple Sorted Sets for this, but the performance will be fantastic (binary protocol) AND you get the added win of the database-esque features Redis has like append only logging and flushing changes to disk as well as replication if you ever grew your jobs server beyond a single node.

NOTE: Redis clustering has been in the works for a long time now. Salvatore has been doing some awesome work with it and it is getting close to launch so if you have a particularly huge node distribute and want a robust/fast processing cluster based on Redis, you should be able to set that up soon.

Redis has some of these more classic database features, but it is not targeted at competing with Mongo or Couch or MySQL. It really is a robust, in-memory data structure server AND if you happen to need very well defined data structures to solve your problem, then Redis is the tool you want. If your data is inherently “document” like in nature, sticking it in a document store like Mongo or CouchDB may just make a lot more mental-model sense for you.

NOTE: I wouldn’t underestimate how important mentally understanding your data model is. Look at Redis, and if you are having a hard time mapping your data to a List, Set or Hash you probably shouldn’t use it. Also note that you WILL use many more data structures than you might realize and it may feel odd… for example a twitter app may have a single user as a hash, and then they might have a number of lists of all the people they follow and follow them — you would need this duplication to make querying in either direction fast. This was one of the hardest or most unnatural things I experienced when trying to solve more “classic DB” problems with Redis and what helped me decide when to use it and when not to use it.

I would say that Redis compliments most systems (whether it is a classic SQL or NoSQL deployment) wonderfully in most cases in a caching capacity or queue capacity.

If you are working on a simple application that just needs to store and retrieve simple values really quickly, then Redis is a perfectly valid fit. For example, if you were working on a high performance algorithmic trading and you were pulling ticker prices out of a firehose and needing to store them at an insane rate so they could be processed, Redis is exactly the kind of datastore you would want to turn to for that — definitely not Mongo, Couch or MySQL.

IMPORTANT: Redis’s does not have a great solution to operating in environments where the data set is bigger than ram and as of right now solutions for “data bigger than ram” has been abandoned. For the longest time this was one of the gripes with Redis and Salvatore (!/antirez) solved this problem with the VM approach. This was quickly deprecated by the diskstore approach after some reported failures and unhappiness with how VM was behaving.

Last I read (a month ago?) was that Salvatore wasn’t entirely happy with how diskstore turned out either and that attempts should really be made to keep Redis’s data set entirely in memory when possible for the best experience.

I say this not because it is impossible, but just to be aware of the “preferred” way to use Redis so if you have a 100GB data set and were thinking of throwing it on a machine with 4GB of ram, you probably don’t want to do that. It may not perform or behave like you want it to.



The universally compatible format specification for binary JSON.

Facts and stats, MongoDB trend

Although NoSQL databases like Hadoop(Apache Foundation),Redis(VMWare),Cassandra (developed and used by Facebook) or CouchDB get a lot of media attention lately, MongoDB appears to be the product to catch in this emerging market.

Making some search over Google trend and Job Trend(using looking for various NoSQL solutions but the evidence point out the MongoDB to lead those trend

SourceForge,Disney,Craiglist are all using MongoDB, check for full adopter list here:


 Google trend result for “MongoDB”


 Job trend from result for various nosql solution

 Job trend from result for “MongoDB”


Kanapes IDE for CouchDB

Kanapes (καναπες – greek for couch) is an Integrated Development Environment (IDE)  designed to help you with your daily CouchDB routines.

If you familiar with applications like Microsoft Visual Studio, Kanapes IDE will be easiest thing ever for you, since you are already familiar with same kind of IDEs.

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API



CouchDB now hosted in Git

Apache CouchDB is now hosted in Git.



WSJ about NoSql recent funding

Venture Capital Dispatch, a blog hosted on WSJ, recently post the following article about the recent MongoDb,CouchDb and Neo4j new funding.


More Corporate Customers Are Saying Yes To NoSQL

By Scott Denne

NoSQL databases promise to solve some of the most pressing problems with traditional database management systems, but so far they’ve been used sparsely by companies willing to pay for the software. That’s starting to change, and several start-ups in the space have recently raised capital as they seek paying customers coping with an explosion in data flowing through their websites.

Since August 10gen, Couchbase and this week, DataStax and Neo Technology, have announced new funding, and some of them have cited early success in landing big corporate customers.

“Starting in January this year, enterprises started using us. In 2010, they really weren’t,” said 10gen Chief Executive Dwight Merriman.

NoSQL databases, built to be more flexible and scalable across multiple servers when compared with relational databases like Oracle or MySQL, were designed around innovations made by pioneering Web companies like, Facebook, and Google and have been embraced by other Web companies looking to follow their lead.

The good news for NoSQL companies, all of which provide support around open-source products, is that they had strong base of users early on. The bad news is that Web companies are notorious for not paying for open-source software. To be successful, NoSQL start-ups must transition their businesses beyond Web customers.

None of them will disclose specific numbers, but most of these companies say they’re seeing more demand from traditional businesses for their products.

Today, Neo Technology announced it has raised $10.6 million in its first major venture round to cope with the increased demand, and yesterday DataStax raised $11 million in its Series B round to support the launch of some monitoring and management tools that it hopes will make its offering more attractive to enterprise customers.

Last month Couchbase raised $14 million as it said it was seeing growth beyond its base of Web start-ups and into the Web properties of major corporations.

Enterprise adoption is important because “the database market is over $20 billion and the majority of that is from enterprises,” said Luis Robles of Sequoia Capital, the firm that led 10gen’s $20 million round earlier this month.