FieldDB is available

FieldDB is a free, modular, open source project developed collectively by field linguists and software developers to make an expandable user-friendly app which can be used to collect, search and share your data, both online and offline. It is fundamentally an app written in 100% Javascript which runs entirely client side, backed by a NoSQL database (we are currently using CouchDB and its offline browser wrapper PouchDB alpha). It has a number of webservices which it connects to in order to allow users to perform tasks which require the internet/cloud (ie, syncing data between devices and users, sharing data publicly, running CPU intensive processes to analyze/extract/search audio/video/text). While the app was designed for “field linguists” it can be used by anyone collecting text data or collecting highly structured data where the fields on each data point require encryption or customization from user to user, and where the schema of the data is expected to evolve over the course of data collection while in the “field.”

FieldDB beta was officially launched in English and Spanish on August 1st 2012 in Patzun, Guatemala as an app for fieldlinguists.

More information about FieldDB are available here:

Spark the Open Source Future of Big Data

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets

Spark is open source under a BSD license, so download it to check it out.


UnQLite is an Embeddable NoSQL (Key/Value store and Document-store) database engine. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections, is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.UnQLite features includes:

More information on the official website:

LevelDB a fast and lightweight key/value database library by Google

LevelDB a fast and lightweight key/value database library by Google

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.


  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value)Get(key)Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
  • Detailed documentation about how to use the library is included with the source code.


  • This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library.


Here is a performance report (with explanations) from the run of the included db_bench program. The results are somewhat noisy, but should be enough to get a ballpark performance estimate.


We use a database with a million entries. Each entry has a 16 byte key, and a 100 byte value. Values used by the benchmark compress to about half their original size.

   LevelDB:    version 1.1
   Date:       Sun May  1 12:11:26 2011
   CPU:        4 x Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
   CPUCache:   4096 KB
   Keys:       16 bytes each
   Values:     100 bytes each (50 bytes after compression)
   Entries:    1000000
   Raw Size:   110.6 MB (estimated)
   File Size:  62.9 MB (estimated)


Write performance

The “fill” benchmarks create a brand new database, in either sequential, or random order. The “fillsync” benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The “overwrite” benchmark does random writes that update existing keys in the database.


   fillseq      :       1.765 micros/op;   62.7 MB/s
   fillsync     :     268.409 micros/op;    0.4 MB/s (10000 ops)
   fillrandom   :       2.460 micros/op;   45.0 MB/s
   overwrite    :       2.380 micros/op;   46.5 MB/s


Each “op” above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.

Each “fillsync” operation costs much less (0.3 millisecond) than a disk seek (typically 10 milliseconds). We suspect that this is because the hard disk itself is buffering the update in its memory and responding before the data has been written to the platter. This may or may not be safe based on whether or not the hard disk has enough power to save its memory in the event of a power failure.

Read performance

We list the performance of reading sequentially in both the forward and reverse direction, and also the performance of a random lookup. Note that the database created by the benchmark is quite small. Therefore the report characterizes the performance of leveldb when the working set fits in memory. The cost of reading a piece of data that is not present in the operating system buffer cache will be dominated by the one or two disk seeks needed to fetch the data from disk. Write performance will be mostly unaffected by whether or not the working set fits in memory.


   readrandom   :      16.677 micros/op;  (approximately 60,000 reads per second)
   readseq      :       0.476 micros/op;  232.3 MB/s
   readreverse  :       0.724 micros/op;  152.9 MB/s


LevelDB compacts its underlying storage data in the background to improve read performance. The results listed above were done immediately after a lot of random writes. The results after compactions (which are usually triggered automatically) are better.


   readrandom   :      11.602 micros/op;  (approximately 85,000 reads per second)
   readseq      :       0.423 micros/op;  261.8 MB/s
   readreverse  :       0.663 micros/op;  166.9 MB/s


Some of the high cost of reads comes from repeated decompression of blocks read from disk. If we supply enough cache to the leveldb so it can hold the uncompressed blocks in memory, the read performance improves again:

   readrandom   :       9.775 micros/op;  (approximately 100,000 reads per second before compaction)
   readrandom   :       5.215 micros/op;  (approximately 190,000 reads per second after compaction)

Twitter's fatcache available on GitHub

fatcache is memcache on SSD. Think of fatcache as a cache for your big data.


There are two ways to think of SSDs in system design. One is to think of SSD as an extension of disk, where it plays the role of making disks fast and the other is to think of them as an extension of memory, where it plays the role of making memory fat. The latter makes sense when persistence (non-volatility) is unnecessary and data is accessed over the network. Even though memory is thousand times faster than SSD, network connected SSD-backed memory makes sense, if we design the system in a way that network latencies dominate over the SSD latencies by a large factor.

To understand why network connected SSD makes sense, it is important to understand the role distributed memory plays in large-scale web architecture. In recent years, terabyte-scale, distributed, in-memory caches have become a fundamental building block of any web architecture. In-memory indexes, hash tables, key-value stores and caches are increasingly incorporated for scaling throughput and reducing latency of persistent storage systems. However, power consumption, operational complexity and single node DRAM cost make horizontally scaling this architecture challenging. The current cost of DRAM per server increases dramatically beyond approximately 150 GB, and power cost scales similarly as DRAM density increases.

Fatcache extends a volatile, in-memory cache by incorporating SSD-backed storage.

SSD-backed memory presents a viable alternative for applications with large workloads that need to maintain high hit rate for high performance. SSDs have higher capacity per dollar and lower power consumption per byte, without degrading random read latency beyond network latency.

Fatcache achieves performance comparable to an in-memory cache by focusing on two design criteria:

  • Minimize disk reads on cache hit
  • Eliminate small, random disk writes

The latter is important due to SSDs’ unique write characteristics. Writes and in-place updates to SSDs degrade performance due to an erase-and-rewrite penalty and garbage collection of dead blocks. Fatcache batches small writes to obtain consistent performance and increased disk lifetime.

SSD reads happen at a page-size granularity, usually 4 KB. Single page read access times are approximately 50 to 70 usec and a single commodity SSD can sustain nearly 40K read IOPS at a 4 KB page size. 70 usec read latency dictates that disk latency will overtake typical network latency after a small number of reads. Fatcache reduces disk reads by maintaining an in-memory index for all on-disk data.

MoSQL live-replicating from MongoDB to PostgreSQL

MoSQL, a tool Stripe developed for live-replicating data from a MongoDB database into a PostgreSQL database. With MoSQL, you can run applications against a MongoDB database, but also maintain a live-updated mirror of your data in PostgreSQL, ready for querying with the full power of SQL.



Here at Stripe, we use a number of different database technologies for both internal- and external-facing services. Over time, we’ve found ourselves with growing amounts of data in MongoDB that we would like to be able to analyze using SQL. MongoDB is great for a lot of reasons, but it’s hard to beat SQL for easy ad-hoc data aggregation and analysis, especially since virtually every developer or analyst already knows it.

An obvious solution is to periodically dump your MongoDB database and re-import into PostgreSQL, perhaps using mongoexport. We experimented with this approach, but found ourselves frustrated with the ever-growing time it took to do a full refresh. Even if most of your analyses can tolerate a day or two of delay, occasionally you want to ask ad-hoc questions about “what happened last night?”, and it’s frustrating to have to wait on a huge dump/load refresh to do that. In response, we built MoSQL, enabling us to keep a real-time SQL mirror of our Mongo data.

MoSQL does an initial import of your MongoDB collections into a PostgreSQL database, and then continues running, applying any changes to the MongoDB server in near-real-time to the PostgreSQL mirror. The replication works by tailing the MongoDB oplog, in essentially the same way Mongo’s own replication works.


MoSQL can be installed like any other gem:


$ gem install mosql


To use MoSQL, you’ll need to create a collection map which maps your MongoDB objects to a SQL schema. We’ll use the collection from the MongoDB tutorial as an example. A possible collection map for that collection would look like:

      - _id: TEXT
      - x: INTEGER
      - j: INTEGER
     :table: things
     :extra_props: true

Save that file as collections.yaml, start a local mongod and postgres, and run:


$ mosql --collections collections.yaml


Now, run through the MongoDB tutorial, and then open a psql shell. You’ll find all your Mongo data now available in SQL form:

postgres=# select * from things limit 5;
           _id            | x | j |   _extra_props
 50f445b65c46a32ca8c84a5d |   |   | {"name":"mongo"}
 50f445df5c46a32ca8c84a5e | 3 |   | {}
 50f445e75c46a32ca8c84a5f | 4 | 1 | {}
 50f445e75c46a32ca8c84a60 | 4 | 2 | {}
 50f445e75c46a32ca8c84a61 | 4 | 3 | {}
(5 rows)

mosql will continue running, syncing any further changes you make into Postgres.

For more documentation and usage information, see the README.


MoSQL comes from a general philosophy of preferring real-time, continuously-updating solutions to periodic batch jobs.


MoSQL is built on top of mongoriver, a general library for MongoDB oplog tailing that we developed. Along with the MoSQL release, we have also released mongoriver as open source today. If you find yourself wanting to write your own MongoDB tailer, to monitor updates to your data in near-realtime, check it out.

Probability, The Analysis of Data

Probability, The Analysis of Data – Volume 1

is a free book available online, it provides educational material in the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website Hardcopies are available at affordable prices.

DuckDuckGo serves 1 Million searches a day has published an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012. You fill find detail on how they use memcached, postgreSql and many other great peace of software to serves 1 million search a day !

Can’t searching the Open Web provide all this data? No really. This is structured data with semantics. Not an HTML page. You need a search engine that’s capable of categorizing, mapping, merging, filtering, prioritizing, searching, formatting, and disambiguating richer data sets and you can’t do that with a keyword search. You need the kind of smarts DDG has built into their search engine. One problem of course is now that data has become valuable many grown ups don’t want to share anymore.



The full article on

Gartner predict strong Hadoop adoption for Business Intelligence and Analytics



Gartner Says Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources

Analysts to Discuss Growth in Data Sources at Gartner Business Intelligence and Analytics Summits 2013, February 5-7 in Barcelona, February 25-26 in Sydney and March 18-20 in Grapevine, Texas


Business intelligence (BI) and analytics need to scale up to support the robust growth in data sources, according to the latest predictions from Gartner, Inc. Business intelligence leaders must embrace a broadening range of information assets to help their organizations.

“New business insights and improved decision making with greater finesse are the key benefits achievable from turning more data into actionable insights, whether that data is from an increasing array of data sources from within or outside of the organization,” said Daniel Yuen, research director at Gartner. “Different technology vendors, especially niche vendors, are rushing into the market, providing organizations with the ability to tap into this wider information base in order to make sounder strategic and prompter operational decisions.”

Gartner outlined three key predictions for BI teams to consider when planning for the future:

By 2015, 65 percent of packaged analytic applications with advanced analytics will come embedded with Hadoop.

Organizations realize the strength that Hadoop-powered analysis brings to big data programs, particularly for analyzing poorly structured data, text, behavior analysis and time-based queries. While IT organizations conduct trials over the next few years, especially with Hadoop-enabled database management system (DBMS) products and appliances, application providers will go one step further and embed purpose-built, Hadoop-based analysis functions within packaged applications. The trend is most noticeable so far with cloud-based packaged application offerings, and this will continue.

“Organizations with the people and processes to benefit from new insights will gain a competitive advantage as having the technology packaged reduces operational costs and IT skills requirements, and speeds up the time to value,” said Bill Gassman, research director at Gartner. “Technology providers will benefit by offering a more competitive product that delivers task-specific analytics directly to the intended role, and avoids a competitive situation with internally developed resources.”

By 2016, 70 percent of leading BI vendors will have incorporated natural-language and spoken-word capabilities.

BI/analytics vendors continue to be slow in providing language- and voice-enabled applications. In their rush to port their applications to mobile and tablet devices, BI vendors have tended to focus only on adapting their traditional BI point-and-click and drag-and-drop user interfaces to touch-based interfaces. Over the next few years, BI vendors are expected to start playing a quick game of catch-up with the virtual personal assistant market. Initially, BI vendors will enable basic voice commands for their standard interfaces, followed by natural language processing of spoken or text input into SQL queries. Ultimately, “personal analytic assistants” will emerge that understand user context, offer two-way dialogue, and (ideally) maintain a conversational thread.

“Many of these technologies can and will underpin these voice-enabled analytic capabilities, rather than BI vendors or enterprises themselves developing them outright,” said Douglas Laney, research vice president at Gartner.”

By 2015, more than 30 percent of analytics projects will deliver insights based on structured and unstructured data.

Business analytics have largely been focused on tools, technologies and approaches for accessing, managing, storing, modeling and optimizing for analysis of structured data. This is changing as organizations strive to gain insights from new and diverse data sources. The potential business value of harnessing and acting upon insights from these new and previously untapped sources of data, coupled with the significant market hype around big data, has fueled new product development to deal with a data variety across existing information management stack vendors and has spurred the entry of a flood of new approaches for relating, correlating, managing, storing and finding insights in varied data.

“Organizations are exploring and combining insights from their vast internal repositories of content — such as text and emails and (increasingly) video and audio — in addition to externally generated content such as the exploding volume of social media, video feeds, and others, into existing and new analytic processes and use cases,” said Rita Sallam, research vice president at Gartner. “Correlating, analyzing, presenting and embedding insights from structured and unstructured information together enables organizations to better personalize the customer experience and exploit new opportunities for growth, efficiencies, differentiation, innovation and even new business models.”

More detailed analysis is available in the report “Predicts 2013: Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources.” The report is available on Gartner’s website at

Additional information and analysis on data sources will be discussed at the Gartner Business Intelligence & Analytics Summit 2013 taking place February 5-7 in Barcelona, February 25-26 in Sydney and March 18-20 in Grapevine, Texas. The Gartner BI & Analytics Summit is specifically designed to drive organizations toward analytics excellence by exploring the latest trends in BI and analytics and examining how the two disciplines relate to one another. Gartner analysts will discuss how the Nexus of Forces will impact BI and analytics, and share best practices for developing and managing successful mobile BI, analytics and master data management initiatives.

Processing data with Drake

Introducing ‘Drake’, a “Make for Data”

We call this tool Drake, and today we are excited to share Drake with the world, as an open source project. It is written in Clojure.

Drake is a text-based command line data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs.  It automatically resolves dependencies and provides a rich set of options for controlling the workflow. It supports multiple inputs and outputs and has HDFS support built-in.

We use Drake at Factual on various internal projects. It serves as a primary way to define, run, and manage data workflow. Some core benefits we’ve seen:
    • Non-programmers can run Drake and fully manage a workflow
    • Encourages repeatability of the overall data building process
    • Encourages consistent organization (e.g., where supporting scripts live, and how they’re run)
    • Precise control over steps (for more effective testing, debugging, etc.)
    • Unifies different tools in a single workflow (shell commands, Ruby, Python, Clojure, pushing data to production, etc.)

Drake official blog: