YaJUG: Cassandra

Jeudi 2 octobre 2014 de 18:00 à 20:00  –  Luxembourg City, Luxembourg

18H00 : Introduction à Cassandra – Duy Hai DOAN (DataStax)

Cassandra est la base NoSQL orientée colonnes derrière les grandes entreprises comme Netflix, Sony Entertainment, Apple …
Une première session couvre la présentation générale de Cassandra et de son architecture. La deuxième session aborde le modèle de données et les bonnes pratiques de modélisation: comment passer du monde SQL au monde NoSQL avec Cassandra.

19H10 : Outillage de la solution Cassandra – Michaël Figuière (DataStax)

Présentation des outils pour aider le développeur Java à travailler efficacement avec Cassandra.

Big Data top paying skills

 According to kdnuggets the Big Data related skills led the list of top paying technical skills (six-figure salaries) in 2013.

The study focus on  technology professionals in the U.S. who enjoyed raises over the last year(2013).

Average U.S. tech salaries increased nearly three percent to $87,811 in 2013, up from $85,619 the previous year.Technology professionals understand they can easily find ways to grow their career in 2014, with two-thirds of respondents (65%) confident in finding a new, better position. That overwhelming confidence matched with declining salary satisfaction (54%, down from 57%) will keep tech-powered companies on edge about their retention strategies.

Companies are willing to pay hefty amounts to professionals with Big Data skills.


According to a report released on Jan 29, 2014 an average salary for a professional having knowledge and experience in programming language R was $115,531 in year 2013. 

Other Big Data oriented skills such as NoSQL, MapReduce, Cassandra, Pig, Hadoop, MongoDB are among top 10 paying skills. 

 

Source: kdnuggets

Cassandra 2.0, just big

In five years, Apache Cassandra has grown into one of the most widely used NoSQL databases in the world and serves as the backbone for some of today’s most popular applications including as Facebook,Netflix,Twitter.

cassandra

 

This newest version, Cassandra 2.0 just announced, includes multiple new features. But perhaps the biggest of them is that “Cassandra 2.0 makes it easier than ever for developers to migrate from relational databases and become productive quickly.”

New features and improvements include:

  • Lightweight transactions allow ensuring operation linearizability similar to the serializable isolation level offered by relational databases, which prevents conflicts during concurrent requests
  • Triggers, which enable pushing performance-critical code close to the data it deals with, and simplify integration with event-driven frameworks like Storm
  • CQL enhancements such as cursors and improved index support
  • Improved compaction, keeping read performance from deteriorating under heavy write load
  • Eager retries to avoid query timeouts by sending redundant requests to other replicas if too much time elapses on the original request
  • Custom Thrift server implementation based on LMAX Disruptor that achieves lower message processing latencies and better throughput with flexible buffer allocation strategies

Official website

Official announce

Download

Cassandra 2.0.0-beta1 have been released

The latest development release , the 2.0.0-beta1, is now available for download:

Full changes list:

  • Removed on-heap row cache (CASSANDRA-5348)
  • use nanotime consistently for node-local timeouts (CASSANDRA-5581)
  • Avoid unnecessary second pass on name-based queries (CASSANDRA-5577)
  • Experimental triggers (CASSANDRA-1311)
  • JEMalloc support for off-heap allocation (CASSANDRA-3997)
  • Single-pass compaction (CASSANDRA-4180)
  • Removed token range bisection (CASSANDRA-5518)
  • Removed compatibility with pre-1.2.5 sstables and network messages(CASSANDRA-5511)
  • removed PBSPredictor (CASSANDRA-5455)
  • CAS support (CASSANDRA-5062, 5441, 5442, 5443, 5619, 5667)
  • Leveled compaction performs size-tiered compactions in L0 (CASSANDRA-5371, 5439)
  • Add yaml network topology snitch for mixed ec2/other envs (CASSANDRA-5339)
  • Log when a node is down longer than the hint window (CASSANDRA-4554)
  • Optimize tombstone creation for ExpiringColumns (CASSANDRA-4917)
  • Improve LeveledScanner work estimation (CASSANDRA-5250, 5407)
  • Replace compaction lock with runWithCompactionsDisabled (CASSANDRA-3430)
  • Change Message IDs to ints (CASSANDRA-5307)
  • Move sstable level information into the Stats component, removing the
  • need for a separate Manifest file (CASSANDRA-4872)
  • avoid serializing to byte[] on commitlog append (CASSANDRA-5199)
  • make index_interval configurable per columnfamily (CASSANDRA-3961, CASSANDRA-5650)
  • add default_time_to_live (CASSANDRA-3974)
  • add memtable_flush_period_in_ms (CASSANDRA-4237)
  • replace supercolumns internally by composites (CASSANDRA-3237, 5123)
  • upgrade thrift to 0.9.0 (CASSANDRA-3719)
  • drop unnecessary keyspace parameter from user-defined compaction API (CASSANDRA-5139)
  • more robust solution to incomplete compactions + counters (CASSANDRA-5151)
  • Change order of directory searching for c*.in.sh (CASSANDRA-3983)
  • Add tool to reset SSTable compaction level for LCS (CASSANDRA-5271)
  • Allow custom configuration loader (CASSANDRA-5045)
  • Remove memory emergency pressure valve logic (CASSANDRA-3534)
  • Reduce request latency with eager retry (CASSANDRA-4705)
  • cqlsh: Remove ASSUME command (CASSANDRA-5331)
  • Rebuild BF when loading sstables if bloom_filter_fp_chance
  • has changed since compaction (CASSANDRA-5015)
  • remove row-level bloom filters (CASSANDRA-4885)
  • Change Kernel Page Cache skipping into row preheating (disabled by default)(CASSANDRA-4937)
  • Improve repair by deciding on a gcBefore before sending
  • out TreeRequests (CASSANDRA-4932)
  • Add an official way to disable compactions (CASSANDRA-5074)
  • Reenable ALTER TABLE DROP with new semantics (CASSANDRA-3919)
  • Add binary protocol versioning (CASSANDRA-5436)
  • Swap THshaServer for TThreadedSelectorServer (CASSANDRA-5530)
  • Add alias support to SELECT statement (CASSANDRA-5075)
  • Don’t create empty RowMutations in CommitLogReplayer (CASSANDRA-5541)
  • Use range tombstones when dropping cfs/columns from schema (CASSANDRA-5579)
  • cqlsh: drop CQL2/CQL3-beta support (CASSANDRA-5585)
  • Track max/min column names in sstables to be able to optimize slice
  • queries (CASSANDRA-5514, CASSANDRA-5595, CASSANDRA-5600)
  • Binary protocol: allow batching already prepared statements (CASSANDRA-4693)
  • Allow preparing timestamp, ttl and limit in CQL3 queries (CASSANDRA-4450)
  • Support native link w/o JNA in Java7 (CASSANDRA-3734)
  • Use SASL authentication in binary protocol v2 (CASSANDRA-5545)
  • Replace Thrift HsHa with LMAX Disruptor based implementation (CASSANDRA-5582)
  • cqlsh: Add row count to SELECT output (CASSANDRA-5636)
  • Include a timestamp with all read commands to determine column expiration(CASSANDRA-5149)
  • Streaming 2.0 (CASSANDRA-5286, 5699)
  • Conditional create/drop ks/table/index statements in CQL3 (CASSANDRA-2737)
  • more pre-table creation property validation (CASSANDRA-5693)
  • Redesign repair messages (CASSANDRA-5426)
  • Fix ALTER RENAME post-5125 (CASSANDRA-5702)
  • Disallow renaming a 2ndary indexed column (CASSANDRA-5705)
  • Rename Table to Keyspace (CASSANDRA-5613)
  • Ensure changing column_index_size_in_kb on different nodes don’t corrupt the
  • sstable (CASSANDRA-5454)
  • Move resultset type information into prepare, not execute (CASSANDRA-5649)
  • Auto paging in binary protocol (CASSANDRA-4415, 5714)
  • Don’t tie client side use of AbstractType to JDBC (CASSANDRA-4495)
  • Adds new TimestampType to replace DateType (CASSANDRA-5723, CASSANDRA-5729)

 

Cassandra 1.1.11 has been released

http://cassandra.apache.org/download/

This is a maintenance/bug fix release[1] on the 1.1 series. As always,
please pay careful attention to the release notes[2] and Let us know[3]
right away if have any problems.

[1]: http://goo.gl/QfZlg (CHANGES.txt)
[2]: http://goo.gl/O55QF (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: http://goo.gl/KbiRm (CHEC)

Most popular data management systems

According to the DB-Engine ranking dsds

 

April 2013
Rank Last Month DBMS Database Model Score Changes
1. 1. Oracle  Relational DBMS 1560.59 +27.20
2. 3. MySQL  Relational DBMS 1342.45 +47.24
3. 2. Microsoft SQL Server  Relational DBMS 1278.15 -40.21
4. 4. PostgreSQL  Relational DBMS 174.09 -3.07
5. 5. Microsoft Access  Relational DBMS 161.40 -8.77
6. 6. DB2  Relational DBMS 155.02 -4.31
7. 7. MongoDB  Document store 129.75 +5.52
8. 9. SQLite  Relational DBMS 88.94 +5.68
9. 8. Sybase  Relational DBMS 80.16 -5.25
10. 10. Solr  Search engine 46.15 +2.99
11. Teradata  Relational DBMS 44.93
12. 11. Cassandra  Wide column store 38.57 +2.21
13. 12. Redis  Key-value store 35.58 +3.15
14. 13. Memcached  Key-value store 24.80 -0.17
15. 14. Informix  Relational DBMS 24.00 +0.10
16. 15. HBase  Wide column store 21.84 +1.40
17. 16. CouchDB  Document store 18.72 +0.42
18. 17. Firebird  Relational DBMS 12.24 -1.54
19. Netezza  Relational DBMS 11.14
20. 18. Sphinx  Search engine 9.55 +0.09
21. 19. Neo4j  Graph DBMS 8.34 +0.90
22. 21. Elasticsearch  Search engine 8.31 +1.56
23. 22. Riak  Key-value store 7.20 +1.10

Cassandra 1.1.10 has been released

 

Download Cassandra 1.1.10 at http://cassandra.apache.org/download

Includes following bugfix:
  • fix saved key cache not loading at startup (CASSANDRA-5166)
  • fix ConcurrentModificationException in getBootstrapSource (CASSANDRA-5170)
  • fix sstable maxtimestamp for row deletes and pre-1.1.1 sstables (CASSANDRA-5153)
  • fix start key/end token validation for wide row iteration (CASSANDRA-5168)
  • add ConfigHelper support for Thrift frame and max message sizes (CASSANDRA-5188)
  • fix nodetool repair not fail on node down (CASSANDRA-5203)
  • always collect tombstone hints (CASSANDRA-5068)
  • Fix thread growth on node removal (CASSANDRA-5175)
  • Fix error when sourcing file in cqlsh (CASSANDRA-5235)
  • Make Ec2Region’s datacenter name configurable (CASSANDRA-5155)

Cassandra performance review

 

Original article available here

 

Four years ago, well before starting DataStax, I evaluated the then-current crop of distributed databases and explained why I chose Cassandra. In a lot of ways, Cassandra was the least mature of the options, but I chose to take a long view and wanted to work on a project that got the fundamentals right; things like documentation and distributed testscould come later.

 

2012 saw that validated in a big way, as the most comprehensive NoSQL benchmark to date was published at the VLDB conference by researchers at the University of Toronto. They concluded,

In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear in- creasing throughput from 1 to 12 nodes.

As a sample, here’s the throughput results from the mixed reads, writes, and (sequential) scans:

I encourage you to take a few minutes to skim the full results.

There are both architectural and implentation reasons for Cassandra’s dominating performance here. Let’s get down into the weeds and see what those are.

Architecture

Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.

Fully distributed: Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode or MongoDB mongos that require special treatment or special hardware to avoid becoming a bottleneck. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting.

Log-structured storage engine: A log-structured engine that avoids overwrites to turn updates into sequential i/o is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write amplification and disk failure. This is why you see mongodb performance go through the floor as the dataset size exceeds RAM.

Tight integration with its storage engine: Voldemort and Riak support pluggable storage engines, which both limits them to a lowest-common-denominator of key/value pairs, and limits the optimizations that can be done with the distributed replication engine.

Locally-managed storage: HBase has an integrated, log-structured storage engine, but relies on HDFS for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, or mixing SSD and HDD in a single cluster with appropriate data pinned to each.

Implementation

An architecture is only as good as its implementation. For the first years after Cassandra’s open-sourcing as an Apache project, every release was a learning experience. 0.3, 0.4, 0.5, 0.6, each attracted a new wave of users that exposed some previously unimportant weakness. Today, we estimate there are over a thousand production deployments of Cassandra, the most for any scalable database. Some are listed here. To paraphrase ESR, “With enough eyes, all performance problems are obvious.”

What are some implementation details relevant to performance? Let’s have a look at some of the options.

MongoDB

MongoDB can be a great alternative to MySQL, but it’s not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.

One important limitation in MongoDB is database-level locking. That is, only one writer may modify a given database at a time. Support for collection-level (a set of documents, analogous to a relational table) locking is planned. With either database- or collection-level locking, other writers or readers are locked out. Even a small number of writes can produce stalls in read performance.

Cassandra uses advanced concurrent structures to provide row-level isolation without locking. Cassandra eveneliminated the need for row-level locks for index updates in the recent 1.2 release.

A more subtle MongoDB limitation is that when adding or updating a field in a document, the entire document must be re-written. If you pre-allocate space for each document, you can avoid the associated fragmentation, but even with pre-allocation updating your document gets slower as it grows.

Cassandra’s storage engine only appends updated data, it never has to re-write or re-read existing data. Thus, updates to a Cassandra row or partition stay fast as your dataset grows.

Riak

Riak presents a document-based data model to the end user, but under the hood it maps everything to a key/value storage API. Thus, like MongoDB, updating any field in a document requires rewriting the whole thing.

However, Riak does emphasize the use of log-structured storage engines. Both the default BitCask backend and LevelDB are log-structured. Riak increasingly emphasizes LevelDB since BitCask does not support scan operations (which are required for indexes), but this brings its own set of problems.

LevelDB is a log-structured storage engine with a different approach to compaction than the one introduced by Bigtable. LevelDB trades more compaction i/o for less i/o at read time, which can be a good tradeoff for many workloads, but not all. Cassandra added support for leveldb-style compaction about a year ago.

LevelDB itself is designed to be an embedded database for the likes of Chrome, and clear growing pains are evident when pressed into service as a multi-user backend for Riak. (A LevelDB configuration for Voldemort also exists.) Basho cites “one stall every 2 hours for 10 to 30 seconds”, “cases that can still cause [compaction] infinite loops,” and no way to create snapshots or backups as of the recently released Riak 1.2.

HBase

HBase’s storage engine is the most similar to Cassandra’s; both drew on Bigtable’s design early on.

But despite a later start, Cassandra’s storage engine is far ahead of HBase’s today, in large part because building on HDFS instead of locally-managed storage makes everything harder for HBase. Cassandra added online snapshotsalmost four years ago; HBase still has a long ways to go.

HDFS also makes SSD support problematic for HBase, which is becoming increasingly relevant as SSD price/performance improves. Cassandra has excellent SSD support and even support for mixed SSD and HDD within the same cluster, with data pinned to the medium that makes the most sense for it.

Other differences that may not show up at benchmark time, but you would definitely notice in production:

HBase can’t delete data during minor compactions — you have to rewrite all the data in a region to reclaim disk space. Cassandra has deleted tombstones during minor compactions for over two years.

While you are running that major compaction, HBase gives you no way to throttle it and limit its impact on your application workload. Cassandra introduced this two years ago and continues to improve it. Dealing with local storage also lets Cassandra avoid polluting the page cache with sequential scans from compaction.

Compaction might seem like bookkeeping details, but it does impact the rest of the system. HBase limits you to two or three column families because of compaction and flushing limitations, forcing you to do sub-optimal things to your data model as a workaround.

Cassandra

I honestly think Cassandra is one to two years ahead of the competition, but I’m under no illusions that Cassandra itself is perfect. We have plenty of improvements to make still; from the recently released Cassandra 1.2 to our ticket backlog, there is no shortage of work to do.

Here are some of the areas I’d like to see Cassandra improve this year:

If working on an industry-leading, open-source database doing cutting edge performance work on the JVM sounds interesting to you, please get in touch.

Cassandra 1.2.1 has been released

Cassandra 1.2.1 has been released and is available for download http://cassandra.apache.org/download

Changes included in this new version:
  • stream undelivered hints on decommission (CASSANDRA-5128)
  • GossipingPropertyFileSnitch loads saved dc/rack info if needed (CASSANDRA-5133)
  • drain should flush system CFs too (CASSANDRA-4446)
  • add inter_dc_tcp_nodelay setting (CASSANDRA-5148)
  • re-allow wrapping ranges for start_token/end_token range pairing (CASSANDRA-5106)
  • fix validation compaction of empty rows (CASSADRA-5136)
  • nodetool methods to enable/disable hint storage/delivery (CASSANDRA-4750)
  • disallow bloom filter false positive chance of 0 (CASSANDRA-5013)
  • add threadpool size adjustment methods to JMXEnabledThreadPoolExecutor and
  • CompactionManagerMBean (CASSANDRA-5044)
  • fix hinting for dropped local writes (CASSANDRA-4753)
  • off-heap cache doesn’t need mutable column container (CASSANDRA-5057)
  • apply disk_failure_policy to bad disks on initial directory creation(CASSANDRA-4847)
  • Optimize name-based queries to use ArrayBackedSortedColumns (CASSANDRA-5043)
  • Fall back to old manifest if most recent is unparseable (CASSANDRA-5041)
  • pool [Compressed]RandomAccessReader objects on the partitioned read path(CASSANDRA-4942)
  • Add debug logging to list filenames processed by Directories.migrateFilemethod (CASSANDRA-4939)
  • Expose black-listed directories via JMX (CASSANDRA-4848)
  • Log compaction merge counts (CASSANDRA-4894)
  • Minimize byte array allocation by AbstractData{Input,Output} (CASSANDRA-5090)
  • Add SSL support for the binary protocol (CASSANDRA-5031)
  • Allow non-schema system ks modification for shuffle to work (CASSANDRA-5097)
  • cqlsh: Add default limit to SELECT statements (CASSANDRA-4972)
  • cqlsh: fix DESCRIBE for 1.1 cfs in CQL3 (CASSANDRA-5101)
  • Correctly gossip with nodes >= 1.1.7 (CASSANDRA-5102)
  • Ensure CL guarantees on digest mismatch (CASSANDRA-5113)
  • Validate correctly selects on composite partition key (CASSANDRA-5122)
  • Fix exception when adding collection (CASSANDRA-5117)
  • Handle states for non-vnode clusters correctly (CASSANDRA-5127)
  • Refuse unrecognized replication and compaction strategy options (CASSANDRA-4795)
  • Pick the correct value validator in sstable2json for cql3 tables (CASSANDRA-5134)
  • Validate login for describe_keyspace, describe_keyspaces and set_keyspace(CASSANDRA-5144)
  • Fix inserting empty maps (CASSANDRA-5141)
  • Don’t remove tokens from System table for node we know (CASSANDRA-5121)
  • fix streaming progress report for compresed files (CASSANDRA-5130)
  • Coverage analysis for low-CL queries (CASSANDRA-4858)
  • Stop interpreting dates as valid timeUUID value (CASSANDRA-4936)
  • Adds E notation for floating point numbers (CASSANDRA-4927)
  • Detect (and warn) unintentional use of the cql2 thrift methods when cql3 was intended (CASSANDRA-5172)

The infoworld technology of the year 2013

Infoworld just published its Technology of the Year Award winners and some well known NoSQL solution have been rewarded:

  • Apache Hadoop
  • Apache Cassandra
  • Couchbase Server

http://www.infoworld.com/slideshow/80986/infoworlds-2013-technology-of-the-year-award-winners-210419#slide1