LinkedIn open source Helix

LinkedIn has announce the release of Helix, an open source cluster management system.


What is Helix?

The LinkedIn infrastructure stack consists of a number of different distributed systems, each specialized to solve a particular problem. This includes online data storage systems (Voldemort, Espresso), messaging systems (Kafka), change data capture system (Databus), a distributed system that provides search as a service (SeaS), and a distributed graph engine.

Although each system services a different purpose, they share a common set of requirements:

  • Resource management: The resources in the system (such as the database and indexes) must be divided among nodes in the cluster.
  • Fault tolerance: Node failures are unavoidable in any distributed system. However, the system as whole must continue to be available in the presence of such failures, without losing data.
  • Elasticity: As workload grows, clusters must be able to grow to accommodate the increased demand.
  • Monitoring: The cluster must be monitored for node failures as well as other health metrics, such as load imbalance and SLA misses.

Rather than forcing each system to reinvent the wheel, we decided to build Helix, a cluster management framework that solves these common problems. This allows each Distributed System to focus on its distinguishing features, while leaving Helix to take care of cluster management functions.

Helix provides significant leverage beyond just code reuse. At scale, the operational cost of management, monitoring and recovery in these systems far outstrips their single node complexity. A generalized cluster management framework provides a unified way of operating these otherwise diverse systems, leading to operational ease.

Helix at Linkedin

Helix has been under development at LinkedIn since April 2011. Currently, it is used in production in three different systems:

    • Espresso: Espresso is a distributed, timeline consistent, scalable document store that supports local secondary indexing and local transactions. Espresso runs on a number of storage node servers that store and index data and answer queries. Espresso databases are horizontally partitioned across multiple nodes, with each partition having a specified number of replicas. Espresso designates one replica of each partition as master (which accepts writes) and the rest as slaves; only one master may exist for each partition at any time. Helix manages the partition assignment, cluster-wide monitoring, and mastership transitions during planned upgrades and unplanned failure. Upon failure of the master, a slave replica is promoted to be the new master.


    • Databus: Databus is a change data capture (CDC) system that provides a common pipeline for transporting events from LinkedIn primary databases to caches, indexes and other applications such as Search and Graph that need to process the change events. Databus deploys a cluster of relays that pull the change log from multiple databases and let consumers subscribe to the change log stream. Each Databus relay connects to one or more database servers and hosts a certain subset of databases (and partitions) from those database servers, depending on the assignment from Helix.


  • SeaS (Search as a Service): LinkedIn’s Search-as-a-service lets other applications define custom indexes on a chosen dataset and then makes those indexes searchable via a service API. The index service runs on a cluster of machines. The index is broken into partitions and each partition has a configured number of replicas. Each new indexing service gets assigned to a set of servers, and the partition replicas must be evenly distributed across those servers. When indexes are bootstrapped, the search service uses snapshots of the data source to create new index partitions. Helix manages the assignment of index partitions to servers. Helix also limits the number of concurrent bootstraps in the system, as bootstrapping is an expensive process.

Try it out

We invite you to download and try out Helix. In the past year, we have had significant adoption and contributions to Helix by multiple teams at Linkedin. By open sourcing Helix, we intend to grow our contributor base significantly and invite interested developers to participate.

We will also be presenting a paper on Helix at the upcoming SOCC (ACM Symposium on Cloud Computing) at San Jose, CA on Oct 15th, 2012.

Qualys – Lessons Learned from Cracking LinkedIn Passwords

Singular approach from Qualys, reagarding the LinkedIn password leak. Using a patched version of John the Ripper (brute force password craker), they successfully found complex passwords in almost no time by iterating over a password file filled by newly found password from latest iteration. Ultimately cracking password as complex as this one: “lsw4linkedin”


Original source:

Like everyone this week, I learned about a huge file of password hashes that had been leaked by hackers. The 120MB zip file contained 6,458,020 SHA-1 hashes of passwords for end-user accounts.


At first, everyone was talking about a quick way to check if their password had been leaked. This simple Linux command line:


echo -n MyPassword | shasum | cut -c6-40 


allows the user to create a SHA-1 sum of his password and take the 6th through 40th characters of the result. (See note below*). Then the user could easily search the 120MB file to see if his hash was present in the file. If it was, then of course his password had been leaked and his account associated with that password was at risk.


John the Ripper


But when the OpenWall community released a patch to run John The Ripper on the leaked file, it caught my attention.  It has been a long time since I have run John The Ripper, and I decided to install this new, community-enhanced “jumbo” version and apply the LinkedIn patch.


John the Ripper attempts to crack SHA-1 hashes of passwords by iterating on this process: 1. guess a password, 2. generate its SHA-1 hash, and 3. check if the generated hash matches a hash in the 120MB file. When it finds a match, then it knows it has a legitimate password.  John the Ripper iterates in a very smart way, using word files (a.k.a. dictionary attack) and rules for word modifications, to make good guesses. It also has an incremental mode that can try any possible passwords (allowing you to define the set of passwords based on the length or the nature of the password, with numeric, uppercase, or special characters), but this becomes very compute-intensive for long passwords and large character sets.


The fact that the file of hashed passwords was not salted helps a lot.  As an aside, even if they were salted, you could concentrate the cracking session to crack the easiest passwords first using the “single” mode of John the Ripper. But this works best with additional user information like a GECOS, which was not available in this case, at least to the public. So the difficulty would be much greater for salted hashes.




In my case, I have an old machine with no GPU and no rainbow table, so I decided to use good old dictionaries and rules.


I ran the default john command that just launches a small set of rules (like append/prepend 1 to every word, etc.) on a small default password dictionary of less than 4000 words. It then switches to incremental mode based on statistical analysis of known password structures, which helps it try the more likely passwords first. The result was quite impressive because after 4 hours I had approximately 900K passwords already cracked.


But then, as it got to the point were it was trying less and less likely passwords and therefore found matches more slowly, I decided to stop it and run a series of old dictionaries I had: from default common password lists (16KB of data) to words of every existing language (40MB of data). It was very efficient and found 500K more passwords in less than an hour, for a total of 1.4M passwords.


Even though my dictionaries were 10 years old and didn’t contain newer words like “linkedin”, it appeared that some cracking rules, by reversing strings or removing some vowels could guess new slang words from already cracked passwords.


And as I had just acquired 1.4M valid passwords, I believed that using these newly discovered passwords as a dictionary I could find more. It worked and the rules applied to the already cracked passwords produced 550K new ones. I ran a second iteration using the 550K passwords from the first iteration as a dictionary, and found 22K more. I iterated in this manner a total of ten times.


It is interesting to see that the most elaborate passwords found in the 3rd or 4th iteration of this kind of recursive dictionary cracking were related to the word linkedin most of the time:


If I tried to match the word linkedin slightly modified (reversed or with ‘1’ or ‘!’ instead of ‘i’ like in l1nked1n):


  • In the first iteration, 558 passwords found in the 554,404 (0.1%) are related to the ‘Linkedin’ string;
  • In the second iteration, 3248 out of 22,688 (14%) are related to the ‘Linkedin’ string;
  • Third iteration: 1,733 out of 3,682 (47%);
  • Fourth iteration: 539 out of 917 (59%);
  • Fifth iteration: 217 out of 330 (66%);
  • Sixth iteration: 119 out of 152 (78%);
  • Seventh iteration: 40 out of 51 (78%);
  • And so on through the tenth iteration.


An example of what I found on the 7th pass is:  m0c.nideknil


Another example is: lsw4linkedin, which was found on the tenth pass. To illustrate how the rules work for modifying words in the dictionary, below is the actual set of modifications used to get from the dictionary entry ‘pwlink’ to the successfully cracked password ‘lsw4linkedin’ over the ten iterations:


  1. pwdlink from pwlink with the rule “insert d in 3rd position”
  2. pwd4link from pwdlink with the rule “insert 4 in 4th position”
  3. pwd4linked from pwd4link with the rule “append ed”
  4. pw4linked from pwd4linked with the rule “remove 3rd char”
  5. pw4linkedin from pw4linked with the rule “append in”
  6. mpw4linkedin from pw4linkedin with the rule “prepend m”
  7. mw4linkedin  from mpw4linkedin with the rule “remove second character”
  8. smw4linkedin from mw4linkedin with the rule “prepend s”
  9. sw4linkedin from smw4linkedin with the rule “remove second character”
  10. lsw4linkedin from sw4linkedin with the rule “prepend l”


This is the deepest password found, i.e. the only one obtained in the last iteration.


This clearly shows that no matter how elaborate a password you choose, as long as it is based on words and rules, even if there are many words and many rules, it will probably be cracked. The fact is that on a huge file like the LinkedIn leak, every password you find can help you to get another one. That is because human-created passwords are not random, and programs like John the Ripper and dictionary attacks can use patterns, either already known or discovered in the password hash file, to greatly reduce the time needed to crack them.


Password Management


Thus, it is highly recommended to use a strong random password generator that is known to be actually random.


It is funny to note that a very old version of a command line tool called “mkpasswd” produced passwords based on a bad random salt and was generating only 32768 different passwords ( ), this was reported and fixed 10 years ago, but I was still able to recover 140 passwords in the leaked file that had been generated by this vulnerable version of mkpasswd.


Evidence indicates that the hacker who made this leak public was most likely trying to get cracked passwords from an online community, a kind of crowdsource cracking. Since he probably possesses the list of logins as well, you might want to change your passwords in other accounts if you think he can access them with the information he has. Note that if you have unique passwords created with simple rules, you might change them as well. For example, if your password for LinkedIn is MyPW4Linkedin, a malicious cracker might guess that MyPW4Facebook might be your Facebook password.


It is also recommended to change your password if your username can be guessed from it, because every password cracker on the planet is currently playing with this password file.


The author of John the Ripper, Solar Designer, did a great presentation on the past, present and future of password security. Although the security industry has put a lot of work into making good hash functions (and there’s still more work to do), I believe that poorly chosen passwords are a concern. Maybe we should demand that our browsers (using secured storage as in Firefox Manager) or 3rd-party single-sign-on providers create easier solutions to help us resist the temptation of using simple passwords and re-using the same passwords with simple variations.


* Note: The hashes in the 120MB file sometimes had their five first characters rewritten with 0.  If we look at the 6th to 40th characters, we can even find duplicates of these substrings in the file meaning the first five characters have been used for some unknown purpose: is it LinkedIn that stores user information here? is it the initial attacker that tagged a set of account to compromise? This is unknown.


6.5M LinkedIn Passwords Leaked Online

Time has come for password change on LinkedIn …. as millions of passwords are posted on a Russian hacker website.

Few details so far:

  • The data leaked is a file of SHA1 hashes
  • The data have been first posted on a Russian hacker website
  • There are 3,521,180 hashes that begin with 00000. Probably marked to cracked(reversed user’s password).
  • The file does not contain duplicates. LinkedIn claims a user base of 161m. This file contains 6.4m unique password hashes. That’s 25 users per hash. Given the large amount of password reuse and poor password choices it is not improbable that this is the complete password file.


SenseiDB 1.0.0 has been released

LinkedIn Engineering has released as open source SenseiDB, a distributed, semi-structured database. SenseiDB is the technology behind the search infrastructure in LinkedIn and powers the LinkedIn homepage, LinkedIn Signal and other search related features (e.g. people/company search). SenseiDB was developed in-house for the needs of the company and is now released as open source under the Search, Network, Analytics project umbrella.

SenseiDB is a NoSQL database focused on high update rates and complex semi-structured search queries. Users familiar with Lucene and Solr will recognize a lot of concepts behind SenseiDB. SenseiDB is deployed in clusters of multiple nodes, where each node can contain N shards of data. Nodes are managed via Apache Zookeeper which keeps the current configuration and transmits any changes (e.g. topology modifications) to the whole group of nodes. A SenseiDB cluster also requires a schema which defines the data model that will be used.

Getting data into the SenseiDB cluster happens only via Gateways (there is no “INSERT” method). Each cluster is connected to a single gateway. This is one of the critical points to understand, since SenseiDB does not handle itself Atomicity and Isolation. Those should be enforced externally at the gateway level. The gateway must make sure that the data stream behaves in an expected manner. Built-in gateways are:

Custom Gateways can also be implemented by the application developer. An example gateway is provided which gets its data from Twitter updates.

With the input datastream in place feeding data into the cluster, SenseiDB allows for faceted querying according to the defined schema. A REST APIis offered for this purpose that can be accessed by any HTTP client. This API is inspired by ElasticSearch’s Query DSL. SenseiDB also comes with wrappers for this API in Java and Python, with a Ruby version to follow soon.

Finally SenseiDB offers BQL (or Browse Query Language) as an alternative method of querying. BQL is an SQL-like language (containing only SELECT statements at the moment) which can be used to query a SenseiDB in a more convenient way. A graphical web console is provided as part of the cluster installation for inspecting and debugging BQL queries.

For more extensive information, see the documentation, the Javadocs and the Wiki. The source code is hosted on GitHub.

SenseiDB is there

LinkedIn pushed out another NoSQL solution: SenseiDB 1.0 has been released as i used to say.

SenseiDB is Open-source, distributed, realtime, semi-structured database

Some Features:

  • Full-text search
  • Fast realtime updates
  • Structured and faceted search
  • Fast key-value lookup
  • High performing under concurrent heavy update and query volumes
  • Hadoop integration


Hadoop integration

Is done through a fast Map-Reduce job taking data from Hadoop and batch build indexes given a Sensei schema and sharding strategy.

The following diagram illustrates this indexing process:

Voldemort 0.9 released

The announce of the new 0.9 version has been posted on their official mailing list by Roshan Sumbaly.

He also take time to develop all those changes(history,usability) into a blog post available here

Official annoucement:

Hello all,

We are finally ready to do an open release of 0.90 version of Voldemort.
I’ve cut off the final release candidate and uploaded the latest build to

I’m going to wait till 25th of July for everyone to play around and let us
know if there are any obvious issues. We do realize that this release is
massive and have tried our best to document everything you need to know here

Here are some links to get everyone started –
a) Instructions related to upgrading from 0.81 –
b) Source branch –
c) What all has changed –
d) Updated site – It contains the latest Javadocs ( ), all the new configuration
parameters ( ) and our the
most important change – a new logo.
e) And if you really don’t have time and need a quick summary –…


Do feel free to drop us a message.Thanks,


LinkedIn open sourced Kafka

LinkedIn bring a great contribution to open source and NoSQL community with Voldemort.

They also open sourced a couple of other exciting projects, and now the open sourced another project named Kafka.

Its a distributed publish-subscribe messaging system. It is designed to support the following

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
  • Support for parallel data load into Hadoop.

Kafka is aimed at providing a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.

Find out more details on Kafka website