Details on stability improvements in the Hypertable 0.9.5.0 pre-release have been posted on the blog: http://blog.hypertable.com/
We recently announced the Hypertable 0.9.5.0 pre-release. Even though we’ve labelled it as a “pre” release, it is one of the biggest and most important Hypertable releases to date. Among other things, it includes a complete re-write of the Master, to fix some known stability problems. It represents a significant amount of work as can be seen by the following code change statistics:
- 512 files changed
- 30,633 line insertions
- 14,354 line deletions
The following describes problems that existed in prior releases and how they were solved, and highlights other stability improvements included in the 0.9.5.0 pre-release.
Duplicate range load. In prior releases, when a Range Server decided to give up a range (e.g. after a split), it would inform the master by calling the Master::move_range() method and then record the move in its meta log (RSML). Unfortunately, this logic contained a race condition. If the range server called Master::move_range(), but died before it got a chance to record the move in the RSML, and then the Master was stopped (e.g. sysadmin restart of the system), all record of the move was lost. When the RangeServer came back up, it would re-attempt to move the range, causing it to get loaded by two different range servers. With the introduction of the Master MetaLog (MML) and a two-phaseMaster::move_range() operation, this problem has been resolved.
Overlapping ranges. In prior releases, the Master would ask a range server to load a range by calling theRangeServer::load_range() method and would rely on the ALREADY_LOADED response code to handle situations where the acknowledgement was lost (e.g. range server or master died at an inopportune moment) and the RangeServer::load_range() call was re-issued. This logic also contained a race condition. When a range was loaded and the acknowledgement was lost, the loaded range could split before the Master re-attempted to load the range. When RangeServer::load_range() call was re-issued, the RangeServer happily loaded the range because it no longer contained the range in its live set (due to the split). With the introduction of a two-phase load range operation, this problem has been resolved.