Steve Jobs

“Many companies forget what it means to make great products. After initial success, sales and marketing people take over and the product people eventually make their way out.”

 

Definition “technical debt” term

The term “technical debt”, used to describe the small mistakes that are made in your code base as the application grows and requirements change. You end up with a big tangled mess if you are not very careful about how you make those changes. If you’ve ever worked on an application where you were afraid to make changes for fear of breaking something, then you have run into technical debt.

It’s very easy to build up technical debt: you put a quick hack in because you feel like you are under pressure to get a task accomplished. It takes courage to push back and say “no, damnit, I need time to fix this particular problem correctly!”. I understand, and it’s okay. We can hug it out if it will make you feel better.

Technical debt can be dealt with via ruthless refactoring and wrapping your application in tests that poke and prod at the edge cases that your application deals with.

Solving your performance issue when dealing with big data

  1. Foresee and understand your performance issue

When dealing with big data, you will face performance problems  with the most simple and basic operation as soon as the processing require the whole data sets to be analyzed.

It is the case for instance when:

  • You aggregate data, to deliver summary statistics: action such as “count”,”min”,”avg” etc…
  • You need to sort your data

This in mind, you can easily and quickly anticipated issues in advance and start thinking about solving the problem.

  1. Solving the performance issue using technical tools

Compression is often a key solution to many performance issue as its require CPU speed which is currently always faster than i/o disk and i/o networks, so compression allow to speed up disk access, data transfer over network and eventually allow to  keep reduced data in memory.

Statistics can often apply to fasten your algorithm and are not necessarily complex, maintaining values range (min,max) or values distribution might fasten your processing resolution path.

Caching, deterministic result are provided by process independant from the data or based on data which rarely changed and forwhich you can assume they won’t change during your process time

Avoid data type conversions, because it’s always resources consuming

Balance then loadparalellized the processing and use map reduce :)

 

  1. Solving the performance issue by giving-up or resigning

We tend to refuse such approach, but sometimes it is a good exercise to go back and review why we do the things we do.

Can i approximate without altering significantly the result ?

Can i use a representative data sample instead of the whole data ?

At least do not avoid to think this way, wondering if solving an easier problem or looking for approximate result can’t finally bring you very close to the solution.

 

Little thoughts – when and why data size can became a problem ?

Small question,  when and why data size can became a problem ?
And here come a small answer (simply little thoughts as i said).
Data size might be a problem for the following reasons:
  •  Storage, not enough space to store your date but tend to disappear nowadays
  • Performance, when required to process all the data, and especially when required to deliver result in real time.
  • Privacy, not that much of a technical issue, but you can quickly be out of law or being creating a valuable data set  for attackers

Riak overview

Riak 1.0 is expected to be revealed and released soon by Basho Technologies, the creators of Riak.
Quick overview:
  • Riak is, along with Cassandra and Voldermort, an implementation of the Amazon’s Dynamo.
  • Meaning it is a highly available, proprietary key-value structured storage system aka distributed data store.
  • Riak also implement a built-in MapReduce with native support for JavaScript and Erlang programming language.
  • Client drivers are available for Erlang, Python, Java, PHP, Javascript, and Ruby.
  • Riak is written using Erlang
  • Riak comes in two flavors: open source and enterprise. Enterprise is a superset of the open source version with a few features added.
  • No master node: all nodes in a Riak cluster are equal and each node is fully capable of serving any client request(possible using consistent hashing to distribute data around the cluster)


				

Definition “Big Data” term

Big Data is a name for new data-analysis technologies as well as a movement to develop real-world uses for these capabilities, holds big promise.Managing the burdens of the information explosion, including volumes of data that made manual review impractical, expensive, and less effective than necessary, was the last paradigm shift in the practice. With Big Data tools, the focus turns from managing the burden of large amounts of information to leveraging its value.

Difficulties include:

  • capture
  • storage
  • search
  • sharing
  • analytics
  • visualizing

This trend continues because of the benefits of working with larger and larger datasets allowing analysts to “spot business trends, prevent diseases, combat crime.

Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.

Preferred area of application include:

  • meteorology
  • genomics
  • biological research
  • Internet search
  • finance

 

One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers. The NoSQL trends aims to provide solution in for organization facing such new challenges.

 

Definition “NewSQL” term

We talk often about various types of NoSQL databases – document stores like Apache CouchDB, graph databases like Neo4j and BigTable clones like Hbase. But we also occasionally talk about various attempts to improve the tried and true relational database model – projects like DrizzleHandlerSocketRethinkDB (coverage), TokuTek and VoltDB.

The 451 Group dubs these “NewSQL” databases. In a blog post, 451 analyst Matthew Aslett explores this burgeoning category of database and adds several to our growing list of projects.

On the definition of NewSQL, Aslett writes:

“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL’ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report.And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.

 

In addition to the ones mentioned above, Aslett cites the following as NewSQL vendors:

What do you think? Is this a helpful new term, or just more buzz wordism?

 

State of the Linking Open Data cloud

The State of the LOD Cloud get an update to version 0.2 as of  03/28/2011

It provides updated statistics about the structure and content of the LOD cloud. It also analyzes the extend to which LOD data sources implement nine best practices that are either recommended W3C or have emerged within the LOD community.

Linked Open Data star scheme by example

Tim Berners-Lee suggested a 5-star deployment scheme for Linked Open Data and Ed Summers provided a nice rendering of it. In the following, examples are given for each level. The example data used throughout is ‘the temperature forecast for Galway, Ireland for the next 3 days‘:

make your stuff available on the Web (whatever format) under an open license
★★ make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ use non-proprietary formats (e.g., CSV instead of Excel)
★★★★ use URIs to identify things, so that people can point at your stuff
★★★★★ link your data to other data to provide context

 

 

 

 

Definition “Gutmann method” term

The delete function in most operating system simply marks the space occupied by the file as reusable without immediately removing any of its contents. This allow to recover the data.

 

The Gutmann method is an algorithm for securely erasing the contents of computer hard drives.

 

About Peter Gutmann, he’s a professor at University of Auckland in New Zealand, specialising in network security.

He has published a number of well-known papers including:

  • Secure Deletion of Data from Magnetic and Solid-State Memory, a classic paper used as a reference by many disk-wiping utilities.
  • A Cost Analysis of Windows Vista Content Protection, with the rather memorable “executive executive summary”:

The Vista Content Protection specification could very well constitute the longest suicide note in history.

  • Software Generation of Practically Strong Random numbers

 

 

Technical overview

One standard way to recover data that has been overwritten on a hard drive is to capture and process the analog signal obtained from the drive’s read/write head prior to this analog signal being digitized. This analog signal will be close to an ideal digital signal, but the differences will reveal important information. By calculating the ideal digital signal and then subtracting it from the actual analog signal, it is possible to amplify the signal remaining after subtraction and use it to determine what had previously been written on the disk.

For example:

Analog signal:        +11.1  -8.9  +9.1 -11.1 +10.9  -9.1
Ideal Digital signal: +10.0 -10.0 +10.0 -10.0 +10.0 -10.0
Difference:            +1.1  +1.1  -0.9  -1.1  +0.9  +0.9
Previous signal:      +11    +11   -9   -11    +9    +9

This can then be done again to see the previous data written:

Recovered signal:     +11    +11   -9   -11    +9    +9
Ideal Digital signal: +10.0 +10.0 -10.0 -10.0 +10.0 +10.0
Difference:            +1    +1    +1    -1    -1    -1
Previous signal:      +10   +10   -10   -10   +10   +10

However, even when overwriting the disk repeatedly with random data it is theoretically possible to recover the previous signal. The permittivity of a medium changes with the frequency of the magnetic field. This means that a lower frequency field will penetrate deeper into the magnetic material on the drive than a high frequency one[citation needed]. So a low frequency signal will, in theory still be detectable even after it has been overwritten hundreds of times by a high frequency signal.

The patterns used are designed to apply alternating magnetic fields of various frequencies and various phases to the drive surface and thereby approximate degaussing the material below the surface of the drive

 

 

Definition “Entry Consistency” term

Entry Consistency

  • With release consistency, all local updates are propagated to other processors during release of shared variable.
  • With entry consistency, each shared variable is associated with a synchronization variable.
  • When acquiring the synchronization variable, the most recent values of its associated shared variables are fetched.

Note: Where release consistency affects all shared variables, entry consistency affects only those shared variables associated with a synchronization variable.

Question: What would be a convenient way of making entry consistency more or less transparent to programmers?