Solving your performance issue when dealing with big data

  1. Foresee and understand your performance issue

When dealing with big data, you will face performance problems  with the most simple and basic operation as soon as the processing require the whole data sets to be analyzed.

It is the case for instance when:

  • You aggregate data, to deliver summary statistics: action such as “count”,”min”,”avg” etc…
  • You need to sort your data

This in mind, you can easily and quickly anticipated issues in advance and start thinking about solving the problem.

  1. Solving the performance issue using technical tools

Compression is often a key solution to many performance issue as its require CPU speed which is currently always faster than i/o disk and i/o networks, so compression allow to speed up disk access, data transfer over network and eventually allow to  keep reduced data in memory.

Statistics can often apply to fasten your algorithm and are not necessarily complex, maintaining values range (min,max) or values distribution might fasten your processing resolution path.

Caching, deterministic result are provided by process independant from the data or based on data which rarely changed and forwhich you can assume they won’t change during your process time

Avoid data type conversions, because it’s always resources consuming

Balance then loadparalellized the processing and use map reduce :)


  1. Solving the performance issue by giving-up or resigning

We tend to refuse such approach, but sometimes it is a good exercise to go back and review why we do the things we do.

Can i approximate without altering significantly the result ?

Can i use a representative data sample instead of the whole data ?

At least do not avoid to think this way, wondering if solving an easier problem or looking for approximate result can’t finally bring you very close to the solution.


Leave a Reply

You must be logged in to post a comment.