- Foresee and understand your performance issue
When dealing with big data, you will face performance problems with the most simple and basic operation as soon as the processing require the whole data sets to be analyzed.
It is the case for instance when:
- You aggregate data, to deliver summary statistics: action such as “count”,”min”,”avg” etc…
- You need to sort your data
This in mind, you can easily and quickly anticipated issues in advance and start thinking about solving the problem.
- Solving the performance issue using technical tools
Compression is often a key solution to many performance issue as its require CPU speed which is currently always faster than i/o disk and i/o networks, so compression allow to speed up disk access, data transfer over network and eventually allow to keep reduced data in memory.
Statistics can often apply to fasten your algorithm and are not necessarily complex, maintaining values range (min,max) or values distribution might fasten your processing resolution path.
Caching, deterministic result are provided by process independant from the data or based on data which rarely changed and forwhich you can assume they won’t change during your process time
Avoid data type conversions, because it’s always resources consuming
Balance then load, paralellized the processing and use map reduce
- Solving the performance issue by giving-up or resigning
We tend to refuse such approach, but sometimes it is a good exercise to go back and review why we do the things we do.
Can i approximate without altering significantly the result ?
Can i use a representative data sample instead of the whole data ?
At least do not avoid to think this way, wondering if solving an easier problem or looking for approximate result can’t finally bring you very close to the solution.