“Many companies forget what it means to make great products. After initial success, sales and marketing people take over and the product people eventually make their way out.”
The term “technical debt”, used to describe the small mistakes that are made in your code base as the application grows and requirements change. You end up with a big tangled mess if you are not very careful about how you make those changes. If you’ve ever worked on an application where you were afraid to make changes for fear of breaking something, then you have run into technical debt.
It’s very easy to build up technical debt: you put a quick hack in because you feel like you are under pressure to get a task accomplished. It takes courage to push back and say “no, damnit, I need time to fix this particular problem correctly!”. I understand, and it’s okay. We can hug it out if it will make you feel better.
Technical debt can be dealt with via ruthless refactoring and wrapping your application in tests that poke and prod at the edge cases that your application deals with.
When dealing with big data, you will face performance problems with the most simple and basic operation as soon as the processing require the whole data sets to be analyzed.
It is the case for instance when:
This in mind, you can easily and quickly anticipated issues in advance and start thinking about solving the problem.
Compression is often a key solution to many performance issue as its require CPU speed which is currently always faster than i/o disk and i/o networks, so compression allow to speed up disk access, data transfer over network and eventually allow to keep reduced data in memory.
Statistics can often apply to fasten your algorithm and are not necessarily complex, maintaining values range (min,max) or values distribution might fasten your processing resolution path.
Caching, deterministic result are provided by process independant from the data or based on data which rarely changed and forwhich you can assume they won’t change during your process time
Avoid data type conversions, because it’s always resources consuming
Balance then load, paralellized the processing and use map reduce
We tend to refuse such approach, but sometimes it is a good exercise to go back and review why we do the things we do.
Can i approximate without altering significantly the result ?
Can i use a representative data sample instead of the whole data ?
At least do not avoid to think this way, wondering if solving an easier problem or looking for approximate result can’t finally bring you very close to the solution.
Riak 1.0 is expected to be revealed and released soon by Basho Technologies, the creators of Riak.
Big Data is a name for new data-analysis technologies as well as a movement to develop real-world uses for these capabilities, holds big promise.Managing the burdens of the information explosion, including volumes of data that made manual review impractical, expensive, and less effective than necessary, was the last paradigm shift in the practice. With Big Data tools, the focus turns from managing the burden of large amounts of information to leveraging its value.
This trend continues because of the benefits of working with larger and larger datasets allowing analysts to “spot business trends, prevent diseases, combat crime.
Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.
Preferred area of application include:
One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers. The NoSQL trends aims to provide solution in for organization facing such new challenges.
We talk often about various types of NoSQL databases – document stores like Apache CouchDB, graph databases like Neo4j and BigTable clones like Hbase. But we also occasionally talk about various attempts to improve the tried and true relational database model – projects like Drizzle, HandlerSocket, RethinkDB (coverage), TokuTek and VoltDB.
On the definition of NewSQL, Aslett writes:
“NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL’ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report.And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
In addition to the ones mentioned above, Aslett cites the following as NewSQL vendors:
What do you think? Is this a helpful new term, or just more buzz wordism?
The State of the LOD Cloud get an update to version 0.2 as of 03/28/2011
It provides updated statistics about the structure and content of the LOD cloud. It also analyzes the extend to which LOD data sources implement nine best practices that are either recommended W3C or have emerged within the LOD community.
Tim Berners-Lee suggested a 5-star deployment scheme for Linked Open Data and Ed Summers provided a nice rendering of it. In the following, examples are given for each level. The example data used throughout is ‘the temperature forecast for Galway, Ireland for the next 3 days‘:
|★||make your stuff available on the Web (whatever format) under an open license|
|★★||make it available as structured data (e.g., Excel instead of image scan of a table)|
|★★★||use non-proprietary formats (e.g., CSV instead of Excel)|
|★★★★||use URIs to identify things, so that people can point at your stuff|
|★★★★★||link your data to other data to provide context|
The delete function in most operating system simply marks the space occupied by the file as reusable without immediately removing any of its contents. This allow to recover the data.
The Gutmann method is an algorithm for securely erasing the contents of computer hard drives.
About Peter Gutmann, he’s a professor at University of Auckland in New Zealand, specialising in network security.
He has published a number of well-known papers including:
The Vista Content Protection specification could very well constitute the longest suicide note in history.
One standard way to recover data that has been overwritten on a hard drive is to capture and process the analog signal obtained from the drive’s read/write head prior to this analog signal being digitized. This analog signal will be close to an ideal digital signal, but the differences will reveal important information. By calculating the ideal digital signal and then subtracting it from the actual analog signal, it is possible to amplify the signal remaining after subtraction and use it to determine what had previously been written on the disk.
Analog signal: +11.1 -8.9 +9.1 -11.1 +10.9 -9.1 Ideal Digital signal: +10.0 -10.0 +10.0 -10.0 +10.0 -10.0 Difference: +1.1 +1.1 -0.9 -1.1 +0.9 +0.9 Previous signal: +11 +11 -9 -11 +9 +9
This can then be done again to see the previous data written:
Recovered signal: +11 +11 -9 -11 +9 +9 Ideal Digital signal: +10.0 +10.0 -10.0 -10.0 +10.0 +10.0 Difference: +1 +1 +1 -1 -1 -1 Previous signal: +10 +10 -10 -10 +10 +10
However, even when overwriting the disk repeatedly with random data it is theoretically possible to recover the previous signal. The permittivity of a medium changes with the frequency of the magnetic field. This means that a lower frequency field will penetrate deeper into the magnetic material on the drive than a high frequency one. So a low frequency signal will, in theory still be detectable even after it has been overwritten hundreds of times by a high frequency signal.
The patterns used are designed to apply alternating magnetic fields of various frequencies and various phases to the drive surface and thereby approximate degaussing the material below the surface of the drive
Note: Where release consistency affects all shared variables, entry consistency affects only those shared variables associated with a synchronization variable.
Question: What would be a convenient way of making entry consistency more or less transparent to programmers?