Data story

Can BigData reduce fraud?1

Too much data can pose security concerns, and it can become overwhelming to manage. Verizon, in its latest Data Breach Investigations Report, finds most organizations get overwhelmed with too much data.

Chris Novak, who works in Verizon’s investigative response unit, says most organizations struggle to collect the right data and properly store it. “They don’t necessarily know where they have data … and how it’s being handled,” he says.

Like all organizations, financial institutions struggle with data. But many banks outsource data management to help ensure the data they collect is, in theory, protected and properly managed.

Can Data Reduce Fraud?

Here’s the question: Could institutions take advantage of their data to support fraud prevention? Experts at credit reporting bureau Experian say yes.

Experian is pushing ID theft management in a new way: to help banks prevent and detect fraud. Keir Breitenfeld, director of product management within Experian’s decision analytics team, says banking institutions are doing better jobs of capturing data.

“Institutions are saying, ‘We have to have a more enterprise-level approach,’” he says. “They know they need to warehouse data, so they can bring channels together, from a cost perspective and customer experience perspective.”

But the residual effect is that banks have a lot more data at their fingertips to track accountholders, rather than just accounts, for fraud.

The ability to capture data and warehouse it has improved so much that credit bureaus now have the ability to provide customized scores for individual accountholders.

So, the more data banks can leverage about new accountholders, in particular, the better their chances are of detecting fraud.

If banks routinely compare the data they collect about customers with information credit bureaus store, they could improve their fraud detection rates on new accounts by 20 percent or more, Breitenfeld says.

“If you can monitor accounts after they are opened, you can better detect fraud.”

Gartner hype cycle reviewed1

For fun only,Gartner ‘s “hype cycle” reviewed:

OrientDB 1.0 has been released1

Congratulations to the OrientDB team: release 1.0 is finally out !

List of major changes
  • new Multi-Master Replication architecture
  • new Object Database interface that use run-time enhancement. Now handles lazy loading, it’s lighter and faster than before
  • new OTraverse class to traverse graphs via Java API using a stack-free approach
  • Data segments: added support for multiple ones and create/drop commands
  • new ODocument.undo() to revert local changes
  • new Server Side Scripting support
  • Query: new context variables
  • Console: new check database command
  • Studio: improved Graph management
  • Improved OSGi support
  • Fixed more than 40 bugs

Apache HBase 0.94 has been released1

Apache HBase 0.94.0 has been released and can be downloaded here

This is the first major release since the January 22nd HBase 0.92 release.

In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Performance Related JIRAs

Below are a few of the important performance related JIRAs:

  • Read Caching improvements: HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file.HBASE-5074: “Support checksums in HBase block cache” adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature isenabled by default.
  • Seek optimizations: Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  HBASE-4465: “Lazy Seek optimization of StoreFile Scanners” optimizes scanner reads to read the most recent StoreFile first by lazily seeking the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is enabledby default.
  • Write to WAL optimizations: HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor.HBASE-4608: “HLog Compression” adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is off by default.

New Feature Related JIRAs

Here is a list of some of the important JIRAs related to adding new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

This major release has a number of new features and bug fixes; a total of 397 resolved JIRAs with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.

 

 

How reduce the cost of data ?1

“How can you control the cost of data growth?”

Is your company an “Information Hoarder”?  If so, answer the five questions below and you’ll be able to reduce the cost of your data.

1 – Do you need the data that you have?

The first step is to determine whether you need all of the data in your production systems.  You might be surprised by the answer.  Often, up to 85% of data in production systems is old and not used.  Information lifecycle management software will manage the lifecycle (i.e., the expiry date) of your information, as well as archive it to lower-cost storage alternatives.  The cost savings is immediate and significant.

2- Do you know the expiry date for your data?

All data has an expiry date.  Or at least it should.  Most organizations do not determine the expiry date for their data.  The result?  They keep data indefinitely.  There’s a simple rule on the TV show Hoarders – if you haven’t worn an item of clothing in the past year, throw it away.  The same rule applies to data.  If you haven’t accessed it recently, consider deleting or archiving it.

3- What is the cost of managing your data?

Every database has a cost.  Do you know the cost of yours?  There are several published performance benchmarks for relational database software.  The performance of the database is directly correlated to cost.  Migrating to another database may be easier than you think, as some vendors have invested in portability and migration capabilities.

4- The quality of data carries a cost

The poorer the quality, the higher the cost.  Seems like an obvious relationship, doesn’t it?  But not always.  Cleaning address data will correct  downstream errors, such as return mail costs and postage discount savings.  Aside from the direct cost savings, consider the indirect cost of inefficiency.  How much time do your employees spend investigating and correcting data errors?

5- Unifying fragmented data reduces cost

In the typical organization, many data entities are fragmented across dozens of systems.  Customers, products, locations, suppliers, to name just a few.  The fragmentation drives a cost that is not easy to detect.  It’s the cost of your employees manually searching for data in multiple systems, the cost of duplicating data entry in multiple systems, and the cost of the inevitable errors that result from fragmentation and duplication of effort.  Unifying data into one master data system often has far reaching cost savings implications.

There are many ways to reduce the cost of your data.  If you are looking for “the low hanging fruit” and a fast ROI, then take a look at the data you have today and determine whether you need all of it.  Likely the answer will be “no”, and information lifecycle management software will help you realize an immediate cost savings.

 

 

Great article from  David Corrigan posted on his blog:

http://corrigandavid.wordpress.com/2012/05/11/questions-from-the-market-how-can-i-reduce-the-cost-of-data/

Neo4j 1.8.M02 has been released1

Neo4j 1.8.M02 – The Strong, Silent Type

Our longer arcs of development are chunked up into short-term stories, which arrive with notable features at differing points. Uniquely, at this 1.8.M02 merge point, the changes are all of the strong, silent type: under-the-hood improvements, stage-setting additions, or simple issue-correcting.

This is a solid, trustworthy release for anyone staying up-to-date with the leading edge of development. Go get download Neo4j 1.8.M02 to keep up. If you’re on Heroku, you can simply specify the version when provisioning a new database:
# heroku addons:add neo4j --neo4j-version 1.8.M02
More information available here

Definition “Kalman filter” term1

According to wikipedia, we found the following definition

The Kalman filter, also known as linear quadratic estimation (LQE), is an algorithm which uses a series of measurements observed over time, containing noise (random variations) and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those that would be based on a single measurement alone. More formally, the Kalman filter operatesrecursively on streams of noisy input data to produce a statistically optimal estimate of the underlying system state. The filter is named for Rudolf (Rudy) E. Kálmán, one of the primary developers of its theory.

The Kalman filter has numerous applications in technology. A common application is for guidance, navigation and control of vehicles, particularly aircraft and spacecraft. Furthermore, the Kalman filter is a widely applied concept in time series econometrics.

The algorithm works in a two-step process: in the prediction step, the Kalman filter produces estimates of the current state variables, along with their uncertainties. Once the outcome of the next measurement (necessarily corrupted with some amount of error, including random noise) is observed, these estimates are updated using a weighted average, with more weight being given to estimates with higher certainty. Because of the algorithm’s recursive nature, it can run in real time using only the present input measurements and the previously calculated state; no additional past information is required.

From a theoretical standpoint, the main assumption of the Kalman filter is that the underlying system is a linear dynamical system and that all error terms and measurements have a Gaussian distribution (often a multivariate Gaussian distribution). Extensions and generalizations to the method have also been developed, such as the Extended Kalman Filter and the Unscented Kalman filter which work on nonlinear systems. The underlying model is a Bayesian model similar to a hidden Markov model but where the state space of the latent variables is continuous and where all latent and observed variables have Gaussian distributions.

 

 

Hadoop in a Microsoft environment1

Microsoft announced a partnership with Hortonworks last year to bring Hadoop to Windows Server and Windows Azure. Microsoft’s vision revolves around making Hadoop and related Big Data tools trivially accessible to the regular IT end-user and to this end it integrates with SQL Server Analysis and Reporting Services as well as Excel PowerPivot.

Here are some resources to use Hadoop in a Microsoft environment

 

White House is Spending Big Money on Big Data1

Story brought by forbes.com

It’s typical in an election year to see an administration spend money on new initiatives. A new cost cutting initiative unveiled back in March has generally gone un-noticed by the main stream technology media. Called the “Big Data Research and Development Initiative” the program is focused on improving the U.S. Federal governments ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help solve some the Nation’s most  pressing challenges.

 

The program includes several federal agencies including NSF, HHS/NIH,DOE, DOD, DARPA and USGS who pledge more than $200 million in new commitments that they promise will greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge  volumes of digital data.

In a statement  Dr. John P. Holdren, Assistant to the President and Director of the White House Office of Science and Technology Policy said “In the same way that past Federal investments in information-technology R&D led to  dramatic advances in supercomputing and the creation of the Internet, the initiative we  are launching today promises to transform our ability to use Big Data for scientific  discovery, environmental and biomedical research, education, and national security.”

 

 

One of the more interesting aspects of this project is the use of public cloud infrastructure, as in cloud computing services provided by the private industry. Confusing I know. A great example of this plan in action is The National Institutes of Health who announced that the world’s largest set of data on human genetic variation – produced by the international 1000 Genomes Project – is  now freely available on the Amazon Web Services (AWS) cloud. At 200 terabytes – the  equivalent of 16 million file cabinets filled with text, or more than 30,000 standard DVDs  – the current 1000 Genomes Project data set is a prime example of big data, where  data sets become so massive that few researchers have the computing power to make  best use of them. AWS is storing the 1000 Genomes Project as a publicly available  data set for free and researchers only will pay for the computing services that they use.

According to a recent article on genengnews.com this is part of a larger strategy to reduce the number of federal data centers from the current  3,133 data centers sliced by “at least 1,200” by 2015, representing a roughly 40% cutback at a $5 billion savings. This also extends the work started with the administration’s Cloud First Policy outlined last year as part of The White House’s Federal Cloud Computing Strategy.

In a world which that is more dependant on data then ever before the stakes are high and so is the money. It will be interesting to follow this initiative over the coming months.

 

 

 

Reverse engineering the origin of NoSQL1

For fun only !

 

Follow LuxNoSQL on Twitter
 
Join the LuxNoSQL Community on LinkedIn