European Commission will adopt measures for an open data strategy

The European Commission will adopt on the 29 November an Open Data Strategy which means a set of measures aimed at increasing government transparency and creating a €32 billion a year market for public data. The measures include a modification of the existing Directive on the re-use of public sector information and the deployment measures such as a creation of open data portals at European level.

The Open Data Strategy will be adopt by the European Commission on 29 November. The strategy proposed by the Commission will consist in a package of measures including regulatory measures, such as a modification of the existing Directive on the re-use of public sector information and the deployment measures such as a creation of open data portals at European level. The Strategy was already proposed in November 2010.


5 billion pages available through Amazon’s S3 service

Common Crawl produces and maintains a repository of web crawl data that is openly accessible to everyone. The crawl currently covers 5 billion pages and the repository includes valuable metadata. The crawl data is stored by Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2.  This makes wholesale extraction, transformation, and analysis of web data cheap and easy. Small startups or even individuals can now access high quality crawl data that was previously only available to large search engine corporations.

For more information, please see the following pages: Processing Pipeline and Accessing the Data.

Additional free and public data sets

Following this first post on free and public dataset I still wondering what could be the easiest way to maintain such a list including details such as format,licensing,update frequency,last update, number of record etc ……

But still …awaiting a great solution (which might already exist!), I still sharing:

You may like the data hub initiative:

Looking for free and public data sets ?

Here is a common need, building a data sample or stressing your solution against high volume of data, this post deliver some links to help you finding more ressources on data.


General reference and data source index:


The potential of big data

Big data has the potential to become the next frontier for innovation, competition and profit.

Great idea from for the following study and graph:



MongoDB, OpenStreetMap and a Little Demo

Original article from–OpenStreetMap-and-a-Little-Demo

I had an idea. Before I realized how bad the idea was, I did stumble upon the open geographic database provided by OpenStreetMap. I have a pretty good feeling that everyone else has known about this for years, but for me it was like my first christmas.

I was curious in finding worldwide points of interest, and I quickly found the OpenStreetMap database. The complete database is available as a 16GB compressed XML file (which comes in at around 250gb uncompressed), which is updated daily by generous contributors. Thankfully, you can find mirrors that have partitioned the data in some meaningful way (like by major cities).

For our needs, the data is made up of few important elements. The first is a node, which has a longitude, latitude and an id. A node has zero or more tag child-elements, which are key-value pairs of meta data. There’s also a way element which references multiple node elements. You see, in my naive mind a point of interest like a building would be represented by a single node. However, from a mapping point of view, it’s really a polygon made up of multiple nodes. A way can also have zero or more tags.

Now ever since I wrote the MongoDB Geospatial tutorial, I’ve had an itch to try more real-world stuff with MongoDB’s geo capabilities. This database seemed like an ideal candidate. The first thing I did was download a bunch of city-dumps from a mirror and started writing a C# importer (github). I wasn’t actually interested in polygons, so I calculated the centroid of any way and converted it into a node. Most of the time the result was quite good. The importer’s readme has more information.

Next, I wrote a little Sinatra app and did the obvious thing using the Google Maps API. You can also find the source for this on github.

I’ve put up a demo at

I also extracted the data for each city and made it available, so that you can play with it yourself. It’s available (you can read about OpenStreetMap’s licensing here). The data is meant to be easily imported into mongodb using its mongoimport command. Download a city, extract it, and do the following:

mongoimport -d pots -c tags PATH_TO_TAGS.json
mongoimport -d pots -c nodes PATH_TO_COUNTRY.json

If mongoimport isn’t in your PATH, you’ll need to use the full path. Also, the tags.json file is the same for all cities – so you only need to import it once. Finally, connect to mongodb, type use pots and then create a 2D index on the loc field: db.nodes.ensureIndex({loc: '2d'})

Different cities have different amounts of data. I left everything in and you can see there’s quite a bit of information. Given that MongoDB supports composite indexes, it’d be trivial to provide additional node filtering.

And that’s why, when people ask me What did you do this weekend?, I can say I parsed a 250gb XML file (because, yes, I did download it and I did *try* to import it)



First came the hardware, second the software and third is the age of data

Value have been first in the hardware as a first stage, in a second it has been within software and it seems “The age of data is upon us” declared Redmonk’s Stephen O’Grady at the Open Source Business Conference.

On a great articles available here which summarize O’Grady’s words:

Mainly it summarize the timefline as follow:

  1. The first stage, epitomized by IBM, held that the money was in the hardware and software was just an adjunct.
  2. Stage two, fired off by Microsoft, contended the money is in the software.
  3. Google epitomizes the third stage, where the money is not in the software, but software is a differentiator. “Google came up at a time when a lot of folks were building the Internet on the backs of some very expensive hardware and software. Google uses commodity hardware, free — meaning no-cost — software, and focuses on what it can do better than its competitors with that software.”

Wondering what could be the the fourth stage ?  It might be Facebook and Twitter. “Now, software is not even differentiating; it’s the value of the data. Facebook and Twitter monetize their data in different ways.”

Watchdog blackmailed by hacker: names home addresses and passwords leaked

South Korea’s financial watchdog launched an investigation into the leak of 420,000 customer’s personal information from South Korea’s Hyundai Capital, the consumer finance unit of Hyundai Motor Group.

The company, whose president returned to South Korea from an overseas trip earlier in the day, also began its own probe into the leak, which prompted the firm Friday to ask its 2 million customers to change their passwords to prevent further leaks.

The Seoul-based company, which specializes in personal loans, home mortgages and auto financing, said this week it was blackmailed by an unnamed hacker demanding money in return for not releasing the data.

The company, which stressed that key data required for financial transactions was not leaked, said names and home addresses of as many as 420,000 of its some 2 million customers were stolen. It remains unconfirmed whether their mobile phone numbers or e-mail addresses were disclosed as well.

“Investigators will be dispatched to look into the cause of the breach, the possibility of additional leaks and the contents of stolen information,” an official said.

Police said Sunday that a hacker likely used servers in the Philippines and Brazil.


More information from Reuters

Data 2.0 conference counting 47 data-centric startups

Here is an incomplete list (as of March 18th 2011) of the data-centric that will be participating in the Data 2.0 Conference as speakers, sponsors, or pitch day data startups. I’ve excluded later-stage data companies like Google or comScore since it is difficult to define a large company as being “data-centric”.


State of the Linking Open Data cloud

The State of the LOD Cloud get an update to version 0.2 as of  03/28/2011

It provides updated statistics about the structure and content of the LOD cloud. It also analyzes the extend to which LOD data sources implement nine best practices that are either recommended W3C or have emerged within the LOD community.

Linked Open Data star scheme by example

Tim Berners-Lee suggested a 5-star deployment scheme for Linked Open Data and Ed Summers provided a nice rendering of it. In the following, examples are given for each level. The example data used throughout is ‘the temperature forecast for Galway, Ireland for the next 3 days‘:

make your stuff available on the Web (whatever format) under an open license
★★ make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ use non-proprietary formats (e.g., CSV instead of Excel)
★★★★ use URIs to identify things, so that people can point at your stuff
★★★★★ link your data to other data to provide context