Google will soon worth more than Apple

Google now worth more than Microsoft and will soon worth more than Apple

GOOG:US  858.800 USD13.080 1.55%

Google glass is getting hyped and trashed all at the same time and it’s not even here yet. Meanwhile, Android’s marketplace dominance and Google’s nicely executed moves to mobile ads are contributing to the valuation. And of course, Microsoft is suffering as their tablet/smartphone offerings flounder and the PC business that they dominate shrinks.

It’s an interesting thought, thinking through what company could be the next one to reach the quarter-trillion valuation mark, which is the valuation that both Google and Microsoft just recently shot past. In the far future, could (or perhaps not too far into the future, as in later this year?), Oracle and Cisco and Intel reach that plateau? They each have been worth that much before.

LevelDB a fast and lightweight key/value database library by Google

LevelDB a fast and lightweight key/value database library by Google

https://code.google.com/p/leveldb/

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

Features

  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value)Get(key)Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
  • Detailed documentation about how to use the library is included with the source code.

Limitations

  • This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library.

Performance

Here is a performance report (with explanations) from the run of the included db_bench program. The results are somewhat noisy, but should be enough to get a ballpark performance estimate.

Setup

We use a database with a million entries. Each entry has a 16 byte key, and a 100 byte value. Values used by the benchmark compress to about half their original size.

   LevelDB:    version 1.1
   Date:       Sun May  1 12:11:26 2011
   CPU:        4 x Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
   CPUCache:   4096 KB
   Keys:       16 bytes each
   Values:     100 bytes each (50 bytes after compression)
   Entries:    1000000
   Raw Size:   110.6 MB (estimated)
   File Size:  62.9 MB (estimated)

 

Write performance

The “fill” benchmarks create a brand new database, in either sequential, or random order. The “fillsync” benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The “overwrite” benchmark does random writes that update existing keys in the database.

 

   fillseq      :       1.765 micros/op;   62.7 MB/s
   fillsync     :     268.409 micros/op;    0.4 MB/s (10000 ops)
   fillrandom   :       2.460 micros/op;   45.0 MB/s
   overwrite    :       2.380 micros/op;   46.5 MB/s

 

Each “op” above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.

Each “fillsync” operation costs much less (0.3 millisecond) than a disk seek (typically 10 milliseconds). We suspect that this is because the hard disk itself is buffering the update in its memory and responding before the data has been written to the platter. This may or may not be safe based on whether or not the hard disk has enough power to save its memory in the event of a power failure.

Read performance

We list the performance of reading sequentially in both the forward and reverse direction, and also the performance of a random lookup. Note that the database created by the benchmark is quite small. Therefore the report characterizes the performance of leveldb when the working set fits in memory. The cost of reading a piece of data that is not present in the operating system buffer cache will be dominated by the one or two disk seeks needed to fetch the data from disk. Write performance will be mostly unaffected by whether or not the working set fits in memory.

 

   readrandom   :      16.677 micros/op;  (approximately 60,000 reads per second)
   readseq      :       0.476 micros/op;  232.3 MB/s
   readreverse  :       0.724 micros/op;  152.9 MB/s

 

LevelDB compacts its underlying storage data in the background to improve read performance. The results listed above were done immediately after a lot of random writes. The results after compactions (which are usually triggered automatically) are better.

 

   readrandom   :      11.602 micros/op;  (approximately 85,000 reads per second)
   readseq      :       0.423 micros/op;  261.8 MB/s
   readreverse  :       0.663 micros/op;  166.9 MB/s

 

Some of the high cost of reads comes from repeated decompression of blocks read from disk. If we supply enough cache to the leveldb so it can hold the uncompressed blocks in memory, the read performance improves again:

   readrandom   :       9.775 micros/op;  (approximately 100,000 reads per second before compaction)
   readrandom   :       5.215 micros/op;  (approximately 190,000 reads per second after compaction)

MongoDB Sharding Visualizer

Nice visualizing tools from 10Gen Labs.

 

The visualizer is a Google Chrome app that provides an intuitive overview of a sharded cluster. This project provides an alternative to the printShardingStatus() utility function available in the MongoDB shell.

Features

The visualizer provides two different perspectives of the cluster’s state.

The collections view is a grid where each rectangle represents a collection. Each rectangle’s area is proportional to that collection’s size relative to the other collections in the cluster. Inside each rectangle a pie chart shows the distribution of that collection’s chunks over all the shards in the cluster.

The shards view is a bar graph where each bar represents a shard and each segment inside the shard represents a collection. The size of each segment is relative to the other collections on that shard.

Additionally, the slider underneath each view allows rewinding the state of the cluster. select and view the state of the cluster at a specific time.

Installation

To install the plugin, download and unzip the source code from 10gen labs. In Google Chrome, go to Preferences > Extensions, enable Developer Mode, and click “Load unpacked extension…”. When prompted, select the “plugin” directory. Then, open a new tab in Chrome and navigate to the Apps page and launch the visualizer.

Feedback

The source code which is available here[https://github.com/10gen-labs/shard-viz ].

90% of the existing data was created in the last two years

And a few more facts about BigData:

  • Wal-Mart handles more than a million customer transactions every hour.
  • Facebook hosts more than 50 billion photos.
  • Google has set up thousands of servers in huge warehouses to process searches.
  • 90 % of the data that exists today was created within the last two years.

 

It is a pattern of growth driven by such rapid and relentless trends as the rise of social networks, video and the Web.  Particularly for organizations struggling to keep on top of their most critical missions, providing visibility into, and actionable business intelligence out of the explosive surge in data, has created unprecedented challenges.

That’s because big data causes big problems for companies, as well as for our economy and national security. Look no further than the financial crisis. Near the end of 2008, when the global financial system stood at the brink of collapse, the CEO of a global banking giant during a conference call with analysts was repeatedly asked to quantify the volume of mortgage-backed security holdings on the bank’s books. Despite the bank’s having spent a whopping $37 billion on IT operations over the previous 10 years, his best response was a sheepish: “I don’t have that information.”

Had regulators and big banks been able to accurately assess their exposure to subprime lending, we might have dampened the recession and saved the housing market from its biggest fall in 30 years

Data generated through social media tools

Few figures about data generated through social media:

  • People send more than 144.8 billion Email messages sent a day.
  • People and brands on Twitter send more than 340 million tweets a day.
  • People on Facebook share more than 684,000 bits of content a day.
  • People upload 72 hours (259,200 seconds) of new video to YouTube a minute.
  • Consumers spend $272,000 on Web shopping a day.
  • Google receives over 2 million search queries a minute.
  • Apple receives around 47,000 app downloads a minute.
  • Brands receive more than 34,000 Facebook ‘likes’a minute.
  • Tumblr blog owners publish 27,000 new posts a minute.
  • Instagram photographers share 3,600 new photos a minute.
  • Flickr photographers upload 3,125 new photos a minute.
  • People perform over 2,000 Foursquare check-ins a minute.
  • Individuals and organizations launch 571 new websites a minute.
  • WordPress bloggers publish close to 350 new blog posts a minute.
  • The Mobile Web receives 217 new participants a minute.
    (The most updated numbers are available from the sites themselves.)

Google is definitely in the Big Data business

At  Google IO developer conference in San Francisco last month, Google introduced a slew of new products including Google Compute Engine.  What wasn’t talked was Google’s big data play.   Google is definitely in the Big Data business and Google will be the 800lb gorilla in the space.

Google has two major things going for it.  1) Google has an amazing infrastructure and network inside their core operations; 2) Google owns lots of data lets just say about 90% of the worlds data including information and people.

Google’s infrastructure strength and direction with big data will shape not only applications but the enterprise business.  Why?  Because Google can provide infrastructure and data to anyone who wants it.

Watch out for Google because soon they will be competing with the everyone in the enterprise including the big boys like EMC/Greenplum, IBM/Netezza, HP, Microsoft, and everyone else.

David Floyer, Chief Technology Officer and head of research at Wikibon.org, wrote a great research paper today called Google and VMware Provide Virtualization of Hadoop and Big Data.   David addressed the Google (and VMware) angle in that piece.

If you’re interested in what Google is doing in Big Data you have to read the Wikibon research.

http://wikibon.org/wiki/v/Google_and_VMware_Provide_Virtualization_of_Hadoop_and_Big_Data

Google Compute Engine Review: source: Wikibon.org  

At the 2012 Google I/O conference, Google announced Compute engine. This provides 700,000 virtual cores to be available for users to spin up and tear down very rapidly for big data application in general, and MapReduce and Hadoop in particular. All without setting up any data center infrastructure. This service works with Google Cloud Storage service to provide the data; the data is encrypted at rest. This is a different service than the Google App service, but complementary.

Compute Engine uses the KVM hypervisor on top of the Linux operating system. In discussions with Wikibon, Google pointed out the improvements that they had made to the open source KVM code to improve performance and security in a multi-core multi-thread Intel CPU environment. This allows virtual cores (one thread, one core) to be used as the building block for spinning up very efficient virtual machines.

To help with data ingestion, Google are offering access to the full resources of Google’s Private Networks. This enables a large scale ability to move ingested data across the network at very high speed, and allows replication to a specific data center. The location(s) can be defined, allowing compliance with specific country or regional requirements to retain data within country. If the user can bring the data cost effectively and with sufficient bandwidth to a Google Edge, the Google network services will take over.

The Google Hadoop service can utilize the MapR framework in a similar way to the MapR service for Amazon. This provides improved availability and management components. John Schroeder, CEO and founder of MapR, presented a demonstration running Terasort on a 5,024 core Hadoop cluster with 1256 disks on the Google Compute Engine service. This completed in 1:20 seconds, at a total cost of $16. He compared this with a 1,460 physical server environment with over 11,000 cores, which would take months to set up and would cost over $5million dollars.

As a demonstration this was impressive. Of course, Terasort is a highly CPU intensive environment which can be effectively parallelized, and utilizes cores very efficiently. Other benchmark results which include more IO intensive use of the Google Cloud Storage are necessary to confirm that the service is of universal value.

Wikibon also discussed whether Google would provide other data services to allow joining of corporate data with other Google-derived and provided datasets. Google indicated that they understood the potential value of this service and understood that other service providers were offering these services (e.g., Microsoft Azure). Wikibon expects that data services of this type will be introduced by Google.

There is no doubt that Google is seriously addressing the big data market, and wanting to compete seriously in the enterprise space. The Google network services, data replication services and encryption services reflect this drive to compete strongly with Amazon.

 

Article from JOHN FURRIER  @ SiliconANGLE

 

 

Google's F1 – distributed scalable RDBMS

 

Google has moved its advertising services from MySQL to a new database, created in-house, called F1. The new system combines the best of NoSQL and SQL approaches.

 

According to Google Research, many of the services that are critical to Google’s ad business have historically been backed by MySQL, but Google has recently migrated several of these services to F1, a new RDBMS developed at Google. The team at Google Research says that F1 gives the benefits of NoSQL systems (scalability, fault tolerance, transparent sharding, and cost benefits) with the ease of use and transactional support of an RDBMS.

 

Google Research has developed F1 to provide relational database features such as a parallel SQL query engine and transactions on a highly distributed storage system that scales on standard hardware.

 

 

 

 

The store is dynamically sharded, supports replication across data centers while keeping transactions consistent, and can deal with data center outages without losing data. The downside of keeping the transactions consistent means F1 has higher write latencies compared to MySQL, so the team restructured the database schemas and redeveloped the applications so the effect of the increased latency is mainly hidden from external users. Because F1 is distributed, Google says it scales easily and can support much higher throughput for batch workloads than a traditional database.

 

The database is sharded by customer, and the applications are optimized using shard awareness. When more power is needed, the database can grow by adding shards. The use of shards in this way has some drawbacks, including the difficulties of rebalancing shards, and the fact you can’t carry out cross-shard transactions or joins.

 

 

 

 

 

F1 has been co-developed with a new lower-level storage system called Spanner. This is described as a descendant of Google’s Bigtable, and as the successor to Megastore. Megastore is the transactional indexed record manager built by Google on top of its BigTable NoSQL datastore. Spanner offers synchronous cross-datacenter replication (with Paxos, the algorithm for fault tolerant distributed systems). It provides snapshot reads, and does multiple reads followed by a single atomic write to ensure transaction consistency.

 

F1 is based on sharded Spanner servers, and can deal with parallel reads with SQL or Map-Reduce. Google has deployed it using five replicas spread across the country to survive regional disasters. Reads are much slower than MySQL, taking between 5 and 10ms.

 

The SQL parallel query engine was developed from scratch to hide the remote procedure call (RPC) latency and to allow parallel and batch execution. The latency is dealt with by using a single read phase and banning serial reads, though you can carry out asynchronous reads in parallel. Writes are buffered at the client, and sent as one RPC. Object relational mapping calls are also handled carefully to avoid those that are problematic to F1.

 

The research paper on F1, presented at SIGMOD 2012, cites serial reads and for loops that carry out one query per iteration as particular avoidance points, saying that while these hurt performance in all databases, they are disastrous on F1. In view of this, the client library is described as very lightweight ORM – it doesn’t really have the “R”. It never uses relational or traversal joins, and all objects are loaded explicitly.

 

 

 

Unicode on way to triumph over

According to Google unicode is now in use for 60% of the web content following  an 800 percent increase in “market share” since 2006.

 

 

Read the full Google’s blog post

Google now offering docs for Takeout

Google, through its Data Liberation Blog, has just announced they now offering the docs for takeout.

So you can now export them along with everything else on the Google Takeout menu.

 

 

 

Choose to download all of the Docs that you own through Takeout in any of the formats mentioned above. We’re making it more convenient for you to retrieve your information however you want — you can even Takeout just your docs if you’d like. Lastly, be sure to click on the new “Configure” menu if you’d like to choose different formats for your documents.

 

 

Tenzing – Hive done the Google way

Impressive it is, the Hive done in the Google way.

Google has publish a paper describing the architecture and implementation of Tenzing, a query engine built on top of MapReduce.

Abstract

Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.
Citation: “Tenzing A SQL Implementation On The MapReduce Framework”, Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, Michael Wong, Proceedings of the VLDB Endowment, vol. 4 (2011), pp. 1318-1327.
[pdf] [search]