Local Secondary Indexes for Amazon DynamoDB

Amazon Webservice Blog just announce a new feature,  you can now create local secondary indexes for Amazon DynamoDB tables. These indexes provide give you the power to query your tables in new ways, and can also increase retrieval efficiency.

What’s a Local Secondary Index?
The local secondary index model builds on DynamoDB’s existing key model.

Up until today you would have to select one of the following two primary key options when you create a table:

  • Hash – A strongly typed (string, number, or binary) value that uniquely identifies each item in a particular table. DynamoDB allows you to retrieve items by their hash keys.
  • Hash + Range – A pair of strongly typed items that collectively form a unique identifier for each item in a particular table. DynamoDB supports range queries that allow you to retrieve some or all of the items that match the hash portion of the primary key.

With today’s release we are extending the Hash + Range option with support for up to five local secondary indexes per table. Like the primary key, the indexes must be defined when you create the table. Each index references a non-primary attribute, and enables efficient retrieval using a combination of the hash key and the specified secondary key.

You can also choose to project some or all of the table’s other attributes into a secondary index. DynamoDB will automatically retrieve attribute values from the table or from the index as required. Projecting a particular attribute will improve retrieval speed and lessen the amount of provisioned throughput consumed, but will require additional storage space. Items within a secondary index are stored physically close to each other and in sorted order for fast query performance.

Show Me an Example!
Let’s say that you need to access a table containing information about heads of state. You create a table like this, with Country as the hash key and PresNumber as the range key:

Country PresNumber Name VP Party Age YearsInOffice
United States 1 George Washington John Adams None 57 6.34
United States 2 John Adams Thomas Jefferson Federalist 61 4
United States 3 Thomas Jefferson Aaron Burr Democratic-Republican 57 8
United States 4 James Madison George Clinton Democratic-Republican 57 8
United States 5 James Monroe Daniel Tompkins Democratic-Republican 58 8

With this schema, you can retrieve heads of state using a country and the ordinal number for the presidency. Now, let’s say that you want to query by Age (upon taking office) as well. You would create the table with a local secondary index (Age) like this:

Country Age PresNumber Name VP Party YearsInOffice
United States 57 1 George Washington John Adams None 6.34
United States 61 2 John Adams Thomas Jefferson Federalist 4
United States 57 3 Thomas Jefferson Aaron Burr Democratic-Republican 8
United States 57 4 James Madison George Clinton Democratic-Republican 8
United States 58 5 James Monroe Daniel Tompkins Democratic-Republican 8

How Do I Create and Use a Local Secondary Index?
As I noted earlier, you must create your local secondary indexes when you create the DynamoDB table. Here is how you would create them in the AWS Management Console:

DynamoDB’s existing Query API now supports the use of local secondary indexes. Your call must specify the table, the name of the index, the attributes you want to be returned, and any query conditions that you want to apply. We have examples in JavaPHP, and .NET / C#.

Costs and Provisioned Throughput
Let’s talk about the implications of local secondary indexes on the DynamoDB cost structure.

Every secondary index means more work for DynamoDB. When you add, delete, or replace items in a table that has local secondary indexes, DynamoDB will use additional write capacity units to update the relevant indexes.

When you query a table that has one or more local secondary indexes, you need to consider two distinct cases:

For queries that use index keys and projected attributes, DynamoDB will read from the index instead of from the table and will compute the number of read capacity units accordingly. This can result in lower costs if there are less attributes in the index than in the table.

For index queries that read non-projected attributes, DynamoDB will need to read the table and the index. This will consume additional read capacity units.

 

Amazon Announces new Data Warehousing Product

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift offers you fast query performance when analyzing virtually any size data set using the same SQL-based tools and business intelligence applications you use today. With a few clicks in the AWS Management Console, you can launch a Redshift cluster, starting with a few hundred gigabytes of data and scaling to a petabyte or more, for under $1,000 per terabyte per year.

Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse cluster, from provisioning capacity to monitoring and backing up the cluster, to applying patches and upgrades. Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime. The service continuously monitors the health of the cluster and automatically replaces any component, if needed. By automating these labor-intensive tasks, Amazon Redshift enables you to spend your time focusing on your data and business insights.

Amazon Redshift is designed for developers or businesses that require the full features and capabilities of a relational data warehouse. It is certified by Jaspersoft and MicroStrategy, with additional business intelligence tools coming soon.

Starting today, you can sign up for an invitation for the limited preview of Amazon Redshift by filling out the form to the right. We’ll contact you by email with instructions on how to get started as soon as we can.

 

http://aws.amazon.com/redshift/

Amazon Glacier

Today Amazon annouced its new service, “Glacier”, allowing to store extremely low-cost storage : $0.01 per GB / month

 

Storing  gigabyte and terabyte of data for years seems exciting at first, but keep in mind it might reveal an expensive deal as we emerge from an era of kilo bytes …. and sense of history brings data storage cost  always lower.

 

Amazon Glacier

Amazon Glacier is an extremely low-cost storage service that provides secure and durable storage for data archiving and backup. In order to keep costs low, Amazon Glacier is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable. With Amazon Glacier, customers can reliably store large or small amounts of data for as little as $0.01 per gigabyte per month, a significant savings compared to on-premises solutions.

Companies typically over-pay for data archiving. First, they’re forced to make an expensive upfront payment for their archiving solution (which does not include the ongoing cost for operational expenses such as power, facilities, staffing, and maintenance). Second, since companies have to guess what their capacity requirements will be, they understandably over-provision to make sure they have enough capacity for data redundancy and unexpected growth. This set of circumstances results in under-utilized capacity and wasted money. With Amazon Glacier, you pay only for what you use. Amazon Glacier changes the game for data archiving and backup as you pay nothing upfront, pay a very low price for storage, and can scale your usage up or down as needed, while AWS handles all of the operational heavy lifting required to do data retention well. It only takes a few clicks in the AWS Management Console to set up Amazon Glacier and then you can upload any amount of data you choose. 

http://aws.amazon.com/glacier/

Google is definitely in the Big Data business

At  Google IO developer conference in San Francisco last month, Google introduced a slew of new products including Google Compute Engine.  What wasn’t talked was Google’s big data play.   Google is definitely in the Big Data business and Google will be the 800lb gorilla in the space.

Google has two major things going for it.  1) Google has an amazing infrastructure and network inside their core operations; 2) Google owns lots of data lets just say about 90% of the worlds data including information and people.

Google’s infrastructure strength and direction with big data will shape not only applications but the enterprise business.  Why?  Because Google can provide infrastructure and data to anyone who wants it.

Watch out for Google because soon they will be competing with the everyone in the enterprise including the big boys like EMC/Greenplum, IBM/Netezza, HP, Microsoft, and everyone else.

David Floyer, Chief Technology Officer and head of research at Wikibon.org, wrote a great research paper today called Google and VMware Provide Virtualization of Hadoop and Big Data.   David addressed the Google (and VMware) angle in that piece.

If you’re interested in what Google is doing in Big Data you have to read the Wikibon research.

http://wikibon.org/wiki/v/Google_and_VMware_Provide_Virtualization_of_Hadoop_and_Big_Data

Google Compute Engine Review: source: Wikibon.org  

At the 2012 Google I/O conference, Google announced Compute engine. This provides 700,000 virtual cores to be available for users to spin up and tear down very rapidly for big data application in general, and MapReduce and Hadoop in particular. All without setting up any data center infrastructure. This service works with Google Cloud Storage service to provide the data; the data is encrypted at rest. This is a different service than the Google App service, but complementary.

Compute Engine uses the KVM hypervisor on top of the Linux operating system. In discussions with Wikibon, Google pointed out the improvements that they had made to the open source KVM code to improve performance and security in a multi-core multi-thread Intel CPU environment. This allows virtual cores (one thread, one core) to be used as the building block for spinning up very efficient virtual machines.

To help with data ingestion, Google are offering access to the full resources of Google’s Private Networks. This enables a large scale ability to move ingested data across the network at very high speed, and allows replication to a specific data center. The location(s) can be defined, allowing compliance with specific country or regional requirements to retain data within country. If the user can bring the data cost effectively and with sufficient bandwidth to a Google Edge, the Google network services will take over.

The Google Hadoop service can utilize the MapR framework in a similar way to the MapR service for Amazon. This provides improved availability and management components. John Schroeder, CEO and founder of MapR, presented a demonstration running Terasort on a 5,024 core Hadoop cluster with 1256 disks on the Google Compute Engine service. This completed in 1:20 seconds, at a total cost of $16. He compared this with a 1,460 physical server environment with over 11,000 cores, which would take months to set up and would cost over $5million dollars.

As a demonstration this was impressive. Of course, Terasort is a highly CPU intensive environment which can be effectively parallelized, and utilizes cores very efficiently. Other benchmark results which include more IO intensive use of the Google Cloud Storage are necessary to confirm that the service is of universal value.

Wikibon also discussed whether Google would provide other data services to allow joining of corporate data with other Google-derived and provided datasets. Google indicated that they understood the potential value of this service and understood that other service providers were offering these services (e.g., Microsoft Azure). Wikibon expects that data services of this type will be introduced by Google.

There is no doubt that Google is seriously addressing the big data market, and wanting to compete seriously in the enterprise space. The Google network services, data replication services and encryption services reflect this drive to compete strongly with Amazon.

 

Article from JOHN FURRIER  @ SiliconANGLE

 

 

Amazon DynamoDB – Internet-Scale Data Storage the NoSQL Way

Amazon made it today, they announce Amazon DynamoDB – Internet-Scale Data Storage the NoSQL Way.

“We want you to think big, to dream big dreams, and to envision (and then build) data-intensive applications that can scale from zero users up to tens or hundreds of millions of users before you know it. We want you to succeed, and we don’t want your database to get in the way. Focus on your app and on building a user base, and leave the driving to us.”

Sound good?

DynamoDB
Today, Amazon is introducing Amazon DynamoDB, an Internet-scale NoSQL database service. Built from the ground up to be efficient, scalable, and highly reliable, DynamoDB will let people store as much data as they want and to access it as often as they’d like, with predictable performance brought on by the use of Solid State Disk, better known as SSD.

DynamoDB is:

  • Scalable
  • Fast
  • Flexible
  • Low Cost
  • Easy to deploy with its own wizard setup

 

As part of the AWS Free Usage Tier, you get 100 MB of free storage, 5 writes per second, and 10 strongly consistent reads per second

All details available on Amazon Web Services Blog:

http://aws.typepad.com/aws/2012/01/amazon-dynamodb-internet-scale-data-storage-the-nosql-way.html