PostgreSQL introduced jsonb support

Binary JSON

PostgreSQL has introduce jsonb.. a diamond in the crown of PostgreSQL 9.4.Based on an elegant hash opclass for GIN, which competes with MongoDB performance in contains operator .

Feature’s documentation : http://www.postgresql.org/docs/devel/static/datatype-json.html

Feature’s story:  http://obartunov.livejournal.com/177247.html

PostgreSQL 9.2 introduced JSON built-in data type

PostgreSQL 9.2 has introduced a new feature related JSON; built-in data type. So you can now store inside your database directly JSON fields without the need of an external format checker as it is now directly inside Postgres core.


 
Built-in JSON data type.
 
Like the XML data type, we simply store JSON data as text, after checking
that it is valid. More complex operations such as canonicalization and
comparison may come later, but this is enough for now.
 
There are a few open issues here, such as whether we should attempt to
detect UTF-8 surrogate pairs represented as uXXXXuYYYY, but this gets
the basic framework in place.

A couple of system functions have also been added later to output some row or array data directly as json.


Add array_to_json and row_to_json functions.
 
Also move the escape_json function from explain.c to json.c where it
seems to belong.
 
Andrew Dunstan, Reviewed by Abhijit Menon-Sen.

What actually Postgres core does with JSON fields is to store them as text fields (so maximum size of 1GB) and top of that a string format check can be performed directly in core.

In case of a format error you will receive the following error message:

ERROR: invalid input syntax for type json

 

PostgreSQL 9.2 has been released

PostgreSQL 9.2 has been released and is HUGE: covering indexes, vertical scalability, kd-trees, JSON objects, better concurrent DDL.  The next release of PostgreSQL is planned to be the 9.3 release, a tentative schedule for this version has a release in the third quarter of 2013.

 

 

The PostgreSQL Global Development Group announces PostgreSQL 9.2, the latest release of the leader in open source databases. Since the beta release was announced in May, developers and vendors have praised it as a leap forward in performance, scalability and flexibility. Users are expected to switch to this version in record numbers.

“PostgreSQL 9.2 will ship with native JSON support, covering indexes, replication and performance improvements, and many more features. We are eagerly awaiting this release and will make it available in Early Access as soon as it’s released by the PostgreSQL community,” said Ines Sombra, Lead Data Engineer, Engine Yard.

Links

 

 

Major new features

 

Index-only scans

In PostgreSQL, indexes have no “visibility” information. It means that when you access a record by its index, PostgreSQL has to visit the real tuple in the table to be sure it is visible to you: the tuple the index points to may simply be an old version of the record you are looking for.

It can be a very big performance problem: the index is mostly ordered, so accessing its records is quite efficient, while the records may be scattered all over the place (that’s a reason why PostgreSQL has a cluster command, but that’s another story). In 9.2, PostgreSQL will use an “Index Only Scan” when possible, and not access the record itself if it doesn’t need to.

There is still no visibility information in the index. So in order to do this, PostgreSQL uses the visibility map (visibility map) , which tells it whether the whole content of a (usually) 8K page is visible to all transactions or not. When the index record points to a tuple contained in an «all visible» page, PostgreSQL won’t have to access the tuple, it will be able to build it directly from the index. Of course, all the columns requested by the query must be in the index.

The visibility map is maintained by VACUUM (it sets the visible bit), and by the backends doing SQL work (they unset the visible bit).

If the data has been read only since the last VACUUM then the data is All Visible and the index only scan feature can improve performance.

Here is an example.

CREATE TABLE demo_ios (col1 float, col2 float, col3 text);

In this table, we’ll put random data, in order to have “scattered” data. We’ll put 100 million records, to have a big recordset, and have it not fit in memory (that’s a 4GB-ram machine). This is an ideal case, made for this demo. The gains won’t be that big in real life.

INSERT INTO demo_ios SELECT generate_series(1,100000000),random(), ‘mynotsolongstring';

 

SELECT pg_size_pretty(pg_total_relation_size(‘demo_ios’));

pg_size_pretty

—————-

6512 MB

Let’s pretend that the query is this:

SELECT col1,col2 FROM demo_ios where col2 BETWEEN 0.01 AND 0.02

In order to use an index only scan on this query, we need an index on col2,col1 (col2 first, as it is used in the WHERE clause).

CREATE index idx_demo_ios on demo_ios(col2,col1);

We vacuum the table, so that the visibility map to be up-to-date:

VACUUM demo_ios;

All the timing you’ll see below are done on a cold OS and PostgreSQL cache (that’s where the gains are, as the purpose on Index Only Scans is to reduce I/O).

Let’s first try without Index Only Scans:

SET enable_indexonlyscan to off;

 

EXPLAIN (analyze,buffers) select col1,col2 FROM demo_ios where col2 between 0.01 and 0.02;

QUERY PLAN

—————————————————————————————————————————————-

Bitmap Heap Scan on demo_ios  (cost=25643.01..916484.44 rows=993633 width=16) (actual time=763.391..362963.899 rows=1000392 loops=1)

Recheck Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))

Rows Removed by Index Recheck: 68098621

Buffers: shared hit=2 read=587779

->  Bitmap Index Scan on idx_demo_ios  (cost=0.00..25394.60 rows=993633 width=0) (actual time=759.011..759.011 rows=1000392 loops=1)

Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))

Buffers: shared hit=2 read=3835

Total runtime: 364390.127 ms

With Index Only Scans:

explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;

QUERY PLAN

———————————————————————————————————————————————–

Index Only Scan using idx_demo_ios on demo_ios  (cost=0.00..35330.93 rows=993633 width=16) (actual time=58.100..3250.589 rows=1000392 loops=1)

Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))

Heap Fetches: 0

Buffers: shared hit=923073 read=3848

Total runtime: 4297.405 ms

 

As nothing is free, there are a few things to keep in mind:

  • Adding indexes for index only scans obviously adds indexes to your table. So updates will be slower.
  • You will index columns that weren’t indexed before. So there will be less opportunities for HOT updates.
  • Gains will probably be smaller in real life situations, especially when data is changed between VACUUMs

This required making visibility map changes crash-safe, so visibility map bit changes are now WAL-logged.

Replication improvements

Streaming Replication becomes more polished with this release.

One of the main remaining gripes about streaming replication is that all the slaves have to be connected to the same and unique master, consuming its resources. Moreover, in case of a failover, it could be complicated to reconnect all the remaining slaves to the newly promoted master, if one is not using a tool like repmgr.

With 9.2, a standby can also send replication changes, allowing cascading replication.

Let’s build this. We start with an already working 9.2 database.

We set it up for replication:

postgresql.conf:

wal_level=hot_standby #(could be archive too)

max_wal_senders=5

hot_standby=on

You’ll probably also want to activate archiving in production, it won’t be done here.

pg_hba.conf (do not use trust in production):

host   replication replication_user          0.0.0.0/0                      md5

Create the user:

create user replication_user replication password ‘secret';

Clone the cluster:

pg_basebackup -h localhost -U replication_user -D data2

Password:

We have a brand new cluster in the data2 directory. We’ll change the port so that it can start (postgresql.conf), as both clusters are running on the same machine:

port=5433

We add a recovery.conf to tell it how to stream from the master database:

standby_mode = on

primary_conninfo = ‘host=localhost port=5432 user=replication_user password=secret’

pg_ctl -D data2 start

server starting

LOG:  database system was interrupted; last known up at 2012-07-03 17:58:09 CEST

LOG:  creating missing WAL directory “pg_xlog/archive_status”

LOG:  entering standby mode

LOG:  streaming replication successfully connected to primary

LOG:  redo starts at 0/9D000020

LOG:  consistent recovery state reached at 0/9D0000B8

LOG:  database system is ready to accept read only connections

Now, let’s add a second slave, which will use this slave:

pg_basebackup -h localhost -U replication_user -D data3 -p 5433

Password:

We edit data3’s postgresql.conf to change the port:

port=5434

We modify the recovery.conf to stream from the slave:

standby_mode = on

primary_conninfo = ‘host=localhost port=5433 user=replication_user password=secret’             # e.g. ‘host=localhost port=5432′

We start the third cluster:

pg_ctl -D data3 start

server starting

LOG:  database system was interrupted while in recovery at log time 2012-07-03 17:58:09 CEST

HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

LOG:  creating missing WAL directory “pg_xlog/archive_status”

LOG:  entering standby mode

LOG:  streaming replication successfully connected to primary

LOG:  redo starts at 0/9D000020

LOG:  consistent recovery state reached at 0/9E000000

LOG:  database system is ready to accept read only connections

Now, everything modified on the master cluster get streamed to the first slave, and from there to the second slave. This second replication has to be monitored from the first slave (the master knows nothing about it).
As you may have noticed from the example, pg_basebackup now works from slaves.

There is another use case that wasn’t covered: what if a user didn’t care for having a full fledged slave, and only wanted to stream the WAL files to another location, to benefit from the reduced data loss without the burden of maintaining a slave ?

pg_receivexlog is provided just for this purpose: it pretends to be a PostgreSQL slave, but only stores the log files as they are streamed, in a directory:

pg_receivexlog -D /tmp/new_logs -h localhost -U replication_user

will connect to the master (or a slave), and start creating files:

ls /tmp/new_logs/

00000001000000000000009E.partial

Files are of the segment size, so they can be used for a normal recovery of the database. It’s the same as an archive command, but with a much smaller granularity.

Remember to rename the last segment to remove the .partial suffix before using it with a PITR restore or any other operation.

The synchronous_commit parameter has a new value: remote_write. It can be used when there is a synchronous slave (synchronous_standby_names is set), meaning that the master doesn’t have to wait for the slave to have written the data to disk, only for the slave to have acknowledged the data. With this set, data is protected from a crash on the master, but could still be lost if the slave crashed at the same time (i.e. before having written the in flight data to disk). As this is a quite remote possibility, and the performance improvement will be large, some people will be interested in this compromise.

JSON datatype

The JSON datatype is meant for storing JSON-structured data. It will validate that the input JSON string is correct JSON:

=# SELECT ‘{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”}'::json;

json

——————————————————————-

{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”}

(1 row)

 

=# SELECT ‘{“username”,”posts”:121,”emailaddress”:”john@nowhere.com”}'::json;

ERROR:  invalid input syntax for type json at character 8

DETAIL:  Expected “:”, but found “,”.

CONTEXT:  JSON data, line 1: {“username”,…

STATEMENT:  SELECT ‘{“username”,”posts”:121,”emailaddress”:”john@nowhere.com”}'::json;

ERROR:  invalid input syntax for type json

LINE 1: SELECT ‘{“username”,”posts”:121,”emailaddress”:”john@nowhere…

^

DETAIL:  Expected “:”, but found “,”.

CONTEXT:  JSON data, line 1: {“username”,…

You can also convert a row type to JSON:

=#SELECT * FROM demo ;

username | posts |    emailaddress

———-+——-+———————

john     |   121 | john@nowhere.com

mickael  |   215 | mickael@nowhere.com

(2 rows)

 

=# SELECT row_to_json(demo) FROM demo;

row_to_json

————————————————————————-

{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”}

{“username”:”mickael”,”posts”:215,”emailaddress”:”mickael@nowhere.com”}

(2 rows)

Or an array type:

 

=# select array_to_json(array_agg(demo)) from demo;

array_to_json

———————————————————————————————————————————————

[{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”},{“username”:”mickael”,”posts”:215,”emailaddress”:”mickael@nowhere.com”}]

(1 row)

Range Types

Range types are used to store a range of data of a given type. There are a few pre-defined types. They are integer (int4range), bigint (int8range), numeric (numrange), timestamp without timezone (tsrange), timestamp with timezone (tstzrange), and date (daterange).

Ranges can be made of continuous (numeric, timestamp…) or discrete (integer, date…) data types. They can be open (the bound isn’t part of the range) or closed (the bound is part of the range). A bound can also be infinite.

Without these datatypes, most people solve the range problems by using two columns in a table. These range types are much more powerful, as you can use many operators on them.

Here is the intersection between then 1000(open)-2000(closed) and 1000(closed)-1200(closed) numeric range:

SELECT ‘(1000,2000]'::numrange * ‘[1000,1200]'::numrange;

?column?

————-

(1000,1200]

(1 row)

So you can query on things like: «give me all ranges that intersect this»:

=# SELECT * from test_range ;

period

—————————————————–

[“2012-01-01 00:00:00+01″,”2012-01-02 12:00:00+01″]

[“2012-01-01 00:00:00+01″,”2012-03-01 00:00:00+01″]

[“2008-01-01 00:00:00+01″,”2015-01-01 00:00:00+01″]

(3 rows)

 

=# SELECT * FROM test_range WHERE period && ‘[2012-01-03 00:00:00,2012-01-03 12:00:00]';

period

—————————————————–

[“2012-01-01 00:00:00+01″,”2012-03-01 00:00:00+01″]

[“2008-01-01 00:00:00+01″,”2015-01-01 00:00:00+01″]

(2 rows)

This query could use an index defined like this:

=# CREATE INDEX idx_test_range on test_range USING gist (period);

You can also use these range data types to define exclusion constraints:

CREATE EXTENSION btree_gist ;

CREATE TABLE reservation (room_id int, period tstzrange);

ALTER TABLE reservation ADD  EXCLUDE USING GIST (room_id WITH =, period WITH &&);

This means that now it is forbidden to have two records in this table where room_id is equal and period overlaps. The extension btree_gist is required to create a GiST index on room_id (it’s an integer, it is usually indexed with a btree index).

=# INSERT INTO reservation VALUES (1,'(2012-08-23 14:00:00,2012-08-23 15:00:00)’);

INSERT 0 1

=# INSERT INTO reservation VALUES (2,'(2012-08-23 14:00:00,2012-08-23 15:00:00)’);

INSERT 0 1

=# INSERT INTO reservation VALUES (1,'(2012-08-23 14:45:00,2012-08-23 15:15:00)’);

ERROR:  conflicting key value violates exclusion constraint “reservation_room_id_period_excl”

DETAIL:  Key (room_id, period)=(1, (“2012-08-23 14:45:00+02″,”2012-08-23 15:15:00+02″))

conflicts with existing key (room_id, period)=(1, (“2012-08-23 14:00:00+02″,”2012-08-23 15:00:00+02″)).

STATEMENT:  INSERT INTO reservation VALUES (1,'(2012-08-23 14:45:00,2012-08-23 15:15:00)’);

One can also declare new range types.

Performance improvements

This version has performance improvements on a very large range of domains (non-exaustive):

  • The most visible will probably be the Index Only Scans, which has already been introduced in this document.
  • The lock contention of several big locks has been significantly reduced, leading to better multi-processor scalability, for machines with over 32 cores mostly.
  • The performance of in-memory sorts has been improved by up to 25% in some situations, with certain specialized sort functions introduced.
  • An idle PostgreSQL server now makes less wakeups, leading to lower power consumption. This is especially useful on virtualized and embedded environments.
  • COPY has been improved, it will generate less WAL volume and fewer locks of a table’s pages.
  • Statistics are collected on array contents, allowing for better estimations of selectivity on array operations.
  • Text-to-anytype concatenation and quote_literal/quote_nullable functions are not volatile any more, enabling better optimization in some cases
  • The system can now track IO durations

 

RJSON: compress JSON to JSON

RJSON converts any JSON data collection into more compact recursive form. Compressed data is still JSON and can be parsed with JSON.parse. RJSON can compress not only homogeneous collections, but also any data sets with free structure.

RJSON is single-pass stream compressor, it extracts data schemes from document, assign each schema unique number and use this number instead of repeating same property names again and again.

Bellow you can see same document in both formats.

 

JSON:
{ "id": 7, "tags": ["programming", "javascript"], "users": [ {"first": "Homer", "last": "Simpson"}, {"first": "Hank", "last": "Hill"}, {"first": "Peter", "last": "Griffin"} ], "books": [ {"title": "JavaScript", "author": "Flanagan", "year": 2006}, {"title": "Cascading Style Sheets", "author": "Meyer", "year": 2004} ] }
RJSON:
{ "id": 7, "tags": ["programming", "javascript"], "users": [ {"first": "Homer", "last": "Simpson"}, [2, "Hank", "Hill", "Peter", "Griffin"] ], "books": [ {"title": "JavaScript", "author": "Flanagan", "year": 2006}, [3, "Cascading Style Sheets", "Meyer", 2004] ] }

 

 

When RJSON founds new object schema (unique combination of names of properties) it outputs the object as is and assign it a new index starting from 1 (0 is reserved for numeric arrays). Next objects with same schema are encoded as arrays (in beginning — schema’s index, then — values of properties). Several consecutive objects with same schema are merged into same array, so shema index is stored only once for them. Schemes itself aren’t stored in the document, unpacker will index new schemes in exactly same way like packer does and we will have implicit in-memory index identical for both packing and unpacking.

You can download source code from https://github.com/dogada/RJSON or try RJSON demo where you can convert any JSON data into RJSON-format, decode result and ensure that it matches original JSON data.

RJSON allows to:

  • reduce JSON data size and network traffic when gzip isn’t available. For example, in-browser 3D-modeling tools like Mydeco 3D-planner may process and send to server megabytes of JSON-data;
  • analyze large collections of JSON-data without unpacking of whole dataset. RJSON-data is still JSON-data, so it can be traversed and analyzed after parsing and fully unpacked only if a document meets some conditions.

The above JSON vs RJSON example is based on the data structure from the JSON DB: a compressed JSON format. JSONDB concept is implemented in JSONH – JSON Homogeneous Collections Compressor. RJSON provides similar level of data compression like JSONH does, but RJSON isn’t limited to homogeneous collections only.

Recently I also found CJSON, it uses similar approach like RJSON does, but uses explicit schemes and wraps compressed objects into other objects with empty keys. RJSON instead tries to preserve original structure of the document, uses implicit index of data schemes, encode compressed objects into arrays and merge homogeneous sequences of objects into single array.

In my opinion RJSON combines best features from JSONH and CJSON and introduces beautiful implicit index of data schemes.

The idea of this algorithm come to my mind near five or six years ago when I worked on J2ME XHTML-rendering engine for mobile phone browsers where each extra Kb matters. When I realized that XML is like violence I started to like JSON and adapted original idea to JSON format.

The code is available under Simplified BSD License. Fell free to compress the world.

JSONP Enables cross-domain requests to any JSON API

JSONP or “JSON with padding” is a complement to the base JSON data format. It provides a method to request data from a server in a different domain, something prohibited by typical web browsers because of the Same origin policy.

Enables cross-domain requests to any JSON API

Easy: JSONP

<script>
  function myCallback(data){
    alert(“it works! “, JSON.parse(data).awesome);
  }
</script>
<script src=”http://jsonp.jit.su/?callback=myCallback&url=http://jsonview.com/example.json”></script>
This Gist brought to you by GitHub.
http://jsonp.jit.su/?callback=&url= 

Easier: Cross-domain AJAX

<script>
  $.get(‘http://jsonp.jit.su/?url=http://jsonview.com/example.json’, function(data){
    alert(“pretty awesome, eh? ” + data.awesome);
  });
</script>
This Gist brought to you by GitHub.

http://jsonp.jit.su/

About JSON genesis

Video from  IEEE Computing Conversations

Interview with Douglas Crockford about the development of JavaScript Object Notation (JSON)

 

  • Crockford is likeably humble about the origins of JSON. Rather than claiming he inventedJSON he instead says he discovered it:
“I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.”

 

  • Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered json.org, put up a specification that documented the data format, and declared it as a standard.

 

  • Crockford wanted something that made his life easier. He needed JSON when building an application where a client written in JavaScript needed to communicate with a server written in Java.   He wanted something where the data serialization matched the data structures available to both programming language environments.

 

IETF working on a convention for HTTP access to JSON resources

Internet draft is working on A Convention for HTTP Access to JSON Resources

Abstract

This document codifies a convention for accessing JSON representations of resources via HTTP.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

 

http://tools.ietf.org/html/draft-pbryan-http-json-resource-01