PostgreSQL 9.2 has been released and is HUGE: covering indexes, vertical scalability, kd-trees, JSON objects, better concurrent DDL. The next release of PostgreSQL is planned to be the 9.3 release, a tentative schedule for this version has a release in the third quarter of 2013.
The PostgreSQL Global Development Group announces PostgreSQL 9.2, the latest release of the leader in open source databases. Since the beta release was announced in May, developers and vendors have praised it as a leap forward in performance, scalability and flexibility. Users are expected to switch to this version in record numbers.
“PostgreSQL 9.2 will ship with native JSON support, covering indexes, replication and performance improvements, and many more features. We are eagerly awaiting this release and will make it available in Early Access as soon as it’s released by the PostgreSQL community,” said Ines Sombra, Lead Data Engineer, Engine Yard.
Links
Major new features
Index-only scans
In PostgreSQL, indexes have no “visibility” information. It means that when you access a record by its index, PostgreSQL has to visit the real tuple in the table to be sure it is visible to you: the tuple the index points to may simply be an old version of the record you are looking for.
It can be a very big performance problem: the index is mostly ordered, so accessing its records is quite efficient, while the records may be scattered all over the place (that’s a reason why PostgreSQL has a cluster command, but that’s another story). In 9.2, PostgreSQL will use an “Index Only Scan” when possible, and not access the record itself if it doesn’t need to.
There is still no visibility information in the index. So in order to do this, PostgreSQL uses the visibility map (visibility map) , which tells it whether the whole content of a (usually) 8K page is visible to all transactions or not. When the index record points to a tuple contained in an «all visible» page, PostgreSQL won’t have to access the tuple, it will be able to build it directly from the index. Of course, all the columns requested by the query must be in the index.
The visibility map is maintained by VACUUM (it sets the visible bit), and by the backends doing SQL work (they unset the visible bit).
If the data has been read only since the last VACUUM then the data is All Visible and the index only scan feature can improve performance.
Here is an example.
CREATE TABLE demo_ios (col1 float, col2 float, col3 text);
In this table, we’ll put random data, in order to have “scattered” data. We’ll put 100 million records, to have a big recordset, and have it not fit in memory (that’s a 4GB-ram machine). This is an ideal case, made for this demo. The gains won’t be that big in real life.
INSERT INTO demo_ios SELECT generate_series(1,100000000),random(), ‘mynotsolongstring’;
SELECT pg_size_pretty(pg_total_relation_size(‘demo_ios’));
pg_size_pretty
—————-
6512 MB
Let’s pretend that the query is this:
SELECT col1,col2 FROM demo_ios where col2 BETWEEN 0.01 AND 0.02
In order to use an index only scan on this query, we need an index on col2,col1 (col2 first, as it is used in the WHERE clause).
CREATE index idx_demo_ios on demo_ios(col2,col1);
We vacuum the table, so that the visibility map to be up-to-date:
VACUUM demo_ios;
All the timing you’ll see below are done on a cold OS and PostgreSQL cache (that’s where the gains are, as the purpose on Index Only Scans is to reduce I/O).
Let’s first try without Index Only Scans:
SET enable_indexonlyscan to off;
EXPLAIN (analyze,buffers) select col1,col2 FROM demo_ios where col2 between 0.01 and 0.02;
QUERY PLAN
—————————————————————————————————————————————-
Bitmap Heap Scan on demo_ios (cost=25643.01..916484.44 rows=993633 width=16) (actual time=763.391..362963.899 rows=1000392 loops=1)
Recheck Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))
Rows Removed by Index Recheck: 68098621
Buffers: shared hit=2 read=587779
-> Bitmap Index Scan on idx_demo_ios (cost=0.00..25394.60 rows=993633 width=0) (actual time=759.011..759.011 rows=1000392 loops=1)
Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))
Buffers: shared hit=2 read=3835
Total runtime: 364390.127 ms
With Index Only Scans:
explain (analyze,buffers) select col1,col2 from demo_ios where col2 between 0.01 and 0.02;
QUERY PLAN
———————————————————————————————————————————————–
Index Only Scan using idx_demo_ios on demo_ios (cost=0.00..35330.93 rows=993633 width=16) (actual time=58.100..3250.589 rows=1000392 loops=1)
Index Cond: ((col2 >= 0.01::double precision) AND (col2 <= 0.02::double precision))
Heap Fetches: 0
Buffers: shared hit=923073 read=3848
Total runtime: 4297.405 ms
As nothing is free, there are a few things to keep in mind:
- Adding indexes for index only scans obviously adds indexes to your table. So updates will be slower.
- You will index columns that weren’t indexed before. So there will be less opportunities for HOT updates.
- Gains will probably be smaller in real life situations, especially when data is changed between VACUUMs
This required making visibility map changes crash-safe, so visibility map bit changes are now WAL-logged.
Replication improvements
Streaming Replication becomes more polished with this release.
One of the main remaining gripes about streaming replication is that all the slaves have to be connected to the same and unique master, consuming its resources. Moreover, in case of a failover, it could be complicated to reconnect all the remaining slaves to the newly promoted master, if one is not using a tool like repmgr.
With 9.2, a standby can also send replication changes, allowing cascading replication.
Let’s build this. We start with an already working 9.2 database.
We set it up for replication:
postgresql.conf:
wal_level=hot_standby #(could be archive too)
max_wal_senders=5
hot_standby=on
You’ll probably also want to activate archiving in production, it won’t be done here.
pg_hba.conf (do not use trust in production):
host replication replication_user 0.0.0.0/0 md5
Create the user:
create user replication_user replication password ‘secret’;
Clone the cluster:
pg_basebackup -h localhost -U replication_user -D data2
Password:
We have a brand new cluster in the data2 directory. We’ll change the port so that it can start (postgresql.conf), as both clusters are running on the same machine:
port=5433
We add a recovery.conf to tell it how to stream from the master database:
standby_mode = on
primary_conninfo = ‘host=localhost port=5432 user=replication_user password=secret’
pg_ctl -D data2 start
server starting
LOG: database system was interrupted; last known up at 2012-07-03 17:58:09 CEST
LOG: creating missing WAL directory “pg_xlog/archive_status”
LOG: entering standby mode
LOG: streaming replication successfully connected to primary
LOG: redo starts at 0/9D000020
LOG: consistent recovery state reached at 0/9D0000B8
LOG: database system is ready to accept read only connections
Now, let’s add a second slave, which will use this slave:
pg_basebackup -h localhost -U replication_user -D data3 -p 5433
Password:
We edit data3′s postgresql.conf to change the port:
port=5434
We modify the recovery.conf to stream from the slave:
standby_mode = on
primary_conninfo = ‘host=localhost port=5433 user=replication_user password=secret’ # e.g. ‘host=localhost port=5432′
We start the third cluster:
pg_ctl -D data3 start
server starting
LOG: database system was interrupted while in recovery at log time 2012-07-03 17:58:09 CEST
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: creating missing WAL directory “pg_xlog/archive_status”
LOG: entering standby mode
LOG: streaming replication successfully connected to primary
LOG: redo starts at 0/9D000020
LOG: consistent recovery state reached at 0/9E000000
LOG: database system is ready to accept read only connections
Now, everything modified on the master cluster get streamed to the first slave, and from there to the second slave. This second replication has to be monitored from the first slave (the master knows nothing about it).
As you may have noticed from the example, pg_basebackup now works from slaves.
There is another use case that wasn’t covered: what if a user didn’t care for having a full fledged slave, and only wanted to stream the WAL files to another location, to benefit from the reduced data loss without the burden of maintaining a slave ?
pg_receivexlog is provided just for this purpose: it pretends to be a PostgreSQL slave, but only stores the log files as they are streamed, in a directory:
pg_receivexlog -D /tmp/new_logs -h localhost -U replication_user
will connect to the master (or a slave), and start creating files:
ls /tmp/new_logs/
00000001000000000000009E.partial
Files are of the segment size, so they can be used for a normal recovery of the database. It’s the same as an archive command, but with a much smaller granularity.
Remember to rename the last segment to remove the .partial suffix before using it with a PITR restore or any other operation.
The synchronous_commit parameter has a new value: remote_write. It can be used when there is a synchronous slave (synchronous_standby_names is set), meaning that the master doesn’t have to wait for the slave to have written the data to disk, only for the slave to have acknowledged the data. With this set, data is protected from a crash on the master, but could still be lost if the slave crashed at the same time (i.e. before having written the in flight data to disk). As this is a quite remote possibility, and the performance improvement will be large, some people will be interested in this compromise.
JSON datatype
The JSON datatype is meant for storing JSON-structured data. It will validate that the input JSON string is correct JSON:
=# SELECT ‘{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”}’::json;
json
——————————————————————-
{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”}
(1 row)
=# SELECT ‘{“username”,”posts”:121,”emailaddress”:”john@nowhere.com”}’::json;
ERROR: invalid input syntax for type json at character 8
DETAIL: Expected “:”, but found “,”.
CONTEXT: JSON data, line 1: {“username”,…
STATEMENT: SELECT ‘{“username”,”posts”:121,”emailaddress”:”john@nowhere.com”}’::json;
ERROR: invalid input syntax for type json
LINE 1: SELECT ‘{“username”,”posts”:121,”emailaddress”:”john@nowhere…
^
DETAIL: Expected “:”, but found “,”.
CONTEXT: JSON data, line 1: {“username”,…
You can also convert a row type to JSON:
=#SELECT * FROM demo ;
username | posts | emailaddress
———-+——-+———————
john | 121 | john@nowhere.com
mickael | 215 | mickael@nowhere.com
(2 rows)
=# SELECT row_to_json(demo) FROM demo;
row_to_json
————————————————————————-
{“username”:”john”,”posts”:121,”emailaddress”:”john@nowhere.com”}
{“username”:”mickael”,”posts”:215,”emailaddress”:”mickael@nowhere.com”}
(2 rows)
Or an array type:
=# select array_to_json(array_agg(demo)) from demo;
array_to_json
———————————————————————————————————————————————
[{"username":"john","posts":121,"emailaddress":"john@nowhere.com"},{"username":"mickael","posts":215,"emailaddress":"mickael@nowhere.com"}]
(1 row)
Range Types
Range types are used to store a range of data of a given type. There are a few pre-defined types. They are integer (int4range), bigint (int8range), numeric (numrange), timestamp without timezone (tsrange), timestamp with timezone (tstzrange), and date (daterange).
Ranges can be made of continuous (numeric, timestamp…) or discrete (integer, date…) data types. They can be open (the bound isn’t part of the range) or closed (the bound is part of the range). A bound can also be infinite.
Without these datatypes, most people solve the range problems by using two columns in a table. These range types are much more powerful, as you can use many operators on them.
Here is the intersection between then 1000(open)-2000(closed) and 1000(closed)-1200(closed) numeric range:
SELECT ‘(1000,2000]’::numrange * ‘[1000,1200]‘::numrange;
?column?
————-
(1000,1200]
(1 row)
So you can query on things like: «give me all ranges that intersect this»:
=# SELECT * from test_range ;
period
—————————————————–
["2012-01-01 00:00:00+01","2012-01-02 12:00:00+01"]
["2012-01-01 00:00:00+01","2012-03-01 00:00:00+01"]
["2008-01-01 00:00:00+01","2015-01-01 00:00:00+01"]
(3 rows)
=# SELECT * FROM test_range WHERE period && ‘[2012-01-03 00:00:00,2012-01-03 12:00:00]‘;
period
—————————————————–
["2012-01-01 00:00:00+01","2012-03-01 00:00:00+01"]
["2008-01-01 00:00:00+01","2015-01-01 00:00:00+01"]
(2 rows)
This query could use an index defined like this:
=# CREATE INDEX idx_test_range on test_range USING gist (period);
You can also use these range data types to define exclusion constraints:
CREATE EXTENSION btree_gist ;
CREATE TABLE reservation (room_id int, period tstzrange);
ALTER TABLE reservation ADD EXCLUDE USING GIST (room_id WITH =, period WITH &&);
This means that now it is forbidden to have two records in this table where room_id is equal and period overlaps. The extension btree_gist is required to create a GiST index on room_id (it’s an integer, it is usually indexed with a btree index).
=# INSERT INTO reservation VALUES (1,’(2012-08-23 14:00:00,2012-08-23 15:00:00)’);
INSERT 0 1
=# INSERT INTO reservation VALUES (2,’(2012-08-23 14:00:00,2012-08-23 15:00:00)’);
INSERT 0 1
=# INSERT INTO reservation VALUES (1,’(2012-08-23 14:45:00,2012-08-23 15:15:00)’);
ERROR: conflicting key value violates exclusion constraint “reservation_room_id_period_excl”
DETAIL: Key (room_id, period)=(1, (“2012-08-23 14:45:00+02″,”2012-08-23 15:15:00+02″))
conflicts with existing key (room_id, period)=(1, (“2012-08-23 14:00:00+02″,”2012-08-23 15:00:00+02″)).
STATEMENT: INSERT INTO reservation VALUES (1,’(2012-08-23 14:45:00,2012-08-23 15:15:00)’);
One can also declare new range types.
Performance improvements
This version has performance improvements on a very large range of domains (non-exaustive):
- The most visible will probably be the Index Only Scans, which has already been introduced in this document.
- The lock contention of several big locks has been significantly reduced, leading to better multi-processor scalability, for machines with over 32 cores mostly.
- The performance of in-memory sorts has been improved by up to 25% in some situations, with certain specialized sort functions introduced.
- An idle PostgreSQL server now makes less wakeups, leading to lower power consumption. This is especially useful on virtualized and embedded environments.
- COPY has been improved, it will generate less WAL volume and fewer locks of a table’s pages.
- Statistics are collected on array contents, allowing for better estimations of selectivity on array operations.
- Text-to-anytype concatenation and quote_literal/quote_nullable functions are not volatile any more, enabling better optimization in some cases
- The system can now track IO durations