Postgres-XC 1.0.4 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 44. Overview of Postgres-XC Internals | Fast Forward | Next |
Note: The following description applies only to Postgres-XC
Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.
In PostgreSQL, each transaction is given unique ID called transaction ID (or XID). XID is given in ascending order to distinguish which transaction is older/newer. [1] When a transaction tries to read a tuple, [2] each tuple has a set of XIDs to indicate transactions which created and deleted the tuple. So if the target tuple is created by an active transaction, it is not committed or aborted and the transaction should ignore such tuple. In such way (in practice, this is done by versup module in PostgreSQL core), if we give each transaction a unique transaction Id throughout the system and maintain snapshot what transaction is active, not only in a single server but transaction in all the servers, we can maintain global consistent visibility of each tuple even when a server accepts new statement from other transactions running on the other server.
These information is stored in "xmin" and "xmax" fields of each row of table. When we INSERT rows, XID of inserting transaction is recorded at xmin field. When we update rows of tables (with UPDATE or DELETE statement), PostgreSQL does not simply overwrite the old rows. Instead, PostgreSQL "marks" the old rows as "deleted" by writing updating transaction's XID to xmax field. In the case of UPDATE (just like INSERT), new rows are created whose xmin field is "marked" with XIDs of the creating transaction.
These "xmin" and "xmax" are used to determine which row is visible to a transaction. To do this, PostgreSQL needs a data to indicate what transactions are running, which is called the "snapshot".
If the creating transaction is not running, visibility of each row depends upon the fact if the creating transaction was committed or aborted. Suppose a row of a table which was created by some transaction and is not deleted yet. If the creating transaction is running, such row is visible to the transaction which created the row, but not visible to other transactions. If the creating transaction is not running and was committed the row is visible. If the transaction was aborted, this row is not visible.
Therefore, PostgreSQL needs two kinds of information to determine "which transaction is running" and "if an old transaction was committed or aborted."
The former information is obtained as "snapshot." PostgreSQL maintains the latter information as "CLOG."
PostgreSQL uses all these information to determine which row is visible to a given transaction.
Note: The following description applies only to Postgres-XC
In Postgres-XC, the following features of transaction management and visibility checking were picked up:
Assigning XID globally to transactions (GXID, Global Transaction ID). This can be done globally to identify each Transactions in the system.
Providing snapshot. GTM collects all the transaction's status (running, committed, aborted etc.) to provide snapshot globally (global snapshot). Please note that global snapshot includes GXID initiated by other Coordinators or Datanodes. This is needed because some older transaction may visit new server after a while. In this case, if GXID of such a transaction is not included in the snapshot, this transaction may be regarded as "old enough" and uncommitted rows may be read. If GXID of such transaction is included in the snapshot from the beginning, such inconsistency does not take place.
To do this, Postgres-XC introduced a dedicated component called GTM (Global Transaction Manager). GTM runs on one of the servers and provide unique and ordered transaction id to each transaction running on Postgres-XC servers. Because this is globally unique ID, we call this GXID (Global Transaction Id).
GTM receives GXID request from transactions and provide GXID. It also keep track of all the transactions when it started and finished to generate snapshot used to control each tuple visibility. Because snapshot here is also global property, it is called Global Snapshot.
As long as each transaction runs with GXID and Global Snapshot, it can maintain consistent visibility throughout the system and it is safe to run transactions in parallel in any servers. On the other hand, a transaction, composed of multiple statements, can be executed using multiple servers maintaining database consistency.
GTM provides Global Transaction Id to each transaction and keeps track of the status of all the transactions, whether it is running, committed or aborted, to calculate global snapshot to maintain tuple visibility.
For this purpose, each transaction reports when it starts and ends, as well as when it issues PREPARE command in two-phase commit protocol.
Each transaction requests snapshot according to the transaction isolation level as done in PostgreSQL. If the transaction isolation level is "read committed", then transaction will request a snapshot for each statement. If it is "serializable" transaction will request a snapshot at the beginning of transaction and reuse it thought the transaction.
Note: The following description applies only to Postgres-XC
Because GTM can be regarded as "serializing" all the transaction processing, people may think that GTM can be a performance bottleneck.
In fact, GTM can limit the whole scalability. GTM should not be used in very slow network environment such as wide area network. GTM architecture is intended to be used with Gigabit local network. We encourage to install Postgres-XC with local Gigabit network with minimum latency, that is, use as fewer switches involved in the connection among GTM, Coordinator and Datanodes.
Primitive GTM implementation can be done as follows:
Coordinator backend is provided with GTM client library to obtain GXID and snapshot and to report the transaction status.
GTM opens a port to accept connection from each Coordinator and Datanode backend. When GTM accepts a connection, it creates a thread (GTM Thread) to handle request to GTM from the connected Coordinator backend.
GTM Thread receives each request, record it and sends GXID, snapshot and other response to the Coordinator backend.
They are repeated until the Coordinator backend requests disconnect.
You may have been noticed that each transaction is issuing request to GTM so frequently and we can collect them into single block of requests in each Coordinator to reduce the amount of interaction, as GTM-Proxy.
In this configuration, each Coordinator and Datanode backend does not connect to GTM directly. Instead, we have GTM Proxy between GTM and Coordinator backend to group multiple requests and responses. GTM Proxy, like GTM explained in the previous sections, accepts connection from the Coordinator backend. However, it does not create new thread. The following paragraphs explains how GTM Proxy is initialized and how it handles requests from Coordinator backends.
GTM Proxy, as well as GTM, is initialized as follows:
GTM starts up normally, but now can accept connections from GTM proxies.
GTM Proxy starts up. GTM Proxy creates GTM Proxy Threads. Each GTM Proxy Threads connect to the GTM in advance. The number of GTM Proxy Threads can be specified at the startup. Typical number of threads is one or two so it can save the number of connections between GTM and Coordinators.
GTM Main Thread waits for the request connection from each backend.
When each Coordinator backend requests for connection, Proxy Main Thread assigns a GTM Proxy Thread to handle request. Therefore, one GTM Proxy Thread handles multiple Coordinator backends. If a Coordinator has one hundred Coordinator backends and one GTM Proxy Thread, this thread takes care of one hundred Coordinator backend.
Then GTM Proxy Thread scans all the requests from Coordinator backend. If Coordinator is more busy, it is expected to capture more requests in a single scan. Therefore, the proxy can group many requests into single block of requests, to reduce the number of interaction between GTM and the Coordinator.
Furthermore, in a single scan, we may have multiple request for snapshots. Because these requests can be regarded as received at the same time, we can represent multiple snapshots with single one. This will reduce the amount of data which GTM provides.
Note: The following description applies only to Postgres-XC
Coordinator handles SQL statements from applications and determine which Datanode should be involved and generates local SQL statements for each Datanode. In the most simplest case, if single Datanode is involved, the Coordinator simply proxies incoming statement to the Datanode. In more complicated case, for example, if the target Datanode cannot be determined, then the Coordinator generates local statements for each Datanode, collects the result to materialize at the Coordinator for further handling. In this case, the Coordinator will try to optimize the plan by
Pushdown WHERE clause to Datanodes,
Pushdown joins to Datanodes,
Pushdown projection (column list in SELECT clause),
Pushdown ORDER BY clause, as well as other clauses.
If a transaction is involved by more than one Datanodes and/or Coordinators, the Coordinator will handle the transaction with two-phase commit protocol internally.
In the case of aggregate functions, Postgres-XC introduced new function collection function between existing transition function and finalize function. Collection function runs on the Coordinator to collect all the intermediate results from involved Datanodes. For details, see Section 34.10 and CREATE AGGREGATE.
In the case of reading replicated tables, Coordinator can choose any Datanode to read. The most efficient way is to select one running in the same hardware or virtual machine. This is called preferred Datanode and can be specified by a GUC local to each Coordinator.
On the other hand, in the case of writing replicated tables, all the Coordinators choose the same Datanode to begin with to avoid update conflicts. This is called primary Datanode.
Coordinators also take care of DDL statements. Because DDL statements handles system catalogs, which are replicated in all the Coordinators and Datanodes, they are proxied to all the Coordinators and Datanodes. To synchronize the catalog update in all the nodes, the Coordinator handles DDL with two-phase commit protocol internally.
Note: The following description applies only to Postgres-XC
While Coordinators handle cluster-wide SQL statements, Datanodes take care of just local issues. In this sense, Datanodes are essentially PostgreSQL servers except that transaction management information is obtained from GTM, as well as other global value.
The number of connection between Coordinator and Datanode may increase from time to time. This may leave unused connection and waste system resources. Repeating real connect and disconnect requires Datanode backend initialization which increases latency and also wastes system resources.
For example, as in the case of GTM, if each Coordinator has one hundred connections to applications and we have ten Coordinators, after a while, each Coordinator may have connection to each data node. It means that each Coordinator backend has ten connections to Coordinators and each Coordinator has one thousand (10 x 10) connections to Coordinators.
Because we consume much more resources for locks and other control information per backend and only a few of such connection is active at a given time, it is not a good idea to hold such unused connection between Coordinator and Datanode.
To improve this, Postgres-XC is equipped with connection pooler between Coordinator and Datanode. When a Coordinator backend requires connection to a Datanode, the pooler looks for appropriate connection from the pool. If there's an available one, the pooler assigns it to the Coordinator backend. When the connection is no longer needed, the Coordinator backend returns the connection to the pooler. Pooler does not disconnect the connection. It keeps the connection to the pool for later reuse, keeping Datanode backend running.
[1] | More precisely, XID is 32bit integer. When XID reaches the max value, it wraps around to the lowest value (3, as to the latest definition). PostgreSQL has a means to handle this, as well as Postgres-XC. For simplicity, it will not be described in this document. |
[2] | This description is somewhat simplified for explanation. You will find the precise rule in tqual.c file in PostgreSQL's source code. |