PostgreSQL/Indices

Relational databases systems store huge amounts of data. Their value only becomes apparent when individual pieces can be retrieved fast enough. E.g., a naive query for a specific telephone number in a phone book with 100 million entries has to read on average 50 million entries. Fortunately, smart algorithms reduce the number of necessary read operations dramatically. A binary search will reduce them in the given example to maximal 27. Using smart algorithms is much more efficient than utilizing faster hardware - especially for huge numbers.

In our case of databases, the implementation of such algorithms is based on additional structures which repeat parts of the original data in their specific way. They are called indices and, of course, they come with some overhead. They occupy space on disc and in RAM; they generate additional effort, e.g., for sorting, and whenever the original data changes they must be maintained accordingly.

As mentioned, their primary purpose and biggest advantage is the acceleration of queries - with regards to identifying rows as well as sorting the resulting set of rows. Besides this, they support some constraints like uniqueness.

If indices exist, it's not sure that the system uses them. Because the system optimizes queries before it executes them, it sometimes decides to ignore an existing index and perform a full table scan instead. This may occur if the table is very small or if the selectiveness of the retrieved value is very low and will return a huge percentage of the existing rows.

PostgreSQL offers some extension mechanisms. Among other things, it is possible to add new data types to the system and integrate them into the existing index types. Beyond that, it's possible to develop application-specific operators to meet the needs of specialized applications, e.g., classification of pictures or music, clustering of arbitrary objects, detection of patterns in stock prices, ... . GIN, BRIN , GiST , and SP-GiST offer an interface (some kind of a template) which allows implementing index assisted domain-specific actions. The technique is called access method. Only B-Tree and Hash are conventional indices without such an extension mechanism.

B-Tree
B-Tree (Balanced Tree) is the default index type. It is suitable for use cases where numbers or short strings are often part of the  clause. Possible operators are the usual arithmetic operators: <, <=, =, >, >=.

Read more

GIN
GIN (Generalized Inverted Index) supports data types that are divisible into smaller components, e.g., elements of an array, words of a text document, or properties of a JSON object. We call them compound data types. In opposite to B-Trees, GIN does not generate a single index entry for the complete value but one index entry for each individual component.

Useable operators in a  clause depend on the data type:
 * Arrays:  (is contained in),   (contains),   (equal), and   (overlaps / has some common elements).
 * Text queries (lexems):  (contains).
 * JSON:  (JSON object field with the given key),   (JSON object field with the given key, as text).

Read more

BRIN
BRIN (Block-Range Index) is a structure that accelerates queries on tables that contain a huge number of rows (> millions) and where the rows occur in a certain physical order within the data file. Typical use cases are such where a column contains a timestamp or a generated sequence number that seldomly or never change over time, e.g.: IoT data, computed values, sensor output, log information.

The correlation between the physical order of rows in the data file and their content in the column of interest arises from the sequence of  commands and growing column values: later  s have to contain equal or higher values. It is possible that this correlation gets lost over time by later  commands. In this case, the benefit of BRIN may get lost.

The power of BRIN results from the fact that it needs only very little space. Typical BRIN sizes for a table with hundreds of millions of rows are some kB, which easily fits into RAM. All other index types need much more space, 25 - 50% of the table size is not unusual.

Read more

Hash
PostgreSQL uses two fundamental strategies to implement Hash indices. First, a hash function maps column values of any type and length to a hash value of a fixed size of 32 bit. Such hash values together with the TIDs of their originating rows are the basic bricks for the Hash index. Second, an elaborated algorithm ensures that the size of the index file grows smoothly (that is, in a small amount of pages at one point in time) when additional index entries occur. Hence, it's an extendible hash.

To save space, the hash index doesn't store the original column value but only the computed hash value. This has some implications. The sort order of the computed hash values haven't any relation to the sort order of the original values. Therefore this index type can support only the  operator, but none of the other comparison operators like   or. Additionally, there is the danger of duplicates. Two different column values can create the same hash value. This is unavoidable because there are many more possible column values (any length) than possible hash values (fixed size). Thus, after reading the row according to the found TID, it's necessary to re-evaluate the column value from the heap.

Read more

GiST and SP-GiST
GiST stands for Generalized Search Tree and implements - similar to B-Tree - a balanced tree structure. This is useful for all kinds of B-Tree and R-Tree structures. Some PostgreSQL extensions use them, e.g.:, hstore (key/value pairs), intarray (array of Integers), ltree ('Labels' like 'World.Countries.Europe.Russia'), pg_trgm (trigram matching), ... .

SP-Gist stands for Space-Partitioned Generalized Search Tree and implements non-balanced tree structures, mainly for object types that contain similar or equal object types. This is useful for quad-trees, k-d trees, radix trees, ... .

Bloom
The above-mentioned index types and index access methods are an integral part of every PostgreSQL installation.

An additional index access method is bloom. Bloom must be explicitly installed using PG's extension mechanism (you have to run  once before you can use this index type). This extension implements a Bloom filter that offers an advantage over B-trees in cases where the first columns of a multicolumn index are not specified in the  condition of an SQL statement.