Entire website open-sourced at: https://github.com/grokit/website_grokit_ca/. Feel free to make pull requests if you find mistakes!
Link to paper: ACM website for citations, full pdf.
SQL databases are typically row-oriented. These row-stores (RS) are well suited for applications such as customer relationship management, but ill-suited for online analytical processing (OLAP) such as data mining.
This paper compares the performance of RS and column-stores (CS) databases. The contributions are:
The main findings are:
Those results suggest that one should pick a database system that is well suited for the expected workload.
However, the reason RS optimizations did not yield good result does not mean that it is not possible to build a RS database that can have similar performances under OLAP workload given CS-like optimizations. The authors claim that current RS databases are not coded in a way that can take advantage of those optimizations.
Section 4 discusses optimizations that can be introduced to a RS database to mimic a CS database.
Although those optimization seem like a good idea, the authors show that none of them perform particularly well.
Vertical partitioning: entities are split in tables, one table per attribute.
Since entities are not necessarily stored in order (hence the need for IAM), each attribute table stores the value alongside its position id (~= primary key of row the attribute belong to).
For example, the name column of the example RS in 1.2.2 becomes a list of (position, value) tuples:
(2, Mary) (1, Paul) (n, Jeanne)
When doing a query on m columns, m-1 joins on the record ID must be done. In order to speed-up joins, the authors tried to either cluster-index each table or use hash joins.
Leave the data as in a RS, but create an index for every column.
Denormalize the data in tuples which fit the predicates that are often run on the database.
This section reviews optimization that a CS can use. Most of those optimizations hinge on the idea that in CS, you can do thing faster since you do not need to read / process unrelated data (that you necessarily have to skip over if processing row-by-row).
Data in a column has less variance than data across rows (less entropy), it can therefore be compressed more efficiently. The main gain from compression is reduced I/O, rather than saving disk space. One must take care to select appropriate compression algorithms where speed of decompression is optimized at the cost of saving disk space. Another saving from compression is that some predicates can operate directly on (smaller) compressed data (think of binary comparison for example).
Materialization is the creation of a tuple that represents information required to complete an operation, which can be any of the attribute from any of the tables (attribute in dimension tables must be fetched with a join operation).
To complete a query, it is necessary to fetch the information from the SELECT statement in addition to all information necessary to run all the predicates.
For example, in a typical SQL query:
SELECT <column_1, column_2, ..., column_n> FROM <table_1, table_2, ..., table_n> WHERE <predicate_1> AND <predicate_2> [...] AND <predicate_n>
... all predicates must return true for the engine to have to return the data from the SELECT line to the user. This means that it can skip fetching some of the data if a predicate is false. Therefore, the data is fetched in a pipeline join in order of predicate selectivity.
This means that the SQL engine will progressively build a tuple of data as it processes the predicates. When it needs data from another table, it does an implicit join, applies the predicate, if the predicate return true it appends the data to the tuple and keep going.
This process is called early materialization because the tuple is materialized from the moment the first predicate is run. If all predicates are run and the last one return false, the partially built tuple is simply discarded.
Assuming CS (data in a column is stored sequentially), it is very efficient to apply predicates column-by-column instead of row-by-row. Since in a CS, entities in a column are at a fixed offset, that allows to keep a simple binary mask (bit at offset i represents delegate on record at offset i) for all entities that are true for a predicate. Doing a binary AND for the bitmasks of all predicates yields the entities that pass all predicates. After this is done, the final output n-tuple can be materialized.
This has the potential to be faster than early materialization since:
Is a RS, some of the fields for a row might not be needed, but they need to be read anyways. Using CS / late materialization, no superfluous data is read.
In a RS, the data is stored in row-order, which means that after the original seek (which is the slowest operation in a read -- sequential read is extremely fast), only a bit of data can be read (the column required to run the predicate for the row) before having to seek again.
Since after doing the AND of all the columns it is possible to skip all rows for which one of the predicate is false. In RS / early materialization, the evaluation of predicate is done in a pipelined fashion, which means that if 9 out of 10 predicates are true and the 10th is false, a 9-tuple is constructed and thrown away. A 9-bit mask is necessarily smaller than a 9-tuple.
The last section of 5.2 covered block iteration and why it is faster. It is worth mentioning that CS can also take advantage of the fact that column data will either be all fixed width, or all variable width. This means that column that are fixed width can be processed much faster. In RS, if any of the column of an record is of variable width, the whole record become variable width and that does away with the possible optimizations.
Their innovation if a form of late materialization that they call invisible join with an added optimization that result in less out of order accesses.
In late materialization, when the output n-tuple is being constructed, only one of the column will likely be in sorted order. Because of locality of reference, reading the columns out-of-order is relatively slow.