Joining tables in Google Cloud Spanner poses is challenging due to its distributed setup. At OTTO, where Google Cloud Spanner is integral to multiple services, achieving optimal join performance is crucial to ensure a great customer experience.
Spanner differs from typical SQL databases – it's highly distributed, requiring careful consideration when crafting join queries. Additionally, it doesn't fully adhere to the SQL standard. This article explores the unique aspects of joins in Spanner and offers insights into optimizing their performance.
Spanner is a distributed database system. This architecture has implications for all executed queries, especially those covering multiple tables, such as joins.
Firstly, everything in Spanner can be thought of as a network call. Simplified: when you select columns, Spanner makes a network call to the server where the table resides, executes the query, and then transfers the data to you. Thus, if you join two tables, there's a high likelihood that those tables are stored on different servers. This scenario can result in a significant number of network calls. Given that networks are inherently unreliable, this can directly impact performance. You can gauge this by examining your query's execution plan. If it contains distributed operators, Spanner has to make several network calls to execute it. The fewer the network calls, the better the performance.
Let's examine how to minimize distributed operators in the first place.
Every time you use a secondary index and query for columns that are not in the index, Spanner does a back-join to get the missing columns. You can think of the index as a table that has the indexed column as a Key and the primary key of the original table as a Column. Spanner then uses a join between the index table and the original table to return the query result. If you only think about the index, this is counterintuitive.
There are three simple ways to mitigate the problem:
Every solution has its advantages and disadvantages. The first solution must be done on the application side but is very flexible. You do not have to alter the index at all.
The second solution – my preferred one – uses more disk space as the columns are effectively duplicated.
The third way has the same downsides as the second one, yet for each insert, the index must be updated on multiple columns, which is often slower than just storing the columns.
If you anticipate that certain tables will be frequently joined, for example, customers and shipping addresses, consider using interleaved tables in Spanner. When two tables are interleaved, Spanner stores them closely together, reducing the need for network calls.
However, this approach is only viable for genuine parent-child relationships. If you interleave table A and B, you cannot insert into table B without a corresponding parent in table A; thus, interleaved tables aren't suitable for every situation.
As this article focuses on the general join use case, we will not delve deeper into the topic of interleaved tables. For more details, please refer to the official documentation.
Most of the time, we do not think about what happens if we type something like:
select A.*, B.Foo from A join B on A.B_ID = B.ID
And to be honest, this is great because it reduces the cognitive load while using a database. However, as mentioned above, if you need a performant query, you have to invest some time and figure out what works best.
In the default scenario, Spanner’s query optimizer will choose the appropriate join type for your query, based on statistics collected in the background. However, if you add a new join, maybe on new tables, the optimizer does not have enough knowledge and you may want to enforce one join method to keep the performance up. You can do this with join hints:
select A.*, B.Foo from A join@{JOIN_METHOD=apply_join} B on A.B_ID = B.ID
We will take a closer look at different join types in Spanner right now. For illustrative purposes, we will use this join on the given schema:
SELECT c.FirstName, o.ProductName
FROM Customers c
JOIN Orders o ON c.CustomerID = o.CustomerID;
CREATE TABLE Customers (
CustomerID INT64 NOT NULL,
FirstName STRING(255),
Email STRING(255)
) PRIMARY KEY (CustomerID);
CREATE TABLE Orders (
OrderID INT64 NOT NULL,
CustomerID INT64 NOT NULL,
ProductName STRING(255)
) PRIMARY KEY (OrderID);
Hash Joins are a good choice if you want to join two large tables. A Hash Join runs in two phases.
First, it takes the “build” table (the table that has fewer rows, in our case, it's Customers) and creates an in-memory hash table where the hash key is derived from the join. In our scenario, the hash key will be based on CustomerID.
After that, the other table (the “probe” table, Orders in our case) is scanned. For each row in Orders, the hash key of CustomerID is calculated and then searched in the Hash Table built previously. If there is a match, the row will be in the result set.
You can enforce a Hash Join in Spanner in two ways:
select A.*, B.Foo from A hash join B on A.B_ID = B.ID
select A.*, B.Foo from A join@{JOIN_METHOD=hash_join} B on A.B_ID = B.ID
You can provide additional join hints for Hash Joins:
Hash Joins come with their advantages and disadvantages.
Advantages
Disadvantages
You can improve the performance of a Hash Join by reducing the size of the build table with an appropriate WHERE condition. Adding an index to the columns in the WHERE condition will speed up the creation of the build table, but does not affect the underlying join.
Hash Joins are useful for large ad hoc joins and data analysis but may not be the best choice to fetch the orders for a single customer.
Apply Joins can also be optimized for larger tables. If the right side has an index on the join column, Spanner only needs to loop over the left side and can use the index to efficiently fetch the rows from the right side. When the left side is narrowed down via an appropriate WHERE condition (ideally backed up by an index), Apply Joins can efficiently handle larger tables. In some scenarios, they can be more efficient than Hash Joins, especially when a small subset of both tables is required.
You can enforce an Apply Join via:
select A.*, B.Foo from A join@{JOIN_METHOD=apply_join} B on A.B_ID = B.ID
A Merge Join takes both sides of the join and sorts them by the join columns. Spanner then reads the beginning of both sorted tables and compares them based on the join condition. If they match, the merged rows are added to the result set. If the join column of one table is smaller than the other, that row is skipped, and the next row from that table is read. If you need another mental model, think of Python's zip function, which works similarly (yet different).
A Merge Join is efficient if the data is already sorted, or can be quickly sorted using an index on the join columns.
Similar to Hash Joins, a Merge Join performs a full table scan on both tables. Therefore, the downsides of both join types are alike. The performance comes from the fact that neither hashing nor looping is needed with the sorted tables. If you only need a small subset of both tables, consider an Apply Join first.
It's worth noting that a Merge Join is never choosen by the query optimizer by default, so you'll need to enforce it with:
select A.*, B.Foo from A join@{JOIN_METHOD=merge_join} B on A.B_ID = B.ID
In day to day work, Merge Joins are not likely to be your first choice, but it is important to keep in mind that they exist.
A Push Broadcast Hash Join acts as a distributed version of a Hash Join. Although the merge logic remains the same as the standard Hash Koin, in this method, the build side is serialized into a batch and dispatched to all remote servers holding data from the probe side. Once received, the batch is deserialized and the Hash Join logic takes over. So, when is this approach preferable to a regular Hash Join?
Push Broadcast Hash Joins facilitate parallel processing of the join, potentially resulting in swifter execution times. They tend to be more efficient when the build side is significantly smaller than the probe side. However, if both sides are of comparable size, transmitting the hash table to all the other servers might introduce a network overhead that isn't counteracted by faster query execution. Another consideration is data distribution. For instance, if 80% of the probe side data resides on one server and the remaining 20% is dispersed across multiple servers, the advantages might be negligible.
It's evident that this join type caters to specific scenarios, much like the merge join. Consequently, it isn't selected by the query optimizer by default. To enable it, use:
select A.*, B.Foo from A join@{JOIN_METHOD=push_broadcast_hash_join} B on A.B_ID = B.ID
Now that we got to know the different join methods in Spanner, let us summarize how to improve join performance.
Want to become part of the team?
We have received your feedback.