Skip to content

Commit 7828ab3

Browse files
tglsfdcCommitfest Bot
authored andcommitted
Improve hash join's handling of tuples with null join keys.
In a plain join, we can just summarily discard an input tuple with null join key(s), since it cannot match anything from the other side of the join (assuming a strict join operator). However, if the tuple comes from the outer side of an outer join then we have to emit it with null-extension of the other side. Up to now, hash joins did that by inserting the tuple into the hash table as though it were a normal tuple. This is unnecessarily inefficient though, since the required processing is far simpler than for a potentially-matchable tuple. Worse, if there are a lot of such tuples they will bloat the hash bucket they go into, possibly causing useless repeated attempts to split that bucket or increase the number of batches. We have a report of a large join vainly creating many thousands of batches when faced with such input. This patch improves the situation by keeping such tuples out of the hash table altogether, instead pushing them into a separate tuplestore from which we return them later. (One might consider trying to return them immediately; but that would require substantial refactoring, and it doesn't work anyway for the case where we rescan an unmodified hash table.) This works even in parallel hash joins, because whichever worker reads a null-keyed tuple can just return it; there's no need for consultation with other workers. Thus the tuplestores are local storage even in a parallel join. Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3061845.1746486714@sss.pgh.pa.us
1 parent 06761b6 commit 7828ab3

File tree

13 files changed

+381
-71
lines changed

13 files changed

+381
-71
lines changed

src/backend/executor/execExpr.c

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4274,25 +4274,27 @@ ExecBuildHash32FromAttrs(TupleDesc desc, const TupleTableSlotOps *ops,
42744274
* 'hash_exprs'. When multiple expressions are present, the hash values
42754275
* returned by each hash function are combined to produce a single hash value.
42764276
*
4277+
* If any hash_expr yields NULL and the corresponding hash operator is strict,
4278+
* the created ExprState will return NULL. (If the operator is not strict,
4279+
* we treat NULL values as having a hash value of zero. The hash functions
4280+
* themselves are always treated as strict.)
4281+
*
42774282
* desc: tuple descriptor for the to-be-hashed expressions
42784283
* ops: TupleTableSlotOps for the TupleDesc
42794284
* hashfunc_oids: Oid for each hash function to call, one for each 'hash_expr'
4280-
* collations: collation to use when calling the hash function.
4281-
* hash_expr: list of expressions to hash the value of
4282-
* opstrict: array corresponding to the 'hashfunc_oids' to store op_strict()
4285+
* collations: collation to use when calling the hash function
4286+
* hash_exprs: list of expressions to hash the value of
4287+
* opstrict: strictness flag for each hash function's comparison operator
42834288
* parent: PlanState node that the 'hash_exprs' will be evaluated at
42844289
* init_value: Normally 0, but can be set to other values to seed the hash
42854290
* with some other value. Using non-zero is slightly less efficient but can
42864291
* be useful.
4287-
* keep_nulls: if true, evaluation of the returned ExprState will abort early
4288-
* returning NULL if the given hash function is strict and the Datum to hash
4289-
* is null. When set to false, any NULL input Datums are skipped.
42904292
*/
42914293
ExprState *
42924294
ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
42934295
const Oid *hashfunc_oids, const List *collations,
42944296
const List *hash_exprs, const bool *opstrict,
4295-
PlanState *parent, uint32 init_value, bool keep_nulls)
4297+
PlanState *parent, uint32 init_value)
42964298
{
42974299
ExprState *state = makeNode(ExprState);
42984300
ExprEvalStep scratch = {0};
@@ -4369,8 +4371,8 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
43694371
fmgr_info(funcid, finfo);
43704372

43714373
/*
4372-
* Build the steps to evaluate the hash function's argument have it so
4373-
* the value of that is stored in the 0th argument of the hash func.
4374+
* Build the steps to evaluate the hash function's argument, placing
4375+
* the value in the 0th argument of the hash func.
43744376
*/
43754377
ExecInitExprRec(expr,
43764378
state,
@@ -4405,7 +4407,7 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
44054407
scratch.d.hashdatum.fcinfo_data = fcinfo;
44064408
scratch.d.hashdatum.fn_addr = finfo->fn_addr;
44074409

4408-
scratch.opcode = opstrict[i] && !keep_nulls ? strict_opcode : opcode;
4410+
scratch.opcode = opstrict[i] ? strict_opcode : opcode;
44094411
scratch.d.hashdatum.jumpdone = -1;
44104412

44114413
ExprEvalPushStep(state, &scratch);

src/backend/executor/nodeHash.c

Lines changed: 55 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -153,8 +153,11 @@ MultiExecPrivateHash(HashState *node)
153153
econtext = node->ps.ps_ExprContext;
154154

155155
/*
156-
* Get all tuples from the node below the Hash node and insert into the
157-
* hash table (or temp files).
156+
* Get all tuples from the node below the Hash node and insert the
157+
* potentially-matchable ones into the hash table (or temp files). Tuples
158+
* that can't possibly match because they have null join keys are dumped
159+
* into a separate tuplestore, or just summarily discarded if we don't
160+
* need to emit them with null-extension.
158161
*/
159162
for (;;)
160163
{
@@ -174,6 +177,7 @@ MultiExecPrivateHash(HashState *node)
174177

175178
if (!isnull)
176179
{
180+
/* normal case with a non-null join key */
177181
uint32 hashvalue = DatumGetUInt32(hashdatum);
178182
int bucketNumber;
179183

@@ -192,6 +196,14 @@ MultiExecPrivateHash(HashState *node)
192196
}
193197
hashtable->totalTuples += 1;
194198
}
199+
else if (node->keep_null_tuples)
200+
{
201+
/* null join key, but we must save tuple to be emitted later */
202+
if (node->null_tuple_store == NULL)
203+
node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
204+
tuplestore_puttupleslot(node->null_tuple_store, slot);
205+
}
206+
/* else we can discard the tuple immediately */
195207
}
196208

197209
/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
@@ -222,7 +234,6 @@ MultiExecParallelHash(HashState *node)
222234
HashJoinTable hashtable;
223235
TupleTableSlot *slot;
224236
ExprContext *econtext;
225-
uint32 hashvalue;
226237
Barrier *build_barrier;
227238
int i;
228239

@@ -282,6 +293,7 @@ MultiExecParallelHash(HashState *node)
282293
for (;;)
283294
{
284295
bool isnull;
296+
uint32 hashvalue;
285297

286298
slot = ExecProcNode(outerNode);
287299
if (TupIsNull(slot))
@@ -295,8 +307,19 @@ MultiExecParallelHash(HashState *node)
295307
&isnull));
296308

297309
if (!isnull)
310+
{
311+
/* normal case with a non-null join key */
298312
ExecParallelHashTableInsert(hashtable, slot, hashvalue);
299-
hashtable->partialTuples++;
313+
hashtable->partialTuples++;
314+
}
315+
else if (node->keep_null_tuples)
316+
{
317+
/* null join key, but save tuple to be emitted later */
318+
if (node->null_tuple_store == NULL)
319+
node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
320+
tuplestore_puttupleslot(node->null_tuple_store, slot);
321+
}
322+
/* else we can discard the tuple immediately */
300323
}
301324

302325
/*
@@ -404,14 +427,10 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
404427

405428
Assert(node->plan.qual == NIL);
406429

407-
/*
408-
* Delay initialization of hash_expr until ExecInitHashJoin(). We cannot
409-
* build the ExprState here as we don't yet know the join type we're going
410-
* to be hashing values for and we need to know that before calling
411-
* ExecBuildHash32Expr as the keep_nulls parameter depends on the join
412-
* type.
413-
*/
430+
/* these fields will be filled by ExecInitHashJoin() */
414431
hashstate->hash_expr = NULL;
432+
hashstate->null_tuple_store = NULL;
433+
hashstate->keep_null_tuples = false;
415434

416435
return hashstate;
417436
}
@@ -2753,6 +2772,31 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
27532772
}
27542773
}
27552774

2775+
/*
2776+
* Build a tuplestore suitable for holding null-keyed input tuples.
2777+
* (This function doesn't care whether it's for outer or inner tuples.)
2778+
*
2779+
* Note that in a parallel hash join, each worker has its own tuplestore(s)
2780+
* for these. There's no need to interact with other workers to decide
2781+
* what to do with them. So they're always in private storage.
2782+
*/
2783+
Tuplestorestate *
2784+
ExecHashBuildNullTupleStore(HashJoinTable hashtable)
2785+
{
2786+
Tuplestorestate *tstore;
2787+
MemoryContext oldcxt;
2788+
2789+
/*
2790+
* We keep the tuplestore in the hashCxt to ensure it won't go away too
2791+
* soon. Size it at work_mem/16 so that it doesn't bloat the node's space
2792+
* consumption too much.
2793+
*/
2794+
oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
2795+
tstore = tuplestore_begin_heap(false, false, work_mem / 16);
2796+
MemoryContextSwitchTo(oldcxt);
2797+
return tstore;
2798+
}
2799+
27562800
/*
27572801
* Reserve space in the DSM segment for instrumentation data.
27582802
*/

0 commit comments

Comments
 (0)