0

I have a table with production data, Prod. There are two fields in Prod, A and B, that I use as keys (both of which are VARCHARs). I have another table, Stage, that I want to import into Prod. However, before I import Stage, I want to check whether Stage has rows that are already in Prod. Any duplicate rows are to be excluded from the import.

The problem I have is as follows:

When I run a query such as

SELECT A, B 
FROM Stage 
WHERE A || B NOT IN (
    SELECT A || B 
    FROM Prod
)

I expect that I will receive a list of all non-duplicate (new) entries. However, I receive no results.

Furthermore, when I run

SELECT A, B 
FROM Stage 
WHERE A || B IN (
    SELECT A || B 
    FROM Prod
)

where the only difference is changing NOT IN to IN, I receive only a subset of the table returned instead of what I would expect to be the entire table.

I know the issue has something to do with the concatenation (||) operator because when I run

SELECT A 
FROM Stage 
WHERE A NOT IN (
    SELECT A FROM Prod
)

rows are returned and the IN version of the query returns the remaining rows.

Does anyone have any thoughts?

4
  • What are the A and B types in Stage and Prod? Commented Mar 27, 2014 at 19:30
  • 2
    You should use where (a,b) not in (select a,b ...). The concatenation can lead to errors because abc could mean a,bc or ab,c. If a or b can be null then not in will not return anything if there is at least one row where one of them is null Commented Mar 27, 2014 at 19:39
  • @ClodoaldoNeto A and B are VARCHAR in production as well Commented Mar 27, 2014 at 20:21
  • @a_horse_with_no_name That works perfectly. Thank you so much. If you want to write it up as answer, I would be happy to mark it as the correct answer. Commented Mar 27, 2014 at 20:24

1 Answer 1

1

Your statement has two problems:

First: using string concatenation is not going to work as you expect it because the comparison cannot distinguish between the tuples ('a','bc') and ('ab','c') (because both will result in the same concatenated value.

Using a real tuple comparison is the correct way:

where (a,b) not in (select a,b ...)

Now to the second problem:

A NOT IN comparison where the "comparison list" contains NULL will always return an empty result because any comparison with null yields "unknown" so the database can not reliably decide if the value from the "left hand side" is in that list or not.

You wrote that using: SELECT A FROM Stage WHERE A NOT IN (SELECT A FROM Prod) returns something, which means that there are no null values in prod.a but apparently in prod.b.

If you want to ignore the null values you can use something like this:

select a,b
from stage 
where (a,b) not in (select a,b 
                    from prod
                    where b is not null);

Another option would be to treat null has something else, e.g. an empty string:

select a,b
from stage 
where (a,coalesce(b, '')) not in (select a, coalesce(b, '')
                                  from prod);

This problem does not occur when using the in operator, so

select a,b
from stage 
where (a,b) in (select a,b 
                from prod);

is safe to use, even with null values.

However if you use those two columns as "keys" you shouldn't allow null values in them in the first place.

Btw: this is not something specific to Postgres, this is how SQL in general works.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.