-1

I am trying to do in Postgres the equivalent of Pandas outer merge, in order to outer merge two tables.

Table df_1 contains these data:

enter image description here

Table df_2 contains these data:

enter image description here

So Table df_1 has one extra column (random_id) than df_2. Also, job_id 1711418 and worker_id 45430 exist in both df_1 and df_2.

If I use the "outer merge" method in Pandas: df_1.merge(df_2, on=['job_id', 'worker_id'], how='outer'), I would get the ideal result as shown below:

enter image description here

However, I wasn't able to find a SQL (postgres) equivalent of that Pandas outer merge method.

I've tried running the following query:

select *
from df_1
full outer join df_2
   on df_1.job_id = df_2.job_id 
  and df_1.worker_id = df_2.worker_id

However, the result generated 2 extra duplicate columns called job_id_duplicate_column_name_1 and worker_id_duplicate_column_name_1 which I don't want: enter image description here

Could anyone help suggest a query method that can achieve the same as outer merge method in Pandas?

Stanleyrr
  • 858
  • 3
  • 12
  • 31
  • 2
    Sample data is better presented as [formatted text](https://meta.stackoverflow.com/a/251362). See [here](https://meta.stackexchange.com/questions/81852) for some tips on how to create nice looking tables. –  Mar 02 '21 at 06:08
  • Default Pandas merge/join is a "natural" SQL join. Read the documentation before you use functionality. – philipxy Mar 02 '21 at 07:03
  • 1
    Does this answer your question? [Difference between natural join and inner join](https://stackoverflow.com/questions/8696383/difference-between-natural-join-and-inner-join) – philipxy Mar 02 '21 at 07:08
  • @philipxy, natural join doesn't work in my case because it only joins on matching records. In my use case there are some records that are distinct between df_1 and df_2 and I want to preserve all of them in the result. – Stanleyrr Mar 02 '21 at 16:48
  • I meant natural full join. I missed that the Q&A I found was only for natural inner vs plain inner rather than for natural vs plain for both inner & outer. Your special case of tables means you could do a natural right join. Also you'd need to map NULLs to NaN. Between the special case & NaN you might want to use UNION etc as in the accepted answer. I was focusing on natural vs plain; I'm not certain at this moment that that's the result you actually want. PS Please [use text, not images/links, for text--including tables & ERDs](https://meta.stackoverflow.com/q/285551/3404097). – philipxy Mar 02 '21 at 17:40
  • `df_1.merge(df_2, on=[LIST], how='outer')` is like SQL `df_1 full join df_2 using (LIST)`; when LIST is all common columns that's `df_1 natural full join df_2`. Only one copy of each USING or common column is returned. (Unmatched rows extended by NULLs.) – philipxy Mar 02 '21 at 19:44

1 Answers1

2

This looks like a UNION to me, not a JOIN:

select job_id, worker_id, random_id
from df_1
union all
select job_id, worker_id, 'NaN'
from df_2 d2
where not exists (select *
                  from df_1 d1
                  where d1.job_id = d2.job_id
                    and d1.worker_id = d2.worker_id)

The second part of the union returns all rows from df_2 that don't exist in df_1

An alternative would be to use EXCEPT to get the rows from df_2 that don't exist in df_1

select job_id, worker_id, random_id
from df_1
union all
select job_id, worker_id, 'NaN'
from 
(
  select job_id, worker_id
  from df_2 d2
  except
  select job_id, worker_id
  from df_1
) t

Online example

  • Thanks @a_horse_with_no_name! Your suggestion works as intended. I am gonna take some time to learn more about this approach. – Stanleyrr Mar 02 '21 at 16:50