Python, dataframe sql join

Question

I have two python functions that query a database directly. Is there a way to join these 2 functions within python?

I want to do a couple of joins not really sure how to do that in python.

Query 1:

def query1(businessDate):
    con = pyodbc.connect(r'DSN='+'Stack',autocommit=True)  
    print('working')
    #businessDate = r"'2019-03-13'"
    #remember business date should be entered like "'2019-03-13'"

    sql = f"""

     SELECT 
       iddate, 
       businessdate, 
       stack, identifier
     FROM stackoverflow
     where stack is not null 
     and businessdate = {businessDate}

    """

    df_stack = pd.read_sql(sql,con)

    con.close()

    return(df_stack)

query 2:

    def superuser(businessDate):  
    con = pyodbc.connect(r'DSN='+'super',autocommit=True)  
    print('working')
    #remember business date should be entered like "'2019-03-13'"

    sql = f"""
    SELECT 
      iddate,
      businessdate,
      stack, identifier
    FROM superuser
    WHERE stack is not null 
    and businessdate = {businessDate}


    """

    df_super = pd.read_sql(sql,con)

    con.close()

    return(df_super)

I'd want to do a left outer join table 1 with table 2 on identifier, stack, iddate and businessdate

Trying:

def testjoin():
    con = pyodbc.connect(r'DSN='+'Stack',autocommit=True)  
    print('working') 

    pd.merge(df_stack,df_super, on = ['identifier','stack','iddate'])

    df_test = pd.read_sql(sql,con)

    con.close()

    return(df_test)

trying 2:

def testjoin():
    con = pyodbc.connect(r'DSN='+'Stack',autocommit=True)  
    print('working') 

    df_stack= query1("'2019-03-13'")
    df_super= superuser("'2019-03-13'")

    pd.merge(df_stack,df_super, on = ['identifier','stack','iddate'])

    df_test = pd.read_sql(sql,con)

    con.close()

    return(df_test)

getting error name 'sql' is not defined'

ramazanbozkir · Answer 1 · 2019-06-21T20:24:05.897

3

Left Outer Join

SELECT *
FROM df_stack
LEFT OUTER JOIN df_super
    ON df_stack.stack = df_super.stack
    ON df_stack.identifier= df_super.identifier 
    ON df_stack.iddate = df_super.iddate
    ON df_stack.businessdate = df_super.businessdate;

pd.merge(df_stack,df_super,
         on=['iddate','businessdate', 'stack', 'identifier'], 
         how='left')

edited Jun 21 '19 at 20:24

answered May 07 '19 at 12:56

ramazanbozkir

161
3

when I call my testjoin function I get df_stack not defined. what could I be doing wrong? `NameError: name 'df_stack' is not defined` – excelguy May 10 '19 at 21:09
Unless `df_stack` is a global variable it won't be in the scope of the `testjoin()` function unless you pass it as a parameter – Matthew Barlowe May 10 '19 at 21:18
For PEP8 and the sake of my eyes, please don't use `\\` for line continuation when it is not required. (not required within parenthesis, brackets and braces) :) – FabienP May 10 '19 at 21:25
1

@excelguy try, replacing `df_stack` with `query1()` and `df_super` with `query2()` inside the `testjoin` function definition. – fabianegli May 11 '19 at 11:19
@MatthewBarlowe How can I do this? I have all these functions in the same .py file. – excelguy May 13 '19 at 13:29
@fabianegli i've tried sigh, getting this error: `'type {left}'.format(left=type(left))) ValueError: can not merge DataFrame with instance of type ` – excelguy May 13 '19 at 13:30
@excelguy in the code you have above in your `testjoin()` you don't define `df_stack` before you call it in the function judging by your code the simplest way to do so would be to put `df_stack = query1(BusinessDate)` somewhere before you call `df_stack` in the `testjoin()` function – Matthew Barlowe May 13 '19 at 14:42
Thanks @MatthewBarlowe , getting closer here. Now getting a "name `'sql'` is not defined" however I have it defined in the functions i'm calling.. shouldnt it be defined because of that?. See the above edit I made. – excelguy May 13 '19 at 14:59
@excelguy yes but those definitions only exist inside those functions what you can do is define it again inside the `testjoin` function or define it outside all the functions in the script itself which is called defining it globally which means all your functions have access to it or pass it as a parameter to the function . Please read about [variable scope in Python](https://matthew-brett.github.io/teaching/global_scope.html) – Matthew Barlowe May 13 '19 at 15:04
@MatthewBarlowe I understand, I dont have my sql statements outside (or above) my functions, so it cant be called in. What is a more clever way to call this in? I use the variable name sql twice in the functions im calling. – excelguy May 13 '19 at 19:18

score 1 · Accepted Answer · answered May 13 '19 at 19:41

Ok I'm going to post this as an answer instead of the comments as there are several ways which to do what you are asking. sql is not defined as I noted in the comment because it is outside the scope of the function testjoin().

One way to is to treat the SQL string as a global variable and then access it inside the function.

sql = '''
  SELECT 
    iddate, 
    businessdate, 
    stack, identifier
  FROM stackoverflow
  where stack is not null 
  and businessdate = {businessDate}
'''

def testjoin():
    con = pyodbc.connect(r'DSN='+'Stack',autocommit=True)  
    print('working') 

    df_stack= query1("'2019-03-13'")
    df_super= superuser("'2019-03-13'")

    pd.merge(df_stack,df_super, on = ['identifier','stack','iddate'])

    df_test = pd.read_sql(sql.format(businessDate="'2019-03-14'"),con)

    con.close()

    return(df_test)

The reason I used .format() instead of an f string is that an f string requires the variable to be declared at the time the f string is created. Which if you did not have businessdate as a variable it would be an error. .format() allows you to place the variable in the string and then change it's value whenever you want. I would do it this way if the main part of your query isn't going to change that much and you just need to filter by date.

The other way would to build the string outside the function and then pass it in as a parameter

businessdate = "'2019-03-13'"
sql = f'''
  SELECT 
    iddate, 
    businessdate, 
    stack, identifier
  FROM stackoverflow
  where stack is not null 
  and businessdate = {businessDate}
'''

def testjoin(sql_string):
    con = pyodbc.connect(r'DSN='+'Stack',autocommit=True)  
    print('working') 

    df_stack= query1("'2019-03-13'")
    df_super= superuser("'2019-03-13'")

    pd.merge(df_stack,df_super, on = ['identifier','stack','iddate'])

    df_test = pd.read_sql(sql_string,con)

    con.close()

    return(df_test)

test_df = testjoin(sql)

You could also continue building the string inside each function as well but given the Don't Repeat Yourself paradigm in coding, and the fact you are already building it in two other functions it would be best to avoid that.

ah i see now, thanks. But what if I have 2 sql queries? Would it be best to do your 2nd options and make the sql strings into 2 parameters? — excelguy, May 13 '19 at 20:37
It really depends. There's no hard and fast rule about this that I know of someone can correct me if they do know. But ultimately it boils down to how you are going to use it in the code. Normally globals are seen as [somewhat bad in OOP programming](https://stackoverflow.com/questions/1263954/is-global-constants-an-anti-pattern) for reasons beyond the scope of this question. It's just something that comes with time and experience with coding and reading other peoples code — Matthew Barlowe, May 13 '19 at 22:18
thanks for all your help matthew, I am having trouble calling 2 Sql queries, as I am not sure how to call this in the `df_test` variable. would I make 2 variables? But then which one would I return? Also a random question what does `test_df = testjoin(sql)` do? — excelguy, May 15 '19 at 13:07
`test_df = testjoin(sql)` assigns the return value of the `testjoin()` function to a variable so you can use it at a later time in the script. If you need to call two queries then you would need to have two separate SQL strings and call `pd.read_sql` two times as well — Matthew Barlowe, May 15 '19 at 15:41

Python, dataframe sql join

2 Answers2