7

I am trying to determine the fastest way to fetch data from MySQL into Pandas. So far, I have tried three different approaches:

Approach 1: Using pymysql and modifying field type (inspired by Fastest way to load numeric data into python/pandas/numpy array from MySQL)

import pymysql 
from pymysql.converters import conversions
from pymysql.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = pymysql.connect(host = host, port = port, user= user, passwd= passwd, db= db)

Approach 2: Using MySqldb

import MySQLdb
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host = host, port = port, user= user, passwd= passwd, db= db)

Approach 3: Using sqlalchemy

import sqlalchemy as SQL
engine = SQL.create_engine('mysql+mysqldb://{0}:{1}@{2}:{3}/{4}'.format(user, passwd, host, port, db))

Approach 2 is the best out of these three and takes an average of 4 seconds to fetch my table. However, fetching the table only takes 2 seconds on MySQL Workbench. How can I shave off this 2 extra seconds ? Does anyone know of any alternative ways to accomplish this ?

Sandeep Silwal
  • 321
  • 5
  • 10
  • What do you mean by "However, fetching the table only takes 2 seconds on MySQL."? Have you connected to MySQL already? – a_guest Jun 19 '17 at 16:26
  • Sorry, I mean when I run SELECT * FROM table in MySQL workbench, it only takes 2 seconds to fetch the table – Sandeep Silwal Jun 19 '17 at 16:48
  • Well this implies that you are already connected to the database. When benchmarking the different methods have you excluded the time required for connecting to the database? – a_guest Jun 19 '17 at 16:54
  • Ah, no all the times I recorded also include the time required to connect to the database. Since I am only fetching a massive table once, I want to account for the time required to connect to the database. However, I realize that this makes the comparison with Mysql workbench unfair since the workbench is already connected to the database – Sandeep Silwal Jun 19 '17 at 17:00
  • @SandeepSilwal did you figure out the fastest way? – dingx Dec 10 '19 at 07:51
  • this was 3 years ago.. i dont remember – Sandeep Silwal Dec 10 '19 at 14:53

2 Answers2

3

You can use ConnectorX library that written with rust and is about 10 times faster than pandas. This library gets data from the database and fills the dataframe.

DrMorteza
  • 2,067
  • 2
  • 21
  • 29
1

I think you may find answers using a specific library such as "peewee" or the function df.read_sql_query from the pandas library. To use df.read_sql_query :

MyEngine = create_engine('[YourDatabase]://[User]:[Pass]@[Host]/[DatabaseName]', echo = True)
df = pd.read_sql_query('select * from [TableName]', con= MyEngine)

Also, for uploading data from a dataframe to SQL:

df.to_sql([TableName], MyEngine, if_exists = 'append', index=False)

You must put if_exists = 'append' if the table already exists, or it will auto-default to fail. You could also put replace if you want to replace as new table as well.

For data integrity sake it's nice using dataframes for uploads and downloads due to its ability to handle data well. Depending on your size of upload, it should be pretty efficient on upload time too.

If you want to go an extra step, peewee queries may help make upload time faster, although I have not personally tested speed. Peewee is an ORM library like SQLAlchemy that I found to be very easy and expressive to develop with. You also could use dataframes as well. Just skim over the documentation - you would construct and assign a query, then convert it to a dataframe like this:

MyQuery = [TableName]select()where([TableName.column] == "value")
df = pd.DataFrame(list(MyQuery.dicts()))

Hope this helps.

Azarias
  • 11
  • 3