6

I'm doing big batch inserts into an SQLite3 database and I'm trying to get a sense for what sort of performance I should be expecting versus what I'm actually seeing.

My table looks like this:

cursor.execute(
            "CREATE TABLE tweets(
             tweet_hash TEXT PRIMARY KEY ON CONFLICT REPLACE,
             tweet_id INTEGER,
             tweet_text TEXT)"
        )

and my inserts look like this:

cursor.executemany("INSERT INTO tweets VALUES (?, ?, ?)", to_write)

where to_write is a list of tuples.

Currently, with about 12 million rows in the database, inserting 50 000 rows is taking me around 16 minutes, running on a 2008 macbook.

Does this sound reasonable, or is there something gross happening?

cmyr
  • 2,235
  • 21
  • 30
  • Have you tried inserting multiple values in each query? Maybe not all of them at once, since you would have to create a very large string, but maybe 100 each? In my experience this improves insertion speed a lot. – Gustav Larsson Jul 25 '13 at 20:02
  • @GustavLarsson: `executemany` is supposed to take advantage of preparing the query and streaming in the per-row columns *already*. – Martijn Pieters Jul 25 '13 at 20:03
  • @GustavLarsson: How were you proposing creating multiple inserts? `execute()` and `executemany()` only allow for *one* statement. – Martijn Pieters Jul 25 '13 at 20:04
  • 1
    i think something like `for i in range(0,len(my_list),100):execute_many(qry,my_list[i:i+100])` at a guess – Joran Beasley Jul 25 '13 at 20:31
  • Personally, I've never used `executemany`, so I don't know exactly how that works, or what optimization it gives you. What I was thinking of was doing something like `"INSERT INTO tweets VALUES (1, 2, 3), (4, 5, 6)"` (in newer version of sqlite3, but there are work around in older), which I know is much faster than separate inserts. Maybe this method can be combined with `executemany` to make it buffered? – Gustav Larsson Jul 25 '13 at 21:01
  • 2
    Does `executemany` use transactions? If not, sqlite will internally wrap each and every insert statement with an implicit transaction, which can cause a *huge* performance hit on bulk inserts. See these for a little more info: http://www.sqlite.org/faq.html#q19 or http://stackoverflow.com/questions/3852068/sqlite-insert-very-slow – rutter Jul 25 '13 at 21:24
  • 16 minutes is a lot slower than I would expect. I think there is a good chance that @rutter comments are correct and each insert is being executed within an implicit transaction. I would store a large .sql file (or just a large string) with all of the inserts wrapped within a transaction(as discussed in the article above) and see how long this takes. – ChrisProsser Jul 25 '13 at 22:25
  • for some follow up I was never really able to satisfactorily resolve this problem. I did find better results with using smaller transaction sizes (around 1k entries) but still had performance problems, and ended up switching to gdbm, which is working better for this project. – cmyr Sep 16 '13 at 22:30

1 Answers1

8

As I understand the main reason of bad performance is time you waste to commit many SQLite transactions. What to do?

Drop the indexes, then

PRAGMA synchronous = OFF (or NORMAL)

Insert blocks of N rows (define N, try N=5000 to start). Before inserting block do

BEGIN TRANSACTION

after inserting do

COMMIT

See also http://www.sqlite.org/faq.html#q19

drastega
  • 1,581
  • 5
  • 30
  • 42