Performance issue when using multiple threads with sqlite3

Question

I am writing a program that generates hashes for files in all subdirectories and then puts them in a database or prints them to standard output: https://github.com/cherrry9/dedup

In the latest commit, I added option for my program to use multiple threads (THREADS macro).

Here are some benchmarks that I did:

$ test() { /usr/bin/time -p ./dedup / -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 test
real 8.03
user 4.34
sys 4.55
$ make clean all THREADS=4 && test
real 3.94
user 7.66
sys 7.42

As you can the version compiled with THREADS=4 was 2 times faster.

Now I will use the second positional argument to specify sqlite3 database:

$ test() { /usr/bin/time -p ./dedup / test.db -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 && test
real 20.40
user 7.58
sys 7.29
$ rm test.db
$ make clean all THREADS=4 && test
real 21.86
user 17.17
sys 18.15

Version compiled with THREADS=4 was slower than version that used THREADS=1!

When I used second argument, in dedup.c was executed this code that inserted hashes to database:

if (sql != NULL && sql_insert(sql, entry->fpath, hash) != 0) {
// ...

sql_insert uses transactions to prevent sqlite from writing to database every time I call INSERT.

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    int errcode;

    pthread_mutex_lock(&sql->mtx);
    sqlite3_bind_text(sql->stmt, 1, filename, -1, NULL);
    sqlite3_bind_blob(sql->stmt, 2, hash, SHA256_LENGTH, NULL);

    sqlite3_step(sql->stmt);
    SQL_TRY(sqlite3_reset(sql->stmt));

    if (++sql->insertc >= INSERT_LIM) {
        SQL_TRY(sqlite3_exec(sql->database, "COMMIT;BEGIN", NULL, NULL, NULL));
        sql->insertc = 0;
    }

    pthread_mutex_unlock(&sql->mtx);
    return 0;
}

This fragment is executed for every processed file and for some reason it's blocking all threads in my program.

And here's my question, how can i prevent sqlite from blocking threads and degrading the performance of my program?

Here is dedup options explanation if you wonder what test function is doing:

1th positional argument - directory to use to generate hashes
2th positional argument - path to databases which will be used by sqlite3
-v level  - verbose level (0 means print only errors)
-c nbytes - read nbytes from each file
-e regex  - exclude directories that match regex

I'm using serialized mode in sqlite3.

@cherrrry9: Weclome to SO! I peeked at your linked code while writing my answer; however, it is expected that you include the relevant code in your question rather than linking to off-site resources; otherwise your future questions might get closed or downvoted. — Yakov Galka
– Yakov Galka, Commented Dec 27, 2021 at 19:51

Yakov Galka · Accepted Answer · 2021-12-27 20:11:35Z

1

It seems that all your threads use the same database connection and statement objects. Therefore you have a race-condition (even in SERIALIZED threading model), as multiple threads are binding, stepping, and resetting the same statement. Asking 'why is it slow' becomes irrelevant until you fix this problem.

Instead you should wrap your sql_insert with a mutex to guarantee that at most one thread is accessing the database connection:

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    pthread_mutex_lock(&sql->mutex);
    // ... actual insert and exec code ...
    pthread_mutex_unlock(&sql->mutex);
    return 0;
}

Then add and initialize that mutex in your SQL structure with pthread_mutex_init.

You'll see the performance boost if your bottleneck is indeed the computation of SHA-256 rather than writing into the database. Otherwise the overhead of this mutex should be negligible and the number of threads will not have a significant effect of the run-time.

edited Dec 27, 2021 at 20:11

answered Dec 27, 2021 at 19:45

Yakov Galka

72.9k16 gold badges149 silver badges226 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

cherrrry9 Over a year ago

Sorry but I was wrong, I checked it again using sqlite_db_mutex which returned valid not null address, so I actually use serialized mode.

Yakov Galka Over a year ago

@cherrrry9: as I said; even with SERIALIZED you still have a race condition. Use multi-threaded and have your own mutex around the entirety of sql_insert.

cherrrry9 Over a year ago

I added this mutex and the strange entries that didn't match the actual files on my filesystem disappeared from the generated database, thanks! But the performance is still the same...

cherrrry9 Over a year ago

So the bottleneck is writing to the database as you wrote.

Yakov Galka Over a year ago

@cherrrry9: I think you misunderstood my comment; I meant that a four thread version running 1s slower than a one thread version is a rather reasonable outcome if you take into account that the bottleneck is the serialized SQLite's code. As for the way to "properly use sqlite with multithreading" is, unfortunately, to serialize all writes to SQLite with a mutex. SQLite's view of a database is inherently single-threaded (for writes); it goes as far as serializing all writes between multiple processes using a file lock.

|

Collectives™ on Stack Overflow

Performance issue when using multiple threads with sqlite3

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related