3

I am writing a program that generates hashes for files in all subdirectories and then puts them in a database or prints them to standard output: https://github.com/cherrry9/dedup

In the latest commit, I added option for my program to use multiple threads (THREADS macro).

Here are some benchmarks that I did:

$ test() { /usr/bin/time -p ./dedup / -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 test
real 8.03
user 4.34
sys 4.55
$ make clean all THREADS=4 && test
real 3.94
user 7.66
sys 7.42

As you can the version compiled with THREADS=4 was 2 times faster.

Now I will use the second positional argument to specify sqlite3 database:

$ test() { /usr/bin/time -p ./dedup / test.db -v 0 -c 2048 -e "/\(proc\|sys\|dev\|run\)"; }
$ make clean all THREADS=1 && ​test
real 20.40
user 7.58
sys 7.29
$ rm test.db
$ make clean all THREADS=4 && ​test
real 21.86
user 17.17
sys 18.15

Version compiled with THREADS=4 was slower than version that used THREADS=1!

When I used second argument, in dedup.c was executed this code that inserted hashes to database:

if (sql != NULL && sql_insert(sql, entry->fpath, hash) != 0) {
// ...

sql_insert uses transactions to prevent sqlite from writing to database every time I call INSERT.

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    int errcode;

    pthread_mutex_lock(&sql->mtx);
    sqlite3_bind_text(sql->stmt, 1, filename, -1, NULL);
    sqlite3_bind_blob(sql->stmt, 2, hash, SHA256_LENGTH, NULL);

    sqlite3_step(sql->stmt);
    SQL_TRY(sqlite3_reset(sql->stmt));

    if (++sql->insertc >= INSERT_LIM) {
        SQL_TRY(sqlite3_exec(sql->database, "COMMIT;BEGIN", NULL, NULL, NULL));
        sql->insertc = 0;
    }

    pthread_mutex_unlock(&sql->mtx);
    return 0;
}

This fragment is executed for every processed file and for some reason it's blocking all threads in my program.

And here's my question, how can i prevent sqlite from blocking threads and degrading the performance of my program?

Here is dedup options explanation if you wonder what test function is doing:

1th positional argument - directory to use to generate hashes
2th positional argument - path to databases which will be used by sqlite3
-v level  - verbose level (0 means print only errors)
-c nbytes - read nbytes from each file
-e regex  - exclude directories that match regex

I'm using serialized mode in sqlite3.

2
  • 2
    What SQLite threading mode are you using? Commented Dec 27, 2021 at 18:28
  • 1
    @cherrrry9: Weclome to SO! I peeked at your linked code while writing my answer; however, it is expected that you include the relevant code in your question rather than linking to off-site resources; otherwise your future questions might get closed or downvoted. Commented Dec 27, 2021 at 19:51

1 Answer 1

1

It seems that all your threads use the same database connection and statement objects. Therefore you have a race-condition (even in SERIALIZED threading model), as multiple threads are binding, stepping, and resetting the same statement. Asking 'why is it slow' becomes irrelevant until you fix this problem.

Instead you should wrap your sql_insert with a mutex to guarantee that at most one thread is accessing the database connection:

int
sql_insert(SQL *sql, const char *filename, char unsigned hash[])
{
    pthread_mutex_lock(&sql->mutex);
    // ... actual insert and exec code ...
    pthread_mutex_unlock(&sql->mutex);
    return 0;
}

Then add and initialize that mutex in your SQL structure with pthread_mutex_init.

You'll see the performance boost if your bottleneck is indeed the computation of SHA-256 rather than writing into the database. Otherwise the overhead of this mutex should be negligible and the number of threads will not have a significant effect of the run-time.

Sign up to request clarification or add additional context in comments.

8 Comments

Sorry but I was wrong, I checked it again using sqlite_db_mutex which returned valid not null address, so I actually use serialized mode.
@cherrrry9: as I said; even with SERIALIZED you still have a race condition. Use multi-threaded and have your own mutex around the entirety of sql_insert.
I added this mutex and the strange entries that didn't match the actual files on my filesystem disappeared from the generated database, thanks! But the performance is still the same...
So the bottleneck is writing to the database as you wrote.
@cherrrry9: I think you misunderstood my comment; I meant that a four thread version running 1s slower than a one thread version is a rather reasonable outcome if you take into account that the bottleneck is the serialized SQLite's code. As for the way to "properly use sqlite with multithreading" is, unfortunately, to serialize all writes to SQLite with a mutex. SQLite's view of a database is inherently single-threaded (for writes); it goes as far as serializing all writes between multiple processes using a file lock.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.