Hamming Distance for Hybrid Search in SQLite

(notnotp.com)

66 points | by enz 3 days ago ago

11 comments

  • stevesimmons 6 hours ago ago

    USearch has a sqlite extension that supports various metrics on including Hamming distance on standard sqlite BLOB columns. It gets similar performance and is very convenient.

    (There's also an indexed variant that does faster lookups, but it uses a special virtual table layout that constrains the types of the other columns in the table.)

    See https://github.com/unum-cloud/USearch. pip-installable for Python users.

  • jonatron 7 hours ago ago

    You could first calculate the distance of the first n bits (eg: 64, one popcountll) as a first pass, then calculate the full distance for candidates over a threshold from the first pass. It makes it approximate, but depending on the application it can be worth it.

    • mbreese 6 hours ago ago

      I was thinking of something similar — instead of just two passes, couldn’t you also store different quantized values? If you have thousands of documents, you could narrow it down to a handful with a few bit-wise Hamming comparisons before doing a full cosine similarity on just the rest. If you hand more than one bitmap stored, you’d have fewer comparisons at each step too.

      Would this work?

  • stephenheron 6 hours ago ago

    I've had good success in using this: https://github.com/sqliteai/sqlite-vector if you want something a bit more "off the shelf" if you are using SQLite.

    • 0cf8612b2e1e 6 hours ago ago

      Ooh that does look nice. However, that license is a deal breaker for me.

      • newusertoday 5 hours ago ago

        you can try this https://github.com/asg017/sqlite-vec its apache license

      • marcobambini 6 hours ago ago

        Author here... it is free for open-source projects.

        What kind of license would you like more?

        • simonw 2 hours ago ago

          One that lets me use it in my open source projects without then preventing other people from using my open source projects in their closed source projects.

          Using your library currently completely disrupts the licensing situation for my own work.

  • andai 6 hours ago ago

    Has anyone tried keyword expansion in this context?

    I had the idea of making a "poor man's embeddings" for document similarity. You want two documents to match even if they share no keywords, as long as their keywords are closely related, right? That seems like a very solvable problem.

  • woadwarrior01 3 hours ago ago

    There's also the recently released zvec[1], the tagline for which is: The SQLite of Vector Databases.

    [1]: https://github.com/alibaba/zvec

  • esafak 5 hours ago ago

    Are today's models any good at helping write postgres or sqlite extensions?