We Developed a Rule Database

(github.com)

1 points | by rockeetterark 6 hours ago ago

1 comments

  • rockeetterark 6 hours ago ago

    database are often used to store data and search data by 'conditions', ex: `select * from ... where <conditions>`.

    on the other side, in many cases we need to store MANY 'conditions' and search the stored conditions by 'data'. such as:

    * online AD, the Advertisers define many conditions for user profiles, contexts ... * content filtering(forbidden words with very complex boolean expression) * risk control, alert ... conditions * data cleaning, curation... * auto labeling...

    Such businesses often use Elastic Search percolator(or its underlying lucene monitor), more recent tantivy percolator(lucene's rust alternative).

    Now, we have developed RuleDB for such use case, our enterprise users showing RuleDB is more than 1000x faster than ES(excluding RPC network overhead).

    RuleDB has two parts: 1. the compiler and 2. the runtime.

    1. The Compiler compiles rule code written by RuleDB DSL into binaries

      * AC automata binaries: for normal 'words'
    
      * **Multi** Regex engine: our **extended regex** with regular language algebras(not/and/or/concat/non-greedy-op...), many different extended regex in many rules are compiled into one DFA(the supported regex num  is 100x of hyperscan)
    
      * cost based recall optimization: for example in `A and B and C`, the least frequent of A,B,C is selected as the recall term, this is in recursive approach.
    
      * The rule verification VM: AC automata and regex just scan text for atoms, such atoms are combined with boolean expressions, ex: a near/3 quick near/4 "brown fox" and not(lazy near/+2 dog or diligent near/2 wolf).
         * The VM code are highly optimized by the compiler
    
    2. The applications use RuleDB API calling the runtime lib

      * load the compiled binaries(just mmap)
      * scan + verify
    
    Advanct features:

      * numbers:  integeral range `{i{3000,5000}}`, real number range `{r{2.71828,3.14159265}}`, -- save and matched as text, no width limit
    
      * composite index: `gender[1] and age[23] and income{i{3000,5000}}` will be compiled into an composite index 'i-gender-age-income'.
    
      * multi dimentions: the typical case geofencing: `longitude{r{116.2418,116.2441}} and latitude {r{39.5424,39.5450}}` -- RuleDB natively support any dimention search
    
      * such numeric expression and text expressions can be mixed: `(america or china) and gender[1] and age[23] and income{i{3000,5000}}`
    
    A complex rule:

    ``` gender{{1}} and age{i{20,28}} and income{i{18000,23000}} and longitude{r{116.2418,116.2441}} and latitude {r{39.5424,39.5450}} and interesting{{movie|food|sport}} and books{{The Red and the Black|The Great Gatsby}} and a near/3 quick near/4 "brown fox" and not(lazy near/+2 dog or diligent near/2 wolf) ```

    There are real world single rule with more than 50KB in our enterprise users, a rule source code file with 2MB is fully compiled in 150 milliseconds, a query using a 10KB document to search 70000 rules with 20 matched rules takes just 300us(micro seconds, not milli seconds)!

    ---- I'am also the author of ToplingDB, I had posted several months ago https://news.ycombinator.com/item?id=44432322