Wie geht MongoDB mit der Dokumentlänge in einem Textindex und Textscore um?

Die Bewertung basiert auf der Anzahl der Übereinstimmungen mit Stamm, aber es gibt auch einen eingebauten Koeffizienten, der die Punktzahl für Übereinstimmungen relativ zur Gesamtfeldlänge (ohne Stoppwörter) anpasst. Wenn Ihr längerer Text relevantere Wörter für eine Suchanfrage enthält, trägt dies zur Punktzahl bei. Längerer Text, der nicht zu einer Suchanfrage passt, verringert die Punktzahl.

Ausschnitt aus dem Quellcode von MongoDB 3.2 auf GitHub (src/mongo/db/fts/fts_spec.cpp ):

   for (ScoreHelperMap::const_iterator i = terms.begin(); i != terms.end(); ++i) {
        const string& term = i->first;
        const ScoreHelperStruct& data = i->second;

        // in order to adjust weights as a function of term count as it
        // relates to total field length. ie. is this the only word or
        // a frequently occuring term? or does it only show up once in
        // a long block of text?

        double coeff = (0.5 * data.count / numTokens) + 0.5;

        // if term is identical to the raw form of the
        // field (untokenized) give it a small boost.
        double adjustment = 1;
        if (raw.size() == term.length() && raw.equalCaseInsensitive(term))
            adjustment += 0.1;

        double& score = (*docScores)[term];
        score += (weight * data.freq * coeff * adjustment);
        verify(score <= MAX_WEIGHT);
    }
}

Erstellen Sie einige Testdaten, um die Wirkung des Längenkoeffizienten an einem sehr einfachen Beispiel zu sehen:

db.articles.insert([
    { headline: "Rock" },
    { headline: "Rocks" },
    { headline: "Rock paper" },
    { headline: "Rock paper scissors" },
])

db.articles.createIndex({ "headline": "text"})

db.articles.find(
    { $text: { $search: "rock" }},
    { _id:0, headline:1, score: { $meta: "textScore" }}
).sort({ score: { $meta: "textScore" }})

Kommentierte Ergebnisse:

// Exact match of raw term to indexed field
// Coefficent is 1, plus 0.1 bonus for identical match of raw term
{
  "headline": "Rock",
  "score": 1.1
}

// Match of stemmed term to indexed field ("rocks" stems to "rock")
// Coefficent is 1
{
  "headline": "Rocks",
  "score": 1
}

// Two terms, one matching
// Coefficient is 0.75: (0.5 * 1 match / 2 terms) + 0.5
{
  "headline": "Rock paper",
  "score": 0.75
}

// Three terms, one matching
// Coefficient is 0.66: (0.5 * 1 match / 3 terms) + 0.5
{
  "headline": "Rock paper scissors",
  "score": 0.6666666666666666
}