docs/settings.txt - mediawiki/extensions/CirrusSearch - Gitiles
gerrit.wikimedia.org
mediawiki
extensions
CirrusSearch
HEAD
docs
settings.txt
blob: 431e89185106623db162eb8ca905f62598721fab [
file
] [
log
] [
blame
This file provides documentation for CirrusSearch configuration variables.
It should be updated each time a new configuration parameter is added or changed.
== Configuration ==
; $wgCirrusSearchServers
Default:
unset
$wgCirrusSearchServers provides a straight forward method for
configuring a typical use case, a single elasticsearch cluster for
all circumstances. The value is a list of hostnames in the cluster
to connect to.
When set the following configuration is ignored:
wgCirrusSearchClusters
wgCirrusSearchDefaultCluster
wgCirrusSearchWriteClusters
wgCirrusSearchReplicaGroup
; $wgCirrusSearchDefaultCluster
Default:
$wgCirrusSearchDefaultCluster = 'default';
Default cluster for read operations. This refers to the cluster group
from $wgCirrusSearchClusters. When running multiple clusters this
should be pointed to the closest cluster, and can be pointed at an
alternate cluster during downtime.
; $wgCirrusSearchClusters
Default:
$wgCirrusSearchClusters = [
'default' => [ 'localhost' ],
];
Each key is the name of an elasticsearch cluster. The value is
a list of addresses to connect to. If no port is specified it
defaults to 9200.
All writes will be processed in all configured cluster groups by the
ElasticaWrite job, unless $wgCirrusSearchWriteClusters is configured
(see below).
This list of addresses can additionally contain 'replica' and
'group' keys for controlling multi-cluster operations. By default
'replica' takes the value of the array key and 'group' is set
to 'default'. For more information see docs/multi_cluster.txt.
Example:
$wgCirrusSearchClusters = [
'dc-foo' => [ 'es01.foo.local', 'es02.foo.local' ]
'dc-bar' => [ 'es01.bar.local', 'es02.bar.local' ]
];
A non-standard elasticsearch port can also be defined.
Example:
$wgCirrusSearchClusters = [
'default' => [
[ 'host' => '127.0.0.1', 'port' => 1234 ],
];
; $wgCirrusSearchManagedClusters
Default:
$wgCirrusSearchManagedClusters = null
List of clusters, from $wgCirrusSearchClusters, where CirrusSearch is responsible
for managing indices. CirrusSearch will refuse to perform maintenance operations
on unlisted clusters. When null all known clusters are used.
; $wgCirrusSearchWriteClusters
Default:
$wgCirrusSearchWriteClusters = null;
List of clusters that can be used for writing. Must be a subset of
cluster groups from $wgCirrusSearchClusters. By default or when set
to null, all configured cluster groups are available for writing.
; $wgCirrusSearchPrivateClusters
Default:
$wgCirrusSearchPrivateClusters = null
List of cluster names that are allowed to contain private indices. This
provides an additional list on top of $wgCirrusSearchWriteClusters for the
archive index which should not be written to clusters that will be publicly
readable. When set to the default value of null all clusters are allowed to
contain private data.
; $wgCirrusSearchReplicaGroup
Default:
$wgCirrusSearchReplicaGroup = 'default'
Replica group the current wiki belongs to. This can be either a
string for a constant assignment, or a configuration array specifying
a strategy for choosing the replica group. This should not be changed
except in advanced multi-wiki configurations. For more information
see docs/multi_cluster.txt.
; $wgCirrusSearchCrossClusterSearch
Default:
$wgCirrusSearchCrossClusterSearch = false
When true search queries will have their index name prepended with an
elasticsearch cross-cluster-search identifier if the indices reside on a
cluster group separate from the host wiki. This only applies to full text
search queries, as they are the only ones that support cross-wiki search.
; $wgCirrusSearchConnectionAttempts
Default:
$wgCirrusSearchConnectionAttempts = 1;
How many times to attempt connecting to a given server.
If you're behind LVS and everything looks like one server,
you may want to reattempt 2 or 3 times.
; $wgCirrusSearchShardCount
Default:
$wgCirrusSearchShardCount = [ 'content' => 1, 'general' => 1, 'titlesuggest' => 1 ];
Number of shards for each index.
You can also set this setting for each cluster:
$wgCirrusSearchShardCount = array(
'cluster1' => array( 'content' => 2, 'general' => 2 ),
'cluster2' => array( 'content' => 3, 'general' => 3 ),
);
; $wgCirrusSearchReplicas
Default:
$wgCirrusSearchReplicas = '0-2';
Number of replicas Elasticsearch can expand or contract to. This allows for
easy development and deployment to a single node (0 replicas) to scale up to
higher levels of replication. If you need more redundancy you could
adjust this to '0-10' or '0-all' or even 'false' (string, not boolean) to
disable the behavior entirely. The default should be fine for most people.
You can also specify this as an array of index type to replica count. If you
do then you must specify all index types. For example:
$wgCirrusSearchReplicas = array( 'content' => '0-3', 'general' => '0-2' );
You can also set this setting for each cluster:
$wgCirrusSearchReplicas = array(
'cluster1' => array( 'content' => '0-1', 'general' => '0-2' ),
'cluster2' => array( 'content' => '0-2', 'general' => '0-3' ),
);
; $wgCirrusSearchMaxShardsPerNode
Default:
$wgCirrusSearchMaxShardsPerNode = [];
Number of shards allowed on the same elasticsearch node, per index type.
Set this to 1 to prevent two shards from the same high traffic index from being allocated
onto the same node.
You can also set this setting for each cluster:
$wgCirrusSearchMaxShardsPerNode = [
'cluster1' => [ 'content' => 1 ],
'cluster2' => [ 'content' => 'unlimited' ],
];
Example:
$wgCirrusSearchMaxShardsPerNode[ 'content' ] = 1;
; $wgCirrusSearchSlowSearch
Default:
$wgCirrusSearchSlowSearch = 10.0;
How many seconds must a search of Elasticsearch take before we consider it
slow? Default value is 10 seconds which should be fine for catching the rare
truly abusive queries. Use Elasticsearch query more granular logs that
don't contain user information.
; $wgCirrusSearchUseExperimentalHighlighter
Default:
$wgCirrusSearchUseExperimentalHighlighter = false;
Should CirrusSearch attempt to use the "experimental" highlighter. It is an
Elasticsearch plugin that should produce better snippets for search results.
Installation instructions are here: https://github.com/wikimedia/search-highlighter
If you have the highlighter installed you can switch this on and off so long
as you don't rebuild the index while $wgCirrusSearchOptimizeIndexForExperimentalHighlighter is true.
Setting it to true without the highlighter installed will break search.
; $wgCirrusSearchOptimizeIndexForExperimentalHighlighter
Default:
$wgCirrusSearchOptimizeIndexForExperimentalHighlighter = false;
Should CirrusSearch optimize the index for the experimental highlighter.
This will speed up indexing, save a ton of space, and speed up highlighting
slightly. This only takes effect if you rebuild the index. The downside is
that you can no longer switch $wgCirrusSearchUseExperimentalHighlighter on
and off - it has to stay on.
; $wgCirrusSearchWikimediaExtraPlugin
Default:
$wgCirrusSearchWikimediaExtraPlugin = [];
Should CirrusSearch try to use the wikimedia/extra plugin? An empty array
means don't use it at all.
Here is an example to enable faster regex matching:
$wgCirrusSearchWikimediaExtraPlugin[ 'regex' ] =
array( 'build', 'use', 'max_inspect' => 10000 );
The 'build' value instructs Cirrus to build the index required to speed up
regex queries. The 'use' value instructs Cirrus to use it to power regular
expression queries. If 'use' is added before the index is rebuilt with
'build' in the array then regex will fail to find anything. The value of
the 'max_inspect' key is the maximum number of pages to recheck the regex
against. Its optional and defaults to 10000 which seems like a reasonable
compromise to keep regexes fast while still producing good results.
This turns on noop-detection for updates and is compatible with
wikimedia-extra versions 1.3.1, 1.4.2, 1.5.0, and greater:
$wgCirrusSearchWikimediaExtraPlugin[ 'super_detect_noop' ] = true;
Configure field specific handlers for the noop script.
$wgCirrusSearchWikimediaExtraPlugin[ 'super_detect_noop_handlers' ] = [
'labels' => 'equals',
];
This turns on document level noop-detection for updates based on revision
ids and is compatible with wikimedia-extra versions 2.3.4.1 and greater:
$wgCirrusSearchWikimediaExtraPlugin[ 'documentVersion' ] = true
Allows to use lucene tokenizers to activate phrase rescore.
This allows not to rely on the presence of spaces (which obviously does not
work on spaceless languages). Available since version 5.1.2
$wgCirrusSearchWikimediaExtraPlugin['token_count_router'] = true;
Allows the use of term_freq token filter and query. Available since
version 5.5.2.7 of the plugin.
$wgCirrusSearchWikimediaExtraPlugin['term_freq'] = true;
; $wgCirrusSearchEnableRegex
Default:
$wgCirrusSearchEnableRegex = true;
Should CirrusSearch try to support regular expressions with insource:?
These can be really expensive, but mostly ok, especially if you have the
extra plugin installed. Sometimes they still cause issues though.
; $wgCirrusSearchRegexMaxDeterminizedStates
Default:
$wgCirrusSearchRegexMaxDeterminizedStates = 20000;
Maximum complexity of regexes. Raising this will allow more complex
regexes use the memory that they need to compile in Elasticsearch. The
default allows reasonably complex regexes and doesn't use too much memory.
; $wgCirrusSearchQueryStringMaxDeterminizedStates
Default:
$wgCirrusSearchQueryStringMaxDeterminizedStates = null;
Maximum complexity of wildcard queries. Raising this value will allow
more wildcards in search terms. 500 will allow about 20 wildcards.
Setting a high value here can cause the cluster to consume a lot of memory
when compiling complex wildcards queries.
This setting requires elasticsearch 1.4+.
With elasticsearch 1.4+ if this setting is disabled the default value is
10000.
With elasticsearch 1.3 this setting must be disabled.
Example:
$wgCirrusSearchQueryStringMaxDeterminizedStates = 500;
; $wgCirrusSearchNamespaceMappings
Default:
$wgCirrusSearchNamespaceMappings = [];
By default, Cirrus will organize pages into one of two indexes (general or
content) based on whether a page is in a content namespace. This should
suffice for most wikis. This setting allows individual namespaces to be
mapped to specific index suffixes. The keys are the namespace number, and
the value is a string name of what index suffix to use. Changing this setting
requires a full reindex (not in-place) of the wiki. If this setting contains
any values then the index names must also exist in $wgCirrusSearchShardCount.
; $wgCirrusSearchExtraIndexes
Default:
$wgCirrusSearchExtraIndexes = [];
Extra indexes (if any) you want to search, and for what namespaces?
The key should be the local namespace, with the value being an array of one
or more indexes that should be searched as well for that namespace.
NOTE: This setting makes no attempts to ensure compatibility across
multiple indexes, and basically assumes everyone's using a CirrusSearch
index that's more or less the same. Most notably, we can't guarantee
that namespaces match up; so you should only use this for core namespaces
or other times you can be sure that namespace IDs match 1-to-1.
NOTE Part Two: Adding an index here is cause cirrus to update spawn jobs to
update that other index, trying to set the local_sites_with_dupe field. This
is used to filter duplicates that appear on the remote index. This is always
done by a job, even when run from forceSearchIndex.php. If you add an image
to your wiki but after it is in the extra search index you'll see duplicate
results until the job is done.
NOTE Part Three: Removing an index from here will stop generating update
jobs, but jobs already enqueued will run to completion.
NOTE Part Four: When using a multi cluster (wgCirrusSearchReplicaGroup) setup
you can prefix with the remote cross cluster name.
Example:
$wgCirrusSearchExtraIndexes = [
NS_FILE => [ 'other_index' ]
; $wgCirrusSearchExtraIndexBoostTemplates
Default:
$wgCirrusSearchExtraIndexBoostTemplates = [];
Template boosts to apply to extra index queries. This is pretty much a complete
hack, but gets the job done. Top level is a map from the extra index addedby
$wgCirrusSearchExtraIndexes to a configuration map. That configuration map must
contain a 'wiki' entry with the same value as the 'wiki' field in the documents,
and a 'boosts' entry containing a map from template name to boost weight.
Example:
$wgCirrusSearchExtraIndexBoostTemplates = [
'commonswiki_file' => [
'wiki' => 'commonswiki',
'boosts' => [
'Template:Valued image' => 1.75
'Template:Assessments' => 1.75,
],
];
; $wgCirrusSearchUpdateShardTimeout
Default:
$wgCirrusSearchUpdateShardTimeout = '1ms';
Shard timeout for index operations. This is the amount of time
Elasticsearch will wait around for an offline primary shard. Currently this
is just used in page updates and not deletes. It is defined in
Elasticsearch's time format which is a string containing a number and then a
unit which is one of d (days), m (minutes), h (hours), ms (milliseconds) or
w (weeks). Cirrus defaults to a very tiny value to prevent job executors
from waiting around a long time for Elasticsearch. Instead, the job will
fail and be retried later.
; $wgCirrusSearchClientSideUpdateTimeout
Default:
$wgCirrusSearchClientSideUpdateTimeout = 120;
Client side timeout for non-maintenance index and delete operations and
in seconds. Set it long enough to account for operations that may be
delayed on the Elasticsearch node.
; $wgCirrusSearchClientSideConnectTimeout
Default:
$wgCirrusSearchClientSideConnectTimeout = 5;
Client side timeout when initializing connections.
Useful to fail fast if elasticsearch is unreachable.
Set to 0 to use Elastica defaults (300 sec).
You can also set this setting for each cluster:
$wgCirrusSearchClientSideConnectTimeout = array(
'cluster1' => 10,
'cluster2' => 5,
; $wgCirrusSearchSearchShardTimeout
Default:
$wgCirrusSearchSearchShardTimeout = [
'default' => '20s',
'regex' => '120s',
];
The amount of time Elasticsearch will wait for search shard actions before
giving up on them and returning the results from the other shards. Defaults
to 20s for regular searches which is about twice the slowest queries we see.
Some shard actions are capable of returning partial results and others are
just ignored. Regexes default to 120 seconds because they are known to be
slow at this point.
; $wgCirrusSearchClientSideSearchTimeout
Default:
$wgCirrusSearchClientSideSearchTimeout = [
'default' => 40,
'regex' => 240,
];
Client side timeout for searches in seconds. Best to keep this double the
shard timeout to give Elasticsearch a chance to timeout the shards and return
partial results.
; $wgCirrusSearchMaintenanceTimeout
Default:
$wgCirrusSearchMaintenanceTimeout = 3600;
Client side timeout for maintenance operations. We can't disable the timeout
all together so we set it to one hour for really long running operations
like optimize.
; $wgCirrusSearchPrefixSearchStartsWithAnyWord
Default:
$wgCirrusSearchPrefixSearchStartsWithAnyWord = false;
Is it ok if the prefix starts on any word in the title or just the first word?
Defaults to false (first word only) because that is the Wikipedia behavior and so
what we expect users to expect. Does not effect the prefix: search filter or
url parameter - that always starts with the first word. false -> true will break
prefix searching until an in place reindex is complete. true -> false is fine
any time and you can then go false -> true if you haven't run an in place reindex
since the change.
; $wgCirrusSearchPhraseSlop
Default:
$wgCirrusSearchPhraseSlop = [ 'precise' => 0, 'default' => 0, 'boost' => 1 ];
Phrase slop is how many words not searched for can be in the phrase and it'll still
match. If I search for "like yellow candy" then phraseSlop of 0 won't match "like
brownish yellow candy" but phraseSlop of 1 will. The 'precise' key is for matching
quoted text. The 'default' key is for matching quoted text that ends in a ~.
The 'boost' key is used for the phrase rescore that boosts phrase matches on queries
that don't already contain phrases.
; $wgCirrusSearchPhraseRescoreBoost
Default:
$wgCirrusSearchPhraseRescoreBoost = 10.0;
If the search doesn't include any phrases (delimited by quotes) then we try wrapping
the whole thing in quotes because sometimes that can turn up better results. This is
the boost that we give such matches. Set this less than or equal to 1.0 to turn off
this feature.
; $wgCirrusSearchPhraseRescoreWindowSize
Default:
$wgCirrusSearchPhraseRescoreWindowSize = 512;
Number of documents per shard for which automatic phrase matches are performed if it
is enabled.
; $wgCirrusSearchFunctionRescoreWindowSize
Default:
$wgCirrusSearchFunctionRescoreWindowSize = 8192;
Number of documents per shard for which function scoring is applied. This is stuff
like incoming links boost, prefer-recent decay, and boost-templates.
; $wgCirrusSearchMoreAccurateScoringMode
Default:
$wgCirrusSearchMoreAccurateScoringMode = true;
If true CirrusSearch asks Elasticsearch to perform searches using a mode that should
produce more accurate results at the cost of performance. See this for more info:
; $wgCirrusSearchFallbackProfile
Default:
$wgCirrusSearchFallbackProfile = 'phrase_suggest_and_language_detection';
Configure fallback methods.
Responsible from displaying the "Did you mean" suggestion and/or
rewriting the query to increase the chances to display some results.
; $wgCirrusSearchFallbackProfiles
Default:
$wgCirrusSearchFallbackProfiles = []
Additional fallback profiles
(see profiles/FallbackProfiles.config.php)
; $wgCirrusSearchEnablePhraseSuggest
Default:
$wgCirrusSearchEnablePhraseSuggest = true;
Should the phrase suggester (did you mean) be enabled?
; $wgCirrusSearchPhraseSuggestProfiles
Default:
$wgCirrusSearchPhraseSuggestProfiles = []
Set additional phrase suggester profiles
(see profiles/PhraseSuggesterProfiles.config.php)
; $wgCirrusSearchInterwikiHTTPTimeout
Read timeout (in seconds) for HTTP requests done to another wiki API.
Default:
$wgCirrusSearchInterwikiHTTPTimeout = 10
; $wgCirrusSearchInterwikiHTTPConnectTimeout
Connection timeout (in seconds) for HTTP requests done to another wiki API.
Default:
$wgCirrusSearchInterwikiHTTPConnectTimeout = 5
; $wgCirrusSearchPhraseSuggestReverseField
Default:
$wgCirrusSearchPhraseSuggestReverseField = [
'build' => false,
'use' => false,
];
Use a reverse field to build the did you mean suggestions.
This is usefull to workaround the prefix length limitation, by working with a reverse
field we can suggest typos correction that appears in the first 2 characters of the word.
i.e. Suggesting "search" if the user types "saerch" is possible with the reverse field.
Set build to true and reindex before set use to true
; $wgCirrusSearchPhraseSuggestUseText
Default:
$wgCirrusSearchPhraseSuggestUseText = false;
Look for suggestions in the article text?
An inplace reindex is needed after any changes to this value.
; $wgCirrusSearchPhraseSuggestUseOpeningText
Default:
$wgCirrusSearchPhraseSuggestUseOpeningText = false;
Look for suggestions in the article opening text?
An inplace reindex is needed after any changes to this value.
; $wgCirrusSearchAllowLeadingWildcard
Default:
$wgCirrusSearchAllowLeadingWildcard = true;
Allow leading wildcard queries.
Searching for terms that have a leading ? or * can be very slow. Turn this off to
disable it. Terms with leading wildcards will have the wildcard escaped.
; $wgCirrusSearchIndexedRedirects
Default:
$wgCirrusSearchIndexedRedirects = 1024;
Maximum number of redirects per target page to index.
; $wgCirrusSearchIndexFieldsToCleanup
Default:
$wgCirrusSearchIndexFieldsToCleanup = []
List of strings identifying the fields to remove from the index when the next in-place re-index is run.
; $wgCirrusSearchIndexWeightedTagsPrefixMap
Default:
$wgCirrusSearchIndexWeightedTagsPrefixMap = [];
Map of weighted tag prefix replacements, mapping old (key) to new (value) prefixes.
Example:
$wgCirrusSearchIndexWeightedTagsPrefixMap = [ "old.prefix" => "new.prefix" ];
; $wgCirrusSearchLinkedArticlesToUpdate
Default:
$wgCirrusSearchLinkedArticlesToUpdate = 25;
Maximum number of newly linked articles to update when an article changes.
; $wgCirrusSearchUnlinkedArticlesToUpdate
Default:
$wgCirrusSearchUnlinkedArticlesToUpdate = 25;
Maximum number of newly unlinked articles to update when an article changes.
; $wgCirrusSearchSimilarityProfile
Default:
$wgCirrusSearchSimilarityProfile = 'classic';
Configure the similarity module.
See profile/SimilarityProfiles.php for more details.
; $wgCirrusSearchWeights
Default:
$wgCirrusSearchWeights = [
'title' => 20,
'redirect' => 15,
'category' => 8,
'heading' => 5,
'opening_text' => 3,
'text' => 1,
'auxiliary_text' => 0.5,
'file_text' => 0.5,
];
Weight of fields. Changes to this require an in place reindex to take effect.
; $wgCirrusSearchPrefixWeights
Default:
$wgCirrusSearchPrefixWeights = [
'title' => 10,
'redirect' => 1,
'title_asciifolding' => 7,
'redirect_asciifolding' => 0.7,
];
Weight of fields in prefix search. It is safe to change these at any time.
; $wgCirrusSearchBoostOpening
Default:
$wgCirrusSearchBoostOpening = 'first_heading';
The method Cirrus will use to extract the opening section of the text. Valid values are:
* first_heading - Wikipedia style. Grab the text before the first heading (h1-h6) tag.
* none - Do not extract opening text and do not search it.
; $wgCirrusSearchNearMatchWeight
Default:
$wgCirrusSearchNearMatchWeight = 2;
Weight of fields that match via "near_match" which is ordered.
; $wgCirrusSearchStemmedWeight
Default:
$wgCirrusSearchStemmedWeight = 0.5;
Weight of stemmed fields relative to unstemmed. Meaning if searching for ,