Creating a synonyms file

Solr provides the facility to configure the synonyms that are used for querying textual data. In i2 Analyze, you can use this option to apply a customized list of synonyms at query time. Synonyms, if not accounted for, can cause a reduction in the relevance of a search result when you search for keywords that are present in alternative forms in your index.

About this task

The synonyms file is the part of the Solr configuration that accounts for the presence of synonyms in your data. For example, your data might contain the words, “bag, handbag, pocketbook, purse” for the concept “bag”. When someone searches they are likely to search for one, but expect results for all four. To meet that expectation, you might want to create a customized synonyms file to accommodate similar variations that are specific to your data. The exact words in a synonyms list that are most useful in your deployment depend on the content of your data. You can also use a mix of languages, which might be useful in some contexts, for example names: 'George, Γεώργιος, Jorge'.

The default synonyms file and synonyms list are in US English. The synonyms files that are associated by default with each supported language are supplied in the directory, toolkit\configuration\solr.

To customize the alternative terms that are used in search operations for your data, you can create files that contain different terms from those terms that are contained in the supplied synonyms files.

The customized file must adhere to the following guidelines:
  • The file must be UTF-8 encoded.
  • The terms in the file must match the terms that are produced by the analyzer chain that is used in Solr prior to the synonym filter being applied.
  • If multiple forms of a word exist, all the forms must be specified in order for synonym matching to work on each form.
  • Words from Latin script languages, for example French or Italian, must be specified without diacritics. For example, use the following substitution:
    • a instead of á
    • c instead of ç
  • Arabic and Hebrew words must be specified exactly as they are written.

Procedure

  1. Create a text file that defines synonyms in the required Solr format.

    For more information, see https://lucene.apache.org/core/8_2_0/analyzers-common/org/apache/lucene/analysis/synonym/SolrSynonymParser.html.

    Note:
    1. You cannot search for multi-word terms. However, if you have data that contains terms "USA" and "United States of America", you can search for "USA" and use a synonym to ensure a match with "United States of America".
    2. You can provide synonyms for terms that include punctuation. However, a search on such a term might not work correctly. The unexpected result is because a filter is applied before synonyms, which means, for example, "Mary-Ann" becomes "Mary,Ann" and then synonyms are expanded from "Mary and "Ann"; not "Mary-Ann" or "Maryann".
  2. Save the file with a .txt extension, for example custom-synonyms.txt.
  3. Complete the instructions in Configuring the Solr index to deploy with your synonyms file.