Elasticsearch Synonym Analyzer using ElasticsearchCRUD

This article shows how to setup and use a custom analyzer in Elasticsearch using ElasticsearchCRUD. An analyzer with a custom Synonym Token Filter is created and added to the index. If you search for any of the synonyms, you will find all the hits for all the possible texts.

Code: https://github.com/damienbod/ElasticsearchSynonymAnalyzer

Other Tutorials:

Part 1: ElasticsearchCRUD introduction
Part 2: MVC application search with simple documents using autocomplete, jQuery and jTable
Part 3: MVC Elasticsearch CRUD with nested documents
Part 4: Data Transfer from MS SQL Server using Entity Framework to Elasticsearch
Part 5: MVC Elasticsearch with child, parent documents
Part 6: MVC application with Entity Framework and Elasticsearch
Part 7: Live Reindex in Elasticsearch
Part 8: CSV export using Elasticsearch and Web API
Part 9: Elasticsearch Parent, Child, Grandchild Documents and Routing
Part 10: Elasticsearch Type mappings with ElasticsearchCRUD
Part 11: Elasticsearch Synonym Analyzer using ElasticsearchCRUD
Part 12: Using Elasticsearch German Analyzer
Part 13: MVC google maps search using Elasticsearch
Part 14: Search Queries and Filters with ElasticsearchCRUD
Part 15: Elasticsearch Bulk Insert
Part 16: Elasticsearch Aggregations With ElasticsearchCRUD
Part 17: Searching Multiple Indices and Types in Elasticsearch
Part 18: MVC searching with Elasticsearch Highlighting
Part 19: Index Warmers with ElasticsearchCRUD

Creating a custom Synonym Analyzer

Creating an index with custom analyzers, filters or tokenizers is very easy to do in ElasticsearchCRUD. Strongly typed class configuration is available for all types as well as const definitions for all the default possibilities. In the example below, a SynonymTokenFilter is created and added to the custom analyzer. This is then included in the index mappings along with other configurations.

new IndexDefinition
{
	IndexSettings =
	{
		Analysis = new Analysis
		{
			Analyzer =
			{
				Analyzers = new List<AnalyzerBase>
				{
					new CustomAnalyzer("john_analyzer")
					{
						Tokenizer = DefaultTokenizers.Whitespace,
						Filter = new List<string> {DefaultTokenFilters.Lowercase, "john_synonym"}
					}
				}
			},
			Filters =
			{
				CustomFilters = new List<AnalysisFilterBase>
				{
					new SynonymTokenFilter("john_synonym")
					{
						Synonyms = new List<string>
						{
							"sean  => john, sean, séan",
							"séan => john, sean, séan",
							"johny => john",
						}
					}
				}
			}
		},
		NumberOfShards = 3,
		NumberOfReplicas = 1
	},
};

The index settings is created as follows in Elasticsearch:

http://localhost:9200/_settings

analyzersEs_01

Creating the mapping for anaylzed and non-analyzed

Now a mapping is created for the member type using the Member class to map this. The Name property uses the custom analyzer and also saves the original text in a non-analyzed field using the Fields configuration.

public class Member
{
	public long Id { get; set; }

	[ElasticsearchString(Index = StringIndex.analyzed, Analyzer="john_analyzer", Fields = typeof(FieldDataDefinition))]
	public string Name { get; set; }

	public string FamilyName { get; set; }

	public string Info { get; set; }
}


public class FieldDataDefinition
{
	[ElasticsearchString(Index=StringIndex.not_analyzed)]
	public string Raw { get; set; }
}

The index with mappings and settings is created using the following method:

_context.IndexCreate<Member>(indexDefinition);

This is then created as follows:

http://localhost:9200/_mapping

analyzersEs_02

Adding some data

Now some data can be added to the index using a bulk insert.

public void CreateSomeMembers()
{
	var jm = new Member {Id = 1, FamilyName = "Moore", Info = "In the club since 1976", Name = "John"};
	_context.AddUpdateDocument(jm, jm.Id);
	var jj = new Member { Id = 2, FamilyName = "Jones", Info = "A great help for the background staff", Name = "Johny" };
	_context.AddUpdateDocument(jj, jj.Id);
	var pm = new Member { Id = 3, FamilyName = "Murphy", Info = "Likes to take control", Name = "Paul" };
	_context.AddUpdateDocument(pm, pm.Id);
	var sm = new Member { Id = 4, FamilyName = "McGurk", Info = "Fresh and fit", Name = "Séan" };
	_context.AddUpdateDocument(sm, sm.Id);
	var sob = new Member { Id = 5, FamilyName = "O'Brien", Info = "Not much use, bit of a problem", Name = "Sean" };
	_context.AddUpdateDocument(sob, sob.Id);
	var tmc = new Member { Id = 5, FamilyName = "McCauley", Info = "Couldn't ask for anyone better", Name = "Tadhg" };
	_context.AddUpdateDocument(tmc, tmc.Id);

	_context.SaveChanges();
}

Search using the analyzed data

The data can be searched for using the query match as we want to search the analyzed fields. If you search for sean, séan, Sean, Séan, John, Johny, all john results will be found.

//{
//  "query": {
//		"match": {"name": "sean"}
//	 }
//  }
//}
public SearchResult<Member> Search(string name)
{
	var query = "{ \"query\": { \"match\": {\"name\": \""+ name + "\"} }  } }";
	return _context.Search<Member>(query).PayloadResult;
}

The request is sent as follows:

POST http://localhost:9200/members/member/_search HTTP/1.1

Content-Type: application/json
Host: localhost:9200
Content-Length: 46
Expect: 100-continue
Connection: Keep-Alive

{ "query": { "match": {"name": "Johny"} }  } }

You can also inspect the analyzed tokens using the _analyze API

http://localhost:9200/members/_analyze?&analyzer=john_analyzer

Links:

https://www.nuget.org/packages/ElasticsearchCRUD/

http://www.elasticsearch.org/

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/using-synonyms.html

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/multi-word-synonyms.html

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/synonyms.html

http://stackoverflow.com/questions/18405779/how-to-configure-the-synonyms-path-in-elasticsearch

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

https://fuzzyquery.wordpress.com/2013/08/15/example-shingle-filter-mapping/

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/using-stopwords.html

https://blog.codecentric.de/en/2014/05/elasticsearch-indexing-performance-cheatsheet/


# Analyze text: "the <b>quick</b> bröwn <img src="fox"/> &quot;jumped&quot;"
curl -XPUT 'http://127.0.0.1:9200/foo/&#39; -d '
{
"index" : {
"analysis" : {
"analyzer" : {
"test_1" : {
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
},
"test_2" : {
"filter" : [
"standard",
"lowercase",
"stop",
"asciifolding"
],
"char_filter" : [
"html_strip"
],
"tokenizer" : "standard"
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9200/foo/_analyze?format=text&text=the+%3Cb%3Equick%3C%2Fb%3E+br%C3%B6wn+%3Cimg+src%3D%22fox%22%2F%3E+%26quot%3Bjumped%26quot%3B&analyzer=standard&#39;
# "tokens" : "[b:5->6:<ALPHANUM>]
#
# 3:
# [quick:7->12:<ALPHANUM>]
#
# 4:
# [b:14->15:<ALPHANUM>]
#
# 5:
# [bröwn:17->22:<ALPHANUM>]
#
# 6:
# [img:24->27:<ALPHANUM>]
#
# 7:
# [src:28->31:<ALPHANUM>]
#
# 8:
# [fox:33->36:<ALPHANUM>]
#
# 9:
# [quot:41->45:<ALPHANUM>]
#
# 10:
# [jumped&quot:46->57:<COMPANY>]
# "
# }
curl -XGET 'http://127.0.0.1:9200/foo/_analyze?format=text&text=the+%3Cb%3Equick%3C%2Fb%3E+br%C3%B6wn+%3Cimg+src%3D%22fox%22%2F%3E+%26quot%3Bjumped%26quot%3B&analyzer=test_1&#39;
# {
# "tokens" : "[the:0->3:<ALPHANUM>]
#
# 2:
# [quick:7->12:<ALPHANUM>]
#
# 3:
# [bröwn:17->22:<ALPHANUM>]
#
# 4:
# [jumped:46->52:<ALPHANUM>]
# "
# }
curl -XGET 'http://127.0.0.1:9200/foo/_analyze?format=text&text=the+%3Cb%3Equick%3C%2Fb%3E+br%C3%B6wn+%3Cimg+src%3D%22fox%22%2F%3E+%26quot%3Bjumped%26quot%3B&analyzer=test_2&#39;
# {
# "tokens" : "[quick:7->12:<ALPHANUM>]
#
# 3:
# [brown:17->22:<ALPHANUM>]
#
# 4:
# [jumped:46->52:<ALPHANUM>]
# "
# }

view raw

html_strip.sh

hosted with ❤ by GitHub

http://bundlr.com/clips/51f89017841e986de80003a7

One comment

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.