Live Reindex in Elasticsearch

The article demonstrates how to do a live reindex in Elasticsearch using ElasticsearchCRUD. The reindex uses scan ans scroll the fetch the data and then updates to a new index using bulk inserts. The reindex has support for alias mapping which makes it possible to do a live index.

Code: https://github.com/damienbod/LiveReindexInElasticsearch

Other tutorials:

Part 1: ElasticsearchCRUD introduction
Part 2: MVC application search with simple documents using autocomplete, jQuery and jTable
Part 3: MVC Elasticsearch CRUD with nested documents
Part 4: Data Transfer from MS SQL Server using Entity Framework to Elasticsearch
Part 5: MVC Elasticsearch with child, parent documents
Part 6: MVC application with Entity Framework and Elasticsearch
Part 7: Live Reindex in Elasticsearch
Part 8: CSV export using Elasticsearch and Web API
Part 9: Elasticsearch Parent, Child, Grandchild Documents and Routing
Part 10: Elasticsearch Type mappings with ElasticsearchCRUD
Part 11: Elasticsearch Synonym Analyzer using ElasticsearchCRUD
Part 12: Using Elasticsearch German Analyzer
Part 13: MVC google maps search using Elasticsearch
Part 14: Search Queries and Filters with ElasticsearchCRUD
Part 15: Elasticsearch Bulk Insert
Part 16: Elasticsearch Aggregations With ElasticsearchCRUD
Part 17: Searching Multiple Indices and Types in Elasticsearch
Part 18: MVC searching with Elasticsearch Highlighting
Part 19: Index Warmers with ElasticsearchCRUD

Pre: Setting up the document search engine and index

AdventureWorks2012 is used to fill the search engine with data. It can be downloaded here.

The code for adding a person index from the Entity Person can be download here.
This code creates a new index “persons_v1”. This is then mapped to “persons” using an alias.

public void CreatePersonAliasForPersonV1Mapping(string alias)
{
	using (var context = new ElasticsearchContext("http://localhost:9200/", _elasticsearchMappingResolver))
	{
		context.AliasCreateForIndex(alias, _elasticsearchMappingResolver.GetElasticSearchMapping(typeof(Person)).GetIndexForType(typeof(Person)));
	}
}

Now the index and the alias is setup. Elasticsearch is ready for a live reindex.

This can be view here: http://localhost:9200/_aliases or http://localhost:9200/_cat/aliases

{
    "persons_v1": {
    "aliases": {
    "persons": {}
    }
  }
}

STEP 1: CREATE NEW INDEX persons_v2 from INDEX persons_v1

A new index is created from the old index. This can be executed live. While the reindex is executing, new documents can be added or documents can be updated. Documents which are deleted will not be noticed. Because of this, you should use a bool deleted field.

The reindex requires the index and type of the old index and also the new index and new index type. If your using a parent/child document index, you have to repeat this step for child documents. Parent/child document reindexing is not support until ElasticsearchCRUD version 1.0.15.

// - This timestamp is usually DateTime.UtcNow. 
// - It is required so all the indexes which were updated during the reindex can be found
DateTime beginDateTime = DateTime.UtcNow;

var reindex = new ElasticsearchCrudReindex<Person, PersonV2>(
	new IndexTypeDescription("persons_v1", "person"), 
	new IndexTypeDescription("persons_v2", "person"), 
	"http://localhost:9200");

The reindex can be configured as required. The method uses scan and scroll. You can change the default settings and allow more documents in each request and response. You should not make it too big because you don’t want to be sending 500MB with each request and response.

The ScanAndScrollConfiguration defines how long each scroll is keep open and all the following scrolls and also defines the Time Unit. Underneath 5s is defined. 1000 documents will be fetched from each shard if it has enough time.
For example if you have 5 shards in your index, and enough time is configured and there is enough documents, the configuration underneath will fetch 5000 documents with each request.

The Console log is used to display the progress

reindex.ScanAndScrollConfiguration = new ScanAndScrollConfiguration(new TimeUnitSecond(5), 1000);
reindex.TraceProvider = new ConsoleTraceProvider(TraceEventType.Information);

The reindex method itself requires 2 functions and the Json content for the Json _search query. A less than range query is used to select all documents before the defined DateTime. If you require a different query logic, you can define it as required. You can use a match all if your not worried about updates or whatever.

The reindex also requires you conversion method. This is the reason for doing the reindex. In this example, a deleted bool field is added to the document. The second Function is used to define the document _id.

reindex.Reindex(
	PersonReindexConfiguration.BuildSearchModifiedDateTimeLessThan(beginDateTime), 
	PersonReindexConfiguration.GetKeyMethod, 
	PersonReindexConfiguration.CreatePersonV2FromPerson);

The Json content builder method BuildSearchModifiedDateTimeLessThan builds the Json query. This is a very primitive implementation, you could do this much more eloquent if required.

public static string BuildSearchModifiedDateTimeLessThan(DateTime dateTimeUtc)
{
	return BuildSearchRange("lt", "modifieddate", dateTimeUtc);
}

//{
//   "query" :  {
//	   "range": {  "modifieddate": { "lt":   "2003-12-29T00:00:00"  } }
//	}
//}
private static string BuildSearchRange(string lessThanOrGreaterThan, string updatePropertyName, DateTime dateTimeUtc)
{
	string isoDateTime = dateTimeUtc.ToString("s");
	var buildJson = new StringBuilder();
	buildJson.AppendLine("{");
	buildJson.AppendLine("\"query\": {");
	buildJson.AppendLine("\"range\": {  \"" + updatePropertyName + "\": { \"" + lessThanOrGreaterThan + "\":   \"" + isoDateTime + "\"  } }");
	buildJson.AppendLine("}");
	buildJson.AppendLine("}");

	return buildJson.ToString();
}

The conversion method which converts the old document type to the new document type.

public static PersonV2 CreatePersonV2FromPerson(Person item)
{
	return new PersonV2
	{
		BusinessEntityID = item.BusinessEntityID,
		PersonType = item.PersonType,
		NameStyle = item.NameStyle,
		Title = item.Title,
		FirstName = item.FirstName,
		MiddleName = item.MiddleName,
		LastName = item.LastName,
		Suffix = item.Suffix,
		EmailPromotion = item.EmailPromotion,
		AdditionalContactInfo = item.AdditionalContactInfo,
		Demographics = item.Demographics,
		rowguid = item.rowguid,
		ModifiedDate = item.ModifiedDate,
		Deleted = false
	};
}

Returns the _id property for the document _id
public static object GetKeyMethod(Person person)
{
	return person.BusinessEntityID;
}

STEP 2: REPLACE ALIAS persons TO INDEX persons_v2

Now the alias is switched from the old index to the new index.

// ---------------------------------------------------------
// STEP 2: REPLACE ALIAS persons TO INDEX persons_v2 
// ---------------------------------------------------------
reindex.SwitchAliasfromOldToNewIndex("persons");

The alias is pointed to the persons_v2 index. http://localhost:9200/_aliases

 {
    "persons_v1": {
    "aliases": {}
    },
    "persons_v2": {
    "aliases": {
    "persons": {}
    }
    }
}

STEP 3: NOW GET ALL THE DOCUMENTS WHICH WERE UPDATED WHILE REINDEXING AND REINDEX

Now that the new index is up and running, all the documents which where updated when the reindexing took place are now reindexed. It uses a greater than range query search and returns all documents larger than the begin DateTime.

If the same document was updated in the new index again, it will be overwritten. Again you could decide if this is important and create the appropriate query search.

// ---------------------------------------------------------
// STEP 3: NOW GET ALL THE DOCUMENTS WHICH WERE UPDATED WHILE REINDEXING
// ---------------------------------------------------------
// NOTE: if the document is updated again in the meantime, it will be overwitten with this method. 
// If required, you must check the update timestamp of the item in the new index!
reindex.Reindex(
	PersonReindexConfiguration.BuildSearchModifiedDateTimeGreaterThan(beginDateTime), 
	PersonReindexConfiguration.GetKeyMethod, 
	PersonReindexConfiguration.CreatePersonV2FromPerson);

When you run the application, then, you can view the progress:
reindex_el_01

Conclusion
Live reindex is a great feature in Elasticsearch and separates the boys from the men when it comes to search engines or NoSQLs. Parent/Child document indexes reindex will be supported in ElasticsearchCRUD 1.0.15.

Links:

https://www.nuget.org/packages/ElasticsearchCRUD/

http://obtao.com/blog/2014/03/elasticsearch-symfony-export-scan-scroll-functions/

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

https://found.no/foundation/elasticsearch-top-down/

http://exploringelasticsearch.com/overview.html

One comment

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.