I'm using this script to bulk update docs in my index.
I need to update a field of a doc in Elasticsearch and add the count of that doc in a list inside python code. The weight field contains the count of the doc in a dataset. The dataset needs to be updated from time to time.So the count of each document must be updated too. hashed_ids is a list of document ids that are in the new batch of data. the weight of matched id must be increased by the count of that id in hashed_ids.
For example let say a doc with id=d1b145716ce1b04ea53d1ede9875e05a and weight=5 is already present in index. and also the string d1b145716ce1b04ea53d1ede9875e05a is repeated three times in the hashed_ids so the update_with_query query will match the doc in database. I need to add 3 to 5 and have 8 as final weight.
The code below works for it but it is too slow and from time to time I get time out error.
hashed_ids = [hashlib.md5(doc.encode('utf-8')).hexdigest() for doc in shingles]
update_by_query_body =
{
"query":{
"terms": {
"id":["id1","id2"]
}
},
"script":{
"source":"long weightToAdd = params.hashed_ids.stream().filter(idFromList -> ctx._source.id.equals(idFromList)).count(); ctx._source.weight += weightToAdd;",
"params":{
"hashed_ids":["id1","id1","id1","id2"]
}
}
}