5

In a movie database, I store the ratings (0 to 5 stars) given by users to each of the movies. I have the following document structure indexed in Elastic Search (version 1.2.2)

"_index": "my_index"
"_type": "film",
"_id": "6629",
"_source": {
  "id": "6629",
  "title": "Fight Club",
  "ratings" : [
    { "user_id" : 1234, "rating_value" : 3 },
    { "user_id" : 4567, "rating_value" : 2 },
    { "user_id" : 7890, "rating_value" : 1 }
    .....
  ]
}

"_index": "my_index"
"_type": "film",
"_id": "6630",
"_source": {
  "id": "6630",
  "title": "Pulp Fiction",
  "ratings" : [
    { "user_id" : 1234, "rating_value" : 1 },
    { "user_id" : 7654, "rating_value" : 2 },
    { "user_id" : 4321, "rating_value" : 5 }
    .....
  ]
}

etc ...

My goal is to get, in a single search, all the movies rated by a user (let's say user 1234), alongside with the rating_value

If I do the following search

GET my_index/film/_search
{
  "query": {
    "match": {
      "ratings.user_id": "1234"
    }
  }
}

I get, for all matched movies, the whole document, and then, I have to parse the whole ratings array to find out which element of the array has matched my query and what is the rating_value associated with the user_id 1234.

Ideally, I would like the result of this query to be

"hits": [ {
  "_index": "my_index"
  "_type": "film",
  "_id": "6629",
  "_source": {
    "id": "6629",
    "title": "Fight Club",
    "ratings" : [
      { "user_id" : 1234, "rating_value" : 3 }, // <= only the row that matches the query
    ]
  },
  "_index": "my_index"
  "_type": "film",
  "_id": "6630",
  "_source": {
    "id": "6630",
    "title": "Pulp Fiction",
    "ratings" : [
      { "user_id" : 1234, "rating_value" : 1 },  // <= only the row that matches the query
    ]
  }
} ]

Thanks in advance

1
  • You can't obtain the ideal result you want as the _source will always be the exact same JSON as the one you indexed. However, you can use aggregations to get the information you want. Commented Aug 13, 2014 at 11:29

2 Answers 2

3

I managed to retrieve the values using aggregations, as stated in my previous comment.

Here follows how I did this.

First, the mapping I used :

PUT test/movie/_mapping
{
  "properties": {
    "title":{
      "type": "string",
      "index": "not_analyzed"
    },
    "ratings": {
      "type": "nested"
    }
  }
}

I chose not to index the title, but you could use fields attribute and keep it as a "raw" field.

Then, the movies indexed :

PUT test/movie/6629
{
  "title": "Fight Club",
  "ratings" : [
    { "user_id" : 1234, "rating_value" : 3 },
    { "user_id" : 4567, "rating_value" : 2 },
    { "user_id" : 7890, "rating_value" : 1 }
  ]
}


PUT test/movie/4456
{
  "title": "Jumanji",
  "ratings" : [
    { "user_id" : 1234, "rating_value" : 4 },
    { "user_id" : 4567, "rating_value" : 3 },
    { "user_id" : 4630, "rating_value" : 5 }
  ]
}

PUT test/movie/6547
{
  "title": "Hook",
  "ratings" : [
    { "user_id" : 1234, "rating_value" : 4 },
    { "user_id" : 7890, "rating_value" : 1 }
  ]
}

The aggregation query is :

GET test/movie/_search
{
  "aggs": {
    "by_movie": {
      "terms": {
        "field": "title"
      },
      "aggs": {
        "ratings_by_user": {
          "nested": {
            "path": "ratings"
          },"aggs": {
            "for_user_1234": {
              "filter": {
                "term": {
                  "ratings.user_id": "1234"
                }
              },
              "aggs": {
                "rating_value": {
                  "terms": {
                    "field": "ratings.rating_value"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Finally, here is the output produced when executing this query against the previous documents :

"aggregations": {
  "by_movie": {
     "buckets": [
        {
           "key": "Fight Club",
           "doc_count": 1,
           "ratings_by_user": {
              "doc_count": 3,
              "for_user_1234": {
                 "doc_count": 1,
                 "rating_value": {
                    "buckets": [
                       {
                          "key": 3,
                          "key_as_string": "3",
                          "doc_count": 1
                       }
                    ]
                 }
              }
           }
        },
        {
           "key": "Hook",
           "doc_count": 1,
           "ratings_by_user": {
              "doc_count": 2,
              "for_user_1234": {
                 "doc_count": 1,
                 "rating_value": {
                    "buckets": [
                       {
                          "key": 4,
                          "key_as_string": "4",
                          "doc_count": 1
                       }
                    ]
                 }
              }
           }
        },
        {
           "key": "Jumanji",
           "doc_count": 1,
           "ratings_by_user": {
              "doc_count": 3,
              "for_user_1234": {
                 "doc_count": 1,
                 "rating_value": {
                    "buckets": [
                       {
                          "key": 4,
                          "key_as_string": "4",
                          "doc_count": 1
                       }
                    ]
                 }
              }
           }
        }
     ]
  }

}

It's a bit tedious because of the nested syntax, but you'll be able to retrieve the rating of the provided user (here, 1234) for each movie.

Hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot for your reply. It works fine but I still face 2 issues : 1/ I cannot aggregate by movie id instead of movie title. 2/ I cannot set an offset ('from' in ES) for the aggregated results I get. For the moment I'm trying to solve those 2 issues - but thanks for pointing this path
1) this is because the id isn't indexed. You can solve this by adding an "id" field to your mapping and index it as "not_analyzed". Then, update the "by_movie" aggregation to relate to the "id" field.
2) According to the documentation, it seems that the from/size parameters are applied to the results, not to the aggregations.
2

Store ratings as nested documents (or children), then you will be able to query them individually.

A good explanation for the difference between nested documents and children can be found here: http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/

3 Comments

Thanks for your reply. Actually, they're already stored as nested documents. But the query still returns the whole "parent" (i.e. film) document
(or did you imply that I should use the "Parent & Child" structure instead of nested documents ?)
Ah then you can make a nested query, which will give you individual elements. The only issue there would be getting parent properties (i.e. movie name) as well. From that POV, parent-child structure is easier to work with.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.