2

We have billions of records indexed in ES cluster, each document will contain fields like account id, transaction id, user name and so on (few free-text string data fields)

My application will query ES based on some user search params (e.g return transactions for user 'A' between X and Y dates and some other filters) and I want to store/export response data to csv/excel file.

For my use case, number of documents returned from ES might be in 100s of thousands or million(s), my question is what are various ways to export "large" amount of data from ES?

These requests are "real-time" requests and not batch processing (e.g - requested user is waiting for exported file to be created).

I read about pagination (size/from) and scroll approach but not sure if these are the best ways to export large dataset from ES. (size/from approach has max setting as 10K if I read it correctly and scroll option is NOT much recommended for realtime use case).

Would like to know from experts.

1 Answer 1

1

If your users need to export a large quantity of data, you need to educate them not to expect that export to be done in real-time (for the sake of the well-being of your other users and your systems).

That's definitely a batch processing job. The user triggers the export via your UI, some process will then wake up and do it asynchronously. When done you notify the user that the export is available for download at some location or you send the file via email.

Just to name an example, when you want to export your data from Twitter, you trigger a request and you'll be notified later (even if you have just a few tweets in your account) that your data has been exported.

If you decide to proceed that way, then nothing prevents you anymore from using the scan/scroll approach.

Sign up to request clarification or add additional context in comments.

3 Comments

That makes sense! Do you think in that case ES is the right technology for this use case? Our users are not expecting results in few seconds but definitely they need something in < 2-3 mins.
The answer is "it depends". Downloading millions of documents from ES will work but that will put some burden on your cluster. If many users do that at the same time, you run the risk of making your cluster unstable and slow normal search requests from your other users. You need to load test your solution and scale your cluster appropriately before going to production with it. But I have seen this in place many times, it just requires careful considerations when scaling your cluster.
Ok, may be let me ask this as a generic another question then (will post it as a new question), which other backend noSQL should be considered for this use case? my data is more like relational data. I've used Hadoop using PIG extensively in past but Hadoop job will not give 2-3 mins of SLA, not much experienced with other noSQLs. Requirements include mass data export on raw data and some dynamic aggregations for few of the fields.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.