Elasticsearch query results export to csv/excel file

Question

We have billions of records indexed in ES cluster, each document will contain fields like account id, transaction id, user name and so on (few free-text string data fields)

My application will query ES based on some user search params (e.g return transactions for user 'A' between X and Y dates and some other filters) and I want to store/export response data to csv/excel file.

For my use case, number of documents returned from ES might be in 100s of thousands or million(s), my question is what are various ways to export "large" amount of data from ES?

These requests are "real-time" requests and not batch processing (e.g - requested user is waiting for exported file to be created).

I read about pagination (size/from) and scroll approach but not sure if these are the best ways to export large dataset from ES. (size/from approach has max setting as 10K if I read it correctly and scroll option is NOT much recommended for realtime use case).

Would like to know from experts.

Val · Accepted Answer · 2016-12-21 05:01:02Z

1

If your users need to export a large quantity of data, you need to educate them not to expect that export to be done in real-time (for the sake of the well-being of your other users and your systems).

That's definitely a batch processing job. The user triggers the export via your UI, some process will then wake up and do it asynchronously. When done you notify the user that the export is available for download at some location or you send the file via email.

Just to name an example, when you want to export your data from Twitter, you trigger a request and you'll be notified later (even if you have just a few tweets in your account) that your data has been exported.

If you decide to proceed that way, then nothing prevents you anymore from using the scan/scroll approach.

answered Dec 21, 2016 at 5:01

Val

218k14 gold badges377 silver badges384 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rushik Over a year ago

That makes sense! Do you think in that case ES is the right technology for this use case? Our users are not expecting results in few seconds but definitely they need something in < 2-3 mins.

Val Over a year ago

The answer is "it depends". Downloading millions of documents from ES will work but that will put some burden on your cluster. If many users do that at the same time, you run the risk of making your cluster unstable and slow normal search requests from your other users. You need to load test your solution and scale your cluster appropriately before going to production with it. But I have seen this in place many times, it just requires careful considerations when scaling your cluster.

Rushik Over a year ago

Ok, may be let me ask this as a generic another question then (will post it as a new question), which other backend noSQL should be considered for this use case? my data is more like relational data. I've used Hadoop using PIG extensively in past but Hadoop job will not give 2-3 mins of SLA, not much experienced with other noSQLs. Requirements include mass data export on raw data and some dynamic aggregations for few of the fields.

Collectives™ on Stack Overflow

Elasticsearch query results export to csv/excel file

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related