python elasticsearch bulk index datatype

Question

I am using the following code to create an index and load data in elastic search

from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
es = Elasticsearch('localhost:9200')
index_name='wordcloud_data'
with open('./csv-data/' + index_name +'.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index=index_name, doc_type='my-type')

print ("done")

My CSV data is as follows

date,word_data,word_count
2017-06-17,luxury vehicle,11
2017-06-17,signifies acceptance,17
2017-06-17,agency imposed,16
2017-06-17,customer appreciation,11

The data loads fine but then the datatype is not accurate How do I force it to say that the word_count is integer and not text See how it figures out the date type ? Is there a way it can figure out the int datatype automatically ? or by passing some parameter ?

Also what do I do to increase the ignore_above or remove it for some of the fields if I wanted to. basically no limit to the number of characters ?

{
  "wordcloud_data" : {
    "mappings" : {
      "my-type" : {
        "properties" : {
          "date" : {
            "type" : "date"
          },
          "word_count" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "word_data" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

drdaeman · Accepted Answer · 2020-01-12 18:30:11Z

5

You need to create a mapping that would describe field types.

With the elasticsearch-py client this can be done using es.indices.put_mapping or index.create methods, by passing it JSON document that describes mappings, like shown in this SO answer. It would be something like this:

es.indices.put_mapping(
    index="wordcloud_data",
    doc_type="my-type",
    body={
        "properties": {  
            "date": {"type":"date"},
            "word_data": {"type": "text"},
            "word_count": {"type": "integer"}
        }
    }
)

However, I'd suggest to take a look at the elasticsearch-dsl package that provides much nicer declarative API to describe things. It would be something along those lines (untested):

from elasticsearch_dsl import DocType, Date, Integer, Text
from elasticsearch_dsl.connections import connections
from elasticsearch.helpers import bulk

connections.create_connection(hosts=["localhost"])

class WordCloud(DocType):
    word_data = Text()
    word_count = Integer()
    date = Date()

    class Index:
        name = "wordcloud_data"
        doc_type = "my_type"   # If you need it to be called so

WordCloud.init()
with open("./csv-data/%s.csv" % index_name) as f:
    reader = csv.DictReader(f)
    bulk(
        connections.get_connection(),
        (WordCloud(**row).to_dict(True) for row in reader)
    )

Please note, I haven't tried the code I've posted - just written it. Don't have an ES server at hand to test. There could be some small mistakes or typos there (please point out if there are), but the general idea should be correct.

edited Jan 12, 2020 at 18:30

answered Jun 22, 2017 at 10:28

drdaeman

11.6k8 gold badges63 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Naresh MG Over a year ago

Thanks 🙏 i will try this out and let you know

Naresh MG Over a year ago

I just changed the order as the same order in the file...not sure if it matters and all seems to work fine from what I can tell...here is one of the documents, is the interger type supposed to store with the double quote "word_count" : "12",?---- { "_index" : "wordss", "_type" : "my-type", "_id" : "AVzS4_2-UW5hFY6GiWVj", "_score" : 1.0, "_source" : { "word_date" : "2017-06-17T00:00:00", "word_count" : "12", "word_data" : "cell phone" }

drdaeman Over a year ago

@NareshMG No, it should be stored and returned as a number, not a string (strings would be accepted on input, but coerced to the type mapping defines). It could be that you need to drop existing data completely (drop the index) and re-create it anew. Just defining a mapping doesn't update already existing data. If you didn't do so, you'll have mixed-type data in your DB.

Premkumar chalmeti · Accepted Answer · 2020-01-10 11:43:20Z

1

Thanks. @drdaeman's Solution worked for me. Although, I thought it's worth mentioning that in elasticsearch-dsl 6+

class Meta:
     index = "wordcloud_data"
     doc_type = "my-type"

This snippet will raise cannot write to wildcard index exception. Change the following to,

class Index:
   name = 'wordcloud_data'
   doc_type = 'my_type'

answered Jan 10, 2020 at 11:43

Premkumar chalmeti

1,0381 gold badge11 silver badges25 bronze badges

Collectives™ on Stack Overflow

python elasticsearch bulk index datatype

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related