3
\$\begingroup\$

I have code that works perfectly, but it uses too much memory.

Essentially this code takes an input file (let's call it an index, that is 2 column tab-separated) that searches in a second input file (let's call it data, that is 4-column tab separated) for a corresponding term in the 1st column which it then replaces with the information from the index file.

An example of the index is:

amphibian   anm|art|art|art|art
anaconda    anm
aardvark    anm

An example of the data is:

amphibian-n is  green   10
anaconda-n  is  green   2
anaconda-n  eats    mice    1
aardvark-n  eats    plants  1

Thus, when replacing the value in Col 1 of data with the corresponding information from Index, the results are as follows:

anm-n   is  green
art-n   is  green
anm-n   eats    mice
anm-n   eats    plants

I divided the code in steps because the idea is to calculate average of the values given a replaced item (Col 4 in data) of Cols 2 and 3 in the data file. This code takes the total number of slot-fillers in the data file and sums the values which is used in Step 3.

The desired results are the following:

anm second  hello   1.0
anm eats    plants  1.0
anm first   heador  0.333333333333
art first   heador  0.666666666667

I open the same input file many times (i.e., 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order. However, the bottleneck is definitely between Steps 2 and 3. If I remove the function in Step 2, I can process the entire file (13GB of RAM in approx. 30 minutes). However, the necessary addition of Step 2 consumes all memory before beginning Step 3.

Is there a way to optimize how many times I open the same input file?

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
from collections import defaultdict

import datetime

print "starting:",
print datetime.datetime.now()

mapping = dict()

with open('input-map', "rb") as oSenseFile:
    for line in oSenseFile:
        uLine = unicode(line, "utf8")
        concept, conceptClass = uLine.split()
        if len(concept) > 2:  
                mapping[concept + '-n'] = conceptClass


print "- step 1:",
print datetime.datetime.now()

lemmas = set()

with open('input-data', "rb") as oIndexFile:
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemma = uLine.split()[0]
        if mapping.has_key(lemma):
            lemmas.add(lemma)

print "- step 2:",
print datetime.datetime.now()


featFreqs = defaultdict(lambda: defaultdict(float))

with open('input-data', "rb") as oIndexFile:            
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemmaTAR, slot, filler, freq = uLine.split()
        featFreqs[slot][filler] += int(freq)


print "- step 3:",
print datetime.datetime.now()

classFreqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
     
with open('input-data', "rb") as oIndexFile:            
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemmaTAR, slot, filler, freq = uLine.split()
        if lemmaTAR in lemmas:
            senses = mapping[lemmaTAR].split(u'|')
            for sense in senses:
                classFreqs[sense][slot][filler] += (int(freq) / len(senses)) / featFreqs[slot][filler]
        else:
            pass

print "- step 4:",
print datetime.datetime.now()
                
with open('output', 'wb') as oOutFile:
    for sense in sorted(classFreqs):
                for slot in classFreqs[sense]:
                        for fill in classFreqs[sense][slot]:
                                outstring = '\t'.join([sense, slot, fill,\
                                                       str(classFreqs[sense][slot][fill])])
                                oOutFile.write(outstring.encode("utf8") + '\n')

Any suggestions on how to optimize this code to process large text files (e.g., >4GB)?

\$\endgroup\$
4
  • 1
    \$\begingroup\$ Why does anaconda become art in the example? The index maps it to anm. \$\endgroup\$ Commented Mar 2, 2014 at 10:20
  • \$\begingroup\$ anaconda does not become art, you are referring to amphibian that is mapped to art. The example demonstrates that for each possible mapping, the Col. inof in Cols 2 and 3 are repeated. \$\endgroup\$ Commented Mar 2, 2014 at 12:26
  • \$\begingroup\$ The example is still not quite clear, but anyway, perhaps you should use a database. \$\endgroup\$ Commented Mar 2, 2014 at 19:30
  • 5
    \$\begingroup\$ I noticed the same question on stackoverflow.com, which has an accepted answer already. \$\endgroup\$ Commented Mar 3, 2014 at 13:56

1 Answer 1

2
\$\begingroup\$

Don't use Python 2 any more; the rest of this answer will assume Python 3 without diving too much into the syntax. Most of the Unicode stuff needs to go away; see codecs for standard encoding names.

The desired results are the following

Are they really? hello doesn't appear in your sample input at all.

I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order

Don't do that. Just open it once and seek to the beginning as necessary.

Your steps should be converted into functions.

Rather than printing datetime.now(), make a logger with an asctime field.

In Python 3 you should not be opening those files as rb; instead pass the appropriate encoding and open them in text mode.

Write a main function responsible for opening and closing files, and pass those files into subroutines.

featFreqs = defaultdict(lambda: defaultdict(float)) is not a good idea, because you only ever add integers; use (int) instead.

The indentation in step 4 is wild. That needs to be fixed up, and you need to keep references to intermediate indexed dictionary levels.

Yes, there are ways (that I don't demonstrate) where the file processing is partitioned to reduce memory burden. The tricky part becomes indexing into parts of a map that are not currently in memory. One approach is to produce a database (SQLite, possibly) that is well-indexed; it will have reasonable caching characteristics and can be gigantic without ruining your RAM during queries.

All together,

#!/usr/bin/env python3
import logging
import typing
from collections import defaultdict

type FreqDict = defaultdict[str, defaultdict[str, int]]
type ClassDict = defaultdict[str, defaultdict[str, defaultdict[str, float]]]


def setup_logger() -> logging.Logger:
    logging.basicConfig(
        level=logging.INFO, format='%(asctime)s %(message)s',
    )
    return logging.getLogger('indexer')


def start(o_sense_file: typing.TextIO) -> dict[str, str]:
    mapping: dict[str, str] = {}

    for line in o_sense_file:
        concept, concept_class = line.split()
        if len(concept) > 2:
            mapping[concept + '-n'] = concept_class

    return mapping


def step_1(mapping: dict[str, str], o_index_file: typing.TextIO) -> set[str]:
    lemmas = set()

    for line in o_index_file:
        lemma = line.split()[0]
        if lemma in mapping:
            lemmas.add(lemma)

    return lemmas


def step_2(o_index_file: typing.TextIO) -> FreqDict:
    feat_freqs = defaultdict(lambda: defaultdict(int))

    for line in o_index_file:
        lemmaTAR, slot, filler, freq = line.split()
        feat_freqs[slot][filler] += int(freq)

    return feat_freqs


def step_3(
    o_index_file: typing.TextIO, mapping: dict[str, str],
    lemmas: set[str], feat_freqs: FreqDict,
) -> ClassDict:
    class_freqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

    for line in o_index_file:
        lemmaTAR, slot, filler, freq = line.split()
        if lemmaTAR in lemmas:
            senses = mapping[lemmaTAR].split('|')
            for sense in senses:
                class_freqs[sense][slot][filler] += int(freq) / len(senses) / feat_freqs[slot][filler]

    return class_freqs


def step_4(o_out_file: typing.TextIO, class_freqs: ClassDict) -> None:
    for sense in sorted(class_freqs.keys()):
        by_sense = class_freqs[sense]
        for slot, freqs in by_sense.items():
            for fill, freq in freqs.items():
                o_out_file.write(f'{sense}\t{slot}\t{fill}\t{freq}\n')


def main():
    logger.info('Starting')
    with open('input-map', encoding='utf_8') as o_sense_file:
        mapping = start(o_sense_file)

    with open('input-data', encoding='latin_1') as o_index_file:
        logger.info('Step 1')
        lemmas = step_1(mapping=mapping, o_index_file=o_index_file)

        logger.info('Step 2')
        o_index_file.seek(0)
        feat_freqs = step_2(o_index_file=o_index_file)

        logger.info('Step 3')
        o_index_file.seek(0)
        class_freqs = step_3(
            mapping=mapping, o_index_file=o_index_file, lemmas=lemmas, feat_freqs=feat_freqs,
        )

    logger.info('Step 4')
    with open('output', mode='w', encoding='utf_8') as o_out_file:
        step_4(o_out_file=o_out_file, class_freqs=class_freqs)


if __name__ == '__main__':
    logger = setup_logger()
    main()

Console output:

2025-01-11 00:06:06,813 Starting
2025-01-11 00:06:06,816 Step 1
2025-01-11 00:06:06,816 Step 2
2025-01-11 00:06:06,816 Step 3
2025-01-11 00:06:06,816 Step 4

Output file:

anm is  green   0.3333333333333333
anm eats    mice    1.0
anm eats    plants  1.0
art is  green   0.6666666666666666
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.