Script which opens and reads the same files multiple times

Question

I have code that works perfectly, but it uses too much memory.

Essentially this code takes an input file (let's call it an index, that is 2 column tab-separated) that searches in a second input file (let's call it data, that is 4-column tab separated) for a corresponding term in the 1st column which it then replaces with the information from the index file.

An example of the index is:

amphibian   anm|art|art|art|art
anaconda    anm
aardvark    anm

An example of the data is:

amphibian-n is  green   10
anaconda-n  is  green   2
anaconda-n  eats    mice    1
aardvark-n  eats    plants  1

Thus, when replacing the value in Col 1 of data with the corresponding information from Index, the results are as follows:

anm-n   is  green
art-n   is  green
anm-n   eats    mice
anm-n   eats    plants

I divided the code in steps because the idea is to calculate average of the values given a replaced item (Col 4 in data) of Cols 2 and 3 in the data file. This code takes the total number of slot-fillers in the data file and sums the values which is used in Step 3.

The desired results are the following:

anm second  hello   1.0
anm eats    plants  1.0
anm first   heador  0.333333333333
art first   heador  0.666666666667

I open the same input file many times (i.e., 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order. However, the bottleneck is definitely between Steps 2 and 3. If I remove the function in Step 2, I can process the entire file (13GB of RAM in approx. 30 minutes). However, the necessary addition of Step 2 consumes all memory before beginning Step 3.

Is there a way to optimize how many times I open the same input file?

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
from collections import defaultdict

import datetime

print "starting:",
print datetime.datetime.now()

mapping = dict()

with open('input-map', "rb") as oSenseFile:
    for line in oSenseFile:
        uLine = unicode(line, "utf8")
        concept, conceptClass = uLine.split()
        if len(concept) > 2:  
                mapping[concept + '-n'] = conceptClass


print "- step 1:",
print datetime.datetime.now()

lemmas = set()

with open('input-data', "rb") as oIndexFile:
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemma = uLine.split()[0]
        if mapping.has_key(lemma):
            lemmas.add(lemma)

print "- step 2:",
print datetime.datetime.now()


featFreqs = defaultdict(lambda: defaultdict(float))

with open('input-data', "rb") as oIndexFile:            
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemmaTAR, slot, filler, freq = uLine.split()
        featFreqs[slot][filler] += int(freq)


print "- step 3:",
print datetime.datetime.now()

classFreqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
     
with open('input-data', "rb") as oIndexFile:            
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemmaTAR, slot, filler, freq = uLine.split()
        if lemmaTAR in lemmas:
            senses = mapping[lemmaTAR].split(u'|')
            for sense in senses:
                classFreqs[sense][slot][filler] += (int(freq) / len(senses)) / featFreqs[slot][filler]
        else:
            pass

print "- step 4:",
print datetime.datetime.now()
                
with open('output', 'wb') as oOutFile:
    for sense in sorted(classFreqs):
                for slot in classFreqs[sense]:
                        for fill in classFreqs[sense][slot]:
                                outstring = '\t'.join([sense, slot, fill,\
                                                       str(classFreqs[sense][slot][fill])])
                                oOutFile.write(outstring.encode("utf8") + '\n')

Any suggestions on how to optimize this code to process large text files (e.g., >4GB)?

Why does anaconda become art in the example? The index maps it to anm. — Janne Karila
– Janne Karila, Commented Mar 2, 2014 at 10:20
anaconda does not become art, you are referring to amphibian that is mapped to art. The example demonstrates that for each possible mapping, the Col. inof in Cols 2 and 3 are repeated. — owwoow14
– owwoow14, Commented Mar 2, 2014 at 12:26
The example is still not quite clear, but anyway, perhaps you should use a database. — Janne Karila
– Janne Karila, Commented Mar 2, 2014 at 19:30
I noticed the same question on stackoverflow.com, which has an accepted answer already. — Janne Karila
– Janne Karila, Commented Mar 3, 2014 at 13:56

Reinderien · Accepted Answer · 2025-01-11 13:34:05Z

Don't use Python 2 any more; the rest of this answer will assume Python 3 without diving too much into the syntax. Most of the Unicode stuff needs to go away; see codecs for standard encoding names.

The desired results are the following

Are they really? hello doesn't appear in your sample input at all.

I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order

Don't do that. Just open it once and seek to the beginning as necessary.

Your steps should be converted into functions.

Rather than printing datetime.now(), make a logger with an asctime field.

In Python 3 you should not be opening those files as rb; instead pass the appropriate encoding and open them in text mode.

Write a main function responsible for opening and closing files, and pass those files into subroutines.

featFreqs = defaultdict(lambda: defaultdict(float)) is not a good idea, because you only ever add integers; use (int) instead.

The indentation in step 4 is wild. That needs to be fixed up, and you need to keep references to intermediate indexed dictionary levels.

Yes, there are ways (that I don't demonstrate) where the file processing is partitioned to reduce memory burden. The tricky part becomes indexing into parts of a map that are not currently in memory. One approach is to produce a database (SQLite, possibly) that is well-indexed; it will have reasonable caching characteristics and can be gigantic without ruining your RAM during queries.

All together,

#!/usr/bin/env python3
import logging
import typing
from collections import defaultdict

type FreqDict = defaultdict[str, defaultdict[str, int]]
type ClassDict = defaultdict[str, defaultdict[str, defaultdict[str, float]]]


def setup_logger() -> logging.Logger:
    logging.basicConfig(
        level=logging.INFO, format='%(asctime)s %(message)s',
    )
    return logging.getLogger('indexer')


def start(o_sense_file: typing.TextIO) -> dict[str, str]:
    mapping: dict[str, str] = {}

    for line in o_sense_file:
        concept, concept_class = line.split()
        if len(concept) > 2:
            mapping[concept + '-n'] = concept_class

    return mapping


def step_1(mapping: dict[str, str], o_index_file: typing.TextIO) -> set[str]:
    lemmas = set()

    for line in o_index_file:
        lemma = line.split()[0]
        if lemma in mapping:
            lemmas.add(lemma)

    return lemmas


def step_2(o_index_file: typing.TextIO) -> FreqDict:
    feat_freqs = defaultdict(lambda: defaultdict(int))

    for line in o_index_file:
        lemmaTAR, slot, filler, freq = line.split()
        feat_freqs[slot][filler] += int(freq)

    return feat_freqs


def step_3(
    o_index_file: typing.TextIO, mapping: dict[str, str],
    lemmas: set[str], feat_freqs: FreqDict,
) -> ClassDict:
    class_freqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

    for line in o_index_file:
        lemmaTAR, slot, filler, freq = line.split()
        if lemmaTAR in lemmas:
            senses = mapping[lemmaTAR].split('|')
            for sense in senses:
                class_freqs[sense][slot][filler] += int(freq) / len(senses) / feat_freqs[slot][filler]

    return class_freqs


def step_4(o_out_file: typing.TextIO, class_freqs: ClassDict) -> None:
    for sense in sorted(class_freqs.keys()):
        by_sense = class_freqs[sense]
        for slot, freqs in by_sense.items():
            for fill, freq in freqs.items():
                o_out_file.write(f'{sense}\t{slot}\t{fill}\t{freq}\n')


def main():
    logger.info('Starting')
    with open('input-map', encoding='utf_8') as o_sense_file:
        mapping = start(o_sense_file)

    with open('input-data', encoding='latin_1') as o_index_file:
        logger.info('Step 1')
        lemmas = step_1(mapping=mapping, o_index_file=o_index_file)

        logger.info('Step 2')
        o_index_file.seek(0)
        feat_freqs = step_2(o_index_file=o_index_file)

        logger.info('Step 3')
        o_index_file.seek(0)
        class_freqs = step_3(
            mapping=mapping, o_index_file=o_index_file, lemmas=lemmas, feat_freqs=feat_freqs,
        )

    logger.info('Step 4')
    with open('output', mode='w', encoding='utf_8') as o_out_file:
        step_4(o_out_file=o_out_file, class_freqs=class_freqs)


if __name__ == '__main__':
    logger = setup_logger()
    main()

Console output:

2025-01-11 00:06:06,813 Starting
2025-01-11 00:06:06,816 Step 1
2025-01-11 00:06:06,816 Step 2
2025-01-11 00:06:06,816 Step 3
2025-01-11 00:06:06,816 Step 4

Output file:

anm is  green   0.3333333333333333
anm eats    mice    1.0
anm eats    plants  1.0
art is  green   0.6666666666666666

Stack Exchange Network

Script which opens and reads the same files multiple times

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Script which opens and reads the same files multiple times

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions