extract columns from multiple text file with Python

Question

I have a folder with 5 text files in it pertaining to various sites--

the title is formatted in this way:

Rockspring_18_SW.417712.WRFc36.ET.2000-2050.txt

Rockspring_18_SW.417712.WRFc36.RAIN.2000-2050.txt

WICA.399347.WRFc36.ET.2000-2050.txt

WICA.399347.WRFc36.RAIN.2000-2050.txt

so, basically the file name follows the format of- (site name).(site number).(WRFc36).(some variable).(2000-2050.txt

Each of these text files has a similar format to it with no header row: Year Month Day Value (consisting of ~18500 rows in each text file)

I want Python to search for similar filenames(where site name and site number match), and pick out the first through third columns of data from one of the files and paste it to a new txt file. I also want to copy and paste the 4th columns from each variable for a site (rain, et, etc.) and have them pasted in a particular order in the new file.

I know how to grab data using the csv module (and defining the new dialect for a space delimeter) from ALL files and print to a new text file, but I'm not sure how to automate the creation of a new file for each site name/number and make sure my variables plot out in the right order--

The output I want to use is one text file (not 5) for each site with the following format (year, month, day, variable1, variable2, variable3, variable4, variable5) for ~18500 rows...

I'm sure I'm looking over something realy simple here... this seems like it would be pretty rudimentary... but- any help would be greatly appreciated!

Update

========

I have updated the code to reflect the comments below.
http://codepad.org/3mQEM75e

from collections import defaultdict import glob import csv

#Create dictionary of lists--   [A] = [Afilename1, Afilename2, Afilename3...]
#                               [B] = [Bfilename1, Bfilename2, Bfilename3...] 
def get_site_files():
    sites = defaultdict(list)
    #to start, I have a bunch of files in this format ---
    #"site name(unique)"."site num(unique)"."WRFc36"."Variable(5 for each site name)"."2000-2050"
    for fname in glob.glob("*.txt"):
        #split name at every instance of "."
        parts = fname.split(".")
        #check to make sure i only use the proper files-- having 6 parts to name and having WRFc36 as 3rd part
        if len(parts)==6 and parts[2]=='WRFc36':
            #Make sure site name is the full unique identifier, the first and second "parts"
            sites[parts[0]+"."+parts[1]].append(fname)
    return sites

#hardcode the variables for method 2, below
Var=["TAVE","RAIN","SMOIS_INST","ET","SFROFF"]

def main():
    for site_name, files in get_site_files().iteritems():
        print "Working on *****"+site_name+"*****"
####Method 1- I'd like to not hardcode in my variables (as in method 2), so I can use this script in other applications.
        for filename in files:
            reader = csv.reader(open(filename, "rb"))
            WriteFile = csv.writer(open("XX_"+site_name+"_combined.txt","wb"))
            for row in reader:
                row = reader.next()
####Method 2 works (mostly), but skips a LOT of random lines of first file, and doesn't utilize the functionality built into my dictionary of lists...            
##        reader0 = csv.reader(open(site_name+".WRFc36."+Var[0]+".2000-2050.txt", "rb"))    #I'd like to copy ALL columns from the first file
##        reader1 = csv.reader(open(site_name+".WRFc36."+Var[1]+".2000-2050.txt", "rb"))    #    and just the fourth column from all the rest of the files
##        reader2 = csv.reader(open(site_name+".WRFc36."+Var[2]+".2000-2050.txt", "rb"))    #    (the columns 1-3 are the same for all files)
##        reader3 = csv.reader(open(site_name+".WRFc36."+Var[3]+".2000-2050.txt", "rb"))
##        reader4 = csv.reader(open(site_name+".WRFc36."+Var[4]+".2000-2050.txt", "rb"))
##        WriteFile = csv.writer(open("XX_"+site_name+"_COMBINED.txt", "wb"))               #creates new command to write a text file
##
##        for row in reader0:
##            row  = reader0.next()
##            row1 = reader1.next()
##            row2 = reader2.next()
##            row3 = reader3.next()
##            row4 = reader4.next()
##            WriteFile.writerow(row + row1 + row2 + row3 + row4)
##        print "***finished with site***"

if __name__=="__main__":
    main()

Hugh Bothwell · Accepted Answer · 2012-08-13 23:57:00Z

2

Here's an easier way to iterate through your files, grouped by site.

from collections import defaultdict
import glob

def get_site_files():
    sites = defaultdict(list)
    for fname in glob.glob('*.txt'):
        parts = fname.split('.')
        if len(parts)==6 and parts[2]=='WRFc36':
            sites[parts[0]].append(fname)
    return sites

def main():
    for site,files in get_site_files().iteritems():
        # you need to better explain what you are trying to do here!
        print site, files

if __name__=="__main__":
    main()

I still don't understand your cutting and pasting columns - you need to more clearly explain what you are trying to accomplish.

answered Aug 13, 2012 at 23:57

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

TheGeoEngineer Over a year ago

I put some new code at codepad.org/3mQEM75e reflecting your schema above. As far as cutting and pasting columns--- I have several study sites- each study site has 5 text files (one for each of 5 variables). So, for 5 study sites, i would have 25 text files. Each text file's columns are formatted the same : Year Month Day VariableValue. I want to copy the dates from one file, and just the variable values from all other files for each study site-- so for 5 study sites i will end up with only one text file with columns formatted: Year Month Day Var1 Var2 Var3 Var4 Var5.

hochl Over a year ago

and don't forget that in this case glob.iglob('*.txt') will create an iterator and avoid creating a list of values.

TheGeoEngineer Over a year ago

@hochl I suppose glob.iglob would be simpler if I used method 2 (see code here codepad.org/3mQEM75e), but I'd like to use method 1... glob.glob works for both, though--- say, how do I paste my updated code into my original question? I couldn't figure that out so i provided that link to it (codepad.org/3mQEM75e).

hochl Over a year ago

Not sure what you mean, but you can edit your question and that's about it. Linking to code might degrade the value of your post in the future if the link vanishes, so it is generally better to include the relevant code directly in your post.

TheGeoEngineer Over a year ago

@hochl Got it! I have edited my question with the code now, rather than with the possibly-dead-in-future link. Andy ideas why I would be missing random lines in method 2 as posted in code above?

mjgpy3 · Accepted Answer · 2012-08-13 21:41:06Z

1

As far as getting the filenames goes I would use something like the following:

import os

# Gets a list of all file names that end in .txt
# ON *nix
file_names = os.popen('ls *.txt').read().split('\n')

# ON Windows
file_names = os.popen('dir /b *.txt').read().split('\n')

Then to get the elements normally separated by periods, use:

# For some file_name in file_names
file_name.split('.')

Then you can proceed to comparisons and extract the desired columns (by using open(file_name, 'r') or your CSV parser)

Michael G.

answered Aug 13, 2012 at 21:41

mjgpy3

8,9935 gold badges34 silver badges53 bronze badges

2 Comments

mjgpy3 Over a year ago

You will also want to remove '' (the empty string) from the list of file names.

TheGeoEngineer Over a year ago

what do you htink about this code I wrote--(codepad.org/3mQEM75e) it doesn't use your code, but maybe you have some insight into this version?

Collectives™ on Stack Overflow

extract columns from multiple text file with Python

Update

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Update

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related