0

I have a large text file, as you can see in the following, including strings and numbers. I want to read just numbers and also delete rows which have just 3 columns and write them into a matrix(m by n). could someone tell me what is the best way in python to manipulate such files?

My file is something like:

# Chunk-averaged data for fix Dens and group ave
# Timestep Number-of-chunks Total-count
# Chunk Coord1 Ncount density/number
4010000 14 1500
  1 4.323 138.758 0.00167105
  2 12.969 121.755 0.00146629
  3 21.615 127.7 0.00153788
  4 30.261 131.682 0.00158584
  5 38.907 127.525 0.00153578
  6 47.553 136.322 0.00164172
  7 56.199 118.014 0.00142124
  8 64.845 125.842 0.00151551
  9 73.491 120.684 0.00145339
  10 82.137 132.282 0.00159306
  11 90.783 121.567 0.00146402
  12 99.429 97.869 0.00117863
  13 108.075 0 0
  14 116.721 0 0......
4
  • Use Regex for extracting numbers! Commented Jun 28, 2018 at 14:51
  • Is it just that header line that only has three numbers, or do lines like that reoccur? If the former, just open the file, skip the first four lines, then have numpy read the rest. If the latter, just have numpy read the whole thing with nan fill and then select the lines where none of the columns are nan. Commented Jun 28, 2018 at 14:53
  • Read line by line, if there is a character skip if not convert it to list if there are only 4 elements (3 columns and one index column) then skip otherwise add to dataframe Commented Jun 28, 2018 at 14:58
  • @ᴀʀᴍᴀɴ It hink regex would be vastly overkill! There are great methods from numpy :) Commented Jun 28, 2018 at 22:25

3 Answers 3

2

You haven't specified what exactly you meant by matrix, so here is a solution that will turn your text file into a 2d list, making each number individually accessible.

It checks that the first item in a given row is a number, and that there are 4 items in the row, in which case it will append that line as 4 separate numbers to the 2d list mat. If you want to access any number in mat, you can use mat[i][j].

with open("test.txt") as f:
    content = f.readlines()

content = [x.strip() for x in content]
mat = []

for line in content:
    s = line.split(' ')
    if s[0].isdigit() and len(s) == 4:
        mat.append(s)
Sign up to request clarification or add additional context in comments.

Comments

2

With a copy-n-paste of your sample to txt:

In [350]: np.genfromtxt(txt.splitlines(), invalid_raise=False)
/usr/local/bin/ipython3:1: ConversionWarning: Some errors were detected !
    Line #2 (got 4 columns instead of 3)
    Line #3 (got 4 columns instead of 3)
  ....
  #!/usr/bin/python3
Out[350]: array([4.01e+06, 1.40e+01, 1.50e+03])

That read the first non-comment line, and took that as the standard. Skipping that, I can read all the lines:

In [351]: np.genfromtxt(txt.splitlines(), invalid_raise=False,skip_header=4)
Out[351]: 
array([[1.00000e+00, 4.32300e+00, 1.38758e+02, 1.67105e-03],
       [2.00000e+00, 1.29690e+01, 1.21755e+02, 1.46629e-03],
       [3.00000e+00, 2.16150e+01, 1.27700e+02, 1.53788e-03],
       [4.00000e+00, 3.02610e+01, 1.31682e+02, 1.58584e-03],
       [5.00000e+00, 3.89070e+01, 1.27525e+02, 1.53578e-03],
       [6.00000e+00, 4.75530e+01, 1.36322e+02, 1.64172e-03],
       [7.00000e+00, 5.61990e+01, 1.18014e+02, 1.42124e-03],
       [8.00000e+00, 6.48450e+01, 1.25842e+02, 1.51551e-03],
       [9.00000e+00, 7.34910e+01, 1.20684e+02, 1.45339e-03],
       [1.00000e+01, 8.21370e+01, 1.32282e+02, 1.59306e-03],
       [1.10000e+01, 9.07830e+01, 1.21567e+02, 1.46402e-03],
       [1.20000e+01, 9.94290e+01, 9.78690e+01, 1.17863e-03],
       [1.30000e+01, 1.08075e+02, 0.00000e+00, 0.00000e+00],
       [1.40000e+01, 1.16721e+02, 0.00000e+00, 0.00000e+00]])

Actually in this case all the rest have the required 4. If I truncate the last 2 lines, I get the warning, but it still reads the other lines.

Filtering the lines before passing them to genfromtxt is another option. genfromtxt accepts any input that feeds it lines - a file, a list of strings, or a function that reads and filters a file.

Comments

0

for your task you would need iterator, string.split() and re.match:

import re #needed to use regexp to see if line in file contains only numbers

matrix = [] #here we'll put your numbers
i = 0 #counter for matrix rows

for line in open('myfile.txt'): #that will iterate lines in file one by one
    if not re.match('[ 0-9\.]', line): #checking for symbols other than numbers in line
        continue #and skipping an iteration if there are any

    list_of_items = line.split(' ') #presumed numbers in string are divided with spaces - splittin line into list of separate strings
    if len(list_of_items) <= 3: #we will not take ro of 3 or less into matrix
        continue

    matrix.append([]) #adding row to matrix

    for an_item in list_of_items:
        matrix[i].append(float(an_item)) #converting strings and adding floats to a row
    i += 1

I tried to make code and comments speak, let me know if anything is unclear

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.