Read text files with special format in python

Question

I want to transform a folder of text documents in the following format:

texts = ['text of document 1', 'text of document 2', 'text of document 3',...]

in order to apply text mining methods.

So far my code is the following:

import os
file= "*.txt"
path = "C:\\"
texts=[]

for files in os.listdir(path):
     with open(path + files) as f:
         for x in f:
             texts.append(x)

Unfortunately, the outcome differs from the wanted one:

texts = ['line 1 of document 1', 'line 2 of document 1', …]

What am I doing wrongly? Can anybody suggest an improvement for my code?

So you want to read all the txt files in a folder and store their content in a list? — DirtyBit
– DirtyBit, Commented Mar 19, 2019 at 7:32
Yes, I already used f.read(), but then the list has empty entries: texts = ['','','',...] — Nils_Denter
– Nils_Denter, Commented Mar 19, 2019 at 7:32

Tim Pietzcker · Accepted Answer · 2019-03-19 07:34:18Z

3

for line in file: (or in your case, for x in f:) iterates over the lines in a file.

Use the .read() method instead. That will read the entire file into a string:

for files in os.listdir(path):
     with open(path + files) as f:
         texts.append(f.read())

Edit: I just saw your comment about empty entries. If your directory contains empty files, you can prevent them from being added:

for files in os.listdir(path):
     with open(path + files) as f:
         contents = f.read()
         if contents.strip(): # will also remove files that contain only whitespace
             texts.append(f.read())

edited Mar 19, 2019 at 7:34

answered Mar 19, 2019 at 7:32

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Read text files with special format in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related