Python file and text processing

Question

So I am new to Python and I want to do the following.

I have a file with a bunch of sentences that looks like this:

- [frank bora three](noun) [go](action) level [three hundred sixty](value)
- [jack blad four](noun) [stay](action) level [two hundred eleven](value)

I want to be able to reproduce a file that looks like this:

text:'frank bora three', entityType:'noun'
text:'jack blad four', entityType:'noun'   
text:'go', entityType:'action'    
text:'stay', entityType:'action'
text:'three hundred sixty', entityType:'value'
text:'two hundred eleven', entityType:'value'

What I need is to delete the first hymph, identify every text in between the two square brackets as a text, and then for their entityType it will be what we have in between round brackets that follows the text between the squarebrackets. ther thing is that we can have some words that are not between brackets and that should be ignored.

Approach: The first thing I tried is to do is put all the sentences in an array:

import re
with open('new_file.txt') as f1:
    lines = f1.readlines()
array_length = len(lines)
for i in range(array_length):
    lines[i]=re.sub(r"\b/-\w+", "", lines[i])
print (lines[0])

After that I tried to remove the hymph using re but it's not working for me, the hymphs were still there when I tried to print the array.

I hope my question is clear.

Thank you in advance,

Please post the re code that you tried, and in what way it did not work. This is the real crux of your question. — PaulMcG
– PaulMcG, Commented Mar 9, 2020 at 14:38
Add this important info by editing your question - people don't always scan comments for additional info. — PaulMcG
– PaulMcG, Commented Mar 9, 2020 at 14:47

neutrino_logic · Accepted Answer · 2020-03-09 16:31:26Z

1

It's often easier, when parsing a complex string like this, to have a two-stage approach. If we first split each string:

temp = foo.split(')')[0:3]

gives for the first string, a list of strings:

temp = ['[frank bora three](noun', ' [go](action', ' level [three hundred sixty](value']

Now we can write simpler regexes to pull out the desired text from each substring:

re_text = re.compile(r'\[.+\]')
re_entity = re.compile(r'\(.+')
mytext = []
myentitites = []
for target in temp:
     mytext.append(re.search(re_text, target).group().strip('[]'))
     myentities.append(re.search(re_entity, target).group().strip('()'))

So now you have two lists:

mynouns = ['frank bora three', 'go', 'three hundred sixty']
myentities = ['noun', 'action', 'value']

Zip them together and make a new list of tuple pairs:

result = list(zip(mynouns, myentities)) #fix

which looks like this:

[('frank bora three', 'noun'),
 ('go', 'action'),
 ('three hundred sixty', 'value')]

And now you can feed these into a string. (To group this collection of strings for your desired output, you can make a list of strings and then sort it by the last word before outputting to a file)

edited Mar 9, 2020 at 16:31

answered Mar 9, 2020 at 15:08

neutrino_logic

1,2991 gold badge8 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

neutrino_logic Over a year ago

Just noticed I had a typo in that list(zip) statement, now fixed

Fabian · Accepted Answer · 2020-03-09 14:58:37Z

1

You don't really need a regex:

Just string split between the brackets :)

s = "- [frank bora three]asdasd(noun) [go](action) level [three hundred sixty](value)"

print(s[s.find("[")+1:s.find("]")]) #text inside []
print(s[s.find("(")+1:s.find(")")]) #noun inside ()

Now you need to reed in your file splitlines and loop over:

stringfile = """- [frank bora three](noun) [go](action) level [three hundred sixty](value)
- [jack blad four](noun) [stay](action) level [two hundred eleven](value)"""


for s in stringfile.splitlines():
    text = s[s.find("[")+1:s.find("]")]
    noun = s[s.find("(")+1:s.find(")")]

    print(text)
    print(noun)

answered Mar 9, 2020 at 14:58

Fabian

1,15011 silver badges26 bronze badges

1 Comment

Imane Over a year ago

Thank you for your answer, I have accepted the other one because in my example there are entities other than noun (action, value..), but it's still answers to the problem .

Collectives™ on Stack Overflow

Python file and text processing

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related