Python text extraction

Question

I'm working on a text extraction with python. The output is not as desirable as I want it!

I have a text file containing information like this:

FN Clarivate Analytics Web of Science
VR 1.0

PT J

AU Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

AF Chen, G

   Gully, SM

   Whiteman, JA

   Kilcullen, RN

TI Examination of relationships among trait-like individual differences,

   state-like individual differences, and learning performance

SO JOURNAL OF APPLIED PSYCHOLOGY

CT 13th Annual Conference of the

   Society-for-Industrial-and-Organizational-Psychology

CY APR 24-26, 1998

CL DALLAS, TEXAS

SP Soc Ind & Org Psychol

RI Gully, Stanley/D-1302-2012

OI Gully, Stanley/0000-0003-4037-3883

SN 0021-9010

PD DEC

PY 2000

VL 85

IS 6

BP 835

EP 847

DI 10.1037//0021-9010.85.6.835

UT WOS:000165745400001

PM 11125649

ER

and when I use my code like this

import random
import sys

filepath = "data\jap_2000-2001-plain.txt"

with open(filepath) as f:
    articles = f.read().strip().split("\n")

articles_list = []

author = ""
title = ""
year = ""
doi = ""

for article in articles:
    if "AU" in article:
        author = article.split("#")[-1]
    if "TI" in article:
        title = article.split("#")[-1]
    if "PY" in article:
        year = article.split("#")[-1]
    if "DI" in article:
        doi = article.split("#")[-1]
    if article == "ER#":
        articles_list.append("{}, {}, {}, https://doi.org/{}".format(author, title, year, doi))
print("Oh hello sir, how many articles do you like to get?")
amount = input()

random_articles = random.sample(articles_list, k = int(amount))


for i in random_articles:
    print(i)
    print("\n")

exit = input('Please enter exit to exit: \n')
if exit in ['exit','Exit']:
    print("Goodbye sir!")
    sys.exit()

The extraction does not include data that has been entered after the linebreak, If I run this code, output would look like "AU Chen, G" and does not include the other names, same with the Title etc etc.

My output looks like:

Chen, G. Examination of relationships among trait, 2000, doi.dx.10.1037//0021-9010.85.6.835

The desired output should be:

Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000, Examination of relationships among trait-like individual differences, state-like individual differences, and learning performance, doi.dx.10.1037//0021-9010.85.6.835

but the extraction only includes the first row of each line –

Any suggestions?

Could you make your example file a little more concise, and be specific about what your intended output should be? — Jordan Singer
– Jordan Singer, Commented Feb 4, 2019 at 16:32
The example file is an extraction .txt from a search engine, it extracts data from articles. the desired output should be: Chen, G., Gully, SM., Whiteman, JA., Kilcullen, RN., 2000, Examination of relationships among trait-like individual differences, state-like individual differences, and learning performance, doi.dx.10.1037//0021-9010.85.6.835, but the extraction only includes the first row of each line — André Kalmendal
– André Kalmendal, Commented Feb 4, 2019 at 16:34
I get that, but it's hard for us to parse it by eye. If you can simplify it for us, and replace the "yadayada" with true expected output, that would be great. — Jordan Singer
– Jordan Singer, Commented Feb 4, 2019 at 16:34

Ryan Widmaier · Accepted Answer · 2019-02-04 17:32:03Z

1

You need to track what section you are in as you are parsing the file. There are cleaner ways to write the state machine, but as a quick and simple example, you could do something like below.

Basically, add all the lines for each section to a list for that section, then combine the lists and do whatever at the end. Note, I didn't test this, just psuedo-coding to show you the general idea.

authors = []
title = []
section = None

for line in articles:
    line = line.strip()

    # Check for start of new section, select the right list to add to
    if line.startswith("AU"):
        line = line[3:]
        section = authors
    elif line.startswith("TI"):
        line = line[3:]
        section = title
    # Other sections..
    ...

    # Add line to the current section
    if line and section is not None:
        section.append(line)

authors_str = ', '.join(authors)
title_str = ' '.join(title)
print authors_str, title_str

edited Feb 4, 2019 at 17:32

answered Feb 4, 2019 at 16:37

Ryan Widmaier

8,6232 gold badges33 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

André Kalmendal Over a year ago

With this code the output is empty atm, It seems like it can't search the txt file for lines starting with "AU"

Ryan Widmaier Over a year ago

section check needs to be "section is not None". I updated it, but as I stated in the description, it's intended as psuedo code to give you the general idea. You will need to adapt to fully implement your case.

aghast · Accepted Answer · 2019-02-04 18:44:47Z

Initial Understanding

Based on your example, I believe that:

The text is provided in lines.
The example text appears to have too many newlines, possibly an artifact of it being migrated from DOS/Windows? If so, either CRLF processing is needed, or alternate lines should be ignored.
The lines are divided into sections.
Each section is delimited by a two-letter uppercase tag in columns 0,1 at the first line in the section, and continues until the start of a new section.
Each line has either a tag or 2 blank spaces, followed by a blank space, in columns 0-2.
The artificial section delimited by tag ER marks the end-of-record.
The ER section contains no usable text.

It may also be the case that:

Records are begun by the FN tag.
Any text encountered outside of a FN / ER pair can be ignored.

Suggested Design

If this is true, I recommend you write a text processor using that logic:

Read lines.
Handle CR/LF processing; or skip alternate lines; or "don't worry the real text doesn't have these line breaks"?
Use a state machine with an unknown number of states, the initial state being ER.
Special rule: Ignore text in the ER state until a FN line is encountered.
General rule: when a tag is seen, end the previous state and begin a new state named after the seen tag. Any accumulated text is added to the record.
If no tag is seen, accumulate text in the previous tag.
Special rule: when the ER state is entered, add the accumulated record to the list of accumulated records.

At the end of this process, you will have a list of records, having various accumulated tags. You may then process the tags in various ways.

Something like this:

from warnings import warn

Debug = True

def read_lines_from(file):
    """Read and split lines from file. This is a separate function, instead
       of just using file.readlines(), in case extra work is needed like
       dos-to-unix conversion inside a unix environment.
    """
    with open(file) as f:
        text = f.read()
        lines = text.split('\n')

    return lines

def parse_file(file):
    """Parse file in format given by 
        https://stackoverflow.com/questions/54520331
    """
    lines = read_lines_from(file)
    state = 'ER'
    records = []
    current = None

    for line_no, line in enumerate(lines):
        tag, rest = line[:2], line[3:]

        if Debug:
            print(F"State: {state}, Tag: {tag}, Rest: {rest}")

        # Skip empty lines
        if tag == '':
            if Debug:
                print(F"Skip empty line at {line_no}")
            continue

        if tag == '  ':
            # Append text, except in ER state.
            if state != 'ER':
                if Debug:
                    print(F"Append text to {state}: {rest}")
                current[state].append(rest)
            continue

        # Found a tag. Process it.

        if tag == 'ER':
            if Debug:
                print("Tag 'ER'. Completed record:")
                print(current)

            records.append(current)
            current = None
            state = tag
            continue

        if tag == 'FN':
            if state != 'ER':
                warn(F"Found 'FN' tag without previous 'ER' at line {line_no}")
                if len(current.keys()):
                    warn(F"Previous record (FN:{current['FN']}) discarded.")

            if Debug:
                print("Tag 'FN'. Create empty record.")

            current = {}

        # All tags except ER get this:
        if Debug:
            print(F"Tag '{tag}'. Create list with rest: {rest}")

        current[tag] = [rest]
        state = tag

    return records

if __name__ == '__main__':
    records = parse_file('input.txt')
    print('Records =', records)

To my small knowledge in coding and the language of python, this looks correct to me!

Collectives™ on Stack Overflow

Python text extraction

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related