1

I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.

I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF - I want to delete them.

So if I load a txt file as follow:

fil = open('F:\\file.txt','r')
x = fil.read()

How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?

7
  • 2
    What is an "ASCII tag"? Commented Nov 5, 2014 at 7:30
  • @IgnacioVazquez-Abrams Sorry I updated my question. I didn't mean tags like in HTML tags. Commented Nov 5, 2014 at 7:32
  • 1
    Please give an example! Commented Nov 5, 2014 at 7:33
  • @Plug4: "ASCII encoded segment" is not a known term on Google, so you're going to need to explain in a lot more detail exactly what you're talking about, what you're trying to do, and why. Commented Nov 5, 2014 at 7:34
  • 1
    It will turn it from a valid whatever file into potentially no longer a valid whatever file, somewhat depending on the whatever specification. If all you care about is extracrting things that look like human-readable text, this should not matter. Commented Nov 5, 2014 at 8:32

1 Answer 1

1

If I understand you correctly, the format you are processing is somehow related to the SEC EDGAR process.

I have not taken the time to look it up formally. Perhaps you should.

From inspecting the Apple statement you link to, it looks like you want to replace anything matching the regular expression <DOCUMENT>\s*<TYPE>(?:GRAPHIC|ZIP|EXCEL|PDF).*?</DOCUMENT> with an empty string.

Disclaimer: A proper implementation would use an XML parser and extract the elements you want, instead of attempting to lexically zap things you don't want. This should not be hard in lxml.

I first thought this was XBLR but it's not. Attempting to parse it with ETree throws an exception because the close tags for some elements (including <TYPE>) appear to be optional. The best way forward would be to find out what format this is (the EDGAR site has several specifications; one of them, perhaps?) and locate a proper DTD, then proceed from there.

Once you have that sorted out, you want to see how to remove elements with XPath and perhaps how to use regex in (lxml) XPath. Then probably reimplement the other extractions you have already done using XML and XPath.

Sign up to request clarification or add additional context in comments.

4 Comments

The standard library also contains XML parsers, they can also be useful.
@tripleee Ah I see. So I should work with the XBLR files instead? Why I am somewhat hesitant on the XBLR files is that there are years where only txt files are available. For instance sec.gov/Archives/edgar/data/320193/…. My objective is to grab the section "MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS" in all of these files. I will have to work harder to get this!
You can probably use the .txt files but you need a proper understanding of what format they are in. There are good reasons to not do "hit and run" regex extractions of well-defined XML formats, but if it's a quick one-off, maybe that's what you want to do in the end. However, all things counted, it's not really more work (modulo the learning curve) to do it properly, and the end result will be a lot more understandable, robust, and well-defined.
By looking into more details into the txt files, I realize that I could extract all the sections that I want if I code a code that reads line by line and keeps all the text that appears between "Item 7." and "Item 8.". Then with the remaining section I can apply regex extractions and html tags strip. Now how to do that, I will have to think! Thanks for all the help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.