Remove ASCII-encoded binary blobs from .txt files

Question

I want to parse 10-K files (financial statements of firms). Example of Apple's can be found here (look for the .txt file). Now, I was reading this research paper (look on page 30-31) on how to parse these files. The step one is described as removing all ASCII-Encoded segments ... that's what I want to figure out on how to remove them.

I see several questions on StackOverflow on how to remove non-ASCII codes, but this is different. ASCII-Encoded segments are: All document segments with <TYPE> tags of GRAPHIC, ZIP, EXCEL and PDF - I want to delete them.

So if I load a txt file as follow:

fil = open('F:\\file.txt','r')
x = fil.read()

How can I remove all ASCII Encoded segments from this txt file? To remove HTML tags, I use the procedure here, but what about ASCII Encoded segments?

@IgnacioVazquez-Abrams Sorry I updated my question. I didn't mean tags like in HTML tags. — Plug4
– Plug4, Commented Nov 5, 2014 at 7:32
@Plug4: "ASCII encoded segment" is not a known term on Google, so you're going to need to explain in a lot more detail exactly what you're talking about, what you're trying to do, and why. — John Zwinck
– John Zwinck, Commented Nov 5, 2014 at 7:34
It will turn it from a valid whatever file into potentially no longer a valid whatever file, somewhat depending on the whatever specification. If all you care about is extracrting things that look like human-readable text, this should not matter. — tripleee
– tripleee, Commented Nov 5, 2014 at 8:32

Community · Accepted Answer · 2017-05-23 11:57:46Z

1

If I understand you correctly, the format you are processing is somehow related to the SEC EDGAR process.

I have not taken the time to look it up formally. Perhaps you should.

From inspecting the Apple statement you link to, it looks like you want to replace anything matching the regular expression <DOCUMENT>\s*<TYPE>(?:GRAPHIC|ZIP|EXCEL|PDF).*?</DOCUMENT> with an empty string.

Disclaimer: A proper implementation would use an XML parser and extract the elements you want, instead of attempting to lexically zap things you don't want. This should not be hard in lxml.

I first thought this was XBLR but it's not. Attempting to parse it with ETree throws an exception because the close tags for some elements (including <TYPE>) appear to be optional. The best way forward would be to find out what format this is (the EDGAR site has several specifications; one of them, perhaps?) and locate a proper DTD, then proceed from there.

Once you have that sorted out, you want to see how to remove elements with XPath and perhaps how to use regex in (lxml) XPath. Then probably reimplement the other extractions you have already done using XML and XPath.

edited May 23, 2017 at 11:57

CommunityBot

11 silver badge

answered Nov 5, 2014 at 7:56

tripleee

192k37 gold badges318 silver badges369 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Eric O. Lebigot Over a year ago

The standard library also contains XML parsers, they can also be useful.

Plug4 Over a year ago

@tripleee Ah I see. So I should work with the XBLR files instead? Why I am somewhat hesitant on the XBLR files is that there are years where only txt files are available. For instance sec.gov/Archives/edgar/data/320193/…. My objective is to grab the section "MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS" in all of these files. I will have to work harder to get this!

tripleee Over a year ago

You can probably use the .txt files but you need a proper understanding of what format they are in. There are good reasons to not do "hit and run" regex extractions of well-defined XML formats, but if it's a quick one-off, maybe that's what you want to do in the end. However, all things counted, it's not really more work (modulo the learning curve) to do it properly, and the end result will be a lot more understandable, robust, and well-defined.

Plug4 Over a year ago

By looking into more details into the txt files, I realize that I could extract all the sections that I want if I code a code that reads line by line and keeps all the text that appears between "Item 7." and "Item 8.". Then with the remaining section I can apply regex extractions and html tags strip. Now how to do that, I will have to think! Thanks for all the help

Collectives™ on Stack Overflow

Remove ASCII-encoded binary blobs from .txt files

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related