0

I am trying to create a Rails app that takes in RSS feeds and displays news stories to a page.

When I get the summary for an article I save it to a string, however the summary for every single story has a lot of markup at the end that is unnecessary. For example:

The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.<div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/4W6duenqKrY" height="1" width="1"/>

The only line I want to keep in this is:

The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.

I want to parse out everything including and after the <div class="feedflare"> tag.

I was stumped on how to do this. If someone could please provide a ruby string manipulation method or a regular expression method that I can use to do this, I would greatly appreciate it. I have been stumped on this for quite a while, since I'm a novice to Ruby and regex.

3
  • Try to use nokogiri to parse this HTML. Commented Oct 6, 2014 at 21:28
  • parse it how? I already have this information stored in a string. I just want to get rid of all the markup after and including the feedflare tag. And how would I even do this using nokogiri? thanks for the quick response, I would appreciate any other insight you could shed on this problem Commented Oct 6, 2014 at 21:31
  • Is this in rails by chance? Commented Oct 6, 2014 at 21:33

2 Answers 2

2

You tagged rails, so I'm assuming that's what you're using. Rails comes built with a great sanitization helper:

[6] pry(main)> HTML::FullSanitizer.new.sanitize('The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.<div class="feedflare">')
=> "The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an \"aggressive fighting stance\" when police attempted to arrest him, according to a probable cause affidavit."

There are a variety of methods to help with this, take a look at strip_link and strip_tags as well, here

Sign up to request clarification or add additional context in comments.

Comments

0

I take a more close look and realize this simple ER expression is enough. Take a look too.

a=%Q[The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.<div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/4W6duenqKrY" height="1" width="1"/>]

     a.gsub(/\<[^\>]+\>/m, "")

1 Comment

It kinda works. It has the potential to greedily find way too much or be tricked by a less-than in the desired content. '1<2 <p>foo</p>'.sub(/\<[^\>]+\>/m, "") # => "1foo</p>"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.