Ruby regex or string manipulation

Question

I am trying to create a Rails app that takes in RSS feeds and displays news stories to a page.

When I get the summary for an article I save it to a string, however the summary for every single story has a lot of markup at the end that is unnecessary. For example:

The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.<div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/4W6duenqKrY" height="1" width="1"/>

The only line I want to keep in this is:

The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.

I want to parse out everything including and after the <div class="feedflare"> tag.

I was stumped on how to do this. If someone could please provide a ruby string manipulation method or a regular expression method that I can use to do this, I would greatly appreciate it. I have been stumped on this for quite a while, since I'm a novice to Ruby and regex.

parse it how? I already have this information stored in a string. I just want to get rid of all the markup after and including the feedflare tag. And how would I even do this using nokogiri? thanks for the quick response, I would appreciate any other insight you could shed on this problem — Abhas Arya
– Abhas Arya, Commented Oct 6, 2014 at 21:31

Anthony · Accepted Answer · 2014-10-06 21:40:04Z

2

You tagged rails, so I'm assuming that's what you're using. Rails comes built with a great sanitization helper:

[6] pry(main)> HTML::FullSanitizer.new.sanitize('The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.<div class="feedflare">')
=> "The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an \"aggressive fighting stance\" when police attempted to arrest him, according to a probable cause affidavit."

There are a variety of methods to help with this, take a look at strip_link and strip_tags as well, here

answered Oct 6, 2014 at 21:40

Anthony

16.1k4 gold badges43 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zini · Accepted Answer · 2014-10-06 21:39:48Z

0

I take a more close look and realize this simple ER expression is enough. Take a look too.

a=%Q[The Miami Dolphins have suspended a defensive lineman after he allegedly touched women and then took an "aggressive fighting stance" when police attempted to arrest him, according to a probable cause affidavit.<div class="feedflare">
<a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:V_sGLiPBpWU" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"></img></a> <a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=4W6duenqKrY:kemJFf3BScg:gIN9vFwOqvQ" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/4W6duenqKrY" height="1" width="1"/>]

     a.gsub(/\<[^\>]+\>/m, "")

answered Oct 6, 2014 at 21:39

Zini

9147 silver badges15 bronze badges

1 Comment

the Tin Man Over a year ago

It kinda works. It has the potential to greedily find way too much or be tricked by a less-than in the desired content. '1<2 <p>foo</p>'.sub(/\<[^\>]+\>/m, "") # => "1foo</p>"

Collectives™ on Stack Overflow

Ruby regex or string manipulation

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related