Scrapy parse html strings from webpage based on css attribute

Question

I am trying to pull specific URL's on a webpage based on the CSS attribute. I can pull the first one, but I am having difficulties getting the full URL added, or getting more than 1 URL.

I have tried and run into many issues using joinurl or parse. I keep getting global errors with joinurl.

Is there a more simple way of doing this??

I am using Centos 6.5 & Python 2.7.5

This code below will provide the first URL, but not the http://www...inline

import scrapy

class PdgaSpider(scrapy.Spider):
name = "pdgavideos"  # Name of the Spider, required value

start_urls = ["http://www.pdga.com/videos/"]

# Entry point for the spiders
def parse(self, response):
    SET_SELECTOR = 'tbody'
    for brickset in response.css(SET_SELECTOR):

        HTML_SELECTOR = 'td.views-field.views-field-title a ::attr(href)'
        yield {
            'http://www.pdga.com': brickset.css(HTML_SELECTOR).extract()[0]
        }

Current Output

http://www.pdga.com
/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman

Expected Output

full list of url's without any breaks

I do not have enough reputation points to post a couple examples

vold · Accepted Answer · 2017-05-06 14:26:12Z

1

In order to get absolute urls from relative links, you can use Scrapy urljoin() method and rewrite your code like this:

import scrapy

class PdgaSpider(scrapy.Spider):
    name = "pdgavideos"
    start_urls = ["http://www.pdga.com/videos/"]

    def parse(self, response):
        for link in response.xpath('//td[2]/a/@href').extract():
            yield scrapy.Request(response.urljoin(link), callback=self.parse_page)

        # If page contains link to next page extract link and parse
        next_page = response.xpath('//a[contains(., "next")]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_page(self, response):
        link = response.xpath('//iframe/@src').extract_first()
        yield{
            'you_tube_link': 'http:' + link.split('?')[0]
        }

# To save links in csv format print in console: scrapy crawl pdgavideos -o links.csv
# http://www.youtube.com/embed/tYBF-BaqVJ8
# http://www.youtube.com/embed/_H0hBBc1Azg
# http://www.youtube.com/embed/HRbKFRCqCos
# http://www.youtube.com/embed/yz3D1sXQkKk
# http://www.youtube.com/embed/W7kuKe2aQ_c

edited May 6, 2017 at 14:26

answered May 6, 2017 at 7:27

vold

1,5491 gold badge15 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Thomas Over a year ago

Thank you both Tiny.D and vold for your quick response! This is exactly what I was looking to achieve. vold: am i able to output the data without the word link or anything else displayed before the results?

vold Over a year ago

You are welcome. As @Tiny.D already pointed out: Scrapy must return either new Request or Item or dictionary. If you want to simply output string with url in the console you better use requests with bs4 or lxml parsers.

vold Over a year ago

@Thomas I edited my answer to provide more desired output.

Thomas Over a year ago

Thank you soooooo much vold. If I could give you more points, then I would =D

Tiny.D · Accepted Answer · 2017-05-06 03:35:34Z

1

Your code return a dictionary, that's why it is break:

{'http://www.pdga.com': u'/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

what you could do is to make the yield this dictionary like this:

yield {
    'href_link':'http://www.pdga.com'+brickset.css(HTML_SELECTOR).extract()[0]
}

This will give you a new dict with the value is no break href.

{'href_link': u'http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

Note: Spider must return Request, BaseItem, dict or None, refer to parse function.

edited May 6, 2017 at 3:35

answered May 6, 2017 at 3:28

Tiny.D

6,5562 gold badges18 silver badges20 bronze badges

Collectives™ on Stack Overflow

Scrapy parse html strings from webpage based on css attribute

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related