1

I am trying to pull specific URL's on a webpage based on the CSS attribute. I can pull the first one, but I am having difficulties getting the full URL added, or getting more than 1 URL.

I have tried and run into many issues using joinurl or parse. I keep getting global errors with joinurl.

Is there a more simple way of doing this??


I am using Centos 6.5 & Python 2.7.5

This code below will provide the first URL, but not the http://www...inline

import scrapy

class PdgaSpider(scrapy.Spider):
name = "pdgavideos"  # Name of the Spider, required value

start_urls = ["http://www.pdga.com/videos/"]

# Entry point for the spiders
def parse(self, response):
    SET_SELECTOR = 'tbody'
    for brickset in response.css(SET_SELECTOR):

        HTML_SELECTOR = 'td.views-field.views-field-title a ::attr(href)'
        yield {
            'http://www.pdga.com': brickset.css(HTML_SELECTOR).extract()[0]
        }

Current Output

http://www.pdga.com
/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman

Expected Output

full list of url's without any breaks

I do not have enough reputation points to post a couple examples

2 Answers 2

1

In order to get absolute urls from relative links, you can use Scrapy urljoin() method and rewrite your code like this:

import scrapy

class PdgaSpider(scrapy.Spider):
    name = "pdgavideos"
    start_urls = ["http://www.pdga.com/videos/"]

    def parse(self, response):
        for link in response.xpath('//td[2]/a/@href').extract():
            yield scrapy.Request(response.urljoin(link), callback=self.parse_page)

        # If page contains link to next page extract link and parse
        next_page = response.xpath('//a[contains(., "next")]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_page(self, response):
        link = response.xpath('//iframe/@src').extract_first()
        yield{
            'you_tube_link': 'http:' + link.split('?')[0]
        }

# To save links in csv format print in console: scrapy crawl pdgavideos -o links.csv
# http://www.youtube.com/embed/tYBF-BaqVJ8
# http://www.youtube.com/embed/_H0hBBc1Azg
# http://www.youtube.com/embed/HRbKFRCqCos
# http://www.youtube.com/embed/yz3D1sXQkKk
# http://www.youtube.com/embed/W7kuKe2aQ_c
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you both Tiny.D and vold for your quick response! This is exactly what I was looking to achieve. vold: am i able to output the data without the word link or anything else displayed before the results?
You are welcome. As @Tiny.D already pointed out: Scrapy must return either new Request or Item or dictionary. If you want to simply output string with url in the console you better use requests with bs4 or lxml parsers.
@Thomas I edited my answer to provide more desired output.
Thank you soooooo much vold. If I could give you more points, then I would =D
1

Your code return a dictionary, that's why it is break:

{'http://www.pdga.com': u'/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

what you could do is to make the yield this dictionary like this:

yield {
    'href_link':'http://www.pdga.com'+brickset.css(HTML_SELECTOR).extract()[0]
}

This will give you a new dict with the value is no break href.

{'href_link': u'http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

Note: Spider must return Request, BaseItem, dict or None, refer to parse function.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.