C# .NET: Scraping dynamic (JS) websites

Question

After hours of fails, I am coming here. I need to scrape a dynamically generated webpage (made using Vue.JS, but I would prefer not to share the link).

I have tried multiple approaches (1, 2, 3). None of them works on this webpage.

The most promising solution was using Selenium and PhantomJS. I tried it like this and I'm not sure why it's not even working for Google:

private void button1_Click(object sender, EventArgs e) {
        PhantomJSDriverService service = PhantomJSDriverService.CreateDefaultService();
        service.IgnoreSslErrors = true;
        service.LoadImages = false;
        service.ProxyType = "none";

        var driver = new PhantomJSDriver(service); // I also tried: new PhantomJSDriver();
        driver.Manage().Timeouts().PageLoad = TimeSpan.FromSeconds(10);
        driver.Url = "https://google.com";
        driver.Navigate();

        var source = driver.PageSource;
        textBox1.AppendText(source);
}

Did not work:

I also tried with a WebBrowser Control, but the page never fully loads:

(EDIT: I found out WebBrowser just instantiates IE, and after trying to open the target website in standalone IE browser, the webpage also never loads completely, so it makes sense to see the same behaviour inside WebView. I think I am bound to Selenium&PhantomJS due to this fact.)

Surely this shouldn't be so complicated. How to do it properly?

The second option seems pretty easy. Embed a browser control then point it to the url and take the EmbededBrowser HtmlContent. — Ross Bush
– Ross Bush, Commented Jun 18, 2018 at 14:13
I tried that too! The webpage never loads completely (I guess JS isn't evaluated, because crucial parts of webpage (html) are missing in the source code). Maybe I need to configure browser coontrol somehow? — c0dehunter
– c0dehunter, Commented Jun 18, 2018 at 14:15
Maybe there are some properties that are blocking. Would an add-blocker cause that spinning? — Ross Bush
– Ross Bush, Commented Jun 18, 2018 at 14:21
PhantomJS has been deprecated for a while, try using Firefox or Chrome in headless mode. — j4nw
– j4nw, Commented Jun 18, 2018 at 17:50

ProgrammingLlama · Accepted Answer · 2018-10-24 01:18:09Z

0

if you need to scrape a website you can use ScrapySharp scraping framework. You can add it to a project as a nuget. https://www.nuget.org/packages/ScrapySharp/

Install-Package ScrapySharp -Version 2.6.2

It has many useful properties to access different elements on the page.For example to access the entire HTML of the page you can use the following:

        ScrapingBrowser Browser = new ScrapingBrowser();
        WebPage PageResult = Browser.NavigateToPage(new Uri("http://www.example-site.com"));
        HtmlNode rawHTML = PageResult.Html;
        Console.WriteLine(rawHTML.InnerHtml);
        Console.ReadLine();

edited Oct 24, 2018 at 1:18

ProgrammingLlama

39.4k7 gold badges79 silver badges105 bronze badges

answered Jun 18, 2018 at 18:45

ashish

2,4381 gold badge12 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Craig Over a year ago

This will not wait until javascript has finished execution

Joan Vilariño Over a year ago

Any solution for dynamic pages? I get the same as new HttpClient().ReadStringAsync(url);

Henrique Miranda Over a year ago

ScrapySharp wraps WebAgilityPack, it won't execute javascript code so it is not suited for scraping dynamic content.

Collectives™ on Stack Overflow

C# .NET: Scraping dynamic (JS) websites

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related