0

I'm using Python to scrape data from Japanese website where it offers both English & Japanese language. Link here

The problem is I got the data I needed but in the wrong language (Link of both languages are identical). I tried inspecting the html page and saw the element 'lang' as followed:

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">

<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja" class="">

Here is the code I used:

import requests
import lxml.html as lh
import pandas as pd

url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
col = []
i = 0

for t in tr_elements[0]:
    i += 1
    name = t.text_content()
    print("{}".format(name))
    col.append((name,[]))

At this point I got the head row of the table from the page but in Japanese version. I'm new to Python and the scrapy. I don't know if there's any method I could use to get the data in English? If there is any existing examples, templates or other resources I could use, that'd be better.

Thanks in advance!

1
  • welcome to so, have you tried setting up cookie as the english request of the website sets the cookie to Set-Cookie: SFCM01LANG=en; Commented Oct 18, 2020 at 19:16

1 Answer 1

1

I visited the website you added, so for english it adds a cookie (look at the headers for Request URL: https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0 in network tab), you will see
Set-Cookie: SFCM01LANG=en; Max-Age=63072000; Expires=Tue, 18-Oct-2022 19:14:29 GMT; Path=/


So I have basically used that, change you code snippet to this

import requests
import lxml.html as lh
import pandas as pd

url='https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0'
page = requests.get(url, cookies={'SFCM01LANG':'en'})
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
Sign up to request clarification or add additional context in comments.

2 Comments

Can you please explain how did you find that language parameter? How that would apply to other websites?
this is already written in the answer look at the headers for Request URL: https://data.j-league.or.jp/SFMS01/search?team_ids=33&home_away_select=0 in network tab

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.