8

Going through the AWS Glue docs I can't see any mention of how to connect to a Postgres RDS via a Glue job of "Python shell" type. I've set up a RDS connection in AWS Glue and verified I can connect to my RDS. Also, when creating the Python job I can see my connection and I've added it to the script.

How do I use the connection which I've added to the Glue job to run some raw SQL?

Thanks in advance,

1
  • Did you have any luck with it? Commented Jun 12, 2019 at 14:06

1 Answer 1

10

There are 2 possible ways to access data from RDS in glue etl (spark):

1st Option:

  • Create a glue connection on top of RDS
  • Create a glue crawler on top of this glue connection created in first step
  • Run the crawler to populate the glue catalogue with database and table pointing to RDS tables.
  • Create a dynamic frame in glue etl using the newly created database and table in glue catalogue.

Code Sample :

from pyspark.context import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")

2nd Option

Create a dataframe using spark sql :

url = "jdbc:postgresql://<rds_host_name>/<database_name>"
properties = {
"user" : "<username>",
"password" : "<password>"
}
df = spark.read.jdbc(url=url, table="<schema.table>", properties=properties)

Note :

  • You will need to pass postgres jdbc jar for creating the database using spark sql.
  • I have tried first method on glue etl and second method on python shell (dev-endpoint)
Sign up to request clarification or add additional context in comments.

4 Comments

Want to be able to execute Raw SQL queries. Such as CREATE .... In the above case that's not possible...from my understanding. :/
@Harsh "You will need to pass postgres jdbc jar for creating the database using spark sql." - how would I do this?
@t_warsop : You will need to ssh to end point, download the postgre jar and pass it with your spark-submit command. I couldn't figure out a better way for dev endpoints.
@mcm : you can use spark's sqlcontext to execute the CREATE command, sqlContext.sql(query).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.