PySpark missing column name error with numerical column names in an Azure Synapse analytics notebook

Question

I am facing this issue when creating dataframe with Azure Synapse SQL dedicated pool as a data source. Some of the columns have numerical column names such as "240". I have used the synapsesql connector in scala and then grabbing the dataframe to pyspark dataframe using spark.sql. Even though I am able to print the schema of the dataframe without any problems trying to select any of the columns with numerical names produces an error.

The error has something to do with empty aliases that correspond to column names with special characters. I have not been able to figure out whether this is a spark issue or does it have something to do with Synapse analytics a data source.

%%spark
val df = spark.read.
option(Constants.SERVER, "db.sql.azuresynapse.net").
synapsesql("DWH.table")
        
df.createOrReplaceTempView("table")

df_p = spark.sql("SELECT * FROM table")
df_p.select('240').show()
df_p.printSchema()

I have understood that I should use backticks when working with column names that have illegal characters but the following seems to produce the same error

df_p = spark.sql("SELECT * FROM table")
df_p.select('`240`').show()
df_p.printSchema()

The error produced

Py4JJavaError: An error occurred while calling o204.showString.
: com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: An object or column name is missing or empty. For SELECT INTO statements, verify each column has a name. For other statements, look for empty alias names. Aliases defined as "" or [] are not allowed. Change the alias to a valid name.

Could someone let me know why I end up with the error?

Thank you!

IpsitaDash-MT · Accepted Answer · 2021-09-08 08:25:11Z

0

Read API, as Token-based authentication to a dedicated SQL pool outside of the workspace is currently not supported by the connection. SQL Auth will be required.

      val df=spark.read.option(Constants.SERVER,"samplews.database.windows.net").
      option(Constants.USER, <SQLServer Login UserName>).
      option(Constants.PASSWORD, <SQLServer Login Password>).
      synapsesql("<DBName>.<Schema>.<TableName>")

Create a temp table using the dataframe in PySpark & Run a Scala cell in the PySpark notebook using magics:

pyspark_df.createOrReplaceTempView("temptable")
val scala_df = spark.sqlContext.sql ("select * from pysparkdftemptable")

Then you can select the column using the select() and show():

df_p.select("240").show()
df_p.printSchema()

If you are facing error would request you to run the session again and see.

As this in the snip works:

answered Sep 8, 2021 at 8:25

IpsitaDash-MT

1,4801 gold badge5 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AnttiM Over a year ago

Thank you for your answer! I think that this does not directly address the problem. I have no trouble reading the data into dataframe from the SQL pool with a Scala cell in PySpark notebook. My dataframe also has columns that don't have any special characters and operations with those columns work as expected. The example provided works in my notebook so I guess this has something to do with how spark interprets these certain special column names in this case.

IpsitaDash-MT Over a year ago

Thank you for the update, as per my understanding from the ques- you weren't able to select the numerical column names, but yes will look more regarding "how spark interprets these certain special column names in this case."

Collectives™ on Stack Overflow

PySpark missing column name error with numerical column names in an Azure Synapse analytics notebook

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related