Please pardon my newbie-ness to this thread. I've had to figure out on my own how to create a test environment for a spark server my company is using to handle data. I've jumped through various forums and have had to intall python itself using scoop via powershell, and set up PyCharm, spark itself, install and configure Hadoop, and probably a few other things. It's been a whirlwind of a few days.
So I've successfully done all the things listed above and am down to just testing out some scripting work. Here's the code:
Please pardon any of the initial stupidity of the first part of the code. I'm sure there's a better way to handle the read-in of the data set and putting it into a spark dataframe, but this works(?) well enough for now.
The problem is with this line of code:
I'm getting "TypeError: 'str' object is not callable" when I run this.
Right now all this should do is.. basically nothing. It's shoving the H_ID column back into itself. I had something mildly more complex in there, but it was giving the same error. I figured something I did was causing this to bomb, but now that's it's stripped down to nothing and I'm still getting the error, I'm at a loss.
I'm sure it's something dumb. I've checked various other forums to figure out if I'm doing something syntactically incorrect, but have found nothing definitive. Appreciate your patience and assistance!
So I've successfully done all the things listed above and am down to just testing out some scripting work. Here's the code:
Python:
import pyodbc
import pandas as pd
import numpy as np
import pyspark
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql.functions import *
appName = "PySpark SQL Server Example - via ODBC"
master = "local"
conf = SparkConf() .setAppName(appName) .setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession
###########################################################################################################
# File Location and pulls
# Update the file location where noted below
# Update the query to connect to the proper sheet in excel and to select any necessary fields needed
# to properly create the rule. Should help with runtime.
###########################################################################################################
conn_str = (
r'DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'
r'DBQ=C:\Users\DASaba\OneDrive - Ruffalo Noel Levitz\Desktop\DSC Load Files\filedata_model.xlsx;' #Update File Name and Location Here
r'ReadOnly=0'
)
cnxn = pyodbc.connect(conn_str, autocommit=True)
sql = pd.read_sql_query('''select [_ID], [S_ID], [H_ID] from [EMSearchVendor$]''', cnxn)
for col in sql.columns:
if ((sql[col].dtypes != np.int64) &
(sql[col].dtypes != np.float64)):
sql[col] = sql[col].fillna('')
inboundDF = spark.createDataFrame(sql)
#inboundDF.printSchema()
newDF = inboundDF.withColumn("H_ID", col("H_ID"))
newDF.show()
Please pardon any of the initial stupidity of the first part of the code. I'm sure there's a better way to handle the read-in of the data set and putting it into a spark dataframe, but this works(?) well enough for now.
The problem is with this line of code:
Python:
newDF = inboundDF.withColumn("H_ID", col("H_ID"))
I'm getting "TypeError: 'str' object is not callable" when I run this.
Right now all this should do is.. basically nothing. It's shoving the H_ID column back into itself. I had something mildly more complex in there, but it was giving the same error. I figured something I did was causing this to bomb, but now that's it's stripped down to nothing and I'm still getting the error, I'm at a loss.
I'm sure it's something dumb. I've checked various other forums to figure out if I'm doing something syntactically incorrect, but have found nothing definitive. Appreciate your patience and assistance!