Big Data Machine Learning with SparkML — Part 1 — Simple Setup
In this blog I would like to show the basics of setting up Spark in Google Colab and run a simple Linear Regression model using Spark ML.
Before we jump in I would like to take a minute to explain why another ML library to learn is important. With TensorFlow, PyTorch & Scikit Learn frameworks already available why do we need another ML library. One simple answer is “Spark”.
We all know and love Spark for making the whole big data frameworks easy esp. with Spark SQL wouldn’t it be better if we can somehow use the same distributed framework for ML too. We can do this using Spark ML. For most of the simple tasks not having to install anything and sparks ability to run in distributed and parallel way makes it worth looking at.
Before we dive in, we need to install Spark in Colab. I specifically am talking about PySpark. Here are the steps to do this in Colab.
- Install Open JDK 8 on Colab
!sudo apt install openjdk-8-jdk
2. Pip install PySpark
!pip install pyspark
3. Set Java Home environment Variable
import os
java8_location= '/usr/lib/jvm/java-8-openjdk-amd64'
os.environ['JAVA_HOME'] = java8_location
4. Create a SparkSql Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFramespark = SparkSession.builder.appName('Final').config("spark.driver.memory", "10g").config("spark.executor.memory","10g").getOrCreate()
5. Import the required SparkML
#Spark ML imports
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
With the above steps we have spark setup on Google colab and ready to do our machine learning tasks.
For the ML task, I want to keep this simple in the first part and run a simple Linear Regression model. We are going use the Boston Dataset from Kaggle for this. I have the dataset downloaded as csv files so this is how we load the data into spark.
data = spark.read.csv('/content/drive/My Drive/train.csv',header=True,inferSchema=True)
Once we have the data in Spark we can check if the loading is done correctly by looking at the Schema.
data.printSchema()
we see that all of our data is loaded correctly, at least the data types.
Now coming to the specific Spark ML part, it seems we need to append all the feature row values into one column called ‘features’. To do this we use the VectorAssembler function we imported before.
#All the features columns as a list
features_cols = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX',
'PTRATIO','B','LSTAT']#Vector Assembler to put these all in one column
VA = VectorAssembler(inputCols= features_cols,outputCol='features')
This creates a column called ‘features’ that puts together all the features.
VA_Boston = VA.transform(data)
VA_Boston.show()
This shows all the columns including the newly created ‘features’ column.
Since we only need the features column and the final ie ‘MDEV’ column we can use the select function to do this.
VA_Boston_final = VA_Boston.select(['features','MEDV'])
Then we do the regular train & test split. I used 70/30 split for this.
train,test = VA_Boston.randomSplit([0.70,0.30])
Finally the Linear Regression model is instantiated with the labelCol as ‘MDEV’. Then we fit on the train model.
lr = LinearRegression(labelCol='MEDV')
lr_model = lr.fit(train)
Now we can evaluate the model performance on the test data.
test_results = lr_model.evaluate(test)
rmse = test_results.rootMeanSquaredError
r2 = test_results.r2
print(f'RMSE = {rmse}, r2 = {r2} ')
We get an RMSE of 4 & R2 of 73.67%. Please do note this is literally out of the box model performance without any EDA or scaling of the features. We certainly can improve the model performance and that would be my next blog.
Happy reading !!!
References