Big Data Machine Learning with SparkML — Part 2 — Ensemble Methods

Sailaja Karra
3 min readNov 18, 2020

In my previous blog I showed how to setup Apache Spark in Google Colab and use Spark ML to run distributed machine learning. In this blog I am going to discuss about how to improve the efficiency of spark machine learning models using ensemble methods.

Before we jump in here is a quick take on what Ensemble methods do.
We basically train various Trees and then use these trees together to get to the final result. The below diagram shows exactly how this is done.

In SparkML we have two main algorithms for ensemble methods, Gradient Boosted Trees and Random Forests. Even though both use tree based models we will see later on that using Random Forest model not only speeds up (as the trees are trained in parallel vs GBT training one tree at a time) it also improves the model accuracy as this is less prone to overfitting.

Using the same Boston housing data set we used last week we will run both of these models.

This is what we had last week in our training data.

train.show()

We can now just take the last ‘features’ column that we built using VectorAssembler and use that. But we cannot use this directly and would have to convert this into ‘LabeledPoint’s. We do that using the following code.

from pyspark.mllib.regression import LabeledPointrdd = train.rdd.map(lambda row: LabeledPoint(row['MEDV'] \
, row['features'].toArray()))
rdd_test = test.rdd.map(lambda row: row['features'].toArray())

Here we first look at the GradientBoostedTrees.

First we import the required libraries

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel

Second we fit this model to our train data

gbt_model = GradientBoostedTrees.trainRegressor
(rdd,categoricalFeaturesInfo={}, numIterations=100)

We run our predictions

gbt_predictions = gbt_model.predict(rdd_test)
gbt_labelsAndPredictions =
test.rdd.map(lambda row: row.MEDV).zip(gbt_predictions)

Finally we calculate RMSE for these predictions

gbt_testMSE = 
gbt_labelsAndPredictions.
map(lambda row: (row[0] - row[1]) * (row[0] - row[1])).sum()
/float(test.rdd.count())
print('Test Root Mean Squared Error = ' +
str(math.sqrt(gbt_testMSE)))

Now lets try the same thing with ‘RandomForest’ model.

model_RamdomForest = RandomForest.trainRegressor(
rdd, categoricalFeaturesInfo={},
numTrees=30, featureSubsetStrategy="auto",
impurity='variance', maxDepth=4, maxBins=32)

Once the model is trained on the training rdd we can use this for predictions.

predictions = model_RamdomForest.predict(rdd_test)

Here are the final metrics

rf_labelsAndPredictions = test.rdd.map(
lambda row: row.MEDV).zip(predictions)
rf_testMSE = rf_labelsAndPredictions.map(
lambda row: (row[0] - row[1]) * (row[0] - row[1])).sum() /float(test.rdd.count())
print('Test Root Mean Squared Error = ' + str(math.sqrt(rf_testMSE)))

As you can see this gives us a much better RMSE compared to the GBT model even though we trained our RandomForest models with only 30 trees.

Hope you enjoyed reading about ensemble methods in Spark.

Happy reading !!!

References

--

--