Large-Scale Time Series Forecasting in Apache Spark

Date:

Accurately forecasting power demand is important for securing energy supply. Time series forecasting methods and other machine learning algorithms can be used to create energy forecasts. We have developed a forecasting framework based on multi-model approach at customer account level. The framework uses a wide range of algorithms (e.g. GLM, ElasticNet, Seasonal ARIMA-X, Decision Tree, Random Forest and Gradient Boosting Machine). Models are pre-trained on AWS EMR cluster using Spark/SparklyR. The process is run at massively parallel scale (>3000 vCores). Once the model training algorithm has completed, the model objects are persisted on AWS S3 so that they can be reused at a later date. To trigger a forecast, the deploy pipeline will load the pre-trained model object from S3 and create a forecast based on the prevailing inputs. The output is stored as partitioned parquet files on S3, which can be converted into table view through AWS Athena.

Enterprise Application of the R Language

Slides can be downloaded here.