Getting Ready for Sitecore Cortex: A Sitecore Architect's Intro to Machine Learning in Sitecore
2017-11-10
Sitecore Symposium introduced us to Sitecore Cortex, the new machine learning (ML) component coming in version 9.1. This will be yet another shift for Sitecore developers and architects, as they will need to interact with an even broader cross-functional team. In this post, I cover a VERY high-level introduction to the machine learning development process and some tips on tooling.
Introduction to the Machine Learning Development Process
ML is a tool that helps you streamline processes and drive results, but it is just a tool. What is more important to understand is the process of deploying and running an ML experiment.
Step 1. You need to frame your problem
This starts out as a business goal or statement anchored in the realities of the problem context. Critical to the success of this first step is involving the domain expertise to avoid false assumptions. You need to establish a baseline so you can evaluate if there is an opportunity. Finally, you need to reframe the problem in the context of ML (more on that in a later post).
Step 2: Data handling
Can you get the data and is it good data? Some technical help is often required to retrieve and format data, but you will still want to involve staff with domain expertise to evaluate the quality of the data.
Once satisfied with your data, you will need to divide you data into sets for training, development and verification of your model.
Step 3: Modeling
Here comes the “fun” part. Cortex will leverage R and CNTK to produce ML models. This is where knowledge of statistics and machine learning approaches becomes key. As a developer you are likely helping data scientists working in R/CNTK integrate their work into Cortex. They may also need your help as they iterate over the model and try different approaches to feature engineering.
Step 4: Application (Launch)
Ideally, you launch the ML solution to a problem with an A/B test. This is the only true test of performance. Learn. Repeat.
Given the usage of Cortex APIs and need to possibly write code to deploy ML models, you will need to account for that in devops and deployment strategies.
Tools
Cortex will use the R language (https://www.r-project.org/about.html) and CNTK (https://github.com/Microsoft/CNTK) as the underlying tools to express the ML models.
There are lots of tutorials online that cover both, so here is the short rundown on what you need to look at setting up to experiment with R.
- Visual Studio R Development Tools extends the VS IDE to include full R support. Download it from https://www.visualstudio.com/vs/rtvs/. The open source alternative is R Studio at https://www.rstudio.com/.
- Microsoft has created R server. I suspect we will see this in Sitecore, so I’ve installed it as well. Visual Studio R will auto-connect to it. The major benefit of the server is increased performance. Download it from https://docs.microsoft.com/en-us/machine-learning-server/install/machine-learning-server-windows-install.
- R samples for Visual Studio and R server can be downloaded from this GitHub rep: https://github.com/Microsoft/RTVS-docs
Once you have the above up and running you can explore the examples. The flight delay predication example is a good place to start. It is a simple classification problem that attempts to determine if a flight will be on time or not.
If you would like to look directly at the example, the link on GitHub here.
The R plugin for Visual Studio includes Intellisense, debugging and everything you might expect from an IDE. Help for R functions is also included (a nice feature since I have not looked at R since my graduate degree). Just put the cursor on the command and hit F1.
One final warning before you jump into this. The CPU and RAM utilization for these examples can be hefty. My i7 was running 50 – 90% CPU for many of these examples and RAM usage reached peaks of 6GB.