1. Differences Caused by Learning Algorithm
You can get different results when you run the same algorithm on the same data due to the nature of the learning algorithm.
This is the most likely reason that you’re reading this tutorial.
You run the same code on the same dataset and get a model that makes different predictions or has a different performance each time, and you think it’s a bug or something. Am I right?
It’s not a bug, it’s a feature.
Some machine learning algorithms are deterministic. Just like the programming that you’re used to. That means, when the algorithm is given the same dataset, it learns the same model every time. An example is a linear regression or logistic regression algorithm.
Some algorithms are not deterministic; instead, they are stochastic. This means that their behavior incorporates elements of randomness.
Stochastic does not mean random. Stochastic machine learning algorithms are not learning a random model. They are learning a model conditional on the historical data you have provided. Instead, the specific small decisions made by the algorithm during the learning process can vary randomly.
The impact is that each time the stochastic machine learning algorithm is run on the same data, it learns a slightly different model. In turn, the model may make slightly different predictions, and when evaluated using error or accuracy, may have a slightly different performance.
Adding randomness to some of the decisions made by an algorithm can improve performance on hard problems. Learning a supervised learning mapping function with a limited sample of data from the domain is a very hard problem.
2. Differences Caused by Evaluation Procedure
The two most common evaluation procedures are a train-test split and k-fold cross-validation.
A train-test split involves randomly assigning rows to either be used to train the model or evaluate the model to meet a predefined train or test set size.
The k-fold cross-validation procedure involves dividing a dataset into k non-overlapping partitions and using one fold as the test set and all other folds as the training set. A model is fit on the training set and evaluated on the holdout fold and this process is repeated k times, giving each fold an opportunity to be used as the holdout fold.
Both of these model evaluation procedures are stochastic.
Again, this does not mean that they are random; it means that small decisions made in the process involve randomness. Specifically, the choice of which rows are assigned to a given subset of the data.
This use of randomness is a feature, not a bug.
The use of randomness, in this case, allows the resampling to approximate an estimate of model performance that is independent of the specific data sample drawn from the domain. This approximation is biased because we only have a small sample of data to work with rather than the complete set of possible observations.
Performance estimates provide an idea of the expected or average capability of the model when making predictions in the domain on data not seen during training. Regardless of the specific rows of data used to train or test the model, at least ideally.
2. Differences Caused by the development environment
This includes:
- Differences in the system architecture, e.g. CPU or GPU.
- Differences in the operating system, e.g. MacOS or Linux.
- Differences in the underlying math libraries, e.g. LAPACK or BLAS.
- Differences in the Python version, e.g. 3.6 or 3.7.
- Differences in the library version, e.g. scikit-learn 1.3.1 or 1.3.2 .
- …
Machine learning algorithms are a type of numerical computation.
This means that they typically involve a lot of math with floating point values. Differences in aspects, such as the architecture and operating system, can result in differences in round errors, which can compound with the number of calculations performed to give very different results.
Additionally, differences in the version of libraries can mean the fixing of bugs and the changing of functionality that too can result in different results.
Additionally, this also explains why you will get different results for the same algorithm on the same machine implemented by different languages, such as R and Python. Small differences in the implementation and/or differences in the underlying math libraries used will cause differences in the resulting model and predictions made by that model.
Honestly, the effect is often very small in practice (at least in my limited experience) as long as major software versions are a good or close enough match.