I have been hit by this issue multiple times, since the early years when individual algorithms like C4.5 or Ripper were distributed as C source code. When integrated platforms such as Weka (or scikit learn) arrived, their own re-implementations of the old algorithms (e.g., JRip) never lived up to the behavior of the C programs.
Why? Sometimes I have tracked down the differences for very specific cases where I was doing a migration for a production system. The differences boiled down to:
- Different default parameters.
- Capabilities for automatically handling undefined values.
- Feature representation issues (being able to handle set-based features or categorical features directly).
- More obscure issues (floating point size, etc)
A systematic study of the behavioral differences of different toolkits will clearly enhance the state-of-the-art for data science practitioners.