Enterprise AI company Indico wants to give back to the open-source community that it says has helped their technology develop with the release of this week’s highlighted open-source project. Indico’s Enso Python library is an open-source codebase designed to standardize a way to test transfer learning techniques for training natural language processing models.
While transfer learning, which utilizes knowledge gained from prior machine-learning tasks to speed up later tasks, has proven successful in computer vision and image classification applications, greatly reducing the number of images needed to make subsequent identifications, Indico says that the technique is greatly unproven for natural language processing.
“The Enso project is focused on addressing a core set of interrelated problems that underlie these limitations,” the company wrote in the release announcement. Problems include:
- “A lack of academic reproducibility. Due to the use of custom datasets and variations in coding practices, it is difficult to determine whether a new methodology is truly effective.
- Weak baseline benchmarks that limit general applicability. It is important to evaluate new methods on a broad range of datasets to determine whether or not a new approach represents a substantial improvement over alternatives.
- ‘Overfitting’ to specific datasets. Many of the models used for benchmarking are tied to specific datasets making it too difficult to take a model trained for one domain and train it on another.”
To counter this, Indico has designed Enso to adhere to a deliberately laid-out workflow. The steps, as listed on the project’s GitHub, are:
- All examples in the dataset are “featurized” via a pre-trained source model (python -m enso.featurize)
- Re-represented data is separated into train and test sets
- A fixed number of examples from the train set is selected to use as training data via the selected sampling strategy
- The training data subset is optionally over or under-sampled to account for variation in class balance
- A target model is trained using the featurized training examples as inputs (python -m enso.experiment)
- The target model is benchmarked on all featurized test examples
- The process is repeated for all combinations of featurizers, dataset sizes, target model architectures, etc.
- Results are visualized and manually inspected (python -m enso.visualize)
“Measuring how well methods perform as the amount of training data increases is critical,” said Madison May, Indico machine learning architect and co-founder. “In real life examples, we often need to select for methods that perform well with only a few hundred labeled training examples. By providing a standard interface for benchmarking, we believe Enso can facilitate the development of more generalized models that have greater value to a broader base of users.”
Indico Enso is compatible with Python 3.4 or higher and more detailed documentation can be found through its GitHub.