MindsDB at PyTorch Ecosystem Day

Join our AI Research team at Pytorch Ecosystem Day. We will have 2 cool presentations (posters) – more details below.

Date: April 21, 2021

Event details: https://pytorchecosystemday.fbreg.com/

Pytorch via SQL commands: A flexible, modular AutoML framework that democratizes ML for database users

Pytorch enables building models with complex inputs and outputs, including time-series data, text and audiovisual data. However, such models require expertise and time to build, often spent on tedious tasks like cleaning the data or transforming it into a format that is expected by the models.

Thus, pre-trained models are often used as-is when a researcher wants to experiment only with a specific facet of a problem. See, as examples, FastAI’s work into optimizers, schedulers, and gradual training through pre-trained residual models, or NLP projects with Hugging Face models as their backbone.

We think that, for many of these problems, we can automatically generate a “good enough” model and data-processing pipeline from just the raw data and the endpoint. To address this situation, we are developing MindsDB, an open-source, PyTorch-based ML platform that works inside databases via SQL commands. It is built with a modular approach, and in this talk we are going to focus on Lightwood, the stand-alone core component that performs machine learning automation on top of the PyTorch framework.

Lightwood automates model building into 5 stages: (1) classifying each feature into a “data type”, (2) running statistical analyses on each column of a dataset, (3) fitting multiple models to normalize, tokenize, and generate embeddings for each feature, (4) deploying the embeddings to fit a final estimator, and (5) running an analysis on the final ensemble to evaluate it and generate a confidence model. It can generate quick “baseline” models to benchmark performance for any custom encoder representation of a data type and can also serve as scaffolding for investigating new hypotheses (architectures, optimizers, loss-functions, hyperparameters, etc).

We aim to present our benchmarks covering wide swaths of problem types and illustrate how Lightwood can be useful for researchers and engineers through a hands-on demo.

Model agnostic confidence estimation with conformal predictors for AutoML

Many domains leverage the extraordinary predictive performance of machine learning algorithms. However, there is an increasing need for transparency of these models in order to justify deploying them in applied settings. Developing trustworthy models is a great challenge, as they are usually optimized for accuracy, relegating the fit between the true and predicted distributions to the background [1]. This concept of obtaining predicted probability estimates that match the true likelihood is also known as calibration.

Contemporary ML models generally exhibit poor calibration. There are several methods that aim at producing calibrated ML models [2, 3]. Inductive conformal prediction (ICP) is a simple yet powerful framework to achieve this, offering strong guarantees about the error rates of any machine learning model [4]. ICP provides confidence scores and turns any point prediction into a prediction region through nonconformity measures, which indicate the degree of inherent strangeness a data point presents when compared to a calibration data split.

In this work, we discuss the integration of ICP with MindsDB –an open source AutoML framework– successfully replacing its existing quantile loss approach for confidence estimation capabilities.

Our contribution is threefold. First, we present a study on the effect of a “self-aware” neural network normalizer in the width of predicted region sizes (also known as efficiency) when compared to an unnormalized baseline. Our benchmarks consider results for over 30 datasets of varied domains with both categorical and numerical targets. Second, we propose an algorithm to dynamically determine the confidence level based on a target size for the predicted region, effectively prioritizing efficiency over a minimum error rate. Finally, we showcase the results of a nonconformity measure specifically tailored for small datasets.


[1] Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017). On Calibration of Modern Neural Networks. ArXiv, abs/1706.04599.

[2] Naeini, M., Cooper, G., & Hauskrecht, M. (2015). Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 2015, 2901-2907 .

[3] Maddox, W., Garipov, T., Izmailov, P., Vetrov, D., & Wilson, A. (2019). A Simple Baseline for Bayesian Uncertainty in Deep Learning. NeurIPS.

[4] Papadopoulos, H., Vovk, V., & Gammerman, A. (2007). Conformal Prediction with Neural Networks. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), 2, 388-395.