Why Your Database Needs a Machine Learning Brain

Cover Image for Why Your Database Needs a Machine Learning Brain

The past 10-15 years have seen organizations put vast resources into creating databases that let them understand their business better, spot trends earlier, and manage tasks more effectively.

Indeed, a whole industry has now grown up around it, not just with database companies like Clickhouse, DataStax, MariaDB, MongoDB, MySQL, PostgreSQL, SingleStore, or Snowflake, but with a swathe of companies developing business intelligence (BI) tools like Tableau to give insight from the data housed in them.

These databases have traditionally been great at using historical data to spot trends but forecasting (or rather, accurate forecasting) was a little more elusive. Artificial intelligence changes this and as machine learning capabilities improve it is becoming possible to make far more accurate predictions – in some cases hour by hour business predictions.

Consequently, AI adoption is accelerating, especially in the wake of the Covid-19 pandemic and according to PWC, most of the companies that have fully embraced AI are already reporting seeing major benefits.

What predictions are possible

Databases now collect and hold information from virtually every function in a business and organizations are turning to ML to use these data more effectively. Indeed, recent announcements on ML have come from organizations as disparate as Vancouver’s bus company TransLink, which used it to improve arrival time predictions and warn of potentially crowded buses; and the Munich Leukaemia Laboratory, where researchers are using it to predict if gene variants might be benign or pathogenic.

From a business intelligence point of view, ML can be used in, for example, retail to optimize promotional displays, just-in-time stock control and staffing levels; or in energy production to predict demand and outages; or in finance for better credit scoring and risk analysis.

A good example of how organizations can use ML’s predictive capabilities on their existing data can be seen in a dataset we recently presented – using data from New York City Taxis and its payment system app from Creative Mobile Technologies (CMT).

This is a hugely complex system, with the distribution of fares not only varying throughout the day for a single taxi vendor, but also between the taxi vendors themselves. Adding to the complexity is there being multiple vendors, each having its own time series.

Fig 1: How temporal dynamics vary for each group of data – using NYC Taxi data

However, once this data was cleaned, it was possible to use the historical data from the database and use a SQL query and MindsDB to train a multivariate time-series predictor that was able to accurately predict demand 7-hours ahead, and do this using just three variables: vendor, pickup time and taxi fare.

Fig 2: NYC Taxi Company fare predictions – MindsDB forecast (blue), vs reality (yellow)

As we see, it takes about 10 predictions before a forecast mirrors reality, with very little deviation after the first 15 predictions, allowing for better allocation of taxis and drivers at specific and for specific sectors of the city.

So databases need a brain - where is the best place to put it?

As we can see, the information in the databases can be used to make very accurate predictions with the addition of ML, and this can be used for a huge array of business applications, from predicting customer behavior to improving employee retention to improving industrial processes….

And that gives us two options, export the data to the brain, or import the brain to the data.

Currently, most ML systems export the data housed in a database using a similar series of steps to those below:

  1. Extract data

  2. Prep it (for example, turning it into a flat file)

  3. Load it into the BI tool

  4. Export the data from the BI tool to the ML extension

  5. Create a model

  6. Train the ML

  7. Run predictions via the AutoML extension

  8. Load those predictions back into the BI tool

  9. Prepare visualization in the BI tool

This method is not ideal. It not only takes time, but it also requires a considerable amount of extraction, transformation, and loading of data from one system to another, which can be challenging, particularly when dealing with the complexities of highly-sensitive data such as in financial services, retail, manufacturing or healthcare.

Indeed one small-scale survey by CrowdFlower found that 80% of data scientists’ time was taken up by data prep, and three-fourths of data scientists consider this prep as the least enjoyable part of the job.

By keeping the ML at the database level, you’re able to eliminate several of the most time-consuming steps. And in doing so, ensure sensitive data can be analyzed within the governance model of the database, and at the same time, you’re able to reduce the timeline of the project and cut points of potential failure.

Furthermore, by placing ML at the data layer, it can be used for experimentation and simple hypothesis testing without it becoming a mini-project that requires time and resources to be signed off. This means you can try things on the fly, and not only increase the amount of insight but the agility of your business planning.

By integrating the ML models as virtual database tables, alongside common BI tools, even large datasets can be queried with simple SQL statements. This technology incorporates a predictive layer into the database, allowing anyone trained in SQL to solve even complex problems related to time series, regression or classification models. In essence, this approach ‘democratizes’ access to predictive data-driven experiences.

Adding trust alongside the predictions

Even with the smartest database, there is more to the application of ML technology than just the machine’s prediction. Nuance is needed, with those using such predictions required to interpret predictions and drive reliable business outcomes.

Optimization tends to happen when the models are assisted with the human decision-making process. However, even then models can still show significant biases and research has discovered the model’s output can also introduce cognitive bias to the human.

A critical aspect, therefore, is to be able to understand the model and be able to trust accuracy and value.

To help business analysts understand why the ML model made certain predictions, it’s best to deploy an ML tool that generates predictions with visualizations and explainable AI (XAI) features. This not only builds the needed trust, it also provides an opportunity for analysts charged with interpreting the results to quickly see if there are any data cleanliness issues or human bias that might skew the model output.

So, does your database need a brain?

Absolutely. And while ML has traditionally been kept separate from the data layer, this is changing. Your database houses a great history for virtually every vital part of your business, and by using ML in the database it is becoming more simple to create forecasts about what that data will look like in the future, running queries using little more than standard database commands.

Read more about MindsDB on mindsdb.com, make sure to star ⭐ MindsDB on GitHub, and connect with the community on Slack!