We previously went over how to run a dataset using MindsDB. Now, let’s run through a quick introduction to MindsDB’s most user-friendly product, MindsDB Scout, a graphical user interface that sits on top of MindsDB and the MindsDB server.
You can download and install Scout for free and follow along by working on your own machine.
If you prefer to follow along visually, you can watch the video below:
Once Scout is installed, we have two options: 1.) to install MindsDB (MindsDB itself + MindsDB server) locally, or 2.) to connect to a remote MindsDB Server.
Once connected, you’ll have three main tabs:
This first thing we need to do is upload a datasource, some data with which MindsDB can train a predictive model and run data analysis.
Simply go to the Datasource tab and click “Upload” if you want to add a local file or “Add from URL” if you have a url with your file. This file can be a json, csv, tsv, xlsx or various other column-based data formats. If you are unsure if your data format is supported, try uploading it. If it doesn’t work, try converting it to one of the previously mentioned formats.
Scout should also support S3, MariaDB, MySQL and Postgres in the near future, since MindsDB already does.
After we’ve uploaded the dataset, we can take a look at it using the “Preview” function.
And we can also tell MindsDB to run some statistical data analysis on it using the “Analyze Datasource” button.
This analysis should give you insight about the type and value distribution of data in each column as well as a quality score for each column (see the big circle with “Overall Data Quality”). You will see warnings about the various sub-scores which build up to the larger data quality score. These warnings are not absolute statements about your data, they are just things that our summary analysis reveal which you might want to take a look at.
A low score could be considered anything below 5.f you have time, I’d recommend reviewing columns with a low score and figuring out if the statistical analysis has indeed caught something problematic which you didn’t expect.
For more information on the individual metrics we analyze, feel free to read our documentation on the subject.
Next, we want to tell MindsDB to train the actual machine learning model. For this we will need:
- A dataset
- One or more columns from that dataset we want to predict
In the “Predictors” tab, click on the “Train New” button in the lower right and you should be brought to a view where you have to input those two options, as well as a name for the predictorany name will do, just make sure you don’t create two predictors with the same name).
Now click train and MindsDB will start a process of extracting the data, analyzing it, cleaning it and transforming it to a standardized format, using it to train a machine learning model and running a black-box analysis on said machine learning model.
This might take a while, especially for a larger dataset, so don’t worry. Go ahead and make yourself a cup of tea.
Once the Predictor is trained, you can click the “Preview” column in order to get some of the insights from that black-box analysis I mentioned.
The first question you might have is “What is a black box analysis ?” The short answer to that is that MindsDB will split your data into 3 groups:
- Train (80% of the data)
- Test (10% of the data)
- Validation (10% of the data)
The “Validation” data is never seen by the machine learning backend, thus we assume that the model behavior based on this data is similar to how the model would behave had you passed on some “real” data that you wanted to predict (where you don’t already know the value of the prediction column).
For a simple example, we use the validation dataset to determine the accuracy you see above (The big circle with “66.1% Accurate”).
For a more complex example, look below at the “Column importance” tab, here we try poking the model with various combinations of missing columns in order to determine the importance of each column (note: Some of the machine learning models we use can also directly yield insights about the column importance, but for the purposes of this article that’s a bit too complex to get into).
The column importance can range from 0 to 10. 0 meaning that the column is basically perfectly useless and 10 meaning that the column’s predictive ability is so good that you could use this column alone and get the exact same accuracy.
Accordingly, most of the time scores will probably range from 2 to 8. You should consider throwing out or taking a second look at the data quality in columns with an importance score in the low end (say, between 3 and 0).
Obviously, this isn’t an absolute statement about the causality of your data, or even about the potential correlation from your data, it just tells you what’s important about your model.
To give a somewhat famous example, given a dataset of people aged 5 to 16, and a goal to predict intelligence, we might have 3 factors:
- Various SNPs
- Various environmental factors (school, parent’s education, grades… etc)
- Shoe size
Any model is likely to find the strongest correlation between shoe size and intelligence, not because shoe size indicates intelligence in any way, but because shoe size is a proxy for age and age in the 5 to 16 cohort is a good proxy for intelligence.
So, if your model told you shoe size is the most important column of the bunch, you shouldn’t jump to the conclusion “Obviously, people have been wrong on both sides of the nurture vs nature debate, we need to implement a feet-size growth program ASAP to make everyone a genius”, but rather, you should try and figure out “why might the model think that ?”. In some cases, it might be that the model indeed found a causal relationship. In other cases, it might be that the model found an unexplainable correlation that can be useful. In further cases, you might have just added a confounding variable. MindDB is an expert on machine learning, but that question is one machine learning can’t answer, that’s where we leave stuff up to domain experts.
There are two other things you can see in this overview at the moment: a confusion matrix based on the validation dataset and a plot of accuracy based on what value the predicted column had plus a histogram of the occurrence of said value in your dataset.
Finally, go to the “Query” tab or click “Query” on your Predictor in order to get some actual predictions from your model.
This part is pretty straightforward, pick a model, choose what you want to predict (selected by default if it only predicts a single column), input some data for the columns you want to be used and hit run.
The more interesting bit comes in what MindsDB does besides predicting a value.
Together with your prediction, you will get a confidence in that prediction–which comes from the black-box analysis we ran–potentially combined with insights that the model itself determines during training.
You will also see some plain-text information about the prediction, which explains why the confidence is such and what you might be able to do to increase that confidence.
Finally, you will see another column importance score, this time for how important the column is in determining the confidence (i.e., would we be less or more confident in this prediction had not a certain column been used).
This shouldn’t be confused with the overall column importance score we saw earlier since this score is specific to the prediction. A column might overall not be that important, but be critically important for the specific combination of values that yield a given prediction, or vice versa.
That about does it for this Overview of MindsDB Scout. Keep in mind that this GUI is not meant to handle the training of production-quality models (yet). Rather, it’s a tool that gives you a quick start with MindsDB if you don’t want to write code and it’s a tool that we can use to analyze a more sophisticated model once it’s been trained.
In a production setting, the GUI might be used to analyze an already-trained model or be given to employees that need to make a small number of predictions on the spot (e.g., “Should I lend this guy 5,000$?”, “Does MindsDB think, based on a given image, that I should change this particular component of a car?”).
However, I think it’s a nice introduction to see what MindsDB can do. You can use a sample of your dataset (up to about 100,000 rows) to train a model in the GUI and see what it can do. Afterwards, if you like it but need a better accuracy or want to run a larger dataset, you can always go into the lower levels of the MindsDB stack (all of MindsDB is made with user-friendliness in mind, so don’t worry too much).
If you want to see how the GUI would be used end-to-end in order to do this with an actual dataset, I suggest you check out this video. However, the best thing you can do is download Scout, install it and seeing what you can do with it.