Let’s take a look at how we run a dataset through MindsDB end-to-end, including the minimal data pre-processing that’s not included in MindsDB, which you might need and the standard way to evaluate a machine learning model’s performance.
For the purposes of this article I will be using a “standard” dataset called “German Credit”. The purpose of this dataset is to predict whether someone’s credit class is either good or bad based on 20 attributes such as: installment rate, job, purpose and credit history.
If you prefer to follow along visually, you can watch the video below:
The dataset can be downloaded from here.
First, let’s download it and create the following directory hierarchy for the project:
Next, we’ll install mindsdb and scipy: `pip install –user mindsdb scipy` (note: you need to use python3’s pip, it might be aliased as `pip3` on certain OS’s)
The Data Processing
Next, we’ll have to process the data, we’ll be doing this inside a file called `pre_processing.py`. There are a few steps we need to take:
1. Turn the Data From arff Into a Pandas DataFrame
This is done because pandas DataFrame is easier to work with in python than arff files.
Use scipy’s `loadarff` to load the data into a tuple. The first member is a list of rows and the second member is an object containing metadata such as the column names.
Next, iterate through the rows in order to do some cleanup where necessary. In this case `loadarff` loads string columns in numpy binary objects rather than python strings and quotes them inside `’`, so we’ll have to decode them to strings and remove the surrounding `’` in order to get a better representation of the original data.
Finally, using the column names, we’ll turn our dataset into a pandas DataFrame.
2. Train and Test Split
It’s good practice to split a dataset into two, one is used for training our machine learning model (in this case the one built by MindsdDB) and another one is used to test its accuracy. We call these the “train” and “test” datasets.
A good train/test split could be something like 80/20, which I’ll be doing here. The more data we feed the model, the better its accuracy will be, but we want to be left with a significant amount of data to test our model. Whatever “significant” means depends on the specific domain you work in, the problem and the size of your data.
Before doing that, we’ll shuffle the data around. This is not necessary for this particular example, but it’s a good general practice since otherwise the ordering of your data might result in an uneven split of certain features between the training and testing datasets.
*This is not the case with certain time-dependent datasets where it can be ideal to train on older data and test on newer data*
Now try running `pre_processing.py`, if all goes well you should see the two new csv files in your `’processed_data` directory.
Next, we’ll create a file called `train.py` in which we’ll add MindsDB the training code:
That’s it. That’s all you need to train a MindsDB model. The only required arguments are `to_predict`, which indicates the name of the column to be predicted (or key, in case you are using a JSON file) and `from_data`, which indicates the location of the data. By default `from_data` can be a structured data file (xlsx, json, csv, tsv… etc) or a pandas dataframe, however MindsDB also supports advanced data sources that can get data from stores such as S3, Mariadb, MySQL or Postgres. More on that here.
To understand the optional argument, it might be helpful to understand the “phases” through which MindsDB goes:
- First, we have a “data preparation” phase composed of extracting the data and arranging it in a format with which MindsDB can work easily.
- Second, we have the “data analysis” phase, in which MindsDB makes some statistical inferences about your data, which you can use to evaluate the quality of the data and figure out how to improve it.
- Third, there is a “data transformation” phase where, based on the insight from the analysis, the data is changed.
- Fourth, there is the training phase (we call it the “model interface”) where the machine learning backend does the heavy lifting and produces a predictive model.
- Fifth, there is a “model analysis” phase, where MindsDB runs some black-box analysis on the model to determine things such as the importance of each column and the confidence of our prediction.
- See this video in order to better see what it produces.
Let’s look at the optional arguments passed here since you might find yourself using them rather often:
- stop_training_in_x_seconds — Tells MindsDB to stop training the model after this amount of seconds. The full amount it takes for MindsDB to run might be up to twice the amount you specify in this value.
- backend — The machine learning backend to use in order to train the model. This can be a string equal to `’lightwood’` (default) or `’ludwig’`. Lightwood is the open source machine learning backend developed by us at MindsDB Ludwig is an alternative backend being developed by Uber’s ML team. You can also use this hook in an object representing your own custom machine learning backend (for more on that, see this example).
- sample_margin_of_error — Essentially, if this argument has a value bigger than 0 MindsDB will not run the data analysis phase on all the data, but rather select a sample. I suggest setting this somewhere between 0 and 0.1, the bigger this value is the quicker MindsDB will analyze your data;however, this doesn’t affect the actual training time.
- equal_accuracy_for_all_output_categories — When you have unbalanced target variable values, this will treat all of them as equally important when training the model. To see more information about this and an example, see this section of the docs. In this example, if our dataset has 200 rows with a “bad” credit class and 20,000 rows with a “good” credit class, MindsDB would reach 99% acuracy if it predicted “good” every time and never bother to teach itself how to predict “bad”. However, if we care about predicting “good” and “bad” with the same accuracy, this argument will tell MindsDB that it should judge the model’s accuracy by how high the combined accuracy of the two is rather than by the overall accuracy.
- use_gpu — Defaults to False, set to True if you have a GPU, this will speed up model training a lot in most situations. Note, that the default learning backend (lightwood) only works with relatively new (2016+) Nvidia GPUs. (This is not because we have some bias towards Nvidia or recommend their hardware, but most underlying libraries we use don’t offer support for other manufactures.)
Testing Our Model
Finally, we’ll add some evaluation code to `train.py`, which we use to figure out how well the model is actually performing:
First, tell MindsDB to predict for our testing dataset and extract the predicted values from the result:
Second, get a list of the “real” values and compare the two using a balanced accuracy score, so that the accuracy is not computed as the overall accuracy, but rather, as the accuracy for predicting a good credit class times the accuracy for predicting a bad credit class.
If we want to better visualize the output, we can also print a confusion matrix:
Now run `train.py` and see how it works for yourself.if something breaks, try retracing your steps through this article. If it still doesn’t work, feel free to report it on our Github project.