Data scientists use different techniques to extract meaning from data. This process includes identifying data analytics problems, incorporating computer science, analyzing the data to identify patterns, data and predictive modeling, and providing predictive analytics. To do data science work, data scientists use tools and methods from statistics, data engineering, and machine learning.
The lifecycle of a data science project includes all of the aforementioned steps. The process typically begins when you start asking questions that allow you to understand the business you’re working for or with’s objectives enough to see how data can be used to reach those objectives. At the end of the cycle, you should be able to present your findings visually. Successfully completing each step and achieving full competency as a data scientist requires expertise and skills in a few different areas.
Having a tool that can help guide you in the areas where you don’t have as much expertise can be very helpful so, in this tutorial, we’ll cover each step of the process and show you how even non-technical users can perform data science by using MindsDB Scout.
Step 1: Business Understanding
First, you’ll need to collect information from all the relevant sources. After you’ve collected this information, you’ll need to define the question you are going to answer. Defining the question is one of the most important part of the process. When we say question, what we mean is:
- What do you want to predict or estimate?
- How can you ensure data quality?
- Which statistical analysis techniques will you need to apply?
- Which model is most appropriate for the data?
Once you have the data structured and know what you want to predict, you can easily use MindsDB Scout to gather the other answers.
Before installing MindsDB, make sure you have a version of Python greater than 3.6. This is the only requirements for using MindsDB. You can download MindsDB through MindsDB’s product page. It works on the most widely used operating systems, including Microsoft Windows, Linux, and macOS.
The data that we will use for this example is a Heart Disease dataset from UCI. The analysis that we are going to do is to use MindsDB as an explainable automated machine learning framework to predict the presence of heart disease in patients. Note that you can follow up on this tutorial with different types of datasets.
Steps 2-3: Data Mining and Cleaning
The quality of the data is the most important part of getting to the final analysis. Any data that tends to be noisy, corrupted, incomplete, or inconsistent can affect the results you obtain from it.
What is data mining?
The simplest answer to this question is that it is a process to turn raw data into useful information.
The raw data can come from different resources such as an API, databases, surveys, blogs, social media. In our example, the Heart Disease UCI dataset, the data comes from 4 databases: the Hungarian Institute of Cardiology, the University Hospital in Zurich, the University Hospital in Basel Switzerland, and the V.A. Medical Center Long Beach and Cleveland Clinic Foundation. Note that the name and social security numbers of the patients were removed from the data.
What is data cleaning?
Most of the time, to ensure that data analysis is accurate, you’ll need to clean up the data you’ll be using. The clean up process often includes organizing the gathered information and removing bad or incomplete data. In the Heart Disease UCI dataset, the database contained around 76 attributes, but all published experiments refer to using a subset of fourteen of them.
Additionally, we don’t need to do any data cleaning because the dataset that we use is well-structured in the tabular data format.
Overall, the most important thing that we can point out in these steps is Better data beats good algorithms or in other words, Garbage in gets you Garbage out.
Step 4: Data Exploration
Let’s run MindsDB Scout and upload the data. The main focus in the data exploration step is to identify and explain biases, confirm the completeness and correctness of the data, and determine possible relationships between elements and other data quality issues.
- Start MindsDB Scout and click Connect to MindsDB.
- Add the path to your local Python installation and click on Install.
- It’ll take around 5 minutes to install all of the MindsDB dependencies. After successful installation, click on Connect.
Now, you are ready to upload the dataset and start the data analysis. Let’s go to Datasources section.
The Datasources section is, as the name indicates, the section where the data for training will be uploaded, reviewed and analyzed for quality.
- To upload the dataset choose Upload.
- Select the Heart Disease Dataset from your local file system. Note that MindsDB Scout allows URL uploads also. Next, click on Upload.
Now, the first step is done, you have successfully uploaded the data in MindsDB Scout. Note that this data is uploaded on your local server. One of the advantages of MindsDB Scout is that it ensures that the data is only accessible by you on your local system. So you don’t need to upload it on the cloud or share it with an external API.
Now, let’s preview the data.
In the Datasource Preview section, there is table representation of the dataset where you can easily preview and search the input values. The full metadata description of the Heart Disease Data Set columns are:
- “age” – age in years
- “sex” – 1 = male and 0 = female
- “cp” – chest pain type (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)
- “testbps” – resting blood pressure
- “chol” – serum cholesterol in mg/dl
- “fbs” – fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
- “restecg” – resting electrocardiographic results (0 = normal ,1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
- “thalach” – maximum heart rate achieved
- “exang” – exercise induced angina (0 = no, 1 = yes)
- “oldpeak” – ST depression induced by exercise relative to rest
- “slope” – the slope of the peak exercise ST segment ( 1 = upsloping, 2 = flat, 3 = downsloping)
- “ca” – number of major vessels (0-3) colored by fluoroscopy
- “thal” – (3 = normal, 6 = fixed defect, 7 = reversable defect)
- “target” – the predicted attribute, diagnosis of heart disease
After uploading the data and being able to preview it, we can say the data is ready to be used. The next step will be to examine the data and do the data analysis.
Imagine reading a much bigger table than the example we are using that contains thousands of rows and columns, full of numbers or timestamp values. There is no way to get statistical insights into the data just looking and exploring that table.
Usually, most people are better at previewing the shapes, colors, scores or data dimensions. That’s what MindsDB Scout enables for us and how it makes the data exploration step much easier. Let’s open the Datasource Quality analysis section.
What we can see here is a quality score of each column(input) value and the variability in the dataset. MindsDB looks in every column, analyzes the data, figures out the quality score and shows us suggestions and warnings about the data.
Let’s click on a specific column, e.g., age. Now, data quality statistics will be displayed for the selected column.
The insights that MindsDB provides for each column are:
- Overall data Quality Score (higher number means better quality and opposite)
- Consistency of the data (how usable is the data, did it miss a lot of values, etc.)
- Value duplication (higher value means the data doesn’t contain too many duplicates)
- Variability (how spread out a set of data is and describes how much data vary)
- Value distribution (a low value indicates that data value differs significantly from other observations)
Also, in the bar chart, we can preview the occurrences of this variable on the dataset.
For the additional scores that MindsDB provides, read about its column-metrics.
What you can learn from the data insights?
If you check the fbs and exang column metrics, you will notice that they have a very bad quality score. The exang column variability score is 3.5/10, which means that it is unevenly distributed. The redundancy score for exang is 3/10. This means that the data in this column is useless for making predictions. Also, it is warning us that there is a high correlation between this column and the fbs column. This could mean that there may be some corrupted data, but it is up to domain experts to choose if they want to exclude one of these columns and improve the model accuracy.
The age and chol (serum cholesterol) columns have good quality scores. In both columns, the consistency is 8/10. That means the data is quite usable. The factors that play a key role in the good consistency score are the Type Distribution score which means MindsDB can easily determine the data types, in this case, chol values are numbers. Also Empty Cells, in this case, there are no cells with missing data and the Value Duplication.
Steps 5 – 6: Feature Engineering and Predictive Modeling
What is Feature Engineering?
Algorithms require features (input data) to work properly. These features can be used to improve the performance of the algorithm and improve model accuracy. So, feature engineering turns the data inputs into features that the algorithm can understand. In this process, MindsDB will select the most relevant features to use for model construction.
What is Predictive Modeling?
Predictive modeling is often referred to as machine learning or predictive analysis. It uses statistics to predict outcomes. The model is made up of a number of predictors, which are variables that shall influence future results.
How does MindsDB help here?
This is the part where we don’t need to do too much work and can leave it to MindsDB to do automated machine learning. To do so let’s go over to the Predictors section, Train New Predictor and select Advanced Mode.
Note that Predictor, in MindsDB’s words, means the machine learning model.
Let’s quickly go through what all of the available options in the Predictor Advanced mode mean:
- From – Choose the datasource for training the data. In this example, it’s HeartDiseaseData.
- Predictor name – add the name for Predictor.
- Select only the columns to be predicted – The target columns to be predicted. In this example, the target column.
- Sample margin of Error – the amount of random sampling error.
- Stop training after – stop the training of the model after n seconds.
- Backend – The machine learning backend to use in order to train the mode l(Lightwood or Ludwig).
- Use GPU – Train the model on CPU or GPU. The default one is CPU.
- Select columns to be removed for training – Ignores selected columns from data.
From the example dataset that we are using, we want to predict the presence of heart disease in the patient. Following the metadata description, our target variable will be Target (diagnosis of heart disease).
The required options to train a Predictor are From value, Predictor Name and Columns to be predicted. After filling out all of the required values, click on Train.
Now, MindsDB will start a process of extracting, analyzing, and transforming the data. It will start training the machine learning model and run a black-box analysis on it. That’s why it will take some time to train the model especially on the large datasets.
After the training status is Complete, click on a Preview button so you can get additional insights about the Predictor results.
What you can learn from the Predictor results?
The Predictor results dashboard provides insights related to the trained model.
How accurate is the model?
The first section shows the accuracy of the model and dataset splitting and usage. The dataset was split into training, validation, and test data. The model was built from training and test data and the validation data was used to validate the model. From the two pie charts, you can visualize the accuracy of the model. The first chart shows the training accuracy which, in this case is 99%. This means that this model is very accurate on the example data it was created. In the second chart, we can see that the model predicted correctly 97% on the example data it has not seen, which is also fairly accurate. Looking at the results we can see that our model is unbiased because there isn’t a large difference in accuracy between the train and test accuracy.
In the second section, you can see the importance of the columns. The score can go from 0 – 10. That means that everything that is close to zero is useless for the model and the opposite, the higher the number, the more important column is to the model.
In this section, you can see the confusion matrix that is used to describe the performance of the model on a set of test data. The confusion matrix for this example is quite simple because we have two possible predicted classes (binary classifier) 0 and 1. Hovering over the table values you can get percentage insights about the predicted value from the model and the actual test data value. Fo example, the first column shows that 94% of the times 0 value was correctly predicted by the model.
So far, we’ve previewed the quality of the data, trained the model and we know that the model is pretty good. Now let’s try to query it and get predictions back.
Step 4: Predictive Analytics
Go over to the Query tab and click on the new Query. In the New Query pop up, you can add the data that you want to make predictions for.
Since we are trying to find the presence of heart disease in the patient, let’s imagine a patient that is healthy. What would be the ideal characteristics for the patient that we already have in the data?
Someone that is young, has low levels of cholesterol, normal fasting blood sugar, normal heart rate, hemoglobin count etc.
Now, add all of this information in the query:
- age – 25 (Young)
- sex – 0 (Women)
- chol – 160 (Total Cholesterol, normal is less than 200 mg/dL)
- fbs – 0(Fasting blood sugar, less than 120 mg/dl)
- thal – 3 (Thalassemia, normal)
- thalach – 170( Maximum heart rate achieved 195. Subtract the years from 220 to calculate the maximum heart rate)
- exang – 0 (No exercise-induced angina)
After adding the data, click on Run Query. The results will be displayed in the Query section.
The MindsDB model is 93% confident that our patient has no presence of heart diseases. You can make additional examples where you can add data about unhealthy patients and check the MindsDB predictions for them.
We saw here how easy it is to go over the data analysis process using MindsDB Scout. It helps us with the general data science expertise steps (feature engineering, data encoding, machine learning, data decoding, results).
If you follow up to this tutorial with your own data, we are happy to hear about how MindsDB Scout has come in useful to you. Feel free to join our community and share your experiences.