Why every data scientist using pandas needs Modin — Bringing SQL to Dataframes
Table of contents
While recently speaking with a data scientist friend from the RiseLab in Berkeley who primarily operates in the pandas API using Modin. She mentioned that she was trying to solve a problem for a client who required her to write a query in SQL that she would normally write in pandas. As she was looking on StackOverflow, she noticed that there is a whole world of questions around “What is the SQL equivalent of this pandas query?” and vice-versa.
She also noted that sharing code and notebooks with other data scientists in her company was difficult when they were not comfortable with the pandas API. If her colleague had a follow-up question, they would either need to go back and forth with questions and answers or schedule a meeting to figure it out in-person with one person at the keyboard. Her colleagues could not just run queries in their preferred language without rewriting the entire notebook.
Data Scientists are increasingly required to do and learn more, but tools have largely lagged supporting all of these new requirements. Moving between languages and computing environments is expensive and costs data scientists hours of productivity every week.
To improve data science productivity, MindsDB has teamed up with Modin to bring SQL to distributed Modin Dataframes. Now you can run SQL alongside the pandas API without copying or going through your disk. What this means is that you can now have a SQL solution that you can seamlessly scale horizontally and vertically, by leveraging the incredible power of Ray.
Presenting modin.sql
Here is a summary by example:
Imagine you have data about reviews for apps in the google store, this information is in two tables, one for the store apps and another one for the reviews, that you can join by the app column, lets start with the apps table:
Imagine that you want to quickly select from ‘gstore_apps_df’ the columns App, Category, and Rating, where Price is ‘0’.
In SQL, this looks like this:
"SELECT App,Category,Rating FROM gstore_apps WHERE Price = '0' "
However, for many of us, the solution to many of these problems often starts like this.
In the end, you get something like this:
Which makes sense if Pandas is your way of doing these tasks. But, for those of you that know some SQL, we want to introduce an in-memory SQL engine that operates on Dataframes, so you can have more options when it comes to using the incredible power of distributed dataframes of Modin.
The function to access that engine is called “modin.experimental.sql.query”
The in-memory SQL engine for data-frames, allows you to run complex queries. You can in a very explicit SQL statement, perform operations such as joins and aggregations.
Let’s bring the other table (reviews) to illustrate the powers of SQL on Dataframes:
Imagine that you want to know what are the best-reviewed app categories where there is little subjectivity: Get the top 10 app categories ranked by best average ‘sentiment_polarity’ where the average ‘sentiment_subjectivity’ is less than 0.5.
Since ‘Category’ is on the gstore_apps_df and sentiment_polarity is on gstore_reviews_df this requires that we join the two tables, and operate averages on that join.
You can solve this by doing it all in one single SQL query:
Or, you can bring the best of doing this in python and do it in parts (it’s up to you), but we believe that this certainly gives you more powers than if you were to do this in say Redshift.
The crazy thing here is that if you have a cluster or even a computer with more than one core, you can write SQL and Modin will run those queries in a distributed and optimized way. You can think of Modin + SQL as an Open-source alternative to Snowflake.
In our next article, we would like to present some benchmarks of running SQL on Distributed Modin Dataframes vs some SQL databases and Data-lakes. Thanks to our friends at Intel, that have provided us with some fancy computers where we can run Modin on many cores with lots of Memory.
In the meantime, you can check out, our Notebook with more examples and ideas https://github.com/mindsdb/dfsql/blob/stable/testdrive.ipynb
Special thanks to the Modin and MindsDB team, Boris Tseitlin, the Rise Lab UC Berkeley, and Intel.