Turning My Excess Data into a Teaching Tool

Tim McAleer
7 min readFeb 4, 2021

The Project

Data science projects can often be described as using 15 pounds of potatoes to make one plate of french fries. Creating a model is a process of iteration, of constant creation and retooling until you have the ideal combinations and parameters to meet your goal. The work of model creation is very front-loaded. From exploring your data to creating features to creating and analyzing models, it can seem to be a lot of work to create one specific model.

All those potatoes, one plate of fries

My recent project used Natural Language Processing to analyze Tweets and create a model to predict a positive or negative sentiment using the words in the Tweets. An important step in model creation is to define the scope of the process. Working with my partner, Mark Patterson, we inspected the data and narrowed the project to experiment with:

  • Count Vectorizing our words or using Term Frequency-Inverse Document Frequency
  • Lemmatizing our data or not
  • Using SMOTE on our data to deal with class balance or not
  • 3 hand picked modeling methods (Logistic Regression, Random Forest, or Multinomial Naive Bayes)

The nature of the project meant we were looking at the recall of the negative sentiment, using the text of Tweets to identify words with negative connotation to find what aspects of the products make a consumer unhappy and to come up with suggestions of fixes to be made. To do this, we took a scientific approach: run the model, record the results, and compare. Options for creating the model are multiplicative; with the above options, there are a total of 24 models providing 24 different results (2 x 2 x 2 x3). This meant running 24 models and creating a spreadsheet to organize our results for quick reference.

Our best performing model turned out to use TF-IDF and all the works, Lemmatizing, SMOTE, and used Multinomial Naive Bayes. This model had the third best overall recall score and a high overall accuracy to go with it. One interesting note was that every combination of parameters that included SMOTE ended up favoring Multinomial Naive Bayes.

But What About the Other 23 Models?

I put a lot of hard work into developing that spreadsheet of statistics. Do I really have to throw that all away? No, not when an idea strikes me.

I decided to turn all that data we produced into a program one could use for exploring all the different combinations of models. The goal was to make it interactive and allow the user to make decisions about what to do with the database. It took some planning, but let’s go through it step by step.

Making Data Interactive

Step 1: Decide where the program starts.
There were several steps involved to get the data into shape to be modeled. Checking the data for missing values and duplicates. Encoding labels for the target value (users’ feelings) and dropping the rows with values we won’t be using (neither positive nor negative). Tokenizing the text. These steps all take some level of personal intervention and would be hard to functionalize, and it wouldn’t make any difference to the end results. So general EDA was finished and saved before the program begins.

Step 2: Functionalize the working parts.
For a clean, reproducible function it’s easiest to turn as many of the steps along the way into defined functions and call the functions when needed in the program. All the decision points, from choosing Count Vectorization or TF-IDF or picking a model should be broken down into their components and given their own function. Then when the user makes their decision, it’s as easy as calling a pre-defined function.

Functionalize the moving parts

Step 3: Create the decision points.
Here it helps to be organized. We print questions for the users to answer and save their inputs as answers. Testing begins to be important here as text formatting often requires editing, and using proper spacing to make it clean and readable.

Step 4: Write the code.
This step requires the most creativity. We have some variables in the form of the user responses to questions, but what do we use them for? What order does the code need to be run in? What variables do our functions create and how are they used by other functions? What other variables are needed, and how do we define and manipulate them to do what we want? The important factor to remember is what we want the end result to look like, and how we get there.

Step 5: Testing, testing, testing.
The bigger your code gets, the more complex it is. The more complex, the greater chance for error. The best way to ensure a working program is to test frequently. There are many natural break points in writing your code and at each one you want to make sure everything works like it should up to that point. Stopping to test your code will save time in the end, as waiting until the code seems complete can create a tangled knot of errors that takes excessive time to untangle.

Step 6: Review your project so far.
Keep the end goal in mind, but keep a critical eye. Is it doing what you want it to do? Can anything be improved? Are there any new features that could improve the work you’ve done? This is also the step where you go back to earlier steps and reexamine and rework your code.

My Modeling Project

The first time I reached step 6, I had what could be described as a Minimum Viable Product. I had asked the user four questions and returned the statistics and confusion matrix of the modeling technique the user created. But was that enough? It still felt a little lacking. My first addition was to create a recommendation system. I looked at my spreadsheet of model statistics and broke them down to the eight combinations of options before you choose a model. In each combination, I picked my favorite working model. Now, after the user has answered the three questions regarding manipulating the data, I insert one more choice: Would they like to select a model or would they like to see the results of the model I’ve picked out?

This turned out to be a good bit of work. A new variable, a list, was needed to keep track of the user’s selections. Three lists of lists were created, one for each model and finally the program adds a check. It looks for the list the user created with its choices and looks for the list of lists that contains it, with each one corresponding to the model of my choosing. Finally, I add an initial question, where the user has the option just to see my chosen model and forgo making any decisions.

Upon reviewing my program again, something else felt like it was missing. My questions all must seem like nonsense to someone who isn’t steeped in Natural Language Processing. Most people’s first response to “Would you like to Count Vectorize or use TF-IDF” would be pure confusion. And for those in the know, a little refresher wouldn’t hurt. So a new option was included for each choice with a simple little explanation of what each choice would do. This turned out to be more difficult than expected and created much more work than expected. A new variable was needed to create a loop to return back to the question after given the new option. My recommendation system had to be limited so as not to record the options that give explanations or else it would break the whole system.

With brief explanations

I have several ideas of ways to expand the program. Functionalizing and enacting data exploration and cleaning would be a big step and bring the program one step closer to being able to take in any dataset and not just the data I imported and cleaned. It’s important to keep the goal of the project in mind, and I am very satisfied with where it ended.

Overall, I’m very pleased with my program. I turned what would normally be hard earned but discarded work into a platform for exploring data modeling and how each choice affects the final product. It became an educational product, helping people like me learning Natural Language Processing to see its effects in real time, and truncated explanations for what’s happening along the way. Finally, it gets to become a blog post, further spreading what I learned along the way. I hope it helps everyone who takes the time to read it.

Special thanks to Mark Patterson, for all the hard work as partner in the NLP project and Yish Lim, for guidance and patience.

--

--