Applying Data Science to MailChimp & Mass Communications Data — Exploring, Visualizing, Modeling Key Performance Metrics
From my undergraduate studies in psychology and public policy, through my experiences developing a nonprofit ecology oriented retreat center, interpreting research and it’s associated data to make informed, actionable policies, especially on several campaigns I’ve been a part of aimed at empowering people to live more fulfilling, healthy, meaningful lives, has been key to my decision making process. To me, it is important we acquire the knowledge that will enable us to live our lives in ways that genuinely satisfies and empowers us, while making room for others to do the same.
So here I am tinkering with a set of email newsletters sent to over 1,500 followers for my nonprofit group, one of my ongoing efforts to connect people to novel experiences and encourage creative collaboration, with this driving philosophy in mind.
The problem statement I sought to address in this project, applying my nonprofits communications data to a more generalizable problem statement — How can Natural Language Processing (NLP) and other data science techniques be applied to mass email communications, specifically in generating statistical models which can provoke insight on how to improve outreach metrics, specifically, subject titles and their impact on maximizing open rates, and minimizing bounced emails?
This post takes off from my prior blog post on preparing Mailchimp data into a clean, concise dataset for visualization, analysis, and modeling. If you follow along, a non-Python user can figure out how to transform disjointed Mailchimp data into a clean, concise dataframe, and those familiar with Python, and the library Pandas, could very easily apply this exercise to other sets of data. Another important caveat to mention, is there is almost certainly better code that can be written, which can extract the metrics via API’s we’re looking to include in out database. The documentation can be found here, and I’ll be exploring this further in the future and updating this post accordingly. Having created a complete, secure dataset using the steps in my prior blog post, you can explore, visualize, and model along with my work on my Github Here.
Exploring and Visualizing Our Data
We can start off having some fun by generating a WordCloud! The quickest, most obvious thing to do is to navigate our Pandas DataFrame, which there’s plenty of documentation on the ins and outs of exploring for those new to Pandas here. You can sort the data by various methods including .head(), .tail(), .min(), .max() and so on, along with the .sort_values() method to get an idea of some of the extremes, averages, sums etc in our data. BUT, you can simultaneously, almost as easily generate a Wordcloud by downloading this library, following the readme file, and
pip install wordcloud
then
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
Along with all the other libraries I accessed at the beginning of the Python notebook within Python. After starting the prerequisite libraries stated in the documentation, You must put your subject title terms into a single string, accomplished as such in this basic for loop —
Subject_string = ''
for d in Complete_secure_df["Subject"]:
Subject_string+=d
Now, your string of Subject words, “Subject_string” or whatever you choose to name it, can be fed into this code, which takes out extraneous terms like “and” and “the” using it’s “stopwords” dictionary —
wordcloud = WordCloud(stopwords = stopwords.words('english'), max_font_size=50, max_words=80, background_color="black").generate(Subject_string)plt.figure(figsize = (16, 16))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Subject Headlines wordcloud")plt.show()
Then BOOM! Your WordCloud should appear, something like the header image from my data I chose, or this one I generated for a subreddit classification project!
Visualizations
The quickest way to ascertain relationships between your variables, is applying the .corr() Pandas method to our dataset, demonstrating the correlations of all variables in an XY grid. It’d be even more helpful if we can assign some colors to our grid! Just plug that .corr() method into a visualization made via Seaborn, shaped with matplotlib (first line, solely to enlarge our heatmap for readability)
plt.subplots(figsize = (14, 14))
sns.heatmap(Complete_secure_df.corr(),
cmap='coolwarm',
annot=True).set_title("Our Heatmap!")
Brilliant! Most find the colors to be super helpful in determining relationships, and the color spectrum can be adjusted by setting the ‘cmap’ to a different palette. Depending on your data, this could be overwhelming — maybe you just want to know the relationship of all the other variables to your dependent variable of interest in your experiment. If so, we tweak the code to specify that’s the relationship we wish to visualize —
sns.heatmap(Complete_secure_df.corr()[['Successful Deliveries']].sort_values('Successful Deliveries'),
annot=True).set_title("Relationship of Successful Deliveries with Metrics")
And out pops something like this! Again, ‘plt.subplots(figsize = (number, number))’ can adjust the size of this heatmap as needed.
You can explore the rest of my python notebook and see the other visualizations I created with Seaborn predominantly. It is a super easy and powerful visualization library, accessible even to beginner Python users and non-coders who are willing to explore the documentation .
Modeling Our Data
You may see my efforts at generating predictive models for my data here, though for my ambitious efforts of applying Natural Language Processing to my work, I must admit I haven’t yet achieved what I was hoping for. Applying NLP to email campaigns in this manner requires A LOT more data than the ~200 words I have in my subject titles, which did not help me generate significant findings of word associations with email metrics — This may be more fruitful if I include the words in each of the email bodies, by sheer volume of data. My efforts instead at classifying whether an email bounced or not using NLP on the email titles fared a bit better, but it still stands that NLP simply requires a larger bag of words. Instead, you’re better off relating your other metrics against one another, in regression or classification type models.
For those with some more Python experience, or an interest in taking Machine Learning head on, I leave you with this powerful pair of functions, that can me modified to conduct all sorts of train-test-split oriented machine learning models! These are the basic, stripped down versions of code included in my modeling notebook. One can alter the train test split, what transformers, statistical outputs, and model methods you wish to use.
def regression_machine(estimator, X, y, scoring = 'RMSE'):
X = # variables here
y= # variables here
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)ss=StandardScaler()
ss.fit(X_train)
X_train_scaled = ss.transform(X_train)
X_test_scaled = ss.transform(X_test)
estimators = [estimator]
for e in estimators:
est = e()
est.fit(X_train_scaled, y_train)
train_score = est.score(X_train_scaled, y_train)
r_sq = est.score(X_test_scaled, y_test)
preds= est.predict(X_test_scaled)
rmse= np.sqrt(mean_squared_error(y_test, preds))
print(train_score)
return ["our R Squared is {} and our RMSE is {}".format( r_sq, rmse)]
Or, for classification oriented models, where your y variable is a Discrete rather than Continuous variable —
def classification_machine(estimators, X, y):
X = # variables here
y= # variables here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state = 42)ss=StandardScaler()
ss.fit(X_train)
X_train_scaled = ss.transform(X_train)
X_test_scaled = ss.transform(X_test)
estimators = [estimators]
for e in estimators:
est = e()
est.fit(X_train_scaled, y_train)
train_score = est.score(X_train_scaled, y_train)
r_sq = est.score(X_test_scaled, y_test)
preds= est.predict(X_test_scaled)
print(train_score, r_sq)
return ["our training data has an r^2 of {} while our testing data has an r^2 of {}".format(train_score, r_sq)]
If any of this compels you to discuss this further with me, or you’re feeling like you could use some help in applying anything I spoke of to your project, don’t hesitate to reach me at elikhtig@gmail.com !