Wrangling MailChimp Audience Data — transforming disjointed metrics into a concise, clean dataset ready for analysis!

Organizing a bunch of unstructured data into a structured form may seem like a challenge without understanding how to automate and streamline the process. The purpose of this post is to inform Mailchimp campaign makers on how to turn the messy, disjointed csv’s Mailchimp offers freely, into a clean, concise .csv dataset with the Python Pandas library, along with Numpy, without having to install plugins or purchase any service upgrades or external software. It’s helpful if you’re familiar with Python, but even if you have No Experience with Python, so long as you’re willing to spend 30 minutes to download Python 3, Jupyter Notebooks, and the Pandas and Numpy library, ideally through the Anaconda Distribution mentioned in the Jupyter link. Doing this, then using “pip install numpy” then “pip install pandas” to connect these libraries, should enable you to follow along with my project’s code hosted Here, on my Github repository. For further info and context on this project, refer to the attached ReadMe file.


By following along, you can utilize my email and social media analytics project, to learn how to combine metrics for each user into a row alongside the campaign’s overall metrics, and generate powerful, flexible analytics and visualizations with various other Python libraries. Again, even if you know very little about Python, you can use what I have done with my project so far as a platform to run a couple of visualization library commands and make some visualizations for your own data, as well as anonymize your data, creating a randomized key associated with each subscribers email, that you can keep in a secure place, so you too can share this data publicly without infringing upon your audience’s privacy.

Some background on what made me write this code — I have a set of campaigns sent to over 1,500 followers over the past 6 months for my nonprofit collaborative arts group, Lightsource. I realized that by not enabling Google Analytics for our first few communications, I had short-changed myself and my organization by losing this opportunity to collect live analytics on individual subscribers that were easy for a nontechnical person to navigate, as these email campaigns were deployed, sent, and read. Our metrics of interest are opens, clicks, conversions (or subscriptions to a service we provide), unsubscribes, delivery rate/bounces, and revenue generation. Going forward, it’s in your interest to connect an analytics platform to your marketing campaign, and there’s pretty clear documentation on how to do so with MailChimp and Google Analytics here!

Collecting Data

The very first part is easy — fortunately, the CSV for the Overall Campaign Data is already available as a clean, concise, data frame. Download it from “Download All Reports” in your Reports section in your interface, and there you have it! The hard part is, as I’ve stated, the disjointed metrics for each audience member, usually just one column for each metrics. The pandas library, fortunately, enables us to merge them together. Download each of the different metrics for each campaign by going to “View Reports” and clicking under activity, which should look something like this :

click on each “activity”, your metrics, and download them via “Export as csv”, organize the related csv’s along with your Python notebook file for the campaign into folders separated by campaign. For each campaign, after importing Pandas,you will load your data as indicated in the next step, replacing my local csv directories with the directories of your csv’s.

Anonymization Initiation

After saving all emails as a set, the numpy library is imported, to perform a method that generates a numeric email_id that references a specific audience member, that will remain in your dataset after all the sensitive data is purged. The original index is then deleted, so it cannot be used by someone who has access to your original CSV’s provided by Mailchimp to trace information back to specific users, which could constitute a breach in their privacy — something we all know is a worst case scenario in data management from the likes of Cambridge Analytica and Equifax! Your job as the newly designated systems security person as of right now is to hide the key attached to this sensitive info in a SAFE PLACE inaccessible to persons who haven’t been authorized to have access to all this data.


The following steps are now where you take all your data, and shape it in a way that it can all be put together! For example, your No Opens metric is fit together with the Opens metric into one column, where No Opens has an Opens value imputed of “0”. Since for a specific campaign, a subscriber cannot both Open and Not Open an email, there are no duplicates. Follow along for all our other metrics. Your campaign data metrics are all merged together, each row for each of your subscribers gets a campaign ID number unique to this campaign, and the email_id is put in place of all your sensitive data. Your secured data frame is saved to a CSV, and there you have it, your cleaned, anonymized data frame for this email campaign!

Rinsing and Repeating

You repeat this exact process, in another folder for each of your email campaigns, to achieve the same results. Then, you will

Bring It All Together

Under All_Campaigns_Cleaned is the Python file Dataframe_Cleaning_All, where a similar process to the above is performed upon the email lists created in the Anonymization Initiation step. These dataframes are combined with the clean Overall Campaign Data Mailchimp graciously gives us in full without having to go through this tedious ordeal. Load in your email lists in place of the local directories I have, along with your Overall Campaign Data, which will be attached to each of the anonymized audience members individual data concerning that campaign, and, Finally, saved in your Complete_Secure_Dataset.csv

Exploratory Data Analysis & Modeling

You can peak at and work from what I did in my Completed_Data_EDA, see some of my methods for navigating this sparkly new dataset you have and check out the plentiful documentation for each of the libraries I use to modify it to your own needs. The vision behind this project was to conduct Natural Language Processing on my subject headers and email bodies, to determine if certain words, phrases, and writing conventions were associated with the metrics we examined — You can click here for the next blog piece about visualizing and modeling your data!

I welcome anyone with questions on my process, those who need a bit of guidance, or who otherwise wish to discuss further, to contact me diffusesingularity@gmail.com in the meantime.

Operating at the intersection of Technology, Ecology, and Arts

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store