Data Science - This Is How We Do It
Although the amount of data collected is increasing every day, the amount of information contained in the data is very little (Fig. 1), and usually a human is required in the loop to ask the right questions and perform data science [1,2]. At Bunchball, we have the biggest gamification data warehouse containing "actions taken," "missions completed," "leaderboards viewed", and more – all taken from hundreds of our clients. At Bunchball Labs, we use this vast amount of data to get meaningful information as well as understand and predict human behavior.
There exists a number of guidelines on the Internet about how to perform data science [3,4]. Below is what I have learned by playing in the field for a number of years (with a special focus on gamification). Fig. 2 summarizes the steps described below.
1. Understanding product, client and users: In a B2B SaaS company such as Bunchball, we not only need to understand our product but also how our client has applied our product to his use-case. At Bunchball Labs, we continuously interact with account managers and digital strategists to understand each client’s use-case and the behavior each client is striving to drive. We do not use a cookie-cutter model for all our clients since each of our clients is different. After a good understanding of the client’s use-case, we take further steps and begin exploring the data.
2. Exploratory data analysis: This is kind of a sanity check for the data. Before we go to explore a particular set of data, we first check if the data has sufficient information to draw statistically significant conclusions. We call this phase an exploratory data analysis phase. At Bunchball Labs, before we take a deep dive into data, we see if there are enough challenges and actions completed for the gamification experience of the client. Other common items in the list of exploratory data analysis are:
a. Calculating both the mean and median of activities: It is important to not only compute the total and mean of the activities but also the median. The actions on the web usually follow a long tail distribution . Fig. 4 shows the histogram of Salesforce Chatter usage of an enterprise gamification use case. The x-axis is the number of users and y-axis is the frequency of Salesforce Chatter posts. The median for the graph is 24 and the average is 63 - almost 3 times to the median. So, calculating both mean as well as median helps us get a sense of the data distribution and also detects the presence of outliers.
b. Clustering: Clustering the data along different dimensions can be a good precursor to asking specific questions from the data. Clustering also helps you understand if there is any natural grouping within the data. Fig. 5 shows the k-means clustering for an Enterprise (Salesforce) gamification experience of a client with the number of clusters (denoted by k) equal to 3 using R software . The features here are different type of challenges users take on Salesforce and the goal is to understand if there are some natural groups. The higher the separation for the different colors of the bubbles, the better the discrimination is along those features. In Fig. 5, we see good discrimination along certain dimensions and poor along others. The below graph is just one example to see the results, but there could be many other ways to analyze clustering.
3. Forming hypothesis: After performing exploratory data analysis, we start with hypothesis formation. For example, some of the things we hypothesize are:
- Which first few actions does a user with better usage take?
- Do the users who take more video actions also engage in more social activity?
- Are people who receive more comments/likes on Salesforce Chatter also better performers on their jobs?
4. Data massaging and Reformatting: Data is usually not in the format we want and is stored by an engineer in a data warehouse. So pulling the data out and massaging it into a format our algorithm can accept is an important step. I use python scripts to do this, but any processing tool can be used
5. Running Experiments: “Running experiments” is the meat of the whole process but it actually takes the least amount of time if you are not writing much of your own code. Packages like R , LibSVM , etc. already exist in a stable form and are used by a large number of researchers.
We used LibSVM to identify the characteristics of long-term vs. short-term users for the gamification program of one of our clients. We categorized the user actions into five categories – social, video, game, mobile and registration. We treated the problem of identifying the long-term vs. short-term users as a classification problem with long-term users belonging to class +1 and short term belonging to class -1, input being the first 10 actions. We see that the first desirable actions a user needs to take are Social, Video, Game, Mobile and Registration (in order of importance). This order was determined by the value of the coefficient in the SVM model as shown in Table 1. A sample SVM model with separating hyperplanes is shown in Fig. 6.
6. Communicating your results: Once we have the results we were looking for, we communicate our results to the Product Managers, Engineers and Strategists because these people bring our ideas to life. For example, the SVM results we obtained in Step 5 were used to design the next phase of digital strategy for our clients
7. Iterating (Improve or die): Lastly, we make it a point to learn from our current data models and keep improving on them .
One question you might ask after reading this is - "how can such a system be scaled?" Well, one can only scale the system when he/she has a good understanding of it. If something is done without understanding, it is invariably doomed to fail. Our strategy at the Labs is to understand first and scale next. Also, once we identify clients with similar use-cases we run the same experiments on them.
"A journey of a thousand miles begins with a single step". We have taken the first step towards building a highly personalized gamification experience for our clients and are exploring new frontiers everyday.