Sales Data Analysis using DataIku Studio

Sales Data Analysis using DataIku Studio

Overview

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

It is easy and more user-friendly to explore the data and perform data cleansing. It supports datasets such as Filesystem, FTP, HTTP, SSH, SFTP, Cloud (S3), PostgreSQL, MySQL, Hadoop (HDFS), Oracle, MS SQL Server, Analytic SQL (Vertica, Greenplum, Redshift, Teradata, and Exadata), and NoSQL (MongoDB, Cassandra, and Elasticsearch).

In this blog, let us discuss about data cleansing, data transformation, and data visualization of sales data of a financial company using Dataiku DSS.

Pre-requisites

Download and install Dataiku DSS Version 4.0.4 on Ubuntu from here

Importing Dataset

To import a dataset into Dataiku DSS, perform the following:

  • Open Dataiku DSS.
  • Create a new Project.
  • Click Add New Dataset and click Add a File to upload a new dataset.
  • Choose the required Filesystem and click Preview to view the added file.
    The dataset looks similar to the one below:

select

The storage type of the data and meanings of the data will be automatically detected from the content of the columns, where the “meaning” is of rich semantic type. For example, DSS automatically detects the meaning of the column with email IDs and sets the meaning as “E-mail address”.

Data Cleansing

Analyze meaning of each column to explore the data and perform data cleansing.

For example, the E-mail address column has Valid, Invalid, and Empty data as shown in the below diagram:

select

Apply a filter to remove invalid email IDs.

For example, the Price column has both integer values and comma (,) as shown in the below diagram:

select

Apply a filter to remove the values with commas as shown in the below diagram:

select

Data Transformation

Data Preparation Recipes

This recipe has filtering and flagging rows, managing dates, sampling, and geographic processing.

To prepare data, perform the following:

  • Parse and format date columns.
  • Calculate difference between account created date and last login date for calculating dormant days.
  • Convert currency to required currency type.
    For example, INR into Dollars.
  • Filter the unwanted columns by its name.
  • Concatenate two column values with delimiters as shown in the below diagram:

select

  • Calculate GeoPoint by giving latitude and longitude as input as shown in the below diagram:

select

You can also extract latitude and longitude from the given GeoPoint.

Visual Recipes

Visual recipes are used to create new datasets by transforming existing datasets.

Filter Recipe

This recipe is used to filter invalid rows/cells, filter rows/cells on date range, numerical range, and value, filter rows/cells with formula. It has filtering and flagging rows. The records, not accessed for a long-time period, are filtered out using this recipe as shown in the below diagram:

select

Split Recipe

This recipe is used to split one dataset rows into several other datasets based on certain rules. The dataset with split and dropped state (“Ireland”) is shown in the below diagram:

select

Grouping – Aggregating Data Recipe

This recipe allows you to perform aggregations on any dataset and is equivalent to SQL “group by” statement. It offers visual tools to setup aggregations and post filters. The rows, aggregated based on products, calculated count, and distinct count of state and country, are shown in the below diagram:

select

The rows after applying a filter for state_count not be less than 100 are shown in the below diagram:

select

Joining Datasets Recipe

To join two datasets, perform the following:

  • In the “Join” section of the recipe, click “Add input” button to add one join.
  • Select 2 datasets for joining.
  • Select Join Type and choose the appropriate join type such as “Inner Join”, “Outer Join” and “Left Join” as shown in the below diagram:

select

  • Click Conditions to add conditions.
    The inner join based on Transaction_ID and Product is shown in the below diagram:

select

  • On successfully completing the join definition, go to “Selected Columns” section of the recipe and select the columns of each dataset needed.

select

The Original Price and Profit calculated using formulas are shown in the below diagram:

select

Stacking Datasets Recipe

This recipe merges several datasets into one dataset and is equivalent of a union all SQL statement.

Data Visualization

The build datasets can be visualized in the form of charts in Dashboard.

Average of Transaction ID Count by Payment Type

select

Profit by Country

select

Product Count by Country

select

Profit by Country and Product

select

Average of Profit by Year

select

Here is the flow created:

select


Sales Data Analysis using DataIku Studio

Call Detail Record Analysis – K-means Clustering with R

Call Detail Record Analysis – K-means Clustering with R

Overview

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.

In this blog, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segment of customers with respect to their usage by hours.

For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.

Data Description

A daily activity file from Dandelion API is used as a data source, where the file contains CDR records generated by the Telecom Italia cellular network over the city of Milano. The daily CDR activity file contains information for 10, 000 grids about SMS in and out, Call in and out, and Internet activity. The structure of the dataset is as follows:

select

As it has five million records, a subset of the file containing activity information for 500 square IDs is used as a use case.

Data Source Features Description

The actual dataset contains 8 numerical features about SMS in and out activity, call in and out activity, Internet traffic activity, square grid ID where the activity has happened, country code, and timestamp information about when the activity has been started.

select

Data Pre-processing

Data pre-processing involves data cleansing, data type conversion, and wrangling.

To pre-process data, perform the following steps:

  • Convert the square ID and the county code into factor columns as part of type conversion.
  • Derive new fields such as “activity start date” and “activity hour” from “time interval” field.
  • Find total activity, which is the sum of SMS in and out activity, call in and out activity, and Internet traffic activity.

select

  • Create new derived fields as mentioned above.

select

CDR Exploratory Data Analysis (EDA)

Exploratory Data Analysis is the process of analyzing the data visually. It involves outlier detection, anomaly detection, missing values detection, aggregating the values, and producing the meaningful insights.

The plot for “Total activity by activity hours” is as follows:

select

From the above plot, it is evident that most of the activities happened in the hour of 23 and very less activity happened in the hour of 06.

The plot for “Top 25 square grids by total activity” is as follows:

select

From the above plot, it is evident that most of the activities happened in the square grid ID 147.

The plot for “Top 10 country by total activity” is as follows:

select

From the above plot, it is evident that the country code 39 has the highest activity.

Call Detail Record Clustering

K-means clustering is the popular unsupervised clustering algorithm used to find the pattern in the data. Here, K-means is applied among “total activity and activity hours” to find the usage pattern with respect to the activity hours.

Elbow method is used to find optimal number of clusters to the K-means algorithm.

select

By looking at the above plot, it is evident that Sum of Squared Error (SSE) decreases with minimal change after cluster number 10 and there is no unexpected increase in the error distance. So, the best cluster to perform K-means for this dataset is 10.

The summary of CDR K-means model and its center calculated for each cluster is as follows:

select

The heat map plot with cluster, activity hour, and total activity time is as follows:

select

From the above plot, it is evident that the clusters 1, 7, and 9 have activity for all 24 hours and are the more revenue generating clusters. The clusters 1, 5, 7, 9, and 10 have activity in night hours. The cluster 5 has activity from 11.5 to 17 hours.

To read original post, click here


Call Detail Record Analysis – K-means Clustering with R