Chapter 1. dCube Introduction

1.1 Executive Summary

DataVisor, the renowned and leading company in fraud detection and prevention, extends its fraud solutions with dCube, a platform for customizable modeling. This document is intended to guide the process workflow of Manual Modeling on the dCube platform while explaining the algorithmic logic behind the model.

1.2 Unsupervised Machine Learning Algorithm Overview

DataVisor's patented and proprietary Unsupervised Machine Learning (UML) algorithm is the highlight of dCube modeling.

Common trending algorithms for fraud detection include clustering techniques to explore relationships and connectivity among users from input data as well as anomaly detection techniques to identify and mitigate outliers. dCube’s UML transcends common clustering and graphical techniques by bridging advanced clustering and graphical analysis techniques with outlier filtering post configuration. Unlike techniques like k-means clustering, DataVisor’s proprietary UML algorithm can scale linearly with respect to the number of data points and features, making it ideal for detecting large coordinated patterns. UML assesses cluster patterns, trends, and hyper-specific characteristics while minimizing outliers with granularity and scalability.

The following section explains the UML conceptually.

1.2.1 Data Organization

Before running the UML algorithm, the system organizes your raw data in terms of user-based entities and various events with valid timestamps. The data is arranged so that users can ultimately be classified as fraudulent or not.  

Example 

User ID

cookie

card

Application date

ip

user01

cookie01

card05

01022019

1.1.3.4

user01

cookie02

card04

01012019

1.1.3.5

 

1.2.2 Feature Selection and Subspace

Once the data is organized at the user level, feature subspaces are formed with different features as dimensions, and users are plotted in the subspace based on their feature values.

Although the system automatically creates feature subspaces, you can control which subspaces are formed and mandate the system to include specific features through feature selection.

Features can be selected and additional configurations can be added to set either individual features (High Priority) or combination features (Combine Features) as must-have dimensions. This will be the seed subspace, and other features for the subspace are automatically determined to make it higher dimensional. If a high-priority configuration is set, the subset should have at least one dimension from the must-have (High Priority) list of features when it is formed.

Example

Feature selection: Select 10 features such as F1, F2, F3, F4...F10.

High Priority features: Select {F2} as high priority feature

Combine Features: {F6, F7} are combined and set as a high priority feature by default.

In the background, every subspace should have {F2} OR {F6, F7} as a dimension. In this case permutations of feature dimensions are used for creating subspaces. For example, feature subspaces such as subspace1 {F1, F2, F3}, subspace2 {F2, F3, F4}, subspace3 {F1, F2, F4, F5}, subspace4 {F1, F6, F7}, etc...will each have at least one of the high priority features. High priority features can be individual features or a combination features. In this specific case, a subspace{F1, F3, F4} will not be created since neither of the high priority features exists. Additionally, feature subspace {F1, F6, F8} will not be created since F6 is only high priority in combination with F7 and not a high priority feature by itself.

What Makes a Good High Priority/Combination Features?

Entities (categorical features) that can identify where a pattern comes from are good candidates (e.g. IP, device, and receiver accounts for fraud use cases). Sometimes, a concatenation of several entities can better capture a pattern (e.g. IP range together with timestamp), making it a combination of features.

1.2.3 Clustering

Pairwise Linkage Clustering Technique

With linkage function defined, clustering is performed in a feature subspace and clusters are identified. After all users have been plotted in the subspaces, users are clustered based on how close they are in a given space based on feature values. User entities are clustered with a proprietary algorithm derived from the pairwise linkage function. Pairwise linkage functions reveal the interconnectivity of users within a cluster. The linkage between two users in a feature subspace is determined by the set of features they share and their shared values. In general, more shared features result in a higher linkage value. For a specific feature, if users in a subspace share a value that is globally rare, then the linkage value from that specific feature will be higher.

You have the configuration to include or exclude specific values within a feature to determine the linkages. This is especially impactful in pairwise linkage function as the distribution of feature values will be severely impacted after including excluding specific default values. 

Filter Clusters

After the clusters are formed based on pairwise linkage, you have the option of filtering clusters based on size (minimum number of users per cluster).

Additionally, outlier filtering is supported at the user level and at the cluster level. Users within a cluster are filtered if they are very different from the rest of the users.

_1.png

NOTE: Filtered clusters are not part of the final model output.

Score Clusters

The Cluster Suspicious Index (CSI) is a scoring method to assess a cluster’s risk factor on a scale from 0-1. A score closer to 1 indicates high suspicion. For each cluster, its score is a weighted sum over all feature contributions scaled by cluster size. You have the ability to configure the share threshold (X%) for each feature. A feature is counted in the final score only if more than X% of users share a value for this feature in the cluster. Oftentimes, a feature may have several default values which should be avoided in user linkage or cluster score. Such scenarios can be avoided by using the option to include or exclude keyword values for each feature.

Sometimes, there may be correlated features in a subspace; you can discount or ignore feature contributions from a correlated feature by specifying the feature family. DataVisor’s auto UML has a built-in algorithm that detects and handles correlated features. Default values are shown for correlation as pre-configurations, and you have the option to edit or add based on your data. Within a feature family, parent correlation has a higher feature importance during scoring.  

The correlation configuration also optimizes the feature subspace. A cluster with a score of 0 will be automatically filtered.

Numerical Features in UML

Since we are determining how similar users are, categorical features makes it easy to compare values. In order to better utilize a numerical feature in modeling, you can preprocess it into a categorical feature by bucketing the feature into different sets based on business logic or risk factors. For example, the numerical feature can be bucketed into different ranges such as 0-1000, 1000-10000, etc...to compare all users within a specific range instead of a specific value.

 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.