The following section will explore the fundamental components of the UML manual modeling with dCube.
4.1 Model Creation
When your dataset preparation and feature creation are complete, you can start to create a model by clicking on the MODEL MANAGEMENT button on the homepage.
On the Model Management page, click on Create New Model, and you will be redirected to the create model page.
Or you can click on the Create Models button on the navigation bar, which will redirect you to the create model page as well.
On the Create Model page, you will begin by selecting the data set.
4.1.1 Dataset Selection
There are two datasets you can choose on the Create Model page, one is the mandatory main dataset, on which the Feature Platform will perform feature calculation; the other is the history dataset. The latter one will be used only as “warm-up“ data so that the main dataset can compute velocity features on top of the historical dataset.
By separating the dataset into the main dataset and historical dataset, the feature computation speed can be greatly improved. Also, the downstream detection algorithm will only be applied on the main dataset with additional calculated features.
4.1.2 Model Type Selection
After selecting the corresponding dataset(s) on the Create Model page, you can type in model details and select one of the three model types to continue.
You can create UML Auto Lite, UML Auto and UML Manual models as needed. UML Auto model types are more suitable for social scenarios in which the system automatically configures everything and detects suspicious clusters. For different scenarios, we recommend using the UML Manual model type. You can name the new model and add a brief description during the model review.
Upon selecting the model type, users must fill in two mandatory fields according to the dataset:
- Entity ID: Entity ID is a field that you would like to aggregate events by. In other words, it is the entity used to represent a node in a cluster.
- Event Type: Usually, a dataset may have different event types (e.g. registration, login, transaction, etc…). If your dataset consists of a single event type (thus lacking a field for data type), you may go back to Feature Engineering to create such feature via simple coding.
Each field can be selected via the drop-down button. The Label function allows users to upload label data through the original data field. Labels can be used for model evaluation later on.
Now, we will briefly introduce the three model types below:
- UML Auto Lite: UML Auto Lite will automatically build a lightweight version of the model based on DataVisor's intelligent recommendation features and configurations. The fraud distribution in the dataset helps facilitate subsequent fine-tuning.
Advantages: Users can immediately click the Run UML AUTO LITE button after filling in the two mandatory fields. This model is a fully automatic system with a short runtime. The following picture shows the process of UML AUTO LITE after clicking on the Run button. The process of the model can be viewed from the process query on screen.
- UML Auto:UML Auto adds feature selection functionality on top of the UML Auto Lite’s model. Once the required features have been manually added, unsupervised cluster analysis is performed to generate model results.
After filling in the two mandatory fields, click on PROCEED TO FEATURE SELECTION button, and you will be redirected to the model configuration pages, starting with the feature selection page, which will be explained in more detail in Section 4.3. In general, the more custom features are added in the feature selection part, the more clusters that UML Auto may capture.
Advantages: With semi-managed modeling, you have control over feature selection and customize feature weights. When weights are left empty, our system will compute them automatically. The weights of features can be determined by the importance of the features in detecting suspicious users. Usually, we suggest the weight to be a number between 0~5. For example, for features device_id we suggest using a weight of 4.0, and for feature IP address we suggest using a weight of 3.0.
The following picture shows the process of UML AUTO after clicking on the Run Model button. You will be redirected to the Create Model page where you can click on the START button to start the UML AUTO model. The process of the model can be viewed from the process query on screen.
- UML Manual: In contrast to the previous automatic models, UML Manual models can be fully configured by the user (e.g. feature selection, high priority feature selection, feature configurations such as including or excluding values, the correlation between features, feature share threshold, minimum cluster size, etc...).
Note that there are two Quick Start Options: copy settings from an existing model will be introduced in more detail in Section 5.3, and we will talk more about initializing settings from the model template in Section 4.2.
After filling in the two mandatory fields, click on the Next button, and you will be redirected to the model configuration pages, starting with the feature selection page. Similar to UML Auto, this part will be explained in more detail in Section 4.3.
Once you are done with the model selection, proceed to the next part of model configuration using the PROCEED TO MANUAL TUNING button on the top right corner. Section 4.5 will continue to induce the remaining part of model configuration.
After you are satisfied with the model configuration, on the Preview Configuration page, click on the RUN MODEL button on the top right corner.
You will be redirected to the Create Model page where you can click on the START button to start the UML MANUAL model. The process of the model can be viewed from the process query on screen.
4.2 Model Template Workflow
Model templates are pre-built UML templates for specific use cases, designed to help users build a good UML model quickly. After selecting UML Manual, you are able to select a Model template from Quick Start Section.
Then, you can select the model template based on your fraud use case from the drop-down. After selecting the model template, click on the Next button to proceed.
You will be redirected to the Mapping Fields page.
After the contents are loaded successfully, you can start mapping data fields in your dataset to the Datavisor Fields required by the model template. For the best result, we recommend mapping as much as you can. System will use the fields you have mapped to generate an initial model. Click on the Next button on the top right of the page to proceed.
After clicking on the Next button, model templates will be loaded in a few minutes, which suggests some features based on our domain knowledge. You may continue to add features and/or adjust model configurations, as introduced in the remainder of this Section.
4.3 Feature Selection
Feature selection is applicable to both UML Auto and UML Manual.
Initially, no features are selected. Users can click on the ADD FEATURES button shown below to start the feature selection process.
4.3.1 Feature Types
There are four different types of features:
- Normal Features
- High priority Features
- Combined Features
- Cluster level Features (advance features)
Among those four types of features, UML Auto only uses first type, while UML Manual uses all of those four types of features.
4.3.2 Feature Selection Process
There are some differences between UML auto and UML manual’s feature selection process.
Normal Feature Selection
Selecting normal features defines the dimensions of the feature subspace. You can remove an existing feature from the selected by disabling the feature within the list.
Use the ADD FEATURES button on the feature selection page to select features as additional ones. Then, the Add Feature to the Model window will pop up with all the available features that may be selected.
You can search or filter the features by name or by type by using the search bar on top right. You can also select or unselect all features within one or more types by enabling or disabling the option beside the type name.
You can select or unselect a feature by toggling its switch on or off.
In most cases, users can select all raw features by toggling the button in the “Raw Feature” row and then clicking the Advanced Custom Feature row to expand and select derived features to be included in a model. Click on ADD to confirm your selection. After this step, users can further add or adjust features by clicking on the ADD FEATURES button.
UML Auto v.s. UML Manual
You may start to run the model at this point for UML Auto. However, for UML Manual, the high priority features must be selected after overall feature selection is complete. Meanwhile, the features that you have selected may be configured in several ways.
For UML Manual, when your normal features have been selected, a Smart Recommendation window will appear at the top of the page. You may choose either to Run Smart Recommendations or Skip This Step.
If you are developing your first manual model, we suggest running the smart recommendation step to obtain and adopt the system’s recommended configurations after feature selection. Users can then make more refined adjustments based on their own understanding of the use case and business. You may run the smart recommendation step (refer to Section 4.4 for details) to get the system’s recommended high priority features, or skip this step for manual high priority feature configuration. A high priority feature serves as the fundamental element in a feature subspace. High priority features may be a single feature or a combined feature. Combine a feature with another feature (up to 3) by clicking on the three dots icon and then selecting Combine feature with.
High Priority Feature Selection
After either running or skipping the smart recommendation, a new column in the feature table named “Set as High Priority” will appear. These are must-have dimensions in the feature space. The high priority list can have at most 20 features. You can choose to enable or disable the high priority option in the table.
Combined Features Selection
Combined features are high priority features by default and are counted as one feature each in the maximum limit of 20 high priority features. You can select up to 3 features to generate 1 combined feature.
1. Click on the three dots icon on the right to combine features.
2. Select features to combine and click SAVE.
3. Enable or disable a combined feature. When you disable a combined feature, the feature combination will be deleted and must be recreated for future use.
Remember to click the Save button beside the ADD FEATURES button when you leave the page, so that you can proceed with your current feature configurations next time. The same process applies for later configuration pages. As long as you click on Save, you may leave the current page and come back later.
After high priority/combined feature selection, you can add advanced cluster level features at the bottom of the page. Cluster level features are advanced features designed to capture more obfuscated fraud patterns. You may choose from two types of cluster level features: Cluster match features, Cluster deep learning features. Each type can click to expand.
Cluster Match Features
Cluster match feature is a fuzzy match feature that returns true or false for each user based on its similarity against other users in a cluster. It takes value true for a user that shares the feature of choice with at least N other users in the current cluster. If most users (default 80%) have match true, this feature will contribute toward scoring. Because it’s computationally expensive, you can configure up to 10 match features. You can click Add Feature to configure such features in three steps.
Select Feature. First, select a source feature, e.g. device_id, from a drop down list.
Name. Then, give this match feature a name, default value can be device_id_cluster_match (source feature + _cluster_match)
Configure N. Finally, define the shared entities of interest for a group. An entity shared by at least N other users will be considered as a shared entity. By default, N is 1.
|Cluster 1 (N=1)|
|User 1||device 1, device 2||true|
|User 2||device 1||true|
|User 3||device 3||false|
|User 4||device 2||true|
|Cluster 2 (N=2)|
|User 1||device 1, device 2||true|
|User 2||device 1||true|
|User 3||device 3||false|
|User 4||device 2||false|
|User 6||device 1||true|
If you want to configure the share threshold for this feature, please refer to Section 4.5.3 for details.
Cluster Deep Learning Features
You can add the deep learning feature in this page. This feature loads a pre-trained deep learning model to predict if names/emails within a group exhibit suspicious patterns. You can click Add Feature to configure such features.
- Select Model. Drop down choose the model you want to apply. There are separate deep learning models trained for email / email_prefix / full name / nickname / cn nickname.
- Select Feature. Drop down select a corresponding source naming feature, e.g. email or user_name.
- Name. Then give this feature a name, default value can be email_dl_pattern (source feature + _dl_pattern)
- Min/Max Cluster Size. Then configure the min cluster size and max cluster size for this feature. It defaults to min=5 and max=100. It means that the system only detects suspicious naming patterns within a cluster whose size is in the range of 5-100.
Similarly, if you want to configure the share threshold for this feature, please refer to Section 4.5.3 for details.
Post feature selection
After feature selection, you will enter a series of manual modeling configuration steps: Refine Feature Values, Define Correlations, Influence Score, and Filter Clusters. Click PROCEED TO MANUAL TUNING to complete each step. Please refer to Section 4.5 for details.
If you exist in the model configure pages, to continue working on an in-progress model configuration, go to the Model Management home page, click on the three dots icon and select View Model Config. The system will bring you to the Feature Selection page with all your previously saved configurations loaded.
4.4 Running Smart Recommendations
This is an optional step in the manual model process. As suggested in the tooltips, running this job will perform following actions:
- Generate new dataset and feature distributions based on selected features
- Suggest high priority features based on feature importance
- Suggest correlated features and keywords to exclude (see details in the next section)
After you run smart recommendations, the system runs a job in the background to generate the recommendation information. Depending on the volume of data, this process may take 20 minutes or more.
You can wait for the job to finish or continue to configure features and proceed to the manual tuning step. When the job finishes, suggested high priority features will be tagged with yellow flags on the left side. You can click the ACCEPT button to enable all recommended features as high priority.
The smart recommendation step also computes your feature distribution. You can also check the feature distribution by clicking the multi-colored bar icon to gain a better sense of the feature values and decide whether to include a feature in the model or tag it as high priority.
After you have done all the configurations in the feature selection page, you can click the Proceed to Manual Tuning button in the top right corner to proceed to the next step.
4.5.1 Include or Exclude Feature Values
Exclude Certain Values
Smart recommendation provides recommended values of a feature that can be excluded from being used. The example below suggests to exclude an idfa value “00000000-0000-0000-0000-000000000000” as it may be a default value. By clicking on the value or clicking on Add All, this value will be excluded. You may add more values to exclude. The excluded values will not participate in linkage calculation and cluster scoring.
It's worth noting that if there is a red alert icon beside a feature name(Cluster level feature or Not Detect feature), it means some actions are frozen for the feature. When you hover over, it should show a tooltip.
Add recommendation: You can choose to Add All or click on specific values to add them to the configuration.
Add a specific value: You can also manually type each value as part of the configuration.
Include Certain Values
Similarly, you can use the include function to allow only specified values in clustering. One feature can be set to either exclude or include certain values, but not both. When viewed from the INCLUDE KEYWORDS tab, features with a warning sign indicate that the user has excluded those values.
Exclude Global High Share Values
An easier way to automatically exclude global common values is through configuring the “Global Share Threshold” column. Values with high global share are likely to be default values.
Feature values whose global shares are higher than this threshold will automatically be excluded. For example, suppose “10” is the most popular OS version with a global share of 0.12. By setting the global share threshold to 0.1, OS version value “10” will be excluded from modeling.
Adjust the slider or enter a numerical value into the box to set the thresholds for each of the selected features. The value entered should be within the range [0, 1] inclusive.
Based on the smart recommendation, default correlations are provided, and feature families are determined. The item in the blue box reflects the parent correlation that has precedence over the child correlation in the yellow box.
An example for a family feature is shown below.
The address feature is the parent correlation of the child features: IP address, city, and cell area. The following child features all provide details about the location and are all relevant to the parent address feature. The cell area is the area code revealing the approximate region. The IP address reveals the location of the device. The city reveals the town of residence for a given address. In this context, the address is the parent feature as it provides the most granularity in regards to the objective of identifying the exact location of a user.
Click on ADD ALL to add the recommended feature family.
Modify the feature family by adding or deleting features.
Create a new feature family.
All selected features are provided here to set the similarity threshold. By default, all values are set to 80%. A lower threshold value encourages a feature to participate more in cluster scoring. An extreme case: 100% means that only if all users inside a cluster share the same feature value will this feature be used for scoring.
Adjust the slider or enter a numerical value into the box. The value entered should be within the range [0, 1] inclusive.
Cluster Size: The cluster size represents the minimum number of users per cluster. The system provides a default of 6.
The recommended cluster size is a minimum of 4-5 users per cluster. While a larger cluster size may detect larger groups, smaller groups may be overlooked, resulting in more false negatives and compromising the recall. Based on the cluster size, you have an option to filter the clusters from the review. Note that this option is also available at the time of post-filtering to help you with the base model and create iterations on top of the base model by using post filtering.
The minimum number of selected features to match within a cluster helps measure the similarity of users within the group. The system provides a default of 1 feature.
Enter a numerical value to set the cluster size and minimum number of selected features.
Users who are not similar to the rest of the cluster can be outliers, so filtering them out can usually boost model precision. We measure such similarity by the proportion of shared features that ,a user has with the rest of the cluster. If such a portion is lower than the user outlier threshold, then this user becomes an outlier and will be filtered out. If the proportion of outliers in the cluster is greater than the cluster outlier threshold, the cluster will be filtered out.
Adjust the slider or enter a numerical value into the box. The value entered should be in the range [0, 1] inclusive.
Feature family setting
Feature family configuration is helpful when you want to discount selected features from the same family to reduce false positives. Selected features from the same feature family will only be counted as one toward the “number of selected features required” in the Filter Clusters config page. In the example below, we set device_brand and device_id as a feature family, so they will be counted as 1 rather than 2 selected features.
Configure a feature family by adding features one by one into the same family.
Then click on Add feature family to create another one.
Per feature cluster size config
This config is useful for selected features that are weaker than others. Sometimes some features are not as accurate as other features, setting a bigger group size threshold for them contributes to improving accuracy. For example, email_name_not_match_with_real_name feature might be noisy and you can configure cluster size of this feature independently. Even global cluster size is 3, if the feature cluster size is set to 10, it means this feature is enabled only in clusters with size > 10. If no config is given, then all features will use default cluster size from the “minimum users per cluster” step. You can configure more than one feature with special cluster size.
- Select a feature from drop down, then input the cluster size requirement for the feature.
Not detect feature config
In some scenarios, if a group of users share the same feature value, it indicates less risk, rather than more risk. For example, a group of users share the same device and IP, however, they are from the same family, sharing the same last_name. For those cases, you can configure one or more features to be used to drop groups when sharing happens. In other words, a high share of this feature indicates a legitimate cluster.
- Select a feature from a drop down list.
- Configure share threshold. If the share proportion of one cluster is higher than the share threshold, the feature will not be detected.
- Add values to include. By default all values are applicable. However, you can specify a list of values to be used for dropping (similar to the include keyword).
Once a feature is configured as NOT DETECT, it can no longer serve as a detection feature. There is a red alert icon beside the feature name in the Refine Feature Values page and Influence Score page. All actions will be frozen for such features unless you remove the NOT DETECT setting.
At any point during the model step, you can save the configuration by clicking the SAVE button on the top right of the page.
4.5.6 Preview Configuration
After finishing all configurations, you can preview the settings before running the modeling by clicking on Preview Configuration.
The Feature Configuration section includes information about high priority features and selected features, including the excluded or included values and the exclude share threshold.
The Influencing Scores section displays information about the feature family and similarity scores.
The Filtering Clusters section displays information about the cluster filtering settings.