Chapter 3. Feature Platform

3.1 Feature Platform Overview

Feature Platform is a dCube module for feature engineering, one of the most challenging components of machine learning fraud detection. The main function of the Feature Platform is to create new features. You can use out-of-the-box features provided by Feature Platform, or create custom features and explore the distribution of each of the features. For advanced settings, users can create their own feature templates, functions, and packages.

Feature Platform can be reached from the top menu Feature Platform -> Get Started or through the Feature Platform link in the dCube homepage.

The Feature Platform dashboard shows the number of features that exist in the system. USER CREATED FEATURES are derived features created by users. Once created, these features will be stored in the system to be applied on any datasets and to be used in any models.

Through this page, users can create new features or calculate all features in a dataset. The IMPORT DATA section connects back to Data Studio.

3.2 Create Feature

To create new custom features, click on Create Feature in the Feature Platform dashboard page. Users are able to create features using our out-of-box feature operator or by using code. 

In terms of workflow, we suggest users first explore our feature packages to check whether there are out-of-box features that can be used directly in their use case. After generating those out-of-box features, users can add additional features through the Create Feature button.  

3.2.1 Generate Out-of-Box Features through Feature Packages

Feature Platform provides out-of-box features based on specific use cases in feature packages. Users can generate many features defined for a use case in a batch. To generate these features, users can choose a package and then map the required field names to the DV fields of the packages.

Feature packages are located in the middle section of the Feature Platform dashboard page.

Currently, there are 9 feature packages:

  • Content Abuse
  • Transaction Fraud
  • Fake Accounts
  • General
  • Application Fraud
  • Anti-money Laundering
  • ATO (account takeover)
  • Promotion Abuse
  • GIN

Except for the General and GIN Packages, other packages are fraud-case specific. Content Abuse, Fake Accounts, and Promotion Abuse packages are mainly for social media use cases. Transaction Fraud, Application Fraud, Anti-money Laundering, and ATO packages are for financial fraud. There are common features across different packages. Most of these common features are generally used for a variety of fraud cases, such as IP and time-related features. These common features are also grouped together in the General package. Users can select specific features for their use case.

GIN Package (May require additional purchase)

To enhance detection efforts and enrich decision-making, DataVisor leverages its Global Intelligence Network (GIN), which consists of anonymized non-PII data from over 4 billion protected accounts and 800 billion events across the globe. The GIN contains rich information on digital data such as IP address subnets, prefixes, proxies and data centers, user agent strings, device types, operating systems, email address domains, and more. Information from the GIN feeds into machine learning algorithms to further improve overall detection.

A few examples of GIN features are:

  • GIN_DEVICE_TYPE_IP_COUNTRY_ALLUSER_COUNT
  • GIN_EMAIL_DOMAIN_IP_COUNTRY_SUSPICIOUS_USER_COUNT
  • GIN_EMAIL_DOMAIN_IP_COUNTRY_BADUSER_COUNT
  • GIN_PHONE_PREFIX_BADUSER_RATE
  • GIN_IP_FIRST_SEEN

GIN features can be used alone or to create new features. For example, if you would like to create a boolean feature that defines whether an IP subnet is likely to be fraudulent, you can create this new feature by encoding the GIN features “gin_ip20_baduser_rate” and “gin_ip20_alluser_count”. The former is the fraction of users accessing from an IP/20 subnet that is fraudulent, and the latter is the total users accessing from the IP subnet. Now you can create a boolean feature by using Create by Coding option (Section 3.2.4) to achieve the following logic:

When an IP subnet has more than 10 total users and more than 50% of these users are fraudulent, the boolean feature will return “True”, otherwise, return “False.”

To check the list of packages and features for each package, click Manage Packages, then click the package name.

To get features from a package, click on the Get More button of the selected package to navigate to the feature package page.

Under each package, click on the field name. All the out-of-box features related to that field will be displayed.

To use any of the features, select the field from your dataset that matches with the DataVisor Fields. For example, if the email field in the dataset is “email_address,” then you need to choose “email_address” to tag EMAIL. If the feature requires two or more fields, you need to make sure all the tags are matched to the raw data’s feature name. Feature dependency is shown at the end of each line. For example, because the AMOUNT related features require both USER_ID and AMOUNT fields, both the AMOUNT and USER_ID fields must be tagged to generate the AMOUNT related features.

Not all fields of a package are necessarily to be tagged. Only fields that are relevant to a use case need to be selected. For example, if there is no PHONE field in the raw data, then there is no need to match that field or generate PHONE related features. After matching all necessary fields, click the Get Features button at the bottom, and all related features in each field will be generated.  

An error message will pop up if any features cannot be generated. Click on the See More button for more error details.

3.2.2 Create Regular Features

The Feature Platform provides the flexibility to create new features through either UI or coding. Through UI, users can create features directly by using out-of-box feature operator functions. Currently there are 4 different categories of functions that users can choose from:

  • Aggregation
  • Generic
  • Attribute specific
  • GIN (Global Intelligence Network)

Regular features that require minimal customization are those created from Generic, Attribute specific, and GIN functions.

  • Generic functions are simple Mathematical and String operations.
  • Attribute specific functions are tailored to apply to certain attributes from the data (e.g. IP, email address, name).
  • GIN functions perform SQL queries from Global Intelligence Network by using entity values.

Each operator comes with a short description to explain how it works. The following UI guides users on how to select parameters. The feature creation process for the three function categories are similar.  

Example:Concatenate two strings

Once you choose a function, an explanation and examples will appear on the right side of the page.

Then the next section will guide users to choose parameters for this function and name the new feature.

After feature creation, the new features will show up in the feature list under the Status bar as “Draft.” Users can edit and test the draft features as needed.

In the feature list page, users can check the feature information by clicking on the feature name. The pop-up window provides a quick look on how the feature was defined. Users can choose to edit or test the feature through this page.

Users can access more options by clicking on the “…” button at the end of each line, where users can also Copy, See Dependency, or Delete features.

The See Dependency function is useful to check how this feature is linked with other features.  

If the feature is listed as “Published” under the Status bar, users cannot delete it, and they are only allowed to edit the description portion. We will cover the feature test and publish process later. A general suggestion is not to publish any feature until it has been thoroughly tested and confirmed correctness.

3.2.3 Create Velocity Features

Velocity features are features based on Aggregation functions that can be used to do time series analysis. You can aggregate a specific attribute (data collection attribute) for each entity (aggregated by) under specific conditions. This is called an accumulator or aggregator. This accumulator can be reused for different functions such as count, distinct count, and more. You could also specify the time window for which you want to apply this function and any offset (start time).

Example:Compute the total transaction amount of each client in the last 7 days.

Function: sum

Data Collection Attribute: amount

Aggregated By: client_id

Condition: event_type = transaction

Window: 7 days

Offset: 0

1. Choose the function “sum.”

2. Define parameters in the aggregator. The system will generate a default Aggregator Name for each new aggregator, such as client_id_123. We recommend you change to a better name along with a brief description for easier lookup in the future.  Reusing an existing aggregator is preferred whenever possible, as it will make feature calculation faster. Conditions are optional in this section, but users leverage conditions to create more powerful features. Only events that match specified conditions will be included in feature calculation.

3. Define the time window. Choose the time scale on the right side of the bar, then select the start and end point of the time window. Users can also directly choose the time from the drop down menu under the bar. Select Exclude Now to exclude current time point from the time window.  

Users can also set time offset if the aggregation needs to be done for the time window in the past (e.g. from -30 days to -7 days). Below is a diagram explaining the concept of time window and offset.  

4. Name the feature and save.

  Create Velocity Feature Using Existing Aggregator

If an aggregator exists, users can reuse it to create similar velocity features, which saves time and also makes the feature calculation much more efficient at the end. By changing functions, time window, or offset, a set of similar velocity features can be created.  

Example 

By following the above example of the total transaction amount in the past 7 days, users can change the function to calculate the total transaction times for each client in the past 7 days. Start from Create Feature, choose Count function, and click Existing Aggregator. In the pop up window, choose the target aggregator (“amount_per_client in this case”) and click Next.  

Then, the existing aggregator will be imported.  

Continue to select the time window/offset and proceed to name the feature as previously described.

Caveats: Currently, Feature Platform does not support nesting of velocity features - the input feature of a velocity feature cannot be another velocity feature. This will be supported in the near future.

Feature Platform provides flexibility for users to aggregate data in different dimensions, thus creating features that are useful for downstream modeling. More examples of velocity features include:    

  • Median or Stdev of transaction amount of each client in the past 3 month
  • Number of transactions from an IP address in the last 24 hours
  • Number of devices were used to login to the same account in the past 7 days
  • Number of emails that linked to the same phone number in the past 180 days 

3.2.4 Create Features by Coding

dCube also supports creating features by coding. In the Use Coding tab, there will be a small editing window which currently supports Java, Python, and SQL.

The default page gives some example code for users to understand the format. To refer to a feature in the feature list, use “$featurename” to retrieve its value.  

Example: Transaction IP which is the IP only from a transaction event

3.2.5 Testing Features

Once a feature is created, users can backtest it by using a validated dataset. This step is important to check whether the new features work as expected. If any error occurs, users can edit the feature definition or adjust parameters. To test a feature, users can either click Test after naming the feature in the Create New Feature page or click the feature through the Feature List page.

In the Test Feature page, users need to first choose which dataset they want to use. Click the sample dataset name, and all validated dataset will show in the list. Choose a dataset that contains the proper data for testing purposes.  

After choosing a dataset, all available data files for this dataset will be shown on the page, and users need to choose one for testing. You can specify the number of records you want to use for testing from 500 to 10,000 records. Local mode is suggested for backtesting.

Click Test after setting up the above page. When the calculation is complete, the results table will show up in the lower half of the page. The feature in testing will be displayed in the first column, and the rest of the columns will be all the raw fields in the dataset. Users can select fewer columns through Edit Columns or sort values in specific columns for better viewability.  

3.2.6 Publish Features

After the new feature is well tested and confirmed, users can publish it through the feature list page. Features can be published individually by changing the status to “Published,” or users can batch select several features and click the Publish button on the top of the page. For features that depend on another feature, the root feature must be published first. All published features will be available in the Feature List page. Only published custom features can be used for modeling (Section 4).

3.3 Calculate Features (Optional)

After feature creation, testing, and publication, you may continue with dCube modeling with these features (refer to Section 4.1). Alternatively, if you would like to just apply these features to the entire dataset to obtain a new dataset, you can use the Calculate Feature on the Feature Platform home page. Calculate Feature processes all records in the selected dataset to calculate the values of derived features.

Similarly to Test Feature, choose the dataset you want to use. All the validated datasets through Data Studio are listed in Database, and the user can also use the dataset in the Cloud.

After choosing the dataset, select all files that need to be used for calculation and also select the Relay Mode. The Feature Platform supports three Replay Modes:

  • Local -> run on the DataVisor single machine
  • Distribute_Internal -> use DataVisor cluster
  • Distribute_External -> launch external cluster

After setting up the data source, choose which features that need to be calculated, name the task, and then click Run.  

Calculation progress and results can be viewed through View Tasks. 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.