Chapter 2 - Create Features

Features are data attributes or properties that describe certain dimensions or characteristics of a dataset, with which we can perform analysis (e.g, using machine learning models for prediction) or perform operations (e.g., using rules to take actions)  in order to achieve certain tasks such as fraud detection.  With Feature Platform, we can create new features from your dataset in the following three different ways:

  • Create features from off-the-shelf feature packages
  • Create features by UI or code snippets from scratch
  • Create features by importing features from another production environment

Feature Platform differentiates features into two types:

  • Aggregation Feature: any features that contain built-in time series information
  • Non-Aggregation Features: all other features

In the following sections, we demonstrate how to create features step by step with the DataVisor feature platform.  

2.1 Create Features from Feature Packages

DataVisor Feature Platform includes an extensive library of sophisticated feature packages to help users solve specific risks and fraud use cases as shown below.

 

For each use case, commonly used features can be derived directly from DataVisor Feature Packages by mapping your input data fields to DataVisor recommended fields ( for your convenience and efficient use. Think of each Feature Package as a set of carefully engineering features readily to support different use cases such as transaction fraud detection or account take over detection. 

After you map the input  data fields to DataVisor’s recommended fields, you instantly create the corresponding features that are associated with these recommended fields. . The more input data fields you can map, the more derived features you will get. For example, as shown in the screenshot below, the Transaction Fraud Feature Package has ten recommended fields to detect transaction fraud.  By mapping your data fields to all ten recommended fields, you will be able to create xxx features immediately.

Below is one example to guide you through how to map your data field to a DataVisor recommended field.

 

Example:

Let’s assume your input dataset has a field called “ip_address,” which corresponds to the IP address for a given event. You want to be able to leverage  all of DataVisor’s IP address related features.

To do so, , you can perform the following steps:

  1. Find the Transaction Fraud package in the Start with a package panel and click Get More.
  2. Under the DataVisor Fields column, find IP. You can select the arrow head on the left column to see all the features that you will receive once you successfully map your input data to  this field.

  1. Under the Client Features column for the IP row, select the input data field name  “ip_address” (raw input field is by default a feature too)

  1. After the mapping, the # of New Mappings will increase by one, and the number of New Features will also increase depending on how many new features can be created by adding that mapping. For example, after mapping raw input data field  “ip_address” to DataVisor recommended field IP, “ip_is_from_data_center”, “city_jumper”, and “ip_prefix_20” are the 3 new features created, and they only require the IP filed.

 

To check which DataVisor recommended fields need to be mapped in order to derive a feature listed in the first column, as shown below, you can also check the third column on the right, which lists the corresponding DataVisor recommended fields  that are required to generate that feature .   For example,  in order to generate the “transaction_count_per_ip_prefix_20_last_60_day” feature (1st feature listed below IP), in addition to mapping the IP field, you also need to map your one of your input data fields to the of TRANSACTION_ID field as well. If you don’t have a data field that can represent a unique transaction ID, the feature “transaction_count_per_ip_prefix_20_last_60_day”  won’t be generated accordingly.

Note: After mapping the “ip_address” field to “IP”,  only TRANSACTION_ID needs to be mapped to create “transaction_count_per_ip_prefix_20_last_60_day” because “ip_prefix_20” was already created as a feature.

 

  1. If you successfully mapped your input data fields  to the recommended DataVisor Fields, you will see a blue GET FEATURES button when you scroll to the bottom of the page. Click this GET FEATURES button to move to the next step.

Note: If you didn’t map anything yet or data field cannot be mapped to the DataVisor Fields, the GET FEATURES button will not appear at the bottom of the page.

 

Of course, the more data fields you can map, the more features you will automatically derive  from DataVisor’s feature packages. (Please refer to Section 6: Package for more detailed information about DataVisor feature packages.)

 

2.2 Create Features from Scratch

If the features from a package are insufficient for your business needs, you can easily Create Feature From Scratch in the Feature Platform by programming or using the UI. DataVisor offers these two options to give you greater control over your feature engineering process.

To do so, go to the Feature Platform Dashboard shown below. The two highlighted boxes show you how to access the Create Feature page. You can either click on Features → Create Feature from the drop down toolbar at the top, or click the blue CREATE FEATURE button in the middle of the page to begin creating your own features from scratch.

 

 

Once you get to the Create New Feature screen as shown below, you can create new features either by clicking USE UI or USE CODING to program in Java, Python, or SQL.

 

 

2.2.1 Use UI to Create New Features

Creating a custom feature by using the UI involves the following a few steps:

Select the Operator Category and the Operator Function

  1. Select Input Parameter Type for input data fields that were used to derive the feature
  2. Input Feature Name, Description, and Tags
  3. Test your feature (optional)
  4. Be done

The highlighted boxes in the image below represent these  steps described above.

 

 

In the following sections, we will explain each of these steps in more depth and provide examples to guide you through creating your first feature.

 

2.2.1.1 Operator (Function) Overview

 

In Feature Platform, an Operator or Fnction refers to a pre-defined logic or transformation that can be applied to raw data fields or input features in order to generate a new feature. For example, we could have a simple function that adds two columns of your data together to compute the sum, or returns the prefix of an email address, or determines the country where an IP address is originated from.

Operators (Functions) are classified into four categories: Aggregation, Generic, Region Specific, and Attribute Specific.

  • Aggregation: Operators that are applied over a temporal dimension to derive the feature value
  • Example: “COLLECT” aggregates all values within a time frame and saves them into a list.
  • Generic: Includes arithmetic or logical operations as well as simple text manipulation
  • Example: “AND_DV_DEFAULT” has two inputs (“condition_1”, “condition_2”) and if both conditions are satisfied, it returns TRUE; otherwise, it returns FALSE.
  • Region Specific: Operators designed for supporting region related transformations
  • Example: “GET_LOCATION_BY_CN_PHONE” retrieves the city name from a China phone number.
  • Attribute Specific: Operators that are intended for specific attribute fields
  • Example: “EMAIL_PREFIX” extracts the string before the “@” sign in an email address.

From the classification, we know that non-aggregation functions are those that do not depend on an aggregation of values over a time interval. 

More details about DataVisor operators or functions can be found in Section 4: Functions.

Now that we introduced the concept of operators, we will next show you how to create both Non-Aggregation (Generic, Region Specific, and Attribute Specific) and Aggregation features step by step in the following sections.

 

2.2.1.2 Create a Non-Aggregation Feature 

 

By name, when you use the UI to create a custom non-aggregation feature, you will need to select a non-aggregation operator function. To illustrate the process, let’s use an example to demonstrate how to create a non-aggregation custom feature.

Example:

Your dataset has a data field named “email” that corresponds to an email address of a user. In this case, you want to create a non-aggregation feature that corresponds to the prefix (the substring preceding the “@” sign) of the email address and call this new feature“EMAIL_PREFIX_EXAMPLE”.

The first step is to select a non-aggregation function that can be applied to a specific field. You can narrow down the list of operator functions by type (e.g., ATTRIBUTE SPECIFIC in this case) if you know what kind of operator you are looking for. Alternatively, you can click EXPLORE ALL OPERATOR FUNCTIONS to search and enable the specific operator you want to use. In this case “EMAIL_PREFIX” is selected.

 

 

Once you select an operator function (“EMAIL_PREFIX” is selected in the example), you will be shown a Return Type (in this case, a String), so that you know the data type of the output of your function.

 

 

After you select your Operator Function, Feature Platform will automatically display the Enter Parameters window. You can choose between the following two options:

  1. Select Input Parameter type as a specific feature from the drop down list
  • Example: “email” feature
  • Note: here the input is a raw data field or an input feature. When the input changes, the output feature value will change accordingly.
  1. Select Input Parameter type as a constant 
  • Example: manually type an email address (e.g. test@example.com)
  • Note: In this case, the input is a fixed value, and thus the returned output will always be the same constant of “test” for all rows in your dataset.

 

 

Next, the Additional Details window at the bottom allows you to choose and create a Feature Name, add a Description for describing the newly created feature and add appropriate Tags associated with the new feature. Adding a description and tags allows you to search for that newly created feature from your Feature List more easily.

Finally, you can either click CREATE FEATURE to add the new feature into your Feature List for use in modeling or rules,  or you can click TEST to test this new feature with your data before you finish creating the feature

 

 

2.2.1.3 Create an Aggregation Feature 

Aggregation Features are advanced customizable behavioral features that aggregate events on a specific entity over a chosen time interval, allowing us to do time series analysis and answer complex questions such as the velocity of certain types of activities. . An aggregation feature requires an Aggregator, over which the input data will be grouped by and analyzed. To some extent, an Aggregator is DataVisor’s equivalent of SQL’s GROUP BY clause; except it is a lot more powerful,  particularly across the time dimension.

We also use an example to illustrate how to create an aggregation feature in this section.

Example:

Suppose your dataset has an IP field indicating the IP address of each customer request, and you want to know how many unique billing addresses were associated with the IP 64.237.37.122 across all customer requests in the past. Since this number could be very large through the entire dataset, we want to limit our search to the last 10 days with a sliding window approach over time.

This example requires an aggregator operation called  “distinct count”. This is an built-in operator function under the Aggregation category that is ready to be used by the Feature Platform users.

Creating an aggregation feature through the UI is quite similar to creating a non-aggregation feature overall. . We have outlined the steps to creating an aggregation feature below:

  1. In the Create New Feature page, select the Aggregation category under Select Operator Function to narrow down the selection of available functions.
  2. Select the aggregation function “distinct count”, which displays Return Type: number that will be the data type for the new feature. In this example, for every IP address within the selected time range and dataset, a single number will be used to count the number of distinct billing addresses.

  1. In this case, just selecting the function is not enough to build the operator.  We also need to know how we plan to use the “distinct_count” function over what data fields across how long to derive the new feature.  These dimensions together are defined as the Operator, or called the  “Aggregator”. If you don’t already have an existing aggregator that can solve your problem, select New Aggregator.

Since we want to calculate a distinct count of billing addresses per IP address, you can specify the Data Collection Attribute as “billing_address” and the Aggregated By value as “ip”.

For future use and easier editing, you can name your aggregator under Aggregator Name and provide a description under Aggregator Description. This functionality allows you to reuse and recall aggregators conveniently in the future.

 

  1. We can further refine the newly created aggregator by adding more conditions to refine it. Below the Choose an existing aggregator or create a new aggregator segment is the Add Conditions window.

In our example, we do not want to calculate the distinct count of billing addresses for all IPs; instead, we care about only one specific IP, say “64.237.37.122” To achieve this goal, the Add Conditions section can be used to introduce filters based on specific attributes. This functionality is largely similar to the WHERE clause in SQL. 

Under the Add Conditions section, select Click to Add Criteria.Set the attribute to be “ip”, set the operator to “STR_EQ” (String equals) and choose the option for Constant Value, where we can enter the IP Address “64.237.37.122”.

In our example, one condition is sufficient. For more complex cases, you can click +ADD CONDITION again to add more conditions, and click on +ADD BLOCK to chain the conditions together by AND and OR blocks. In this way, multiple AND/OR logic blocks can be used to support more complicated conditions.

If you decide to delete a condition, select the trash symbol directly (with an X) to the right of your condition. Selecting the trash symbol on the bottom right of the block (with no X), will delete the entire block, so please be cautious when using this functionality.

 

  1. Next, we must Select time period for data aggregation to finish creating the aggregation feature. Feature Platform allows you to specify time units as days, hours, minutes, and even seconds. The selected time period can range from a minimum of one second to a maximum of 180 days. In real time integration, if you want to skip very recent data when calculating a feature, you can select Exclude Now to exclude that most recent data. In our example, we want to select the last ten days as shown in the figure below.

There are two ways to do so:

  • Use the graphic sliding ruler to mark the start and end time of the desired time  window of aggregation
  • Manually specify the desired time period under the Start Day and End Day drop-down inputs. .

  1. Finally in the Additional Details window, you can name this feature under Feature Name and provide a short description under Description for easier future recall. Feature Platform also offers a Tags option where you can add custom keyword tags for your convenience when searching for the feature in the future

  1. Click CREATE FEATURE to finish creating  this feature.  ust like with non-aggregation features, you also have the option to click TEST on a sample of your data. Please refer to Section 4.2: Test Features for more information about testing features.

2.2.2 Use Coding to Create New Features

While DataVisor has an extensive feature library with many ready-to-use features and very powerful UIs to facilitate creating custom features from scratch, you may sometimes have the need to create custom features that can not be built straightforwardly using operators.

Advanced users may choose to Use Coding to create custom features directly — Java, Python, and SQL are supported in the DataVisor Feature Platform.

When you select Use Coding under Create New Feature, you will be  taken to a console in which code can be written to generate a feature. First, select a language from the dropdown (Java, Python, and SQL are currently supported). Then write the code snippets in the coding console. The code snippets needs to end with a  return statement  required to return the value of your new feature.

 

 

In your code snippets  you may need to reference data fields from the input  datasets or refer to other features that were already created. In order to do so, you can use the $ symbol in front of your data field name or feature name in your code. When you type the “$” sign, a dropdown menu will appear to help you select the relevant features or data fields.

Note: At the moment, libraries for Python and Java are not accessible through the coding console (e.g., “Math”, for Java or “NumPy” for Python).  However, you will have access to a set of DataVisor utility functions that may be helpful for generating features. .

Example:

Just as we created a feature to extract the prefix of an email address by using the DataVisor UI, we will demonstrate how to extract the prefix of an email address this time using code.

  1. Select Use Coding under Create New Feature to turn on the coding console.
  2. Write code/script  in Python or Java to extract the prefix of an email address, using $email to refer to the email field in your dataset. Examples of scripts in both languages are shown below.

 

 

The coding  console comes pre-configured with certain IDE-like features including autocomplete and syntax highlighting. For example, default String functionality is available in Java. Similarly, any comments made in the code will be preserved if you go back to view this feature.

  1. Next, you will need to complete the Additional Details section, much like you would do when creating a feature using UI. You must name your feature under Feature Name and specify a return type under Return Type.
  • We currently support the following Return Type:
  • Integer
  • Float
  • String
  • Boolean
  • Set
  • Dict
  • List
  1. While filling out the other fields for Description and Tags are optional, we still suggest adding a description and tags for easier search in future.

 

 

  1. Before you finalize the feature creation, you have the option to TEST this feature. . If you choose to TEST your feature, you will be directed to the same Test Feature page, similar to when you test features created by UI. Please refer to Section 4.2: Test Features for more information about testing your features.
  2. Whether you choose to test it by clicking on TEST or just wish to finish the process by clicking CREATE FEATURE,  if there are errors in the code, an error message will appear. You can edit your code to fix any errors. In the example below, the error message suggests that the variable “full_email” is misspelled to “full_emal”. .

 

 

  1. Click CREATE FEATURE to finish creating the new feature and add it to your Feature List,  where you can access it later for modeling or creating rules.

2.3 Create Features from Import

DataVisor Feature Platform also allows you to import and export features across different environments using the Feature Configuration option for convenience.

The Feature Configuration option is located at the bottom right corner of the Feature Platform Dashboard in the Advanced Options section.

 

 

To create features by importing through a JSON feature configuration exported previously, just select Import and choose the JSON feature configuration that you created earlier. If the import process is successful, you will see Import Successful message as shown below. You can then directly proceed to View Features and see all the features that have been imported into Feature Platform.

 

2.4 Create Advanced Features Using Datasource / Blacklist / Whitelist

2.4.1 Datasources

Data sources are effectively tables that can be used in Feature Platform by means of query, in order to create more advanced features.

There are many examples where a data source is useful:

  • A user can store a table of suspicious IPs and query against it to see if a newly-received event contains one of those IPs
  • Additional tables like currency rates could be stored as a data source so that they’re easily query-able for features
  • Rules could be created based on a whitelisted field value

2.4.1.1 Create a Datasource

Datasources can be created from the Feature Platform UI. In order to create a datasource, you will need to know the schema of the field.

  1. Navigate to the Datasource List page by going to Data and Features → Datasource on the top menu. The resulting page should be the following:

  1. Select Create Datasource to begin make a new data source, and you will reach the following configuration page

  1. Use the Add Field button to add as many fields as needed. For each field, you will need to specify:
  1. Its name
  2. The type of the field. We currently support
      • String
      • Float
      • Boolean
      • Double
      • Long
      • Integer
  1. Partition Key - This field is used to denote whether or not the field will be used to query the datasource. Select True here if you would like to query the datasource using this value, and False otherwise.
  2. Cluster Key - This field is used to denote whether the ordering of the rows within a specific partition key depends on this field. Select True here if you want this field to contribute to the ordering of the rows when querying the datasource, and False otherwise.
  3. Use the trashcan icon on the right to delete an erroneous field, and use the right-most symbol (the stripes) to arrange the ordering of the fields by dragging and dropping.
  4. Name your datasource at the top (under Datasource Name) and select whether you want the datasource to be locally available or globally available. The default is Local and we recommend using Local unless specified otherwise.

2.4.1.2.View Existing Datasource

Once created, a datasource is not editable, but you can view the schema for future reference.

  1. Select the “...” button on the right-most column, and select View and the following page opens:

You can see each field along with its additional configurations, in order to help query this datasource or use it in Feature Platform.

 

2.4.1.3 Query a Datasource

 

Without creating a feature, Feature Platform supports a standalone querying of a given datasource, in order to find some entries.

  1. In Datasource main page, click on the “...” button of the data source of interest, and select Records and the following page opens:

  1. Any field that was marked in the datasource as being a partition key will show up here as query-able. Simplify type in a value in the Search Value space, and the Search button will be unlocked. You can then search and all the records that match your query will be displayed.

2.4.1.4 Add to a Datasource

 

While a datasource’s schema is not editable after creation, values can always be added into the datasource.

  1. Load the same page as in the above section (Datasource List → Datasource Records) and click on Add Records to reach the following page:

 

  1. You can add records in one of two ways:
      • To add just a small number of records, simply type the information into the fields in the UI (in this case, we have only two fields called entity_type and entity_value, but other datasources may have many more), and once you’re done adding all of them, just click Add and these records will be inserted into the datasource database.
      • To add a bulk set of records into the datasource, select Import File at the top right, and you will be taken to an Import page. You can upload a dataset where the schema matches the datasource schema exactly, and all the records will be added into the datasource.
    1. Once you are done adding records, you can return to the Records page (noted earlier) and query your new data to make sure it has been added successfully.

2.4.1.5 Use a Datasource in Feature Platform

 

Now that a datasource has been created, we are ready to use it in Feature Platform. Navigate to the main Feature Platform menu and select Create Feature to get started.

We currently support querying a database (like the whitelist or blacklist) in Java or SQL:

SQL

Let’s start with a simple feature that will query the blacklist datasource and find out whether a specific IP (the one in our event) is in that blacklist.

The blacklist datasource has two columns: entity_type & entity_value. Entity-type is the field name, while entity_value is the blacklisted value for that field.

In SQL, the feature script would look like this:

select entity_value from blacklist where entity_type = "ip" and entity_value = "$ip"

 

This script will query the blacklist datasource, filter by the lines where the record’s field name is ip and the record’s field value is your event’s IP field value (using the “$” allows us to reference this)

There are two cases here:

  • If there is a record in the blacklist datasource matching your entity_type and entity_value, the script will return the value itself (e.g. 168.120.1.1)
  • If there is no record in the blacklist datasource matching the specified field and value, the script will return null.

Java

We can also accomplish this task using 1 Java feature:

String ipBlacklistVal = $blacklist("ip_constant", "ip", "entity_value", "");

return ipBlacklistVal;

In this case, we are querying the blacklist as a feature and supplying parameters that can help us get the IP out, in the same way as before.

  • ip_constant: A blacklist or whitelist could have many different entities, and we need to know what entity we are querying. We must first create a constant feature (above, we call it ip_constant) that simply returns the entity value (for example, something like return “ip”; ). This is so that when we query the blacklist, we are going to the right entity_type. Also, this parameter must be a String.
  • ip: This is our field name in the dataset that we want to query. If I have multiple such fields I want to query, rather than creating many SQL features, each is just 1 line of code within the same feature.
  • entity_value: The name of the column in the blacklist that has the values (so that we know what to return). For most cases, this will be the constant entity_value.
  • ““: We require this argument to specify what the default return value should be if you cannot find the value in the blacklist. So here, emailInBlacklist will be an empty string if it can’t be found in the blacklist.

Since this is in Java, we are able to do further modification upon the feature ipBlacklistVal if we want, whereas in SQL we are constrained to simply returning that feature as-is. Please see the SQL vs Java section for more information.

 

2.4.1.6 Special Datasources: Blacklist / Whitelist

 

The blacklist and whitelists are simply specific types of data sources that have been created for user convenience. Using such lists is an easy way to ensure that users and accounts who need to be caught / filtered out are done so.

These two datasources function exactly the same as any other datasource, except that for these lists, the schema is hard coded. In both cases, there are exactly two columns: entity_type and entity_value. Any entries added to these datasources must have exactly these two columns, and no others.

In order to make blacklists and whitelists more accessible to the user, we have a separate page under the Feature Platform tab for their use. Go to Data And Features → Blacklist/Whitelist to get started.

 

 

Here, you can query either the blacklist or the whitelist in the same way as before, as well as Add Records to add individual or bulk records to either datasource.

 

 

If doing a bulk addition of records, please remember to ensure the schema remains identical, otherwise there will be errors shown on the console.

SQL vs Java

When deciding whether to use SQL or Java to query a datasource, please consider the following:

SQL

  • + Easy to understand syntax
  • + Less error-prone, since there are fewer parameters into the feature script
  • - Cannot do any modification beyond simply returning the feature value
  • - Need secondary features to do further post-processing

Java

  • + All processing can be done within 1 feature
  • + Multiple queries to the database can occur within one feature
  • - More complicated syntax
  • - More error prone
  • - Requires using a constant feature before querying

In general, we recommend users use the SQL version unless there is a pressing need to use Java, since this method is more convenient and less likely to require a lot of manual debugging.

 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.