Chapter 2. Data Studio

In Data Studio, you can create a dataset by importing data. Once a dataset is created, you can perform field mapping and determine the data type for each field, and finally validate the dataset. A validated dataset can be used in Feature Platform for feature testing, as well as in Model Management for model building. Data studio also allows you to create and manage resource data (datasources) that serve as lookup tables when creating more powerful features. Resource data will be introduced toward the end of this chapter.

2.1 Import Data

There are three ways to access the import data page.

1.Click on Data Studio in the top menu bar, then click on Import Data.

2.Click on the IMPORT button on the Data Studio home page.

3.Click on the IMPORT DATA button on the Feature Platform home page.

You may import files from local, or import from custom cloud. We support common file formats such as csv, json and txt, as well as gz format.  

2.1.1 Supported File Formats

dCube currently supports files in the following formats:

  • JSON (preferred)
  • CSV
  • TXT
  • GZ

Files in raw JSON should follow the format of one event per line. It is OK if different lines have different data fields. Below is an example of data in the acceptable JSON format:

{"event_type":"registration", "user_id":"ABXD1232WER", "event_time":"1423160520015",

"time_zone":"UTC+0200", "IP":"113.245.239.117", "occurring_latitude":"16.99943733215332",

"occurring_longitude":"115.3338623046875", "device_client":"android",

"device_id":"bf23de564f7b7285613570a2e257d897", "device_version":"263"}

For CSV format, we support both single and multi-character delimiters. It is required that each file has a header, and all files during a single upload should share the same header and the same delimiter. Make sure you select or enter a correct delimiter that aligns with your data. Otherwise, the upload will fail.

2.1.2 Upload Data

You have two options for importing the data files:

1.Upload Data From Local

Click on the Browse Files button to choose file(s) for importing. System infers the file type as JSON or CSV by file extension. So for GZ files, manual selection is needed. The maximum file size is 10GB for a single upload.

You can enter a name and description for your dataset, then click on SUBMIT. Note that you cannot use the same dataset name as a previously uploaded dataset. System will prompt a message of “Updating is in progress, please stay on this page until completed”. Once uploading finishes, you will be directed to the View Data page like below.

2.Upload From Custom Cloud

You can also pull data files that are already uploaded to the cloud. Click on Custom Cloud on the left to obtain the root path that hosts the files on the cloud. Typically, the root path is the DATASET folder under the main bucket. First level folders and individual files under the root path can be used as input files for preparing a new dataset. However, deeper subfolders are not supported yet.

Click on LOAD FILES to check available files to select. Make sure that all files selected during the same import share the same format.  

Similarly, make sure the file format is correctly selected, then name your dataset and click on SUBMIT.

2.2 View Data

At this stage, your dataset is in the “uploaded” status. The View Data page provides a basic view of this data. Total data size, file names and dataset name and uploader’s name are shown in the table. The system will try to sample evenly from each of your files, for a maximum total of 30,000 records. The preview table shows you this sampled data, and allows you to scroll up and down (as well as left and right) for a quick sanity check. You may also search for particular rows by typing in the search bar. Alternatively, to see only a few feature columns, you may click on Edit Columns to select them. Once verified, you may click on the NEXT button to proceed to mapping and validation.

2.3 Map and Validate Field

Now you are in the field mapping page, which allows you to map each raw field to an event attribute. In Datavisor terminology, event attributes are basically event level features that serve as the building blocks for derived features. An event attribute has a name and data type. By mapping a raw field to an existing event attribute, you will be able to use existing derived features that are constructed based on the existing event attribute. However, if such event attribute has existing dependencies, e.g. being used by features or other datasets, then you will not be able to change its data type. You may click on SEE DEPENDENCY for more information.

The inferred type column is based on an auto data type inference algorithm. To prevent feature computation error, event attributes with String as inferred data type cannot be set as numerical data types, and will need to be set as String. Another case where correction is required is if the event attribute name is conflicting with Datavisor internal names. You may rename your event attribute if this happens.

In some scenarios, most of your data fields may look numerical (e.g. birth year, user ID, bank ID) but you want to map all of them as String to save edit time. You may click on “Set type to string for ALL fields” to achieve this. To revert, just click on “Reset type to inferred type for ALL fields”. One caveat is for an event attribute with existing dependency, system cannot change its data type. Therefore you have to rename the event attribute in order to change its data type.

You may click on SAVE MAPPING before leaving this page and later come back for more edits. Once saved, your dataset will have a Mapped status. When you are ready with all edits, you may click on the VALIDATE button to start the validating process.

2.4 Validate Data

Data type validation and time field validation are mandatory (unless there is no time field at all in the dataset, in which case we treat all events as the same time of 0).  Select your event time field from the dropdown and provide its format. System will try to auto infer a time format from the sample data available. However auto inference may not work if you select a field irrelevant to time, or your data do not have a unified and acceptable time format. It is worth mentioning that time validation is required for dCube modeling, so please make sure your dataset has a unified time field.

In order for a dataset to be successfully validated, at least 80% data records need to pass each field data type validation. In addition, for a dataset with a time field, at least 80% of the data records need to have a valid timestamp. Data type and time validation are processed in memory. Expect it to finish within 10 minutes for a dataset under 10GB, and 1-2 hours for a dataset over 100GB. During this process, your dataset will have a Validating status.

Generating the Data Distribution Report is optional. By selecting the Data Distribution Report option, the system will ask you for an event sample rate. Events will be randomly sampled for distribution calculation. For larger datasets, it is recommended that you use a smaller sample rate to calculate data field distributions. You may further provide the event type field in your dataset. For social dataset, event type usually takes values such as “registration”, “login”, “message”, etc. For financial dataset, they may be “application”, “login”, “transaction”, “feedback”, etc. With an event time field, the distribution report will plot a time series curve for each event type. Without an event type field, the time series curve will just be plotted based on all events. Because this calculation is optional, the validation successful status does not rely on the success of such calculation. As soon as data type and time validation are complete, system will reveal validation status on the Data Studio home page. If the distribution job is still running, you will see a spinning wheel such as below.

2.5 Validation Results

After the mandatory validation finishes, you can review its status in the Data Studio homepage. If both data type and time field are valid for at least 80% of the dataset, then it will receive a Validation Successful status. If either any data type or the time field is invalid for at least 20%, then it will be marked as Validation Failed. You may directly click on the Status button to view the validation result. Alternatively, click on the … button on the right, then select View Validation Results.

Data type validation result provides a few metrics for each event attribute, such as total count, empty count, invalid count and valid count. The latter three should sum up to the total count. Only invalid count is used toward the acceptance criteria. You may toggle between Number view and Percentage view on the top right corner. When hovering over each metric, you will be provided with options to sort. When an event attribute contains invalid values, system will show some examples next to the event attribute name.

Time validation results are simpler for review. It shows time field name and time format used for validation, as well as total event count, validated event count and validated percentage. If there are data records with incorrect time, several examples will be shown on the right.

The data distribution result is organized into two sections. The top bar chart counts total event count per event type, as well as for all events. The time series chart below shows the number of events per day for each event type, as well as for all events.

The major section below these two charts provides top 20 frequent values for each event attribute in the dataset. You may toggle between # and % on the right upper corner for absolute count and percentage view. The N/A keyword corresponds to the events when the event attribute is missing. For example, the second graph below indicates that about 49% events do not have the device_model field.

2.6 Transform Data (WIP)

Data transformation empowers a user to create a new dataset by transforming an existing dataset according to custom logic via Java code. Common use cases where you need to transform your dataset:

  • Events in your dataset do not have the same time field, e.g. different event types have different event time fields.

  • All events have the same time field. However, their time formats are different.

  • Data quality in a numerical field. For example, you have an amount field set to be Double. However, more than 20% events with the amount field cannot be parsed as Double. You may leverage data transformation to resolve such data errors.

  • Your dataset does not contain an event type field. Although we do not require such a field in the Data Studio and Feature Platform, it is a required field in the later Model Management for UML model building. For example, if your dataset contains a single event type (e.g. just registration data), you may add an event type field with constant value “registration”.

Test Run

Before creating this new dataset based on the above criteria, you have an option to run a test using "Test Run."

Once you are satisfied with the new dataset, you can click on Save Changes as New Dataset to create the new dataset.

2.7 Resource Data

Different from the datasets we have discussed so far, resource data is a different type of data that facilitates your feature creation. Resource data serves as lookup tables for query-like features. In Datavisor’s terminology, we refer to resource data as datasources.

Different data sources may serve different purposes. For example,

  • Suspicious IP: this data source maintains records of suspicious IPs and a feature can query this data source table to see if current event matches any of those IPs
  • Currency rate: this data source has two columns -- currency and its rate. A feature can fetch the currency rate of the currency type of an event and then compute an amount value in USD.

Because blacklist and whitelist are very commonly used data sources, system creates them by default. For other data sources, you may create by yourself.

2.7.1 Create Data Source

Create a new data source by defining its data schema. From the top menu bar, choose Data Studio and then click on Resource Data. You will see default data sources as below.

Click on CREATE DATASOURCE and start by providing a name for your data source, e.g. “CurrencyRate”. Notice that no space is allowed. Then click on ADD FIELD to add fields according to your data schema. In the currency rate example, we may need a currency field with String type and a rate field with Double type. Select True for Partition Key if this field will serve as a query field. In our example, we need to query the currency field to obtain its rate. So the currency field should have Partition Key set to True and the rate field should be set to False. Finish by clicking on CONFIRM.  

See below for detailed explanation for each configuration.

1. Use the Add Field button to add as many fields as needed. For each field, you will need to specify:

       a. Its name

       b. The type of the field. We currently support

              i. String

              ii. Float

              iii. Boolean

              iv. Double

              v. Long

              vi. Integer

       c. Partition Key - This field is used to denote whether or not the field will be used to query the datasource. Select True here if you would like to query the datasource using this value, and False otherwise.

       d. Cluster Key - This field is used to denote whether the ordering of the rows within a specific partition key depends on this field. Select True here if you want this field to contribute to the ordering of the rows when querying the datasource, and False otherwise.

2. Use the trashcan icon on the right to delete an erroneous field, and use the right-most symbol (the stripes) to arrange the ordering of the fields by dragging and dropping.

Once a data source has been created, you can add records to it. Find your data source and click on the … on the right, then select Records.  

Now you are in the Datasource Records Management page. Click on ADD RECORDS.

Here you can add records in one of two ways:

  • To add just a small number of records, you may manually add them on UI. Click ADD and these records will be inserted into the datasource.
  • To add a bulk set, click on IMPORT FILE at the top right, and you will be taken to an Import page. You can upload a file where the schema matches the datasource schema exactly, and all the records will be added into the datasource.

Once you are done adding records, you can return to the previous page and query your new data to make sure it has been added successfully.

 

2.7.2 Use Data Source

Now that a data source has been created, we are ready to construct a feature to query from this data source. Navigate to the main Feature Platform menu and select Create Feature to get started.

We currently support querying a database (like the whitelist or blacklist) in Java or SQL.

SQL: Click on USE CODING and choose SQL as language. The feature script takes the format of SELECT some field from your data source table WHERE the partition key equals to some query value. In the example below, assuming we have an “amount_currency” field in your event, it will query the CurrencyRate data source to return the rate of amount_currency. Here, the $ sign is used to inference an existing field or feature. If the currency can be queried from CurrencyRate table (e.g. amount_currency is “USD” for an event), then corresponding rate will be returned (e.g. 1.0). Otherwise, the return value will be null.

Java: We can also accomplish this task by the following feature script:

Double rate = $CurrencyRate("amount_currency", "rate", "0.0");

return rate;

Such internal function with $ sign plus data source name as function name takes in n arguments. The last two arguments are for the datasource field name to return and its default value if no record can be fetched. The first n minus 2 arguments should exactly match the partition keys of your data source (the order should match as well). In our example, we only have one partition key, i.e. currency. So we query it using the “amount_currency” field, assuming this field is available in events.

Notice here, different from SQL syntax, we don’t need the $ sign for inferencing the amount_currency field.

2.7.3 Default Data Source: blacklist and whitelist

Due to their popularity, blacklist and whitelist data sources were created by default for your convenience. Using such lists is an easy way to ensure that users and accounts who need to be caught / filtered out are done so.

Datasource schema is predefined for blacklist and whitelist. In both cases, there are exactly two columns: entity_type and entity_value, and both take String type. Entity-type is the field name, while entity_value is the blacklisted value for that field. So any records added to these datasources must have exactly these two columns. Here are some example records you may add to these tables:

entity_type

entity_value

ip

1.2.3.4

email

hello@world.com

device

0123456789

Let’s start with a simple feature that will query the blacklist datasource and determine whether a specific IP (the one in our event) is in that blacklist.

In SQL, the feature script would look like this:

select entity_value from blacklist where entity_type = "ip" and entity_value = "$ip"

This script will query the blacklist datasource with the criteria that entity_type is equal to “ip” and entity_value is equal to your event’s ip field value (using the “$” allows us to reference this).

In Java, the feature script would look like this:

String ipBlacklistVal = $blacklist("ip_constant", "ip", "entity_value", "");

return ipBlacklistVal;

In this case, we are querying the blacklist as a feature and supplying parameters that can help us get the IP out, in the same way as before.

  • ip_constant: A blacklist or whitelist could have many different entities, and we need to know what entity we are querying. We must first create a constant feature (above, we call it ip_constant) that simply returns the entity value (for example, something like return “ip”; ). This is so that when we query the blacklist, we are going to the right entity_type. Also, this parameter must be a String.
  • ip: This is our field name in the dataset that we want to query. If I have multiple such fields I want to query, rather than creating many SQL features, each is just 1 line of code within the same feature.
  • entity_value: The name of the column in the blacklist that has the values (so that we know what to return). For most cases, this will be the constant entity_value.
  • ““: We require this argument to specify what the default return value should be if you cannot find the value in the blacklist. So here, emailInBlacklist will be an empty string if it can’t be found in the blacklist.

Since this is in Java, we are able to do further modification upon the feature ipBlacklistVal if we want, whereas in SQL we are constrained to simply returning that feature as-is. Please see below for more information.

SQL vs Java

When deciding whether to use SQL or Java to query a datasource, please consider the following:

SQL

  • + Easy to understand syntax
  • + Less error-prone, since there are fewer parameters into the feature script
  • - Cannot do any modification beyond simply returning the feature value
  • - Need secondary features to do further post-processing

Java

  • + All processing can be done within 1 feature
  • + Multiple queries to the database can occur within one feature
  • - More complicated syntax
  • - More error prone
  • - Requires using a constant feature before querying

In general, we recommend users use the SQL version unless there is a pressing need to use Java, since this method is more convenient and less likely to require a lot of manual debugging.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.