Datasets Overview

Datasets are a Platform entity that represents a dataset that is stored in our data lake. Datasets can be added to the platform in 3 ways:

  1. Users can upload CSV or Parquet files through the User Interface
  2. Users can generate datasets in the Query Editor
  3. Some types of Datasets will be automatically generated by the backend when different modeling workflows are run.

We allow users to tag datasets in the UI for organizational purposes; the backend will also set these tags automatically based on the type of dataset created.

Dataset Types

Train: A full training dataset will include individual_id, yr, mo, label, treatment (for intervention training datasets), and all features used to train the model. Train datasets are used in custom dataset models.

Predict: Will contain the same information as a training dataset, except the label column will not be present. Predict datasets are used in custom dataset models.

Population: Similar to train/predict datasets, but they are generated by the API model population query builder for point & click models.

Cohort: A dataset representing a cohort of patients. This will include individual_id, yr (optional), and mo (optional).

  • If the cohort dataset contains only individual_id, every individual in the cohort will potentially be included in every time window of a model
  • If the cohort dataset also contains yr and mo, the yr and month will be matched to the "anchor date" (first month after the end of the evidence period) for any time window of a model.

Outcome: A dataset representing outcomes for a cohort of patients. This will include individual_id, yr, mo, and label.

Intervention: A definition of the application of any intervention, including one row for every application, with columns individual_id, yr, and mo. (To whom the intervention was applied, and when.)

Creating a Custom Dataset

  1. On the top navigation bar, hover over the "Data" button to the right of the logo

  2. A dropdown menu will open – select "Datasets" to navigate to the dataset page

  1. Select the "Create Dataset" button located at the top left of the page, a pop-up will open

  2. Select the type of dataset you are uploading (Training, Prediction, Cohort, Outcome, Intervention) from the drop-down menu that appears when you click the "Select type" box (reference the Dataset Overview above for the required components of each type of dataset)

  3. Name your dataset

  4. (Optional) Write a description of your dataset

  5. There are two ways to bring your CSV or Parquet file into the platform:

    – Click the box stating "Click or drag CSV or Apache Parquet to this area to upload file" and select the file from the files browser that opens

    – Drag the file from your computer into that same box

  6. When the required fields are completed, click the "Create" button to create your dataset


Generating a Dataset with the Query Builder

Data Queries are what they sound like, entities that represent a query of data. Queries are often used to generate new datasets. The platform backend uses queries to generate the population, cohort, outcome, and intervention datasets referenced above. This means that for any system-generated dataset, a user can look at the underlying query that generated the data.

The platform provides a query editor UI that allows users to develop and execute queries against our data lake. Anytime a user executes a query, the platform creates a new Data Query object, and a history of queries can be seen in the UI. The user has the ability to generate a dataset from a query as well. This generated dataset can then be used in modeling workflows.

To generate a dataset using the query builder:

  1. On the top navigation bar – hover over the “Data” button to the right of the logo

  2. A dropdown menu will open – select “Query” to navigate to the query interface