Category Limitations of Explainable AI

Data for feature store exercise – Vertex AI Feature Store

For this exercise, data is downloaded from Kaggle (link is provided below) and the dataset is listed under CC0 public domain licenses. Data contains information regarding employee promotion data. Since we are not building any model from the data, we are considering only 5 attributes (employee ID, education, gender, no. of trainings, and age), and only 50 samples are considered.

https://www.kaggle.com/datasets/arashnic/hr-ana

feature_store_input bucket is created under us-centra1 (single region) and the CSV file is uploaded from the bucket as shown in Figure 9.2:

Figure 9.2: Data stored in cloud storage

Working on feature store using GUI

Before we ingest data to the feature store or feature, we need to create a feature store, entity, and features. Resources of the feature store can be created using GUI or Python code. Follow the below-mentioned steps to create the feature store resources using GUI:

Step 1: Opening of feature store.

The landing page of the Vertex AI is shown in Figure 9.3:

Figure 9.3: landing page of Vertex AI

  1. Click Feature Store to open.

Step 2: Landing page of feature store

Feature stores are region specific; the landing page provides information on the feature store under a specific region as shown in Figure 9.4:

Figure 9.4: Landing page of feature store

  1. Click CREATE FEATURESTORE.

The region needs to be selected in this step (region cannot be changed post this step).

Step 3: Creation of feature store

Follow the steps mentioned in Figure 9.5 to create a feature store:

Figure 9.5: Creation of feature store

  1. Provide a name for the feature store.
  2. Enable Online Serving if the features need to be made available for low-latency online serving.
  3. Since the volume of data is small, select Fixed Nodes and provide the value of 1.
  4. Click on CREATE.

Step 4: Feature store created successfully

Once the feature store is created it will be displayed on the landing page as shown in Figure 9.6:

Figure 9.6: Feature store created and listed on the landing page

  1. Newly created feature store.
  2. Click on Create Entity Type for its creation.

Step 5: Creation of entity type

Follow the steps shown in Figure 9.7 to create an entity type:

Figure 9.7: Creation of entity type

  1. Select the Region under which the feature store is created.
  2. All the feature stores under the region will be listed, select the newly created Featurestore.
  3. Provide Entity type name.
  4. Write the Description for the entity type.
  5. Feature monitoring (enable if the features need to be monitored).
  6. It enables the monitoring of feature stores and features. Feature store monitors CPU utilization, storage, and latency. Feature monitoring helps in monitoring changes in feature value distribution.
  7. Click CREATE.

Advantages of feature store – Vertex AI Feature Store

These are the advantages of feature store:

  • Extend features company-wide: Feature stores let you easily share features for training or serving. Different projects and use cases do not need feature re-engineering. Manage and deliver features from a central repository to preserve consistency throughout your business and prevent redundant efforts, especially for high-value features.

Vertex AI Feature Store lets people find and reuse features using search and filtering. View feature metadata to assess quality and usage. For instance, you may check feature coverage and feature value distribution.

  • Serving at scale: Online forecasts require low-latency feature serving, which Vertex AI Feature Store manages. Vertex AI Feature Store automatically builds and expands low-latency data serving infrastructure. You create features but outsource providing them. Data scientists may create new features without worrying about deployment using this management.
  • Reduce training-serving bias: Training-serving skew happens when your production feature data distribution differs from the one used to train your model. This skew causes disparities between a model’s training and production performance. Vertex AI Feature Store can handle training-serving bias with these examples:
    • Vertex AI Feature Store guarantees that feature values are ingested once and reused for training and serving. Without a feature store, training and serving features may use distinct code paths. Training and serving feature values may differ.
    • Vertex AI Feature Store offers previous data lookups for training. By collecting pre-prediction feature values, these lookups reduce data leakage.
  • Identify drift: Vertex AI Feature Store detects drift in feature data distribution. Vertex AI Feature Store monitors feature value dispersion. Retrain models using impacted features as feature drift rises.
  • Retention: The Vertex AI Feature Store preserves feature values for the allotted amount of time. This cap is determined by the feature values’ timestamp, not the date and time the values were imported. Values with timestamps that go beyond the limit are scheduled for deletion by Vertex AI Feature Store.

Disadvantages of feature store

A feature store has overhead, which can make data science more complicated, especially for smaller projects. A feature store may complicate matters if a business has numerous little data sets. Feature stores are ineffective when the data is so diverse that no standard modeling approach will assist. Reusing features created on separate data sources and metadata is tough. Additionally, a feature store might not be the ideal choice when the features are not time-dependent or when features are needed only for batch predictions.

Knowing Vertex AI feature store – Vertex AI Feature Store

Introduction

After learning about the pipelines of the platform, we will move to the feature store of GCP. In this chapter, we will start with an understanding of the feature store, and the advantages of features followed by a hands-on feature store.

Structure

In this chapter, we will cover the following topics:

  • Knowing Vertex AI feature store
  • Hierarchy of feature store
  • Advantages of feature store
  • Disadvantages of feature store
  • Working on feature store using GUI
  • Working on feature store using python
  • Deleting resources
  • Best practices for Feature store

Objectives

By the end of this chapter, users will have a good idea about the feature store, when to use it, and how to employ it with the web console of GCP and Python.

Knowing Vertex AI feature store

Vertex AI Feature Store is a centralized repository for managing and delivering machine learning features. To speed up the process of creating and delivering high-quality ML applications, many organizations are turning to centralized feature stores to facilitate the sharing, discovery, and re-use of ML features at scale.

The storage and processing power, as well as other components of the backend infrastructure, are handled by Vertex AI Feature Store, making it a fully managed solution. As a result of this strategy, data scientists may ignore the difficulties associated with delivering features into production and instead concentrate on the feature computation logic.

The feature store in Vertex AI is an integral aspect of the overall system. Use Vertex AI Feature Store on its own or include it in your existing Vertex AI workflows. For instance, the Vertex AI Feature Store may be queried for information to be used in the training of custom or AutoML models.

Hierarchy of feature store

The collection of entities for a certain entity time is stored in a feature store. Fields like entity ID, timestamp, and a series of attributes like feature 1, feature 2, and so on, are all defined for each entity type. The hierarchy of the feature store is described in Figure 9.1:

Figure 9.1: Hierarchy of feature store

  • Feature store: A top-level container for entity types, features, and their values.
  • Entity type: A collection of semantically related features (real or virtual).
  • Entity: An instance of the entity type.
  • Feature: A measurable property or attribute of an entity type.
  • Feature values: These contain values of the features at a specific point in time.

Importing packages – Pipelines using TensorFlow Extended-1

Step 6: Importing packages
Run the following mentioned line of codes to import required packages:
import tensorflow as tf
import tensorflow_transform as tft
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils
from tfx import v1 as tfx
from tfx_bsl.public import tfxio
from tfx.components.base import executor_spec
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
from tensorflow_metadata.proto.v0 import schema_pb2
import os
from typing import List

Step 7: Understanding a few TFX components from code
Even before jumping into pipeline creation and execution, let us try to understand how a few of the TFX components can be used individually and analyze the output of those components. Let us start with the data Examplegen component.
Examplegen: Run the following code in a new cell:
context_in = InteractiveContext()
example_gen_csv = tfx.components.CsvExampleGen(input_base=INPUT_DATA_DIR)
context_in.run(example_gen_csv)

Examplegen component can read the data from various sources and data types such as CSV files, TRF records, and BigQuery. An interactive widget displaying the results of ExampleGen will appear on the notebook once run is complete as shown in Figure 8.6. ExampleGen typically generates two types of artifacts, which are known as training and evaluation examples. ExampleGen will divide the data into two thirds for the training set and one third for the evaluation set by default. Location where these artifacts are stored can also be viewed as shown:

Figure 8.6: Output for example_gen component
StatisticsGen: Run the following code in a new cell:
gen_statistics = tfx.components.StatisticsGen(examples=example_gen_csv.outputs[‘examples’])
context_in.run(gen_statistics)
context_in.show(gen_statistics.outputs[‘statistics’])

Statistics over your dataset are computed using the StatisticsGen component. These statistics provide a quick overview of your data, including details such as shape, features, and value distribution. You will use the output from the ExampleGen as input to compute statistics about the data. An interactive widget displaying the statistics of train and evaluation dataset separately appears once the run is complete as shown in Figure 8.7:

Figure 8.7: Output for statsics_gen component
SchemaGen: Run the following code in a new cell:
gen_schema = tfx.components.SchemaGen(statistics=gen_statistics.outputs[‘statistics’])
context_in.run(gen_schema)
context_in.show(gen_schema.outputs[‘schema’])

From the statistics, the SchemaGen component will generate a schema for your data. A Schema is simply a data definition. It defines the data features’ types, expected properties, bounds, and so on. Output of the SchemaGen is as shown in Figure 8.8:

Figure 8.8: Output for schema_gen component
ExampleValidator: Run the following code in a new cell:
stats_validate = tfx.components.ExampleValidator(statistics=gen_statistics.outputs[‘statistics’],schema=gen_schema.outputs[‘schema’])
context_in.run(stats_validate)
context_in.show(stats_validate.outputs[‘anomalies’])

Based on the defined schema, this component validates your data and detects anomalies. When in production, this can be used to validate any new data that enters your pipeline. It can detect drift, changes, and skew in new data, unexpected types, new column which was not in the schema. Output of ExampleValidator is as shown in Figure 8.9:

Figure 8.9: Output of example validator component

Data for pipeline building – Pipelines using TensorFlow Extended

For this exercise data is downloaded from Kaggle (link is provided below) and the dataset is listed under CC0:Public domain licenses. The data contains various measurements from EEG and the state of the eye is captured via camera. 1 indicates closed eye and 0 indicates open eye.
https://www.kaggle.com/datasets/robikscube/eye-state-classification-eeg-dataset
tfx_pipeline_input_data bucket is created under us-centra1 (single region) and csv file is uploaded from to the bucket as shown in Figure 8.2:

Figure 8.2: Data in GCS for pipeline construction
Pipeline code walk through
Workbench needs to be created for run the pipeline code. Follow the steps mentioned in the chapter Vertex AI workbench & custom model training., for creation of the workbench (choose TensorFlow enterprise | TensorFlow Enterprise 2.9 | Without GPUs, refer Figure 8.3 for reference. All other steps will be the same as mentioned in the Vertex AI workbench and custom model training)

Figure 8.3: Workbench creation using TensorFlow enterprise
Step 1: Create Python notebook file
Once the workbench is created, open Jupyterlab and follow the steps mentioned in Figure 8.4 to create Python notebook file:

Figure 8.4: New launcher window

  1. Click New launcher.
  2. Double click on the Python3 Notebook file to create one.

Step 2 onwards, run the following codes in separate cells.
Step 2: Package installation
Run the following commands to install the Kubeflow, google cloud pipeline and google cloud aiplatform package. (It will take few minutes to install the packages):
USER_FLAG = “–user”
!pip install {USER_FLAG} –upgrade “tfx[kfp]<2”
!pip install {USER_FLAG} apache-beam[interactive]
!pip install python-snappy

Step 3: Kernel restart
Type the following commands in the next cell, to restart the kernel. (Users can restart kernel from the GUI as well):
import os
import IPython
if not os.getenv(“”):
IPython.Application.instance().kernel.do_shutdown(True)

Step 4: Verify packages are installed
Run the following mentioned lines of code to check if the packages are installed (if the packages are installed properly try upgrading the pip package before installing tfx and kfp packages):
import snappy
import warnings
warnings.filterwarnings(‘ignore’)
import tensorflow as tf
from tfx import v1 as tfx
import kfp
print(‘TensorFlow version:’, tf.version)
print(‘TFX version: ‘,tfx.version)
print(‘KFP version: ‘,kfp.version)

If the packages are installed properly, you should see the versions of TensorFlow, TensorFlow extended and Kubeflow package versions as shown in Figure 8.5:

Figure 8.5: Packages installed successfully
Step 5: Setting up the project and other variables
Run the following mentioned line of codes in a new cell to set the project to the current one, also define variables to store the path for multiple purpose:
PROJECT_ID=”vertex-ai-gcp-1”
!gcloud config set project {PROJECT_ID}
BUCKET_NAME=”tfx_pipeline_demo”
NAME_PIPELINE = “tfx-pipeline”
ROOT_PIPELINE = f’gs://{BUCKET_NAME}/root/{NAME_PIPELINE}’
MODULE_FOLDER = f’gs://{BUCKET_NAME}/module/{NAME_PIPELINE}’
OUTPUT_MODEL_DIR=f’gs://{BUCKET_NAME}/output_model/{NAME_PIPELINE}’
INPUT_DATA_DIR = ‘gs://tfx_pipeline_input_data’

ROOT_PIPELINE is used to store the artifacts of the pipeline, MODULE_FOLDER is used to store the .py file for the trainer component, OUTPUT_MODEL_DIR is used to store the trained model and INPUT_DATA_DIR is the GCS location where input data is located.