Rank Top Drivers

In this example we ask VerbaGPT to give us a ranked list of variables that drive a target variable’s value. In this example we use the famous (in data modeling) Abalone dataset 1.

Video:

User question: Rank the top 5 variables that explain the variation in Rings

VerbaGPT response:

Cost of query: $0.00

       Feature      F-value
6  ShellWeight  2712.724757
1     Diameter  2058.519849
2       Height  1882.482838
0       Length  1875.169489
3  WholeWeight  1722.067593

VerbaGPT got it right. It used an embedding framework to find the relevant database, table and columns in Db_Test_01*. The table d_Abalone has a column Rings that indicates the age of Abalones. Other columns are Abalone features that may relate to age.

In addition to producing the result above, VerbaGPT also provides the user with the following (editable) code. If the user chooses to, they can edit the code before hitting execute button, or simply hit execute without editing the code to produce the results shown above.

VerbaGPT Message Window

Here is a code snippet that you can use to rank the top 5 variables that explain the variation in Rings in your MS SQL database:

import pandas as pd
from sklearn.feature_selection import f_regression

# Load data from the d_Abalone table into a pandas DataFrame
query = "SELECT * FROM d_Abalone"
data = pd.read_sql(query, conn)

# One-hot-encode the Sex column
data = pd.get_dummies(data, columns=['Sex'])

# Define the target variable (Rings) and the feature variables
y = data['Rings']
X = data.drop('Rings', axis=1)

# Perform an F-test to rank the feature variables by their ability to explain the variation in Rings
f_values, p_values = f_regression(X, y)
f_test_results = pd.DataFrame({'Feature': X.columns, 'F-value': f_values})
f_test_results.sort_values('F-value', ascending=False, inplace=True)

# Print the top 5 feature variables
print(f_test_results.head(5))

This code loads data from the `d_Abalone` table into a pandas DataFrame and one-hot-encodes the `Sex` column. It then defines the target variable (`Rings`) and the feature variables (all other columns). An F-test is performed to rank the feature variables by their ability to explain the variation in `Rings`. Finally, the top 5 feature variables are printed.

This code looks great and is a perfect base for further modeling. It’s amazing that a simple question, without any jargon, can produce a complete answer with fairly complex statistical modeling (for a business scenario).

*Db_Test_01 is a testing database on our system.

  1. Abalone Dataset

Posted

in

, ,

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *