In this example we ask VerbaGPT to give us the top 5 models to predict a variable.
Video:
User question: Give me the top 5 models to predict Rings. use r2 as performance metric
VerbaGPT response:
Cost of query: $0.00
Random Forest: 0.5727 Gradient Boosting: 0.5616 Linear Regression: 0.5390 XGBoost: 0.5213 Decision Tree: 0.1661
This is truly remarkable. Not only did VerbaGPT find the relevant database based on a plain query (thanks to its embedding framework), it applied fairly sophisticated statistical modeling code to answer the user question that included no technical jargon.
In addition to producing the result above, VerbaGPT also provides the user with the following (editable) code. If the user chooses to, they can edit the code before hitting execute button, or simply hit execute without editing the code to produce the results shown above.
VerbaGPT Message Window
Here is a code snippet that you can use to get the top 5 models to predict Rings in your MS SQL database using Python:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
# Read data from database
query = "SELECT * FROM d_Abalone"
data = pd.read_sql(query, conn)
# One-hot-encode 'Sex' column
encoder = OneHotEncoder(sparse=False)
sex_encoded = encoder.fit_transform(data[['Sex']])
data = data.drop('Sex', axis=1)
data[['Sex_F', 'Sex_I', 'Sex_M']] = sex_encoded
# Split data into training and testing sets
X = data.drop('Rings', axis=1)
y = data['Rings']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Define models
models = {
'Linear Regression': LinearRegression(),
'Decision Tree': DecisionTreeRegressor(),
'Random Forest': RandomForestRegressor(),
'Gradient Boosting': GradientBoostingRegressor(),
'XGBoost': XGBRegressor()
}
# Train and evaluate models
scores = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
scores[name] = score
# Print top 5 models based on r2 score
top_models = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:5]
for name, score in top_models:
print(f'{name}: {score:.4f}')
This code loads the data from the `d_Abalone` table in your MS SQL database into a pandas DataFrame. It then one-hot-encodes the `Sex` column and splits the data into features and target. The data is further split into training and testing sets. The code then creates a list of models to evaluate and evaluates each model using mean squared error. The results are sorted by mean squared error in ascending order and the top 5 models are printed.
This code looks great and is a jumping-off point for further modeling. It’s amazing that a simple question, without any jargon, can produce a complete answer with complex statistical modeling in seconds.
*Db_Test_01 is a testing database on our system.
Leave a Reply