ChatGPT as a Python Programming Assistant - KDnuggets (2023)

ChatGPT as a Python Programming Assistant - KDnuggets (1)
Original image by Sigmund on Unsplash | Edits by author

ChatGPT from OpenAI is... you already know. You've read enough about it, and introductions are no longer needed. If you happen to not know what ChatGPT is, have a look at this first, and then come on back to continue.

Conversations with ChatGPT can be entertaining, even enlightening. But let's have a look at how helpful it can actually be.

Let's see what ChatGPT can do when it comes to writing some code from specifications that we provide. We will start simple and build from there.


I figured that a good place to start was a data processing script of some sort. Since I intend to have ChatGPT write some machine learning scripts afterwards using a variety of Python libraries, this seemed like a reasonable place to start.

The Goal


First, I wanted to make some attempts at loading a dataset; ensuring all dataset values are numeric values, or converting them to be such; checking for missing values; and splitting the dataset into train and test sets.

The Prompt


Write a Python script that does the following:
- loads a dataset from csv on disk to Pandas DataFrame
- ensures that dataset is entirely numeric; converts string values to numeric and tracks this conversion in a dictionary to reference later
- checks for missing values in the DataFrame and imputes missing values for any record that has one missing value; entirely drops a record that has more than one missing value
- splits the dataset into test and training sets

The Response

import pandas as pd# Load dataset from csv to DataFramedf = pd.read_csv('dataset.csv')# Initialize dictionary to track string to numeric conversionsconversions = {}# Convert string values to numeric and track conversions in dictionaryfor col in df.columns: if df[col].dtype == object: conversions[col] = {val: i for i, val in enumerate(df[col].unique())} df[col] = df[col].map(conversions[col])# Check for missing valuesmissing_values = df.isna().sum()# Impute missing values for records with one missing valuefor col in missing_values[missing_values == 1].index: df[col].fillna(df[col].median(), inplace=True)# Drop records with more than one missing valuedf.dropna(thresh=len(df.columns) - 1, inplace=True)# Split dataset into test and training setsfrom sklearn.model_selection import train_test_splittrain_df, test_df = train_test_split(df, test_size=0.2)

Note: the script loads the dataset from csv file name "dataset.csv" and splits the dataframe into test and training sets with 80/20 ratio. If you want to use different file name or ratio, change the path/ratio accordingly.

The Analysis


ChatGPT produced a useful script. It decided on a dataset name, as well as a train/test split percentage, and noted both of these points and how to change them if desired. Overall, mission accomplished.

The Goal


Now that we have a script that does something, let's refactor it a little. A cornerstone of good coding is the use of functions to logically organize you code, facilitate reuse, and ensure you don't repeat yourself.

The Prompt


Rewrite the above script to use a series of functions that make sense

The Response

import pandas as pdfrom sklearn.model_selection import train_test_splitdef load_and_convert_data(file_path): # Load dataset from csv to DataFrame df = pd.read_csv(file_path) # Initialize dictionary to track string to numeric conversions conversions = {} # Convert string values to numeric and track conversions in dictionary for col in df.columns: if df[col].dtype == object: conversions[col] = {val: i for i, val in enumerate(df[col].unique())} df[col] = df[col].map(conversions[col]) return df, conversionsdef handle_missing_values(df): # Check for missing values missing_values = df.isna().sum() # Impute missing values for records with one missing value for col in missing_values[missing_values == 1].index: df[col].fillna(df[col].median(), inplace=True) # Drop records with more than one missing value df.dropna(thresh=len(df.columns) - 1, inplace=True) return dfdef split_data(df, test_size): return train_test_split(df, test_size=test_size)# Use the functionsfile_path = "dataset.csv"df, conversions = load_and_convert_data(file_path)df = handle_missing_values(df)train_df, test_df = split_data(df, 0.2)

Here, I have used three functions:

1. load_and_convert_data(file_path) loads the csv dataset and convert the string values to numeric.
2. handle_missing_values(df) handles the missing values in the dataset.
3. split_data(df, test_size) splits the dataset into test and training sets.

The main script uses these functions in order to load, convert and clean the dataset, then split it into test and train sets.

The path of the dataset and the test ratio are inputs for these functions.

The Analysis


If you look at the first script output by ChatGPT, you can probably see some rather obvious lines you would draw as far as what functionality would be separated into functions. I certainly did. ChatGPT did exactly what I expected it to do, in this regard. Again, a competent script has been generated.

The Goal


OK, our script looks a little better now that it's using some functions for organization. Next, let's extend the script slightly in order to make it so that the user can specify a few parameters while running the script.

The Prompt


Rewrite the above program to solicit the following from the command line in the form of arguments, and use them in the program:
- the path/name of the dataset csv file
- the train/test split size
- the threshold for the number of missing values per record to drop after which to drop entire record, instead of imputing the missing values

The Response

import pandas as pdimport argparsefrom sklearn.model_selection import train_test_splitdef load_and_convert_data(file_path): # Load dataset from csv to DataFrame df = pd.read_csv(file_path) # Initialize dictionary to track string to numeric conversions conversions = {} # Convert string values to numeric and track conversions in dictionary for col in df.columns: if df[col].dtype == object: conversions[col] = {val: i for i, val in enumerate(df[col].unique())} df[col] = df[col].map(conversions[col]) return df, conversionsdef handle_missing_values(df, threshold): # Check for missing values missing_values = df.isna().sum() # Impute missing values for records with one missing value for col in missing_values[missing_values == 1].index: df[col].fillna(df[col].median(), inplace=True) # Drop records with more than threshold missing value df.dropna(thresh=len(df.columns) - threshold, inplace=True) return dfdef split_data(df, test_size): return train_test_split(df, test_size=test_size)if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument("file_path", help="path/name of the dataset csv file") parser.add_argument("test_size", help="train/test split size", type=float) parser.add_argument("threshold", help="threshold for the number of missing values per record", type=int) args = parser.parse_args() # Use the functions df, conversions = load_and_convert_data(args.file_path) df = handle_missing_values(df, args.threshold) train_df, test_df = split_data(df, args.test_size)

In this version, I have used the argparse module to take the inputs of file path, test/train split size, and threshold for the number of missing values per record as command-line arguments. The user will have to run the script with the correct arguments, for example:


python script.py dataset.csv 0.2 1

This will run the script using the "dataset.csv" file, a test_size of 0.2 and a threshold of 1 for the missing values.

The main script uses these arguments in the functions to load, convert and clean the dataset, then split it into test and train sets.

It's also worth mentioning that, when running script, the script should be run from command line, rather than running it from an IDE.

The Analysis


Honestly, not much analysis is needed here, given that ChatGPT explained what it did quite well. I'll actually note something here that I had previously not, and give ChatGPT a pat on the back for its commenting, which is expressive and concise.

The Goal


Now let's say we wanted to make this code "production" ready (humor me). Let's get ChatGPT to wrap these functions in a Streamlit app.

The Prompt


Rewrite the above program as a Streamlit app, allowing for the user to provide the same arguments as in the command line args above

The Response

import pandas as pdimport streamlit as stfrom sklearn.model_selection import train_test_splitdef load_and_convert_data(file_path): # Load dataset from csv to DataFrame df = pd.read_csv(file_path) # Initialize dictionary to track string to numeric conversions conversions = {} # Convert string values to numeric and track conversions in dictionary for col in df.columns: if df[col].dtype == object: conversions[col] = {val: i for i, val in enumerate(df[col].unique())} df[col] = df[col].map(conversions[col]) return df, conversionsdef handle_missing_values(df, threshold): # Check for missing values missing_values = df.isna().sum() # Impute missing values for records with one missing value for col in missing_values[missing_values == 1].index: df[col].fillna(df[col].median(), inplace=True) # Drop records with more than threshold missing value df.dropna(thresh=len(df.columns) - threshold, inplace=True) return dfdef split_data(df, test_size): return train_test_split(df, test_size=test_size)def main(): st.set_page_config(page_title="Data Preprocessing", page_icon=":guardsman:", layout="wide") st.title("Data Preprocessing") file_path = st.text_input("Enter the path/name of the dataset csv file: ") test_size = st.number_input("Enter the train/test split size (decimal between 0 and 1): ", step=0.01, value=0.2) threshold = st.number_input("Enter the threshold for the number of missing values per record: ", step=1, value=1) if st.button("Process Data"): df, conversions = load_and_convert_data(file_path) df = handle_missing_values(df, threshold) train_df, test_df = split_data(df, test_size) st.success("Data preprocessing completed!")if __name__ == '__main__': main()

This version is a Streamlit app that allows the user to provide the same arguments as command-line arguments. It uses the Streamlit library to create an interactive web app that allows the user to input the file path, test/train split size, and threshold for the number of missing values per record. The user can then click the "Process Data" button to run the script and preprocess the data. It uses the same functions as the previous version of the script to load, convert, and clean the dataset, then split it into test and train sets.

To run the script, you need to have Streamlit installed, you can run the command pip install streamlit in your terminal. Once installed, run the script with the command streamlit run script.py.

The Analysis


It did what it was supposed to do. ChatGPT took our functioning... functions and built a simple Streamlit web app with them.

ChatGPT as a Python Programming Assistant - KDnuggets (2)
The Streamlit web app created with the script generated by ChatGPT


ChatGPT did a perfectly competent job of creating, refactoring, extending, and adding a wrapper around a simple data preprocessing Python script we wanted it to write for us, from the specifications that we provided. While it wasn't the most complex of programs by a long shot, it was a step in the direction of having ChatGPT be genuinely useful by acting as a programming assistant. The results hit the mark at each step.

Next time, we will build a less trivial program with some increased complexity and additional requirements. I'm excited to see how well ChatGPT performs under pressure. Hopefully you are too.



Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.


More On This Topic

  • Python: The programming language of machine learning
  • Free Intermediate Python Programming Crash Course
  • Free From MIT: Intro to Computer Science and Programming in Python
  • What Makes Python An Ideal Programming Language For Startups
  • KDnuggets™ News 20:n35, Sep 16: Data Science Skills: Core, Emerging,…
  • Top September Stories: Free From MIT: Intro to Computer Science and…
Top Articles
Latest Posts
Article information

Author: Fredrick Kertzmann

Last Updated: 01/28/2023

Views: 5847

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.