Resolving KeyError Issues in Python Pandas for Excel Data Analysis

Getting Started with Excel Automation: Building on Our Foundation

In our previous post, we took the initial steps toward automating Excel by completing the following tasks:

  • Creating Sample Data: Developed example datasets to work with.
  • Generating Split Sales Files: Divided the sales data into three separate Excel files.
  • Preparing Incomplete Data for Cleaning: Created datasets that required cleaning and preprocessing.
  • Setting Up Sales Data for Analysis: Organized the sales data files to be ready for in-depth analysis.

Additionally, we implemented some fundamental automation scripts, including:

  • Merging Excel Files: Combined multiple Excel files into a single dataset.
  • Data Cleaning Functions: Automated the process of cleaning and preparing data for analysis.
  • Automated Monthly Reporting: Created scripts to generate monthly reports without manual intervention.

Enhancing Our Automation with Advanced Analysis

Building on this solid foundation, our goal for this installment is to introduce more practical analytical features. Before diving into adding these new capabilities, let’s ensure we have everything prepared.

table of contents

Prerequisites

Before adding analysis functionalities, make sure you have the following ready:

  • Complete Python Scripts from the Previous Session: Ensure you have all the Python scripts we developed earlier.
  • Generated Excel Files:
    • sales_data_original.xlsx – The primary sales data file.
    • messy_data.xlsx – The dataset prepared for cleaning.
    • Split files located within the excel_files folder.
  • Additional Python Packages Needed:
    • Install SciPy by running:
      pip install scipy
    • We will explain the purpose of this package shortly.

Folder Structure

Organize your working directory as follows to keep everything tidy:

working_directory/
├── excel_files/
│   ├── sales_data_1.xlsx
│   ├── sales_data_2.xlsx
│   └── sales_data_3.xlsx
├── sales_data_original.xlsx
└── messy_data.xlsx

With these preparations in place, we’re ready to move forward and add smarter analytical capabilities to our Excel automation project.

Installing the New Package

Up until now, our automation scripts have handled basic operations with Excel files. This time, to perform more advanced analyses, we’ll be adding a new tool to our toolkit.

pip install scipy

What Does This Command Add?

SciPy, pronounced “Sigh-pie,” stands for Scientific Python. It’s a powerful library that offers a range of functionalities, including:

  • Statistical Calculations
    • Beyond basic computations like averages and medians, SciPy can detect outliers and perform more sophisticated statistical analyses.
  • Data Analysis
    • Quantifying trends in sales fluctuations
    • Discovering patterns within the data
    • Performing calculations for forecasting

In simple terms, SciPy allows you to execute complex calculations effortlessly, which might be cumbersome to handle manually in Excel.

For Example:

  • “Is the sales growth this month unusually high compared to last month?”
  • “Does the sales pattern of this product correlate with seasonal changes?”

With SciPy, conducting such analyses becomes straightforward. Let’s explore how we can leverage this tool to perform specific analyses in the next steps.

Enhancing Our Analysis Capabilities

By now, we’ve mastered the basics of merging and organizing Excel files. Additionally, we’ve installed SciPy for statistical analysis. Building on this foundation, we’ll gradually incorporate more intelligent analytical features.

Let’s start by creating a function that automatically identifies key insights from our sales data. Here’s the code we’ll be working with:

import pandas as pd
import numpy as np
from scipy import stats

def discover_insights(df):
    """
    Function to extract key insights from sales data
    """
    insights = []
    
    # 1. Monthly Sales Aggregation
    monthly_sales = df.groupby('Year-Month')['Sales Amount'].sum()
    
    # 2. Identifying the Best Sales Month
    best_month = monthly_sales.idxmax()
    best_month_sales = monthly_sales.max()
    insights.append(
        f"Best Sales Month: {best_month} "
        f"(Sales: ¥{best_month_sales:,.0f})"
    )
    
    return insights

# Main Execution
if __name__ == "__main__":
    # Loading the sample data
    df = pd.read_excel("sales_data_original.xlsx")
    
    # Running the analysis
    found_insights = discover_insights(df)
    
    # Displaying the results
    print("\n===== Analysis Results =====")
    for insight in found_insights:
        print(f"• {insight}")

Code Explanation

Let’s break down the code:

  • Grouping by ‘Year-Month’:
    • The sales data is grouped by the ‘Year-Month’ column, aggregating the ‘Sales Amount’ for each period.
  • Identifying the Top Performing Month:
    • idxmax() finds the month with the highest sales.
    • max() retrieves the sales amount for that month.
    • The insight is then appended to the insights list in a readable format.

When you run this script, it will output something like:

===== Analysis Results =====
• Best Sales Month: 2023-07 (Sales: ¥12,345,678)

This marks the first step in adding analytical capabilities to our automation. In the next installment, we’ll delve deeper into enhancing these analyses with additional functionalities.

Handling Potential Errors

If you encounter errors while running the script, here are some common issues and their solutions:

  • ModuleNotFoundError: No module named ‘scipy’
    • Ensure that SciPy is installed by running pip install scipy.
  • FileNotFoundError: [Errno 2] No such file or directory
    • Verify that sales_data_original.xlsx is located in the same directory as your Python script.

By addressing these issues, you can ensure smooth execution of your analysis scripts.


With these enhancements, we’ve established a solid foundation for intelligent data analysis within our Excel automation project. Stay tuned for the next steps, where we’ll expand these functionalities even further.

Save and Execute the Script

Save the above code as smart_analysis.py and run it. This will automatically identify the month with the highest sales and display the corresponding amount.

Understanding the Code

Let’s take a closer look at the code:

  • Grouping by ‘Year-Month’:
    The sales data is grouped by the ‘Year-Month’ column. For example, all sales from January 2023 are aggregated into a single entry.
    monthly_sales = df.groupby('Year-Month')['Sales Amount'].sum()
  • Summing ‘Sales Amount’:
    After grouping, the script calculates the total sales for each month by summing the ‘Sales Amount’ column.
    monthly_sales = df.groupby('Year-Month')['Sales Amount'].sum()
  • Using idxmax() and max():
    • idxmax(): Identifies the month with the highest sales value.max(): Retrieves the sales amount for that peak month.

    best_month = monthly_sales.idxmax()
    best_month_sales = monthly_sales.max()
    insights.append(
    f"Best Sales Month: {best_month} (Sales: ¥{best_month_sales:,.0f})"
    )

When you execute this script, you will receive an output similar to:

===== Analysis Results =====
• Best Sales Month: July 2023 (Sales: ¥12,345,678)

How Does This Work?

This output signifies the first step in enhancing our analysis capabilities by automatically determining the most successful month in terms of sales.

Troubleshooting: Handling Common Errors

If you encounter any issues while running the script, here are some common errors and their solutions:

  • ModuleNotFoundError: No module named ‘scipy’
    • Cause: SciPy isn’t installed in your Python environment.
    • Solution: Install SciPy by executing the following command in your terminal:
      pip install scipy
  • FileNotFoundError: [Errno 2] No such file or directory
    • Cause: The script cannot locate the sales_data_original.xlsx file.
    • Solution: Ensure that sales_data_original.xlsx is placed in the same directory as your Python script.
  • Example Error Traceback:
    If you mistakenly alter the code, you might encounter an error like this:
    C:/excel/smart_analysis.py
    Traceback (most recent call last):
    File "C:/excel/smart_analysis.py", line 30, in <module>
    found_insights = discover_insights(df)
    File "C:/excel/smart_analysis.py", line 12, in discover_insights
    monthly_sales = df.groupby('Year-Month')['Sales Amount'].sum()
    File "C:/Users/minok/AppData/Local/Programs/Python/Python310/lib/site-packages/pandas/core/frame.py", line 8872, in groupby
    return DataFrameGroupBy(
    File "C:/Users/minok/AppData/Local/Programs/Python/Python310/lib/site-packages/pandas/core/groupby/groupby.py", line 1273, in __init__ grouper, exclusions, obj = get_grouper(
    File "C:/Users/minok/AppData/Local/Programs/Python/Python310/lib/site-packages/pandas/core/groupby/grouper.py", line 1009, in get_grouper raise KeyError(gpr)
    KeyError: 'Year-Month'
  • Explanation:
    The error KeyError: 'Year-Month' indicates that the script is trying to access a column named ‘Year-Month’ which doesn’t exist in your dataset. Here’s how to fix it:
  1. Verify Column Names:
    Add the following lines to your script to print out the available columns and the first few rows of your data:
    # Check available columns
    print("Available Columns:")
    print(df.columns.tolist())

    # Display the first 5 rows
    print("\nFirst 5 Rows of Data:")
    print(df.head())
  2. Update Column References:
    If your dataset uses different column names (e.g., 'Year-Month' and 'Sales Amount'), update the groupby line accordingly:

    monthly_sales = df.groupby('Year-Month')['Sales Amount'].sum()

By ensuring that the column names in your script match those in your Excel file, you can resolve the KeyError and successfully run your analysis.

Next Steps

With the basic analysis functionality in place, you’re now equipped to move forward. In the next installment, we’ll expand our analysis by adding more sophisticated features, enabling deeper insights into your sales data.

Debugging Excel Analysis Scripts: Common Errors and How to Address Them

While running our Excel analysis script, we encountered the following error:
KeyError: 'Year-Month'

This message indicates that the column ("Year-Month") the script is trying to reference does not exist in the dataset.


Checking What Caused the Error

First, let’s confirm the actual structure of the data. Try running the following code:

import pandas as pd
import numpy as np
from scipy import stats

# Load the data and examine its contents
df = pd.read_excel("sales_data_original.xlsx")

# Display the list of column names
print("=== List of Columns ===")
print(df.columns.tolist())

# Show the first five rows
print("\n=== First 5 Rows ===")
print(df.head())

Executing this snippet reveals whether the column is truly named "Year-Month" or something else. If the column name differs, that’s likely why you’re getting a KeyError.

0 2023-01-01    2023-01      Electronics         46360     Rep 2
1 2023-01-01    2023-01        Household         75873     Rep 2
2 2023-01-01    2023-01      Electronics         78725     Rep 5
3 2023-01-02    2023-01       Stationery         86548     Rep 4
4 2023-01-02    2023-01          Apparel         19688     Rep 5

How to Resolve the Error

Depending on what you find when inspecting your data, make one of the following changes. Below is a completed example:

import pandas as pd
import numpy as np
from scipy import stats

def discover_insights(df):
    """
    Function to identify key points in the sales data
    """
    insights = []
    
    # Aggregate monthly sales
    # Compatible with English column names
    monthly_sales = df.groupby('Year-Month')['Sales Amount'].sum()
    
    # Determine which month had the highest sales
    best_month = monthly_sales.idxmax()
    best_month_sales = monthly_sales.max()
    insights.append(
        f"Month with the highest sales: {best_month} "
        f"(Sales: ¥{best_month_sales:,.0f})"
    )
    
    return insights

# Main process
if __name__ == "__main__":
    # Load sample data
    print("Loading data...")
    df = pd.read_excel("sales_data_original.xlsx")
    
    # Check the column names
    print("\nAvailable Columns:")
    print(df.columns.tolist())
    
    # Run the analysis
    print("\nStarting analysis...")
    found_insights = discover_insights(df)
    
    # Display the results
    print("\n===== Analysis Results =====")
    for insight in found_insights:
        print(f"- {insight}")

We start by examining the data’s structure, then proceed with our calculations.


If an Error Occurs

If you still see errors, confirm the following:

  1. Are the Excel file’s column names correct?
  2. Is the data being read properly?
  3. Do the required columns exist in the file?

Once these points are verified, the KeyError: 'Year-Month' problem should be resolved.

With this fix in place, we’ve established a foundation for basic analysis features. In the next installment, we’ll dive deeper into refining our analysis.

If you like this article, please
Follow !

Please share if you like it!
table of contents