BLOG

How non-tech teams can clean raw data from Redshift, BigQuery, and Snowflake

Increasingly, companies are moving to data warehouses as their source of truth for all operational data. For finance teams at companies that use data warehouse, this data can be critical for budgeting, reconciliation, forecasting, and reporting.

Jason Fan

July 16, 2024

Unfortunately none of the data warehouse providers make it easy for non-tech teams to use them. This guide is an actionable roadmap to helping you understand your company’s data architecture, even if you have no experience with coding or IT.

The initial steps will require someone with access to the data warehouse and SQL knowledge, ideally someone from the engineering or analytics team. I’ll refer to them collectively as “analyst” going forward.

💡 A note on data security
In this tutorial, you might need to send information to external companies like OpenAI. We’ll only be working with database schemas, NOT actual user data, so regulations like GDPR shouldn’t apply. However you should still check with your compliance or InfoSec team before sending any proprietary information to third parties.

Step 1: Understanding the Data Architecture

The first step to cleaning data is to make sure you understand what data is available. CRMs, HRIS, and payroll solutions make this fairly straightforward and have standard data formats that are well supported by other applications.

Data warehouses however are unique to each company, so the steps to get the data you need will vary depending on the warehouse your company uses and how the engineering team has set it up.

Snowflake/BigQuery

For both Snowflake and BigQuery, you’ll need your analyst to export the Information_Schema.Columns view and the Information_Schema.Tables view. The former includes all the metadata on each column in the data warehouse, including its name and which table it belongs to. The latter has metadata on each table, including its name and which database and schema it belongs to.

You can share the following links to your analyst:

Here’s an example of the output file for Information_Schema.Tables:

And an example of Information_Schema.Columns:

Redshift

You’ll need your analyst to export the PG_TABLE_DEF system catalog table. This export will list each column from every table that’s accessible to the analyst, it’s name, its data type, and which column it belongs to.

You can share the following link with your analyst:

Documentation for Redshift

Here’s an example of the output file for PG_TABLE_DEF:

Determining which tables and columns are relevant

Now you’ll have a list of every database, table, and column in the data warehouse. Depending on your company, this could be a very long list! If you already know which table and columns are relevant (or can figure it out yourself), you can ask your analyst to export only those tables in the next step. Skip the next few steps and go straight to the section on “Running the Query”.

If you do not yet know which databases and tables you need, we’ll go through how to efficiently filter through these to find the columns you need in the next step.

Step 2: Getting the Data

Once you have a file with all the database, table, and column names, the next step is to figure out which ones are relevant to you.

Your analyst might know all the tables available, but not which ones are relevant to your purposes. In this case, it’s useful to look through the schema yourself to find ones that might be relevant.

Using AI to find relevant tables and columns

In some cases, the data schema of your company will be very complex. Instead of looking through it line by line, you can also upload the file to ChatGPT and ask it to list tables that will be relevant to your purposes.

Example Prompt:

Here is the schema for a [Snowflake/Redshift/BigQuery] instance. I am trying to find all customer orders for the month of April, including the date they were paid, the amount of the order, and the date the order was placed. Which tables and columns might be relevant for my query?

The schema should be uploaded as a plain text (.txt) file, which will be easiest for ChatGPT to process.

If you cannot use ChatGPT, you can read through the schema yourself to understand what data is available in the data warehouse.

Running the Query

Once you know which tables and columns are relevant, ask your analyst to run a query to get the data from those tables. I recommend getting the raw data instead of a version that’s already been cleaned up, as your analyst might not have the context you do and may unnecessarily filter out information that is actually relevant.

Questions to Ask

Using a table of customer orders as an example, some helpful questions to ask include:

Which tables store the data on customer orders?
Are there multiple tables that store order data?
- What are the difference in how order data is piped to each table?
  - Do they vary based on the source of the data? (e.g. one is from Stripe and one is from product metrics)
  - Do they vary based on timing? (e.g. one table updates daily but another updates weekly)

Step 3: Cleaning the Data

Once you have the raw data, the next step is to turn it into a format that is compatible with your models.

Exports from all data warehouses come as CSV files. Sometimes, the data is clean and ready to use in your models. More often, there’s some cleaning required because the dataset is too large to process in Excel, too granular to use directly in your models, or has a lot of noise that needs to be removed.

Types of Data Cleaning

Scrubbing invalid rows

Figuring out what constitutes an “invalid” row is unique to each company. You’ll need to work with your counterparts in whichever team generates the data to determine the signals for rows that are invalid. For example, does the engineering team use fake accounts for product testing that also generate data in this table? Those will need to be filtered out. Which columns indicate whether the data belongs to a test user?

Removing duplicate rows

For duplicates, you’ll need to determine which columns are unique identifiers for a row. The primary key column (usually id ) is not reliable because it’s guaranteed to be unique per row, so even if the same order appears twice in an orders table their IDs will be different. For orders, two rows that have the same invoice_number could indicate a duplicate. You’ll also need to figure out which columns have information that will tell you which record is the original.

Aggregating rows

In some cases, the data will be exported as single transactions, and you need a summary of transactions for a period.

Aggregating rows is straightforward, and in most cases can be done in either SQL or Excel.

Transforming data

In some cases, custom business logic requires more advanced transformations that are difficult to do in either SQL or Excel.

For example:

Converting payments in different currencies to USD
Factoring in discounts to products
Calculating accruals based on data from systems that payment date but not service date

These are often the most time consuming parts of the data cleaning process, and where using ChatGPT can be most helpful.

Data Cleaning Option 1: Ask your analyst

Once it’s clear how to tell if a row is a duplicate, your analyst can easily modify their SQL query to filter them out.

For aggregations, this is more complex. Some simple aggregations (like summing a series of numeric values) can be done in SQL, but Excel is generally more flexible.

Data Cleaning Option 2: Use ChatGPT to generate code

ChatGPT is a great tool for cleaning data when the dataset is too large for Excel and the transformation is too complex for SQL.

Since you’re working with real data, you’ll want to avoid sending it to ChatGPT unless your company has a SLA with Microsoft or OpenAI. Instead, you can upload all the column names in the dataset to ChatGPT and ask it to write a Python script to do what you need.

Example Prompt:

I've uploaded a list of column names for a dataset in CSV format, representing customer orders. Write me a Python script that finds duplicate rows where the invoice_number is the same. Remove all duplicate rows except the one where date_paid is the earliest date. Write the result to a CSV file in the current directory. The script should be executable directly from the command line, taking in the file path as the only argument.

Once you get the code back you can run it yourself on the dataset and inspect the results.

Open a new text file on your desktop
Copy the python code into the text file and save it as “data_cleaning_script.txt”
Change the file extension from .txt to .py
1. If you’re on Windows:
  1. Press Win + R to open the Run dialog.
  2. Type cmd and press Enter to open the command prompt.
2. If you’re on MacOS:
  1. Hit CMD+Space to open up Spotlight Search.
  2. Search for “Terminal” and hit enter to open a terminal.
3. Type python and add a space. Don’t hit enter yet.
4. Drag and drop data_cleaning_script.py from the folder into the terminal, then hit space again.
5. Drag and drop your dataset into the terminal.
6. The final command should look something like like this:
  python /Users/johndoe/Downloads/data_cleaning_script.py /Users/johndoe/Downloads/dataset.csv
Hit enter to run the script
The cleaned dataset should now be in the same folder as data_cleaning_script.py

ChatGPT is not always correct so this method is not guaranteed to work. If you run into issues, you can send the script to your analyst or an engineer to review, or email me at jason@finic.ai and I'd be happy to help you figure it out.

Data Cleaning Option 3: Process in Excel

Cleaning data directly in Excel is a common option. However, removing duplicates requires multiple nested =IF formulas, so it is slow to run and only works on smaller datasets. Aggregations run faster, and Excel is a great tool for this.

Performing Multiple Data Transformations

Once you’ve done one transformation, you can repeat step 3 as many times as you need to get to the final result, with the output of one step as the input of the next.

Step 4: Import into Excel

No instructions needed here! Now that you have a fresh dataset that’s cleaned up, import it into Excel (or whatever BI tool you use) and get modelling.

Finic Makes Data Cleaning Simple

At Finic, we’ve built the first no-code data orchestration platform for finance teams. What that means is finance teams can drag-and-drop data transformations to build custom data workflows.

This frees up finance teams from doing low-value data cleaning so they can focus more on strategic initiatives that drive the business forward.

Interested in a demo? Email jason@finic.ai to learn more.