edit: It is also imperative to mention that this analysis of the dominant operators is actually wrong. SNCF does not operate in Norway. Always double check your work before it is published, regardless of the author. The conclusion that the exploration phase is greatly benefitted still remains; I will fix the rest later.
One frequent issue we hear from our customers, both new and old alike, is how to do or even just start the ‘data analysis’ process. If we have 50 000 rows over 10 columns, where do we start? A common strategy is to plot everything, or plot some things that could seem interesting. This is a very creative, and at times methodological process, that is what makes it hard. Fortunately, it is easier to identify good questions from bad, than come up with them in the first place, and this is where havinga robot spit out many potential questions can be quite useful.
Do note that the version of ChatGPT I am using is paid, the original dataset is from GitHub - trainline-eu/stations: List of stations and associated metadata. You can also view the full conversation.
Since this article is about supercharging our workflow, I had ChatGPT summarize my interactions with it into a draft post, reviewed that draft and had it rewritten (twice), and only then did I make my own modifications to the final draft, which is what you will soon read.
While this is not my style of writing, I think it is strictly good writing which accurately represents the sequence of steps that I took to produce our chart.
First, the final result:
Our journey started with a big dataset. It was about train stations in Europe. It had over 64,000 records with 75 different types of information. This included things like the name of the station, its location, and details about different train operators.
The first thing we needed to do was understand our data. ChatGPT helped us load the dataset. But there was a problem. The data wasn’t arranged in the usual way with commas. It used semicolons instead. ChatGPT spotted this and quickly fixed the problem.
Once the data was loaded properly, ChatGPT gave us a good summary. It showed us how many different station names there were, where the stations were located, and how many countries and time zones were included.
A big part of working with data is asking the right questions. ChatGPT helped us come up with some interesting questions to explore. For example:
- Where are most of the stations located?
- Which station names are most common?
- How many stations have a ‘parent’ station, and which parent stations have the most ‘child’ stations?
- Which train operators run the most stations, and are some operators more common in certain countries?
ChatGPT also created bar charts (not in everviz, but in the ChatGPT app) showing how stations were spread across countries and how many were main stations or city stations. This was helpful to get an early birds-eye-view of what we are dealing with.
These questions gave us a roadmap for exploring our data in more detail.
I chose to focus on which train operators ran the most stations per country. The initial question that I asked to find the answer to this question did not quite work with the data we had, but ChatGPT discovered this itself and course-corrected automatically:
I apologize for the misunderstanding. The columns representing the railway operators are not numerical but boolean (true/false), indicating whether a particular operator services the station. We cannot directly find the operator with the most stations per country as the dataset currently stands.
However, we can transform the data to count the number of stations each operator services in each country. This will allow us to find the dominant operator for each country. Let’s proceed with this approach.
Final task: Create a new dataset that showed the most common operator for each country.
ChatGPT handled this task smoothly. It created a new dataset with the country name, a number representing the operator (for the category color in everviz), and a label with the operator’s name. To do this, ChatGPT had to create a map from country codes to country names, because our original data didn’t include the names.
Our journey showed us how ChatGPT and the Code Interpreter can make working with big datasets easier. They helped us understand our data, ask good questions, create helpful charts, and prepare a new, useful dataset. These tools can make working with data quicker and more efficient, letting us focus on what the data is telling us.