My CityLab Diaries

Cyclistic Bike-Share Analysis

Google Data Analytics Certificate - Case Study

How does a bike-share navigate speedy success?

Project Overview:

This project was completed as part of the Google Data Analytics Certificate. Cyclistic is a fictional bike-share company in Chicago. The company’s bike-share program offers 5800 bikes and 600 docking stations across the city.

The company’s financial analysis concluded that annual members bring much more profit than the casual riders. The goal for this project was to analyze Cyclistic’s 2019 bike-share data to understand how annual members and casual riders use the service differently.

Data Source:

For the purpose of this case study, a public dataset was provided by Motivate International Inc. The dataset is made available under their license for educational and analytical use.

Summary of Analysis:

I decided to use R for my analysis. The quarterly bike-share trip files for 2019 were downloaded and saved in CSV format ( data ). The dataset is split into four quarterly CSV files (Q1–Q4). Each file contains trip-level records with variables such as trip ID, bike ID, start/end time, station information, user type, gender, and birth year.

Inspecting the data: The dataset is published by an official operator (Motivate International Inc), making it reliable and, as operational data, it reflects actual usage. However, it does not capture external factors such as weather, socioeconomic differences, or riders who don’t disclose their gender/birth year. The provided data are raw trip records for the entire 12 months of 2019. There were no privacy issues since Personally identifiable information (PII) is not included, only anonymized trip‑level data is provided.

Data Integrity: Several steps were taken to ensure data’s integrity

Checked for missing values, empty rows and removed invalid records (e.g., trips less than 1 minute or more than 24 hours).
Standardized column names across quarters: Each quarter had different column names, which required standardization. I also standardized user type values (Subscriber → member, Customer → casual).
Converted date/time fields to proper formats(as_datetime).
Calculated ride durations in seconds, derived new variables (ride date, month, day of week, hour).
Cleaned gender values (standardized to “Male”, “Female”, “Other”, “not_specified”).
Converted birth year to numeric and calculated rider age & age groups.
Removed duplicate rides based on trip ID and start time.
Confirmed that the date range consistently covered January–December 2019.

Results After Cleaning:The total number of rides prior to cleaning was 3,818,004 and the final combined dataset consisted of 3,816,142 rows (only 1,862 rows removed, which is around 0.05%).

Missing values check: No missing values in trip ID, times, stations, or user type. Birth year missing for about 537,796 records (age also missing accordingly).

Final dataset columns: trip_id, start_time, end_time, bike_id, start_station_name, start_station_id, end_station_name, end_station_id, user_type, gender, birthyear, quarter, ride_length_seconds, ride_date, ride_year, ride_month, ride_day_of_week, ride_hour, age, age_group.

Business Question: How do annual members and casual riders use Cyclistic bikes differently? The following represent usage patterns and key insights:

Total rides:

Members: Annual subscribers took 2,936,859 rides which is about 77% of all rides in the dataset.
Casual riders: Non‑subscribers took 879,283 rides (23%).

Ride duration:

Members: shorter average and median trip lengths.
Casual riders: significantly longer trips on average.

Key Insight (1): Members dominate overall ridership volume, but casual riders tend to take longer trips, reflecting different usage motivations (commuting vs leisure).

Daily Usage Patterns:

Members:Consistent usage across weekdays with peak rides during morning and evening commute hours.
Casual riders:Higher ridership on weekends, but Longer average trip durations compared to members.

Monthly Usage Patterns:

Both groups have strong seasonal trends with the highest ridership in summer months (June–August) and lowest ridership in winter months (December–February).

Key Insight (2): Seasonal weather impacts both groups similarly, but casual riders show sharper peaks in summer, aligning with recreational use.

Hourly Usage Patterns:

Members:Clear peaks during commuting hours (7–9 AM and 4–6 PM).
Casual riders:Usage spread more evenly throughout the day with afternoon peaks (12–4 PM) are most prominent.

Key Insight (3): Members’ usage reflects workday commuting, while casual riders’ usage reflects flexible, leisure‑oriented trips.

Summary of Findings:

Members: High volume, short trips, weekday commuting patterns.
Casual riders: Lower volume, longer trips, weekend and afternoon peaks.
Business Implication: These differences highlight opportunities to convert casual riders into members by emphasizing the value of membership for frequent leisure use, while continuing to support members’ commuting needs.

The insights reveal clear behavioral differences: members are commuters, casual riders are leisure users.This helps Cyclistic design targeted marketing strategies:

Key Recommendations:

Encourage casual riders to convert to membership by highlighting cost savings for frequent weekend use.
Promote membership benefits aligned with commuting convenience.
Seasonal and hourly trends inform operational decisions:
- Bike availability: Add more bikes to stations during summer months when demand is high.
- Station placement: Ensure popular leisure areas (parks, waterfronts) are well‑stocked in summer.
- Promotions: Offer winter discounts or “cold‑weather challenges” to encourage off‑season use.

R Code Snippet: Quarter 1 Data Cleaning

Click to expand R code


# STEP 2: CLEAN AND STANDARDIZE QUARTER 1 DATA
# =============================================================================

cat("\nSTEP 2: Cleaning Quarter 1 data (January-March)...\n")

# First, let's see what columns we have in Q1
cat("Original columns in Q1:", paste(names(Q1), collapse = ", "), "\n")

quarter_1_cleaned <- Q1 %>%
  # Clean column names (makes them consistent and lowercase)
  clean_names() %>%
  # Remove completely empty rows (where all values are NA)
  dplyr::filter(!if_all(everything(), is.na)) %>%
  # Rename columns to standard names
  rename(
    trip_id = trip_id,
    bike_id = bikeid,
    start_time = start_time,
    end_time = end_time,
    trip_duration = tripduration,
    start_station_name = from_station_name,
    start_station_id = from_station_id,
    end_station_name = to_station_name,
    end_station_id = to_station_id,
    user_type = usertype,
    gender = gender,
    birthyear = birthyear
  ) %>%
  # Add a column to track which quarter this data came from
  mutate(quarter = "Q1")

cat("✓ Quarter 1 cleaned:", nrow(quarter_1_cleaned), "rows remaining\n")

# Remove the original Q1 data to free up memory
rm(Q1)
gc() # Garbage collection to free memory

R Code Snippet: Comprehensive data cleaning on combined dataset & quality check

Click to expand R code



# STEP 7: COMPREHENSIVE DATA CLEANING ON COMBINED DATASET
# =============================================================================

cat("\nSTEP 7: Comprehensive cleaning of combined dataset...\n")

#  Record initial row count
initial_row_count <- nrow(all_trips_combined)

#  Clean and transform the data
all_trips_2019 <- all_trips_combined %>%
  mutate(
    # Step 7.1: Standardize user type values
    user_type = case_when(
      user_type == "Subscriber" ~ "member",
      user_type == "Customer" ~ "casual",
      TRUE ~ user_type
    ),
    
    # Step 7.2: Convert to proper date/time formats
    start_time = as_datetime(start_time),
    end_time = as_datetime(end_time),
    
    # Step 7.3: Calculate ride duration in seconds
    ride_length_seconds = as.numeric(difftime(end_time, start_time, units = "secs")),
    
    # Step 7.4: Add useful date and time components for analysis
    ride_date = as.Date(start_time),
    ride_year = year(start_time),
    ride_month = month(start_time, label = TRUE),
    ride_day_of_week = wday(start_time, label = TRUE),
    ride_hour = hour(start_time),
    
    # Step 7.5: Clean gender column
    gender = case_when(
      is.na(gender) ~ "not_specified",
      gender == "" ~ "not_specified",
      tolower(gender) == "male" ~ "Male",
      tolower(gender) == "female" ~ "Female",
      TRUE ~ "Other"
    ),
    
    # Step 7.6: Clean birthyear and calculate age
    birthyear = as.numeric(birthyear),
    age = 2019 - birthyear,
    age_group = case_when(
      age < 20 ~ "Under 20",
      age >= 20 & age < 30 ~ "20-29",
      age >= 30 & age < 40 ~ "30-39",
      age >= 40 & age < 50 ~ "40-49",
      age >= 50 & age < 60 ~ "50-59",
      age >= 60 ~ "60 and over",
      is.na(age) ~ "Unknown",
      TRUE ~ "Unknown"
    )
  ) %>%
  # Step 7.7: Remove invalid records
  dplyr::filter(
    ride_length_seconds > 60,
    ride_length_seconds < 86400,
    !is.na(start_time),
    !is.na(end_time),
    !is.na(user_type),
    start_time < end_time,
    user_type %in% c("member", "casual")
  ) %>%
  # Step 7.8: Remove duplicate rides
  distinct(trip_id, start_time, .keep_all = TRUE)

# Step 3: Record final row count and calculate rows removed
final_row_count <- nrow(all_trips_2019)
rows_removed <- initial_row_count - final_row_count

# Step 4: Print summary
cat("✓ Comprehensive cleaning completed\n")
cat("Rows removed during cleaning:", rows_removed, "(", round(rows_removed / initial_row_count * 100, 1), "%)\n")
cat("Final clean dataset:", final_row_count, "rows\n")

#Reorder the colomns in the dataset
all_trips_2019 <- all_trips_2019 %>%
  select(
    trip_id,
    start_time,
    end_time,
    bike_id,
    start_station_name,
    start_station_id,
    end_station_name,
    end_station_id,
    user_type,
    gender,
    birthyear,
    quarter,
    ride_length_seconds,
    ride_date,
    ride_year,
    ride_month,
    ride_day_of_week,
    ride_hour, 
    age,
    age_group
  )

# View the dataset 
View(all_trips_2019) 

# STEP 8: CHECK DATA QUALITY
# =============================================================================

cat("\nSTEP 8: Checking data quality...\n")

# Check for missing values in each column
missing_data_summary <- sapply(all_trips_2019, function(x) sum(is.na(x)))
cat("Missing values by column:\n")
print(missing_data_summary)

# Check member vs casual distribution
user_type_summary <- all_trips_2019 %>%
  count(user_type) %>%
  mutate(percentage = n / sum(n) * 100)

cat("\nUser type distribution:\n")
print(user_type_summary)

# Check date range
date_range <- range(all_trips_2019$ride_date)
cat("\nDate range of data:", as.character(date_range[1]), "to", as.character(date_range[2]), "\n")

R Code Snippet: Analyzing differences between annual members and casual riders

Click to expand R code



# ANALYSIS FOCUSED ON BUSINESS TASK
# =============================================================================

cat("\nAnalyzing differences between annual members and casual riders...\n")

# 2.1 Overall comparison
overall_comparison <- all_trips_2019 %>%
  group_by(user_type) %>%
  summarise(
    total_rides = n(),
    percent_of_total = (n() / nrow(all_trips_2019)) * 100,
    avg_duration_min = mean(ride_length_seconds) / 60,
    median_duration_min = median(ride_length_seconds) / 60,
    .groups = 'drop'
  )

cat("OVERALL COMPARISON:\n")
print(overall_comparison)


# 2.2 Daily usage patterns
daily_usage <- all_trips_2019 %>%
  group_by(user_type, ride_day_of_week) %>%
  summarise(
    number_of_rides = n(),
    avg_duration_min = mean(ride_length_seconds) / 60,
    .groups = 'drop'
  )

cat("daily_usage:\n")
print(daily_usage)

# 2.3 Monthly usage patterns
monthly_usage <- all_trips_2019 %>%
  group_by(user_type, ride_month) %>%
  summarise(
    number_of_rides = n(),
    .groups = 'drop'
  )

cat("monthly_usage:\n")
print(monthly_usage)

# 2.4 Hourly usage patterns
hourly_usage <- all_trips_2019 %>%
  group_by(user_type, ride_hour) %>%
  summarise(
    number_of_rides = n(),
    .groups = 'drop'
  )

cat("hourly_usage:\n")
print(hourly_usage)

R Code Snippet:Visualizations

Click to expand R code



# VISUALIZATIONS FOR BUSINESS TASK
# =============================================================================

cat("\nCreating visualizations for business report...\n")

# Set clean theme
theme_set(theme_minimal())

# Visualization 1: Number of rides by day of week
plot1 <- daily_usage %>%
  ggplot(aes(x = ride_day_of_week, y = number_of_rides, fill = user_type)) +
  geom_col(position = "dodge") +
  labs(
    title = "Number of Rides by Day of Week",
    subtitle = "Comparison between annual members and casual riders",
    x = "Day of Week",
    y = "Number of Rides",
    fill = "User Type"
  ) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("casual" = "orange", "member" = "blue"))

print(plot1)

# Visualization 2: Average ride duration by day of week
plot2 <- daily_usage %>%
  ggplot(aes(x = ride_day_of_week, y = avg_duration_min, fill = user_type)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Ride Duration by Day of Week",
    subtitle = "Casual riders take longer trips on average",
    x = "Day of Week",
    y = "Average Duration (minutes)",
    fill = "User Type"
  ) +
  scale_fill_manual(values = c("casual" = "orange", "member" = "blue"))

print(plot2)

# Visualization 3: Monthly usage patterns
plot3 <- monthly_usage %>%
  ggplot(aes(x = ride_month, y = number_of_rides, color = user_type, group = user_type)) +
  geom_line(size = 1) +
  labs(
    title = "Monthly Ridership Patterns in 2019",
    subtitle = "Seasonal trends for members and casual riders",
    x = "Month",
    y = "Number of Rides",
    color = "User Type"
  ) +
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = c("casual" = "orange", "member" = "blue")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(plot3)

# Visualization 4: Hourly usage patterns
plot4 <- hourly_usage %>%
  ggplot(aes(x = ride_hour, y = number_of_rides, color = user_type)) +
  geom_line(size = 1) +
  labs(
    title = "Hourly Usage Patterns",
    subtitle = "Members show commute peaks, casual riders have afternoon peaks",
    x = "Hour of Day",
    y = "Number of Rides",
    color = "User Type"
  ) +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = seq(0, 23, by = 3)) +
  scale_color_manual(values = c("casual" = "orange", "member" = "blue"))

print(plot4)