Martin Tran’s Tech Blog | Data Science, Machine Learning & Python Tutorials

Stitching Images with Harris Corners and SIFT: A Homework Retrospective

Martin Tran — Sun, 01 Jun 2025 20:34:42 GMT

This project came from a computer vision class where we built an image stitching pipeline using Harris corners and SIFT descriptors. I’m keeping the original code mostly intact to show how I approached it at the time, but I’ve added some thoughts below on what I’d do differently now.

The goal was to align and blend multiple overlapping images into a panorama, using classic computer vision techniques (no deep learning). This post goes over what I implemented, what I’ve learned since, and what I’d change if I revisited it today.

What This Project Does

Detects corners in each image using Harris corner detection
Computes SIFT descriptors around those corners
Matches features between image pairs
Estimates the homography using RANSAC
Warps and blends images into a stitched result
Supports stitching multiple images in a sequence

Here’s what the stitched output looks like:

What I Did (Original Approach)

At the time, my focus was on getting the pipeline working and matching the spec. I handled corner detection with cv2.cornerHarris, filtered out low responses, then extracted SIFT descriptors.

A_gray = cv2.cvtColor(A, cv2.COLOR_BGR2GRAY)
A_corners = cv2.cornerHarris(A_gray, blockSize=2, ksize=3, k=0.05)
# Using Harris corner detector on grayscale images A and B
keypoints = [cv2.KeyPoint(x, y, radius) for x, y in corner_locs]
_, descriptors = sift.compute(A_gray, keypoints)

I matched descriptors using OpenCV’s brute-force matcher with cross-checking, took the top 200 matches, and then used RANSAC to compute a homography.

bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)
matches = bf.match(desc1, desc2)
matches = sorted(matches, key=lambda x: x.distance)

# Estimate homography from matched keypoints using RANSAC
H, mask = cv2.findHomography(dst_pts, src_pts, cv2.RANSAC, 0.3)

After computing the homography, I warped one image into the other's frame and blended them together. For multiple images, I ran all pairwise combinations, selected the pairs with the most inliers, and stitched them in that order. It worked, though it was a bit rigid.

warped = cv2.warpPerspective(new_img, H, canvas_size)
mask = (warped > 0)
panorama[mask] = warped[mask]

What I’d Do Differently Now

Since submitting this assignment, I’ve learned a few things that would make the whole pipeline more reliable and scalable. If I were to revisit this, here’s how I’d improve it:

Preprocessing

I didn’t smooth the grayscale image before running Harris detection, which can cause noisy responses. I’d apply a Gaussian blur (σ=2) beforehand now.
Harris should run on float32, but I passed in the raw grayscale output.

Keypoints and Descriptors

Instead of manually detecting corners and computing descriptors, I’d just use SIFT’s detectAndCompute() which handles both in a consistent way.
It’s more robust to scale and rotation than manually sampling Harris corners.

Matching

Rather than using brute-force matching with cross-checking and slicing the top 200 matches, I’d switch to KNN matching with Lowe’s ratio test. It does a better job at filtering out bad matches.
I’d also add a sanity check to make sure there are at least 4 good matches before trying to compute a homography.

Multi-Image Stitching

Instead of planning out the full stitching sequence at the start, I’d do it recursively—merge the best pair, then re-run matching with the merged image and the rest.
That would be more adaptive and flexible, especially for larger sets of images.

Blending

Right now, I just overwrite the overlap with the warped image. It works, but it creates hard seams. I’d do linear blending in the overlapping region or look into OpenCV’s multi-band blending for smoother transitions.

Why I’m Keeping It As-Is

Even though I know how to make it better now, I’m keeping this version mostly untouched for two reasons:

It shows where I was at the time. I think it’s important to document progress honestly.
The code still works, and I learned a lot from writing it—even if I’d approach it differently today.

There’s value in being able to look back at your work and recognize areas for improvement. That’s part of the learning process.

Final Thoughts

This was one of the first projects where I had to take a full vision pipeline from scratch to output. It’s far from perfect, but it got me thinking critically about matching, transformation, and how to go from theory to implementation.

If you want to check out the code or try it yourself, here’s the GitHub repo:
GitHub - image-stitching-harris-sift

Visualizing NYC Taxi Trends with Kepler.gl

Martin Tran — Thu, 29 May 2025 19:25:25 GMT

In my last post, I walked through how to clean and prepare NYC taxi data with Python — merging multiple years of trip data, resolving zone IDs into boroughs, and engineering useful time features like hour, day_of_week, and timestamp. This time, I’m taking that cleaned dataset (kepler_zone_heatmap_detailed.csv) and building a visual story with it using Kepler.gl.

Here’s how I set it up.

1. What’s in the Data?

This dataset includes hourly pickup counts grouped by NYC taxi zones, already joined with centroid coordinates. Each row represents a zone snapshot, and columns include:

Latitude, Longitude: Coordinates for the zone center
trip_count: Number of pickups at that location and time
hour, day_of_week, day, day_of_year: Temporal granularity for filtering
- day stands for day of month (e.g. comparing traffic on December 25th compared to December 1st)
timestamp: Rounded datetime used for animation/slider
Zone, Borough: For labeling or aggregation later

This isn’t raw GPS data — it’s zone-level aggregation, which makes visualization both faster and more interpretable.

2. Uploading to Kepler.gl

Go to kepler.gl/demo and browse for or click and drag the kepler_zone_heatmap_detailed.csv into the pop-up dialogue
Upload kepler_zone_heatmap_detailed.csv
In my experience, Kepler.gl does not auto-detect Latitude and Longitude. You’ll have to manually confirm or set those in the layer config.

3. Creating the Heatmap (Point Layer Style)

While Kepler offers multiple layer types, I found better clarity using a Point Layer instead of a Heatmap. Here's how I configured it:

Layer Type: Point
Fill Color: Set based on trip_count
Color Gradient: I changed the scale from white to red — this gave better contrast for spotting high-demand zones
Radius: Adjust for visibility based on zoom; I used 10 — which is the default

The result makes it much easier to distinguish heavily trafficked areas like Midtown Manhattan or JFK Airport during rush hour.

4. Enabling Time Playback

To get temporal playback:

Go to the Filters tab
Add a filter on timestamp
Click on the "time playback" button to enable animation
You can also add filters for day_of_week or hour to explore weekday vs weekend patterns

This let me visualize how demand shifts over time — morning rush in Midtown, late-night clusters in nightlife zones, etc.

5. Getting a Sharper Export

To export a high-quality map snapshot:

Click the three-dot menu > Export Image
Choose resolution (I used 2x for clarity) and aspect ratio (16:9 works great for blog banners)

If that’s not to your liking, you could always take a screenshot. I prefer to take screenshots (which the pictures above are screenshots) since you can zoom into the map.

The issue with exporting a snapshot is the fact that it is harder to tell which points has changed and which ones have not. Kepler.gl has an option to export the map you created as a HTML file in which other users can access it. However, the dataset I produced made a 300MB resulting file, which is large for web users to view.

6. What’s Next?

While Kepler.gl is rudimentary, I’ve got a clear visual of when and where taxi demand spikes. But I want to take it a step further — combining this with trip profitability (fare + tips) and estimated travel time or congestion to recommend optimal pickup zones at any given hour. Think of it like a driver-facing AI assistant.

Preprocessing NYC Taxi Trip Data with Python 🗽🚕

Martin Tran — Sun, 25 May 2025 21:42:21 GMT

tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]

Image credits: https://www.nyc.gov/site/tlc/about/about-tlc.page

Overview

The NYC Taxi & Limousine Commission (TLC) publishes one of the richest urban mobility datasets in the world, containing millions of taxi trips taken across New York City. Analyzing this data offers insights into urban transportation patterns, congestion hotspots, and rider behavior. In this post, we’ll begin preparing the data for machine learning tasks by merging metadata, visualizing demand, and exporting spatial features.

The Dataset

I used publicly available data from the NYC Taxi & Limousine Commission (TLC). Here's a glimpse at what the data looks like:

import pandas as pd

df = pd.read_parquet("yellow_tripdata_2020-01.parquet")
df.head()

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
1	2020-01-01 00:28:15	2020-01-01 00:33:03	1.0	1.2	1.0	N	238	239	1	6.0	3.0	0.5	1.47	0.0	0.3	11.27	2.5	None
1	2020-01-01 00:35:39	2020-01-01 00:43:04	1.0	1.2	1.0	N	239	238	1	7.0	3.0	0.5	1.50	0.0	0.3	12.30	2.5	None
1	2020-01-01 00:47:41	2020-01-01 00:53:52	1.0	0.6	1.0	N	238	238	1	6.0	3.0	0.5	1.00	0.0	0.3	10.80	2.5	None
1	2020-01-01 00:55:23	2020-01-01 01:00:14	1.0	0.8	1.0	N	238	151	1	5.5	0.5	0.5	1.36	0.0	0.3	8.16	0.0	None
2	2020-01-01 00:01:58	2020-01-01 00:04:16	1.0	0.0	1.0	N	193	193	2	3.5	0.5	0.5	0.00	0.0	0.3	4.80	0.0	None

What Do the Colors Mean?

💛 Yellow Taxis

Traditional medallion taxis
Can pick up riders anywhere via street hail
Mostly operate in Manhattan and airports
Rich data: trip times, distances, fares, tips, payment types

💚 Green Taxis

Also called Boro Taxis
Focus on outer boroughs and Upper Manhattan
Street hail only allowed outside Midtown/Downtown Manhattan
Data structure nearly identical to Yellow Taxis

🚘 FHV (For-Hire Vehicles)

Covers Uber, Lyft, Via, and other pre-arranged rides
Cannot be hailed from the street
Less granular data due to privacy: usually lacks fare and tip amounts

🚐 High Volume FHV

Since 2018, high-volume data includes:
- Trip miles, time, shared ride flag
- Pickups by zone

Step 1: Loading Lookup Metadata

We started by loading the taxi zone lookup table, which maps LocationID values to boroughs and zones:

import dask.dataframe as dd
import pandas as pd

lookup_table = pd.read_csv('./taxi_zones/taxi_zone_lookup.csv', header=0)
lookup_table.columns = ["LocationID", "Borough", "Zone", "service_zone"]
lookup_table = lookup_table.fillna("N/A")
lookup_table = dd.from_pandas(lookup_table, npartitions=1)

This mapping is essential for geographic analysis and later visualizations or aggregations by borough or zone.

Note: dask.dataframe is not needed, but it is nice to use wildcards to read in data as the data is split across multiple files for each month.

# Load all Parquet files into Dask
df = dd.read_parquet("./data/*/yellow_tripdata_*.parquet")

Step 2: Preparing Spatial Coordinates

In tandem with zone information, we worked with centroid coordinates for each taxi zone:

centroids = pd.read_csv('./taxi_zones/taxi_zone_centroids.csv')

The centroid data provides latitude and longitude per LocationID, useful for plotting pickup and drop-off locations or calculating spatial features.

We also included optional (but commented out) GeoPandas logic to:

Read in the official NYC taxi zone shapefile
Calculate geometric centroids
Convert the CRS to WGS84 (EPSG:4326)
Save those as CSV for reuse

This step is crucial for anyone wanting to enhance ML features with geographic context.

Step 3: Merging Data for Contextual Features

To tie the lookup and centroid data together, we merged them into one enriched table on LocationID:

# Merge trip data with zone lookup
df = df.merge(lookup_table, left_on="PULocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "pickup_borough", "Zone": "pickup_zone", "service_zone": "pickup_service_zone"})
df = df.drop(columns=["LocationID"])  # Remove duplicate LocationID column

df = df.merge(lookup_table, left_on="DOLocationID", right_on="LocationID", how="left")
df = df.rename(columns={"Borough": "dropoff_borough", "Zone": "dropoff_zone", "service_zone": "dropoff_service_zone"})
df = df.drop(columns=["LocationID"])  # Remove duplicate LocationID column

# Merge centroids with trip data (ensure `PULocationID` is matched)
df = df.merge(centroids, left_on="PULocationID", right_on="LocationID", how="left")
df = df.drop(columns=["LocationID"]) # Remove duplicate LocationID column

df["hour"] = df["tpep_pickup_datetime"].dt.hour

This combined dataset now contains:

Zone and borough names
Latitude/Longitude of zone centroids

Such a table becomes critical when analyzing pickup or drop-off zones in the raw trip data.

Step 4: Understanding `df.compute()`

Since we used Dask to load the lookup table, we needed to convert the lazy Dask DataFrame into a real in-memory Pandas DataFrame using .compute():

df.compute()

This command triggers Dask to execute any queued operations and return the full Pandas DataFrame. It’s necessary when you want to:

Perform operations that require full materialization (like merging with other Pandas data)
Preview results or export data
Move from parallel to single-machine processing when the data size is manageable

Using .compute() strategically allows us to benefit from Dask’s scalability without sacrificing compatibility with core Python tools.

Step 5: Computing Basic Zone Statistics

We then computed some basic counts of how often each zone appeared:

# Merge zone counts with coordinates
zone_counts = zone_counts.merge(
    centroids,
    left_on="PULocationID",
    right_on="LocationID",
    how="left"
)

These summaries provide insight into high-traffic areas, which can later be used to:

Create features for popular zones
Weight zones based on frequency
Visualize spatial activity levels

Step 6: Creating Visualization File for Kepler.gl

Once the data is merged and enriched with lat/lon coordinates, it's ready to be exported and visualized in a tool like Kepler.gl.

# Export to CSV for Kepler
zone_counts[[
    "timestamp", "Latitude", "Longitude", "trip_count",
    "hour", "day", "day_of_year", "day_of_week"
]].to_csv("kepler_zone_heatmap_detailed.csv", index=False)

With Kepler.gl, we can:

Plot pickup/drop-off zone centers
Visualize heatmaps of high-traffic zones
Animate trip volume over time (in future posts)

This step is especially useful for communicating insights spatially and uncovering regional patterns before modeling.

Step 7: Uncovering Demand Patterns by Hour and Borough

To better understand rider behavior, we visualized taxi demand over time and across boroughs.

Demand by Hour:

import matplotlib.pyplot as plt

df['hour'] = df['tpep_pickup_datetime'].dt.hour
hourly_demand = df.groupby('hour').size()

plt.figure(figsize=(10, 5))
plt.plot(hourly_demand.index, hourly_demand.values, marker='o')
plt.xlabel("Hour of Day")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Hour")
plt.grid()
plt.show()

This line plot shows daily fluctuations in demand, highlighting rush hours and late-night surges.

Demand by Borough:

borough_counts = df["pickup_borough"].value_counts().drop('N/A')

plt.figure(figsize=(10, 5))
borough_counts.plot(kind="bar", color="orange")
plt.xlabel("Borough")
plt.ylabel("Number of Trips")
plt.title("NYC Taxi Demand by Borough")
plt.xticks(rotation=45)
plt.show()

This bar chart reveals which boroughs are most active for pickups. Both analyses provide valuable context when building time-based or location-based features.

Next Steps: Feature Engineering & ML

Modeling Possibilities

With zone metadata and demand patterns now accessible, we can engineer features like:

pickup_hour, pickup_zone, dropoff_zone (categorical features)
Trip distance (from raw data)
Average tip by zone or hour (aggregated features)
Popularity rank of pickup zones

These features could feed into models predicting:

Fare amount (regression)
Trip duration (regression)
Demand forecasting (time series or classification)

Stay tuned for the next post where we'll start visualizing the data with Kepler.gl.