<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Martin Tran’s Tech Blog | Data Science, Machine Learning & Python Tutorials]]></title><description><![CDATA[Explore practical guides, tutorials, and insights on data science, machine learning, Python programming, and more—authored by Martin Tran.]]></description><link>https://blog.mtran.me</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1748806506806/38980de6-75a7-4711-b227-eab11fd8a703.png</url><title>Martin Tran’s Tech Blog | Data Science, Machine Learning &amp; Python Tutorials</title><link>https://blog.mtran.me</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 17:56:50 GMT</lastBuildDate><atom:link href="https://blog.mtran.me/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Stitching Images with Harris Corners and SIFT: A Homework Retrospective]]></title><description><![CDATA[This project came from a computer vision class where we built an image stitching pipeline using Harris corners and SIFT descriptors. I’m keeping the original code mostly intact to show how I approached it at the time, but I’ve added some thoughts bel...]]></description><link>https://blog.mtran.me/stitching-images-with-harris-corners-and-sift-a-homework-retrospective</link><guid isPermaLink="true">https://blog.mtran.me/stitching-images-with-harris-corners-and-sift-a-homework-retrospective</guid><category><![CDATA[image stitching]]></category><category><![CDATA[harris corners]]></category><category><![CDATA[Python]]></category><category><![CDATA[sift]]></category><category><![CDATA[Computer Vision]]></category><dc:creator><![CDATA[Martin Tran]]></dc:creator><pubDate>Sun, 01 Jun 2025 20:34:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748808683440/359e0f60-c5ab-459a-b83a-31b1cf622274.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This project came from a computer vision class where we built an image stitching pipeline using Harris corners and SIFT descriptors. I’m keeping the original code mostly intact to show how I approached it at the time, but I’ve added some thoughts below on what I’d do differently now.</p>
<p>The goal was to align and blend multiple overlapping images into a panorama, using classic computer vision techniques (no deep learning). This post goes over what I implemented, what I’ve learned since, and what I’d change if I revisited it today.</p>
<hr />
<h2 id="heading-what-this-project-does">What This Project Does</h2>
<ul>
<li><p>Detects corners in each image using Harris corner detection</p>
</li>
<li><p>Computes SIFT descriptors around those corners</p>
</li>
<li><p>Matches features between image pairs</p>
</li>
<li><p>Estimates the homography using RANSAC</p>
</li>
<li><p>Warps and blends images into a stitched result</p>
</li>
<li><p>Supports stitching multiple images in a sequence</p>
</li>
</ul>
<p>Here’s what the stitched output looks like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748808944035/edce3ff3-a7fc-449f-b74f-0f9ad4d9e42b.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-what-i-did-original-approach">What I Did (Original Approach)</h2>
<p>At the time, my focus was on getting the pipeline working and matching the spec. I handled corner detection with <code>cv2.cornerHarris</code>, filtered out low responses, then extracted SIFT descriptors.</p>
<pre><code class="lang-python">A_gray = cv2.cvtColor(A, cv2.COLOR_BGR2GRAY)
A_corners = cv2.cornerHarris(A_gray, blockSize=<span class="hljs-number">2</span>, ksize=<span class="hljs-number">3</span>, k=<span class="hljs-number">0.05</span>)
<span class="hljs-comment"># Using Harris corner detector on grayscale images A and B</span>
keypoints = [cv2.KeyPoint(x, y, radius) <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> corner_locs]
_, descriptors = sift.compute(A_gray, keypoints)
</code></pre>
<p>I matched descriptors using OpenCV’s brute-force matcher with cross-checking, took the top 200 matches, and then used RANSAC to compute a homography.</p>
<pre><code class="lang-python">bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=<span class="hljs-literal">True</span>)
matches = bf.match(desc1, desc2)
matches = sorted(matches, key=<span class="hljs-keyword">lambda</span> x: x.distance)

<span class="hljs-comment"># Estimate homography from matched keypoints using RANSAC</span>
H, mask = cv2.findHomography(dst_pts, src_pts, cv2.RANSAC, <span class="hljs-number">0.3</span>)
</code></pre>
<p>After computing the homography, I warped one image into the other's frame and blended them together. For multiple images, I ran all pairwise combinations, selected the pairs with the most inliers, and stitched them in that order. It worked, though it was a bit rigid.</p>
<pre><code class="lang-python">warped = cv2.warpPerspective(new_img, H, canvas_size)
mask = (warped &gt; <span class="hljs-number">0</span>)
panorama[mask] = warped[mask]
</code></pre>
<hr />
<h2 id="heading-what-id-do-differently-now">What I’d Do Differently Now</h2>
<p>Since submitting this assignment, I’ve learned a few things that would make the whole pipeline more reliable and scalable. If I were to revisit this, here’s how I’d improve it:</p>
<h3 id="heading-preprocessing">Preprocessing</h3>
<ul>
<li><p>I didn’t smooth the grayscale image before running Harris detection, which can cause noisy responses. I’d apply a Gaussian blur (σ=2) beforehand now.</p>
</li>
<li><p>Harris should run on <code>float32</code>, but I passed in the raw grayscale output.</p>
</li>
</ul>
<h3 id="heading-keypoints-and-descriptors">Keypoints and Descriptors</h3>
<ul>
<li><p>Instead of manually detecting corners and computing descriptors, I’d just use SIFT’s <code>detectAndCompute()</code> which handles both in a consistent way.</p>
</li>
<li><p>It’s more robust to scale and rotation than manually sampling Harris corners.</p>
</li>
</ul>
<h3 id="heading-matching">Matching</h3>
<ul>
<li><p>Rather than using brute-force matching with cross-checking and slicing the top 200 matches, I’d switch to KNN matching with Lowe’s ratio test. It does a better job at filtering out bad matches.</p>
</li>
<li><p>I’d also add a sanity check to make sure there are at least 4 good matches before trying to compute a homography.</p>
</li>
</ul>
<h3 id="heading-multi-image-stitching">Multi-Image Stitching</h3>
<ul>
<li><p>Instead of planning out the full stitching sequence at the start, I’d do it recursively—merge the best pair, then re-run matching with the merged image and the rest.</p>
</li>
<li><p>That would be more adaptive and flexible, especially for larger sets of images.</p>
</li>
</ul>
<h3 id="heading-blending">Blending</h3>
<ul>
<li>Right now, I just overwrite the overlap with the warped image. It works, but it creates hard seams. I’d do linear blending in the overlapping region or look into OpenCV’s multi-band blending for smoother transitions.</li>
</ul>
<hr />
<h2 id="heading-why-im-keeping-it-as-is">Why I’m Keeping It As-Is</h2>
<p>Even though I know how to make it better now, I’m keeping this version mostly untouched for two reasons:</p>
<ol>
<li><p>It shows where I was at the time. I think it’s important to document progress honestly.</p>
</li>
<li><p>The code still works, and I learned a lot from writing it—even if I’d approach it differently today.</p>
</li>
</ol>
<p>There’s value in being able to look back at your work and recognize areas for improvement. That’s part of the learning process.</p>
<hr />
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>This was one of the first projects where I had to take a full vision pipeline from scratch to output. It’s far from perfect, but it got me thinking critically about matching, transformation, and how to go from theory to implementation.</p>
<p>If you want to check out the code or try it yourself, here’s the GitHub repo:<br /><a target="_blank" href="https://github.com/USTranM/image-stitching-harris-sift">GitHub - image-stitching-harris-sift</a></p>
]]></content:encoded></item><item><title><![CDATA[Visualizing NYC Taxi Trends with Kepler.gl]]></title><description><![CDATA[In my last post, I walked through how to clean and prepare NYC taxi data with Python — merging multiple years of trip data, resolving zone IDs into boroughs, and engineering useful time features like hour, day_of_week, and timestamp. This time, I’m t...]]></description><link>https://blog.mtran.me/visualizing-nyc-taxi-trends-with-keplergl</link><guid isPermaLink="true">https://blog.mtran.me/visualizing-nyc-taxi-trends-with-keplergl</guid><category><![CDATA[nyc tlc]]></category><category><![CDATA[data visualization]]></category><category><![CDATA[GIS]]></category><category><![CDATA[Keplergl]]></category><dc:creator><![CDATA[Martin Tran]]></dc:creator><pubDate>Thu, 29 May 2025 19:25:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748546098237/20233237-23c9-44bd-ad40-fa8532fba39d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my <a target="_blank" href="https://blog.mtran.me/preprocessing-nyc-taxi-trip-data-with-python">last post</a>, I walked through how to clean and prepare NYC taxi data with Python — merging multiple years of trip data, resolving zone IDs into boroughs, and engineering useful time features like <code>hour</code>, <code>day_of_week</code>, and <code>timestamp</code>. This time, I’m taking that cleaned dataset (<code>kepler_zone_heatmap_detailed.csv</code>) and building a visual story with it using <a target="_blank" href="http://Kepler.gl">Kepler.gl</a>.</p>
<p>Here’s how I set it up.</p>
<hr />
<h2 id="heading-1-whats-in-the-data">1. What’s in the Data?</h2>
<p>This dataset includes hourly pickup counts grouped by NYC taxi zones, already joined with centroid coordinates. Each row represents a zone snapshot, and columns include:</p>
<ul>
<li><p><code>Latitude</code>, <code>Longitude</code>: Coordinates for the zone center</p>
</li>
<li><p><code>trip_count</code>: Number of pickups at that location and time</p>
</li>
<li><p><code>hour</code>, <code>day_of_week</code>, <code>day</code>, <code>day_of_year</code>: Temporal granularity for filtering</p>
<ul>
<li><code>day</code> stands for day of month (e.g. comparing traffic on December 25th compared to December 1st)</li>
</ul>
</li>
<li><p><code>timestamp</code>: Rounded datetime used for animation/slider</p>
</li>
<li><p><code>Zone</code>, <code>Borough</code>: For labeling or aggregation later</p>
</li>
</ul>
<p>This isn’t raw GPS data — it’s zone-level aggregation, which makes visualization both faster and more interpretable.</p>
<hr />
<h2 id="heading-2-uploading-to-keplerglhttpkeplergl">2. Uploading to <a target="_blank" href="http://Kepler.gl">Kepler.gl</a></h2>
<ul>
<li><p>Go to kepler.gl/demo and browse for or click and drag the <code>kepler_zone_heatmap_detailed.csv</code> into the pop-up dialogue</p>
</li>
<li><p>Upload <code>kepler_zone_heatmap_detailed.csv</code></p>
</li>
<li><p>In my experience, <a target="_blank" href="http://Kepler.gl">Kepler.gl</a> does <strong>not</strong> auto-detect <code>Latitude</code> and <code>Longitude</code>. You’ll have to manually confirm or set those in the layer config.</p>
</li>
</ul>
<hr />
<h2 id="heading-3-creating-the-heatmap-point-layer-style">3. Creating the Heatmap (Point Layer Style)</h2>
<p>While Kepler offers multiple layer types, I found better clarity using a <strong>Point Layer</strong> instead of a Heatmap. Here's how I configured it:</p>
<ul>
<li><p><strong>Layer Type</strong>: Point</p>
</li>
<li><p><strong>Fill Color</strong>: Set based on <code>trip_count</code></p>
</li>
<li><p><strong>Color Gradient</strong>: I changed the scale from <strong>white to red</strong> — this gave better contrast for spotting high-demand zones</p>
</li>
<li><p><strong>Radius</strong>: Adjust for visibility based on zoom; I used 10 — which is the default</p>
</li>
</ul>
<p>The result makes it much easier to distinguish heavily trafficked areas like Midtown Manhattan or JFK Airport during rush hour.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748551185156/3c76f7b8-02b0-45bf-85b8-76bf2622bbf8.png" alt="Sundays at 10:00 PM" class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-4-enabling-time-playback">4. Enabling Time Playback</h2>
<p>To get temporal playback:</p>
<ul>
<li><p>Go to the <strong>Filters</strong> tab</p>
</li>
<li><p>Add a filter on <code>timestamp</code></p>
</li>
<li><p>Click on the <strong>"time playback"</strong> button to enable animation</p>
</li>
<li><p>You can also add filters for <code>day_of_week</code> or <code>hour</code> to explore weekday vs weekend patterns</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748545875309/3f92519e-842d-4453-8d62-8be09a9ff476.png" alt="Including all the filters makes it too granular, so you will have to play around with the combinations." class="image--center mx-auto" /></p>
<p>This let me visualize how demand shifts over time — morning rush in Midtown, late-night clusters in nightlife zones, etc.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748551136749/94705cf7-18c2-4dc5-bcba-3f80eeabe4c0.png" alt="Sundays at 4:00 AM" class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-5-getting-a-sharper-export">5. Getting a Sharper Export</h2>
<p>To export a high-quality map snapshot:</p>
<ol>
<li><p>Click the <strong>three-dot menu &gt; Export Image</strong></p>
</li>
<li><p>Choose resolution (I used 2x for clarity) and aspect ratio (16:9 works great for blog banners)</p>
</li>
</ol>
<p>If that’s not to your liking, you could always take a screenshot. I prefer to take screenshots (which the pictures above are screenshots) since you can zoom into the map.</p>
<p>The issue with exporting a snapshot is the fact that it is harder to tell which points has changed and which ones have not. Kepler.gl has an option to export the map you created as a HTML file in which other users can access it. However, the dataset I produced made a 300MB resulting file, which is large for web users to view.</p>
<hr />
<h2 id="heading-6-whats-next">6. What’s Next?</h2>
<p>While Kepler.gl is rudimentary, I’ve got a clear visual of when and where taxi demand spikes. But I want to take it a step further — combining this with <strong>trip profitability (fare + tips)</strong> and <strong>estimated travel time or congestion</strong> to recommend optimal pickup zones at any given hour. Think of it like a driver-facing AI assistant.</p>
<p>More on that in the next post.</p>
<hr />
<p>Want to explore this yourself? The dataset’s light enough to run locally, and tools like Kepler.gl make this type of exploration a breeze without needing any frontend code.</p>
<p>Let me know what visual trends you notice — especially if you find hidden hotspots I missed.</p>
]]></content:encoded></item><item><title><![CDATA[Preprocessing NYC Taxi Trip Data with Python 🗽🚕]]></title><description><![CDATA[tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]
Image credits: https://www.nyc.gov/site/tlc/about/about-tlc.page

Overview
The NYC Taxi & Limousine Commission (TLC) publishes one of the richest urban mobility datasets in the world, con...]]></description><link>https://blog.mtran.me/preprocessing-nyc-taxi-trip-data-with-python</link><guid isPermaLink="true">https://blog.mtran.me/preprocessing-nyc-taxi-trip-data-with-python</guid><category><![CDATA[Python]]></category><category><![CDATA[data visualization]]></category><category><![CDATA[nyc]]></category><category><![CDATA[pandas]]></category><category><![CDATA[data analysis]]></category><dc:creator><![CDATA[Martin Tran]]></dc:creator><pubDate>Sun, 25 May 2025 21:42:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748546199133/1f89ffc6-1ad2-40f5-8bef-88ebc2dba3bc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-tags-python-data-analysis-pandas-data-visualization-nyc">tags: [Python, Data Analysis, Pandas, Data Visualization, NYC]</h3>
<p>Image credits: <a target="_blank" href="https://www.nyc.gov/site/tlc/about/about-tlc.page">https://www.nyc.gov/site/tlc/about/about-tlc.page</a></p>
<hr />
<h2 id="heading-overview">Overview</h2>
<p>The NYC Taxi &amp; Limousine Commission (TLC) publishes one of the richest urban mobility datasets in the world, containing millions of taxi trips taken across New York City. Analyzing this data offers insights into urban transportation patterns, congestion hotspots, and rider behavior. In this post, we’ll begin preparing the data for machine learning tasks by merging metadata, visualizing demand, and exporting spatial features.</p>
<hr />
<h2 id="heading-the-dataset">The Dataset</h2>
<p>I used publicly available data from the <a target="_blank" href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page">NYC Taxi &amp; Limousine Commission (TLC)</a>. Here's a glimpse at what the data looks like:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

df = pd.read_parquet(<span class="hljs-string">"yellow_tripdata_2020-01.parquet"</span>)
df.head()
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td>VendorID</td><td>tpep_pickup_datetime</td><td>tpep_dropoff_datetime</td><td>passenger_count</td><td>trip_distance</td><td>RatecodeID</td><td>store_and_fwd_flag</td><td>PULocationID</td><td>DOLocationID</td><td>payment_type</td><td>fare_amount</td><td>extra</td><td>mta_tax</td><td>tip_amount</td><td>tolls_amount</td><td>improvement_surcharge</td><td>total_amount</td><td>congestion_surcharge</td><td>airport_fee</td></tr>
</thead>
<tbody>
<tr>
<td>1</td><td>2020-01-01 00:28:15</td><td>2020-01-01 00:33:03</td><td>1.0</td><td>1.2</td><td>1.0</td><td>N</td><td>238</td><td>239</td><td>1</td><td>6.0</td><td>3.0</td><td>0.5</td><td>1.47</td><td>0.0</td><td>0.3</td><td>11.27</td><td>2.5</td><td>None</td></tr>
<tr>
<td>1</td><td>2020-01-01 00:35:39</td><td>2020-01-01 00:43:04</td><td>1.0</td><td>1.2</td><td>1.0</td><td>N</td><td>239</td><td>238</td><td>1</td><td>7.0</td><td>3.0</td><td>0.5</td><td>1.50</td><td>0.0</td><td>0.3</td><td>12.30</td><td>2.5</td><td>None</td></tr>
<tr>
<td>1</td><td>2020-01-01 00:47:41</td><td>2020-01-01 00:53:52</td><td>1.0</td><td>0.6</td><td>1.0</td><td>N</td><td>238</td><td>238</td><td>1</td><td>6.0</td><td>3.0</td><td>0.5</td><td>1.00</td><td>0.0</td><td>0.3</td><td>10.80</td><td>2.5</td><td>None</td></tr>
<tr>
<td>1</td><td>2020-01-01 00:55:23</td><td>2020-01-01 01:00:14</td><td>1.0</td><td>0.8</td><td>1.0</td><td>N</td><td>238</td><td>151</td><td>1</td><td>5.5</td><td>0.5</td><td>0.5</td><td>1.36</td><td>0.0</td><td>0.3</td><td>8.16</td><td>0.0</td><td>None</td></tr>
<tr>
<td>2</td><td>2020-01-01 00:01:58</td><td>2020-01-01 00:04:16</td><td>1.0</td><td>0.0</td><td>1.0</td><td>N</td><td>193</td><td>193</td><td>2</td><td>3.5</td><td>0.5</td><td>0.5</td><td>0.00</td><td>0.0</td><td>0.3</td><td>4.80</td><td>0.0</td><td>None</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-what-do-the-colors-mean">What Do the Colors Mean?</h2>
<h4 id="heading-yellow-taxis">💛 Yellow Taxis</h4>
<ul>
<li><p><strong>Traditional medallion taxis</strong></p>
</li>
<li><p>Can pick up riders anywhere via <strong>street hail</strong></p>
</li>
<li><p>Mostly operate in <strong>Manhattan and airports</strong></p>
</li>
<li><p>Rich data: trip times, distances, fares, tips, payment types</p>
</li>
</ul>
<h4 id="heading-green-taxis">💚 Green Taxis</h4>
<ul>
<li><p>Also called <strong>Boro Taxis</strong></p>
</li>
<li><p>Focus on <strong>outer boroughs</strong> and <strong>Upper Manhattan</strong></p>
</li>
<li><p><strong>Street hail only allowed outside Midtown/Downtown Manhattan</strong></p>
</li>
<li><p>Data structure nearly identical to Yellow Taxis</p>
</li>
</ul>
<h4 id="heading-fhv-for-hire-vehicles">🚘 FHV (For-Hire Vehicles)</h4>
<ul>
<li><p>Covers <strong>Uber, Lyft, Via</strong>, and other pre-arranged rides</p>
</li>
<li><p>Cannot be hailed from the street</p>
</li>
<li><p><strong>Less granular data</strong> due to privacy: usually lacks fare and tip amounts</p>
</li>
</ul>
<h4 id="heading-high-volume-fhv">🚐 High Volume FHV</h4>
<ul>
<li><p>Since 2018, high-volume data includes:</p>
<ul>
<li><p>Trip miles, time, shared ride flag</p>
</li>
<li><p>Pickups by zone</p>
</li>
</ul>
</li>
</ul>
<hr />
<h2 id="heading-step-1-loading-lookup-metadata">Step 1: Loading Lookup Metadata</h2>
<p>We started by loading the <strong>taxi zone lookup table</strong>, which maps <code>LocationID</code> values to boroughs and zones:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> dask.dataframe <span class="hljs-keyword">as</span> dd
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

lookup_table = pd.read_csv(<span class="hljs-string">'./taxi_zones/taxi_zone_lookup.csv'</span>, header=<span class="hljs-number">0</span>)
lookup_table.columns = [<span class="hljs-string">"LocationID"</span>, <span class="hljs-string">"Borough"</span>, <span class="hljs-string">"Zone"</span>, <span class="hljs-string">"service_zone"</span>]
lookup_table = lookup_table.fillna(<span class="hljs-string">"N/A"</span>)
lookup_table = dd.from_pandas(lookup_table, npartitions=<span class="hljs-number">1</span>)
</code></pre>
<p>This mapping is essential for geographic analysis and later visualizations or aggregations by borough or zone.</p>
<p>Note: dask.dataframe is not needed, but it is nice to use wildcards to read in data as the data is split across multiple files for each month.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Load all Parquet files into Dask</span>
df = dd.read_parquet(<span class="hljs-string">"./data/*/yellow_tripdata_*.parquet"</span>)
</code></pre>
<hr />
<h2 id="heading-step-2-preparing-spatial-coordinates">Step 2: Preparing Spatial Coordinates</h2>
<p>In tandem with zone information, we worked with <strong>centroid coordinates</strong> for each taxi zone:</p>
<pre><code class="lang-python">centroids = pd.read_csv(<span class="hljs-string">'./taxi_zones/taxi_zone_centroids.csv'</span>)
</code></pre>
<p>The centroid data provides latitude and longitude per <code>LocationID</code>, useful for plotting pickup and drop-off locations or calculating spatial features.</p>
<p>We also included optional (but commented out) GeoPandas logic to:</p>
<ul>
<li><p>Read in the official NYC taxi zone shapefile</p>
</li>
<li><p>Calculate geometric centroids</p>
</li>
<li><p>Convert the CRS to WGS84 (EPSG:4326)</p>
</li>
<li><p>Save those as CSV for reuse</p>
</li>
</ul>
<p>This step is crucial for anyone wanting to enhance ML features with geographic context.</p>
<hr />
<h2 id="heading-step-3-merging-data-for-contextual-features">Step 3: Merging Data for Contextual Features</h2>
<p>To tie the lookup and centroid data together, we merged them into one enriched table on <code>LocationID</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Merge trip data with zone lookup</span>
df = df.merge(lookup_table, left_on=<span class="hljs-string">"PULocationID"</span>, right_on=<span class="hljs-string">"LocationID"</span>, how=<span class="hljs-string">"left"</span>)
df = df.rename(columns={<span class="hljs-string">"Borough"</span>: <span class="hljs-string">"pickup_borough"</span>, <span class="hljs-string">"Zone"</span>: <span class="hljs-string">"pickup_zone"</span>, <span class="hljs-string">"service_zone"</span>: <span class="hljs-string">"pickup_service_zone"</span>})
df = df.drop(columns=[<span class="hljs-string">"LocationID"</span>])  <span class="hljs-comment"># Remove duplicate LocationID column</span>

df = df.merge(lookup_table, left_on=<span class="hljs-string">"DOLocationID"</span>, right_on=<span class="hljs-string">"LocationID"</span>, how=<span class="hljs-string">"left"</span>)
df = df.rename(columns={<span class="hljs-string">"Borough"</span>: <span class="hljs-string">"dropoff_borough"</span>, <span class="hljs-string">"Zone"</span>: <span class="hljs-string">"dropoff_zone"</span>, <span class="hljs-string">"service_zone"</span>: <span class="hljs-string">"dropoff_service_zone"</span>})
df = df.drop(columns=[<span class="hljs-string">"LocationID"</span>])  <span class="hljs-comment"># Remove duplicate LocationID column</span>

<span class="hljs-comment"># Merge centroids with trip data (ensure `PULocationID` is matched)</span>
df = df.merge(centroids, left_on=<span class="hljs-string">"PULocationID"</span>, right_on=<span class="hljs-string">"LocationID"</span>, how=<span class="hljs-string">"left"</span>)
df = df.drop(columns=[<span class="hljs-string">"LocationID"</span>]) <span class="hljs-comment"># Remove duplicate LocationID column</span>

df[<span class="hljs-string">"hour"</span>] = df[<span class="hljs-string">"tpep_pickup_datetime"</span>].dt.hour
</code></pre>
<p>This combined dataset now contains:</p>
<ul>
<li><p>Zone and borough names</p>
</li>
<li><p>Latitude/Longitude of zone centroids</p>
</li>
</ul>
<p>Such a table becomes critical when analyzing pickup or drop-off zones in the raw trip data.</p>
<hr />
<h2 id="heading-step-4-understanding-dfcompute">Step 4: Understanding <code>df.compute()</code></h2>
<p>Since we used Dask to load the lookup table, we needed to convert the lazy Dask DataFrame into a real in-memory Pandas DataFrame using <code>.compute()</code>:</p>
<pre><code class="lang-python">df.compute()
</code></pre>
<p>This command triggers Dask to execute any queued operations and return the full Pandas DataFrame. It’s necessary when you want to:</p>
<ul>
<li><p>Perform operations that require full materialization (like merging with other Pandas data)</p>
</li>
<li><p>Preview results or export data</p>
</li>
<li><p>Move from parallel to single-machine processing when the data size is manageable</p>
</li>
</ul>
<p>Using <code>.compute()</code> strategically allows us to benefit from Dask’s scalability without sacrificing compatibility with core Python tools.</p>
<hr />
<h2 id="heading-step-5-computing-basic-zone-statistics">Step 5: Computing Basic Zone Statistics</h2>
<p>We then computed some basic counts of how often each zone appeared:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Merge zone counts with coordinates</span>
zone_counts = zone_counts.merge(
    centroids,
    left_on=<span class="hljs-string">"PULocationID"</span>,
    right_on=<span class="hljs-string">"LocationID"</span>,
    how=<span class="hljs-string">"left"</span>
)
</code></pre>
<p>These summaries provide insight into high-traffic areas, which can later be used to:</p>
<ul>
<li><p>Create features for popular zones</p>
</li>
<li><p>Weight zones based on frequency</p>
</li>
<li><p>Visualize spatial activity levels</p>
</li>
</ul>
<hr />
<h2 id="heading-step-6-creating-visualization-file-for-keplergl">Step 6: Creating Visualization File for Kepler.gl</h2>
<p>Once the data is merged and enriched with lat/lon coordinates, it's ready to be exported and visualized in a tool like <strong>Kepler.gl</strong>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Export to CSV for Kepler</span>
zone_counts[[
    <span class="hljs-string">"timestamp"</span>, <span class="hljs-string">"Latitude"</span>, <span class="hljs-string">"Longitude"</span>, <span class="hljs-string">"trip_count"</span>,
    <span class="hljs-string">"hour"</span>, <span class="hljs-string">"day"</span>, <span class="hljs-string">"day_of_year"</span>, <span class="hljs-string">"day_of_week"</span>
]].to_csv(<span class="hljs-string">"kepler_zone_heatmap_detailed.csv"</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
<p>With Kepler.gl, we can:</p>
<ul>
<li><p>Plot pickup/drop-off zone centers</p>
</li>
<li><p>Visualize heatmaps of high-traffic zones</p>
</li>
<li><p>Animate trip volume over time (in future posts)</p>
</li>
</ul>
<p>This step is especially useful for communicating insights spatially and uncovering regional patterns before modeling.</p>
<hr />
<h2 id="heading-step-7-uncovering-demand-patterns-by-hour-and-borough">Step 7: Uncovering Demand Patterns by Hour and Borough</h2>
<p>To better understand rider behavior, we visualized taxi demand over time and across boroughs.</p>
<p><strong>Demand by Hour:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

df[<span class="hljs-string">'hour'</span>] = df[<span class="hljs-string">'tpep_pickup_datetime'</span>].dt.hour
hourly_demand = df.groupby(<span class="hljs-string">'hour'</span>).size()

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>))
plt.plot(hourly_demand.index, hourly_demand.values, marker=<span class="hljs-string">'o'</span>)
plt.xlabel(<span class="hljs-string">"Hour of Day"</span>)
plt.ylabel(<span class="hljs-string">"Number of Trips"</span>)
plt.title(<span class="hljs-string">"NYC Taxi Demand by Hour"</span>)
plt.grid()
plt.show()
</code></pre>
<p>This line plot shows daily fluctuations in demand, highlighting rush hours and late-night surges.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748209110038/0196dfbd-3d9b-4d3d-8bd5-48dd25a6facb.png" alt class="image--center mx-auto" /></p>
<p><strong>Demand by Borough:</strong></p>
<pre><code class="lang-python">borough_counts = df[<span class="hljs-string">"pickup_borough"</span>].value_counts().drop(<span class="hljs-string">'N/A'</span>)

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>))
borough_counts.plot(kind=<span class="hljs-string">"bar"</span>, color=<span class="hljs-string">"orange"</span>)
plt.xlabel(<span class="hljs-string">"Borough"</span>)
plt.ylabel(<span class="hljs-string">"Number of Trips"</span>)
plt.title(<span class="hljs-string">"NYC Taxi Demand by Borough"</span>)
plt.xticks(rotation=<span class="hljs-number">45</span>)
plt.show()
</code></pre>
<p>This bar chart reveals which boroughs are most active for pickups. Both analyses provide valuable context when building time-based or location-based features.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748209117171/1ac7b853-13c8-4880-a122-1681179cc540.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-next-steps-feature-engineering-amp-ml">Next Steps: Feature Engineering &amp; ML</h2>
<h3 id="heading-modeling-possibilities">Modeling Possibilities</h3>
<p>With zone metadata and demand patterns now accessible, we can engineer features like:</p>
<ul>
<li><p><code>pickup_hour</code>, <code>pickup_zone</code>, <code>dropoff_zone</code> (categorical features)</p>
</li>
<li><p>Trip distance (from raw data)</p>
</li>
<li><p>Average tip by zone or hour (aggregated features)</p>
</li>
<li><p>Popularity rank of pickup zones</p>
</li>
</ul>
<p>These features could feed into models predicting:</p>
<ul>
<li><p>Fare amount (regression)</p>
</li>
<li><p>Trip duration (regression)</p>
</li>
<li><p>Demand forecasting (time series or classification)</p>
</li>
</ul>
<p>Stay tuned for the next post where we'll start visualizing the data with Kepler.gl.</p>
]]></content:encoded></item></channel></rss>