Site Chart: a concurrent Web Crawler in Go

Problem: Real-world web crawling faces complex hurdles: infinite loops from malformed URLs, memory leaks from unmanaged goroutines, and being blocked due to aggressive "impolite" request rates that overwhelm target servers.

Action: Developed Site Chart, a CLI tool utilizing a Semaphore-based Worker Pool. I implemented a custom ticker-based throttler to manage requests-per-second (RPS) and a URL normalization engine to ensure system stability and responsible data fetching.

Result: Site Chart - a robust, high-speed crawler and link validator featuring:

  • Throttled Concurrency: Precisely balances high-speed link validation with server-side rate-limiting requirements.
  • Normalization Engine: Custom package to handle relative paths and fragments, preventing duplicate crawls and infinite loops.
  • Hermetic Testing: Used Go's httptest to simulate 404 errors, timeouts, and redirects in a controlled, offline environment.
  • 12-Factor Configuration: Fully configurable via YAML, Env variables, or CLI flags through the Viper/Cobra ecosystem.
  • Production Safety: Achieved 90%+ test coverage in the parsing engine and verified 100% thread-safety via the Race Detector.
  • Unix Integration: Optimized for DevOps pipelines with support for Unix piping and structured YAML output.

Site Chart: Efficient sitemap generation