Site Chart: a concurrent Web Crawler in Go
Problem: Real-world web crawling faces complex hurdles: infinite loops from malformed URLs, memory leaks from unmanaged goroutines, and being blocked due to aggressive "impolite" request rates that overwhelm target servers.
Action: Developed Site Chart, a CLI tool utilizing a Semaphore-based Worker Pool. I implemented a custom ticker-based throttler to manage requests-per-second (RPS) and a URL normalization engine to ensure system stability and responsible data fetching.
Result: Site Chart - a robust, high-speed crawler and link validator featuring:
- Throttled Concurrency: Precisely balances high-speed link validation with server-side rate-limiting requirements.
- Normalization Engine: Custom package to handle relative paths and fragments, preventing duplicate crawls and infinite loops.
-
Hermetic Testing: Used Go's
httptestto simulate 404 errors, timeouts, and redirects in a controlled, offline environment. - 12-Factor Configuration: Fully configurable via YAML, Env variables, or CLI flags through the Viper/Cobra ecosystem.
- Production Safety: Achieved 90%+ test coverage in the parsing engine and verified 100% thread-safety via the Race Detector.
- Unix Integration: Optimized for DevOps pipelines with support for Unix piping and structured YAML output.