We are expanding our engineering team. We're looking for someone who is excited about our non-profit, open data mission, proficient with Python, and hopefully also some Java. Proficiency at cloud systems such as Spark/PySpark is required. Willingness to learn the rest: crawling, parsing, indexing, etc.
The Common Crawl Foundation has a 17-year-old, 8 petabyte crawl & archive of the web. Their open dataset has been cited in nearly 10,000 research papers and is the most-used dataset in the AWS Open Data program. The organization is also very active in the open source community.