We are expanding our engineering team and looking for people who are excited about our non-profit, open data mission. Candidates should be proficient with Python, and hopefully also some Java, and proficient at cloud systems such as Spark/PySpark. Our current team is composed of engineers who do some data science, and data scientists who do some engineering. We are focused on improving our crawl, making new data products, and using these new data products to improve our crawl.
The Common Crawl Foundation has a 17-year-old, 8 petabyte crawl & archive of the web. Their open dataset has been cited in nearly 10,000 research papers and is the most-used dataset in the AWS Open Data program. The organization is also very active in the open source community.