The Common Crawl Foundation has a 17-year-old, 8 petabyte crawl & archive of the web. Their open dataset has been cited in nearly 10,000 research papers and is the most-used dataset in the AWS Open Data program. The organization is also very active in the open source community.