Web Scraping Policy

Web scraping is a contentious issue within research. While it is true that fair use provides for many uses of data gleaned from the Internet, in general this is applied to human information gathering, not programmatic machine scraping. That distinction makes the act of brute-force scraping an issue separate from fair use.

You are not just using the source’s data, but also their servers, bandwidth, etc. in a way the source may not approve. This can lead to IP blacklisting and even legal action. So please tread carefully as your actions could negatively affect others.

ⓘ If in doubt or in need of more authoritative guidance, please contact the Harvard Office of the Vice Provost for Research

Scraping on the FASRC Cluster

If your research requires you to scrape content from the web using the FASRC cluster, please review the following requirements and guidelines. These guidelines are not related to data you bring to the cluster to process, but data you acquire using the cluster. It’s also good practice to let us know ahead of time before you begin any scraping process.

Permission

Data on the Internet should not be programmatically (or ‘brute-force’) scraped using the cluster, even for academic research purposes, unless:

A) The source provides an API for this purpose and any requirements they impose have been met.

B) The source explicitly permits programmatic scraping in their terms of service or other public notice.

C) The source is the United States government and the data in question was generated with public funds and is publicly available without encumbrance. Further, that the site not be scraped using brute-force means if an API is provided.

D) The source has given you explicit permission in writing or via a secondary document spelling out that permission.

Data cannot be programmatically scraped if the source has explicitly forbidden scraping in their terms of service and written permission to do so cannot be obtained. You should investigate other options for acquiring this or similar data.

Throttling and Blacklisting

Scraping content from websites using highly parallelized processes, even with unfettered permission from the source, should be avoided. Doing so runs the risk of having the cluster, or even the university’s, IP range blacklisted. This could have an undesirable effect on other network and cluster users. Please ensure your processes pull data at a reasonable rate unless you explicitly have written approval from the data source to download more aggressively and assurance that this will not lead to blacklisting from them or their upstream provider.

Related:

US Data.gov Data Harvesting Information

Archive.org Archive Scraping

 

Was this article helpful?
0 out of 5 stars
5 Stars 0%
4 Stars 0%
3 Stars 0%
2 Stars 0%
1 Stars 0%
How can we improve this article?
Need help?
© The President and Fellows of Harvard College
Except where otherwise noted, this content is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.