Web Scraping Policy

Web scraping is a contentious issue within research. While it is true that fair use provides for many uses of data gleaned from the Internet, in general this is applied to human information gathering, not programmatic machine scraping. That distinction makes the act of brute-force scraping an issue separate from fair use.

You, as a representative of Harvard, are not just using the source’s data, but also their servers, bandwidth, etc. in a way the source may not approve. This can lead to IP blacklisting and even legal action. So please tread carefully as your actions could negatively affect others.

If in doubt or in need of more authoritative guidance, please contact the Harvard Office of the General Counsel or Office of the Vice Provost for Research

If you are scraping for the purpose of train a GAI model, contact the Harvard Office of the General Counsel or Office of the Vice Provost for Research

Please be aware that merely being involved in academic pursuits does not exempt you from the usage policies of social media and other Internet platforms like Facebook, Twitter, etc.

Sensitive Data

If the data you are acquiring is considered sensitive, confidential, or contains human data, you will need to have this data reviewed for compliance before placing it on the FASRC cluster. If in doubt, you should always err on the side of caution and contact the Office of the Vice Provost for Research

Scraping data for use on the FASRC Cluster

If your research requires you to scrape content from the web, please review the following guidelines and suggestions.

We highly discourage using the cluster itself to scrape data. Due to its size and ease of parallelization of processes, the cluster is easily weaponized and your actions could have consequences for other researchers. Please seek another avenue for data acquisition first.

You should contact FASRC before commencing any scraping activity using the FASRC cluster.

It is highly preferable that you do the scraping elsewhere and then bring the data to the FASRC cluster for processing. If the data is sensitive, confidential, contains human data, or it is unclear, then this is a requirement. See ‘Sensitive Data’ above.

Also, if you are scraping for the purpose of training a GAI/LLM model, you should respect that site’s policies on this practice (this may be posted on the site, contained in a robots.txt file, or explicitly stated in their ToS). Even if you are doing the scraping manually, you should consider yourself the same as a bot and, if a site excludes GAI/AI bots, this also applies to you. Merely being an academic does not exempt you from following the wishes of a site and/or its members; your exfiltrated data could end up in other models thereby nullifying the source’s right to exclusivity/ownership. Please contact the Harvard Office of the General Counsel or Office of the Vice Provost for Research for further guidance.

Source Permission

If you are in doubt or have questions, please contact the Harvard Office of the Vice Provost for Research

Data on the Internet should not be programmatically (or ‘brute-force’) scraped using FASRC computing resources, even for academic research purposes, unless FASRC has given permission to proceed using the cluster or some system tied to the cluster, and:

A) The source provides an API for this purpose and any requirements they impose have been met.

B) The source allows/does not prohibit scraping in their terms of service or other public notice.

C) The source is the United States government and the data in question was generated with public funds and is publicly available without encumbrance. Further, that the site not be scraped using brute-force means if an API is provided.

D) The source has given you explicit permission in writing or via a secondary document spelling out that permission.

E) The source does not exclude/forbid your use-case, such as GAI or LLM training.

Data cannot be programmatically scraped using FASRC computing resources if the source has explicitly forbidden scraping in their terms of service and written permission to do so cannot be obtained. In such a case, you should investigate other options for acquiring this or similar data.

Throttling and Blacklisting

Scraping content from websites using highly parallelized processes, even with unfettered permission from the source, should be avoided. Doing so runs the risk of having the cluster, or even the university’s, IP range blacklisted. This could have an undesirable effect on other network and cluster users. Please ensure your processes pull data at a reasonable rate unless you explicitly have written approval from the data source to download more aggressively and assurance that this will not lead to blacklisting from them or their upstream provider.

Harvard Office of the Vice Provost for Research

US Data.gov Data Harvesting Information

Archive.org Scraping

Bookmarkable Links