Search Docs by Keyword
Web Scraping Policy
THIS DOCUMENT IS UNDER REVIEW (SEPT 2022)
Web scraping is a contentious issue within research. While it is true that fair use provides for many uses of data gleaned from the Internet, in general this is applied to human information gathering, not programmatic machine scraping. That distinction makes the act of brute-force scraping an issue separate from fair use.
You, as a representative of Harvard, are not just using the source’s data, but also their servers, bandwidth, etc. in a way the source may not approve. This can lead to IP blacklisting and even legal action. So please tread carefully as your actions could negatively affect others.
Please be aware that merely being an academic does not exempt you from the usage policies of social media and other Internet platforms like Facebook, Twitter, etc.
If the data you are acquiring is considered sensitive, confidential, or contains human data, you will need to have this data reviewed for compliance before placing it on the FASRC cluster. If in doubt, you should always err on the side of caution and contact the Office of the Vice Provost for Research
- Research Data at Harvard
- Office of the General Counsel
- Harvard Office of the Vice Provost for Research
- HRDSP Applications Summary with Order of Reviews
- Harvard Research Data Security Policy site
- Applications Summary and Order of Reviews
Scraping data for use on the FASRC Cluster
If your research requires you to scrape content from the web, please review the following guidelines and suggestions.
We highly discourage using the cluster itself to scrape data, but we do recognize that there will be a small number of exceptions. Due to its size and ease of parallelization of processes, the cluster is easily weaponized and your actions could have consequences for other researchers. Please consider another avenue for acquisition first.
You should contact FASRC before commencing any scraping activity using the FASRC cluster.
It is highly preferable that you do the scraping elsewhere and then bring the data to the FASRC cluster for processing. If the data is sensitive, confidential, contains human data, or it is unclear, then this is a requirement. See ‘Sensitive Data’ above.
If you are in doubt or have questions, please contact the Harvard Office of the Vice Provost for Research
Data on the Internet should not be programmatically (or ‘brute-force’) scraped using FASRC computing resources, even for academic research purposes, unless FASRC has given permission to proceed using the cluster or some system tied to the cluster, and:
A) The source provides an API for this purpose and any requirements they impose have been met.
B) The source allows/does not prohibit scraping in their terms of service or other public notice.
C) The source is the United States government and the data in question was generated with public funds and is publicly available without encumbrance. Further, that the site not be scraped using brute-force means if an API is provided.
D) The source has given you explicit permission in writing or via a secondary document spelling out that permission.
Data cannot be programmatically scraped using FASRC computing resources if the source has explicitly forbidden scraping in their terms of service and written permission to do so cannot be obtained. In such a case, you should investigate other options for acquiring this or similar data.
Throttling and Blacklisting
Scraping content from websites using highly parallelized processes, even with unfettered permission from the source, should be avoided. Doing so runs the risk of having the cluster, or even the university’s, IP range blacklisted. This could have an undesirable effect on other network and cluster users. Please ensure your processes pull data at a reasonable rate unless you explicitly have written approval from the data source to download more aggressively and assurance that this will not lead to blacklisting from them or their upstream provider.