{FindDataLab}

Legal Web Scraping for Legal Purposes

6 Steps to Healthy Web Scraping
Web scraping is a fast and easy way to extract data from the web. How does it work? It is an automated process using a bot or a web crawler through the HTTP protocol or a web browser. Target data is stored into a central local database or a spreadsheet, and is later used for retrieval or analysis. Web scraping is a tool that can be applied for different business processes. Through web scraping, you may easily get information for brand monitoring and market research (such as visitor stats, product details, customers' email addresses). Web scraping is also actively used to train artificial intelligence and to collect information for scientific research.

Is web scraping legal? This question is raised by corporations, programmers, lawyers, and users of aggregated data not only on public or private meetings but also in courts. Scraping data from the web does indeed have some ethical, legal, and technical limitations. For instance, for the extraction and publication of personal data without the written consent of their owners, certain sanctions will follow fast. However, web scraping is legal for legal purposes. We have compiled a six-steps instructions for healthy web scraping for you.

legal web scraping
Step 1. Define the legal purpose of data collection
Step 2. Make sure that you want to get publicly available information
Step 3. Check copyrights
Step 4. Set optimal web scraping rate
Step 5. Direct your web scrapers a similar to the search engine
Step 6. Identify your web scrapers

Step 1. Define the legal purpose of data collection

Before you start the web scraping, you should decide what information, from which sites and in what format you want to receive. The next move is how you plan to use collected data. Two important questions should be answered here: 1. Are you going to publish the data or use it for yourself or your company's needs only? 2. Does may extract from websites information be a cause of any damage to data owners, whether reputational, financial or criminal?

If the data is extracted for your personal use and analysis, then it is legal and ethical. But if you are going to use it as your content and publish on your website without any attributing to original data owners, then it is completely against the interest of data creators and it is nor ethical, nor legal. So, if you plan to publish the scraped data, you should make download request to the data owners or do some background research about website policies as well about the data you are going to scrape. Remember that scraping for information about individuals without their knowledge could infringe on personal data protection laws. Moreover, damage to data or to use a computer to gain to access data without proper authorization is a criminal case.



Step 2. Make sure that you want to get publicly available information

In spite of the data published by most websites is for public consumption, and it is legal for coping, you better double-check the site's policies for your safety. You can legally use web scraping to access and acquire public, authorized data. Make sure that the information on the sites you need do not contain personal data. Web scraping can generally be done without asking for permission of the owner of data if it does not a violation of a website's terms and service. Each website has Terms of Service (ToS), you can easily find that document on the bottom of the home page and check that there is no direct prohibition on scraping. If a website has written under its ToS that data collection is not allowed, you risk being fined for web scraping, because it is done without the owner's permission. Also be ready that some information on needed sites may be secured behind some digital obstacles (username, password or access code), you cannot collect these data as well.

And of course, you may scrape your website without any doubts.

Step 3. Check copyrights

Besides ToS all websites have Copyright details, which web scraping users should respect as well. Before coping any content, make sure that the information you are about to extract is not copyrighted, including the rights to text, images, databases, and trademarks. Avoid republish scraped data or any dataset without verifying the data license, or without having written permission from the copyright holder. If some data is not allowed to be used for commercial purposes under copyright laws, you should steer clear from it. However, if the scraped data is a creative work, then usually just the way or format in which it is presented is copyrighted. So, in the case you scraped 'facts' from the work, modified it and presented originally, that is legal.



Step 4. Set optimal web scraping rate

Let's come to the technical limitations of healthy web scraping. Data scrapers can put heavy loads on a website's servers by asking for data much more times than a human does. You should take care of the optimal rate of web scraping process and do not affect the performance and bandwidth of the webserver in any way. Because most web servers will just automatically block your IP, preventing further access to its pages, in case this happens. Respect the crawl-delay setting provided in robots.txt, if there is none, use a conservative scraping rate, for example, 1 request per 10-15 seconds.



Step 5. Direct your web scrapers a path similar to the search engine

One more important thing about a healthy web scraping process is the way of getting to the site and searching for needed information. Well-known coders and US lawyers recommend using scrapers which accesses website data as a visitor and by following paths similar to a search engine. Even more, this can be done without registering as a user and explicitly accepting any terms. So, the legal scraper may scan and copy any public information which is available to the regular user but cannot, for example, damage the site coding, destroy secured digital obstacles and interfere with normal website operation in any way.



Step 6. Identify your web scrapers

Website administrations have to understand that is going on if you do not want to be blocked. So, be respectful and identify your web scraper with a legitimate user agent string. Create a page that explains what you are doing and for what, point out organization name (if you are scraping for one), add a link back to the page in your user agent string as well. Legitimate bots abide by a site's robot.txt file, which lists those pages a bot is permitted to access and those it cannot. If ToS or robots.txt prevent you from scraping, you should ask written permission from the site owner, before doing anything else.


Web scraping is a valuable and cheap tool for businesses in the global competitive market. However, web scraping should be done with respect and responsibility to data owners and site administrators. Following our 6-steps instruction of healthy web scraping, you may avoid many problems and protect yourself.
1-to-1 live consultation with our expert
Write Close
Close
We will send you a link to schedule an online meeting to talk about your case
I agree the Terms of Service
But in a case you are uncertain about the legality of your scraping or you have no opportunity to prepare for it in the quality manner, do not be afraid to contact professional providers of web scraping services. These companies are always careful with breaking any rules and laws because, in the case of a violation, it is the data scraping company that will suffer the consequences. Future more, as a company they have sufficient experience they're more aware of what's right and wrong and will tell you if web crawling is legal for the data you want. And, significantly, such providers usually have technical resources for fast and legal data collection.

October-1-2019