Step 1. Define the legal purpose of data collection
Before you start the web scraping, you should decide what information, from which sites and in what format you want to receive. The next move is how you plan to use collected data. Two important questions should be answered here: 1. Are you going to publish the data or use it for yourself or your company's needs only? 2. Does may extract from websites information be a cause of any damage to data owners, whether reputational, financial or criminal?
If the data is extracted for your personal use and analysis, then it is legal and ethical. But if you are going to use it as your content and publish on your website without any attributing to original data owners, then it is completely against the interest of data creators and it is nor ethical, nor legal. So, if you plan to publish the scraped data, you should make download request to the data owners or do some background research about website policies as well about the data you are going to scrape. Remember that scraping for information about individuals without their knowledge could infringe on personal data protection laws. Moreover, damage to data or to use a computer to gain to access data without proper authorization is a criminal case.
Step 2. Make sure that you want to get publicly available information
In spite of the data published by most websites is for public consumption, and it is legal for coping, you better double-check the site's policies for your safety. You can legally use web scraping to access and acquire public, authorized data. Make sure that the information on the sites you need do not contain personal data. Web scraping can generally be done without asking for permission of the owner of data if it does not a violation of a website's terms and service. Each website has Terms of Service (ToS), you can easily find that document on the bottom of the home page and check that there is no direct prohibition on scraping. If a website has written under its ToS that data collection is not allowed, you risk being fined for web scraping, because it is done without the owner's permission. Also be ready that some information on needed sites may be secured behind some digital obstacles (username, password or access code), you cannot collect these data as well.
And of course, you may scrape your website without any doubts.
Step 3. Check copyrights
Besides ToS all websites have Copyright details, which web scraping users should respect as well. Before coping any content, make sure that the information you are about to extract is not copyrighted, including the rights to text, images, databases, and trademarks. Avoid republish scraped data or any dataset without verifying the data license, or without having written permission from the copyright holder. If some data is not allowed to be used for commercial purposes under copyright laws, you should steer clear from it. However, if the scraped data is a creative work, then usually just the way or format in which it is presented is copyrighted. So, in the case you scraped 'facts' from the work, modified it and presented originally, that is legal.
Step 4. Set optimal web scraping rate
Let's come to the technical limitations of healthy web scraping. Data scrapers can put heavy loads on a website's servers by asking for data much more times than a human does. You should take care of the optimal rate of web scraping process and do not affect the performance and bandwidth of the webserver in any way. Because most web servers will just automatically block your IP, preventing further access to its pages, in case this happens. Respect the crawl-delay setting provided in robots.txt, if there is none, use a conservative scraping rate, for example, 1 request per 10-15 seconds.
Step 5. Direct your web scrapers a path similar to the search engine
One more important thing about a healthy web scraping process is the way of getting to the site and searching for needed information. Well-known coders and US lawyers recommend using scrapers which accesses website data as a visitor and by following paths similar to a search engine. Even more, this can be done without registering as a user and explicitly accepting any terms. So, the legal scraper may scan and copy any public information which is available to the regular user but cannot, for example, damage the site coding, destroy secured digital obstacles and interfere with normal website operation in any way.
Step 6. Identify your web scrapers
Website administrations have to understand that is going on if you do not want to be blocked. So, be respectful and identify your web scraper with a legitimate user agent string. Create a page that explains what you are doing and for what, point out organization name (if you are scraping for one), add a link back to the page in your user agent string as well. Legitimate bots abide by a site's robot.txt file, which lists those pages a bot is permitted to access and those it cannot. If ToS or robots.txt prevent you from scraping, you should ask written permission from the site owner, before doing anything else.
Web scraping is a valuable and cheap tool for businesses in the global competitive market. However, web scraping should be done with respect and responsibility to data owners and site administrators. Following our 6-steps instruction of healthy web scraping, you may avoid many problems and protect yourself.