Non-standard R&D project: automated solution for CAPTCHA service
Challenge
The client came to us with a need to scrape public data from the online resource to enhance their marketing activities.
The list of individuals, whose data should be enriched, is conveniently organized within a Google spreadsheet file. The service is checking “human” traits for each time, when it is accessed, using CAPTCHA. For such websites, getting the data manually is not as challenging as getting to it with your bot.
The typical scraping problems has become the main challenges of the project:
- Region-specific restrictions and VPN blockade: the website is available only from within US and Canada
- Bot detection by:
- Rate limiting per IP
- Captcha protection (hCaptcha type)
- Request headers analysis
- Residential IPs only
- Cost-effectiveness
Jason Reavis
Chief Operating Officer / Partner at J.D. Hawkins & Associates
Implex is a great partner. The program was well-built and does not require continual maintenance. They did a great job managing the project and providing a summary of each meeting. They managed their tasks well, delivered work on time, and promptly responded to our needs. Their professionalism and ability to deliver on their promises stood out.
Solution
There are several techniques and external services which allow you to tackle each problem. We were using two CAPTCHA services, which helped us to be identified as humans and get the data. The data were put to a spreadsheet in a usable format.
The customer is using the google spreadsheet file, conveniently filling the data. The data is uploaded in the agreed format. The scraping process is as easy as clicking one button inside the spreadsheet. Technically, the button triggers AWS Lambda through Apps Script. Lambda schedules an ECS task which scrapes the data and sends an email with the result to everyone listed in the input spreadsheet. Currently there is no DB connected.
As long as CAPTCHA solving solutions are quite slow and unstable we drew separate attention to the redundancy and fallback mechanisms to significantly improve the final success rate of the scraping.
The tools and technologies we used were:
Results
We automated the process of searching for specific individuals on a website.
As a result, Implex has built a custom data scraping solution for a healthcare consulting company.
The solution aimed to support the client's marketing efforts. We accelerated the data enrichment process by tens times, which was previously done manually.
The project addressed concerns related to the site's security, particularly in terms of web scrapers and bots. Web scraping, while enhancing data collection, also reduced the manual workload.
Key results:
- Automation of manual work, allowing one employee to save about 25 hours per month, that said about 125 hours/mo for 5 people
- The ability for the entire team to use the tool collectively
- Avoided unnecessary expenses and created a system compatible with the client's preferred tools, such as Google Sheets
- This is a self-managed system that operates without supervision, provides error notifications in a language the client understands, and allows them to fix those errors
- Cost-effectiveness: the system paid for itself within 3-4 months