Services
Clients
Expertise
Why Implex?
Insights
Scraping solution for J.D.H.

Non-standard R&D project: automated solution for CAPTCHA service

Implex built a custom software to scrape data from the internet to support client’s marketing efforts. Web scraping is a technique used to extract data from websites. However, some websites are well-protected and can make scraping more difficult. This case study will present the solution we've developed to overcome common problems that can be encountered while scraping more demanding websites. These issues include protection against scraping from IPs of public clouds and VPNs, the requirement of a CAPTCHA, rate limiting per IP address, and region-specific restrictions.
main page img
main page img
main page img
main page img
main page img
logo of J.D.Hawkins
Industry:
Recruitment and consulting
Region:
USA
Client since:
2023
Our client operates as a consulting agency, specializing in healthcare consulting and recruitment. They offer a premium level of customer service, focusing on permanent placements for professionals such as Physicians, Dentists, Advanced Practitioners, and Executive Leadership roles.
8-12 MINUTES
is needed to scrape file with 100 records
SERVERLESS ARCHITECTURE
which is cost savings 
100% CCPA COMPLIANT
and target any city in USA and Canada
Services and expertise

Challenge

The client came to us with a need to scrape public data from the online resource to enhance their marketing activities.

The list of individuals, whose data should be enriched, is conveniently organized within a Google spreadsheet file. The service is checking “human” traits for each time, when it is accessed, using CAPTCHA. For such websites, getting the data manually is not as challenging as getting to it with your bot. 

The typical scraping problems has become the main challenges of the project:

  • Region-specific restrictions and VPN blockade: the website is available only from within US and Canada
  • Bot detection by:
    • Rate limiting per IP
    • Captcha protection (hCaptcha type)
    • Request headers analysis
    • Residential IPs only
  • Cost-effectiveness
Avatar of Jason Reavis
Jason Reavis Chief Operating Officer / Partner at J.D. Hawkins & Associates

Implex is a great partner. The program was well-built and does not require continual maintenance. They did a great job managing the project and providing a summary of each meeting. They managed their tasks well, delivered work on time, and promptly responded to our needs. Their professionalism and ability to deliver on their promises stood out.

Solution

There are several techniques and external services which allow you to tackle each problem. We were using two CAPTCHA services, which helped us to be identified as humans and get the data. The data were put to a spreadsheet in a usable format.

The customer is using the google spreadsheet file, conveniently filling the data. The data is uploaded in the agreed format. The scraping process is as easy as clicking one button inside the spreadsheet. Technically, the button triggers AWS Lambda through Apps Script. Lambda schedules an ECS task which scrapes the data and sends an email with the result to everyone listed in the input spreadsheet. Currently there is no DB connected.

As long as CAPTCHA solving solutions are quite slow and unstable we drew separate attention to the redundancy and fallback mechanisms to significantly improve the final success rate of the scraping.

https://d1by0hsgj87y7x.cloudfront.net/use-cases/J.D.Hawkins/picturesDesktopUrl/CasesJDHawkins-Inner1-1280px.png

The tools and technologies we used were:

technologytechnologytechnologytechnologytechnologytechnology

Results

We automated the process of searching for specific individuals on a website.
background bubble

As a result, Implex has built a custom data scraping solution for a healthcare consulting company.

The solution aimed to support the client's marketing efforts. We accelerated the data enrichment process by tens times, which was previously done manually.

The project addressed concerns related to the site's security, particularly in terms of web scrapers and bots. Web scraping, while enhancing data collection, also reduced the manual workload. 

Key results:

  • Automation of manual work, allowing one employee to save about 25 hours per month, that said about 125 hours/mo for 5 people
  • The ability for the entire team to use the tool collectively
  • Avoided unnecessary expenses and created a system compatible with the client's preferred tools, such as Google Sheets
  • This is a self-managed system that operates without supervision, provides error notifications in a language the client understands, and allows them to fix those errors
  • Cost-effectiveness: the system paid for itself within 3-4 months

More Case Studies

We take great pride in the work we do and the values we uphold. Here are some of our best case studies
Image for Transforming Project Portfolio Management: A Journey of Innovation and Implementation

Transforming Project Portfolio Management: A Journey of Innovation and Implementation

Stepping into the role of Project Consulting team, our goal was clear: to transform project portfolio management. By analyzing existing portfolios, fostering stakeholder collaboration, and aligning projects with business objectives, we set the stage for success and unified vision. Introducing standardized processes and efficient tools, we streamlined project initiation, prioritization, and resource management. Join us as we share our journey of innovation and implementation.
View Case Studyright arrow
Image for Telemetry service for data-driven decisions & product growth

Telemetry service for data-driven decisions & product growth

Implementation of the new Telemetry service, which should have a significant impact on Percona's business decision process. By collecting data on how the company’s product and services were being used by clients, Percona is able to identify and prioritize new features based on user needs and improve the product development flow. This lightweight, scalable service was implemented with high quality, allowing the company to meet its product release schedule and manage its software more efficiently.
View Case Studyright arrow
Image for (R)evolution of the high-load visual art web encyclopedia

(R)evolution of the high-load visual art web encyclopedia

The aim of the project was to make art accessible to anyone, anywhere. Today, WikiArt features over 250,000 artworks by 5,000 artists, localized to 8 languages. These artworks are displayed in museums, universities and town halls of more than 100 countries, yet most of it is not on public view. The client claims that they are planning to cover the entire art history of the Earth, from cave artworks to modern private collections.
View Case Studyright arrow
Image for Turning unusual vision into a novel map-based web product

Turning unusual vision into a novel map-based web product

CyberQuantic's web application transforms a mind map with cases into an innovative map-based website with intuitive navigation. Mental maps are the primary content and navigation method with an easy-to-read visualization. The web app is integrated with CyberQuantic's knowledge base and maintains a high level of website performance. The site includes data on 600+ AI firms in Europe and 200+ open APIs, and excellent Google PageSpeed Insights ratings.
View Case Studyright arrow
Image for Admit.me: trustful cooperation in education domain

Admit.me: trustful cooperation in education domain

Admit.me is a free virtual admissions coach that offers step-by-step guides through the admissions process of the best schools. The developed solution allows for generating personalized admissions programs based on the applicant's background and goals. Admit.me provides tools and lessons designed to optimize the applicant's admissions chances by framing the applicant's thinking and informing critical admissions decisions.
View Case Studyright arrow
Image for Rapid MVP for the on-line mobile and web car auction

Rapid MVP for the on-line mobile and web car auction

A B2B startup for car dealers who sell used vehicles across the USA via the marketplace. The developed application has a whole range of functions to support the workflows of car dealers and inspectors, like auctions workflow, live and proxy bidding, post-auction counter offering, and deal closing. The solution is integrated with different APIs to gather information about vehicles, helping dealers make the right decisions at auctions
View Case Studyright arrow
Image for Lightning-fast website for a reputable cybersec company

Lightning-fast website for a reputable cybersec company

A cybersecurity services company Berezha Security Group rebranded as BSG in 2020. So they needed to improve their website, but redesign was only part of the story. The ultimate goal was to make the website fast and SEO-friendly, keeping its Google PageSpeed Insights scores in the green 90-100 range, what's impossible using a WordPress-like CMS approach. A static website with perfect usability, external CMS and one-click deployment functionality, and revision control successfully represents the BSG brand now.
View Case Studyright arrow
background bubbleForm img

Every journey starts from the first step

Leave your contact details, and we will reach you within 24 hours
File size up to 5 MB