The CROWler

a complete and open source web crawling and indexing solution for many different uses.

In Summary

The CROWler is an old project of mine, it went through multiple “incarnations”, the very first one years ago in C++, then a port to Rust and now, the latest (and hopefully the last incarnation) in Go Lang.

What is it?

The CROWler is an open-source, feature-rich web crawler designed with a unique philosophy at its core: to be as gentle and low-noise as possible. In other words, The CROWler tries to stand out by ensuring minimal impact on the websites it crawls while maximizing convenience for its users.

Additionally, the system tries to behave like a human would, to help reducing the noise even more.

It is also equipped with an API, providing a streamlined interface for data queries. This feature ensures easy integration and access to indexed data for various applications.

The CROWler is designed to be micro-services based, so it can be easily deployed in a containerized environment. By default, it comes with Docker/Docker Compose scripts to make the initial deployment as quick as a few minutes on most platforms.

Some of the Features

Low-noise: The CROWler is designed to be as gentle as possible when crawling websites. It respects robots.txt, and it’s designed to try to appear close to a human user to the websites it crawls.
Customizable Crawling: Tailor your crawling experience like never before. Specify URLs and configure individual crawling parameters to fit your precise needs. Whether it’s a single page or an expansive domain, The CROWler adapts to your scope with unmatched flexibility.
Scope Variability: Define your crawling boundaries with precision. Choose from:
- Singular URL Crawling
- Domain-wide Crawling (combining L3, L2, and L1 domains)
- L2 and L1 Domain Crawling
- L1 Domain Crawling (e.g., everything within “.com”)
- Full Recursive Crawling, venturing beyond initial boundaries to explore connected URLs
Advanced Detection Capabilities: Discover a wealth of information with features that go beyond basic crawling:
- URL and Content Discovery
- Page Content, Metadata, and more
- Keywords Analysis and Language Detection
- Insightful HTTP Headers, Network Info, WHOIS, DNS, and Geo-localization Data
Sophisticated Ruleset: To leverage rules-based activities and logic customization, The CROWler offers:
- Scraping rules: To extract precisely what you need from websites and it’s also capable of automatic transformation of HTML into JSON.
- Actions rules: To interact with websites in a more dynamic way
- Detection rules: To identify specific patterns or elements on a page, technologies used, vulnerabilities, etc.
- Crawling rules: To define how the crawler should behave in different situations (for instance both recursive and non-recursive crawling, fuzzing, etc.)
Powerful Search Engine: Utilize an API-driven search engine equipped with dorking capabilities and comprehensive content search, opening new avenues for data analysis and insight.
AI Integration: From its ruleset architecture, to the way it works, the CROWler has been designed to integrate with Artificial Inteligence tools. There is also an freely available ChatGPT service to help users to generate, maintain and debug rules, configurations and even ask questions for the CROWler (link here).
3rd party network and security tools integration: to help discovery vulnerabilities, miss-configurations, and many other problems.

What problem does it solve?

The CROWler is designed to solve a set of problems about web crawling, content discovery, and data extraction. It’s designed to be able to crawl websites in a respectful and efficient way. It’s also designed to be able to crawl private networks and intranets, so you can use it to create your own or company search engine.

If, for example, you wish to have your own search engine and organize it for what you value, instead of having companies reading your data and invading your privacy when you make a search, clutter your results with sponsored links, or having results beign hijacked by insane SEO practices, then the CROWler is for you. Oh and no ads 😉

On top of that, it can also be used as the “base” for a more complex cybersecurity tool, as it can be used to gather information about a website, its network, its owners, which services are being exposed, network vulnerabilities and collect its scripts (for later linting and security assessments) etc.

Given it can also extract information, it can be used to create knowledge bases with reference to the sources, or to create a database of information about a specific topic.

Obviously, it can also be used to do keywords analysis, language detection, etc. but this is something every single crawler can be used for. However, all the “classic” features are implemented/being implemented.

What type of architecture does it present?

Here is a diagram to help you visualize a typical deployemnt of the CROWler:

How much can I scale it?

The CROWler uses minimal resources to crawl a single website. It’s architecture allows users to scale it both horizontally and vertically.

To scale a single CROWler engine vertically use the config.yaml and configure it to use more workers, while to scale it horizontally you can just spin up more CROWler engines and/or CROWler VDI (Virtual Desktop Images) and set the config.yaml to pin-point the engine to use them all.

The CROWler engine and the VDI auto-scale, so all you need to do is spin up more instances. There is no limit to the number of Engines and VDIs you can run.

The API already comes with secure headers, rate-limiting and protection in place, so you can either increase the allowed number of simultaneous connections or the rate-limiting parameters (use the config.yaml for this) or you can spin up multiple instances of the API and put it behind a load balancer.

To scale up the CROWler DB, you can use one of the many techniques availabel to scale up PostgreSQL. Here is a useful article about this: https://pgdash.io/blog/horizontally-scaling-postgresql.html

Also, in the case you do not need to collect everything the CROWler can detect and collect, don’t forget that you can tailor what you detect and extract from a web-site. That means if you do not need to detect or extract certain informations, disable such features and this will also increase performance.

Where do I get it?

Sources are available on my github repository here:

https://github.com/pzaino/thecrowler

If you need help writing, debugging, or reviewing your rules, there is a new OpenAI ChatGPT support service here: The CROWler Rules Support

What are the minimal requirement to run this thing?

The CROWler is designed to run from a Raspberry Pi kind of hardware all the way up to Cloud Services. I have deployed it using 3 Raspberry Pis at home.

It can be built locally and deployed on both x86_64 platforms and ARM64. For your convenience, it comes with Dockerfiles and Docker Compose scripts, so you can quickly build the micro-service deployment and have it running in a few minutes just by configuring 3 environment variables.

Is there an official Documentation?

Yes, have a look at this link for it, it contains also diagrams of the architecture of both, the Crowler and its Database.

How can I contribute to this project?

Check out this link here where you’ll find all the info on how to contribute and this link to check out the code of conduct for contributors.

Where does the peculiar name come from?

The CROWler kinda resembles “the crawler”, which is obviously what it is! Besides that, the name’s origins come from two things I enjoy:

In the Norse Mythology, Odin used crows (mostly translated as ravens, but a raven doesn’t sound like a crawler!):
- Huginn and Muninn: Huginn (thought) and Muninn (memory or mind). These birds fly across the world (Midgard) and bring information back to Odin, allowing him to stay informed about events in the world. This represents Odin’s pursuit of knowledge and wisdom. Sounds familiar? 😉
In the world of Beers:
- A Crowler is a 32oz aluminum can to contain handcrafted beer! (yeah, not the big corporations stuff). Now, how cool is that! It has an airtight spill-proof seal made to keep carbonation in, which in turn keeps light out and keeps your beer fresher for longer! In Cybersecurity, we are pretty famous for enjoying this type of beverage, so, there is NO doubt, I had to call one of my projects the Crowler! 😀

The CROW in CROWler is spelled in capitals because originally (mostly only the C++ version) it meant: Content discovery Robot Over the Web. But I no longer use this acronym, so I have left the capital spelling just to differentiate the CROWler crawler from the Crowler beer can. So you can use it as it pleases you.

The name is pronouned as:

CROW: Pronounced as /kroʊ/, rhymes with “know” or “snow.”

ler: The latter part is pronounced as /lər/, similar to the ending of the word “crawler” or the word “ler” in “tumbler.”

Putting it all together, it sounds like “thuh KROH-lər“

Future plans

The development has shifted over the years from using C++, to Rust and now has landed to port the entire solution to Go Lang, so porting activities continues while, at the same time, improvements on the existsting components are continuous. Please make sure you always get the latest release in the main branch to get the most out of this project.

Other plans for the future include:
– Improving the already powerful ruleset architecture and allowing integration with more tools on the scene.
– Further enhancing AI integration. I want to build the easiest solution for the job while not compromising on features and robustness.
– Move the network portions of HTTP and Net Information detection to the VDI (Virtual Desktop Image) or a dedicated image.
– Improve human user simulation

Using the CROWler as a development framework

The CROWler can be used in a multitude of scenarios, here below are some examples with some details on which CROWler modules should be used as a base to develop the specific type of solution:

1. Automated Competitive Intelligence Gathering

Objective: Monitor competitors’ websites for changes in product offerings, pricing, marketing strategies, and customer feedback.
Components:
- Detection Rules: Identify and extract product details, prices, promotional banners, and customer reviews.
- Scraping Rules: Collect detailed information on product changes, new product launches, price adjustments, and customer sentiment analysis.
- Action Rules: Automatically interact with dynamic content like price calculators or promotional pop-ups.
- Integration: Use APIs to integrate collected data with a BI tool for real-time analysis and reporting.

2. Global News Aggregator and Sentiment Analyzer

Objective: Aggregate news articles from various sources worldwide and perform sentiment analysis to gauge public opinion on specific topics.
Components:
- Crawling Rules: Define rules for crawling news websites, blogs, and social media platforms.
- Detection Rules: Identify and categorize articles based on predefined topics or keywords.
- Scraping Rules: Extract headlines, article content, author details, and publication dates.
  - Post-Processing: Integrate with a sentiment analysis API to evaluate the sentiment of each article.
- Database: Store and index the collected data in a searchable format for further analysis.
- Visualization: Use a dashboard tool to visualize trends, sentiment scores, and geographical distribution of news articles.

3. Advanced Web Security and Vulnerability Scanner

Objective: Perform comprehensive security assessments on websites to identify potential vulnerabilities and security misconfigurations.
Components:
- Service Scout Integration: Utilize service discovery features to detect open ports and running services as well as network vulnerability.
- Crawling Rules: Simulate various attack vectors like SQL injection, XSS, and CSRF.
- Detection Rules: Identify web technologies, frameworks, and versions used as well as XSS, CSRF, CSS, JavaScript vulnerability.
- Scraping Rules: Extract and analyze configuration files, scripts, and other potentially sensitive information.
  - Post-Processing: Use third-party vulnerability databases and APIs to cross-reference detected vulnerabilities as well as AI based live data analysis and enrichment.
- Reporting: Generate detailed security assessment reports with recommendations for remediation.

4. Automated E-Commerce Market Analysis and Trend Prediction

Objective: Continuously monitor and analyze e-commerce websites to track product trends, consumer behavior, and market dynamics.
Components:
- Crawling Rules: Define rules for crawling various e-commerce platforms like Amazon, eBay, and Alibaba.
- Detection Rules: Identify popular products, emerging trends, and seasonal changes.
- Scraping Rules: Extract product listings, prices, stock levels, customer reviews, and ratings.
  - Post-Processing: Integrate with machine learning models to predict future trends and consumer behavior.
- Visualization: Create dashboards to visualize market trends, price fluctuations, and consumer sentiment.

5. Personalized Content Delivery System

Objective: Create a system that personalizes content delivery based on user behavior and preferences.
Components:
- User Behavior Tracking: Use action rules to track user interactions on websites.
- Crawling and Scraping Rules: Collect content from various sources tailored to individual user preferences.
- Detection Rules: Categorize and tag content based on user interests and behavior.
- Machine Learning Integration: Analyze user data to recommend personalized content.
- Delivery System: Implement a content delivery system that dynamically updates based on real-time user interactions.

And more!

Here is a pseudo-knowledge graph diagram to provide some more details of what the CROWler collects and stores (with some degree of abstraction) to provide an idea of how collected information can be correlated. Keep in mind that every “details” field is actually a fully extensible JSON document (and each JSON’s tag is indexed and searchable). To extend it, one develops rulesets and plugins for the CROWler.

Project’s Updates

2024-06-27
– New alpha release 0.9.4 available!
– This release introduces Data Base schema 1.4
– More details on all the new features and fixes here:
https://paolozaino.wordpress.com/2024/06/27/new-alpha-release-v0-9-4-for-the-crowler-golang-port/

2024-05-20
– New Database Schema version 1.3 released (beta testers please rebuild the containers!)
– New Cluster-friendly Status report in the logs
– New Concurrent crawling algorithm released (for when a VDI is shared amongh multiple Sources crawling) It also works well when “mix and match” various number of VDIs with various number of SOurces to crawl from a single Engine (CEI)

2024-05-01
– As of today, all previous implementations in C++ and the previously on going port to Rust are no longer supported and their repositories have been frozen. Please use only the new port to Go Lang as this is the only maintained release.
– At the moment the port is still in active development, so please be patient. Alpha testers are most welcome.
– You can find the new go lang port here: https://github.com/pzaino/thecrowler

Discover more from Paolo Fabio Zaino's Blog

Subscribe to get the latest posts sent to your email.

Paolo Fabio Zaino's Blog

Blog about computing, networking, software development and whatever comes up in my mind!

The CROWler

In Summary

What is it?

Some of the Features

What problem does it solve?

What type of architecture does it present?

How much can I scale it?

Where do I get it?

What are the minimal requirement to run this thing?

Is there an official Documentation?

How can I contribute to this project?

Where does the peculiar name come from?

Future plans

Using the CROWler as a development framework

1. Automated Competitive Intelligence Gathering

2. Global News Aggregator and Sentiment Analyzer

3. Advanced Web Security and Vulnerability Scanner

4. Automated E-Commerce Market Analysis and Trend Prediction

5. Personalized Content Delivery System

And more!

Project’s Updates

Discover more from Paolo Fabio Zaino's Blog

Leave a Reply or Ask a Question Cancel reply

In Summary

What is it?

Some of the Features

What problem does it solve?

What type of architecture does it present?

How much can I scale it?

Where do I get it?

What are the minimal requirement to run this thing?

Is there an official Documentation?

How can I contribute to this project?

Where does the peculiar name come from?

Future plans

Using the CROWler as a development framework

1. Automated Competitive Intelligence Gathering

2. Global News Aggregator and Sentiment Analyzer

3. Advanced Web Security and Vulnerability Scanner

4. Automated E-Commerce Market Analysis and Trend Prediction

5. Personalized Content Delivery System

And more!

Project’s Updates

Discover more from Paolo Fabio Zaino's Blog

Share this on:

Related

Leave a Reply or Ask a Question Cancel reply