Micro Web Crawler in PHP & Manticore
Go to file
2023-11-24 19:40:46 +02:00
src add url trim 2023-11-24 18:37:25 +02:00
.gitignore ignore storage folder 2023-11-24 19:40:46 +02:00
composer.json initial commit 2023-11-19 23:00:51 +02:00
LICENSE Initial commit 2023-11-19 20:07:17 +02:00
README.md update readme 2023-11-24 19:40:39 +02:00

Yo! Micro Web Crawler in PHP & Manticore

Next generation of YGGo! project with goal to reduce server requirements and make deployment process simpler.

Index model changed to the distributed cluster model, and oriented to aggregate search results from different instances trough API.

Codebase following minimalism such as possible.

Implementation

Engine written in PHP and uses Manticore on backend.

Default build inspired and adapted for Yggdrasil eco-system but could be used to make own search project.

Components

  • CLI tools for index operations
  • JS-less frontend to make search web portal
  • API tools to make search index distributed

Features

  • MIME-based crawler with flexible filter settings
  • Page snap history with local and remote mirrors support

Documentation

CLI

Index
Init

Create initial index

php src/cli/index/init.php [reset]
  • reset - optional, reset existing index
Document
Add
php src/cli/document/add.php URL
  • URL - add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
php src/cli/document/search.php '@title "*"' [limit]
  • query - required
  • limit - optional search results limit