Micro Web Crawler in PHP & Manticore
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ghost 904c8b8d6d ignore storage folder 1 year ago
src add url trim 1 year ago
.gitignore ignore storage folder 1 year ago
LICENSE Initial commit 1 year ago
README.md update readme 1 year ago
composer.json initial commit 1 year ago

README.md

Yo! Micro Web Crawler in PHP & Manticore

Next generation of YGGo! project with goal to reduce server requirements and make deployment process simpler.

Index model changed to the distributed cluster model, and oriented to aggregate search results from different instances trough API.

Codebase following minimalism such as possible.

Implementation

Engine written in PHP and uses Manticore on backend.

Default build inspired and adapted for Yggdrasil eco-system but could be used to make own search project.

Components

  • CLI tools for index operations
  • JS-less frontend to make search web portal
  • API tools to make search index distributed

Features

  • MIME-based crawler with flexible filter settings
  • Page snap history with local and remote mirrors support

Documentation

CLI

Index
Init

Create initial index

php src/cli/index/init.php [reset]
  • reset - optional, reset existing index
Document
Add
php src/cli/document/add.php URL
  • URL - add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
php src/cli/document/search.php '@title "*"' [limit]
  • query - required
  • limit - optional search results limit