Micro Web Crawler in PHP & Manticore
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
yggverse c1bfc58466 fix resolved host attribute 2 months ago
example make custom resolver optionally required to continue the crawl #15 2 months ago
src/cli fix resolved host attribute 2 months ago
.gitignore ignore all config files in this folder 6 months ago
LICENSE Initial commit 6 months ago
README.md implement DNS resolver with memory cache feature #15 2 months ago
composer.json fix resolved host request 2 months ago

README.md

Yo!

Micro Web Crawler in PHP & Manticore

Yo! is the super thin client-server crawler based on Manticore full-text search.
Compatible with different networks, includes flexible settings, history snaps, CLI tools and UI for Gemini Protocol.

To use HTTP version, please checkout main branch!

Features

  • MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
  • Page snap history with local and remote mirrors support (including FTP protocol)
  • CLI tools for index administration and crontab tasks
  • Gemini Protocol UI (coming soon)

Components

Install

Environment

Debian
  • wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
  • dpkg -i manticore-repo.noarch.deb
  • apt update
  • apt install git composer manticore manticore-extra memcached php-fpm php-mbstring php-memcached

Yo search engine uses Manticore as the primary database. If your server sensitive to power down, change default binlog flush strategy to binlog_flush = 1

Deployment

  • git clone https://github.com/YGGverse/Yo.git
  • cd Yo
  • git checkout gemini
  • composer update

Development

  • git clone https://github.com/YGGverse/Yo.git
  • cd Yo
  • git checkout gemini
  • git checkout -b pr-branch
  • git commit -m 'new fix'
  • git push

Update

  • cd Yo
  • git pull
  • composer update

Init

  • cp example/config.json config.json
  • php src/cli/index/init.php

Usage

  • php src/cli/document/add.php URL
  • php src/cli/document/crawl.php
  • php src/cli/document/search.php '*'

Gemini UI

Coming soon..

Documentation

CLI

Index

Init

Create initial index

php src/cli/index/init.php [reset]
  • reset - optional, reset existing index
Alter

Change existing index

php src/cli/index/alter.php {operation} {column} {type}
  • operation - operation name, supported values: add|drop
  • column - target column name
  • type - target column type, supported values: text|integer

Document

Add
php src/cli/document/add.php URL
  • URL - add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Clean

Make index optimization, apply new configuration rules

php src/cli/document/clean.php [limit]
  • limit - integer, documents quantity per queue
php src/cli/document/search.php '@title "*"' [limit]
  • query - required
  • limit - optional search results limit

Backup

Logical

SQL text dumps could be useful for public index distribution, but requires more computing resources.

Read more

Physical

Better for infrastructure administration and includes original data binaries.

Read more

Instances

Coming soon..