Micro Web Crawler in PHP & Manticore
Go to file
2024-03-21 19:01:39 +02:00
example add new link rules 2024-03-21 18:00:29 +02:00
src remove exact condition 2024-03-21 19:01:39 +02:00
.gitignore ignore all config files in this folder 2023-12-12 23:30:15 +02:00
composer.json implement FTP snaps 2023-11-25 03:19:54 +02:00
LICENSE Initial commit 2023-11-19 20:07:17 +02:00
README.md add cleanup limit argument 2024-03-21 18:41:33 +02:00

Yo!

Micro Web Crawler in PHP & Manticore

Yo! is the super thin layer for Manticore search server that extends official manticoresearch-php client with CLI tools and simple JS-less WebUI.

Features

  • MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
  • Page snap history with local and remote mirrors support (including FTP protocol)
  • CLI tools for index administration and crontab tasks
  • JS-less frontend to run local or public search web portal
  • API tools to make search index distributed

Components

Install

Application require manticore, composer and php

Deployment

Project in development, use dev-main branch:

  • composer create-project yggverse/yo:dev-main

Development

  • git clone https://github.com/YGGverse/Yo.git
  • cd Yo
  • composer update
  • git checkout -b pr-branch

Init

  • cp example/config.json config.json
  • php src/cli/index/init.php

Usage

  • php src/cli/document/add.php URL
  • php src/cli/document/crawl.php
  • php src/cli/document/search.php '*'

Web UI

  1. cd src/webui
  2. php -S 127.0.0.1:8080
  3. open http://127.0.0.1:8080 in browser

Documentation

CLI

Index

Init

Create initial index

php src/cli/index/init.php [reset]
  • reset - optional, reset existing index
Alter

Change existing index

php src/cli/index/alter.php {operation} {column} {type}
  • operation - operation name, supported values: add|drop
  • column - target column name
  • type - target column type, supported values: text|integer

Document

Add
php src/cli/document/add.php URL
  • URL - add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Clean

Make index optimization, apply new configuration rules

php src/cli/document/clean.php [limit]
  • limit - integer, documents quantity per queue
php src/cli/document/search.php '@title "*"' [limit]
  • query - required
  • limit - optional search results limit
Migration
YGGo

Import index from YGGo database

php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique=off] [start=0] [limit=100]

Source DB fields required:

  • host
  • port
  • user
  • password
  • database
  • unique - optional, check for unique URL (takes more time)
  • start - optional, offset to start queue
  • limit - optional, limit queue

Backup

Logical

SQL text dumps could be useful for public index distribution, but requires more computing resources.

Read more

Physical

Better for infrastructure administration and includes original data binaries.

Read more

Instances

Yggdrasil

  • http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/ - IPv6 0200::/7 addresses only | index

Alfis DNS

  • http://yo.ygg - .ygg domain zone search only | index
  • http://ygg.yo.index - alias of http://yo.ygg | index

**.yo.index reserved for domain-oriented instances e.g. .btn, .conf, .mirror - feel free to request the address