mirror of
https://github.com/YGGverse/Yo.git
synced 2025-01-13 00:08:09 +00:00
Micro Web Crawler in PHP & Manticore
example | ||
src | ||
.gitignore | ||
composer.json | ||
LICENSE | ||
README.md |
Yo!
Micro Web Crawler in PHP & Manticore
Yo! is the super thin layer for Manticore search server that extends official manticoresearch-php client with CLI tools and simple JS-less WebUI.
Features
- MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
- Page snap history with local and remote mirrors support (including FTP protocol)
- CLI tools for index administration and crontab tasks
- JS-less frontend to run local or public search web portal
- API tools to make search index distributed
Components
- Manticore Server
- PHP library for Manticore
- Symfony DOM crawler
- Symfony CSS selector
- FTP client for snap mirrors
- Hostname ident icons
- Bootstrap icons
Install
- Install
manticore
,composer
andphp
- Grab latest
Yo
versiongit clone https://github.com/YGGverse/Yo.git
- Run
composer update
inside the project directory - Copy and customize config file
cp example/config.json config.json
- Make sure
storage
folder writable - Run indexes initiation script
php src/cli/index/init.php
- Announce new URL
php src/cli/document/add.php URL
- Run crawler to grab the data
php src/cli/document/crawl.php
- Test search results
php src/cli/document/search.php '*'
Web UI
cd src/webui
php -S 127.0.0.1:8080
- open
127.0.0.1:8080
in browser
Documentation
CLI
Index
Init
Create initial index
php src/cli/index/init.php [reset]
reset
- optional, reset existing index
Document
Add
php src/cli/document/add.php URL
URL
- add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Clean
php src/cli/document/clean.php
- remove
url
duplicates - make index optimization
Search
php src/cli/document/search.php '@title "*"' [limit]
query
- requiredlimit
- optional search results limit
Migration
YGGo
Import index from YGGo database
php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique=off] [start=0] [limit=100]
Source DB fields required:
host
port
user
password
database
unique
- optional, check for unique URL (takes more time)start
- optional, offset to start queuelimit
- optional, limit queue
Backup
Logical
SQL text dumps could be useful for public index distribution, but requires more computing resources.
Physical
Better for infrastructure administration and includes original data binaries.
Instances
Yggdrasil
http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/
- IPv60200::/7
addresses only | index
Alfis DNS
http://yo.ygg
-.ygg
domain zone search only | indexhttp://ygg.yo.index
- alias ofhttp://yo.ygg
| index
**.yo.index
reserved for domain-oriented instances e.g. .btn
, .conf
, .mirror
- feel free to request the address