mirror of
https://github.com/YGGverse/Yo.git
synced 2025-01-13 16:27:54 +00:00
Micro Web Crawler in PHP & Manticore
example | ||
src/cli | ||
.gitignore | ||
composer.json | ||
LICENSE | ||
README.md |
Yo!
Micro Web Crawler in PHP & Manticore
Yo! is the super thin client-server crawler based on Manticore full-text search.
Compatible with different networks, includes flexible settings, history snaps, CLI tools and UI for Gemini Protocol.
To use HTTP
version, please checkout main branch!
Features
- MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
- Page snap history with local and remote mirrors support (including FTP protocol)
- CLI tools for index administration and crontab tasks
- Gemini Protocol UI (coming soon)
Components
Install
Environment
Debian
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
dpkg -i manticore-repo.noarch.deb
apt update
apt install git composer manticore manticore-extra php-fpm php-mbstring
Yo search engine uses Manticore as the primary database. If your server sensitive to power down,
change default binlog flush strategy to binlog_flush = 1
Deployment
git clone https://github.com/YGGverse/Yo.git
cd Yo
git checkout gemini
composer update
Development
git clone https://github.com/YGGverse/Yo.git
cd Yo
git checkout gemini
git checkout -b pr-branch
git commit -m 'new fix'
git push
Update
cd Yo
git pull
composer update
Init
cp example/config.json config.json
php src/cli/index/init.php
Usage
php src/cli/document/add.php URL
php src/cli/document/crawl.php
php src/cli/document/search.php '*'
Gemini UI
Coming soon..
Documentation
CLI
Index
Init
Create initial index
php src/cli/index/init.php [reset]
reset
- optional, reset existing index
Alter
Change existing index
php src/cli/index/alter.php {operation} {column} {type}
operation
- operation name, supported values:add
|drop
column
- target column nametype
- target column type, supported values:text
|integer
Document
Add
php src/cli/document/add.php URL
URL
- add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Clean
Make index optimization, apply new configuration rules
php src/cli/document/clean.php [limit]
limit
- integer, documents quantity per queue
Search
php src/cli/document/search.php '@title "*"' [limit]
query
- requiredlimit
- optional search results limit
Backup
Logical
SQL text dumps could be useful for public index distribution, but requires more computing resources.
Physical
Better for infrastructure administration and includes original data binaries.
Instances
Coming soon..