# Yo! Micro Web Crawler in PHP & Manticore Yo! is the super thin client-server crawler based on [Manticore](https://github.com/manticoresoftware) full-text search.\ Compatible with different networks, includes flexible settings, history snaps, CLI tools and UI for [Gemini Protocol](https://geminiprotocol.net). To use `HTTP` version, please checkout [main branch](https://github.com/YGGverse/Yo)! ## Features * MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc * Page snap history with local and remote mirrors support (including FTP protocol) * CLI tools for index administration and crontab tasks * Gemini Protocol UI (coming soon) ## Components * [Manticore Server](https://github.com/manticoresoftware/manticoresearch) * [PHP library for Manticore](https://github.com/manticoresoftware/manticoresearch-php) * [PHP library for Gemini Protocol](https://github.com/YGGverse/gemini-php) * [PHP library for Network operations](https://github.com/YGGverse/net-php) * [FTP client for snap mirrors](https://github.com/YGGverse/ftp-php) ### Install #### Environment ##### Debian * `wget https://repo.manticoresearch.com/manticore-repo.noarch.deb` * `dpkg -i manticore-repo.noarch.deb` * `apt update` * `apt install git composer manticore manticore-extra memcached php-fpm php-mbstring php-memcached` Yo search engine uses Manticore as the primary database. If your server sensitive to power down, change default [binlog flush strategy](https://manual.manticoresearch.com/Logging/Binary_logging#Binary-flushing-strategies) to `binlog_flush = 1` #### Deployment * `git clone https://github.com/YGGverse/Yo.git` * `cd Yo` * `git checkout gemini` * `composer update` #### Development * `git clone https://github.com/YGGverse/Yo.git` * `cd Yo` * `git checkout gemini` * `git checkout -b pr-branch` * `git commit -m 'new fix'` * `git push` #### Update * `cd Yo` * `git pull` * `composer update` #### Init * `cp example/config.json config.json` * `php src/cli/index/init.php` #### Usage * `php src/cli/document/add.php URL` * `php src/cli/document/crawl.php` * `php src/cli/document/search.php '*'` #### Gemini UI Coming soon.. ## Documentation ### CLI #### Index ##### Init Create initial index ``` php src/cli/index/init.php [reset] ``` * `reset` - optional, reset existing index ##### Alter Change existing index ``` php src/cli/index/alter.php {operation} {column} {type} ``` * `operation` - operation name, supported values: `add`|`drop` * `column` - target column name * `type` - target column type, supported values: `text`|`integer` #### Document ##### Add ``` php src/cli/document/add.php URL ``` * `URL` - add new URL to the crawl queue ##### Crawl ``` php src/cli/document/crawl.php ``` ##### Clean Make index optimization, apply new configuration rules ``` php src/cli/document/clean.php [limit] ``` * `limit` - integer, documents quantity per queue ##### Search ``` php src/cli/document/search.php '@title "*"' [limit] ``` * `query` - required * `limit` - optional search results limit ### Backup #### Logical SQL text dumps could be useful for public index distribution, but requires more computing resources. [Read more](https://manual.manticoresearch.com/Securing_and_compacting_a_table/Backup_and_restore#Backup-and-restore-with-mysqldump) #### Physical Better for infrastructure administration and includes original data binaries. [Read more](https://manual.manticoresearch.com/Securing_and_compacting_a_table/Backup_and_restore#Using-manticore-backup-command-line-tool) ## Instances Coming soon..