Yo/README.md

149 lines
3.4 KiB
Markdown
Raw Normal View History

2023-11-25 20:00:15 +02:00
# Yo!
2023-11-19 23:00:51 +02:00
2023-11-25 20:00:15 +02:00
Micro Web Crawler in PHP & Manticore
2023-11-24 17:52:13 +02:00
2024-04-03 20:15:47 +03:00
Yo! is the super thin client-server crawler based on [Manticore](https://github.com/manticoresoftware) full-text search.\
Compatible with different networks, includes flexible settings, history snaps, CLI tools and UI for [Gemini Protocol](https://geminiprotocol.net).
2024-04-03 17:07:28 +03:00
To use `HTTP` version, please checkout [main branch](https://github.com/YGGverse/Yo)!
2023-11-24 17:52:13 +02:00
2023-11-26 14:19:57 +02:00
## Features
2023-11-24 17:52:13 +02:00
2023-11-26 14:19:57 +02:00
* MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
* Page snap history with local and remote mirrors support (including FTP protocol)
* CLI tools for index administration and crontab tasks
2024-04-03 17:07:28 +03:00
* Gemini Protocol UI (coming soon)
2023-11-24 17:52:13 +02:00
2023-11-26 14:19:57 +02:00
## Components
* [Manticore Server](https://github.com/manticoresoftware/manticoresearch)
2023-11-26 14:24:38 +02:00
* [PHP library for Manticore](https://github.com/manticoresoftware/manticoresearch-php)
2023-11-26 14:19:57 +02:00
* [FTP client for snap mirrors](https://github.com/YGGverse/ftp-php)
2023-11-24 17:52:13 +02:00
2023-11-24 20:00:09 +02:00
### Install
2024-03-21 20:24:51 +02:00
#### Environment
##### Debian
2024-03-21 20:25:18 +02:00
* `wget https://repo.manticoresearch.com/manticore-repo.noarch.deb`
* `dpkg -i manticore-repo.noarch.deb`
* `apt update`
2024-04-03 17:07:28 +03:00
* `apt install git composer manticore manticore-extra php-fpm php-mbstring`
2024-03-21 20:24:51 +02:00
Yo search engine uses Manticore as the primary database. If your server sensitive to power down,
change default [binlog flush strategy](https://manual.manticoresearch.com/Logging/Binary_logging#Binary-flushing-strategies) to `binlog_flush = 1`
2023-11-30 20:54:04 +02:00
2024-03-21 15:47:41 +02:00
#### Deployment
2023-11-30 20:54:04 +02:00
2024-04-03 17:07:28 +03:00
* `git clone https://github.com/YGGverse/Yo.git`
* `cd Yo`
* `git checkout gemini`
* `composer update`
2023-11-30 20:54:04 +02:00
#### Development
* `git clone https://github.com/YGGverse/Yo.git`
2024-03-21 15:47:41 +02:00
* `cd Yo`
2024-04-03 17:07:28 +03:00
* `git checkout gemini`
2024-03-21 15:47:41 +02:00
* `git checkout -b pr-branch`
2024-03-21 20:24:51 +02:00
* `git commit -m 'new fix'`
* `git push`
#### Update
* `cd Yo`
* `git pull`
* `composer update`
2023-11-30 20:54:04 +02:00
#### Init
* `cp example/config.json config.json`
* `php src/cli/index/init.php`
#### Usage
* `php src/cli/document/add.php URL`
* `php src/cli/document/crawl.php`
* `php src/cli/document/search.php '*'`
2023-11-24 22:26:02 +02:00
2024-04-03 17:07:28 +03:00
#### Gemini UI
2023-11-24 22:26:02 +02:00
2024-04-03 17:07:28 +03:00
Coming soon..
2023-11-24 20:03:12 +02:00
## Documentation
2023-11-24 20:00:09 +02:00
2023-11-24 19:55:07 +02:00
### CLI
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
#### Index
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
##### Init
2023-11-19 23:00:51 +02:00
Create initial index
```
php src/cli/index/init.php [reset]
```
* `reset` - optional, reset existing index
2024-03-20 21:06:18 +02:00
##### Alter
Change existing index
```
php src/cli/index/alter.php {operation} {column} {type}
```
* `operation` - operation name, supported values: `add`|`drop`
* `column` - target column name
* `type` - target column type, supported values: `text`|`integer`
2023-11-24 19:55:07 +02:00
#### Document
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
##### Add
2023-11-19 23:00:51 +02:00
```
php src/cli/document/add.php URL
```
* `URL` - add new URL to the crawl queue
2023-11-24 19:55:07 +02:00
##### Crawl
2023-11-19 23:00:51 +02:00
```
php src/cli/document/crawl.php
```
2023-11-27 19:29:17 +02:00
##### Clean
2024-03-21 18:41:33 +02:00
Make index optimization, apply new configuration rules
2023-11-27 19:29:17 +02:00
```
2024-03-21 18:41:33 +02:00
php src/cli/document/clean.php [limit]
2023-11-27 19:29:17 +02:00
```
2024-03-21 18:41:33 +02:00
* `limit` - integer, documents quantity per queue
2023-11-27 19:29:17 +02:00
2023-11-24 19:55:07 +02:00
##### Search
2023-11-19 23:00:51 +02:00
```
php src/cli/document/search.php '@title "*"' [limit]
```
* `query` - required
2023-11-25 04:44:07 +02:00
* `limit` - optional search results limit
2023-11-30 14:32:55 +02:00
### Backup
#### Logical
2023-11-30 14:33:57 +02:00
SQL text dumps could be useful for public index distribution, but requires more computing resources.
2023-11-30 14:32:55 +02:00
[Read more](https://manual.manticoresearch.com/Securing_and_compacting_a_table/Backup_and_restore#Backup-and-restore-with-mysqldump)
#### Physical
2023-11-30 14:33:57 +02:00
Better for infrastructure administration and includes original data binaries.
2023-11-30 14:32:55 +02:00
[Read more](https://manual.manticoresearch.com/Securing_and_compacting_a_table/Backup_and_restore#Using-manticore-backup-command-line-tool)
2023-11-25 13:42:22 +02:00
## Instances
2024-04-03 17:07:28 +03:00
Coming soon..