Yo/README.md

115 lines
2.7 KiB
Markdown
Raw Normal View History

2023-11-25 20:00:15 +02:00
# Yo!
2023-11-19 23:00:51 +02:00
2023-11-25 20:00:15 +02:00
Micro Web Crawler in PHP & Manticore
2023-11-24 17:52:13 +02:00
2023-11-25 20:00:15 +02:00
Yo! is the super thin layer for Manticore search server that extends official [manticoresearch-php](https://github.com/manticoresoftware/manticoresearch-php) client with CLI tools and simple JS-less WebUI.
2023-11-24 17:52:13 +02:00
2023-11-26 14:19:57 +02:00
## Features
2023-11-24 17:52:13 +02:00
2023-11-26 14:19:57 +02:00
* MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
* Page snap history with local and remote mirrors support (including FTP protocol)
* CLI tools for index administration and crontab tasks
2023-11-25 20:00:15 +02:00
* JS-less frontend to run local or public search web portal
2023-11-24 17:52:13 +02:00
* API tools to make search index distributed
2023-11-26 14:19:57 +02:00
## Components
* [Manticore Server](https://github.com/manticoresoftware/manticoresearch)
2023-11-26 14:24:38 +02:00
* [PHP library for Manticore](https://github.com/manticoresoftware/manticoresearch-php)
* [Symfony DOM crawler](https://github.com/symfony/dom-crawler)
* [Symfony CSS selector](https://github.com/symfony/css-selector)
2023-11-26 14:19:57 +02:00
* [FTP client for snap mirrors](https://github.com/YGGverse/ftp-php)
2023-11-26 14:24:38 +02:00
* [Hostname ident icons](https://github.com/dmester/jdenticon-php)
2023-11-26 14:29:20 +02:00
* [Bootstrap icons](https://icons.getbootstrap.com/)
2023-11-24 17:52:13 +02:00
2023-11-24 20:00:09 +02:00
### Install
2023-11-25 20:00:15 +02:00
1. Install `manticore`, `composer` and `php`
2023-11-24 20:04:52 +02:00
2. Grab latest `Yo` version `git clone https://github.com/YGGverse/Yo.git`
2023-11-24 20:03:12 +02:00
3. Run `composer update` inside the project directory
2023-11-25 00:16:08 +02:00
4. Copy and customize config file `cp example/config.json config.json`
2023-11-24 20:04:52 +02:00
5. Make sure `storage` folder writable
2023-11-25 20:00:15 +02:00
6. Run indexes initiation script `php src/cli/index/init.php`
7. Announce new URL `php src/cli/document/add.php URL`
8. Run crawler to grab the data `php src/cli/document/crawl.php`
9. Test search results `php src/cli/document/search.php '*'`
2023-11-24 22:26:02 +02:00
#### Web UI
1. `cd src/webui`
2. `php -S 127.0.0.1:8080`
2023-11-25 20:00:15 +02:00
3. open `127.0.0.1:8080` in browser
2023-11-24 20:03:12 +02:00
## Documentation
2023-11-24 20:00:09 +02:00
2023-11-24 19:55:07 +02:00
### CLI
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
#### Index
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
##### Init
2023-11-19 23:00:51 +02:00
Create initial index
```
php src/cli/index/init.php [reset]
```
* `reset` - optional, reset existing index
2023-11-24 19:55:07 +02:00
#### Document
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
##### Add
2023-11-19 23:00:51 +02:00
```
php src/cli/document/add.php URL
```
* `URL` - add new URL to the crawl queue
2023-11-24 19:55:07 +02:00
##### Crawl
2023-11-19 23:00:51 +02:00
```
php src/cli/document/crawl.php
```
2023-11-27 19:29:17 +02:00
##### Clean
```
php src/cli/document/clean.php
```
* remove `url` duplicates
* make index optimization
2023-11-24 19:55:07 +02:00
##### Search
2023-11-19 23:00:51 +02:00
```
php src/cli/document/search.php '@title "*"' [limit]
```
* `query` - required
2023-11-25 04:44:07 +02:00
* `limit` - optional search results limit
##### Migration
###### YGGo
Import index from YGGo database
```
2023-11-25 13:19:34 +02:00
php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique=off] [start=0] [limit=100]
2023-11-25 04:44:07 +02:00
```
Source DB fields required:
* `host`
* `port`
* `user`
* `password`
* `database`
2023-11-25 13:19:34 +02:00
* `unique` - optional, check for unique URL (takes more time)
* `start` - optional, offset to start queue
2023-11-25 13:42:22 +02:00
* `limit` - optional, limit queue
## Instances
### Yggdrasil
* `http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/`