Yo/README.md

81 lines
2.1 KiB
Markdown
Raw Normal View History

2023-11-24 18:00:15 +02:00
# Yo! Micro Web Crawler in PHP & Manticore
2023-11-19 23:00:51 +02:00
2023-11-24 20:00:09 +02:00
Next generation of [YGGo!](https://github.com/YGGverse/YGGo) project with goal to reduce server requirements and make deployment process simpler
2023-11-24 17:52:13 +02:00
2023-11-25 00:16:08 +02:00
- Index model changed to distributed cluster model, and now oriented to aggregate search results from network instances trough API
- Refactored data exchange model where drop all internal keys dependencies
2023-11-24 19:55:07 +02:00
- Snaps now using tar.gz compression to reduce storage requirements and still supporting remote mirrors, FTP including
2023-11-24 22:26:02 +02:00
- Minimalism everywhere
2023-11-24 18:00:15 +02:00
## Implementation
2023-11-25 00:16:08 +02:00
Engine written in PHP 8 and uses [Manticore](https://github.com/manticoresoftware) on backend.
2023-11-24 17:52:13 +02:00
2023-11-25 00:16:08 +02:00
Default build adapted for [Yggdrasil](https://github.com/yggdrasil-network) but could be used to make internet search portal.
2023-11-24 17:52:13 +02:00
2023-11-24 18:00:15 +02:00
## Components
2023-11-24 17:52:13 +02:00
* CLI tools for index operations
* JS-less frontend to make search web portal
* API tools to make search index distributed
2023-11-24 19:55:07 +02:00
## Features
2023-11-24 17:52:13 +02:00
* MIME-based crawler with flexible filter settings
* Page snap history with local and remote mirrors support
2023-11-24 20:00:09 +02:00
### Install
2023-11-24 20:03:12 +02:00
1. Install `composer`, `php` and `manticore`
2023-11-24 20:04:52 +02:00
2. Grab latest `Yo` version `git clone https://github.com/YGGverse/Yo.git`
2023-11-24 20:03:12 +02:00
3. Run `composer update` inside the project directory
2023-11-25 00:16:08 +02:00
4. Copy and customize config file `cp example/config.json config.json`
2023-11-24 20:04:52 +02:00
5. Make sure `storage` folder writable
6. Run indexes init script `php src/cli/index/init.php`
2023-11-24 22:26:02 +02:00
7. Add new URL `php src/cli/document/add.php URL`
8. Run crawler `php src/cli/document/crawl.php`
9. Get search results `php src/cli/document/search.php '*'`
#### Web UI
1. `cd src/webui`
2. `php -S 127.0.0.1:8080`
3. now open `127.0.0.1:8080` in your browser!
2023-11-24 20:03:12 +02:00
## Documentation
2023-11-24 20:00:09 +02:00
2023-11-24 19:55:07 +02:00
### CLI
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
#### Index
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
##### Init
2023-11-19 23:00:51 +02:00
Create initial index
```
php src/cli/index/init.php [reset]
```
* `reset` - optional, reset existing index
2023-11-24 19:55:07 +02:00
#### Document
2023-11-19 23:00:51 +02:00
2023-11-24 19:55:07 +02:00
##### Add
2023-11-19 23:00:51 +02:00
```
php src/cli/document/add.php URL
```
* `URL` - add new URL to the crawl queue
2023-11-24 19:55:07 +02:00
##### Crawl
2023-11-19 23:00:51 +02:00
```
php src/cli/document/crawl.php
```
2023-11-24 19:55:07 +02:00
##### Search
2023-11-19 23:00:51 +02:00
```
php src/cli/document/search.php '@title "*"' [limit]
```
* `query` - required
* `limit` - optional search results limit