Micro Web Crawler in PHP & Manticore
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

73 lines
1.9 KiB

1 year ago
# Yo! Micro Web Crawler in PHP & Manticore
1 year ago
1 year ago
Next generation of [YGGo!](https://github.com/YGGverse/YGGo) project with goal to reduce server requirements and make deployment process simpler
1 year ago
1 year ago
- Index model changed to the distributed cluster model, and oriented to aggregate search results from different instances trough API
1 year ago
- Refactored data exchange model with drop all primary keys dependencies
- Snaps now using tar.gz compression to reduce storage requirements and still supporting remote mirrors, FTP including
1 year ago
- Codebase following minimalism principles everywhere
1 year ago
## Implementation
1 year ago
Engine written in PHP and uses [Manticore](https://github.com/manticoresoftware) on backend.
1 year ago
1 year ago
Default build inspired and adapted for [Yggdrasil](https://github.com/yggdrasil-network) eco-system but could be used to make own search project.
1 year ago
1 year ago
## Components
1 year ago
* CLI tools for index operations
* JS-less frontend to make search web portal
* API tools to make search index distributed
1 year ago
## Features
1 year ago
* MIME-based crawler with flexible filter settings
* Page snap history with local and remote mirrors support
1 year ago
### Install
1 year ago
1. Install `composer`, `php` and `manticore`
1 year ago
2. Grab latest `Yo` version `git clone https://github.com/YGGverse/Yo.git`
1 year ago
3. Run `composer update` inside the project directory
4. Check `src/config.json` for any customizations
1 year ago
5. Make sure `storage` folder writable
6. Run indexes init script `php src/cli/index/init.php`
7. [Start crawling!](https://github.com/YGGverse/Yo#documentation)
1 year ago
## Documentation
1 year ago
1 year ago
### CLI
1 year ago
1 year ago
#### Index
1 year ago
1 year ago
##### Init
1 year ago
Create initial index
```
php src/cli/index/init.php [reset]
```
* `reset` - optional, reset existing index
1 year ago
#### Document
1 year ago
1 year ago
##### Add
1 year ago
```
php src/cli/document/add.php URL
```
* `URL` - add new URL to the crawl queue
1 year ago
##### Crawl
1 year ago
```
php src/cli/document/crawl.php
```
1 year ago
##### Search
1 year ago
```
php src/cli/document/search.php '@title "*"' [limit]
```
* `query` - required
* `limit` - optional search results limit