Micro Web Crawler in PHP & Manticore
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ghost 86b20cbc51 add debug output on skip condition 12 months ago
example add skip url filter by stripos condition 12 months ago
src add debug output on skip condition 12 months ago
.gitignore fix gitignore 1 year ago
LICENSE Initial commit 1 year ago
README.md implement index cleaner tool #5 1 year ago
composer.json implement FTP snaps 1 year ago

README.md

Yo!

Micro Web Crawler in PHP & Manticore

Yo! is the super thin layer for Manticore search server that extends official manticoresearch-php client with CLI tools and simple JS-less WebUI.

Features

  • MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
  • Page snap history with local and remote mirrors support (including FTP protocol)
  • CLI tools for index administration and crontab tasks
  • JS-less frontend to run local or public search web portal
  • API tools to make search index distributed

Components

Install

  1. Install manticore, composer and php
  2. Grab latest Yo version git clone https://github.com/YGGverse/Yo.git
  3. Run composer update inside the project directory
  4. Copy and customize config file cp example/config.json config.json
  5. Make sure storage folder writable
  6. Run indexes initiation script php src/cli/index/init.php
  7. Announce new URL php src/cli/document/add.php URL
  8. Run crawler to grab the data php src/cli/document/crawl.php
  9. Test search results php src/cli/document/search.php '*'

Web UI

  1. cd src/webui
  2. php -S 127.0.0.1:8080
  3. open 127.0.0.1:8080 in browser

Documentation

CLI

Index

Init

Create initial index

php src/cli/index/init.php [reset]
  • reset - optional, reset existing index

Document

Add
php src/cli/document/add.php URL
  • URL - add new URL to the crawl queue
Crawl
php src/cli/document/crawl.php
Clean
php src/cli/document/clean.php
  • remove url duplicates
  • make index optimization
php src/cli/document/search.php '@title "*"' [limit]
  • query - required
  • limit - optional search results limit
Migration
YGGo

Import index from YGGo database

php src/cli/yggo/import.php 'host' 'port' 'user' 'password' 'database' [unique=off] [start=0] [limit=100]

Source DB fields required:

  • host
  • port
  • user
  • password
  • database
  • unique - optional, check for unique URL (takes more time)
  • start - optional, offset to start queue
  • limit - optional, limit queue

Instances

Yggdrasil

  • http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yo/