YGGverse/YGGo: YGGo! Distributed Web Search Engine - YGGo

Written by inspiration to explore Yggdrasil ecosystem, because of last YaCy node there was discontinued. This engine also could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Overview

https://github.com/YGGverse/YGGo/tree/main/media

Online instances

http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggo
http://94.140.114.241/yggo/

Requirements

php8^
php-dom
php-pdo
php-curl
php-gd
php-mysql
sphinxsearch

Installation

The web root dir is /public
Deploy the database using MySQL Workbench project presented in the /database folder
Install Sphinx Search Server
Configuration examples are placed at /config folder
Make sure /storage folder is writable
Set up the /crontab scripts by following example

JSON API

Build third party applications / index distribution.

Could be enabled or disabled by API_ENABLED option

Address

/api.php

Search

Returns search results.

Could be enabled or disabled by API_SEARCH_ENABLED option

Request attributes

GET action=search  - required
GET query={string} - optional, search request, empty if not provided
GET type={string}  - optional, search type, image|default or empty
GET page={int}     - optional, search results page, 1 if not provided
GET mode=SphinxQL  - optional, enable extended SphinxQL syntax

Hosts distribution

Returns node hosts collected with fields provided in API_HOSTS_FIELDS option.

Could be enabled or disabled by API_HOSTS_ENABLED option

Request attributes

GET action=hosts - required

Application manifest

Returns node information.

Could be enabled or disabled by API_MANIFEST_ENABLED option

Request attributes

GET action=manifest - required

Search textual filtering

Default constructions

operator OR:

hello | world

operator MAYBE:

hello MAYBE world

operator NOT:

hello -world

strict order operator (aka operator "before"):

aaa << bbb << ccc

exact form modifier:

raining =cats and =dogs

field-start and field-end modifier:

^hello world$

keyword IDF boost modifier:

boosted^1.234 boostedfieldend$^1.234

Extended syntax

https://sphinxsearch.com/docs/current.html#extended-syntax

Could be enabled with following attributes

GET m=SphinxQL

Roadmap

Basic features

Web pages full text ranking search
Images search with safe proxy preview support
Extended syntax support
Flexible settings compatible with IPv4/IPv6 networks

UI

CSS only, JS-less interface
Unique ident icons for sites without favicons
Results with found matches highlight
Content genre tabs (#1)
The time machine feature by content history cache preview
Link clicks counter

API

Index API
- Manifest
- Search
  - Pages
  - Images
- Hosts
- Pages
- Images
Context advertising API

Crawler

Auto crawl links by regular expression rules
Robots.txt / robots meta tags support (#2)
Specific rules configuration for every host
Deprecated index auto cleaner
Auto stop crawling on disk quota reached
Transactions support to prevent data loss on queue failures
Distributed index crawling between YGGo nodes trough manifest API
MIME Content-type crawler settings
Indexing new sites homepage in higher priority
Redirect codes extended processing
Palette image index / filter
Crawl queue balancer, that depends of CPU available

Other

Administrative panel for useful index moderation
Deployment tools

Contributions

Please make a new branch of master|sqliteway tree for each patch in your fork before create PR

git checkout master
git checkout -b my-pr-branch-name

Donate to contributors

@d47081: BTC | DOGE | Support our server by order Linux VPS

License

Engine sources MIT License
Home page animation by alvarotrigo

Feedback

Please, feel free to share your ideas and bug reports here or use sources for your own implementations.

Have a good time.