YGGverse/YGGo: YGGo! Distributed Web Search Engine - YGGo

Written by inspiration to explore Yggdrasil ecosystem. Engine could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Overview

https://github.com/YGGverse/YGGo/tree/main/media

Online instances

http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggo/

Database snaps

17-09-2023 http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggtracker/en/torrent/15

Requirements

php8^
php-dom
php-xml
php-pdo
php-curl
php-gd
php-mbstring
php-zip
php-mysql
php-memcached
memcached
sphinxsearch

Installation

git clone https://github.com/YGGverse/YGGo.git
cd YGGo
composer install

Setup

Server configuration /example/environment
The web root dir is /src/public
Deploy the database using MySQL Workbench project presented in the /database folder
Install Sphinx Search Server
Configuration examples presented at /config folder
Make sure /src/storage/cache, /src/storage/tmp, /src/storage/snap folders are writable
Set up the /src/crontab by following example
To start crawler, add at least one initial URL using search form or CLI

JSON API

Build third party applications / index distribution.

Could be enabled or disabled by API_ENABLED option

Address

/api.php

Search

Returns search results.

Could be enabled or disabled by API_SEARCH_ENABLED option

Request attributes

GET action=search  - required
GET query={string} - optional, search request, empty if not provided
GET type={string}  - optional, filter mime type of available or empty
GET page={int}     - optional, search results page, 1 if not provided
GET mode=SphinxQL  - optional, enable extended SphinxQL syntax

Hosts distribution

Returns hosts collected with fields provided in API_HOSTS_FIELDS option.

Could be enabled or disabled by API_HOSTS_ENABLED option

Request attributes

GET action=hosts - required

Application manifest

Returns node information for other nodes that have same CRAWL_MANIFEST_API_VERSION and DEFAULT_HOST_URL_REGEXP conditions.

Could be enabled or disabled by API_MANIFEST_ENABLED option

Request attributes

GET action=manifest - required

Search textual filtering

Default constructions

word prefix:

yg*

operator OR:

hello | world

operator MAYBE:

hello MAYBE world

operator NOT:

hello -world

strict order operator (aka operator "before"):

aaa << bbb << ccc

exact form modifier:

raining =cats and =dogs

field-start and field-end modifier:

^hello world$

keyword IDF boost modifier:

boosted^1.234 boostedfieldend$^1.234

Extended syntax

https://sphinxsearch.com/docs/current.html#extended-syntax

Could be enabled with following attributes

GET m=SphinxQL

Roadmap

Basic features

Web pages full text ranking search
- Sphinx
Unlimited content MIME crawling
Flexible settings compatible with IPv4/IPv6 networks
Extended search syntax support
Compressed page history snaps with multi-provider storage sync
- Local (unlimited locations)
- Remote FTP (unlimited mirrors)
- Privacy-oriented downloads counting, traffic controls

UI

CSS only, JS-less interface
Unique host ident icons
Content MIME tabs (#1)
Page index explorer
- Meta
- Snaps history
- Referrers
Top hosts page
Safe media preview
Results with found matches highlight
The time machine feature by content snaps history

API

Index API
- Manifest
- Search
- Hosts
- Snaps
Context advertising API

Crawler

Auto crawl links by regular expression rules
- Pages
- Manifests
Robots.txt / robots meta tags support (#2)
Specific rules configuration for every host
Auto stop crawling on disk quota reached
Transactions support to prevent data loss on queue failures
Distributed index crawling between YGGo nodes trough manifest API
MIME Content-type settings
Ban non-condition links to prevent extra requests
Debug log
Index homepages and shorter URI with higher priority
Collect target location links on page redirect available
Collect referrer pages (redirects including)
URL aliasing support on PR calculation
Host page DOM elements collecting by CSS selectors
- Custom settings for each host
XML Feeds support
- Sitemap
- RSS
- Atom
Palette image index / filter
Crawl queue balancer, that depends of CPU available
Networks integration
- yggdrasil
  - YGGstate (unlimited nodes)
    - DB
    - API

Cleaner

Banned pages reset by timeout
DB tables optimization

CLI

*CLI interface still under construction, use it for your own risk!

help
db
- optimize [x] crontab
- crawl
- clean
hostSetting
- get
- set
- list
- delete
- flush
hostPage
- add
- rank
  - reindex
hostPageSnap
- repair
  - db
  - fs
- reindex
- truncate

Other

Administrative panel for useful index moderation
Deployment tools
Testing
Documentation

Contributions

Please make a new branch of main|sqliteway tree for each patch in your fork before create PR

git checkout main
git checkout -b my-pr-branch-name

Donate to contributors

@d47081: BTC | LTC | XMR | ZEPH | Support our server by order Linux VPS

License

Engine sources MIT License
Home page animation by alvarotrigo
CLI logo by patorjk.com
Transliteration by php-translit
Identicons by jdenticon

Feedback

Feel free to share your ideas and bug reports!

README.md

YGGo! - Distributed Web Search Engine

Overview

Online instances

Database snaps

Requirements

Installation

Setup

JSON API

Address

Search

Request attributes

Hosts distribution

Request attributes

Application manifest

Request attributes

Search textual filtering

Default constructions

Extended syntax

Roadmap

Basic features

UI

API

Crawler

Cleaner

CLI

Other

Contributions

Donate to contributors

License

Feedback

Community

See also