YGGo/README.md at 6550eb310f13dcf9de1b996e80faaf217b36291f - YGGo

YGGo! Distributed Web Search Engine

php yggdrasil crawler mysql js-less spider alt-web sphinx open-source distributed web search-engine parser fts5 privacy-oriented sphinxsearch federative web-archive pdo curl

3.6 KiB

Raw Blame History

YGGo! - Open Source Web Search Engine

Written by inspiration to explore Yggdrasil ecosystem, because of last YaCy node there was discontinued. This engine also could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Online examples

http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggo
http://94.140.114.241/yggo/

Screenshots

https://github.com/YGGverse/YGGo/tree/main/media

License

Engine sources MIT License
Home page animation by alvarotrigo

Requirements

php8^
php-dom
php-pdo
php-curl
php-gd
php-mysql
sphinxsearch

Installation

The webroot dir is /public
Single configuration file placed here /config/app.php.txt and need to be configured and renamed to /config/app.php
By the idea, script automaticaly generates database structure in /storage folder (where could be nice to collect other variative and tmp data - like logs, etc). Make sure storage folder writable.
Set up the /crontab/crawler.php script for execution every the minute, but it mostly related of the configs and targetal network volume, there is no debug implemented yet, so let's silentize it by /dev/null
Script has no MVC model, because of super simple. It's is just 2 files, and everything else stored incapsulated in /library classes.

Configuration

Crontab

0 * * * * indexer --all --rotate

0 0 * * * cd /YGGo/crontab && php cleaner.php > /dev/null 2>&1
* * * * * cd /YGGo/crontab && php crawler.php > /dev/null 2>&1

Roadmap / ideas

Web pages full text ranking search
Make search results pagination
Add robots.txt support (Issue #2)
Improve yggdrasil links detection, add .ygg domain zone support
Make page description visible - based on the cached content dump, when website description tag not available, add condition highlights
Images search (basically implemented but requires testing and some performance optimization)
Index cleaner
Crawl queue balancer, that depends from CPU available
Implement smart queue algorithm that indexing new sites homepage in higher priority
Implement database auto backup on crawl process completing
Add transactions to prevent data loss on DB crashes
Distributed index data sharing between the nodes trough service API
An idea to make unique gravatars for sites without favicons, because simpler to ident, comparing to ipv6
An idea to make some visitors counters, like in good old times?

Contributions

Please make a new master branch for each patch in your fork before create PR

git checkout master
git checkout -b my-pr-branch-name

Donate to contributors

@d47081: BTC | DOGE | Support our server by order Linux VPS

Feedback

Please, feel free to share your ideas and bug reports here or use sources for your own implementations.

Have a good time.

3.6 KiB Raw Blame History