YGGo! Distributed Web Search Engine
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ghost ef50716696 update readme 2 years ago
config add disk quota validation 2 years ago
crontab fix quota check condition 2 years ago
database update host.robotsPostfix registry 2 years ago
library implement basic api 2 years ago
media update demo media 2 years ago
public add disk quota validation 2 years ago
storage/cache add image storage cache folder 2 years ago
.gitignore implement MySQL/Sphinx data model #3, add basical robots.txt support #2 2 years ago
LICENSE change repository address 2 years ago
README.md update readme 2 years ago

README.md

YGGo! - Open Source Web Search Engine

Проект присвячується захисникам міста Бахмут

Written by inspiration to explore Yggdrasil ecosystem, because of last YaCy node there was discontinued. This engine also could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Overview

Home page

https://github.com/YGGverse/YGGo/tree/main/media

Online instances

License

Requirements

php8^
php-dom
php-pdo
php-curl
php-gd
php-mysql
sphinxsearch

Installation

  • The webroot dir is /public
  • Single configuration file placed here /config/app.php.txt and need to be configured and renamed to /config/app.php
  • By the idea, script automaticaly generates database structure in /storage folder (where could be nice to collect other variative and tmp data - like logs, etc). Make sure storage folder writable.
  • Set up the /crontab/crawler.php script for execution every the minute, but it mostly related of the configs and targetal network volume, there is no debug implemented yet, so let's silentize it by /dev/null
  • Script has no MVC model, because of super simple. It's is just 2 files, and everything else stored incapsulated in /library classes.

Configuration

Crontab
@reboot searchd
@reboot indexer --all --rotate

0 * * * * indexer --all --rotate

0 0 * * * cd /YGGo/crontab && php cleaner.php > /dev/null 2>&1
* * * * * cd /YGGo/crontab && php crawler.php > /dev/null 2>&1

JSON API

Build third party applications / index distribution.

Could be enabled or disabled by API_ENABLED option

Address
/api.php
Search API

Returns search results.

Could be enabled or disabled by API_SEARCH_ENABLED option

Request attributes
GET action=search  - required
GET query={string} - optional, search request, empty if not provided
GET page={int}     - optional, search results page, 1 if not provided
Hosts distribution API

Returns node hosts collected with fields provided in API_HOSTS_FIELDS option.

Could be enabled or disabled by API_HOSTS_ENABLED option

Request attributes
GET action=hosts - required

Roadmap / ideas

  • Web pages full text ranking search
  • Make search results pagination
  • Add robots.txt support (Issue #2)
  • Improve yggdrasil links detection, add .ygg domain zone support
  • Make page description visible - based on the cached content dump, when website description tag not available, add condition highlights
  • Images search (basically implemented but requires testing and some performance optimization)
  • Index cleaner
  • Crawl queue balancer, that depends from CPU available
  • Implement smart queue algorithm that indexing new sites homepage in higher priority
  • Implement database auto backup on crawl process completing
  • Add transactions to prevent data loss on DB crashes
  • JSON API
  • Distributed index data sharing between the nodes trough service API
  • An idea to make unique gravatars for sites without favicons, because simpler to ident, comparing to ipv6
  • An idea to make some visitors counters, like in good old times?

Contributions

Please make a new branch of master|sqliteway tree for each patch in your fork before create PR

git checkout master
git checkout -b my-pr-branch-name

See also: SQLite tree

Donate to contributors

Feedback

Please, feel free to share your ideas and bug reports here or use sources for your own implementations.

Have a good time.