YGGo! Distributed Web Search Engine
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ghost 4d54e5cc8f update readme 2 years ago
config add distributed hosts crawling using yggo nodes manifest 2 years ago
crontab add distributed hosts crawling using yggo nodes manifest 2 years ago
database add distributed hosts crawling using yggo nodes manifest 2 years ago
library add distributed hosts crawling using yggo nodes manifest 2 years ago
media update db prototype 2 years ago
public show images total instead of pages in placeholder on image search page 2 years ago
storage/cache add image storage cache folder 2 years ago
.gitignore implement MySQL/Sphinx data model #3, add basical robots.txt support #2 2 years ago
LICENSE change repository address 2 years ago
README.md update readme 2 years ago

README.md

YGGo! - Distributed & Open Source Web Search Engine

Проект присвячується захисникам міста Бахмут

Written by inspiration to explore Yggdrasil ecosystem, because of last YaCy node there was discontinued. This engine also could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Overview

Home page

https://github.com/YGGverse/YGGo/tree/main/media

Online instances

Requirements

php8^
php-dom
php-pdo
php-curl
php-gd
php-mysql
sphinxsearch

Installation

  • The web root dir is /public
  • Deploy the database using MySQL Workbench project presented in the /database folder
  • Install Sphinx Search Server
  • Configuration examples are placed at /config folder
  • Make sure /storage folder is writable
  • Set up the /crontab scripts by following example

JSON API

Build third party applications / index distribution.

Could be enabled or disabled by API_ENABLED option

Address
/api.php

Returns search results.

Could be enabled or disabled by API_SEARCH_ENABLED option

Request attributes
GET action=search  - required
GET query={string} - optional, search request, empty if not provided
GET type={string}  - optional, search type, image|default or empty
GET page={int}     - optional, search results page, 1 if not provided
GET mode=SphinxQL  - optional, enable extended SphinxQL syntax
Hosts distribution

Returns node hosts collected with fields provided in API_HOSTS_FIELDS option.

Could be enabled or disabled by API_HOSTS_ENABLED option

Request attributes
GET action=hosts - required
Application manifest

Returns node information.

Could be enabled or disabled by API_MANIFEST_ENABLED option

Request attributes
GET action=manifest - required

Search textual filtering

Default constructions
operator OR:

hello | world

operator MAYBE:

hello MAYBE world

operator NOT:

hello -world

strict order operator (aka operator "before"):

aaa << bbb << ccc

exact form modifier:

raining =cats and =dogs

field-start and field-end modifier:

^hello world$

keyword IDF boost modifier:

boosted^1.234 boostedfieldend$^1.234

Extended syntax

https://sphinxsearch.com/docs/current.html#extended-syntax

Could be enabled with following attributes

GET m=SphinxQL

Roadmap

  • Web pages full text ranking search
  • Make search results pagination
  • Add robots.txt support (Issue #2)
  • Make page description with found matches highlight
  • Images search
  • Index cleaner
  • Crawl queue balancer, that depends of CPU available
  • Indexing new sites homepage in higher priority
  • Add transactions to prevent data loss on DB crashes
  • JSON API
  • Distributed index data sharing between the nodes trough service API
  • Unique gravatars for sites without favicons, because simpler to ident, comparing to ipv6
  • Link clicks counter, trough internal stats redirect controller
  • The time machine feature by content history cache preview

Contributions

Please make a new branch of master|sqliteway tree for each patch in your fork before create PR

git checkout master
git checkout -b my-pr-branch-name

See also: SQLite tree

Donate to contributors

License

Feedback

Please, feel free to share your ideas and bug reports here or use sources for your own implementations.

Have a good time.