YGGo! Distributed Web Search Engine
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ghost b7c415a8b0 crawl host page DOM selectors on meta robots:index/follow condition enabled only 1 year ago
cli add hostPageDom generate [selectors] attribute 2 years ago
config add custom home page reindex settings 2 years ago
crontab crawl host page DOM selectors on meta robots:index/follow condition enabled only 1 year ago
database implement custom hostPageDom elements index 2 years ago
library autodelete failed snaps 1 year ago
media implement custom hostPageDom elements index 2 years ago
public change project link 1 year ago
storage implement multi-storage snap downloads 2 years ago
.gitignore
LICENSE
README.md update readme 2 years ago

README.md

YGGo! - Distributed & Open Source Web Search Engine

Проект присвячується захисникам міста Бахмут

Written by inspiration to explore Yggdrasil ecosystem, because of last YaCy node there was discontinued. This engine also could be useful for crawling regular websites, small business resources, local networks.

The project goal - simple interface, clear architecture and lightweight server requirement.

Overview

Home page

https://github.com/YGGverse/YGGo/tree/main/media

Online instances

Requirements

php8^
php-dom
php-pdo
php-curl
php-gd
php-mbstring
php-zip
php-mysql
sphinxsearch

Installation

  • The web root dir is /public
  • Deploy the database using MySQL Workbench project presented in the /database folder
  • Install Sphinx Search Server, MEGAcmd (on remote snaps enabled)
  • Configuration examples presented at /config folder
  • Make sure /storage/cache, /storage/tmp, /storage/snap folders are writable
  • Set up the /crontab by following example

JSON API

Build third party applications / index distribution.

Could be enabled or disabled by API_ENABLED option

Address
/api.php

Returns search results.

Could be enabled or disabled by API_SEARCH_ENABLED option

Request attributes
GET action=search  - required
GET query={string} - optional, search request, empty if not provided
GET type={string}  - optional, filter mime type of available or empty
GET page={int}     - optional, search results page, 1 if not provided
GET mode=SphinxQL  - optional, enable extended SphinxQL syntax
Hosts distribution

Returns hosts collected with fields provided in API_HOSTS_FIELDS option.

Could be enabled or disabled by API_HOSTS_ENABLED option

Request attributes
GET action=hosts - required
Application manifest

Returns node information for other nodes that have same CRAWL_MANIFEST_API_VERSION and CRAWL_URL_REGEXP conditions.

Could be enabled or disabled by API_MANIFEST_ENABLED option

Request attributes
GET action=manifest - required

Search textual filtering

Default constructions
word prefix:

yg*

operator OR:

hello | world

operator MAYBE:

hello MAYBE world

operator NOT:

hello -world

strict order operator (aka operator "before"):

aaa << bbb << ccc

exact form modifier:

raining =cats and =dogs

field-start and field-end modifier:

^hello world$

keyword IDF boost modifier:

boosted^1.234 boostedfieldend$^1.234

Extended syntax

https://sphinxsearch.com/docs/current.html#extended-syntax

Could be enabled with following attributes

GET m=SphinxQL

Roadmap

Basic features
  • Web pages full text ranking search
    • Sphinx
  • Unlimited content MIME crawling
  • Flexible settings compatible with IPv4/IPv6 networks
  • Extended search syntax support
  • Compressed page history snaps with multi-provider storage sync
    • Local
    • Remote
      • MEGAcmd/FTP
      • Yggdrasil over NAT
    • Privacy-oriented downloads counting, traffic controls
UI
  • CSS only, JS-less interface
  • Unique host ident icons
  • Content genre tabs (#1)
  • Page index explorer
    • Meta
    • Snaps history
    • Referrers
  • Safe media preview
  • Results with found matches highlight
  • The time machine feature by content snaps history
API
  • Index API
    • Manifest
    • Search
    • Hosts
    • Snaps
  • Context advertising API
Crawler
  • Auto crawl links by regular expression rules
    • Pages
    • Manifests
  • Robots.txt / robots meta tags support (#2)
  • Specific rules configuration for every host
  • Auto stop crawling on disk quota reached
  • Transactions support to prevent data loss on queue failures
  • Distributed index crawling between YGGo nodes trough manifest API
  • MIME Content-type settings
  • Ban non-condition links to prevent extra requests
  • Debug log
  • Index homepages and shorter URI with higher priority
  • Collect target location links on page redirect available
  • Host page DOM elements collecting by CSS selectors
    • Custom settings for each host
  • XML Feeds support
    • Sitemap
    • RSS
    • Atom
  • Palette image index / filter
  • Crawl queue balancer, that depends of CPU available
Cleaner
  • Deprecated DB items auto deletion / host settings update
    • Pages
    • Snaps
      • Snap downloads
      • Missed snap file relations
    • Manifests
    • Logs
      • Crawler
      • Cleaner
  • Banned resources reset by timeout
  • DB tables optimization
  • Debug log
CLI
  • help
  • hostPageDom
    • generate
    • truncate
  • hostPage
    • add
Other
  • Administrative panel for useful index moderation
  • Deployment tools
  • Testing
  • Documentation

Contributions

Please make a new branch of master|sqliteway tree for each patch in your fork before create PR

git checkout master
git checkout -b my-pr-branch-name

See also: SQLite tree

Donate to contributors

License

See also

Feedback

Please, feel free to share your ideas and bug reports here or use sources for your own implementations.

Have a good time!