YGGo! - Distributed Web Search Engine
Written by inspiration to explore Yggdrasil ecosystem. Engine could be useful for crawling regular websites, small business resources, local networks.
The project goal - simple interface, clear architecture and lightweight server requirement.
Overview
https://github.com/YGGverse/YGGo/tree/main/media
Online instances
- http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggo
Database snaps
- 17-09-2023
Requirements
php8^
php-dom
php-xml
php-pdo
php-curl
php-gd
php-mbstring
php-zip
php-mysql
php-memcached
memcached
sphinxsearch
Installation
git clone https://github.com/YGGverse/YGGo.git
cd YGGo
composer install
Setup
- Server configuration
/example/environment
- The web root dir is
/src/public
- Deploy the database using MySQL Workbench project presented in the
/database
folder - Install Sphinx Search Server
- Configuration examples presented at
/config
folder - Make sure
/src/storage/cache
,/src/storage/tmp
,/src/storage/snap
folders are writable - Set up the
/src/crontab
by following example - To start crawler, add at least one initial URL using search form or CLI
JSON API
Build third party applications / index distribution.
Could be enabled or disabled by API_ENABLED
option
Address
/api.php
Search
Returns search results.
Could be enabled or disabled by API_SEARCH_ENABLED
option
Request attributes
GET action=search - required
GET query={string} - optional, search request, empty if not provided
GET type={string} - optional, filter mime type of available or empty
GET page={int} - optional, search results page, 1 if not provided
GET mode=SphinxQL - optional, enable extended SphinxQL syntax
Hosts distribution
Returns hosts collected with fields provided in API_HOSTS_FIELDS
option.
Could be enabled or disabled by API_HOSTS_ENABLED
option
Request attributes
GET action=hosts - required
Application manifest
Returns node information for other nodes that have same CRAWL_MANIFEST_API_VERSION
and DEFAULT_HOST_URL_REGEXP
conditions.
Could be enabled or disabled by API_MANIFEST_ENABLED
option
Request attributes
GET action=manifest - required
Search textual filtering
Default constructions
word prefix:
yg*
operator OR:
hello | world
operator MAYBE:
hello MAYBE world
operator NOT:
hello -world
strict order operator (aka operator "before"):
aaa << bbb << ccc
exact form modifier:
raining =cats and =dogs
field-start and field-end modifier:
^hello world$
keyword IDF boost modifier:
boosted^1.234 boostedfieldend$^1.234
Extended syntax
https://sphinxsearch.com/docs/current.html#extended-syntax
Could be enabled with following attributes
GET m=SphinxQL
Roadmap
Basic features
- Web pages full text ranking search
- Sphinx
- Unlimited content MIME crawling
- Flexible settings compatible with IPv4/IPv6 networks
- Extended search syntax support
- Compressed page history snaps with multi-provider storage sync
- Local (unlimited locations)
- Remote FTP (unlimited mirrors)
- Privacy-oriented downloads counting, traffic controls
UI
- CSS only, JS-less interface
- Unique host ident icons
- Content MIME tabs (#1)
- Page index explorer
- Meta
- Snaps history
- Referrers
- Top hosts page
- Safe media preview
- Results with found matches highlight
- The time machine feature by content snaps history
API
- Index API
- Manifest
- Search
- Hosts
- Snaps
- Context advertising API
Crawler
- Auto crawl links by regular expression rules
- Pages
- Manifests
- Robots.txt / robots meta tags support (#2)
- Specific rules configuration for every host
- Auto stop crawling on disk quota reached
- Transactions support to prevent data loss on queue failures
- Distributed index crawling between YGGo nodes trough manifest API
- MIME Content-type settings
- Ban non-condition links to prevent extra requests
- Debug log
- Index homepages and shorter URI with higher priority
- Collect target location links on page redirect available
- Collect referrer pages (redirects including)
- URL aliasing support on PR calculation
- Host page DOM elements collecting by CSS selectors
- Custom settings for each host
- XML Feeds support
- Sitemap
- RSS
- Atom
- Palette image index / filter
- Crawl queue balancer, that depends of CPU available
- Networks integration
Cleaner
- Banned pages reset by timeout
- DB tables optimization
CLI
*CLI interface still under construction, use it for your own risk!
- help
- db
- optimize [x] crontab
- crawl
- clean
- hostSetting
- get
- set
- list
- delete
- flush
- hostPage
- add
- rank
- reindex
- hostPageSnap
- repair
- db
- fs
- reindex
- truncate
- repair
Other
- Administrative panel for useful index moderation
- Deployment tools
- Testing
- Documentation
Contributions
Please make a new branch of main|sqliteway tree for each patch in your fork before create PR
git checkout main
git checkout -b my-pr-branch-name
See also: SQLite tree
Donate to contributors
License
- Engine sources MIT License
- Home page animation by alvarotrigo
- CLI logo by patorjk.com
- Identicons by jdenticon
Feedback
Feel free to share your ideas and bug reports!