cli | ||
config | ||
crontab | ||
database | ||
library | ||
media | ||
public | ||
storage | ||
.gitignore | ||
LICENSE | ||
README.md |
YGGo! - Distributed & Open Source Web Search Engine
Проект присвячується захисникам міста Бахмут
Written by inspiration to explore Yggdrasil ecosystem, because of last YaCy node there was discontinued. This engine also could be useful for crawling regular websites, small business resources, local networks.
The project goal - simple interface, clear architecture and lightweight server requirement.
Overview
https://github.com/YGGverse/YGGo/tree/main/media
Online instances
- http://[201:23b4:991a:634d:8359:4521:5576:15b7]/yggo
- http://94.140.114.241/yggo/
Requirements
php8^
php-dom
php-xml
php-pdo
php-curl
php-gd
php-mbstring
php-zip
php-mysql
php-memcached
memcached
sphinxsearch
Installation
- The web root dir is
/public
- Deploy the database using MySQL Workbench project presented in the
/database
folder - Install Sphinx Search Server
- Configuration examples presented at
/config
folder - Make sure
/storage/cache
,/storage/tmp
,/storage/snap
folders are writable - Set up the
/crontab
by following example
JSON API
Build third party applications / index distribution.
Could be enabled or disabled by API_ENABLED
option
Address
/api.php
Search
Returns search results.
Could be enabled or disabled by API_SEARCH_ENABLED
option
Request attributes
GET action=search - required
GET query={string} - optional, search request, empty if not provided
GET type={string} - optional, filter mime type of available or empty
GET page={int} - optional, search results page, 1 if not provided
GET mode=SphinxQL - optional, enable extended SphinxQL syntax
Hosts distribution
Returns hosts collected with fields provided in API_HOSTS_FIELDS
option.
Could be enabled or disabled by API_HOSTS_ENABLED
option
Request attributes
GET action=hosts - required
Application manifest
Returns node information for other nodes that have same CRAWL_MANIFEST_API_VERSION
and CRAWL_URL_REGEXP
conditions.
Could be enabled or disabled by API_MANIFEST_ENABLED
option
Request attributes
GET action=manifest - required
Search textual filtering
Default constructions
word prefix:
yg*
operator OR:
hello | world
operator MAYBE:
hello MAYBE world
operator NOT:
hello -world
strict order operator (aka operator "before"):
aaa << bbb << ccc
exact form modifier:
raining =cats and =dogs
field-start and field-end modifier:
^hello world$
keyword IDF boost modifier:
boosted^1.234 boostedfieldend$^1.234
Extended syntax
https://sphinxsearch.com/docs/current.html#extended-syntax
Could be enabled with following attributes
GET m=SphinxQL
Roadmap
Basic features
- Web pages full text ranking search
- Sphinx
- Unlimited content MIME crawling
- Flexible settings compatible with IPv4/IPv6 networks
- Extended search syntax support
- Compressed page history snaps with multi-provider storage sync
- Local (unlimited locations)
- Remote FTP (unlimited mirrors)
- Privacy-oriented downloads counting, traffic controls
UI
- CSS only, JS-less interface
- Unique host ident icons
- Content MIME tabs (#1)
- Page index explorer
- Meta
- Snaps history
- Referrers
- Top hosts page
- Safe media preview
- Results with found matches highlight
- The time machine feature by content snaps history
API
- Index API
- Manifest
- Search
- Hosts
- Snaps
- Context advertising API
Crawler
- Auto crawl links by regular expression rules
- Pages
- Manifests
- Robots.txt / robots meta tags support (#2)
- Specific rules configuration for every host
- Auto stop crawling on disk quota reached
- Transactions support to prevent data loss on queue failures
- Distributed index crawling between YGGo nodes trough manifest API
- MIME Content-type settings
- Ban non-condition links to prevent extra requests
- Debug log
- Index homepages and shorter URI with higher priority
- Collect target location links on page redirect available
- Collect referrer pages (redirects including)
- Aliasing page URL with ending slash
- Host page DOM elements collecting by CSS selectors
- Custom settings for each host
- XML Feeds support
- Sitemap
- RSS
- Atom
- Palette image index / filter
- Crawl queue balancer, that depends of CPU available
Cleaner
- Deprecated DB items auto deletion / host settings update
- Pages
- Snaps
- Snap downloads
- Manifests
- Logs
- Crawler
- Cleaner
- Banned resources reset by timeout
- DB tables optimization
- Debug log
CLI
- help
- crontab
- crawl
- clean
- hostPageSnap
- repair (not tested)
- sync DB-FS relations
- FTP
- localhost
- delete FS missed in the DB
- FTP
- localhost
- sync DB-FS relations
- truncate
- repair (not tested)
- hostPageDom
- generate
- truncate
- hostPage
- add
Other
- Administrative panel for useful index moderation
- Deployment tools
- Testing
- Documentation
Contributions
Please make a new branch of main|sqliteway tree for each patch in your fork before create PR
git checkout main
git checkout -b my-pr-branch-name
See also: SQLite tree
Donate to contributors
License
- Engine sources MIT License
- HTML parser simple_html_dom
- Home page animation by alvarotrigo
- CLI logo by patorjk.com
See also
Feedback
Please, feel free to share your ideas and bug reports here or use sources for your own implementations.
Have a good time!