104 Commits (547cd6717b53abc6763856db2d23f687c8a3c056)

Author SHA1 Message Date
ghost d186fff48f skip curl download on response data size reached 2 years ago
ghost d7a5f7ef84 remove content filter, snap raw the data 2 years ago
ghost 23ead4e12c update page / image description models, implement history snap crawling 2 years ago
ghost 0e9d29675f implement host page description history crawling 2 years ago
ghost 6371def666 fix attributes passing 2 years ago
ghost 32d0f390d3 update http code and mime type on page/image ban event 2 years ago
ghost 84dcecf50b add svg images support, fix mime validation 2 years ago
ghost bf1eeb332c fix page/image mime content type detection 2 years ago
ghost 25b6bce2ec add crawler/cleaner logs 2 years ago
ghost dcdc2c50ad update debug string names 2 years ago
ghost ea04220de3 add curl requests debug 2 years ago
ghost 1aba060d34 fix variable name 2 years ago
ghost fdd18de373 remove abstraction 2 years ago
ghost 6c41dd5831 fix ban time update / count affected rows only 2 years ago
ghost 20514c455f add banned items counters 2 years ago
ghost b6605b9132 implement not reachable resources ban feature with timeout to prevent extra http requests 2 years ago
ghost 702a14b634 add mime content type crawling #1 2 years ago
ghost 0bd95d7f4d fix comments 2 years ago
ghost f88d2ee9ff implement MIME content-type crawler filter 2 years ago
ghost 5999fb3a73 add distributed hosts crawling using yggo nodes manifest 2 years ago
ghost 5297e6e918 fix condition error 2 years ago
ghost 0cc712f24e fix variable definition 2 years ago
ghost d4f66c83e7 fix image crawling errors 2 years ago
ghost baa8b0d2f0 fix data type formatting 2 years ago
ghost 79878d17fe add crawler / proxy user agent settings 2 years ago
ghost 9ed8411d2f add image queue crawler 2 years ago
ghost d905e33b4f update host images info on search requests 2 years ago
ghost 0741a3e9ef implement image crawler 2 years ago
ghost 1ee2ac4f0b add yggo:manifest namespace 2 years ago
ghost f8e0a50db6 add manifest url filter 2 years ago
ghost 6d8f4f4882 create manifests registry 2 years ago
ghost eb3e70a7b7 fix robots.txt conditions 2 years ago
ghost a5f5541395 skip robots:noindex page without extra actions 2 years ago
ghost e418ddcd32 fix data type 2 years ago
ghost 11aa404807 add metaYggo field index 2 years ago
ghost 5875dd58c9 fix PR update condition 2 years ago
ghost 8671fc4bde implement page ranking 2 years ago
ghost 5936fa9a30 fix quota check condition 2 years ago
ghost 8dbb4a06af add disk quota validation 2 years ago
ghost dfbc6132c9 fix robots:noindex condition, add robots:nofollow attribute support 2 years ago
ghost 5c8d299a4a add meta:robots tag support #2 2 years ago
ghost 0484d43482 fix trim path levels in the relative links 2 years ago
ghost df6f2a1869 implement CRAWL_ROBOTS_POSTFIX_RULES configuration #5 2 years ago
ghost b3c668706b trim path levels in the relative links 2 years ago
ghost 71a3e7dd0e skip x-raw-image links crawl 2 years ago
ghost 9b9d40a97c skip javascript/mailto links index 2 years ago
ghost 2a843449e0 add process locked notice to the debug output 2 years ago
ghost ce509ec0a8 remove debug row 2 years ago
ghost 2495a2bbc7 implement MySQL/Sphinx data model #3, add basical robots.txt support #2 2 years ago
ghost 79663c84db add CRAWL_META_ONLY option 2 years ago