Commit Graph

121 Commits

Author SHA1 Message Date
ghost
1969707eeb integrate optional MEGA/cmd snap storage 2023-05-14 19:41:20 +03:00
ghost
50c9066f62 add tables optimization to the cron/cleaner task 2023-05-14 02:39:32 +03:00
ghost
0d19004e86 make local snap storage optimization 2023-05-14 01:45:55 +03:00
ghost
2f7d99079d implement local snaps 2023-05-13 10:15:07 +03:00
ghost
d98b8f5c94 remove hostPageToHostPage.quantity field because of implements wrong duplicates counting on reindex 2023-05-13 06:30:40 +03:00
ghost
eeeb3dceac implement index explorer 2023-05-13 05:54:15 +03:00
ghost
377b519a2c implement host page info mode 2023-05-13 03:51:34 +03:00
ghost
371670fadf add media referrers info 2023-05-13 03:01:00 +03:00
ghost
4486bdc215 show mime type options that match search results only 2023-05-10 20:37:05 +03:00
ghost
307ebcf0b1 add page description on title | description | keywords not empty, remove deprecated constructions 2023-05-10 19:35:01 +03:00
ghost
7c5ba050b2 fix media crawling 2023-05-10 18:35:18 +03:00
ghost
0fed16621a fix mime content type update 2023-05-10 14:47:33 +03:00
ghost
db0e66c846 refactor to mime-based content index #1 2023-05-10 12:47:36 +03:00
ghost
0ffcee1efb fix image description updates timing 2023-05-09 15:53:21 +03:00
ghost
2c5ca1b630 fix image description duplicate 2023-05-09 15:23:32 +03:00
ghost
28bf526d53 add host nsfw settings 2023-05-09 13:26:19 +03:00
ghost
8ce0324e94 convert page data to string 2023-05-09 12:52:07 +03:00
ghost
dfca5570c6 remove unused construction 2023-05-09 12:10:42 +03:00
ghost
d186fff48f skip curl download on response data size reached 2023-05-09 10:21:37 +03:00
ghost
ef4de6b245 fix image search page errors 2023-05-09 08:53:33 +03:00
ghost
23ead4e12c update page / image description models, implement history snap crawling 2023-05-09 08:19:49 +03:00
ghost
0e9d29675f implement host page description history crawling 2023-05-09 01:29:32 +03:00
ghost
32d0f390d3 update http code and mime type on page/image ban event 2023-05-08 14:13:53 +03:00
ghost
8fbd7f3516 count totals using sphinx index instead of database 2023-05-08 12:28:49 +03:00
ghost
25b6bce2ec add crawler/cleaner logs 2023-05-08 11:04:59 +03:00
ghost
ea04220de3 add curl requests debug 2023-05-08 08:27:21 +03:00
ghost
6c41dd5831 fix ban time update / count affected rows only 2023-05-06 10:11:25 +03:00
ghost
b6605b9132 implement not reachable resources ban feature with timeout to prevent extra http requests 2023-05-06 08:45:37 +03:00
ghost
702a14b634 add mime content type crawling #1 2023-05-06 07:25:54 +03:00
ghost
f88d2ee9ff implement MIME content-type crawler filter 2023-05-05 21:25:57 +03:00
ghost
bed5d3f149 fix offset out of bounds error 2023-05-05 15:16:36 +03:00
ghost
5999fb3a73 add distributed hosts crawling using yggo nodes manifest 2023-05-05 05:26:53 +03:00
ghost
f0b2eb1613 show images total instead of pages in placeholder on image search page 2023-05-05 01:42:44 +03:00
ghost
297563d4a5 display related pages in priority to the unique host by rank, rand() order 2023-05-04 10:53:37 +03:00
ghost
34b7291228 add related to image hostpages limit 2023-05-04 10:17:47 +03:00
ghost
adc791f378 fix updateTime init 2023-05-04 10:11:13 +03:00
ghost
d4f66c83e7 fix image crawling errors 2023-05-04 08:51:45 +03:00
ghost
baa8b0d2f0 fix data type formatting 2023-05-04 07:58:07 +03:00
ghost
79878d17fe add crawler / proxy user agent settings 2023-05-04 07:38:22 +03:00
ghost
73f212e3d7 set crawler queue order priority to item rank, rand() 2023-05-04 06:55:05 +03:00
ghost
9ed8411d2f add image queue crawler 2023-05-04 06:45:04 +03:00
ghost
d905e33b4f update host images info on search requests 2023-05-04 06:12:51 +03:00
ghost
68581960a3 add image.data field 2023-05-04 05:19:29 +03:00
ghost
100d12c6ab update curl library constructor 2023-05-04 04:55:26 +03:00
ghost
250e20bbcd remove separator 2023-05-04 04:19:38 +03:00
ghost
6b18202588 implement proxied image search #1 2023-05-04 03:48:57 +03:00
ghost
0741a3e9ef implement image crawler 2023-05-04 01:04:39 +03:00
ghost
6d8f4f4882 create manifests registry 2023-05-03 09:22:14 +03:00
ghost
0bd765064b implement extended search mode support #9 2023-05-01 20:09:28 +03:00
ghost
84fd82f294 fix replacement typo #9 2023-05-01 19:03:14 +03:00
ghost
d40b914983 add new chars quoting #9 2023-05-01 18:58:03 +03:00
ghost
f7807cf43e add extended syntax filter to prevent sphinxql query error #9 2023-05-01 18:39:46 +03:00
ghost
a5f5541395 skip robots:noindex page without extra actions 2023-04-29 08:58:48 +03:00
ghost
11aa404807 add metaYggo field index 2023-04-25 21:10:59 +03:00
ghost
8671fc4bde implement page ranking 2023-04-25 16:54:01 +03:00
ghost
fcee7f62ef fix max_matches error 2023-04-23 09:29:24 +03:00
ghost
9916fb701f implement basic api 2023-04-23 03:01:51 +03:00
ghost
e6b1e8029c add missed regex replacement rule 2023-04-10 03:18:50 +03:00
ghost
5c8d299a4a add meta:robots tag support #2 2023-04-09 03:28:31 +03:00
ghost
8e8d89db0e implement database cleaner 2023-04-09 00:06:28 +03:00
ghost
df6f2a1869 implement CRAWL_ROBOTS_POSTFIX_RULES configuration #5 2023-04-08 22:28:31 +03:00
ghost
2495a2bbc7 implement MySQL/Sphinx data model #3, add basical robots.txt support #2 2023-04-07 04:04:24 +03:00
ghost
c9cd38f6ac update variable names #2 2023-04-04 01:38:32 +03:00
ghost
ed2d4047b4 implement robots.txt library #2 2023-04-04 00:27:32 +03:00
ghost
e7e4bb686c fix curl exec double call 2023-04-03 04:47:31 +03:00
ghost
ff95df72c1 implement hostname identicons 2023-04-03 01:30:09 +03:00
ghost
4ea01bf8b4 implement search results pagination 2023-04-02 23:36:35 +03:00
ghost
04dbbc3adf make url/src column ukeys digital by using crc32 2023-04-02 18:56:56 +03:00
ghost
b218b8bbc3 make url/src columns unique keys, add insert/ignore construction 2023-04-02 18:09:44 +03:00
ghost
d5f33ad643 add ceawl in queue notification 2023-04-02 01:30:50 +03:00
ghost
72985eaf9e initial commit 2023-04-01 19:29:39 +03:00