Commit Graph

81 Commits

Author SHA1 Message Date
ghost
f3475035c2 show page size in explorer view, hide not available data 2023-06-13 23:20:22 +03:00
ghost
ab78e17ca8 add hostPage.size collection 2023-06-13 12:45:12 +03:00
ghost
7892784f5c add httpCode column to hostPageSnapDownload table 2023-06-12 13:34:25 +03:00
ghost
edec590e09 fix MAYBE filter in the default search mode 2023-06-06 00:36:13 +03:00
ghost
e1fb7f8c17 change query separators to the MAYBE operator in default search mode 2023-06-05 23:33:07 +03:00
ghost
0af5d165d3 remove logCrawler column not in use 2023-06-05 22:06:55 +03:00
ghost
4fa33afe40 prevent infinitive connection on streaming resources detected 2023-06-04 17:02:32 +03:00
ghost
345c59b5f4 collect target location links on page redirect available 2023-06-04 14:58:33 +03:00
ghost
f49076bb0c index homepages and shorter URL with higher priority 2023-06-04 11:38:56 +03:00
ghost
81f7ea1e1e implement multi-storage snap downloads 2023-05-15 09:18:18 +03:00
ghost
1969707eeb integrate optional MEGA/cmd snap storage 2023-05-14 19:41:20 +03:00
ghost
50c9066f62 add tables optimization to the cron/cleaner task 2023-05-14 02:39:32 +03:00
ghost
0d19004e86 make local snap storage optimization 2023-05-14 01:45:55 +03:00
ghost
2f7d99079d implement local snaps 2023-05-13 10:15:07 +03:00
ghost
d98b8f5c94 remove hostPageToHostPage.quantity field because of implements wrong duplicates counting on reindex 2023-05-13 06:30:40 +03:00
ghost
eeeb3dceac implement index explorer 2023-05-13 05:54:15 +03:00
ghost
377b519a2c implement host page info mode 2023-05-13 03:51:34 +03:00
ghost
371670fadf add media referrers info 2023-05-13 03:01:00 +03:00
ghost
4486bdc215 show mime type options that match search results only 2023-05-10 20:37:05 +03:00
ghost
307ebcf0b1 add page description on title | description | keywords not empty, remove deprecated constructions 2023-05-10 19:35:01 +03:00
ghost
7c5ba050b2 fix media crawling 2023-05-10 18:35:18 +03:00
ghost
0fed16621a fix mime content type update 2023-05-10 14:47:33 +03:00
ghost
db0e66c846 refactor to mime-based content index #1 2023-05-10 12:47:36 +03:00
ghost
0ffcee1efb fix image description updates timing 2023-05-09 15:53:21 +03:00
ghost
2c5ca1b630 fix image description duplicate 2023-05-09 15:23:32 +03:00
ghost
28bf526d53 add host nsfw settings 2023-05-09 13:26:19 +03:00
ghost
8ce0324e94 convert page data to string 2023-05-09 12:52:07 +03:00
ghost
dfca5570c6 remove unused construction 2023-05-09 12:10:42 +03:00
ghost
d186fff48f skip curl download on response data size reached 2023-05-09 10:21:37 +03:00
ghost
ef4de6b245 fix image search page errors 2023-05-09 08:53:33 +03:00
ghost
23ead4e12c update page / image description models, implement history snap crawling 2023-05-09 08:19:49 +03:00
ghost
0e9d29675f implement host page description history crawling 2023-05-09 01:29:32 +03:00
ghost
32d0f390d3 update http code and mime type on page/image ban event 2023-05-08 14:13:53 +03:00
ghost
8fbd7f3516 count totals using sphinx index instead of database 2023-05-08 12:28:49 +03:00
ghost
25b6bce2ec add crawler/cleaner logs 2023-05-08 11:04:59 +03:00
ghost
ea04220de3 add curl requests debug 2023-05-08 08:27:21 +03:00
ghost
6c41dd5831 fix ban time update / count affected rows only 2023-05-06 10:11:25 +03:00
ghost
b6605b9132 implement not reachable resources ban feature with timeout to prevent extra http requests 2023-05-06 08:45:37 +03:00
ghost
702a14b634 add mime content type crawling #1 2023-05-06 07:25:54 +03:00
ghost
f88d2ee9ff implement MIME content-type crawler filter 2023-05-05 21:25:57 +03:00
ghost
bed5d3f149 fix offset out of bounds error 2023-05-05 15:16:36 +03:00
ghost
5999fb3a73 add distributed hosts crawling using yggo nodes manifest 2023-05-05 05:26:53 +03:00
ghost
f0b2eb1613 show images total instead of pages in placeholder on image search page 2023-05-05 01:42:44 +03:00
ghost
297563d4a5 display related pages in priority to the unique host by rank, rand() order 2023-05-04 10:53:37 +03:00
ghost
34b7291228 add related to image hostpages limit 2023-05-04 10:17:47 +03:00
ghost
adc791f378 fix updateTime init 2023-05-04 10:11:13 +03:00
ghost
d4f66c83e7 fix image crawling errors 2023-05-04 08:51:45 +03:00
ghost
baa8b0d2f0 fix data type formatting 2023-05-04 07:58:07 +03:00
ghost
79878d17fe add crawler / proxy user agent settings 2023-05-04 07:38:22 +03:00
ghost
73f212e3d7 set crawler queue order priority to item rank, rand() 2023-05-04 06:55:05 +03:00