Commit Graph

26 Commits

Author SHA1 Message Date
ghost
5346b13602 implement custom hostPageDom elements index 2023-06-25 22:10:47 +03:00
ghost
dc2d971ba0 clean up banned pages extra data 2023-06-16 16:53:14 +03:00
ghost
acba2816e2 remove transaction from tables optimization case 2023-06-13 17:45:02 +03:00
ghost
b2cf9fc6a5 do table optimization in separated transaction 2023-06-13 16:51:16 +03:00
ghost
5d7f2bf68c fix snap foreign keys deletion 2023-06-04 13:39:47 +03:00
ghost
62a4f33b53 load missed dependency 2023-06-04 12:27:20 +03:00
ghost
45c4f7b7b0 add database optimization settings 2023-05-29 22:13:41 +03:00
ghost
81f7ea1e1e implement multi-storage snap downloads 2023-05-15 09:18:18 +03:00
ghost
1969707eeb integrate optional MEGA/cmd snap storage 2023-05-14 19:41:20 +03:00
ghost
50c9066f62 add tables optimization to the cron/cleaner task 2023-05-14 02:39:32 +03:00
ghost
0d19004e86 make local snap storage optimization 2023-05-14 01:45:55 +03:00
ghost
2f7d99079d implement local snaps 2023-05-13 10:15:07 +03:00
ghost
db0e66c846 refactor to mime-based content index #1 2023-05-10 12:47:36 +03:00
ghost
d186fff48f skip curl download on response data size reached 2023-05-09 10:21:37 +03:00
ghost
23ead4e12c update page / image description models, implement history snap crawling 2023-05-09 08:19:49 +03:00
ghost
0e9d29675f implement host page description history crawling 2023-05-09 01:29:32 +03:00
ghost
25b6bce2ec add crawler/cleaner logs 2023-05-08 11:04:59 +03:00
ghost
dcdc2c50ad update debug string names 2023-05-08 08:31:34 +03:00
ghost
ea04220de3 add curl requests debug 2023-05-08 08:27:21 +03:00
ghost
b6605b9132 implement not reachable resources ban feature with timeout to prevent extra http requests 2023-05-06 08:45:37 +03:00
ghost
f88d2ee9ff implement MIME content-type crawler filter 2023-05-05 21:25:57 +03:00
ghost
5999fb3a73 add distributed hosts crawling using yggo nodes manifest 2023-05-05 05:26:53 +03:00
ghost
79878d17fe add crawler / proxy user agent settings 2023-05-04 07:38:22 +03:00
ghost
0741a3e9ef implement image crawler 2023-05-04 01:04:39 +03:00
ghost
eb3e70a7b7 fix robots.txt conditions 2023-05-03 04:17:58 +03:00
ghost
8e8d89db0e implement database cleaner 2023-04-09 00:06:28 +03:00