Commit Graph

37 Commits

Author SHA1 Message Date
ghost
eccb7ea241 refactor hostPageDom tables, add multiple selectors and children values support 2023-08-17 18:32:48 +03:00
ghost
2b49ff5f6a move hostPageDescription.data field data to hostPageDom.value 2023-08-16 23:25:45 +03:00
ghost
d024ffd770 implement unlimited settings customization for each host 2023-08-05 19:06:39 +03:00
ghost
71724ae33f refactor manifest crawling 2023-08-04 09:00:03 +03:00
ghost
b24d31f360 refactor cleaner, delegate tasks to crawler, init hostSetting table 2023-08-03 15:25:38 +03:00
ghost
3e3b7ee2ef optimize snaps, delete unused constructions 2023-07-30 19:09:41 +03:00
ghost
712d67f6bf implement unlimited snap storage mirrors, delete megaCMD integration 2023-07-29 14:37:01 +03:00
ghost
1dd0a8ee2c make page rank procedural, optimize performance 2023-07-28 12:49:43 +03:00
ghost
5346b13602 implement custom hostPageDom elements index 2023-06-25 22:10:47 +03:00
ghost
0949d7f871 set default encoding 2023-06-14 02:20:09 +03:00
ghost
ab78e17ca8 add hostPage.size collection 2023-06-13 12:45:12 +03:00
ghost
7892784f5c add httpCode column to hostPageSnapDownload table 2023-06-12 13:34:25 +03:00
ghost
0af5d165d3 remove logCrawler column not in use 2023-06-05 22:06:55 +03:00
ghost
81f7ea1e1e implement multi-storage snap downloads 2023-05-15 09:18:18 +03:00
ghost
1969707eeb integrate optional MEGA/cmd snap storage 2023-05-14 19:41:20 +03:00
ghost
0d19004e86 make local snap storage optimization 2023-05-14 01:45:55 +03:00
ghost
2f7d99079d implement local snaps 2023-05-13 10:15:07 +03:00
ghost
d98b8f5c94 remove hostPageToHostPage.quantity field because of implements wrong duplicates counting on reindex 2023-05-13 06:30:40 +03:00
ghost
db0e66c846 refactor to mime-based content index #1 2023-05-10 12:47:36 +03:00
ghost
2c5ca1b630 fix image description duplicate 2023-05-09 15:23:32 +03:00
ghost
1c7cca1446 fix UNIQUE index relation 2023-05-09 14:10:08 +03:00
ghost
28bf526d53 add host nsfw settings 2023-05-09 13:26:19 +03:00
ghost
23ead4e12c update page / image description models, implement history snap crawling 2023-05-09 08:19:49 +03:00
ghost
0e9d29675f implement host page description history crawling 2023-05-09 01:29:32 +03:00
ghost
25b6bce2ec add crawler/cleaner logs 2023-05-08 11:04:59 +03:00
ghost
b6605b9132 implement not reachable resources ban feature with timeout to prevent extra http requests 2023-05-06 08:45:37 +03:00
ghost
702a14b634 add mime content type crawling #1 2023-05-06 07:25:54 +03:00
ghost
5999fb3a73 add distributed hosts crawling using yggo nodes manifest 2023-05-05 05:26:53 +03:00
ghost
d4f66c83e7 fix image crawling errors 2023-05-04 08:51:45 +03:00
ghost
68581960a3 add image.data field 2023-05-04 05:19:29 +03:00
ghost
0741a3e9ef implement image crawler 2023-05-04 01:04:39 +03:00
ghost
78931ebc74 normalize host image description storage 2023-05-03 21:52:00 +03:00
ghost
db617f9939 refactor image storage model 2023-05-03 21:27:15 +03:00
ghost
6d8f4f4882 create manifests registry 2023-05-03 09:22:14 +03:00
ghost
11aa404807 add metaYggo field index 2023-04-25 21:10:59 +03:00
ghost
df6f2a1869 implement CRAWL_ROBOTS_POSTFIX_RULES configuration #5 2023-04-08 22:28:31 +03:00
ghost
2495a2bbc7 implement MySQL/Sphinx data model #3, add basical robots.txt support #2 2023-04-07 04:04:24 +03:00