Commit Graph

33 Commits

Author SHA1 Message Date
ghost
d4f66c83e7 fix image crawling errors 2023-05-04 08:51:45 +03:00
ghost
baa8b0d2f0 fix data type formatting 2023-05-04 07:58:07 +03:00
ghost
79878d17fe add crawler / proxy user agent settings 2023-05-04 07:38:22 +03:00
ghost
9ed8411d2f add image queue crawler 2023-05-04 06:45:04 +03:00
ghost
d905e33b4f update host images info on search requests 2023-05-04 06:12:51 +03:00
ghost
0741a3e9ef implement image crawler 2023-05-04 01:04:39 +03:00
ghost
1ee2ac4f0b add yggo:manifest namespace 2023-05-03 09:38:58 +03:00
ghost
f8e0a50db6 add manifest url filter 2023-05-03 09:26:48 +03:00
ghost
6d8f4f4882 create manifests registry 2023-05-03 09:22:14 +03:00
ghost
eb3e70a7b7 fix robots.txt conditions 2023-05-03 04:17:58 +03:00
ghost
a5f5541395 skip robots:noindex page without extra actions 2023-04-29 08:58:48 +03:00
ghost
e418ddcd32 fix data type 2023-04-25 21:20:35 +03:00
ghost
11aa404807 add metaYggo field index 2023-04-25 21:10:59 +03:00
ghost
5875dd58c9 fix PR update condition 2023-04-25 18:19:22 +03:00
ghost
8671fc4bde implement page ranking 2023-04-25 16:54:01 +03:00
ghost
5936fa9a30 fix quota check condition 2023-04-23 04:31:32 +03:00
ghost
8dbb4a06af add disk quota validation 2023-04-23 04:05:00 +03:00
ghost
dfbc6132c9 fix robots:noindex condition, add robots:nofollow attribute support 2023-04-09 15:25:15 +03:00
ghost
5c8d299a4a add meta:robots tag support #2 2023-04-09 03:28:31 +03:00
ghost
8e8d89db0e implement database cleaner 2023-04-09 00:06:28 +03:00
ghost
0484d43482 fix trim path levels in the relative links 2023-04-08 23:52:46 +03:00
ghost
df6f2a1869 implement CRAWL_ROBOTS_POSTFIX_RULES configuration #5 2023-04-08 22:28:31 +03:00
ghost
b3c668706b trim path levels in the relative links 2023-04-08 19:14:04 +03:00
ghost
71a3e7dd0e skip x-raw-image links crawl 2023-04-08 19:11:12 +03:00
ghost
9b9d40a97c skip javascript/mailto links index 2023-04-07 05:19:32 +03:00
ghost
2a843449e0 add process locked notice to the debug output 2023-04-07 04:58:56 +03:00
ghost
ce509ec0a8 remove debug row 2023-04-07 04:39:25 +03:00
ghost
2495a2bbc7 implement MySQL/Sphinx data model #3, add basical robots.txt support #2 2023-04-07 04:04:24 +03:00
ghost
79663c84db add CRAWL_META_ONLY option 2023-04-03 03:07:54 +03:00
ghost
04dbbc3adf make url/src column ukeys digital by using crc32 2023-04-02 18:56:56 +03:00
ghost
b218b8bbc3 make url/src columns unique keys, add insert/ignore construction 2023-04-02 18:09:44 +03:00
ghost
1485983b3a lock multi-thread execution 2023-04-02 00:27:33 +03:00
ghost
72985eaf9e initial commit 2023-04-01 19:29:39 +03:00