Commit Graph

87 Commits

Author SHA1 Message Date
ghost
d96abb8ea8 ban host page on encoding not detected 2023-06-16 13:23:52 +03:00
ghost
d2469e9adc fix meta variables overwrite 2023-06-14 02:53:14 +03:00
ghost
1d5d5ead5d fix DomDocument initiation without encoding provided 2023-06-14 02:20:00 +03:00
ghost
8a747de341 fix HTML/multimedia content detection 2023-06-13 23:09:44 +03:00
ghost
93c6067fd9 fix host page mime detection 2023-06-13 22:29:28 +03:00
ghost
80d3912bc7 allow x-raw-image links 2023-06-13 20:26:17 +03:00
ghost
b23f550a1b skip magnet links 2023-06-13 20:25:37 +03:00
ghost
ab78e17ca8 add hostPage.size collection 2023-06-13 12:45:12 +03:00
ghost
0af5d165d3 remove logCrawler column not in use 2023-06-05 22:06:55 +03:00
ghost
4b16b41440 make transaction for each item in crawl queue 2023-06-05 22:01:22 +03:00
ghost
b585b16d31 fix datatype error detection 2023-06-05 21:02:18 +03:00
ghost
c5e25d17fb prevent page ban when it MIME in the whitelist, skip steps below only (make multimedia/streaming resources visible in search results) 2023-06-04 17:44:09 +03:00
ghost
4fa33afe40 prevent infinitive connection on streaming resources detected 2023-06-04 17:02:32 +03:00
ghost
345c59b5f4 collect target location links on page redirect available 2023-06-04 14:58:33 +03:00
ghost
242e0abd86 ban pages only on data type error codes only 2023-06-04 13:10:32 +03:00
ghost
512bd56056 ban page that throws the error and stuck the crawl queue 2023-06-04 12:04:41 +03:00
ghost
81f7ea1e1e implement multi-storage snap downloads 2023-05-15 09:18:18 +03:00
ghost
1969707eeb integrate optional MEGA/cmd snap storage 2023-05-14 19:41:20 +03:00
ghost
bd99dcb023 add leading zero to mkdir access code 2023-05-14 05:43:03 +03:00
ghost
48664f0caf fix zip close, loop brake condition 2023-05-14 04:33:35 +03:00
ghost
0d19004e86 make local snap storage optimization 2023-05-14 01:45:55 +03:00
ghost
efc66d5dab update local snap storage paths 2023-05-13 11:06:40 +03:00
ghost
2f7d99079d implement local snaps 2023-05-13 10:15:07 +03:00
ghost
9477d87b2e change strpos to stripos 2023-05-13 01:28:50 +03:00
ghost
28e8bcf8d7 add audio/video media crawl support 2023-05-13 01:23:09 +03:00
ghost
307ebcf0b1 add page description on title | description | keywords not empty, remove deprecated constructions 2023-05-10 19:35:01 +03:00
ghost
7c5ba050b2 fix media crawling 2023-05-10 18:35:18 +03:00
ghost
0fed16621a fix mime content type update 2023-05-10 14:47:33 +03:00
ghost
db0e66c846 refactor to mime-based content index #1 2023-05-10 12:47:36 +03:00
ghost
0ffcee1efb fix image description updates timing 2023-05-09 15:53:21 +03:00
ghost
2c5ca1b630 fix image description duplicate 2023-05-09 15:23:32 +03:00
ghost
28bf526d53 add host nsfw settings 2023-05-09 13:26:19 +03:00
ghost
8ce0324e94 convert page data to string 2023-05-09 12:52:07 +03:00
ghost
d186fff48f skip curl download on response data size reached 2023-05-09 10:21:37 +03:00
ghost
d7a5f7ef84 remove content filter, snap raw the data 2023-05-09 09:02:17 +03:00
ghost
23ead4e12c update page / image description models, implement history snap crawling 2023-05-09 08:19:49 +03:00
ghost
0e9d29675f implement host page description history crawling 2023-05-09 01:29:32 +03:00
ghost
6371def666 fix attributes passing 2023-05-08 17:52:17 +03:00
ghost
32d0f390d3 update http code and mime type on page/image ban event 2023-05-08 14:13:53 +03:00
ghost
84dcecf50b add svg images support, fix mime validation 2023-05-08 13:12:16 +03:00
ghost
bf1eeb332c fix page/image mime content type detection 2023-05-08 12:10:57 +03:00
ghost
25b6bce2ec add crawler/cleaner logs 2023-05-08 11:04:59 +03:00
ghost
dcdc2c50ad update debug string names 2023-05-08 08:31:34 +03:00
ghost
ea04220de3 add curl requests debug 2023-05-08 08:27:21 +03:00
ghost
1aba060d34 fix variable name 2023-05-08 07:23:50 +03:00
ghost
fdd18de373 remove abstraction 2023-05-06 14:03:43 +03:00
ghost
6c41dd5831 fix ban time update / count affected rows only 2023-05-06 10:11:25 +03:00
ghost
20514c455f add banned items counters 2023-05-06 08:50:41 +03:00
ghost
b6605b9132 implement not reachable resources ban feature with timeout to prevent extra http requests 2023-05-06 08:45:37 +03:00
ghost
702a14b634 add mime content type crawling #1 2023-05-06 07:25:54 +03:00