linux安装httrack和httrack命令行使用方法

2013年03月4日

原创内容,转载请注明出处:https://www.myzhenai.com/thread-15442-1-1.html https://www.myzhenai.com.cn/post/948.html
httrack:https://www.httrack.com/ 是一款网站镜像程序,就是可以把网站结构下载到本地的程序,某些人也可以称为网站扒皮工具或网站复制工具.
使用者可以通过HTTrack把互联网上的网站页面下载到本地计算机上。在默认设置下,HTTrack对网站页面的下载结果是按照原始站点相对链接的结构来组织的。
HTTrack使用网络爬虫下载网站。对于有robots.txt的网站,如果不在程序运行时取消限制,默认设置下的程序不会把网站完全镜像。HTTrack能跟随基本的JavaScript,或者Applet、Flash中的链接,但是对于复杂的链接(使用函数和表达式创建的链接)或者服务器端的Image Map(Image Map)则无能为力。
我的系统是Centos6.3,所以安装的是linux版本的httrack,不过很奇怪,安装后却没有gui图形界面,不知道是不是我没有安装qt的某个库,不过在命令行下可以正常使用.安装方法非常简单.

#wget https://download.httrack.com/httrack-3.46.1.tar.gz
#tar -xzvf httrack-3.46.1.tar.gz
#cd httrack-3.46.1
#./configure
#make
#make install

 

到这里如果没有出现错误提示的话,安装完成.
命令行下运行httrack的方法

#httrack

 

Welcome to HTTrack Website Copier (Offline Browser) 3.46+libhtsjava.so.2
Copyright (C) Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack –help

Enter project name :
//输入项目名称,程序会自动生成一个项目名称的目录

Base path (return=/root/websites/) :
//本地保存路径及目录,请输入一个本地的路径.

Enter URLs (separated by commas or blank spaces) :
//欲抓取的网站地址,这里我们以 https://www.dnspod.cn/ 演示

Action:
(enter) 1 Mirror Web Site(s)
2 Mirror Web Site(s) with Wizard
3 Just Get Files Indicated
4 Mirror ALL links in URLs (Multiple Mirror)
5 Test Links In URLs (Bookmark Test)
0 Quit
:
//抓取模式选项,选项汉化过来的意思是
行动:
(进入)1镜像网站(的)
2镜像网站(S)与向导
3把文件显示
4镜在URL中所有的链接(多镜)
5在URL链接(书签测试)的测试
0退出
//这里我选择2

Proxy (return=none) :
//是否使用代理,我没有用代理,直接回车.

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :
//使用通配符下载,我直接回车

You can define additional options, such as recurse level (-r), separed by blank spaces
To see the option list, type help
Additional options (return=none) :
//抓取选项,输入help可以查阅详细参数,这里我直接回车

—> Wizard command line: httrack https://www.dnspod.cn/ -W -O “/home/RucLinux/DNSPod” -%v

Ready to launch the mirror? (Y/n) :
//输入 Y 回车

WARNING! You are running this program as root!
It might be a good idea to use the -%U option to change the userid:
Example: -%U smith

Mirror launched on Mon, 04 Mar 2013 02:35:02 by HTTrack Website Copier/3.46+libhtsjava.so.2 [XR&CO’2010]
mirroring https://www.dnspod.cn/ with the wizard help..
Done.
Thanks for using HTTrack!
*
//操作完毕

httrack备用下载地址:https://ruclinux.googlecode.com/files/httrack-3.46.1.tar.gz
完全操作过程

[RucLinux@localhost ~]$ su root
密码:
[root@localhost RucLinux]# httrack

Welcome to HTTrack Website Copier (Offline Browser) 3.46+libhtsjava.so.2
Copyright (C) Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help

Enter project name :DNSPod

Base path (return=/root/websites/) :/home/RucLinux/

Enter URLs (separated by commas or blank spaces) :https://www.dnspod.cn/

Action:
(enter)	1	Mirror Web Site(s)
	2	Mirror Web Site(s) with Wizard
	3	Just Get Files Indicated
	4	Mirror ALL links in URLs (Multiple Mirror)
	5	Test Links In URLs (Bookmark Test)
	0	Quit
: 2

Proxy (return=none) :

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :

You can define additional options, such as recurse level (-r), separed by blank spaces
To see the option list, type help
Additional options (return=none) :help

General options:
  O  path for mirror/logfiles+cache (-O path_mirror[,path_cache_and_logfiles]) (--path )
 %O  chroot path to, must be r00t (-%O root_path) (--chroot )

Action options:
  w *mirror web sites (--mirror)
  W  mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
  g  just get files (saved in the current directory) (--get-files)
  i  continue an interrupted mirror using the cache (--continue)
  Y   mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)

Proxy options:
  P  proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy )
 %f *use proxy for ftp (f0 don't use) (--httpproxy-ftp[=N])
 %b  use this local hostname to make/send requests (-%b hostname) (--bind )

Limits options:
  rN set the mirror depth to N (* r9999) (--depth[=N])
 %eN set the external links depth to N (* %e0) (--ext-depth[=N])
  mN maximum file length for a non-html file (--max-files[=N])
  mN,N2 maximum file length for non html (N) and html (N2)
  MN maximum overall size that can be uploaded/scanned (--max-size[=N])
  EN maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-time[=N])
  AN maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-rate[=N])
 %cN maximum number of connections/seconds (*%c10) (--connection-per-second[=N])
  GN pause transfer if N bytes reached, and wait until lock file is deleted (--max-pause[=N])
 %mN maximum mms stream download time in seconds (60=1 minute, 3600=1 hour) (--max-mms-time[=N])

Flow control:
  cN number of multiple connections (*c8) (--sockets[=N])
  TN timeout, number of seconds after a non-responding link is shutdown (--timeout)
  RN number of retries, in case of timeout or non-fatal errors (*R1) (--retries[=N])
  JN traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (--min-rate[=N])
  HN host is abandonned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow (--host-control[=N])

Links options:
 %P *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don't use) (--extended-parsing[=N])
  n  get non-html files 'near' an html file (ex: an image located outside) (--near)
  t  test all URLs (even forbidden ones) (--test)
 %L  add all URL located in this text file (one URL per line) (--list )
 %S  add all scan rules located in this text file (one scan rule per line) (--urllist )

Build options:
  NN structure type (0 *original structure, 1+: see below) (--structure[=N])
     or user defined structure (-N "%h%p/%n%q.%t")
 %N  delayed type check, don't make any link test but wait for files download to start instead (experimental) (%N0 don't use, %N1 use for unknown extensions, * %N2 always use)
 %D  cached delayed type check, don't wait for remote type during updates, to speedup them (%D0 wait, * %D1 don't wait) (--cached-delayed-type-check)
 %M  generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html)
  LN long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible) (--long-names[=N])
  KN keep original links (e.g. https://www.adr/link) (K0 *relative link, K absolute links, K4 original links, K3 absolute URI links, K5 transparent proxy link) (--keep-links[=N])
  x  replace external html links by error pages (--replace-external)
 %x  do not include any password for external password protected websites (%x0 include) (--disable-passwords)
 %q *include query string for local files (useless, for information purpose only) (%q0 don't include) (--include-query-string)
  o *generate output html file in case of error (404..) (o0 don't generate) (--generate-errors)
  X *purge old files after update (X0 keep delete) (--purge-old[=N])
 %p  preserve html files 'as is' (identical to '-K4 -%F ""') (--preserve)
 %T  links conversion to UTF-8 (--utf8-conversion)

Spider options:
  bN accept cookies in cookies.txt (0=do not accept,* 1=accept) (--cookies[=N])
  u  check document type if unknown (cgi,asp..) (u0 don't check, * u1 check but /, u2 check always) (--check-type[=N])
  j *parse Java Classes (j0 don't parse, bitmask: |1 parse default, |2 don't parse .class |4 don't parse .js |8 don't be aggressive) (--parse-java[=N])
  sN follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (--robots[=N])
 %h  force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
 %k  use keep-alive if possible, greately reducing latency for small files and test requests (%k0 don't use) (--keep-alive)
 %B  tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
 %s  update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
 %u  url hacks: various hacks to limit duplicate URLs (strip //, www.foo.com==foo.com..) (--urlhack)
 %A  assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume )
     shortcut: '--assume standard' is equivalent to -%A php2 php3 php4 php cgi asp jsp pl cfm nsf=text/html
     can also be used to force a specific file type: --assume foo.cgi=text/html
 @iN internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--protocol[=N])
 %w  disable a specific external mime module (-%w htsswf -%w htsjava) (--disable-module )

Browser ID:
  F  user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent )
 %R  default referer field sent in HTTP headers (--referer )
 %E  from email address sent in HTTP headers (--from )
 %F  footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer )
 %l  preffered language (-%l "fr, en, jp, *" (--language )

Log, index, cache
  C  create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
  k  store all files in cache (not useful if files on disk) (--store-all-in-cache)
 %n  do not re-download locally erased files (--do-not-recatch)
 %v  display on screen filenames downloaded (in realtime) - * %v1 short version - %v2 full animation (--display)
  Q  no log - quiet mode (--do-not-log)
  q  no questions - quiet mode (--quiet)
  z  log - extra infos (--extra-log)
  Z  log - debug (--debug-log)
  v  log on screen (--verbose)
  f *log in files (--file-log)
  f2 one single log file (--single-log)
  I *make an index (I0 don't make) (--index)
 %i  make a top index for a project folder (* %i0 don't make) (--build-top-index)
 %I  make an searchable index for this mirror (* %I0 don't make) (--search-index)

Expert options:
  pN priority mode: (* p3) (--priority[=N])
      p0 just scan, don't save anything (for checking links)
      p1 save only html files
      p2 save only non html files
     *p3 save all files
      p7 get html files before, then treat other files
  S  stay on the same directory (--stay-on-same-dir)
  D *can only go down into subdirs (--can-go-down)
  U  can only go to upper directories (--can-go-up)
  B  can both go up&down into the directory structure (--can-go-up-and-down)
  a *stay on the same address (--stay-on-same-address)
  d  stay on the same principal domain (--stay-on-same-domain)
  l  stay on the same TLD (eg: .com) (--stay-on-same-tld)
  e  go everywhere on the web (--go-everywhere)
 %H  debug HTTP headers in logfile (--debug-headers)

Guru options: (do NOT use if possible)
 #X *use optimized engine (limited memory boundary checks) (--fast-engine)
 #0  filter test (-#0 '*.gif' 'www.bar.com/foo.gif') (--debug-testfilters )
 #1  simplify test (-#1 ./foo/bar/../foobar)
 #2  type test (-#2 /foo/bar.php)
 #C  cache list (-#C '*.com/spider*.gif' (--debug-cache )
 #R  cache repair (damaged cache) (--repair-cache)
 #d  debug parser (--debug-parsing)
 #E  extract new.zip cache meta-data in meta.zip
 #f  always flush log files (--advanced-flushlogs)
 #FN maximum number of filters (--advanced-maxfilters[=N])
 #h  version info (--version)
 #K  scan stdin (debug) (--debug-scanstdin)
 #L  maximum number of links (-#L1000000) (--advanced-maxlinks)
 #p  display ugly progress information (--advanced-progressinfo)
 #P  catch URL (--catch-url)
 #R  old FTP routines (debug) (--repair-cache)
 #T  generate transfer ops. log every minutes (--debug-xfrstats)
 #u  wait time (--advanced-wait)
 #Z  generate transfer rate statictics every minutes (--debug-ratestats)
 #!  execute a shell command (-#! "echo hello") (--exec )

Dangerous options: (do NOT use unless you exactly know what you are doing)
 %!  bypass built-in security limits aimed to avoid bandwith abuses (bandwidth, simultaneous connections) (--disable-security-limits)
     IMPORTANT NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS
                     USE IT WITH EXTREME CARE

Command-line specific options:
  V execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd )
 %U run the engine with another id when called as root (-%U smith) (--user )
 %W use an external library function as a wrapper (-%W myfoo.so[,myparameters]) (--callback )

Details: Option N
  N0 Site-structure (default)
  N1 HTML in web/, images/other files in web/images/
  N2 HTML in web/HTML, images/other in web/images
  N3 HTML in web/,  images/other in web/
  N4 HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif will be placed onto web/gif, for example)
  N5 Images/other in web/xxx and HTML in web/HTML
  N99 All files in web/, with random names (gadget !)
  N100 Site-structure, without www.domain.xxx/
  N101 Identical to N1 exept that "web" is replaced by the site's name
  N102 Identical to N2 exept that "web" is replaced by the site's name
  N103 Identical to N3 exept that "web" is replaced by the site's name
  N104 Identical to N4 exept that "web" is replaced by the site's name
  N105 Identical to N5 exept that "web" is replaced by the site's name
  N199 Identical to N99 exept that "web" is replaced by the site's name
  N1001 Identical to N1 exept that there is no "web" directory
  N1002 Identical to N2 exept that there is no "web" directory
  N1003 Identical to N3 exept that there is no "web" directory (option set for g option)
  N1004 Identical to N4 exept that there is no "web" directory
  N1005 Identical to N5 exept that there is no "web" directory
  N1099 Identical to N99 exept that there is no "web" directory
Details: User-defined option N
  '%n' Name of file without file type (ex: image)
  '%N' Name of file, including file type (ex: image.gif)
  '%t' File type (ex: gif)
  '%p' Path [without ending /] (ex: /someimages)
  '%h' Host name (ex: www.someweb.com)
  '%M' URL MD5 (128 bits, 32 ascii bytes)
  '%Q' query string MD5 (128 bits, 32 ascii bytes)
  '%k' full query string
  '%r' protocol name (ex: http)
  '%q' small query string MD5 (16 bits, 4 ascii bytes)
     '%s?' Short name version (ex: %sN)
  '%[param]' param variable in query string
  '%[param:before:after:empty:notfound]' advanced variable extraction
Details: User-defined option N and advanced variable extraction
   %[param:before:after:empty:notfound]
   param : parameter name
   before : string to prepend if the parameter was found
   after : string to append if the parameter was found
   notfound : string replacement if the parameter could not be found
   empty : string replacement if the parameter was empty
   all fields, except the first one (the parameter name), can be empty

Details: Option K
  K0  foo.cgi?q=45  ->  foo4B54.html?q=45 (relative URI, default)
  K                 ->  https://www.foobar.com/folder/foo.cgi?q=45 (absolute URL) (--keep-links[=N])
  K3                ->  /folder/foo.cgi?q=45 (absolute URI)
  K4                ->  foo.cgi?q=45 (original URL)
  K5                ->  https://www.foobar.com/folder/foo4B54.html?q=45 (transparent proxy URL)

Shortcuts:
--mirror       *make a mirror of site(s) (default)
--get           get the files indicated, do not seek other URLs (-qg)
--list     add all URL located in this text file (-%L)
--mirrorlinks   mirror all links in 1st level pages (-Y)
--testlinks     test links in pages (-r1p0C0I0t)
--spider        spider site(s), to test links: reports Errors & Warnings (-p0C0I0t)
--testsite      identical to --spider
--skeleton      make a mirror, but gets only html files (-p1)
--update              update a mirror, without confirmation (-iC2)
--continue            continue a mirror, without confirmation (-iC1)

--catchurl            create a temporary proxy to capture an URL or a form post URL
--clean               erase cache & log files

--http10              force http/1.0 requests (-%h)

Details: Option %W: External callbacks prototypes
see htsdefines.h

example: httrack www.someweb.com/bob/
means:   mirror site www.someweb.com/bob/ and only this site

example: httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*
means:   mirror the two sites together (with shared links) and accept any .jpg files on .com sites

example: httrack www.someweb.com/bob/bobby.html +* -r6
means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web

example: httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
runs the spider on www.someweb.com/bob/bobby.html using a proxy

example: httrack --update
updates a mirror in the current folder

example: httrack
will bring you to the interactive mode

example: httrack --continue
continues a mirror in the current folder

HTTrack version 3.46 (compiled Mar  3 2013)
Copyright (C) Xavier Roche and other contributors

You can define additional options, such as recurse level (-r), separed by blank spaces
To see the option list, type help
Additional options (return=none) :

---> Wizard command line: httrack https://www.dnspod.cn/ -W -O "/home/RucLinux/DNSPod"  -%v  

Ready to launch the mirror? (Y/n) :y

WARNING! You are running this program as root!
It might be a good idea to use the -%U option to change the userid:
Example: -%U smith

Mirror launched on Mon, 04 Mar 2013 02:35:02 by HTTrack Website Copier/3.46+libhtsjava.so.2 [XR&CO'2010]
mirroring https://www.dnspod.cn/ with the wizard help..
Done.
Thanks for using HTTrack!
* 
[root@localhost RucLinux]# 


 


sicnature ---------------------------------------------------------------------
I P 地 址: 3.137.159.134
区 域 位 置: 美国
系 统 信 息: 美国
Original content, please indicate the source:
同福客栈论坛 | 蟒蛇科普海南乡情论坛 | JiaYu Blog
sicnature ---------------------------------------------------------------------
Welcome to reprint. Please indicate the source https://myzhenai.com/post/948.html

没有评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注