scrapy 爬虫框架
scrapy 是 python 写的爬虫框架,代码架构借鉴于django,灵活多样,功能强大。First steps1 Getting help2 First steps2.1 Scrapy at a glance2.2 Installation guide2.3 Scrapy Tutorial2.4 Examples203 Basic concepts213.1 Command line tool213.2 Spiders3.3 Selectors3.4 Items···3.5 Item Loaders3.6 Scrap shell3.7 Item Pipeline683. 8 Feed723.9 Requests and responses773.10 Link extractors863.11 Settings883. 12 Exceptions.1094 Built-in services1114.1 Loggi4.2 Stats collecti1154.3 Sending e-mail,.1164.4 Telnet Console.1194.5 Web Service1215 Solving specific problems1235.1F1ly Asked Questions1235.2 Debugging Spiders1275.3 Spiders contracts1305.4 Common Practices.131s Broad crawls1355.6 Using Firefox for scraping.1375.7 USing Firebug for scraping385.8 Debugging memory leaks....1425.9 Downloading and processing files and images1465.10 Deploying Spiders5.11 Auto Throttle extension1535.12 Benchmarking5.13 Jobs: pausing and resuming crawls1566 Extending Scrapyl596. 1 Architecture overview.1596.2 Downloader middleware1626.3 Spider middleware1746. 4 Extensions..1786.5 Core API1836.6 Signals.1906.7 Item Exporters1947 All the rest7.1 Rel7.2 Contributing to Scrapy2397.3 Versioning and API Stability.241Python Module Index243Scrapy Documentation, Release 1.3.0This documentation contains everything you need to know about ScrapyFirst stepsScrapy Documentation, Release 1.3.0st stepsCHAPTERGetting helpHaving trouble? We'd like to help!Try the FAQ -it's got answers to some common questionsLooking for specific information? Try the genindex or modindexAsk or search questions in StackOverflow using the scrapy tagSearch for information in the archives of the scrapy-users mailing list, or post a questionAsk a question in the #scrapy IrC channelReport bugs with Scrapy in our issue tracker.3Scrapy Documentation, Release 1.3.0Chapter 1. Getting helpCHAPTER 2First steps2.1 Scrap at a glanceScrapy is an application framework for crawling web sites and extracting structured data which can be used for a widerange of useful applications, like data mining, information processing or historical archivalEven though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs(such asAmazon Associates Web Services) or as a general purpose web crawler2.1.1 Walk-through of an example spiderIn order to show you what Scrapy brings to the table, we'll walk you through an example of a Scrapy Spider using thesimplest way to run a spiderHeresthecodeforaspiderthatscrapesfamousquotesfromwebsitehttp://quotes.toscrape.com,followingthepagi-ationimport scrapclass Quotes Spider(scrap. spider):namequotesstart urls ='http://quotes.oscrape.com/tag/humor/',def parse(self, response)for quote in response. css('civ. quote):yield itext: quote. css(' span. text:: text).extract-first (authorl: quote xpath(' span/smal1text()').extract-first(),next-page response css('li.next a::attr(href)').extract first(if next page is not Nonenext page response. urljoin(next -pageyield scrapy RequesT (nexL_page, callback=selr parse)Put this in a text file, name it to something like quotes_spider. py and run the spider using the runspidercommandScrapy Documentation, Release 1.3.0scrapy runspider quotes_spider. py -c quotes jsonWhen this finishes you will have in the quotes. ison file a list of the quotes in soN format, containing text andauthor, looking like this (reformatted here for better readability)author: "Jane Ausen"text":#\u201cThe person, be it gentleman or lady, who has not pleasure in a good ffovel, must beauthor":"Groucho Marx","text":"\u201 cOut side f a dog, a book is man's best friend. Inside of a dog it 's too dark to rauthor":Steve Martin",七ex七\u201cA dithout sunshine is like, you know, night.\u201dWhat just happened?When you ran the command scrapy runspider quotes_spider. py, Scrapy looked for a Spider definitioninside it and ran it through its crawler engineThe crawl started by making requests to the urls defined in the start_urls attribute (in this case, only theURL for quotes in humor category) and called the default callback method parse, passing the response object asan argument. In the parse callback, we loop through the quote elements using a Css Selector, yield a Python dictwith the extracted quote text and author, look for a link to the next page and schedule another request using the sameparse method as callbackHere you notice one of the main advantages about Scrap: requests are scheduled and processed asynchronously. thismeans that Scrapy doesn't need to wait for a request to be finished and processed, it can send another request or doother things in the meantime. This also means that other requests can keep going even if some request fails or an errorhappens while handling itWhile this enables you to do very fast crawls(sending multiple concurrent requests at the same time, in a fault-tolerantway) Scrapy also gives you control over the politeness of the crawl through a few settings. You can do things likesetting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and evenusing an auto-throttling extension that tries to figure out these automaticallyNote: This is using feed exports to generate the jSon file, you can easily change the export format(XML or CSV,for example) or the storage backend(FTP or Amazon S3, for example). You can also write an item pipeline to storethe items in a database2.1.2 What else?Youve seen how to extract and store items from a website using Scrapy, but this is just the surface Scrapy provides alot of powerful features for making scraping easy and efficient, such asBuilt-in support for selecting and extracting data from HTML/XML sources using extended Css selectors andXPath expressions, with helper methods to extract using regular expressionsAn interactive shell console(IPython aware) for trying out the CSs and XPath expressions to scrape data, veryuseful when writing or debugging your spidersChapter 2. First steps
暂无评论