-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
87 lines (55 loc) · 47.7 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Blog(robot9)</title>
<link href="https://robot9.github.io/atom.xml" rel="self"/>
<link href="https://robot9.github.io/"/>
<updated>2021-03-18T03:29:14.127Z</updated>
<id>https://robot9.github.io/</id>
<author>
<name>robot9</name>
</author>
<generator uri="https://hexo.io/">Hexo</generator>
<entry>
<title>Trending Apps: 从头开始分析流行应用 - 采集篇 - 环境设置 - 2</title>
<link href="https://robot9.github.io/2021/03/18/Trending-Apps-%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E5%88%86%E6%9E%90%E6%B5%81%E8%A1%8C%E5%BA%94%E7%94%A8-%E9%87%87%E9%9B%86%E7%AF%87-%E7%8E%AF%E5%A2%83%E8%AE%BE%E7%BD%AE-2/"/>
<id>https://robot9.github.io/2021/03/18/Trending-Apps-%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E5%88%86%E6%9E%90%E6%B5%81%E8%A1%8C%E5%BA%94%E7%94%A8-%E9%87%87%E9%9B%86%E7%AF%87-%E7%8E%AF%E5%A2%83%E8%AE%BE%E7%BD%AE-2/</id>
<published>2021-03-18T00:22:00.000Z</published>
<updated>2021-03-18T03:29:14.127Z</updated>
<content type="html"><![CDATA[<p>Link to previous chapter: <a href="/2021/03/17/Trending-Apps-%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E5%88%86%E6%9E%90%E6%B5%81%E8%A1%8C%E5%BA%94%E7%94%A8-%E9%87%87%E9%9B%86%E7%AF%87-%E7%8E%AF%E5%A2%83%E8%AE%BE%E7%BD%AE-1/">环境设置-1</a></p><h3 id="2-开始第一个-Scarpy-项目"><a href="#2-开始第一个-Scarpy-项目" class="headerlink" title="2. 开始第一个 Scarpy 项目"></a>2. 开始第一个 Scarpy 项目</h3><h4 id="2-1-创建项目"><a href="#2-1-创建项目" class="headerlink" title="2.1 创建项目"></a>2.1 创建项目</h4><p>因为我们本地并没有scrapy的executable,需要启动一个ad-hoc的容器并在容器里面操作, 进入容器:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">> docker-compose run --rm scrapy</span><br></pre></td></tr></table></figure><p>现在开始创建我们的project</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">> cd /app</span><br><span class="line">> scrapy startproject app_trend</span><br><span class="line">> cd app_trend</span><br></pre></td></tr></table></figure><p>这样子会生成一个 <code>app_trend</code> folder, 其中主要的文件目录是:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">tutorial/</span><br><span class="line"> scrapy.cfg # deploy configuration file</span><br><span class="line"> tutorial/ # project's Python module, you'll import your code from here</span><br><span class="line"> __init__.py</span><br><span class="line"> items.py # project items definition file</span><br><span class="line"> middlewares.py # project middlewares file</span><br><span class="line"> pipelines.py # project pipelines file</span><br><span class="line"> settings.py # project settings file</span><br><span class="line"> spiders/ # !!! a directory where you'll later put your spiders</span><br><span class="line"> __init__.py</span><br><span class="line"> </span><br></pre></td></tr></table></figure><p>我们将会在 <code>spiders/</code> folder 里面编写我们自己的 spider</p><span id="more"></span><h4 id="2-2-第一个-spider"><a href="#2-2-第一个-spider" class="headerlink" title="2.2 第一个 spider"></a>2.2 第一个 spider</h4><p>Filename: <code>spiders/top50_spider.py</code></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line">import scrapy</span><br><span class="line"></span><br><span class="line">class Top50Spider(scrapy.Spider):</span><br><span class="line"> name = "top50"</span><br><span class="line"></span><br><span class="line"> def start_requests(self):</span><br><span class="line"> urls = [</span><br><span class="line"> 'https://www.appannie.com/en/apps/ios/top/'</span><br><span class="line"> ]</span><br><span class="line"> for url in urls:</span><br><span class="line"> yield scrapy.Request(</span><br><span class="line"> url=url, callback=self.parse,</span><br><span class="line"> headers={</span><br><span class="line"> "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",</span><br><span class="line"> "accept-language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,zh-TW;q=0.5,ja;q=0.4",</span><br><span class="line"> "accept-encoding": "gzip, deflate, br",</span><br><span class="line"> "cache-control": "max-age=0",</span><br><span class="line"> "sec-fetch-dest": "document",</span><br><span class="line"> "sec-fetch-mode": "navigate",</span><br><span class="line"> "sec-fetch-site": "none",</span><br><span class="line"> "sec-fetch-user": "?1",</span><br><span class="line"> "upgrade-insecure-requests": "1",</span><br><span class="line"> "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36 Edg/89.0.774.50",</span><br><span class="line"> })</span><br><span class="line"></span><br><span class="line"> def parse(self, response):</span><br><span class="line"> for ranking in response.xpath('//div[count(a[contains(@href, "en/apps")])>=40]'):</span><br><span class="line"> category = ranking.css("h4::text").get()</span><br><span class="line"> for app in ranking.xpath('.//a[contains(@href, "en/apps")]'):</span><br><span class="line"> items = app.xpath('.//p/text()').getall()</span><br><span class="line"> yield {</span><br><span class="line"> 'category': category,</span><br><span class="line"> 'rank': items[0],</span><br><span class="line"> 'name': items[1],</span><br><span class="line"> 'company': items[2],</span><br><span class="line"> }</span><br></pre></td></tr></table></figure><div class="tip"> 这里面特别加上了一些headers,特别是`user-agent`是为了防止服务器认为我们是机器人而直接返回503 Error</div><h4 id="2-2-1-网页分析:xpath-和-css-selector"><a href="#2-2-1-网页分析:xpath-和-css-selector" class="headerlink" title="2.2.1 网页分析:xpath 和 css selector"></a>2.2.1 网页分析:<code>xpath</code> 和 <code>css</code> selector</h4><p>Scrapy默认提供了非常强大的<code>css</code>和<code>xpath</code> selector, 参考 <a href="https://www.w3schools.com/xml/xpath_syntax.asp">W3Schools</a><br>已知页面的structure:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><html></span><br><span class="line"> <body></span><br><span class="line"> <div> // Top Free App</span><br><span class="line"> <h4>Free</h4></span><br><span class="line"> <a href="en/apps/..."><p>app_rank</p><p>name</p><p>company</p></a></span><br><span class="line"> ...</span><br><span class="line"> </div></span><br><span class="line"> <div> // Top Paid App</span><br><span class="line"> <h4>Free</h4></span><br><span class="line"> <a href="en/apps/..."><p>app_rank</p><p>name</p><p>company</p></a></span><br><span class="line"> ...</span><br><span class="line"> </div></span><br><span class="line"> <div> // Top Grossing App</span><br><span class="line"> <h4>Free</h4></span><br><span class="line"> <a href="en/apps/..."><p>app_rank</p><p>name</p><p>company</p></a></span><br><span class="line"> ...</span><br><span class="line"> </div></span><br><span class="line"> </body></span><br><span class="line"></html></span><br></pre></td></tr></table></figure><p>其中每个<code><div></code>里面会列出类别以及包含50个app的rank<br>这里我们用到了四个:</p><ol><li><code>response.xpath('//div[count(a[contains(@href, "en/apps")])>=40]')</code>: 找到一个div并且里面有超过40个link,每个link都指向 <code>en/apps/...</code></li><li><code>css("h4::text")</code>: 提取 <code>h4</code> 里面的文字</li><li><code>xpath('.//a[contains(@href, "en/apps")]')</code>: 提取链接</li><li><code>xpath('.//p/text()')</code>: 找到每个段落</li></ol><h4 id="2-3-运行"><a href="#2-3-运行" class="headerlink" title="2.3 运行"></a>2.3 运行</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">> scrapy crawl top50</span><br></pre></td></tr></table></figure><p>可以看到我们能够拿到150个结果了</p><h4 id="2-4-输出-items"><a href="#2-4-输出-items" class="headerlink" title="2.4 输出 items"></a>2.4 输出 <code>items</code></h4><p>既然已经能够生成需要的数据,那么我们可以使用 <code>Pipeline</code> 来把处理过的内容输出到磁盘上</p><p>Filename: <code>app_rank/pipelines.py</code></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line">from datetime import datetime</span><br><span class="line">from itemadapter import ItemAdapter</span><br><span class="line">from scrapy.exporters import JsonLinesItemExporter</span><br><span class="line"></span><br><span class="line">class AppTrendPipeline:</span><br><span class="line"> def __init__(self):</span><br><span class="line"> self.files = {}</span><br><span class="line"></span><br><span class="line"> def open_spider(self, spider):</span><br><span class="line"> import os</span><br><span class="line"> print(os.getcwd())</span><br><span class="line"> spider.logger.info("Current path: %s", os.getcwd())</span><br><span class="line"> now = datetime.now()</span><br><span class="line"> file = open('json/{}.json'.format(now.strftime("%d_%m_%Y_%H_%M_%S")), 'w+b')</span><br><span class="line"> self.files[spider] = file</span><br><span class="line"> self.exporter = JsonLinesItemExporter(file)</span><br><span class="line"> self.exporter.start_exporting()</span><br><span class="line"></span><br><span class="line"> def close_spider(self, spider):</span><br><span class="line"> self.exporter.finish_exporting()</span><br><span class="line"> file = self.files.pop(spider)</span><br><span class="line"> file.close()</span><br><span class="line"></span><br><span class="line"> def process_item(self, item, spider):</span><br><span class="line"> self.exporter.export_item(item)</span><br><span class="line"> return item</span><br></pre></td></tr></table></figure><p>我们使用了 <code>JsonLinesItemExporter</code> 将每一个Item输出到了本地的 <code>.json</code> 文件中:</p><p><img src="/static/2021_03_17/2021_03_17_app_trend_list.png"></p><h4 id="2-5-通过网页控制台来运行我们的spider"><a href="#2-5-通过网页控制台来运行我们的spider" class="headerlink" title="2.5 通过网页控制台来运行我们的spider"></a>2.5 通过网页控制台来运行我们的spider</h4><p>访问 <a href="http://localhost/">http://localhost</a> 应该就能看到控制台,以及我们增加了一个同为localhost的worker</p><p><img src="/static/2021_03_17/2021_03_17_scrapyweb.png"></p><ol><li>进入 屏幕左侧 <code>Deploy Project</code> 选项, 选择我们创建的 <code>app_trend</code> 项目,并且选择 <code>Package & Deploy</code></li><li>进入 屏幕左侧 <code>Run Spider</code> 选项, 选择我们的项目和spider,点击 <code>Check CMD</code> 生成命令,这里也可以选择计划任务设置定时执行</li><li>点击 <code>Run Spider</code> 按钮</li></ol><p>第一个task开始运行,选择 <code>Jobs</code> 可以看到现在的jobs,点击第一个task的 <code>Stats</code>已经完成,我们抓取到了150个条目</p><p><img src="/static/2021_03_17/2021_03_17_first_task.png"></p>]]></content>
<summary type="html"><p>Link to previous chapter: <a href="/2021/03/17/Trending-Apps-%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E5%88%86%E6%9E%90%E6%B5%81%E8%A1%8C%E5%BA%94%E7%94%A8-%E9%87%87%E9%9B%86%E7%AF%87-%E7%8E%AF%E5%A2%83%E8%AE%BE%E7%BD%AE-1/">环境设置-1</a></p>
<h3 id="2-开始第一个-Scarpy-项目"><a href="#2-开始第一个-Scarpy-项目" class="headerlink" title="2. 开始第一个 Scarpy 项目"></a>2. 开始第一个 Scarpy 项目</h3><h4 id="2-1-创建项目"><a href="#2-1-创建项目" class="headerlink" title="2.1 创建项目"></a>2.1 创建项目</h4><p>因为我们本地并没有scrapy的executable,需要启动一个ad-hoc的容器并在容器里面操作, 进入容器:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">&gt; docker-compose run --rm scrapy</span><br></pre></td></tr></table></figure>
<p>现在开始创建我们的project</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">&gt; cd &#x2F;app</span><br><span class="line">&gt; scrapy startproject app_trend</span><br><span class="line">&gt; cd app_trend</span><br></pre></td></tr></table></figure>
<p>这样子会生成一个 <code>app_trend</code> folder, 其中主要的文件目录是:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">tutorial&#x2F;</span><br><span class="line"> scrapy.cfg # deploy configuration file</span><br><span class="line"> tutorial&#x2F; # project&#39;s Python module, you&#39;ll import your code from here</span><br><span class="line"> __init__.py</span><br><span class="line"> items.py # project items definition file</span><br><span class="line"> middlewares.py # project middlewares file</span><br><span class="line"> pipelines.py # project pipelines file</span><br><span class="line"> settings.py # project settings file</span><br><span class="line"> spiders&#x2F; # !!! a directory where you&#39;ll later put your spiders</span><br><span class="line"> __init__.py</span><br><span class="line"> </span><br></pre></td></tr></table></figure>
<p>我们将会在 <code>spiders/</code> folder 里面编写我们自己的 spider</p></summary>
<category term="app_trend" scheme="https://robot9.github.io/categories/app-trend/"/>
<category term="docker" scheme="https://robot9.github.io/tags/docker/"/>
<category term="scrapy" scheme="https://robot9.github.io/tags/scrapy/"/>
<category term="scrapyweb" scheme="https://robot9.github.io/tags/scrapyweb/"/>
</entry>
<entry>
<title>Trending Apps: 从头开始分析流行应用 - 采集篇 - 环境设置 - 1</title>
<link href="https://robot9.github.io/2021/03/17/Trending-Apps-%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E5%88%86%E6%9E%90%E6%B5%81%E8%A1%8C%E5%BA%94%E7%94%A8-%E9%87%87%E9%9B%86%E7%AF%87-%E7%8E%AF%E5%A2%83%E8%AE%BE%E7%BD%AE-1/"/>
<id>https://robot9.github.io/2021/03/17/Trending-Apps-%E4%BB%8E%E5%A4%B4%E5%BC%80%E5%A7%8B%E5%88%86%E6%9E%90%E6%B5%81%E8%A1%8C%E5%BA%94%E7%94%A8-%E9%87%87%E9%9B%86%E7%AF%87-%E7%8E%AF%E5%A2%83%E8%AE%BE%E7%BD%AE-1/</id>
<published>2021-03-17T21:20:00.000Z</published>
<updated>2021-03-18T03:29:50.972Z</updated>
<content type="html"><![CDATA[<h2 id="项目目标"><a href="#项目目标" class="headerlink" title="项目目标"></a>项目目标</h2><p>收集最新的iOS App排名,并分析流行趋势</p><h2 id="采集篇"><a href="#采集篇" class="headerlink" title="采集篇"></a>采集篇</h2><h3 id="使用到的tools"><a href="#使用到的tools" class="headerlink" title="使用到的tools"></a>使用到的tools</h3><ol><li><a href="https://scrapy.org/">scrapy</a>: 基于python的网页采集框架</li><li><a href="https://github.com/my8100/scrapydweb">scrapydweb</a>: 用于 Scrapyd 集群管理的 web 应用,支持 Scrapy 日志分析和可视化。</li><li><a href="https://www.docker.com/">docker</a>: 多服务容器管理</li></ol><h3 id="1-创建docker-instance"><a href="#1-创建docker-instance" class="headerlink" title="1. 创建docker instance"></a>1. 创建docker instance</h3><h4 id="1-1-准备工作:"><a href="#1-1-准备工作:" class="headerlink" title="1.1 准备工作:"></a>1.1 准备工作:</h4><ol><li>一台linux服务器</li><li>安装 docker 以及 <a href="https://docs.docker.com/compose/install/">docker-compose</a> 工具</li></ol><h4 id="1-2-文件目录"><a href="#1-2-文件目录" class="headerlink" title="1.2 文件目录"></a>1.2 文件目录</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">app_trend</span><br><span class="line"> /code/ # 爬虫的python code放这儿</span><br><span class="line"> /scrapy_web/ # scrapydweb 的 config,logs 以及 build file</span><br><span class="line"> /app/</span><br><span class="line"> # scrapydweb 的 config 文件 </span><br><span class="line"> # 用来override https://github.com/my8100/scrapydweb/blob/master/scrapydweb/default_settings.py</span><br><span class="line"> /scrapydweb_settings_v10.py </span><br><span class="line"> /logs/</span><br><span class="line"> /data/</span><br><span class="line"> /Dockerfile</span><br><span class="line"> /scrapyd/</span><br><span class="line"> /scrapyd.conf</span><br><span class="line"> /Dockerfile</span><br><span class="line"> /data/</span><br><span class="line"> # 远程调用scrapyd的任务的output会在这</span><br><span class="line"> /code/</span><br><span class="line"> # 自定义启动scrapyd的脚本</span><br><span class="line"> /entrypoint.sh</span><br></pre></td></tr></table></figure><span id="more"></span><h3 id="1-2-创建-scrapydweb-镜像"><a href="#1-2-创建-scrapydweb-镜像" class="headerlink" title="1.2 创建 scrapydweb 镜像"></a>1.2 创建 scrapydweb 镜像</h3><p>Filename: <code>scrapy_web/Dockerfile</code></p><details><summary>[展开文件]</summary> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">FROM python:3.8-slim</span><br><span class="line"></span><br><span class="line">WORKDIR /app</span><br><span class="line"></span><br><span class="line">EXPOSE 5000</span><br><span class="line"> RUN apt-get update && \ </span><br><span class="line"> apt-get install -y git && \</span><br><span class="line"> pip3 install -U git+https://github.com/my8100/scrapydweb.git && \</span><br><span class="line"> apt-get remove -y git</span><br><span class="line"># 通过这个来override一些dependency的version</span><br><span class="line"># RUN pip3 install SQLAlchemy==1.3.23 --upgrade</span><br><span class="line">CMD scrapydweb </span><br></pre></td></tr></table></figure></details><h3 id="1-3-创建-scrapyd-镜像"><a href="#1-3-创建-scrapyd-镜像" class="headerlink" title="1.3 创建 scrapyd 镜像"></a>1.3 创建 scrapyd 镜像</h3><p>[启动logparser和scrapyd] Filename: <code>scrapyd/code/entrypoint.sh</code></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">#!/bin/bash</span><br><span class="line">logparser -dir /var/lib/scrapyd/logs -t 10 --delete_json_files & scrapyd</span><br></pre></td></tr></table></figure><p>[创建镜像] Filename: <code>scrapyd/Dockerfile</code></p><details><summary>[展开文件]</summary> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br></pre></td><td class="code"><pre><span class="line">FROM debian:buster</span><br><span class="line">MAINTAINER EasyPi Software Foundation</span><br><span class="line"></span><br><span class="line">ENV SCRAPY_VERSION=2.4.1</span><br><span class="line">ENV SCRAPYD_VERSION=1.2.1</span><br><span class="line">ENV PILLOW_VERSION=8.1.0</span><br><span class="line"></span><br><span class="line">RUN set -xe \</span><br><span class="line"> && apt-get update \</span><br><span class="line"> && apt-get install -y autoconf \</span><br><span class="line"> build-essential \</span><br><span class="line"> curl \</span><br><span class="line"> git \</span><br><span class="line"> libffi-dev \</span><br><span class="line"> libssl-dev \</span><br><span class="line"> libtool \</span><br><span class="line"> libxml2 \</span><br><span class="line"> libxml2-dev \</span><br><span class="line"> libxslt1.1 \</span><br><span class="line"> libxslt1-dev \</span><br><span class="line"> python3 \</span><br><span class="line"> python3-dev \</span><br><span class="line"> python3-distutils \</span><br><span class="line"> vim-tiny \</span><br><span class="line"> && apt-get install -y libtiff5 \</span><br><span class="line"> libtiff5-dev \</span><br><span class="line"> libfreetype6-dev \</span><br><span class="line"> libjpeg62-turbo \</span><br><span class="line"> libjpeg62-turbo-dev \</span><br><span class="line"> liblcms2-2 \</span><br><span class="line"> liblcms2-dev \</span><br><span class="line"> libwebp6 \</span><br><span class="line"> libwebp-dev \</span><br><span class="line"> zlib1g \</span><br><span class="line"> zlib1g-dev \</span><br><span class="line"> && curl -sSL https://bootstrap.pypa.io/get-pip.py | python3 \</span><br><span class="line"> && pip install git+https://github.com/scrapy/scrapy.git@$SCRAPY_VERSION \</span><br><span class="line"> git+https://github.com/scrapy/scrapyd.git@$SCRAPYD_VERSION \</span><br><span class="line"> git+https://github.com/scrapy/scrapyd-client.git \</span><br><span class="line"> git+https://github.com/scrapinghub/scrapy-splash.git \</span><br><span class="line"> git+https://github.com/scrapinghub/scrapyrt.git \</span><br><span class="line"> git+https://github.com/python-pillow/Pillow.git@$PILLOW_VERSION \</span><br><span class="line"> && pip install logparser \</span><br><span class="line"> && curl -sSL https://github.com/scrapy/scrapy/raw/master/extras/scrapy_bash_completion -o /etc/bash_completion.d/scrapy_bash_completion \</span><br><span class="line"> && echo 'source /etc/bash_completion.d/scrapy_bash_completion' >> /root/.bashrc \</span><br><span class="line"> && apt-get purge -y --auto-remove autoconf \</span><br><span class="line"> build-essential \</span><br><span class="line"> curl \</span><br><span class="line"> libffi-dev \</span><br><span class="line"> libssl-dev \</span><br><span class="line"> libtool \</span><br><span class="line"> libxml2-dev \</span><br><span class="line"> libxslt1-dev \</span><br><span class="line"> python3-dev \</span><br><span class="line"> && apt-get purge -y --auto-remove libtiff5-dev \</span><br><span class="line"> libfreetype6-dev \</span><br><span class="line"> libjpeg62-turbo-dev \</span><br><span class="line"> liblcms2-dev \</span><br><span class="line"> libwebp-dev \</span><br><span class="line"> zlib1g-dev \</span><br><span class="line"> && rm -rf /var/lib/apt/lists/*</span><br><span class="line"></span><br><span class="line">EXPOSE 6800</span><br><span class="line">VOLUME ["/code"]</span><br><span class="line">WORKDIR /code</span><br><span class="line">RUN ["chmod", "777", "entrypoint.sh"]</span><br><span class="line">ENTRYPOINT ["./entrypoint.sh"]</span><br></pre></td></tr></table></figure></details><p>[scrapyd的设置文件] Filename: <code>scrapyd/scrapyd.conf</code></p><details><summary>[展开文件]</summary> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line">[scrapyd]</span><br><span class="line">eggs_dir = /var/lib/scrapyd/eggs</span><br><span class="line">logs_dir = /var/lib/scrapyd/logs</span><br><span class="line">items_dir = /var/lib/scrapyd/items</span><br><span class="line">dbs_dir = /var/lib/scrapyd/dbs</span><br><span class="line">jobs_to_keep = 5</span><br><span class="line">max_proc = 0</span><br><span class="line">max_proc_per_cpu = 4</span><br><span class="line">finished_to_keep = 100</span><br><span class="line">poll_interval = 5</span><br><span class="line">#设置成0.0.0.0来允许外网访问</span><br><span class="line">bind_address = 0.0.0.0</span><br><span class="line">http_port = 6800</span><br><span class="line">debug = off</span><br><span class="line">runner = scrapyd.runner</span><br><span class="line">application = scrapyd.app.application</span><br><span class="line">launcher = scrapyd.launcher.Launcher</span><br><span class="line"></span><br><span class="line">[services]</span><br><span class="line">schedule.json = scrapyd.webservice.Schedule</span><br><span class="line">cancel.json = scrapyd.webservice.Cancel</span><br><span class="line">addversion.json = scrapyd.webservice.AddVersion</span><br><span class="line">listprojects.json = scrapyd.webservice.ListProjects</span><br><span class="line">listversions.json = scrapyd.webservice.ListVersions</span><br><span class="line">listspiders.json = scrapyd.webservice.ListSpiders</span><br><span class="line">delproject.json = scrapyd.webservice.DeleteProject</span><br><span class="line">delversion.json = scrapyd.webservice.DeleteVersion</span><br><span class="line">listjobs.json = scrapyd.webservice.ListJobs</span><br><span class="line">daemonstatus.json = scrapyd.webservice.DaemonStatus</span><br></pre></td></tr></table></figure></details><h4 id="1-4-编写-docker-compose-文件来定义-container"><a href="#1-4-编写-docker-compose-文件来定义-container" class="headerlink" title="1.4 编写 docker-compose 文件来定义 container"></a>1.4 编写 docker-compose 文件来定义 container</h4><p>Filename: <code>docker-compose.yml</code></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br></pre></td><td class="code"><pre><span class="line">scrapy:</span><br><span class="line"> image: vimagick/scrapyd:py3</span><br><span class="line"> command: bash</span><br><span class="line"> volumes:</span><br><span class="line"> - ./code:/code</span><br><span class="line"> working_dir: /code</span><br><span class="line"> restart: unless-stopped</span><br><span class="line"></span><br><span class="line">scrapy_web:</span><br><span class="line"> container_name: scrapy_web</span><br><span class="line"> restart: unless-stopped</span><br><span class="line"> build: ./scrapy_web/</span><br><span class="line"> ports:</span><br><span class="line"> - "80:80"</span><br><span class="line"> expose:</span><br><span class="line"> - "80"</span><br><span class="line"> volumes:</span><br><span class="line"> - ./scrapy_web/app:/app</span><br><span class="line"> - ./scrapy_web/logs:/logs</span><br><span class="line"> - ./scrapy_web/data:/data</span><br><span class="line"> - ./code:/code</span><br><span class="line"> environment:</span><br><span class="line"> - PASSWORD</span><br><span class="line"> - USERNAME</span><br><span class="line"> # 填入本机IP 或者其他运行了爬虫的机器</span><br><span class="line"> - SCRAPYD_SERVER_1=[Your_IP]:6800</span><br><span class="line"> - PORT=80</span><br><span class="line"> - DATA_PATH=/data</span><br><span class="line"> depends_on:</span><br><span class="line"> - scrapyd</span><br><span class="line"></span><br><span class="line">scrapyd:</span><br><span class="line"> container_name: scrapyd</span><br><span class="line"> build: ./scrapyd</span><br><span class="line"> ports:</span><br><span class="line"> - "6800:6800"</span><br><span class="line"> volumes:</span><br><span class="line"> - ./scrapyd/scrapyd.conf:/etc/scrapyd/scrapyd.conf</span><br><span class="line"> - ./scrapyd/data:/var/lib/scrapyd</span><br><span class="line"> - ./scrapyd/code:/code</span><br><span class="line"> restart: unless-stopped</span><br></pre></td></tr></table></figure><h4 id="1-5-启动-docker-容器"><a href="#1-5-启动-docker-容器" class="headerlink" title="1.5 启动 docker 容器"></a>1.5 启动 docker 容器</h4> <figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose up -d</span><br></pre></td></tr></table></figure><p> 现在可以通过80端口来访问啦</p><h4 id="1-6-编写-scrapy-项目"><a href="#1-6-编写-scrapy-项目" class="headerlink" title="1.6 编写 scrapy 项目"></a>1.6 编写 scrapy 项目</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker-compose run --rm scrapy</span><br></pre></td></tr></table></figure><p>会启动一个设置好scrapy的容器 在里面可以直接用 <code>scrapy startproject tutorial</code> 并测试你的spider啦<br>而与此同时这个容器内的 <code>/code</code> 实际上对应的就是外部的 <code>code</code> folder,所以任何文件操作都会保存在这并能从外部访问</p>]]></content>
<summary type="html"><h2 id="项目目标"><a href="#项目目标" class="headerlink" title="项目目标"></a>项目目标</h2><p>收集最新的iOS App排名,并分析流行趋势</p>
<h2 id="采集篇"><a href="#采集篇" class="headerlink" title="采集篇"></a>采集篇</h2><h3 id="使用到的tools"><a href="#使用到的tools" class="headerlink" title="使用到的tools"></a>使用到的tools</h3><ol>
<li><a href="https://scrapy.org/">scrapy</a>: 基于python的网页采集框架</li>
<li><a href="https://github.com/my8100/scrapydweb">scrapydweb</a>: 用于 Scrapyd 集群管理的 web 应用,支持 Scrapy 日志分析和可视化。</li>
<li><a href="https://www.docker.com/">docker</a>: 多服务容器管理</li>
</ol>
<h3 id="1-创建docker-instance"><a href="#1-创建docker-instance" class="headerlink" title="1. 创建docker instance"></a>1. 创建docker instance</h3><h4 id="1-1-准备工作:"><a href="#1-1-准备工作:" class="headerlink" title="1.1 准备工作:"></a>1.1 准备工作:</h4><ol>
<li>一台linux服务器</li>
<li>安装 docker 以及 <a href="https://docs.docker.com/compose/install/">docker-compose</a> 工具</li>
</ol>
<h4 id="1-2-文件目录"><a href="#1-2-文件目录" class="headerlink" title="1.2 文件目录"></a>1.2 文件目录</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">app_trend</span><br><span class="line"> &#x2F;code&#x2F; # 爬虫的python code放这儿</span><br><span class="line"> &#x2F;scrapy_web&#x2F; # scrapydweb 的 config,logs 以及 build file</span><br><span class="line"> &#x2F;app&#x2F;</span><br><span class="line"> # scrapydweb 的 config 文件 </span><br><span class="line"> # 用来override https:&#x2F;&#x2F;github.com&#x2F;my8100&#x2F;scrapydweb&#x2F;blob&#x2F;master&#x2F;scrapydweb&#x2F;default_settings.py</span><br><span class="line"> &#x2F;scrapydweb_settings_v10.py </span><br><span class="line"> &#x2F;logs&#x2F;</span><br><span class="line"> &#x2F;data&#x2F;</span><br><span class="line"> &#x2F;Dockerfile</span><br><span class="line"> &#x2F;scrapyd&#x2F;</span><br><span class="line"> &#x2F;scrapyd.conf</span><br><span class="line"> &#x2F;Dockerfile</span><br><span class="line"> &#x2F;data&#x2F;</span><br><span class="line"> # 远程调用scrapyd的任务的output会在这</span><br><span class="line"> &#x2F;code&#x2F;</span><br><span class="line"> # 自定义启动scrapyd的脚本</span><br><span class="line"> &#x2F;entrypoint.sh</span><br></pre></td></tr></table></figure></summary>
<category term="app_trend" scheme="https://robot9.github.io/categories/app-trend/"/>
<category term="docker" scheme="https://robot9.github.io/tags/docker/"/>
<category term="scrapy" scheme="https://robot9.github.io/tags/scrapy/"/>
<category term="scrapyweb" scheme="https://robot9.github.io/tags/scrapyweb/"/>
</entry>
</feed>