Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
hope-data-science authored Nov 7, 2024
1 parent bfb5090 commit 3c9cb59
Showing 1 changed file with 59 additions and 22 deletions.
81 changes: 59 additions & 22 deletions docs/化整为零:对文件进行批处理.html
Original file line number Diff line number Diff line change
Expand Up @@ -247,10 +247,11 @@ <h2 id="toc-title">目录</h2>
<ul class="collapse">
<li><a href="#环境配置与数据生成" id="toc-环境配置与数据生成" class="nav-link" data-scroll-target="#环境配置与数据生成"><span class="header-section-number">8.4.1</span> 环境配置与数据生成</a></li>
<li><a href="#数据的保存" id="toc-数据的保存" class="nav-link" data-scroll-target="#数据的保存"><span class="header-section-number">8.4.2</span> 数据的保存</a></li>
<li><a href="#文件的压缩" id="toc-文件的压缩" class="nav-link" data-scroll-target="#文件的压缩"><span class="header-section-number">8.4.3</span> 文件的压缩</a></li>
<li><a href="#文件的移动" id="toc-文件的移动" class="nav-link" data-scroll-target="#文件的移动"><span class="header-section-number">8.4.4</span> 文件的移动</a></li>
<li><a href="#保存为excel文件" id="toc-保存为excel文件" class="nav-link" data-scroll-target="#保存为excel文件"><span class="header-section-number">8.4.5</span> 保存为Excel文件</a></li>
<li><a href="#文件的删除" id="toc-文件的删除" class="nav-link" data-scroll-target="#文件的删除"><span class="header-section-number">8.4.6</span> 文件的删除</a></li>
<li><a href="#文件批量读取" id="toc-文件批量读取" class="nav-link" data-scroll-target="#文件批量读取"><span class="header-section-number">8.4.3</span> 文件批量读取</a></li>
<li><a href="#文件的压缩" id="toc-文件的压缩" class="nav-link" data-scroll-target="#文件的压缩"><span class="header-section-number">8.4.4</span> 文件的压缩</a></li>
<li><a href="#文件的移动" id="toc-文件的移动" class="nav-link" data-scroll-target="#文件的移动"><span class="header-section-number">8.4.5</span> 文件的移动</a></li>
<li><a href="#保存为excel文件" id="toc-保存为excel文件" class="nav-link" data-scroll-target="#保存为excel文件"><span class="header-section-number">8.4.6</span> 保存为Excel文件</a></li>
<li><a href="#文件的删除" id="toc-文件的删除" class="nav-link" data-scroll-target="#文件的删除"><span class="header-section-number">8.4.7</span> 文件的删除</a></li>
</ul></li>
<li><a href="#小结" id="toc-小结" class="nav-link" data-scroll-target="#小结"><span class="header-section-number">8.5</span> 小结</a></li>
<li><a href="#练习" id="toc-练习" class="nav-link" data-scroll-target="#练习"><span class="header-section-number">8.6</span> 练习</a></li>
Expand Down Expand Up @@ -338,11 +339,11 @@ <h3 data-number="8.4.1" class="anchored" data-anchor-id="环境配置与数据
</div>
<p>然后,我们来生成相关的数据集:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>nr_of_rows <span class="ot">&lt;-</span> <span class="fl">1e7</span></span>
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>nr_of_rows <span class="ot">&lt;-</span> <span class="fl">1e7</span> <span class="co"># 可适当增加数据量,如改为 5e7或1e8</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>df <span class="ot">&lt;-</span> <span class="fu">data.table</span>(</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a> <span class="at">Logical =</span> <span class="fu">sample</span>(<span class="fu">c</span>(<span class="cn">TRUE</span>, <span class="cn">FALSE</span>, <span class="cn">NA</span>), <span class="at">prob =</span> <span class="fu">c</span>(<span class="fl">0.85</span>, <span class="fl">0.1</span>, <span class="fl">0.05</span>), nr_of_rows, <span class="at">replace =</span> <span class="cn">TRUE</span>),</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a> <span class="at">Integer =</span> <span class="fu">sample</span>(<span class="dv">1</span>L<span class="sc">:</span><span class="dv">100</span>L, nr_of_rows, <span class="at">replace =</span> <span class="cn">TRUE</span>),</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a> <span class="at">Integer =</span> <span class="fu">sample</span>(<span class="dv">1</span>L<span class="sc">:</span><span class="dv">100</span>L, nr_of_rows, <span class="at">replace =</span> <span class="cn">TRUE</span>), <span class="co"># 可适当增加文件数量,如改为 1:1e3或1:1e4</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a> <span class="at">Real =</span> <span class="fu">sample</span>(<span class="fu">sample</span>(<span class="dv">1</span><span class="sc">:</span><span class="dv">10000</span>, <span class="dv">20</span>) <span class="sc">/</span> <span class="dv">100</span>, nr_of_rows, <span class="at">replace =</span> <span class="cn">TRUE</span>),</span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a> <span class="at">Factor =</span> <span class="fu">as.factor</span>(<span class="fu">sample</span>(<span class="fu">labels</span>(UScitiesD), nr_of_rows, <span class="at">replace =</span> <span class="cn">TRUE</span>))</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a> )</span>
Expand Down Expand Up @@ -400,40 +401,76 @@ <h3 data-number="8.4.2" class="anchored" data-anchor-id="数据的保存"><span
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a> })</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="文件的压缩" class="level3" data-number="8.4.3">
<h3 data-number="8.4.3" class="anchored" data-anchor-id="文件的压缩"><span class="header-section-number">8.4.3</span> 文件的压缩</h3>
<section id="文件批量读取" class="level3" data-number="8.4.3">
<h3 data-number="8.4.3" class="anchored" data-anchor-id="文件批量读取"><span class="header-section-number">8.4.3</span> 文件批量读取</h3>
<p>如果我们需要对csv文件夹下的所有文件进行读取,然后合并为一个数据框,可以使用readr包的<code>read_csv</code>函数进行实现,操作方法如下:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">dir_ls</span>(<span class="st">"temp/csv"</span>) <span class="ot">-&gt;</span> all_csv_paths</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="fu">read_csv</span>(all_csv_paths,<span class="at">id =</span> <span class="st">"file_path"</span>) <span class="ot">-&gt;</span> all_data</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>all_data</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="co"># # A tibble: 10,000,000 × 5</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="co"># file_path Logical Integer Real Factor </span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a><span class="co"># &lt;chr&gt; &lt;lgl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; </span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="co"># 1 temp/csv/1.csv TRUE 1 54.6 Denver </span></span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a><span class="co"># 2 temp/csv/1.csv TRUE 1 52.2 Houston </span></span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a><span class="co"># 3 temp/csv/1.csv TRUE 1 35.8 SanFrancisco</span></span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a><span class="co"># 4 temp/csv/1.csv TRUE 1 79.5 Houston </span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="co"># 5 temp/csv/1.csv TRUE 1 92.2 LosAngeles </span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a><span class="co"># 6 temp/csv/1.csv FALSE 1 53.8 Atlanta </span></span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a><span class="co"># 7 temp/csv/1.csv TRUE 1 78.4 Miami </span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a><span class="co"># 8 temp/csv/1.csv TRUE 1 18.2 Atlanta </span></span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a><span class="co"># 9 temp/csv/1.csv TRUE 1 8.83 Denver </span></span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a><span class="co"># 10 temp/csv/1.csv FALSE 1 49.8 Houston </span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a><span class="co"># # ℹ 9,999,990 more rows</span></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="co"># # ℹ Use `print(n = ...)` to see more rows</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>那么在返回的all_data变量中,包含了所有文件合并后的信息,其中file_path列保存了文件的路径。</p>
<p>不过在R中,能够最快读取csv格式文件的还是data.table包提供的<code>fread</code>函数,因此如果对性能有较高要求,可以这样操作:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="fu">map_dfr</span>(all_csv_paths,fread) <span class="ot">-&gt;</span> all_data2</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>这里使用了purrr包的<code>map_dfr</code>函数,会开展向量化操作,得到的结果会把数据框按照行合并到一起。在上述操作中,没有记录文件名称,如果需要记录,可以这样操作:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="fu">lapply</span>(all_csv_paths,\(x) <span class="fu">fread</span>(x)[,<span class="at">file_path:=</span>x]) <span class="sc">%&gt;%</span> </span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">rbindlist</span>() <span class="ot">-&gt;</span> all_data3</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>这里我们使用了基本包的<code>lapply</code>函数,并且利用data.table包的<code>rbindlist</code>函数把获得的数据框列表合并到一起。</p>
</section>
<section id="文件的压缩" class="level3" data-number="8.4.4">
<h3 data-number="8.4.4" class="anchored" data-anchor-id="文件的压缩"><span class="header-section-number">8.4.4</span> 文件的压缩</h3>
<p>在本部分中,我们会对先前生成的数据进行打包压缩。由于Excel文件已经是一整个文件,因此不需要再进行压缩操作。我们首先把csv文件都打包为zip文件:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">archive_write_dir</span>(<span class="at">archive =</span> <span class="st">"temp/csv.zip"</span>,<span class="at">dir =</span> <span class="st">"temp/csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="fu">archive_write_dir</span>(<span class="at">archive =</span> <span class="st">"temp/csv.zip"</span>,<span class="at">dir =</span> <span class="st">"temp/csv"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>另一方面,我们把fst文件都打包为tar文件:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="fu">archive_write_dir</span>(<span class="at">archive =</span> <span class="st">"temp/fst.tar"</span>,<span class="at">dir =</span> <span class="st">"temp/fst"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="fu">archive_write_dir</span>(<span class="at">archive =</span> <span class="st">"temp/fst.tar"</span>,<span class="at">dir =</span> <span class="st">"temp/fst"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="文件的移动" class="level3" data-number="8.4.4">
<h3 data-number="8.4.4" class="anchored" data-anchor-id="文件的移动"><span class="header-section-number">8.4.4</span> 文件的移动</h3>
<section id="文件的移动" class="level3" data-number="8.4.5">
<h3 data-number="8.4.5" class="anchored" data-anchor-id="文件的移动"><span class="header-section-number">8.4.5</span> 文件的移动</h3>
<p>我们知道,文件打包后,移动会更加快。我们不妨来进行尝试,在temp中再建立一个dest文件夹,然后分别把csv文件夹和csv压缩包移动进去,并测试移动时间:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="fu">dir_create</span>(<span class="st">"temp/dest"</span>)</span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="fu">pst</span>(<span class="fu">file_move</span>(<span class="st">"temp/csv"</span>,<span class="st">"temp/dest/csv"</span>))</span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="fu">pst</span>(<span class="fu">file_move</span>(<span class="st">"temp/csv.zip"</span>,<span class="st">"temp/dest/csv.zip"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="fu">dir_create</span>(<span class="st">"temp/dest"</span>)</span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="fu">pst</span>(<span class="fu">file_move</span>(<span class="st">"temp/csv"</span>,<span class="st">"temp/dest/csv"</span>))</span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="fu">pst</span>(<span class="fu">file_move</span>(<span class="st">"temp/csv.zip"</span>,<span class="st">"temp/dest/csv.zip"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>如果效果不明显,可以尝试增加文件数量。</p>
</section>
<section id="保存为excel文件" class="level3" data-number="8.4.5">
<h3 data-number="8.4.5" class="anchored" data-anchor-id="保存为excel文件"><span class="header-section-number">8.4.5</span> 保存为Excel文件</h3>
<section id="保存为excel文件" class="level3" data-number="8.4.6">
<h3 data-number="8.4.6" class="anchored" data-anchor-id="保存为excel文件"><span class="header-section-number">8.4.6</span> 保存为Excel文件</h3>
<p>在这一步中,我们会把fst中的1到3号文件转存在一个Excel文件中(命名为“1-3.xlsx”),分成不同的工作簿进行保存,实现方法如下:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="fu">map</span>(<span class="fu">path</span>(<span class="st">"temp"</span>,<span class="st">"fst"</span>,<span class="dv">1</span><span class="sc">:</span><span class="dv">3</span>,<span class="at">ext =</span> <span class="st">"fst"</span>),import_fst) <span class="sc">%&gt;%</span> </span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">write_xlsx</span>(<span class="fu">path</span>(<span class="st">"temp"</span>,<span class="st">"1-3"</span>,<span class="at">ext =</span> <span class="st">"xlsx"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="fu">map</span>(<span class="fu">path</span>(<span class="st">"temp"</span>,<span class="st">"fst"</span>,<span class="dv">1</span><span class="sc">:</span><span class="dv">3</span>,<span class="at">ext =</span> <span class="st">"fst"</span>),import_fst) <span class="sc">%&gt;%</span> </span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">write_xlsx</span>(<span class="fu">path</span>(<span class="st">"temp"</span>,<span class="st">"1-3"</span>,<span class="at">ext =</span> <span class="st">"xlsx"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="文件的删除" class="level3" data-number="8.4.6">
<h3 data-number="8.4.6" class="anchored" data-anchor-id="文件的删除"><span class="header-section-number">8.4.6</span> 文件的删除</h3>
<section id="文件的删除" class="level3" data-number="8.4.7">
<h3 data-number="8.4.7" class="anchored" data-anchor-id="文件的删除"><span class="header-section-number">8.4.7</span> 文件的删除</h3>
<p>下面我们尝试删除temp文件夹,操作方法如下:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="fu">file_delete</span>(<span class="st">"temp"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="fu">file_delete</span>(<span class="st">"temp"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
</section>
Expand Down

0 comments on commit 3c9cb59

Please sign in to comment.