Text
Page: 1
Better CSV processing
with Ruby 2.6
Kouhei Sutou/Kazuma Furuhashi
ClearCode Inc./Speee, Inc.
RubyKaigi 2019
2019-04-19
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 2
Ad: Silver sponsor
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 3
Ad: Cafe sponsor
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 4
Kouhei Sutou
✓ The president of ClearCode Inc.
クリアコードの社長
✓ A new maintainer of the csv library
csvライブラリーの新メンテナー
✓ The founder of Red Data Tools project
Red Data Toolsプロジェクトの立ち上げ人
✓ Provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 5
Kazuma Furuhashi
✓ A member of Asakusa.rb / Red Data Tools
Asakusa.rb/Red Data Toolsメンバー
✓ Worikng at Speee Inc.
Speeeで働いている
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 6
csv in Ruby 2.6 (1)
Ruby 2.6のcsv(1)
Faster CSV parsing
CSVパースの高速化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 7
Unquoted CSV
クォートなしのCSV
AAAAA,AAAAA,AAAAA
...
2.52.6
432.0i/s764.9i/s
Better CSV processingwith Ruby 2.6
Faster?
1.77x
Powered by Rabbit 3.0.0
Page: 8
Quoted CSV
クォートありのCSV
"AAAAA","AAAAA","AAAAA"
...
2.52.6
274.1i/s534.5i/s
Better CSV processingwith Ruby 2.6
Faster?
1.95x
Powered by Rabbit 3.0.0
Page: 9
Quoted separator CSV (1)
区切り文字をクォートしているCSV(1)
",AAAAA",",AAAAA",",AAAAA"
...
2.52.6
211.0i/s330.0/s
Better CSV processingwith Ruby 2.6
Faster?
1.56x
Powered by Rabbit 3.0.0
Page: 10
Quoted separator CSV (2)
区切り文字をクォートしているCSV(2)
"AAAAA\r\n","AAAAA\r\n","AAAAA\r\n"
...
2.52.6
118.7i/s325.6/s
Better CSV processingwith Ruby 2.6
Faster?
2.74x
Powered by Rabbit 3.0.0
Page: 11
Quoted CSVs
クォートありのCSV
Just quoted
Include sep1
Include sep2
(Note)
2.5
2.6
274.1i/s
554.5i/s
211.0i/s
330.0i/s
118.0i/s
325.6i/s
(Slow down)
(Still fast)
Note: "Just quoted" on 2.6 is optimized
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 12
Multibyte CSV
マルチバイトのCSV
あああああ,あああああ,あああああ
...
2.52.6
371.2i/s626.6i/s
Better CSV processingwith Ruby 2.6
Faster?
1.69x
Powered by Rabbit 3.0.0
Page: 13
csv in Ruby 2.6 (2)
Ruby 2.6のcsv(2)
Faster CSV writing
CSV書き出しの高速化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 14
CSV.generate_line
n_rows.times do
CSV.generate_line(fields)
end
2.52.6
284.4i/s684.2i/s
Better CSV processingwith Ruby 2.6
Faster?
2.41x
Powered by Rabbit 3.0.0
Page: 15
CSV#<<
output = StringIO.new
csv = CSV.new(output)
n_rows.times {csv << fields}
2.52.6
2891.4i/s4824.1i/s
Better CSV processingwith Ruby 2.6
Faster?
1.67x
Powered by Rabbit 3.0.0
Page: 16
CSV.generate_line vs. CSV#<<
generate_
line
<<
2.5
284.4i/s2.6
684.2i/s
2891.4i/s4824.1i/s
Use << for multiple writes
複数行書き出すときは<<を使うこと
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 17
csv in Ruby 2.6 (3)
Ruby 2.6のcsv(3)
New CSV parser
for
further improvements
さらなる改良のための新しいCSVパーサー
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 18
Benchmark with KEN_ALL.CSV
KEN_ALL.CSVでのベンチマーク
01101,"060 ","0600000","ホッカイドウ","サッポロシチュウオウク",...
...(124257 lines)...
47382,"90718","9071801","オキナワケン","ヤエヤマグンヨナグニチョウ",...
Zip code data in Japan
日本の郵便番号データ
https://www.post.japanpost.jp/zipcode/download.html
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 19
KEN_ALL.CSV statistics
KEN_ALL.CSVの統計情報
Size(サイズ)
# of columns(列数)
# of rows(行数)
Encoding(エンコーディング)
Better CSV processingwith Ruby 2.6
11.7MiB
15
124259
CP932
Powered by Rabbit 3.0.0
Page: 20
Parsing KEN_ALL.CSV
KEN_ALL.CSVのパース
CSV.foreach("KEN_ALL.CSV",
"r:cp932") do |row|
end
2.5
1.17s
Better CSV processingwith Ruby 2.6
2.6
0.79s
Faster?
1.48x
Powered by Rabbit 3.0.0
Page: 21
Fastest parsing in pure Ruby
Ruby実装での最速のパース方法
input.each_line(chomp: true) do |line|
line.split(",", -1) do |column|
end
end
Limitation: No quote
制限:クォートがないこと
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 22
KEN_ALL.CSV without quote
クォートなしのKEN_ALL.CSV
01101,060 ,0600000,ホッカイドウ,サッポロシチュウオウク,...
...(124257 lines)...
47382,90718,9071801,オキナワケン,ヤエヤマグンヨナグニチョウ,...
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 23
Optimized no quote CSV parsing
最適化したクォートなしCSVのパース方法
CSV.foreach("KEN_ALL_NO_QUOTE.CSV",
"r:cp932",
quote_char: nil) {|row|}
split
0.32s
Better CSV processingwith Ruby 2.6
2.6
0.37s
Faster?
0.86x
(almost the same/同等)
Powered by Rabbit 3.0.0
Page: 24
Summary: Performance
まとめ:性能
✓ Parsing: 1.5x-3x faster
パース:1.5x-3x高速
✓ Max to the "split" level by using an option
オプションを指定すると最大で「split」レベルまで高速化可能
✓ Writing: 1.5x-2.5x faster
書き出し:1.5x-2.5x高速
✓ Use CSV#<< than CSV.generate_line
CSV.generate_lineよりもCSV#<<を使うこと
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 25
How to improve performance (1)
速度改善方法(1)
Complex quote
複雑なクォート
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 26
Complex quote
複雑なクォート
"AA""AAA"
"AA,AAA"
"AA\rAAA"
"AA\nAAA"
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 27
Use StringScanner
StringScannerを使う
✓ String#split is very fast
String#splitは高速
✓ String#split is naive for complex quote
String#splitは複雑なクォートを処理するには単純過ぎる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 28
2.5 uses String#split
in_extended_column = false # "...\n..." case
@input.each_line do |line|
line.split(",", -1).each do |part|
if in_extended_column
# ...
elsif part.start_with?('"')
if part.end_with?('"')
row << pars.gsub('""', '"') # "...""..." case
else
in_extended_column = true
end
# ...
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 29
split: Complex quote
Just quoted
Include sep1
Include sep2
274.1i/s
211.0i/s
118.0i/s
Slow down
遅くなる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 30
2.6 uses StringScanner
row = []
until @scanner.eos?
value = parse_column_value
if @scanner.scan(/,/)
row << value
elsif @scanner.scan(/\n/)
row << value
yield(row)
row = []
end
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 31
parse_column_value
def parse_column_value
parse_unquoted_column_value ||
parse_quoted_column_value
end
Compositable components
部品を組み合わせられる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 32
parse_unquoted_column_value
def parse_unquoted_column_value
@scanner.scan(/[^,"\r\n]+/)
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 33
parse_quoted_column_value
def parse_quoted_column_value
# Not quoted
return nil unless @scanner.scan(/"/)
# Quoted case ...
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 34
Parse methods can be composited
パースメソッドを組み合わせられる
def parse_column_value
parse_unquoted_column_value ||
parse_quoted_column_value
end
Easy to maintain
メンテナンスしやすい
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 35
Point (1)
ポイント(1)
✓ Use StringScanner for complex case
複雑なケースにはStringScannerを使う
✓ StringScanner for complex case:
複雑なケースにStringScannerを使うと:
✓ Easy to maintain
メンテナンスしやすい
✓ No performance regression
性能が劣化しない
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 36
StringScanner: Complex quote
Just quoted
Include sep1
Include sep2
554.5i/s
330.0i/s
325.6i/s
No slow down...?
遅くなっていない。。。?
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 37
How to improve performance (2)
速度改善方法(2)
Simple case
単純なケース
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 38
Simple case
単純なケース
AAAAA
"AAAAA"
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 39
Use String#split
String#splitを使う
StringScanner is
slow
for simple case
StringScannerは単純なケースでは遅い
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 40
Fallback to StringScanner impl.
StringScanner実装にフォールバック
def parse_by_strip(&block)
@input.each_line do |line|
if complex?(line)
return parse_by_string_scanner(&block)
else
yield(line.split(","))
end
end
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 41
Quoted CSVs
クォートありのCSV
StringScanner
Just quoted
Include sep1
Include sep2
Better CSV processingwith Ruby 2.6
split +
StringScanner
311.7i/s
523.4i/s
312.9i/s
309.8i/s
311.3i/s
312.6i/s
Powered by Rabbit 3.0.0
Page: 42
Point (2)
ポイント(2)
✓ First try optimized version
まず最適化バージョンを試す
✓ Fallback to robust version
when complexity is detected
複雑だとわかったらちゃんとしたバージョンにフォールバック
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 43
How to improve performance (3)
速度改善方法(3)
loop do
↓
while true
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 44
loop vs. while
How
loop
while
Better CSV processingwith Ruby 2.6
Throughput
377i/s
401i/s
Powered by Rabbit 3.0.0
Page: 45
Point (3)
ポイント(3)
✓ while doesn't create a new scope
whileは新しいスコープを作らない
✓ Normally, you can use loop
ふつうはloopでよい
✓ Normally, loop isn't a bottle neck
ふつうはloopがボトルネックにはならない
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 46
How to improve performance (4)
速度改善方法(4)
Lazy
遅延
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 47
CSV object is parser and writer
CSVオブジェクトは読み書きできる
✓ 2.5: Always initializes everything
2.5:常にすべてを初期化
✓ 2.6: Initializes when it's needed
2.6:必要になったら初期化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 48
Write performance
generate_
line
<<
Better CSV processingwith Ruby 2.6
2.52.6284.4i/s684.2i/sFaster?
2.41x
2891.4i/s 4824.1i/s1.67x
Powered by Rabbit 3.0.0
Page: 49
How to initialize lazily
初期化を遅延する方法
def parser
@parser ||= Parser.new(...)
end
def writer
@writer ||= Writer.new(...)
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 50
Point (4)
ポイント(4)
✓ Do only needed things
必要なことだけする
✓ One class for one feature
機能ごとにクラスを分ける
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 51
New features by new parser
新しいパーサーによる新機能
✓ Add support for \" escape
\"でのエスケープをサポート
✓ Add strip: option
strip:オプションを追加
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 52
\" escape
\"でのエスケープ
CSV.parse(%Q["a""bc","a\\"bc"],
liberal_parsing: {backslash_quote: true})
# [["a\\"bc", "a\\"bc"]]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 53
strip:
strip:
CSV.parse(%Q[ abc , " abc"], strip: true)
# [["abc", " abc"]]
CSV.parse(%Q[abca,abc], strip: "a")
# [["bc", "bc"]]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 54
csv in Ruby 2.6 (4)
Ruby 2.6のcsv(4)
Keep backward
compatibility
互換性を維持
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 55
How to keep backward compat.
互換性を維持する方法
✓ Reconstruct test suites
テストを整理
✓ Add benchmark suites
ベンチマークを追加
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 56
Test
テスト
✓ Important to detect incompat.
非互換の検出のために重要
✓ Must be easy to maintain
メンテナンスしやすくしておくべき
✓ To keep developing
継続的な開発するため
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 57
Easy to maintenance
メンテナンスしやすい状態
✓ Easy to understand each test
各テストを理解しやすい
✓ Easy to run each test
各テストを個別に実行しやすい
✓ Focusing a failed case is easy to debug
失敗したケースに集中できるとデバッグしやすい
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 58
Benchmark
ベンチマーク
✓ Important to detect
performance regression bugs
性能劣化バグを検出するために重要
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 59
benchmark_driver gem
Fully-featured
benchmark driver
for Ruby 3x3
Ruby 3x3のために必要な機能が揃っているベンチマークツール
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 60
benchmark_driver gem in csv
✓ YAML input is easy to use
YAMLによる入力が便利
✓ Can compare multiple gem versions
複数のgemのバージョンで比較可能
✓ To detect performance regression
性能劣化を検出するため
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 61
Benchmark for each gem version
gemのバージョン毎のベンチマーク
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 62
csv/benchmark/
✓ convert_nil.yaml
✓ parse{,_liberal_parsing}.yaml
✓ parse_{quote_char_nil,strip}.yaml
✓ read.yaml, shift.yaml, write.yaml
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 63
Benchmark as a starting point
出発点としてのベンチマーク
✓ Join csv developing!
csvの開発に参加しよう!
✓ Adding a new benchmark is a good start
ベンチマークの追加から始めるのはどう?
✓ We'll focus on improving performance for
benchmark cases
ベンチマークが整備されているケースの性能改善に注力するよ
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 64
How to use improved csv?
改良されたcsvを使う方法
gem install csv
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 65
csv in Ruby 2.5
Ruby 2.5のcsv
Default gemified
デフォルトgem化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 66
Default gem
デフォルトgem
✓ Can use it just by require
requireするだけで使える
✓ Can use it without entry in Gemfile
(But you use it bundled in your Ruby)
Gemfileに書かなくても使えるけど古い
✓ Can upgrade it by gem
gemでアップグレードできる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 67
How to use improved csv?
改良されたcsvを使う方法
gem install csv
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 68
Future
今後の話
Faster
さらに速く
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 69
Improve String#split
String#splitを改良
Accept " "
as normal separator
" "をただの区切り文字として扱う
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 70
split(" ") works like awk
split(" ")はawkのように動く
" a b c".split(" ", -1)
# => ["a", "b", "c"]
" a b c".split(/ /, -1)
# => ["", "a", "", "b", "c"]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 71
String#split in csv
csvでのString#split
if @column_separator == " "
line.split(/ /, -1)
else
line.split(@column_separator, -1)
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 72
split(string) vs. split(regexp)
regexp
344448i/s
string
3161117i/s
Faster?
9.18x
See also [Feauture:15771]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 73
Improve StringScanner#scan
StringScanner#scanを改良
Accept String
as pattern
Stringもパターンとして使えるようにする
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 74
scan(string) vs. scan(regexp)
regexp
14712660i/s
string
18421631i/s
Faster?
1.25x
See also ruby/strscan#4
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 75
Faster KEN_ALL.CSV parsing (1)
より速いKEN_ALL.CSVのパース(1)
Elapsed
csv
FastestCSV
Better CSV processingwith Ruby 2.6
0.791s
0.141s
Powered by Rabbit 3.0.0
Page: 76
Faster KEN_ALL.CSV parsing (2)
より速いKEN_ALL.CSVのパース(2)
csv
FastestCSV
csv
FastestCSV
Better CSV processingwith Ruby 2.6
Encoding
CP932
CP932
UTF-8
UTF-8
Elapsed
0.791s
0.141s
1.345s
0.713s
Powered by Rabbit 3.0.0
Page: 77
Faster KEN_ALL.CSV parsing (3)
より速いKEN_ALL.CSVのパース(3)
FastestCSV
Python
Apache Arrow
Better CSV processingwith Ruby 2.6
Encoding
UTF-8
UTF-8
UTF-8
Elapsed
0.713s
0.208s
0.145s
Powered by Rabbit 3.0.0
Page: 78
Further work
今後の改善案
✓ Improve transcoding performance of Ruby
Rubyのエンコーディング変換処理の高速化
✓ Improve simple case parse performance
by implementing parser in C
シンプルなケース用のパーサーをCで実装して高速化
✓ Improve perf. of REXML as well as csv
csvのようにREXMLも高速化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 79
Join us!
一緒に開発しようぜ!
✓ Red Data Tools:
https://red-data-tools.github.io/
✓ RubyData Workshop: Today 14:20-15:30
✓ Code Party: Today 19:00-21:00
✓ After Hack: Sun. 10:30-17:30
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0
Page: 80
Join us!!
一緒に開発しようぜ!!
✓ OSS Gate: https://oss-gate.github.io/
✓ provides a "gate" to join OSS development
OSSの開発に参加する「入り口」を提供する取り組み
✓ Both ClearCode and Speee are one of sponsors
クリアコードもSpeeeもスポンサー
✓ OSS Gate Fukuoka:
https://oss-gate-fukuoka.connpass.com/
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0