Rabbit Slide Show

Better CSV processing with Ruby 2.6

2019-04-19

Description

csv, one of the standard libraries, in Ruby 2.6 has many improvements: * Default gemified * Faster CSV parsing * Faster CSV writing * Clean new CSV parser implementation for further improvements * Reconstructed test suites for further improvements * Benchmark suites for further performance improvements These improvements are done without breaking backward compatibility. This talk describes details of these improvements by a new csv maintainer.

Text

Page: 1

Better CSV processing
with Ruby 2.6
Kouhei Sutou/Kazuma Furuhashi
ClearCode Inc./Speee, Inc.
RubyKaigi 2019
2019-04-19
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 2

Ad: Silver sponsor
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 3

Ad: Cafe sponsor
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 4

Kouhei Sutou
✓ The president of ClearCode Inc.
クリアコードの社長
✓ A new maintainer of the csv library
csvライブラリーの新メンテナー
✓ The founder of Red Data Tools project
Red Data Toolsプロジェクトの立ち上げ人
✓ Provides data processing tools for Ruby
Ruby用のデータ処理ツールを提供するプロジェクト
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 5

Kazuma Furuhashi
✓ A member of Asakusa.rb / Red Data Tools
Asakusa.rb/Red Data Toolsメンバー
✓ Worikng at Speee Inc.
Speeeで働いている
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 6

csv in Ruby 2.6 (1)
Ruby 2.6のcsv(1)
Faster CSV parsing
CSVパースの高速化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 7

Unquoted CSV
クォートなしのCSV
AAAAA,AAAAA,AAAAA
...
2.52.6
432.0i/s764.9i/s
Better CSV processingwith Ruby 2.6
Faster?
1.77x
Powered by Rabbit 3.0.0

Page: 8

Quoted CSV
クォートありのCSV
"AAAAA","AAAAA","AAAAA"
...
2.52.6
274.1i/s534.5i/s
Better CSV processingwith Ruby 2.6
Faster?
1.95x
Powered by Rabbit 3.0.0

Page: 9

Quoted separator CSV (1)
区切り文字をクォートしているCSV(1)
",AAAAA",",AAAAA",",AAAAA"
...
2.52.6
211.0i/s330.0/s
Better CSV processingwith Ruby 2.6
Faster?
1.56x
Powered by Rabbit 3.0.0

Page: 10

Quoted separator CSV (2)
区切り文字をクォートしているCSV(2)
"AAAAA\r\n","AAAAA\r\n","AAAAA\r\n"
...
2.52.6
118.7i/s325.6/s
Better CSV processingwith Ruby 2.6
Faster?
2.74x
Powered by Rabbit 3.0.0

Page: 11

Quoted CSVs
クォートありのCSV
Just quoted
Include sep1
Include sep2
(Note)
2.5
2.6
274.1i/s
554.5i/s
211.0i/s
330.0i/s
118.0i/s
325.6i/s
(Slow down)
(Still fast)
Note: "Just quoted" on 2.6 is optimized
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 12

Multibyte CSV
マルチバイトのCSV
あああああ,あああああ,あああああ
...
2.52.6
371.2i/s626.6i/s
Better CSV processingwith Ruby 2.6
Faster?
1.69x
Powered by Rabbit 3.0.0

Page: 13

csv in Ruby 2.6 (2)
Ruby 2.6のcsv(2)
Faster CSV writing
CSV書き出しの高速化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 14

CSV.generate_line
n_rows.times do
CSV.generate_line(fields)
end
2.52.6
284.4i/s684.2i/s
Better CSV processingwith Ruby 2.6
Faster?
2.41x
Powered by Rabbit 3.0.0

Page: 15

CSV#<<
output = StringIO.new
csv = CSV.new(output)
n_rows.times {csv << fields}
2.52.6
2891.4i/s4824.1i/s
Better CSV processingwith Ruby 2.6
Faster?
1.67x
Powered by Rabbit 3.0.0

Page: 16

CSV.generate_line vs. CSV#<<
generate_
line
<<
2.5
284.4i/s2.6
684.2i/s
2891.4i/s4824.1i/s
Use << for multiple writes
複数行書き出すときは<<を使うこと
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 17

csv in Ruby 2.6 (3)
Ruby 2.6のcsv(3)
New CSV parser
for
further improvements
さらなる改良のための新しいCSVパーサー
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 18

Benchmark with KEN_ALL.CSV
KEN_ALL.CSVでのベンチマーク
01101,"060 ","0600000","ホッカイドウ","サッポロシチュウオウク",...
...(124257 lines)...
47382,"90718","9071801","オキナワケン","ヤエヤマグンヨナグニチョウ",...
Zip code data in Japan
日本の郵便番号データ
https://www.post.japanpost.jp/zipcode/download.html
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 19

KEN_ALL.CSV statistics
KEN_ALL.CSVの統計情報
Size(サイズ)
# of columns(列数)
# of rows(行数)
Encoding(エンコーディング)
Better CSV processingwith Ruby 2.6
11.7MiB
15
124259
CP932
Powered by Rabbit 3.0.0

Page: 20

Parsing KEN_ALL.CSV
KEN_ALL.CSVのパース
CSV.foreach("KEN_ALL.CSV",
"r:cp932") do |row|
end
2.5
1.17s
Better CSV processingwith Ruby 2.6
2.6
0.79s
Faster?
1.48x
Powered by Rabbit 3.0.0

Page: 21

Fastest parsing in pure Ruby
Ruby実装での最速のパース方法
input.each_line(chomp: true) do |line|
line.split(",", -1) do |column|
end
end
Limitation: No quote
制限:クォートがないこと
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 22

KEN_ALL.CSV without quote
クォートなしのKEN_ALL.CSV
01101,060 ,0600000,ホッカイドウ,サッポロシチュウオウク,...
...(124257 lines)...
47382,90718,9071801,オキナワケン,ヤエヤマグンヨナグニチョウ,...
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 23

Optimized no quote CSV parsing
最適化したクォートなしCSVのパース方法
CSV.foreach("KEN_ALL_NO_QUOTE.CSV",
"r:cp932",
quote_char: nil) {|row|}
split
0.32s
Better CSV processingwith Ruby 2.6
2.6
0.37s
Faster?
0.86x
(almost the same/同等)
Powered by Rabbit 3.0.0

Page: 24

Summary: Performance
まとめ:性能
✓ Parsing: 1.5x-3x faster
パース:1.5x-3x高速
✓ Max to the "split" level by using an option
オプションを指定すると最大で「split」レベルまで高速化可能
✓ Writing: 1.5x-2.5x faster
書き出し:1.5x-2.5x高速
✓ Use CSV#<< than CSV.generate_line
CSV.generate_lineよりもCSV#<<を使うこと
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 25

How to improve performance (1)
速度改善方法(1)
Complex quote
複雑なクォート
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 26

Complex quote
複雑なクォート
"AA""AAA"
"AA,AAA"
"AA\rAAA"
"AA\nAAA"
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 27

Use StringScanner
StringScannerを使う
✓ String#split is very fast
String#splitは高速
✓ String#split is naive for complex quote
String#splitは複雑なクォートを処理するには単純過ぎる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 28

2.5 uses String#split
in_extended_column = false # "...\n..." case
@input.each_line do |line|
line.split(",", -1).each do |part|
if in_extended_column
# ...
elsif part.start_with?('"')
if part.end_with?('"')
row << pars.gsub('""', '"') # "...""..." case
else
in_extended_column = true
end
# ...
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 29

split: Complex quote
Just quoted
Include sep1
Include sep2
274.1i/s
211.0i/s
118.0i/s
Slow down
遅くなる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 30

2.6 uses StringScanner
row = []
until @scanner.eos?
value = parse_column_value
if @scanner.scan(/,/)
row << value
elsif @scanner.scan(/\n/)
row << value
yield(row)
row = []
end
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 31

parse_column_value
def parse_column_value
parse_unquoted_column_value ||
parse_quoted_column_value
end
Compositable components
部品を組み合わせられる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 32

parse_unquoted_column_value
def parse_unquoted_column_value
@scanner.scan(/[^,"\r\n]+/)
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 33

parse_quoted_column_value
def parse_quoted_column_value
# Not quoted
return nil unless @scanner.scan(/"/)
# Quoted case ...
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 34

Parse methods can be composited
パースメソッドを組み合わせられる
def parse_column_value
parse_unquoted_column_value ||
parse_quoted_column_value
end
Easy to maintain
メンテナンスしやすい
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 35

Point (1)
ポイント(1)
✓ Use StringScanner for complex case
複雑なケースにはStringScannerを使う
✓ StringScanner for complex case:
複雑なケースにStringScannerを使うと:
✓ Easy to maintain
メンテナンスしやすい
✓ No performance regression
性能が劣化しない
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 36

StringScanner: Complex quote
Just quoted
Include sep1
Include sep2
554.5i/s
330.0i/s
325.6i/s
No slow down...?
遅くなっていない。。。?
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 37

How to improve performance (2)
速度改善方法(2)
Simple case
単純なケース
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 38

Simple case
単純なケース
AAAAA
"AAAAA"
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 39

Use String#split
String#splitを使う
StringScanner is
slow
for simple case
StringScannerは単純なケースでは遅い
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 40

Fallback to StringScanner impl.
StringScanner実装にフォールバック
def parse_by_strip(&block)
@input.each_line do |line|
if complex?(line)
return parse_by_string_scanner(&block)
else
yield(line.split(","))
end
end
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 41

Quoted CSVs
クォートありのCSV
StringScanner
Just quoted
Include sep1
Include sep2
Better CSV processingwith Ruby 2.6
split +
StringScanner
311.7i/s
523.4i/s
312.9i/s
309.8i/s
311.3i/s
312.6i/s
Powered by Rabbit 3.0.0

Page: 42

Point (2)
ポイント(2)
✓ First try optimized version
まず最適化バージョンを試す
✓ Fallback to robust version
when complexity is detected
複雑だとわかったらちゃんとしたバージョンにフォールバック
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 43

How to improve performance (3)
速度改善方法(3)
loop do
↓
while true
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 44

loop vs. while
How
loop
while
Better CSV processingwith Ruby 2.6
Throughput
377i/s
401i/s
Powered by Rabbit 3.0.0

Page: 45

Point (3)
ポイント(3)
✓ while doesn't create a new scope
whileは新しいスコープを作らない
✓ Normally, you can use loop
ふつうはloopでよい
✓ Normally, loop isn't a bottle neck
ふつうはloopがボトルネックにはならない
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 46

How to improve performance (4)
速度改善方法(4)
Lazy
遅延
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 47

CSV object is parser and writer
CSVオブジェクトは読み書きできる
✓ 2.5: Always initializes everything
2.5:常にすべてを初期化
✓ 2.6: Initializes when it's needed
2.6:必要になったら初期化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 48

Write performance
generate_
line
<<
Better CSV processingwith Ruby 2.6
2.52.6284.4i/s684.2i/sFaster?
2.41x
2891.4i/s 4824.1i/s1.67x
Powered by Rabbit 3.0.0

Page: 49

How to initialize lazily
初期化を遅延する方法
def parser
@parser ||= Parser.new(...)
end
def writer
@writer ||= Writer.new(...)
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 50

Point (4)
ポイント(4)
✓ Do only needed things
必要なことだけする
✓ One class for one feature
機能ごとにクラスを分ける
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 51

New features by new parser
新しいパーサーによる新機能
✓ Add support for \" escape
\"でのエスケープをサポート
✓ Add strip: option
strip:オプションを追加
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 52

\" escape
\"でのエスケープ
CSV.parse(%Q["a""bc","a\\"bc"],
liberal_parsing: {backslash_quote: true})
# [["a\\"bc", "a\\"bc"]]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 53

strip:
strip:
CSV.parse(%Q[ abc , " abc"], strip: true)
# [["abc", " abc"]]
CSV.parse(%Q[abca,abc], strip: "a")
# [["bc", "bc"]]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 54

csv in Ruby 2.6 (4)
Ruby 2.6のcsv(4)
Keep backward
compatibility
互換性を維持
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 55

How to keep backward compat.
互換性を維持する方法
✓ Reconstruct test suites
テストを整理
✓ Add benchmark suites
ベンチマークを追加
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 56

Test
テスト
✓ Important to detect incompat.
非互換の検出のために重要
✓ Must be easy to maintain
メンテナンスしやすくしておくべき
✓ To keep developing
継続的な開発するため
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 57

Easy to maintenance
メンテナンスしやすい状態
✓ Easy to understand each test
各テストを理解しやすい
✓ Easy to run each test
各テストを個別に実行しやすい
✓ Focusing a failed case is easy to debug
失敗したケースに集中できるとデバッグしやすい
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 58

Benchmark
ベンチマーク
✓ Important to detect
performance regression bugs
性能劣化バグを検出するために重要
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 59

benchmark_driver gem
Fully-featured
benchmark driver
for Ruby 3x3
Ruby 3x3のために必要な機能が揃っているベンチマークツール
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 60

benchmark_driver gem in csv
✓ YAML input is easy to use
YAMLによる入力が便利
✓ Can compare multiple gem versions
複数のgemのバージョンで比較可能
✓ To detect performance regression
性能劣化を検出するため
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 61

Benchmark for each gem version
gemのバージョン毎のベンチマーク
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 62

csv/benchmark/
✓ convert_nil.yaml
✓ parse{,_liberal_parsing}.yaml
✓ parse_{quote_char_nil,strip}.yaml
✓ read.yaml, shift.yaml, write.yaml
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 63

Benchmark as a starting point
出発点としてのベンチマーク
✓ Join csv developing!
csvの開発に参加しよう!
✓ Adding a new benchmark is a good start
ベンチマークの追加から始めるのはどう?
✓ We'll focus on improving performance for
benchmark cases
ベンチマークが整備されているケースの性能改善に注力するよ
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 64

How to use improved csv?
改良されたcsvを使う方法
gem install csv
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 65

csv in Ruby 2.5
Ruby 2.5のcsv
Default gemified
デフォルトgem化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 66

Default gem
デフォルトgem
✓ Can use it just by require
requireするだけで使える
✓ Can use it without entry in Gemfile
(But you use it bundled in your Ruby)
Gemfileに書かなくても使えるけど古い
✓ Can upgrade it by gem
gemでアップグレードできる
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 67

How to use improved csv?
改良されたcsvを使う方法
gem install csv
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 68

Future
今後の話
Faster
さらに速く
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 69

Improve String#split
String#splitを改良
Accept " "
as normal separator
" "をただの区切り文字として扱う
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 70

split(" ") works like awk
split(" ")はawkのように動く
" a b c".split(" ", -1)
# => ["a", "b", "c"]
" a b c".split(/ /, -1)
# => ["", "a", "", "b", "c"]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 71

String#split in csv
csvでのString#split
if @column_separator == " "
line.split(/ /, -1)
else
line.split(@column_separator, -1)
end
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 72

split(string) vs. split(regexp)
regexp
344448i/s
string
3161117i/s
Faster?
9.18x
See also [Feauture:15771]
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 73

Improve StringScanner#scan
StringScanner#scanを改良
Accept String
as pattern
Stringもパターンとして使えるようにする
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 74

scan(string) vs. scan(regexp)
regexp
14712660i/s
string
18421631i/s
Faster?
1.25x
See also ruby/strscan#4
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 75

Faster KEN_ALL.CSV parsing (1)
より速いKEN_ALL.CSVのパース(1)
Elapsed
csv
FastestCSV
Better CSV processingwith Ruby 2.6
0.791s
0.141s
Powered by Rabbit 3.0.0

Page: 76

Faster KEN_ALL.CSV parsing (2)
より速いKEN_ALL.CSVのパース(2)
csv
FastestCSV
csv
FastestCSV
Better CSV processingwith Ruby 2.6
Encoding
CP932
CP932
UTF-8
UTF-8
Elapsed
0.791s
0.141s
1.345s
0.713s
Powered by Rabbit 3.0.0

Page: 77

Faster KEN_ALL.CSV parsing (3)
より速いKEN_ALL.CSVのパース(3)
FastestCSV
Python
Apache Arrow
Better CSV processingwith Ruby 2.6
Encoding
UTF-8
UTF-8
UTF-8
Elapsed
0.713s
0.208s
0.145s
Powered by Rabbit 3.0.0

Page: 78

Further work
今後の改善案
✓ Improve transcoding performance of Ruby
Rubyのエンコーディング変換処理の高速化
✓ Improve simple case parse performance
by implementing parser in C
シンプルなケース用のパーサーをCで実装して高速化
✓ Improve perf. of REXML as well as csv
csvのようにREXMLも高速化
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 79

Join us!
一緒に開発しようぜ!
✓ Red Data Tools:
https://red-data-tools.github.io/
✓ RubyData Workshop: Today 14:20-15:30
✓ Code Party: Today 19:00-21:00
✓ After Hack: Sun. 10:30-17:30
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Page: 80

Join us!!
一緒に開発しようぜ!!
✓ OSS Gate: https://oss-gate.github.io/
✓ provides a "gate" to join OSS development
OSSの開発に参加する「入り口」を提供する取り組み
✓ Both ClearCode and Speee are one of sponsors
クリアコードもSpeeeもスポンサー
✓ OSS Gate Fukuoka:
https://oss-gate-fukuoka.connpass.com/
Better CSV processingwith Ruby 2.6
Powered by Rabbit 3.0.0

Other slides

Apache Arrow Apache Arrow
2018-12-08
Apache Arrow Apache Arrow
2018-11-17
Apache Arrow Apache Arrow
2017-06-13
Apache Arrow Apache Arrow
2017-05-28
Mroonga! Mroonga!
2015-10-30