Text
Page: 1
Improvement of REXML and
speed up using
StringScanner
NAITOH Jun
MedPeer, Inc.
RubyKaigi 2025
2025-04-17
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 2
NAITOH Jun
✓ @naitoh (GitHub and X(Twitter))
GitHub と X(Twitter) のアカウント
✓ A new maintainer of the rexml library
rexmlライブラリーの新メンテナー
✓ Red Data Tools project member
Red Data Toolsプロジェクトのメンバー
✓ redmine.tokyo stuff
redmine.tokyo 勉強会のスタッフ
✓ Working at MedPeer, Inc.
MedPeerで働いている
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 3
MedPeer
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 4
https://tech.medpeer.co.jp/
Improvement of REXML and speed up using StringScanner
https://medpeer.co.jp/recruit/
Powered by Rabbit 3.0.5
Page: 5
rexml in Ruby 3.4 (1)
Ruby 3.4の rexml(1)
✓ Faster XML parsing
XMLパースの高速化
✓ It is up to 50% faster between
rexml 3.2.6(attached to Ruby 3.3.0) and
rexml 3.4.0(attached to Ruby 3.4.0).
rexml 3.2.6(Ruby 3.3.0添付)とrexml 3.4.0(Ruby 3.4.0添付)の間で、最
大50%速くなった。
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 6
Benchmark target file
<?xml version="1.0"?>
<root>
<child id0="0" id1="0" />
:
<child id0="4999" id1="4999" />
</root>
Parsing XML with 5000 child nodes.
5000個の子ノードを持つXMLをパースする
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 7
Ruby 3.4 YJIT disable YJIT 無効
REXML Performance benchmarks
34.0
30.0
pull
stream
sax2
dom
i/s
26.0
22.0
18.0
3.2.6
parser
dom
sax2
pull
stream
3.4.0
rexml version
3.2.6
18.19 i/s
25.82 i/s
29.94 i/s
28.11 i/s
Improvement of REXML and speed up using StringScanner
3.4.0
18.84 i/s
28.59 i/s
33.08 i/s
32.62 i/s
Faster?
1.03x
1.11x
1.10x
1.16x
Powered by Rabbit 3.0.5
Page: 8
Ruby 3.4 YJIT enable YJIT 有効
REXML Performance benchmarks
60.0
53.0
pull(JIT)
stream(JIT)
sax2(JIT)
dom(JIT)
i/s
46.0
39.0
32.0
25.0
3.2.6
parser
dom
sax2
pull
stream
3.4.0
rexml version
3.2.6
25.90 i/s
36.57 i/s
42.71 i/s
39.02 i/s
Improvement of REXML and speed up using StringScanner
3.4.0
31.58 i/s
50.67 i/s
59.88 i/s
57.25 i/s
Faster?
1.22x
1.39x
1.40x
1.47x
Powered by Rabbit 3.0.5
Page: 9
rexml in Ruby 3.4 (2)
Ruby 3.4の rexml(2)
✓ Faster XML parsing
XMLパースの高速化
✓ It is up to 60% faster between
rexml 3.2.6(attached to Ruby 3.3.0) and rexml
3.4.1.
rexml 3.2.6(Ruby 3.3.0添付)とrexml 3.4.1の間で、最大60%速くなっ
た
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 10
Ruby 3.4 YJIT disable YJIT 無効
REXML Performance benchmarks
pull
stream
sax2
dom
34.0
i/s
30.0
26.0
22.0
18.0
3.2.6
parser
dom
sax2
pull
stream
3.2.6
18.19 i/s
25.82 i/s
29.94 i/s
28.11 i/s
Improvement of REXML and speed up using StringScanner
3.4.0
rexml version
3.4.0
18.84 i/s
28.59 i/s
33.08 i/s
32.62 i/s
3.4.1
3.4.1
19.47 i/s
30.21 i/s
34.37 i/s
34.25 i/s
Faster?
1.07x
1.16x
1.14x
1.21x
Powered by Rabbit 3.0.5
Page: 11
Ruby 3.4 YJIT enable YJIT 有効
REXML Performance benchmarks
pull(JIT)
stream(JIT)
sax2(JIT)
dom(JIT)
61.0
i/s
52.0
43.0
34.0
25.0
3.2.6
parser
dom
sax2
pull
stream
3.2.6
25.90 i/s
36.57 i/s
42.71 i/s
39.02 i/s
Improvement of REXML and speed up using StringScanner
3.4.0
rexml version
3.4.0
31.58 i/s
50.67 i/s
59.88 i/s
57.25 i/s
3.4.1
3.4.1
33.93 i/s
53.18 i/s
65.89 i/s
61.87 i/s
Faster?
1.31x
1.45x
1.54x
1.58x
Powered by Rabbit 3.0.5
Page: 12
Ruby 3.4 rexml 3.4.1
REXML Performance benchmarks
58.0
i/s
48.0
pull(JIT)
stream(JIT)
sax2(JIT)
dom(JIT)
pull
stream
sax2
dom
38.0
28.0
18.0
3.2.6
3.4.0
rexml version
3.4.1
Up to 60% faster.
最大60%高速化
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 13
RubyKaigi 2024 LT
✓ Up to 60% faster since rexml 3.2.6.
rexml 3.2.6から最大60%速くなった。
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 14
The release version 3.2.7 was
slowing down.
リリース版の3.2.7が遅くなっていた。
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 15
Ruby 3.4 rexml 3.2.7
REXML Performance benchmarks
60.0
i/s
49.0
pull(JIT)
stream(JIT)
sax2(JIT)
dom(JIT)
pull
stream
sax2
dom
38.0
27.0
16.0
3.2.6
RubyKaigi2024LT
3.2.7
rexml version
3.2.9
3.4.0
3.4.1
It was slow in rexml 3.2.7.
rexml 3.2.7 で遅くなっていた
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 16
Parsing process was slowing
down to deal with CVE.
CVEに対処するためパース処理が遅くなっていた
CVE-2024-35176: DoS vulnerability in REXML
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 17
Make
REXML
Fast Again.
REXML を再び高速化する
Page: 18
What's REXML
REXML とは?
✓ REXML is a standard XML library
(Bundled Gem) for Ruby implemented
in Pure Ruby.
REXMLはPure Rubyで実装されたRuby用の標準XMLライブラリ
(Bundled Gem)
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 19
What's XML?
XML とは?
<?xml version="1.0"?>
<svg xmlns="http://www.w3.org/2000/svg">
<rect x="0" y="0" width="100" height="60" fill="#ddd" />
<polygon points="50 10, 70 30, 50 50, 30 30" fill="#99f" />
</svg>
✓ Describe using tags like HTML.
HTML の様にタグを用いて記述する
✓ Highly scalable data exchange format.
拡張性の高いデータ交換用フォーマット
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 20
XML-based format example
XMLベースのフォーマット例
✓ SVG, DOCX, XLSX, MathML, PubMed-
XML, CIM-XML, etc.
SVG, DOCX, XLSX、MathML、PubMed XML, CIM-XML, ..
✓ Complex expressions are possible.
複雑な表現が可能
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 21
XML character encoding.
XML の 文字コード
✓ XML supports UTF-8 and UTF-16
(required).
XML は UTF-8 と UTF-16 をサポート(必須)
✓ Other character encoding can be used
by declaring XML encoding (optional).
XMLエンコーディングを宣言することで、他の文字エンコーディングを
使用可能(オプション)
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 22
How REXML processes XML.(1)
REXML の XML処理方法(1)
✓ REXML supports UTF-8 and UTF-16.
REXML は UTF-8 と UTF-16 をサポート
✓ Reads and parses XML files by tags.
XMLファイルをタグ単位で読み込み、パース処理を行う
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 23
How REXML processes XML.(2)
REXML の XML処理方法(2)
✓ For non-UTF-8 encoding, convert to
UTF-8 when reading.
UTF-8 以外のエンコーディングの場合は、読み込み時にUTF-8 に変換
✓ Other encoding can basically be used if
Ruby supports it.
Rubyがサポートしていれば他のエンコーディングも基本的に使用可能
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 24
About REXML's Parser
REXMLのパーサーについて
✓ REXML has a DOM/SAX2/Pull/Stream
parser.
REXML はDOM/SAX2/Pull/Stream のパーサーがある
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 25
Types of REXML parsers
REXMLのパーサーの種類
✓ Tree API ツリーAPI
✓ DOM(Document Object Model) parser
✓ Streaming API ストリーミングAPI
✓ Callback type(passive) コールバック形式(受動的)
✓ SAX2 (Simple API for XML) parser
✓ Stream parser
✓ Pull type(active) Pull形式(能動的)
✓ Pull parser
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 26
Features of DOM parser.
DOMパーサーの特徴
✓ DOM-style XML parser.
DOM スタイルの XML パーサ
✓ Treat XML as a tree structure.
XMLを木構造として扱う
✓ Easy to use as any node can be accessed.
任意のノードにアクセス可能なので使いやすい
✓ Keep parsing results in memory.
パース処理結果をメモリ上に保持する
✓ Large XML is memory inefficient and slow..
大きなXMLではメモリ効率が悪く遅い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 27
Features of SAX2 parser.(1)
SAX2パーサーの特徴(1)
✓ Stream-style parser with API equivalent
to SAX2.
SAX2 と同等の API を持つストリーム形式のパーサ
✓ Sequential processing from the beginning of
the file, line by line.
ファイル先頭からシーケンシャルに1行単位ごとに処理
✓ By registering an event listener, callbacks are
processed on an event-by-event basis.
イベントリスナを登録することでイベント単位にコールバック処理
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 28
Features of SAX2 parser.(2)
SAX2パーサーの特徴(2)
✓ Not keep parsing results in memory.
パース処理結果をメモリ上に保持しない
✓ Memory efficient, fast even with large
XML.
メモリ効率に優れ、大きなXMLでも高速
✓ Not easy to use, because access to
arbitrary nodes is not possible.
任意のノードへのアクセスはできないので、使い勝手は悪い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 29
Features of Stream parser.
Streamパーサーの特徴
✓ Basically the same as SAX2 parser.
基本的には SAX2 パーサーと同じ
✓ Stream parsers are lightweight and fast
because of their limited functionality.
Streamパーサーは 機能が限定的なので軽量のため速い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 30
Features of Pull parser.
Pullパーサーの特徴
✓ Stream-style parser, processing one
line at a time.
ストリーム形式のパーサで、1行単位に処理する
✓ Pull parser reads and processes each
event by itself.
Pullパーサーは各イベントを自分で読み込んで処理する
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 31
Motivation(1)
動機(1)
✓ I am the author of the PDF export
library RBPDF gem.
私は PDF出力ライブラリRBPDF gemの作者
✓ RBPDF is used by Redmine's PDF output
feature.
RBPDF は Redmine のPDF出力機能で使用されている
✓ I would like to support PDF output of SVG images in
RBPDF using REXML, which is easy to install.
インストールの容易なREXMLを用いてRBPDFでSVG画像のPDF出
力をサポートしたい
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 32
Motivation(2)
動機(2)
✓ REXML performance is slower than C
extension gem.
REXMLはパフォーマンスがC拡張gemより遅い
✓ I would like to improve REXML
performance.
REXMLのパフォーマンスを改善したい
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 33
Improvements
改善点
✓ Up to 50% speedup between rexml
3.2.6 (Ruby 3.3.0 attached) and rexml
3.4.0 (Ruby 3.4.0 attached).
rexml 3.2.6(Ruby 3.3.0 添付) から、rexml 3.4.0(Ruby 3.4.0 添付) の間で
最大50%の高速化を実現
✓ It is up to 60% faster between rexml
3.2.6 and rexml 3.4.1.
rexml 3.2.6 から rexml 3.4.1 の間で最大60%の高速化を実現
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 34
How?
どうやって?
"Improving CSV Processing with Ruby 2.6"
at RubyKaigi 2019
https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2019/
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 35
What is StringScanner?(1)
StringScanner とは?(1)
✓ StringScanner is a sequential string
scanner.
StringScanner は逐次的な文字列スキャナ
✓ Sequential String Scanning
文字列の逐次スキャン
✓ Regular Expression Matching
正規表現マッチング
✓ State Management
状態管理
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 36
StringScanner Features(1)
StringScanner の特徴(1)
✓ Sequential String Scanning
文字列の逐次スキャン
✓ It processes a string from the beginning,
moving forward as it matches patterns.
文字列の先頭から順番に解析しながら先に進む
✓ Regular Expression Matching
正規表現マッチング
✓ State Management
状態管理
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 37
StringScanner Features(2)
StringScanner の特徴(2)
✓ Sequential String Scanning
文字列の逐次スキャン
✓ Regular Expression Matching
正規表現マッチング
✓ Use scan(), scan_until(), etc. to look for
specific patterns.
scan(), scan_until() などを使って特定のパターンを探す。
✓ State Management
状態管理
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 38
StringScanner Features(3)
StringScanner の特徴(3)
✓ Sequential String Scanning
文字列の逐次スキャン
✓ Regular Expression Matching
正規表現マッチング
✓ State Management
状態管理
✓ pos (current position) and rest (rest of string) to
see how far you have parsed.
pos (現在の位置) や rest (残りの文字列) で、どこまで解析したかを把
握できる
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 39
What is StringScanner?(2)
StringScanner とは?(2)
✓ Available since Ruby 1.8 and default
gem since Ruby 2.5.
Ruby 1.8 から使用可能で Ruby 2.5 以降で default gem になっている
✓ Uses Ruby's regular expression engine
(Onigmo).
Rubyの正規表現エンジン(Onigmo)を使っている
✓ ReDoS countermeasure implemented in Ruby
3.2 is also effective.
Ruby 3.2で実装されたReDoS対策も有効
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 40
StringScanner is fast.(1)
StringScanner は速い(1)
/\A\w+/.match('test string') #=> 'test'
s = StringScanner.new('test string')
s.check(/\w+/) #=> 'test'
s.check('test') #=> 'test'
Regexp#match
StringScanner
#check(Regexp)
StringScanner
#check(String)
Improvement of REXML and speed up using StringScanner
5.675M i/s
8.800M i/s
(1.55x faster)
10.27M i/s
(1.81x faster)
Powered by Rabbit 3.0.5
Page: 41
StringScanner is fast.(2)
StringScanner は速い(2)
Regexp#match
StringScanner
#check(Regexp)
StringScanner
#check(String)
5.675M i/s
8.800M i/s
(1.55x faster)
10.27M i/s
(1.81x faster)
✓ StringScanner#check(String) is fast
because it uses memcmp() for
comparison.
StringScanner#check(String) は比較に memcmp() を使うので速い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 42
StringScanner is fast.(3)
StringScanner は速い(3)
Regexp#match
StringScanner
#check(Regexp)
5.675M i/s
8.800M i/s
(1.55x faster)
✓ Regexp#match generates and returns a
MatchData Object.
Regexp#match は MatchData Object を生成して返す
✓ StringScanner#check copies the
matched string and returns it.
StringScanner#check はマッチした文字列をコピーして返す
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 43
Without match object generation,
performance is equivalent.
マッチオブジェクト生成無しの場合、性能は同等
Regexp#match?
StringScanner#match?
14.71M i/s
14.28M i/s (1.03x slower)
✓ Regexp#match? and StringScanner
#match? respond with boolean.
Regexp#match?とStringScanner#match? はboolean で応答
✓ Performance is equivalent since both are only
regular expression processing.
両方とも正規表現処理のみなのでパフォーマンスは同等
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 44
rexml in Ruby 3.4 (2)
Ruby 3.4のrexml(2)
Use StringScanner#scan
instead of Regexp#match
to parse XML.
XMLのパースにRegexp#matchの代わりにStringScanner#scanを使う
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 45
Example of Regexp class use
Regexpクラスを使う例
✓ Repeat until a match is found in the
regular expression.
正規表現でマッチするまで繰り返す
word = (/\A<[^>]*>/um).match('<!-- foo -->') # Get Tag
case word[0]
when /\A<\?xml\s/u
Not Match
when /\A<!--/u
/<!--(.*?)-->/um.match(word[0])[0] #=> " foo "
when /\A<!DOCTYPE\s/um
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 46
Example of Regexp class use
Regexpクラスを使う例
✓ Repeat until a match is found in the
regular expression.
正規表現でマッチするまで繰り返す
word = (/\A<[^>]*>/um).match('<!-- foo -->') # Get Tag
case word[0]
when /\A<\?xml\s/u
when /\A<!--/u
Match
/<!--(.*?)-->/um.match(word[0])[0] #=> " foo "
when /\A<!DOCTYPE\s/um
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 47
Example of StringScanner use(1)
StringScannerの使用例(1)
✓ Parses strings sequentially from the
beginning.
文字列の先頭から順番に解析する
✓ Easy to understand process.
処理がわかりやすい
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 48
Example of StringScanner use(1)
StringScannerの使用例(1)
s = StringScanner.new('<!-- foo -->')
s.pos = 0 # for Benchmark
if s.scan("<?")
Not Match
elsif s.scan("<!")
if s.scan("--")
s.scan(/(.*?)-->/um) and s[1] #=> " foo "
elsif s.scan("DOCTYPE")
<
!
-
-
Improvement of REXML and speed up using StringScanner
f
o
o
-
-
>
Powered by Rabbit 3.0.5
Page: 49
Example of StringScanner use(1)
StringScannerの使用例(1)
s = StringScanner.new('<!-- foo -->')
s.pos = 0 # for Benchmark
if s.scan("<?")
elsif s.scan("<!")
Match
if s.scan("--")
s.scan(/(.*?)-->/um) and s[1] #=> " foo "
elsif s.scan("DOCTYPE")
< !
▶ ▶
-
-
Improvement of REXML and speed up using StringScanner
f
o
o
-
-
>
Powered by Rabbit 3.0.5
Page: 50
Example of StringScanner use(1)
StringScannerの使用例(1)
s = StringScanner.new('<!-- foo -->')
s.pos = 0 # for Benchmark
if s.scan("<?")
elsif s.scan("<!")
if s.scan("--")
Match
s.scan(/(.*?)-->/um) and s[1] #=> " foo "
elsif s.scan("DOCTYPE")
<
!
- -
▶ ▶
Improvement of REXML and speed up using StringScanner
f
o
o
-
-
>
Powered by Rabbit 3.0.5
Page: 51
Example of StringScanner use(1)
StringScannerの使用例(1)
s = StringScanner.new('<!-- foo -->')
s.pos = 0 # for Benchmark
if s.scan("<?")
elsif s.scan("<!")
if s.scan("--")
s.scan(/(.*?)-->/um) and s[1] #=> " foo "
elsif s.scan("DOCTYPE")
<
!
-
-
Improvement of REXML and speed up using StringScanner
f o o
- - >
▶ ▶ ▶ ▶ ▶ ▶ ▶
Powered by Rabbit 3.0.5
Page: 52
Benchmark result
ベンチマーク結果
Regexp#match
StringScanner#scan
1.210M i/s
2.225M i/s
(1.84x faster)
✓ Using StringScanner is 1.8 times faster.
StringScanner を使った方が 1.8 倍速い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 53
rexml in Ruby 3.4 (3)
Ruby 3.4のrexml(3)
Use StringScanner#scan_until
instead of Regexp#match
to parse XML.
XMLのパースにRegexp#matchの代わりにStringScanner#scan_until を使う
s = StringScanner.new('test string')
s.scan_until(/str/) # => "test str"
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 54
Regexp Class
xml = "<![CDATA[#{'a'*n}]]>"
md = xml.match(/\A<!\[CDATA\[(.*?)\]\]>/um)
md[1]
=> "aaa.."
Regexp#match
Improvement of REXML and speed up using StringScanner
n=10
2856k i/s
n=100
976k i/s
Powered by Rabbit 3.0.5
Page: 55
StringScanner#check(Regexp)
s = StringScanner.new("<![CDATA[#{'a'*n}]]>")
s.check(/<!\[CDATA\[(.*?)\]\]>/um) and s[1]
=> "aaa.."
n=10
2856k i/s
4146k i/s
n=100
976k i/s
1137k i/s
Regexp#match
#scan
(Regexp)
(1.45x faster)
(1.16x faster)
✓ Using StringScanner is 1.45 times faster.
StringScanner を使った方が 1.45 倍速い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 56
StringScanner#scan_until String
s = StringScanner.new("<![CDATA[#{'a'*n}]]>")
s.pos = 0 and s.skip("<!") and s.skip("[CDATA[")\
and s.scan_until("]]>").chomp!("]]>") #=>"aaa.."
n=10
2856k i/s
4146k i/s
n=100
976k i/s
1137k i/s
Regexp#match
#scan
(Regexp)
(1.45x faster)
(1.16x faster)
#scan_until
4513k i/s
3574k i/s
(String)
(1.58x faster)
(3.66x faster)
✓ Using scan_until(String) is 3.6 times faster and
less degrading.scan_until(String) の方が3.6倍速く、劣化しにくい
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 57
rexml in Ruby 3.4 (3)
Ruby 3.4のrexml(3)
✓ Using string matching
(instead of regular expressions)
is faster.
(正規表現の代わりに)文字列マッチを使うと、より高速
✓ Even when match strings are long,
performance degradation is minimized.
マッチ文字列が長くなっても、性能劣化を最小限に抑えられる
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 58
StringScanner#scan_until String
✓ strscan#106 Support for string
matching in strscan 3.1.2(attached to
Ruby 3.4.0).
(Ruby 3.4.0添付の)strscan 3.1.2 で文字列マッチに対応
✓ StringScanner#scan_until(String) is fast
because it uses rb_memsearch() for
comparison.
StringScanner#scan_until(String) は比較に rb_memsearch() を使うので
速い
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 59
rexml in Ruby 3.4 (4)
Ruby 3.4のrexml(4)
Suppress generation of
unnecessary String Objects.
不要な String Object の生成を抑制する
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 60
Generating String Object
(copying data) takes time
文字列 Object の生成(データのコピー)に時間がかかる
s = StringScanner.new(' ')
s.check(/\s+/um) #=> " "
s.match?(/\s+/um) #=> 1
✓ No match string needed for whitespace
removal.
空白除去の場合はマッチ文字列は不要
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 61
Benchmark result
✓ Faster if no match string is needed.
マッチ文字列不要なら速くなる
check
match?
Improvement of REXML and speed up using StringScanner
/\s+/um
6686K i/s
8925K i/s (1.33x faster)
Powered by Rabbit 3.0.5
Page: 62
String matching is faster.(1)
文字列マッチならより高速(1)
s = StringScanner.new('<![CDATA[')
if s.check(/<!/um) .. #=> "<!"
if s.match?(/<!/um) .. #=> 2
if s.check('<!') .. #=> "<!"
if s.match?('<!') .. #=> 2
✓ Branching process does not need match
string.分岐処理ではマッチ文字列は不要
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 63
String matching is faster.(2)
文字列マッチならより高速(2)
✓ Faster if no match string is needed.
マッチ文字列不要なら速くなる
✓ String matching is faster.
文字列マッチならより高速
check
/<!/um
10525K i/s
'<!'
14450K i/s
(1.37x faster)
match?
Improvement of REXML and speed up using StringScanner
18731K i/s31900K i/s
(1.78x faster)(3.03x faster)
Powered by Rabbit 3.0.5
Page: 64
Avoid generating String object(1)
文字列オブジェクトの生成を避ける(1)
s = StringScanner.new("'test'>")
s.check(/(['"])/) #=> "'"
✓ Need to determine if double or single
quotes are used to get attribute values.
属性値を取得するためにダブルクォーテーションかシングルクォーテーショ
ンかの判断が必要
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 65
Avoid generating String object(2)
文字列オブジェクトの生成を避ける(2)
s = StringScanner.new("'test'>")
case s.peek_byte
when 34 then '"' # '"'.ord
when 39 then "'" # "'".ord
end #=> "'"
✓ If a single byte is returned as a number,
no object is created.
1バイトなら数値で返せば、オブジェクトは生成されない
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 66
Benchmark result
✓ Faster without object generation.
オブジェクト生成無しでより速く
with object
without object
Improvement of REXML and speed up using StringScanner
10385k i/s
32000K i/s (3.08x faster)
Powered by Rabbit 3.0.5
Page: 67
StringScanner#peek_byte
✓ New methods added in strscan 3.1.2
(attached to Ruby 3.4.0)
(Ruby 3.4.0添付の)strscan 3.1.2 で追加された新しいメソッド
✓ Referring to Fast Route Parsing in Rails
(by Aaron Patterson), REXML was also
made faster using peek_byte.
Fast Route Parsing in Rails の話を参考に、REXML も peek_byte を使っ
て速くなりました
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 68
rexml in Ruby 3.4 (5)
Ruby 3.4のrexml(5)
Use constants or memoization
for slow processes.
遅い処理には定数やメモ化を使う
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 69
Regexp.escape is slow, so use
Constant.
Regexp.escape は遅いので定数化
REG = /#{Regexp.escape("'")}/
s = StringScanner.new("test'>")
s.check_until(REG)
not use Constant
use Constant
Improvement of REXML and speed up using StringScanner
1393K i/s
9660K i/s (6.93x faster)
Powered by Rabbit 3.0.5
Page: 70
Encoding of the read delimiter.
読み取り区切り文字のエンコード
✓ REXML reads XML with ">" as read
delimiter.
REXML は ">" を読み込み区切り文字としてXMLを読み込む
✓ In UTF-16, ">" is 2 byte.
UTF-16 の場合、">" は 2byte
<
Improvement of REXML and speed up using StringScanner
f
o
o
>
Powered by Rabbit 3.0.5
Page: 71
Memorize the encoding result of
the read delimiter.
読み取り区切り文字のエンコード結果のメモ化
✓ One character code exists for each XML
instance.
XMLインスタンス毎に文字コードは一つ存在
term = '>'
@term_encord ||= term.encode('UTF-16BE')
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 72
Made REXML Fast Again!
REXML を再び高速化した!
REXML Performance benchmarks
60.0
i/s
49.0
pull(JIT)
stream(JIT)
sax2(JIT)
dom(JIT)
pull
stream
sax2
dom
38.0
27.0
16.0
3.2.6
RubyKaigi2024LT
Improvement of REXML and speed up using StringScanner
3.2.7
rexml version
3.2.9
3.4.0
3.4.1
Powered by Rabbit 3.0.5
Page: 73
Other improvements.
他の改善点
✓ Enhanced invalid XML check.
無効XMLチェック強化
✓ Unification of processing results for
each REXML parser (DOM/SAX2/Pull/
Stream).
各パーサー間でのパース処理結果の統一
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 74
Enhanced invalid XML check(1).
無効XMLチェック強化(1)
<root1></root1>
<root2></root2>
✓ Multiple root tags.
複数のルートタグ
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 75
Enhanced invalid XML check(2).
無効XMLチェック強化(2)
foo
<root></root>
✓ String before the root tag.
ルートタグ前の文字列
✓ rexml#212: If the BOM is broken, it is interpreted
as a string and can detect errors.
BOMが壊れた場合、文字列と解釈されるのでエラー検出できる。
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 76
Enhanced invalid XML check(3).
無効XMLチェック強化(3)
<root></root>
bar
✓ String after the root tag.
ルートタグ後の文字列
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 77
Enhanced invalid XML check(4).
無効XMLチェック強化(4)
429 Too Many Requests
✓ String with no root tag.
ルートタグが無い文字列
✓ If an error string is responded instead of XML,
the XML parser can handle the error.
XMLではなくエラー文字列が応答された場合に、XMLパーサー側でエ
ラー処理できる。
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 78
Unification of processing results
for each REXML parser.(1)
各パーサー間でのパース処理結果の統一(1)
✓ REXML has a DOM/SAX2/Pull/Stream
parser and there are differences in
processing procedures'.
REXMLはDOM/SAX2/Pull/Streamパーサーがあり処理手順に差がある
✓ There was a difference in the parsing result, so
it was fixed.
パース処理結果にも差異があったので修正した
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 79
Unification of processing results
for each REXML parser.(2)
各パーサー間でのパース処理結果の統一(2)
✓ There were also differences in error
handling items.
エラー処理項目にも差があった
✓ Some of the security measures were
implemented only in the DOM parser.
セキュリティ対策の一部がDOMパーサーのみの実装だった
✓ CVE-2024-41946: DoS vulnerability in REXML
Fixed in REXML 3.3.3.
REXML 3.3.3 で修正済
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 80
Summary: Performance(1)
まとめ:性能(1)
✓ Use StringScanner instead of Regexp
class to parse XML.
XMLのパースにRegexpクラスの代わりにStringScannerを使う
✓ Using string matching is faster.
文字列マッチを使うと、より高速
✓ Even when match strings are long,
performance degradation is minimized.
マッチ文字列が長くなっても、性能劣化を最小限に抑えられる
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 81
Summary: Performance(2)
まとめ:性能(2)
✓ Avoiding generation of String object is
faster.
文字列 Object の生成を避けると速い
✓ Use constants or memoization
for slow processes.
遅い処理には定数やメモ化を使う
✓ After eliminating bottlenecks,
YJIT makes it even faster.
ボトルネックを解消後、YJITでさらに高速に
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 82
Summary: Other improvements
まとめ:他の改善点
✓ Enhanced invalid XML check.
無効XMLチェック強化
✓ Unification of processing results for
each REXML parser (DOM/SAX2/Pull/
Stream).
各パーサー間でのパース処理結果の統一
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 83
Further work(1)
今後の改善案(1)
✓ Further optimization with new
StringScanner features.
StringScannerの新機能を用いたさらなる最適化
✓ Use StringScanner#scan_until(String) and
StringScanner#peek_byte more.
StringScanner#scan_until(String) や StringScanner#peek_byte をよ
り使っていく
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5
Page: 84
Further work(2)
今後の改善案(2)
✓ Use StringScanner's new features in other gems.
他のgem でもStringScannerの新機能を使う
✓ Enhanced StringScanner may also make CSV
faster.
StringScannerの強化により、CSVもより速くできるかもしれない
Improvement of REXML and speed up using StringScanner
Powered by Rabbit 3.0.5