2023年9月13日 - @uaa@social.mikutter.hachune.netの投稿

日	月	火	水	木	金	土
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

22:14:50 SASANO Takayoshi @uaa@social.mikutter.hachune.net

libkkcが落ちてるから、その巻き添えを食ってるということで良さそうだけど。

22:11:55 SASANO Takayoshi @uaa@social.mikutter.hachune.net

うーん、出現頻度1800に絞ってもfcitx5がsegmentation faultで落ちるな。

21:59:44 SASANO Takayoshi @uaa@social.mikutter.hachune.net

根拠のない個人的な感覚なんだけど、出現頻度1700～1800くらいが良さそうに見える。メモリ不足とかで動かなくなる心配を考えると、1800が適当か。

21:50:05 SASANO Takayoshi @uaa@social.mikutter.hachune.net

出現頻度1700
ngram 1= 55203
ngram 2= 665837
ngram 3= 2289672
出現頻度1600
ngram 1= 56791
ngram 2= 698848
ngram 3= 2460089

ngram=1が単語数そのものなのでこれは削りたくないが、その組み合わせたるn-gram2, 3が爆発してもらっても困る（とはいえこれも削ると「賢さ」が落ちる）。出現頻度1000でも厳しいというなら、それ以上の頻度に制限するしかないのは確かそうなんだけど…

21:46:11 SASANO Takayoshi @uaa@social.mikutter.hachune.net

出現頻度2000
ngram 1= 51095
ngram 2= 585849
ngram 3= 1977021
出現頻度1900
ngram 1= 52359
ngram 2= 609753
ngram 3= 2046937
出現頻度1800
ngram 1= 53745
ngram 2= 636476
ngram 3= 2176045
出現頻度1750
ngram 1= 54440
ngram 2= 650833
ngram 3= 2231599
出現頻度1500
ngram 1= 58509
ngram 2= 735145
ngram 3= 2619441
出現頻度1250
ngram 1= 63540
ngram 2= 847794
ngram 3= 3118175

21:09:00 SASANO Takayoshi @uaa@social.mikutter.hachune.net

オリジナルのデータが
ngram 1=118333
ngram 2=775414
ngram 3=1777469
用意したものが
ngram 1= 70292
ngram 2= 1009331
ngram 3= 3901618
単語数は減っているとはいえ組み合わせた物の数が異様に大きい（なにしろrootじゃないと辞書データを作れない程度にはメモリを食っている）というのがあるので、その辺で悲鳴を上げたのかも。
出現頻度1000以上よりは小さいデータセットが無いので、コードをいじって制限（フィルタ）するという方向にしないとダメかも。

21:05:48 SASANO Takayoshi @uaa@social.mikutter.hachune.net

うごきません！fcitx5がcore吐いて死にます！

20:42:59 SASANO Takayoshi @uaa@social.mikutter.hachune.net

中の人の勘が鋭ければ、あのPRの裏で何をしようとしているかは気づくはず。わざわざIRSTLMとツール名を書いているので。

20:41:45 SASANO Takayoshi @uaa@social.mikutter.hachune.net

(marisa-)trieを使っている以上、辞書の作りとしては確かにそうなるわな…

2-gram: 1-gramに定義された単語でペアを作ること
3-gram: 2-gramに定義された単語ペア＋1-gramに定義された単語であること

とはいえその辺の言語資源で適当に作ったN-gramだと場合によっては<unk>とかで切られちゃうので、この条件を平然と満たさなくなる…ノイズが含まれるとでも言えば良いんだろうかね、そのノイズでsortlm.pyが機嫌を悪くしちゃう。

そう理解してる。

20:39:04 SASANO Takayoshi @uaa@social.mikutter.hachune.net

とりあえず、sortlm.pyの修正はおしまい。IRSTLMで変換したarpaの処理もできているので、あとはこれを実際にlibkkcに食わせて何が起こるかを見る…ってことになるな。 https://github.com/ueno/libkkc/pull/46

https://github.com/ueno/libkkc/pull/46

improvement for accept generic arpa language-model data by jg1uaa · Pull Request #46 · ueno/libkkc

07:45:40 SASANO Takayoshi @uaa@social.mikutter.hachune.net

そうそう、このページ https://note.nkmk.me/python-list-comprehension/ 参考にして、（例にある）最初のsquares = []に相当する部分を省略して痛い目にあったとか、append()でうまくいかなかったのでextend()にしたら全然ダメだったとか（append()使ってさらに()の指定を足したか抜いたした）、Pythonの文法わかんね💢💢💢となってました。

https://note.nkmk.me/python-list-comprehension/

Pythonリスト内包表記の使い方 | note.nkmk.me

07:43:58 SASANO Takayoshi @uaa@social.mikutter.hachune.net

2023-09-13 07:43:34 redbrick@HyZERO3強制解約済みの投稿 redbrick@social.mikutter.hachune.net

このアカウントは、notestockで公開設定になっていません。

07:40:48 SASANO Takayoshi @uaa@social.mikutter.hachune.net

とりあえず内包表記で書かれていた部分を、内包表記使わない＋エラーチェック入れる、で書き直せたけどPython知らない人間にそれやらせるとか…append()とextend()の違いとか知らんし…

07:21:36 SASANO Takayoshi @uaa@social.mikutter.hachune.net

なんなのPythonの内包表記って…すっきり書けるのは理解するけど、エラーチェック入れないといけない場合にはどうしようもないじゃんこれ…

日	月	火	水	木	金	土
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

日	月	火	水	木	金	土
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

日	月	火	水	木	金	土
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30