Only English-language senses in JMdict contain part-of-speech tags.
This info is displayed to users in definition tags and also used
for deinflecting verbs and adjectives during term lookups.
The old version of Yomichan-Import took the PoS tags from the final
sense in the English version of an entry and applied them to every
sense of every other language. For example, 川・かわ has two senses in
English JMdict: a noun sense and a suffix sense. Therefore every sense
of 川・かわ in every other language was tagged as a suffix.
Instead, I suggest gathering all distinct PoS tags from each English
entry and applying them all to each non-English sense. Every
non-English sense of 川・かわ will therefore be tagged as both a noun
and suffix.
Require `-language=english_extra` to produce the complete version of
the new JMdict dictionary file.
If and when we determine that the all the new features are ready to be
included the dictionary by default, we can remove this logic.
This commit ensures that terms are grouped among their entries of
origin and displayed in correct sequential order in Yomichan's default
result grouping mode, "Group term-reading pairs."
If a headword appears in multiple entries, then each entry needs a
corresponding "forms" term in the output dictionary.
For example, 軽卒 is the only headword in entry 2275730, but 軽卒 also
appears as an irregular form in entry 1252910. If a "forms" term is
not included for the former entry, then it will appear that 軽卒 is
irregular for all senses in the output dictionary.
If a term has a frequency tag, it should return higher in search
results than a match which does not have a tag.
For example, a search for 素性 should return すじょう rather than
そせい, because the former has a "news" frequency tag.
This allows a user to install the English version and another version
without cluttering their setup with duplicated information.
If a user doesn't want to use the English version, they can get the
"search" and "forms" terms by installing the separate jmdict_forms
file.
Now you can search for totally useful every day words like 瘟㾮日
and 多羅吒干𤚥 :^).
The characters that remain either don't exist in unicode or are very
difficult to find. Also a couple terms seem unsearchable in qolibri so
I couldn't check what the characters are supposed to be.
Any questionable choice was marked with FIXME. This will make it easy in
the future to replace some characters with their images if its something
that we want to support in the future.
* The FIXMEs with the missing font symbol should all be the correct
character (not commonly covered by fonts)
* The くの字点 choices are to try and imitate the daijirin
experience(TM). Probably the worst use of image fonts I've seen. Those
characters should never appear in horizontal text. They should have
just been replaced with the text that was supposed to be repeated.
* The 漢文訓読 characters in '{}' are technically the unicode specified
characters for those glyphs however they just look like their full
size variants. I surrounded them with '{}' so the examples that use
them are still readable.
* The other FIXMEs should be self explanatory. Search the term in qolibri
and look at what they used to see why they are questionable.