1
Commit Graph

301 Commits

Author SHA1 Message Date
Alexei Yatskov
f4da17e228
Merge pull request #41 from stephenmk/master
New version of JMnedict (the proper name dictionary)
2023-02-05 09:57:17 -08:00
stephenmk
ecf22da5a3
Improve readability of publication date functions 2023-02-04 01:42:08 -06:00
stephenmk
a9d85dc720
Simplify string -> runes conversion 2023-02-03 22:07:41 -06:00
stephenmk
70611a51c4
Fix typo 2023-02-03 15:51:52 -06:00
stephenmk
dffbec6337
Designate more JMnedict category tags 2023-02-02 20:15:28 -06:00
stephenmk
5755b79341
Use cached part-of-speech values 2023-02-02 15:50:57 -06:00
stephenmk
7bff70b71c
JMdict: Ensure part-of-speech info is added in non-English versions
Only English-language senses in JMdict contain part-of-speech tags.
This info is displayed to users in definition tags and also used
for deinflecting verbs and adjectives during term lookups.

The old version of Yomichan-Import took the PoS tags from the final
sense in the English version of an entry and applied them to every
sense of every other language. For example, 川・かわ has two senses in
English JMdict: a noun sense and a suffix sense. Therefore every sense
of 川・かわ in every other language was tagged as a suffix.

Instead, I suggest gathering all distinct PoS tags from each English
entry and applying them all to each non-English sense. Every
non-English sense of 川・かわ will therefore be tagged as both a noun
and suffix.
2023-02-02 10:44:16 -06:00
stephenmk
19d6d0bb43
Rename some jmdict functions 2023-02-01 19:14:37 -06:00
stephenmk
3b420f8b6c
Use library implementation of Contains function 2023-02-01 18:57:35 -06:00
stephenmk
8281301869
New JMnedict version 2023-02-01 18:55:03 -06:00
stephenmk
b826dbf264
Add verification logic for date entry in JMdict
Very old versions of JMdict and unofficial versions are unlikely to
have the publication date entry at the end of the file.
2023-01-30 13:26:26 -06:00
Alexei Yatskov
74de4ce9e5
Merge pull request #40 from stephenmk/master
New version of JMdict for Yomichan
2023-01-29 22:30:04 -08:00
stephenmk
0b328e1e07
Add support for undocumented frequency and information tags
Custom dictionary files using the JMdict XML format may contain
nonstandard frequency and information tags.
2023-01-29 22:34:13 -06:00
stephenmk
aab031972c
Simplify declaration of constants 2023-01-29 20:06:46 -06:00
stephenmk
8b4b899959
Hide new JMdict structured content features behind "extra" option
Require `-language=english_extra` to produce the complete version of
the new JMdict dictionary file.

If and when we determine that the all the new features are ready to be
included the dictionary by default, we can remove this logic.
2023-01-29 14:06:50 -06:00
stephenmk
abbe183145
Simplify logic for index.json struct 2023-01-28 18:39:08 -06:00
stephenmk
184dd45dbc
Use snake_case in filenames 2023-01-28 18:17:06 -06:00
stephenmk
517ef3d052
Fix bug in term score assignments
This commit ensures that terms are grouped among their entries of
origin and displayed in correct sequential order in Yomichan's default
result grouping mode, "Group term-reading pairs."
2023-01-27 19:09:12 -06:00
stephenmk
7bd967915c
Add "forms" term in special circumstances
If a headword appears in multiple entries, then each entry needs a
corresponding "forms" term in the output dictionary.

For example, 軽卒 is the only headword in entry 2275730, but 軽卒 also
appears as an irregular form in entry 1252910. If a "forms" term is
not included for the former entry, then it will appear that 軽卒 is
irregular for all senses in the output dictionary.
2023-01-25 18:26:47 -06:00
stephenmk
406067eedd
Include entity tags in standalone forms dictionary 2023-01-24 13:02:50 -06:00
stephenmk
96358e3eb5
Fix function parameter
Sense numbers start at 1, not 0
2023-01-24 08:55:24 -06:00
stephenmk
ef1e74447d
Include term tags and scores in standalone forms dictionary 2023-01-23 23:52:42 -06:00
stephenmk
d606f729cf
Use secondary frequency tags in term score calculation
If a term has a frequency tag, it should return higher in search
results than a match which does not have a tag.

For example, a search for 素性 should return すじょう rather than
そせい, because the former has a "news" frequency tag.
2023-01-23 14:13:22 -06:00
stephenmk
6726c5245b
Rename variables for consistency 2023-01-23 14:09:50 -06:00
stephenmk
d8a3b420ee
Exclude "search" and "forms" terms from non-English dictionaries
This allows a user to install the English version and another version
without cluttering their setup with duplicated information.

If a user doesn't want to use the English version, they can get the
"search" and "forms" terms by installing the separate jmdict_forms
file.
2023-01-22 17:55:27 -06:00
stephenmk
8451803bfd
Update copyright 2023-01-22 15:00:13 -06:00
stephenmk
972dc6c4e9
Update dictionary build script 2023-01-22 14:40:39 -06:00
stephenmk
abc28bb19d
Add new JMdict version 2023-01-22 14:37:18 -06:00
stephenmk
73fb992865
Add intersection and union functions for string arrays 2023-01-22 14:32:45 -06:00
stephenmk
56f9895967
Add struct for handling index.json data 2023-01-22 14:27:02 -06:00
stephenmk
853d0b33dc
Use empty interface type for dictionary glossaries
Necesssary for structured content support
2023-01-22 14:14:33 -06:00
Alexei Yatskov
9222417bfd
Merge pull request #37 from toasted-nutbread/update-vs-rules
Update how suru verb rules are detected
2022-08-20 11:52:32 -07:00
toasted-nutbread
77d5d2debd Update how suru verb rules are detected 2022-08-14 15:35:20 -04:00
2168659243 Fix import path 2022-08-07 09:38:50 -07:00
Alexei Yatskov
b5d6095c06
Merge pull request #36 from 0x766F6964/update_daijisen
Update daijisen
2022-08-01 19:34:03 -07:00
Randy Palamar
5b8481e5bf remove duplicate newlines in definitions
this prevents entries from have empty lines which are particularly
annoying when using the popup dictionary in yomichan
2022-07-28 20:38:02 -06:00
Randy Palamar
94326126d3 update the daijisen regexps
this also fixes #5

the method used is a bit hacky but it works
2022-07-28 20:27:29 -06:00
Randy Palamar
8bc7ffdb36 add newlines to characters indicating sub-definitions
this will cause some things to be displayed incorrectly but overall
makes daijisen much more readable.
2022-07-28 20:25:35 -06:00
Randy Palamar
65df67b085 map most of daijisen
the remaining glyphs don't exist in unicode usually because they are
normally displayed using HTML or MathJax type things
2022-07-28 20:20:48 -06:00
Alexei Yatskov
57280ea5fd
Merge pull request #35 from univerio/shougakukan2
Add support for 小学館 中日・日中 統合辞書 第2版 EPWING
2022-07-14 21:18:53 -07:00
75207654d9 Update README 2022-07-14 14:24:32 -07:00
1fdf4f2998 Switch to foosoft.net for packages 2022-07-03 20:59:33 -07:00
Jack Zhou
c918a6bb5d Implement shougakukan2 2022-05-16 21:39:11 -07:00
Alex Yatskov
a4af996222
Merge pull request #31 from 0x766F6964/add_font_mappings
finish mapping most of daijirin
2022-02-05 18:23:22 -08:00
d61c1e0df6 Readme consistency 2022-02-05 18:22:07 -08:00
6b3aaf3886 Update readme 2022-02-05 18:20:31 -08:00
e16da37017 Update README 2021-12-15 18:06:35 -08:00
e9849380ea Add links 2021-12-14 20:32:29 -08:00
fc7fd48748 Add site metadata 2021-12-14 20:27:16 -08:00
Randy Palamar
6224b4c21f finish mapping most of daijirin
Now you can search for totally useful every day words like 瘟㾮日
and 多羅吒干𤚥 :^).

The characters that remain either don't exist in unicode or are very
difficult to find. Also a couple terms seem unsearchable in qolibri so
I couldn't check what the characters are supposed to be.

Any questionable choice was marked with FIXME. This will make it easy in
the future to replace some characters with their images if its something
that we want to support in the future.

* The FIXMEs with the missing font symbol should all be the correct
  character (not commonly covered by fonts)

* The くの字点 choices are to try and imitate the daijirin
  experience(TM). Probably the worst use of image fonts I've seen. Those
  characters should never appear in horizontal text. They should have
  just been replaced with the text that was supposed to be repeated.

* The 漢文訓読 characters in '{}' are technically the unicode specified
  characters for those glyphs however they just look like their full
  size variants. I surrounded them with '{}' so the examples that use
  them are still readable.

* The other FIXMEs should be self explanatory. Search the term in qolibri
  and look at what they used to see why they are questionable.
2021-06-17 07:56:14 -06:00