![]() ![]() In general you won't find interlinear annotations on the web because HTML has a better way of dealing with ruby. Sadly, no :( You may have luck scraping Wikibooks or some other source of PDFs or plaintext. > Can you give an example of text in the wild that uses interlinear ruby annotation codepoints? So I'm surprised that your answer is a flat "Yes". But as far as I can tell, they're not actually used (markup is used instead, quite reasonably). Seeing how ruby codepoints are actually used would help to decide how to process them. Could they appear in the middle of a phrase you would reasonably search for? Should that phrase then be searchable without the ruby? Should the contents of the ruby also be searchable? ![]() What do you do when you feed text containing ruby characters to a Japanese word segmenter (which is not going to be okay with crazy Unicode control characters, even those intended for Japanese)? Some decisions that have nothing to do with rendering are: I care from the point of view of the base level of natural language processing. Note that I didn't actually ask you about rendering. You're supposed to use markup if you actually want it rendered, but if you just wanted to store the text of a manuscript you can use ruby annotations)Ĭan you give an example of text in the wild that uses interlinear ruby annotation codepoints? Because I searched the Common Crawl for them, and every occurrence of U+FFF9 through U+FFFB seems to have been an accident that has nothing to do with Japanese. (Ruby is inside unicode instead of being completely deferred to markup since it is used often enough in Japanese text, especially whenever an irregular (not out of the "common" list) kanji is used. It lets you preserve the nature of the text without losing info. So when you come across some text using ruby, or some text with an unencodable glyph, what do you do? You use ruby annotations or IDS respectively. This is why it has things like lacuna characters and other things. Not all stored text is intended to be rendered. Unicode is ultimately a system for describing text. You probably shouldn't make ruby happen here if your text is intended to be rendered correctly use a markup language. Don't display them, or display some symbolic representation. > And what are you supposed to do when you encounter one? Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere? ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |