Regular expressions for Mishnaic tractates
Various transliteration conventions (or a lack thereof) and dialectal differences make it very difficult at times to gather all possible variations for transcribing Hebrew words into English characters. This can make using search engines to find Hebrew terms in English sources very difficult, or could make it hard for a piece of software to identify what someone is referring to when they enter a string of text. For example, biblical book names each have a number of ways of being written, and my BibRef solves this by simply storing a list of alternative names and abbreviations.
Another way of identifying an entered string with one of many options is with regular expressions. As such, I have attempted below to devise regular expressions to match all expected spellings for each tractate (masechet, masekhet, maseches, meseches, etc.) of the Mishnah. Please note that this is only a draft: I expect to improve the regular expressions, and feedback is much appreciated.
Using this as a background study, it may be possible to automate the building of regular expressions for Hebrew words (with vowels given), although many of the expressions below also cover a number of irregularities that would be hard to incorporate into such a builder. Consequently, one could also build a list of all possible alternative spellings for a word, which could then be used with a search engine to make searches of these Hebrew words comprehensive. (Edit: the current expressions below overgenerate way too much and would probably be inappropriate for that task.)
For those without the technical background, the following features are used in the regular expressions below:
- [abcd] means any of a, b, c, or d.
- (ab|cd|ef) means any of ab, cd, or ef.
- ab?c means either abc or ac (i.e. b is optional).
For the purposes of this task, the expressions below may overgenerate. That is, they may match more than valid spellings of those words, but that is okay as long as they don’t overlap, i.e. match expressions that could also refer to another tractate. I also hope they don’t excessively undergenerate.
| זרעים | ||
|---|---|---|
| ברכות | Berakhot | B[e']?r[ao][ck]h?o[iy]?(s|t|th) |
| פאה | Pe’ah | Pe’?ah? |
| דמאי | Demai | D[e']?mai |
| כלאיים | Kil’ayim | Ki?l’?[ao]yim |
| שביעית | Shevi’it | (Sh|š|Š)[e']?[bv]i’?i(s|t|th) |
| תרומות | Terumot | T[e']?rumo[iy]?(s|t|th) |
| מעשרות | Ma’aserot | Ma’?asei?ro[iy]?(s|t|th) |
| מעשר שני | Ma’aser Sheni | Ma’?asei?r (Sh|š|Š)(e|ei|ey|ay|ai)ni |
| חלה | Hallah | C?[ḤHh]all?[ao]h? |
| ערלה | Orlah | ‘?Orl[ao]h? |
| ביכורים | Bikkurim | Bi[ck][ck]?urim |
| מועד | ||
| שבת | Shabbat | (Sh|š|Š|S)abb?[ao](s|t|th) |
| עירובין | Eruvin | ‘?(e|ei|ey|ay|ai)ru[vb]in |
| פסחים | Pesahim | P[e']?s[ao]c?[ḥh]im |
| שקלים | Shekalim | (Sh|š|Š)[e']?[ḳkq][ao]lim |
| יומא | Yoma | Y(o[iy]?|u)ma |
| סוכה | Sukkah | Su[kc][kc]?[ao]h? |
| ביצה | Beitzah | B(e|ei|ey|ay|ai)t?[zsẓṣ][ao]h? |
| ראש השנה | Rosh Hashanah | Ro[iy]?(sh|š|Š) ha(sh|š|Š)?-?(sh|š|Š)[ao]nn?[ao]h |
| תענית | Ta’anit | Ta’?[ay]ni(s|t|th) |
| מגילה | Megillah | M[e']?gill?[ao]h? |
| מועד קטן | Mo’ed Katan | Mo’?(e|ë|ei|ey|ay|ai)d [KḲQ][ao][ṭt][ao]n |
| חגיגה | Hagigah | C?[ḤHh]agg?igg?[ao]h? |
| נשים | ||
| יבמות | Yevamot | Y[e']?[bv][ao]mo[iy]?(s|t|th) |
| כתובות | Ketubot | K[e']?(s|t|th)u[vb]o[iy]?(s|t|th) |
| נדרים | Nedarim | N[e']?d[ao]rim |
| נזיר | Nazir | N[ao]zir |
| סוטה | Sotah | So[iy]?[tṭ][ao]h? |
| גיטין | Gittin | Gi[tṭ][tṭ]?in |
| קידושין | Kiddushin | [Ḳkq]idd?u(sh|š|Š)in |
| נזיקין | ||
| בבא קמא | Bava Kamma | B[ao][bv][ao]h? [KḲQ]amm?[ao]h? |
| בבא מציעא | Bava Metzia | B[ao][bv][ao]h? Met?[zsẓṣ]i’?[ao]h? |
| בבא בתרא | Bava Batra | B[ao][bv][ao]h? Ba(s|t|th)r[ao]h? |
| סנהדרין | Sanhedrin | Sanhedrin |
| מכות | Makkot | Ma[ck][ck]?o[iy]?(s|t|th) |
| שבועות | Shevu’ot | (Sh|š|Š)[e']?[bv]u’?o[iy]?(s|t|th) |
| עדויות | Eduyot | ‘?(e|ei|ey|ay|ai)duyy?o[iy]?(s|t|th) |
| עבודה זרה | Avodah Zarah | ‘?A[vb]o[iy]?d[ao]h? Z[ao]r[ao]h? |
| אבות | Avot | Pir[kḳ]ei? A[vb]o[iy]?(s|t|th) |
| הוריות | Horayot | Hor[ao]yo[iy]?(s|t|th) |
| קודשים | ||
| זבחים | Zevahim | Z[e']?[bv][ao]c?[hḥ]im |
| מנחות | Menahot | M[e']?n[ao]c?[ḥh]o[iy]?(s|t|th) |
| חולין | Hullin | C?[ḤHh]ull?in |
| בכורות | Bekhorot | B[e']?[ck]h?o[iy]?ro[iy]?(s|t|th) |
| ערכין | Arakhin | ‘?[ae]ra?[kc]h?in |
| תמורה | Temurah | T[e']?mur[ao]h |
| כרתות | Keritot | K[e']?ri(s|t|th)(o[iy]?|u)(s|t|th) |
| מעילה | Me’ilah | M(e|ei|ey|ay|ai)?’?[iï]l[ao]h? |
| תמיד | Tamid | T[ao]mid |
| מידות | Middot | Midd?o[iy]?(s|t|th) |
| קינים | Kinnim | [Ḳkq]inn?im |
| טהרות | ||
| כלים | Keilim | K(e|ei|ey|ay|ai)lim |
| אוהלות | Oholot | Oh[ao]lo[iy]?(s|t|th) |
| נגעים | Nega’im | N[e']?g[ao]‘?im |
| פרה | Parah | P[ao]r[ao]h? |
| טהרות | Tohorot | [ṬT][ao]h[ao]ro[iy]?(s|t|th) |
| מקוות | Mikva’ot | Mi[ḳkq][vw][ao]‘?o[iy]?(s|t|th) |
| נידה | Niddah | Nidd?[ao]h? |
| מכשירין | Makhshirin | Ma[ck]h?(sh|š)irin |
| זבים | Zavim | Z[ao][bv]im |
| טבול יום | Tevul Yom | [ṬT][e']?[bv]ul Yo[iy]?m |
| ידיים | Yadayim | Y[ao]d[ao]yim |
| עוקצים | Uktzim | ‘?U[ḳkq]t?[zsẓṣ]in |
To take your first example, how would you explain that Berakhot cannot be spelled Beracot? By writing “[ck]h?” you imply that it can. This isn’t me picking holes in your work, just trying to understand exactly HOW it works. I like it: did it take you a long time?
Comment by Simon Holloway — 29 October, 2007 @ 7:41 pm
So that’s what I mean by it overgenerates. At the moment, I intend to use it for recognition, less than for generation of strings (and even then it won’t overgenerate too much), so it was easiest to just write out something simple. And sometimes c is used for כ (as in the common Succot). I also currently allow CḤallah.
It didn’t take that long. These are relatively simple regular expressions. Only I had to think about some of the possible issues, and occasionally check if there were uses online of a particular example. Something like Erchin was an unexpected case to handle. I still don’t know whether people write Yaddayim or Yadayyim, both of which I’ve excluded, but maybe should include just because overgenerating in this case is better than undergenerating.
So to give some examples, ברכות can be any of:
[Berachos, Brachos, Berochos, Brochos, Berakhos, Brakhos, Berokhos, Brokhos, Beracos, Bracos, Berocos, Brocos, Berakos, Brakos, Berokos, Brokos, Berachot, Brachot, Berochot, Brochot, Berakhot, Brakhot, Berokhot, Brokhot, Beracot, Bracot, Berocot, Brocot, Berakot, Brakot, Berokot, Brokot]
Many of these are unrealistic because they mix types of conventions / dialects that people are likely to use, but they’re at least all-encompassing, I hope.
Or ערובין is:
['eruvin, eruvin, 'eiruvin, eiruvin, 'eyruvin, eyruvin, 'ayruvin, ayruvin, 'airuvin, airuvin, 'erubin, erubin, 'eirubin, eirubin, 'eyrubin, eyrubin, 'ayrubin, ayrubin, 'airubin, airubin]
בבא מציעא returns the most with 256 alternatives (and that’s without allowing no e in Metzia)! מועד קטן closely follows with 240
Comment by Joel — 29 October, 2007 @ 8:48 pm
Oh no! I’ve forgotten all about ת as th, and shewa as ‘!
Now it’s:
[Berachos, B'rachos, Brachos, Berochos, B'rochos, Brochos, Berakhos, B'rakhos, Brakhos, Berokhos, B'rokhos, Brokhos, Beracos, B'racos, Bracos, Berocos, B'rocos, Brocos, Berakos, B'rakos, Brakos, Berokos, B'rokos, Brokos, Berachot, B'rachot, Brachot, Berochot, B'rochot, Brochot, Berakhot, B'rakhot, Brakhot, Berokhot, B'rokhot, Brokhot, Beracot, B'racot, Bracot, Berocot, B'rocot, Brocot, Berakot, B'rakot, Brakot, Berokot, B'rokot, Brokot, Berachoth, B'rachoth, Brachoth, Berochoth, B'rochoth, Brochoth, Berakhoth, B'rakhoth, Brakhoth, Berokhoth, B'rokhoth, Brokhoth, Beracoth, B'racoth, Bracoth, Berocoth, B'rocoth, Brocoth, Berakoth, B'rakoth, Brakoth, Berokoth, B'rokoth, Brokoth]
Comment by Joel — 29 October, 2007 @ 8:52 pm
And what do you know? Masseceth Beracoth. But it’s excused because it’s Latin.
Comment by Joel — 29 October, 2007 @ 9:03 pm
I’m looking through Jastrow’s abbreviation list, and not only is it horrendously inconsistent, but it’s not in alphabetical order!
Comment by Joel — 29 October, 2007 @ 9:26 pm
Sounds like a lot of work to me. Well done.
(And why do I suddenly feel like a berocca?)
Comment by Simon Holloway — 30 October, 2007 @ 3:17 pm