i’ve always struggled with the concept of regular (or of context-free) language, not in the way that i don’t know how to construct one that i know satisfies the definition, but that of deciding, how do i know whether a language is regular or not? and how do i arbitrarily come up with a language that’s not regular? they will tell you about the pumping lemma, but there’s just not much good explanation about it when i tried to search around.

background: regular expression

let’s know what’s a regular language first. i would like to put it simply, something represented by a regular expression. maybe the one “grep” comes with. the one in many programming languages. let’s start by looking at some. (i’ll use the common notation in programming language world here, that is, | *.)

suppose you want to describe a word, but not very precisely; let’s say, you want to look for both regular and normal. then you want to type in regular|normal (or normal|regular, it’s mostly symmetric in real world). the | represents an alternative, meaning to describe both the word on the left and on the right.

now you want to describe both regularity and regularly. in the same logic you want to write regularity|regularly, which will work just as fine. but you notice you typed regular twice, and that’s not efficient. so in these cases, you may use a regular expression like regular(ity|ly), to describe the common prefixes and different suffixes. to take a step further, you may also take out the last y here for regular(it|l)y. three describe the same set of words.

maybe you want to include something like familiarity in your description too. and for consistency you want familiarly too. in this case you might want to write (regul|famili)ar(it|l)y. this expression describes the four words mentioned. if you want to describe similar- in here too, just say (regul|famili|simil)ar(it|l)y, or maybe ((regu|simi)l|famili)ar(it|l)y. these describe the six words. say for some reason you want to forgo with the -ly but add in arity (i hope arly is a word now), you can write (()|regul|famili|simil)arity for it. () is the regexp for empty input, which is sometimes written as epsilon, $\varepsilon$.

and now you want to describe looooools. for this we can use the kleene star *. we might write something like lo(o*)l, o* means arbitrarily many o in series. we write an extra o before that because, l(o*)l describes ll too, the repetition can be zero times. if we want to describe lololololol now, we write lo((lo)*)l. now if you like a freestyle llloololllooollol, describe it with l(l|o)*l.

and regular expressions are just basically the above, but they usually appear with some shorthand notations too. notice we wrote o(o*) lo((lo)*), in which we want to describe a repetition of one or more times instead of zero or more. so let’s write A+ := A(A*). if we only want to describe a short enough lulz, we write something like lo{1-5}l, meaning o repeated for any of 1 to 5 times. this is expands to l(o|oo|ooo|oooo|ooooo)l. write A? := ()|A. and there’s a lot more interesting extensions, but we can always decide to stick to only * | and sticking two regexps together.

now there’s actually a less known non-shorthand regexp, $\empty$, that describes absolutely nothing, not even the empty string. it’s probably less useful in real-world situations, but we need it in order to describe all regular languages.

i hope this is enough to give us a grasp of what a regular expression is, and what it describes (we’ll call it a regular language). there are a ton of tips and tricks, deeper into the topic of what is a regular language and why the regular expressions can describe exactly all the regular languages for a given alphabet, automaton and computability theory, all very fascinating stuff out there. for the purpose of this article we’ll conclude here, with a somewhat-formal definition of a regular expression.

we call it a regular expression iff exactly one of the following holds:

  • it is empty-set $\empty$;
  • it is empty-string $\varepsilon$;
  • it is one element of the alphabet, a b x y etc;
  • it is an alternation, of the form A | B;
  • it is a concatenation, of the form AB;
  • it is a kleene star expression, of the form A*.

matching a string to a regexp

so now we know a regexp describes a set of strings. that means you can tell if (at least for some) strings belongs to this set or not. let’s see how we can tell by defining a relation “match” for a regexp and a string. should we talk about a string, i will use ‘++’ to represent string concatenation from now on.

matches is the smallest relation that satisfies:

  • empty string matches empty-string.
  • singleton string of only x matches the one element of the alphabet x.
  • if s matches A, it matches A|B.
  • if s matches B, it matches A|B.
  • if s1 matches A and s2 matches B, s1 ++ s2 matches AB.
  • empty string matches A*.
  • if sh matches A and st matches A*, sh ++ st matches A*.
  • (and you can see because this is the smallest relation, nothing matches empty-set).

this relation is inductive, so we can prove things about this relation by induction directly on the relation, that is, case-analysis with different witnesses of the fact that a string matches a regexp, which is super handy. better yet we can show this relation is actually decidable, by developing an algorithm for the decision and proving its correctness. it’s kind of a labor, involving you resolving by some enumeration and thus guessing and backtracking, and taking care of ambiguity. so is why people generally just find whatever regexp library, sporadically selling their souls to grep and sed. may all rest peacefully in the regexp heaven.

anyways we don’t really need this to be decidable for the later reasoning, just wanted to let you know this really is the regexp we’re familiar with, albeit in a computational relation form.

now there’s still a relatively trivial way to treat this relation as a meta-algorithm, so say you want to prove or disprove s matches R, just see which rule will give the correct s or R, and try to prove premises of the rule. if there are multiple choices, then it involves guessing, which you can use your galaxy brain to just use one of them, or just like the real algorithm, try and backtrack. eventually it’s either proven or you find a contradiction entails (which means s does not match R).

suppose we want to show familiarly matches (regul|famili)ar(it|l)y. top level of this expression is of form AB so we invoke its sole rule, (from oracle) setting s1 = familiar, s2 = ly, A = (regul|famili)ar and B = (it|l)y. now we prove s1 matches A and s2 matches B. let’s go with s2 matches B here first. again by the same logic we split s2 into l ++ y, and show that (s2 = ) y matches (B =) y, by applying the singleton rule. now prove l matches it|l. we invoke the second rule for alternation. now we see s2 matches B. go back to s1 matches A, it’s similar.

pumping lemma (for regular languages)

we should first know what it means if a language is regular or not. so if you have a language, that is, a (very possibly infinite) collection of strings (sentences) over the alphabet (maybe words), saying it is regular is equal to saying, there’s some regular expression out there, which describes exactly this language. and if a language is so, we can ask it for its corresponding regexp.

now let’s gradually work out what the pumping lemma really is. first of all, it is a property that every regular language has. so we can assert a language has the pumping lemma, or it doesn’t have the pumping lemma; and the conclusion is going to be that if a language is regular, it has the pumping lemma.

this is useful, because it allows us to show a language doesn’t have the pumping lemma, thus it isn’t a regular language.


detour

please note this is called a contraposition, which is (very) different from proof by contradiction (reductio ad absurdum), which many explanations end up claiming. the logic is $ (\neg A)\rightarrow (\neg B) \leftrightarrow B \rightarrow A $ which has an intuitionist derivation, unlike RAA which is only classical. this is important because, imagine if we are reasoning about some relation around regular languages, say for some parameter x, if A(x) is regular then B(x) is regular too. we might do a case analysis on x, and maybe for some x A(x) doesn’t have a pumping lemma, then A(x) isn’t regular, refuting the need to prove B(x) is regular in this case. consider if we have such a theorem, we can specify x and A (and regexp of A(x)), to get a regexp for B(x). if we have to use a classical derivation here, then the process of running A(x) through the theorem might just not terminate sometimes, then we don’t always get a B(x). (which I just happened to randomly feel that this would be super useful someday!)

also please note that this doesn’t exclude the possibility of some language having the pumping lemma but isn’t really regular. this only denies that some language is regular, and it’s usually hard to deny, relative to proving some language is regular, which instead only involves you coming up with a regular expression of it. just remember you can’t use this to prove a language is regular.


so, let’s ponder a little bit around the features of a regular language, what do they all have in common?

a finite language is necessarily regular because, for a given finite language, you can just join every sentence together with |, and it’s exactly this language. and on the other way around, if a regular expression only involves | and concatenation, it’s necessarily finite, such that you can kind of expand it like it’s a polynomial, splitting any (A|B)C into AC|BC, and you will eventually get a long list of sentences joined with |, which is the exact list of every sentence in this language.

as soon as we introduce * into our regexp, we come around to dealing with an infinite language. any regular infinite language is so because they have at least one * in them. you can factor and expand | and concatenation and some * if you need, but the language is still infinite and you still will have a * somewhere in there. as you can imagine, a regular infinite language is and in a way only is made so with the kleene star, the lemma of our interest will definitely talk about a star a lot.

the idea basically is that, if there’s some * somewhere in your language, then some sentence is going to use that * to match, which means a part of that sentence r matches some A* with a certain p in it matching A. and if p matches A*, p ++ r also will. and p ++ r ++ r also will. and so you can just shove more and more r into this hole. trivially, you can repeat the o in loooool as many times as you want.

let’s take a look at our earlier lolol language. so for example we know lolololol is in this language, that it matches the given regular expression lo((lo)*)l. and you can see this “used” the (lo)* to match. so we can break the sentence apart like this: lo(lo)lolol, and shove as many lo into there as we want, and it still matches the regular expression. we can also break as lolo(lo)lol and this still works. we can even break this into lol(ol)olol and you see we can shove arbitrarily many ol in there too in this case. this is what “pumping” means, that you can repeatedly insert something back in.

but sometimes you can’t always do this pumping even if it’s a string in a regular infinite language, simply because if we take lmao matching lmao|lo((lo)*)l, you can’t really find any part of lmao that you can pump. this suggests there’s some condition a match must satisfy in order for it to be pumped. and the common way to put is that we can only pump a sentence in this language long enough, and lmao is too short lmao, deal with it.

pumping constant

so how long do you want then? as you can imagine this depends on the regular expression itself. we could also just say eh we don’t really know, just that not forall lengths, not some string longer but not pumpable, as is the usual semantic of “sufficiently (large|long|many)”. but it’s also just a kind of easy-to-calculate number, and it helps to know what this number actually means. we call this shortest pumpable length of a language its pumping constant.

the pumping constant of a regular expression is:

  • if it’s empty-set, 1
  • if it’s empty-string, 1
  • if it’s one element of alphabet, 2
  • if it’s A|B or AB, result is summing pumping constant of A and pumping constant of B up
  • if it’s a kleene star A*, result is just pumping constant of A

it might look a bit arbitrary and thus scary, but here’s the breakdown: we’re basically denying to pump the finite stuff by setting the pumping constant to be just 1 too large: empty-string can only have a length of 0, singleton can only have a length of 1 etc. and for A|B and AB, they both produce finite language if both A and B are finite. in this case if we just sum them up, the result is larger than the maximum length A|B or AB can reach, we also bar them from being pumped. finally, the number of A* you see, if we have a string as long as A could represent, means we can pump it and it’ll still be in A*. this basically means the number of characters you need to look for until you find a part that actually uses the *, since each of the other cases says you need to look past the entire string, which means it’s impossible to arrive at anything else.

deduction

we got our condition figured out now, let’s see how to put them all together and get a complete proposition that is our lemma:

given any regexp R, any sentence s that matches R, if the given s is longer than the pumping constant of R, then we can give a partition of s, as in there exists a b c, such that a ++ b ++ c = s and s2 isn’t empty, and satisfies that a ++ (any time of repeated b) ++ c also matches R.

Proof by induction on the given match:

  • if the match is because of empty-string, pumping constant is 1 but the length of s is 0, premise “s is longer than pumping constant of R” is unsatisfiable, we can refuse to prove this case.

  • if the match is because of a singleton char, it’s similar to the above.

  • if the match is because R is some A|B and s is matching A, we have the induction hypothesis, s has pumping lemma in A: if s is longer than the pumping constant of A, there exists sa1 sa2 sa3, sa1 ++ sa2 ++ sa3 = s, sa2 isn’t empty, and we can pump sa2.

    we need to show s has pumping lemma in A|B. recall s is longer than pumping constant of A|B. since pumping constant of A|B is that of A + B, we know s is also longer than the pumping constant of A. so we use this fact together with induction hypothesis to get our sa1 sa2 sa3, and it’s exactly the partition we’re looking for. basically, you figure out how to pump “s matches A”, the induction hypothesis, instead.

  • if the match is because R is some A|B and s is matching B, it’s similar to the case above.

  • if the match is because R is some AB and s is s1 ++ s2, s1 is matching A and s2 is matching B, we have the two induction hypothesis: if s1 is longer than the pumping constant of A, we can partition s1 and pump; similar for s2 and B.

    we show s1 ++ s2 has pumping lemma in AB. we know s1 ++ s2 is longer than the pumping constant of AB, and it’s that of A + B, and the length of s1 ++ s2 is that of s1 + s2. either s1 is longer than pumping constant of A or s2 is longer than B. in either case, we can use one of the induction hypotheses to get the partition and property desired.

  • if the match is because R is some A* and s is empty, we need to prove from the premise of s is longer than the pumping constant of A*. notice pumping constant is always >= 1, leading to an unsatisfiable premise.

  • finally, if the match is because R is some A* and s is some s1 ++ s2, induction hypotheses are if s1 is longer than A, we can partition A to pump; same with if s2 is longer than A (because pumping constant A* = A). we prove from the premise of that, s1 ++ s2 is longer than pumping constant of A, to show we can partition s1 ++ s2 and pump.

    s1 is either empty or not.

    • if it’s empty, that means s1 ++ s2 is s2, thus from the premise s2 is longer than pumping constant of A, and we invoke induction hypothesis.
    • if it’s not, we directly give out the partition letting a = empty string, b = s1, c = s2. a ++ b ++ c = s1 ++ s2 = s. b = s1 is not empty. any times of repeated s1 followed with s2 still matches A* in this case (this can be derived from a trivial induction on the times repeated).

conclusion

the above derivation actually still lacks a part: that the pumping constant actually has even more power, the partition we need to give should also satisfy that length of a ++ b is shorter or equal to the pumping constant of the given R. so the above is called the weak pumping lemma. the strong pumping lemma also asserts you can find this pumpable part in the first pumping-constant number of characters. but as you can see, it’s not that hard to keep track of exactly which a and b we chose, and show they are indeed in the required range. the basic idea is still that the pumping constant is indicating how long we must look for until we find a *.

let’s see some language that isn’t regular. the simplest example is a (a*)(b*) but only if there is a same amount of a and b, usually noted like $\{a^nb^n\,|\,n \in\N\}$. we show it’s always impossible to pump in this, for that any partition you come up will either pump a segment of all a or all b, which will cause imbalance of the number of as and bs; or if you pump a segment of some ...aaabb..., then obviously this doesn’t match our specification (a*)(b*). since we can’t pump this language, it’s impossible to come up with a regular expression to describe this language.

and there’s actually a pumping lemma for context-free languages (the above example is one) too. it’s just slightly more complicated but the idea is very similar, you can always partition a sentence, into 5 parts this time, a(b)c(d)e, where it satisfies that all the a(bbbb)c(dddd)e is still in the language. b and d get pumped the same number of times, and they correspond in a way that the first b is corresponding to the last d, such that if b...d matches two different rules, that b and that d can only choose the same rule.

i wish i possess formal linguistic prowess, but i don’t even have much idea about the formalization of BNF forms. alas let’s see an example, just like the last one, but this time it’s $\{a^nb^nc^n\,|\,n \in\N\}$. by the exact same logic, if you choose only two segments, the other one mismatch in number; and if you choose the boundary it doesn’t match at all. thus this can’t be pumped, and isn’t a context-free language.

this article was written because i wish i had someone explained to me when i was still a child. i feel like the wikipedia pages still suck to this day, and even if they’re getting better, they always involve automatons like if automatons are the hotcakes. and was this chance here given to me by @HoshinoTented, for she asked me for help in doing an exercise in Software Foundations, Vol.1 lf, IndProp.v, which asked for exactly the derivation above. actually making it feels really relieving.