Sunday, 15 September 2013

Get start location of capturing group within regex pattern

Get start location of capturing group within regex pattern

Basically, I want to find the index for the first occurrence of any of the
substrings: "ABC", "DEF", or "GHI", so long as they occur in an interval
of three. The regex that I wrote to match this pattern is:
regex = compile ("(?:[a-zA-Z]{3})*?(ABC|DEF|GHI)")
The *? ensures that I get the first match, since it's non-greedy. I'm
using a capturing group since I assume that that is the only way to
actually get the index (of the substring) that I'm actually looking for. I
don't care where the match itself starts, just where the capturing group
starts. The ...{3}... mandates that the pattern occur in an interval of 3,
ie:
example_1 = "BNDABCDJML"
example_2 = "JKMJABCKME"
example_1 would match since "ABC" occurs at position 3 but example_2 would
not match since "ABC" occurs at position 4.
Ideally, given the string:
text = "STCABCFFC"
this matches, but if I simply get the start of the match, it will give me
0, since that's the beginning index of the match, where what I want is 3
I'd like to do this:
print match(regex, text).group(1).start()
but, of course, this doesn't work, since start() is not a method for
strings, plus the string is now independent of text. I can't simply search
for the starting index of the substring in the capturing group, because
that won't guarantee me that it follows the regex pattern (only occur in
intervals of 3). Perhaps I'm overlooking something, I don't write too much
in python, so forgive me if this is a trivial question.

No comments:

Post a Comment