I understand how to turn a class into a Iterator by implementing a __next__
and a __iter__
. There are loads of resources online on how to do this like : https://www.w3schools.com/python/python_iterators.asp
My question is more on when should you not make a class an iterator? In particular if you have a class that iterates over a collection that is :
- Not predetermined even after the iterator is made
- Not generated within the iterator
- Not iterating over a fixed index
Even if the class has a “next” function would this be a good or bad “Iterator.”
A more specific example, and the underlying reason I ask this question, would be class that if given a path to a file system can retrieve the “next logical” directory within that path (including its subpaths) if you call its next function. It does not store a physical collection of all directories within it for efficiency purposes. Something like, what I current have called DirectoryIterator, below:
import logging
import os
from pathlib import Path
from typing import Union
from repoManager.Models import ImagePromptDirectory
from repoManager.utils import generate_file_name
from utils.pathingUtils import get_reverse_sorted_directory_by_name, get_next_file_index_from_reverse_sorted, DIRECTION
class DirectoryIterator:
"""
An iterator that simplifies traversal of the file system. Its primary mode of traversal is going to the logical "next"
prompt directory in the system.
When created the DirectoryIterator first makes a snapshot of the date directories and orders them by most recent. This snapshot
can become stale so users of this Directory Iterator should make a new Directory Iterator to get a more recent snapshot
of dates.
Logically the repo directory structure can look as follows:
${pathToDirectories}/
2024-01-11/
15:05:06.713451_Sad rat/
1.png
2.png
16:55:31.897695_Mom yelling/
1.png
Whatever the "pathToDirectories" location is there will be directories there named by iso-8601 date format.
Within each directory a prompt, prefixed by the iso-8601 time used to generated it, exists.
Within each timePrompt directory the images generated for that prompt are stored.
When saved to the file system the actual directories won't be sorted, as they are actual stored by order of the file
systems hash system. For optimization purposes that is why a "snapshot" of the dates is used to reduce having to
sort all dates when loading them from the file system.
The DirectoryIterator has internal pionters to where it is in the directory structure. These pointers start
at the very first entry in the directory structure depending on the direction provided (default of Forward.) These
pointers can be adjusted with a provided startingDirectory arg. The startingDirectory does not need to be physical
directory in the system aand this DirectoryIterator will iterate to the next logical entry of the provided starting directory
if that is the case.
"""
def __init__(self, pathToDirectories: Union[str, Path], startingDirectory: ImagePromptDirectory = None):
self.pathToDirectories = Path(pathToDirectories)
self.sortedDateDirectories = get_reverse_sorted_directory_by_name(self.pathToDirectories)
if(startingDirectory is not None):
# if directory provided (even if it doesn't exist) use that
startingPromptWithTime = generate_file_name(startingDirectory.time, startingDirectory.prompt)
self.currentDate = startingDirectory.date
self.currentTimePromptDirectories = get_reverse_sorted_directory_by_name(self.pathToDirectories/self.currentDate)
self.currentTimePrompt = startingPromptWithTime
else:
self.currentDate = None
self.currentTimePromptDirectories = []
self.currentTimePrompt = None
def _rewind_to_right(self):
self.currentDate = "9999-99-99" # I would be blessed if I lived to see this be an error
self.currentTimePromptDirectories = []
self.currentTimePrompt = ""
def _rewind_to_left(self):
self.currentDate = "0000-00-00"
self.currentTimePromptDirectories = []
self.currentTimePrompt = ""
def _rewind_if_necessary(self, direction: DIRECTION = DIRECTION.FORWARD):
if(self.get_current_image_prompt_directory() is None):
if(direction is DIRECTION.FORWARD):
self._rewind_to_right()
else:
self._rewind_to_left()
def get_current_image_prompt_directory(self) -> ImagePromptDirectory:
"""
Gets the current image prompt that this iterator is pointing to.
Returns
-------
ImagePromptDirectory
A directory model that could represent a physical entry. However the physical entry
could of been deleted externally since the last time this iterator took a snapshot of the time_prompts
directories for a given date.
"""
logging.debug(f'Getting the current directory')
if(
self.currentTimePrompt is not None and
self.pathToDirectories is not None and
self.currentDate is not None
):
logging.error(f'self.currentTimePrompt {self.currentTimePrompt}')
time, prompt = self.currentTimePrompt.split("_")
return ImagePromptDirectory(
prompt=prompt,
time=time,
# Get only the repo name. Not the full absolute path of the repo.
repo=os.path.basename(os.path.normpath(self.pathToDirectories)),
date=self.currentDate,
)
return None
def _iterate_time_prompt(self, direction: DIRECTION = DIRECTION.FORWARD) -> ImagePromptDirectory:
# attempt to get next time prompt within current date directory
nextTimePromptIndex = get_next_file_index_from_reverse_sorted(fileName=self.currentTimePrompt, reverseSortedFiles=self.currentTimePromptDirectories, direction=direction)
if(nextTimePromptIndex is not None):
nextTimePrompt = self.currentTimePromptDirectories[nextTimePromptIndex]
self.currentTimePrompt = nextTimePrompt
return self.get_current_image_prompt_directory()
self.currentTimePrompt = None
return None
def _iterate_date(self, direction: DIRECTION = DIRECTION.FORWARD) -> ImagePromptDirectory:
# attempt to get next time prompt within current date directory
nextDateIndex = get_next_file_index_from_reverse_sorted(fileName=self.currentDate, reverseSortedFiles=self.sortedDateDirectories, direction=direction)
if(nextDateIndex is not None):
nextDate = self.sortedDateDirectories[nextDateIndex]
nextTimePromptDirectories = get_reverse_sorted_directory_by_name(self.pathToDirectories/nextDate)
startIndexOfNextDateFolder = 0 if direction is not DIRECTION.BACKWARD else len(nextTimePromptDirectories)-1
nextTimePrompt = nextTimePromptDirectories[startIndexOfNextDateFolder]
self.currentDate = nextDate
self.currentTimePromptDirectories = nextTimePromptDirectories
self.currentTimePrompt = nextTimePrompt
return self.get_current_image_prompt_directory()
self.currentDate = None
self.currentTimePromptDirectories = None
self.currentTimePrompt = None
return None
def get_next_time_prompt_directories(self, direction: DIRECTION = DIRECTION.FORWARD) -> ImagePromptDirectory:
"""
Gets the logical next image prompt that this iterator from the current image prompt this iterator is
pointing to. It could also return None if all image prompts have been exhuasted.
Not garanteed to be 100% accurate in scenarios where the physical file system has been externally
modified.
Parameters
----------
direction: (DIRECTION):
The direciton to get the next prompt time directory. Supports "forward" or "backward".
Default is to go forward.
Returns
-------
ImagePromptDirectory
A directory model that could represent the next logical physical entry. However the physical entry
could of been deleted externally since the last time this iterator took a snapshot of the time_prompts
directories for a given date.
"""
logging.debug(f'Getting next prompt with direction {direction}')
self._rewind_if_necessary(direction=direction)
# get next time prompt if current date direcotry has any
candidate_time_prompt_directory = self._iterate_time_prompt(direction)
# Iterate through date directories until a time prompt directory is found or all date directories are exhuasted
while(self.currentDate is not None and candidate_time_prompt_directory is None):
candidate_time_prompt_directory = self._iterate_date(direction)
return candidate_time_prompt_directory
This comes from a project I am making at https://github.com/CoryBond/PAIID/blob/main/src/repoManager/DirectoryIterator.py. Its “next” method is effectively get_next_time_prompt_directories. Apologies for the documentation having awkward wording and typos.
With just a few modifications like:
- Removing direction from the “next” method and moving it to the constructor
- Raising a StopIteration when the directories are exhausted
- Adding or replacing methods with
__iter__
and__next__
This can become a TRUE Python Iterator. However its underlying behavior, especially the parts where it takes snapshots of sub directories as an efficiency measure, make me think naming it or treating it like a generator/iterator is not pythonic.
Maybe this class is more a “scanner” then it is an “itertaor?”
I think what you’re looking for is called
glob
.@MarkRansom my question is more specifically about if the bullet point conditions I listed in the question make a class a bad choice for the iterator protocol. I posted the “DirectoryIterator” example because it was the class which made me think more deeply about this topic. How the DirectoryIterator is specifically implemented (using os, glob, path, etc) isn’t as important as the behavior of DirectoryIterators “next” function. _ Speaking about glob it might be useful if I wanted to support filtering of partitions in the future but I have to weigh it against just filtering out os.scandir.
As I was reading more through the OS documentation for scandir I noticed some key behaviors of the returned iterator which might answer my question. In particular scandir : * Return an iterator of os.DirEntry * If a file is removed from or added to the directory after creating the iterator, whether an entry for that file be included is unspecified. That last bullet point is behavior I thought might invalidate a class from being a true iterator but I guess that isn’t the case.