[GMG-Devel] Reprocessing thoughts

mediagoblin-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[GMG-Devel] Reprocessing thoughts

From:	Christopher Allan Webber
Subject:	[GMG-Devel] Reprocessing thoughts
Date:	Wed, 24 Jul 2013 17:06:59 -0500
User-agent:	mu4e 0.9.9.5-dev5; emacs 24.1.50.1
Hey all,

Following closely on the heels of turning media types into plugins,
there's some other stuff that's been frequently requested regarding
media types: reprocessing.

Some old wiki pages to catch up people:
 - Here's how things more or less are now:
     http://wiki.mediagoblin.org/Processing

 - About media types generally:
     http://wiki.mediagoblin.org/Media_Types

 - And a collection of thoughts we've collected over time about
   reprocessing (the topic at hand):
     http://wiki.mediagoblin.org/Feature_Ideas/Reprocessing

The gist of it is that people want to be able to reprocess media that
failed to process the right way after an initial upload, or maybe some
step of it (copying up to the remote storage) failed, but should be
tried again, or something.

Alternately, maybe someone wants to re-transcode some of their media to
another format.  Maybe vp9 is out now, and they want to support that.
Or maybe they want to resize their images to be 1024 pixels wide as
opposed to 800 pixels wide.  How do you do that?

My thoughts are below, but they're very jumbled... just scratchy notes.
But you can do your best to follow along... everything from here on out
is probably actually jumping in and implementing.  The tl;dr version is:

 - Providing a list of "ProcessActions" (each bound to a media type)
   that can also tell if a media is "eligible" for them (ie, you're not
   eligible for initial_processing if the initial processing already
   went through, and you're not eligible for "resize" if
   initial_processing didn't go through).  These also provide
   instructions on how they're carried out.
 - Initially, accessible interface-wise only from
   ./bin/gmg subcommands... later, accessible from the web interface
   to administrators
 - Splitting commonly used parts into shared functions.  There may be
   some patterns here; what the patterns are exactly aren't clear yet,
   but I think will become clearer upon implementation.  A ProcessAction
   is a bit like carrying out a recipie with a series of steps, and
   actually multiple recipies may contain the same process with more or
   less the same prerequisite pieces (browning onions requires first
   chopping onions generally) even if they may not share *all* the same
   steps.
 - How does celery SubTasks fit into this?  I'm not totally
   sure... though I tend to think that a processing action is an all or
   nothing event.  Certain things may allow for retrying (copying up an
   image to a remote storage) but for the most part there's not that
   much of it needed, and copying around an open workbench might not
   really be worth it, and it's all going to run as one big chunk at
   once anything.  If a processing fails, it can be reattempted later if
   the conditions are fixed, but it starts over from step 0.

A lot more scattered thoughts below.  Mostly a stream-of-consciousness
braindump.  If anyone wants to talk about this on IRC I'm happy to do
it.  In the meanwhile, hopefully it's helpful as some sort of reference.


Braindump below
===============

*** Thinking this through / braindump

 - Failed upload (pre-initial-upload-success)
 - Changing a file (post-initial-upload-success)
   - Transcoding to different format
   - Resizing an image
   - Possibly reprocessing something spec

 - Under what circumstances/mechanisms are a reprocessing operation started?
   - Administrator presses button to re-submit failed upload
   - Administrator runs ./bin/gmg command

 - Does it use Celery?  In what ways?
   - What about from ./bin/gmg commands?
   - What about from the web interface?

 - How do you add reprocessing operations, and how to "check for
   eligibility"?
   
   Eligibility should probably be determined on a per-media-type
   operation
   
 - ProcessingState possibly as a way of handling some of this
   (see mediagoblin/processing/__init__.py)

 - How do operations supply "options", and how are those options conveyed/set?
   
   Would be easy enough for ./bin/gmg commands... a lot harder for web
   interface

 - Do all reprocessing operations assume operating on a single file?

   If so that might make things a lot easier...

 - What if each step had logic that knew how to set up the next step
   and provide a "request" depending on the environment?
   
   ... isn't a request at that point a Celery task though?  What might
   ProcessingState do that a Celery task wouldn't currently encompass?

 - How to determine what file to use?
   Easy enough if we have the original... what if we don't?
   
   It should be up to the reprocessor to see if it has that available.


Example ./bin/gmg session:

: $ ./bin/gmg reprocess available 4315
: Available reprocessing actions for 4315 (video):
:  - initial_processing
: 
: $ ./bin/gmg reprocess available 1566
: Available reprocessing actions for 1566 (video):
:  - transcode
:  - generate_thumbnail
: 
: $ ./bin/gmg reprocess available 33
: Available reprocessing actions for 33 (image):
:  - resize
: 
: $ ./bin/gmg reprocess help video:transcode
: Transcode a video into some kind of format, something something
:
:   Options:
:     --size: blah blah
:     --format: blah blah
:
: # reprocess a single thingy
: $ ./bin/gmg reprocess run --eager transcode 33 --size 300x480 --format 
webm-vp9
: 
: # reprocess all available videos
: $ ./bin/gmg reprocess bulk_run --eager --type video transcode --size 300x480 
--format webm-vp9

So media types need to supply some kind of

  ProcessingAction

Which itself may be a celery task.

Additionally, some activities may chain together ProcessingActions.

Even more additionally!  What about files that need to be "copied down
locally", but already have been?  Eg, copying an image/video to the
workbench.  How to address this?

Here's another question: should we be using Celery subtasks?  Assuming
we do, and assuming it's not called in always eager, doesn't that mean
that if a task is broken into multiple parts, a new workbench has to
be constructed for each task?  How to handle this aspect of subtasks?
Are subtasks really the right thing?

I'm not sure we want subtasks if the above is true.  And right now, I
think that if a part of processing fails, it's okay to redo the whole
thing of reprocessing.

So it seems we might want to use celery tasks, but not necessarily
celery subtasks... it might actually just be that we have a general
"processing" handler task, to which it's passed an argument of which
processing handler to use.

What if each processing action was a recipe, and it had a set of steps
that needed to be followed... and certain steps have prerequisites
that other steps have already been called?

Reprocessing as in terms of "failed initial processing" then is just
"entire initial processing task again, back to step 1"

I guess we could use subtasks, and even pickle things, or something,
but not sure why.

What about ProcessingState?  Is it still useful?
It does provide a bunch of the expected logic re: getting queued
filename, copy_original, store_public, etc.

I think we might figure that out at the time we code such a thing


So what are things that a ProcessingAction would provide?

#+BEGIN_SRC python
  class MyProcessingAction(ProcessingAction):
      def check_eligible(media):
          """
          Check to see if this MediaEntry is eligible for processing
          via this
          """
          pass
  
      def process(media):
          """
          Actually process this media entry
          """
          pass
#+END_SRC

... okay, but that doesn't handle several things:

 - registering the processing action as in terms of being applicable
   to a specific media type
 - the actual "steps" involved in the processing.  Are the steps
   implemented as general functions?  Are they actually themselves
   ProcessingActions?

 - I think they should be simple functions maybe?  But it would be
   good to have them separated out, like

#+BEGIN_SRC python
#######################################
# PROCESSING STEPS
#   shared by various ProcessingActions
#######################################

#######################################
# PROCESSING ACTIONS
#######################################
#+END_SRC

What about retrying something if it fails, like a failed upload?

Subtasks would really help there...
I think we'd have to start writing the processing steps and actions
out, then at that point we can more easily see if they should be
separate tools or not.
[Prev in Thread]
Current Thread
[Next in Thread]
[GMG-Devel] Reprocessing thoughts, Christopher Allan Webber <=
Prev by Date: Re: [GMG-Devel] New ticket workflows, and relevant meeting
Next by Date: [GMG-Devel] Proposed updates to website and docs structure (long)
Previous by thread: [GMG-Devel] Media types are now plugins (update your configs!)
Next by thread: [GMG-Devel] Proposed updates to website and docs structure (long)
Index(es):
- Date
- Thread