Home
        Flume 1.5.2 User Guide - Apache Flume
         Contents
1.     Note  For all of the time related escape sequences  a header with the key    timestamp    must exist among the headers of the event  unless  hdfs useLocalTimeStamp   S Set to true   One way to add this automatically is to use the Timestamplnterceptor     Name Default Description   channel     type   The component type name  needs to be hafs   hdfs path   HDFS directory path  eg hdfs   namenode flume webdata     hdfs filePrefix FlumeData Name prefixed to files created by Flume in hdfs directory   hafs  fileSuffix   Suffix to append to file  eg  avro   NOTE  period is not automatically added   hdfs inUsePrefix   Prefix that is used for temporal files that flume actively writes into   hdfs inUseSuffix  tmp Suffix that is used for temporal files that flume actively writes into   hdfs rollInterval 30 Number of seconds to wait before rolling current file  0   never roll based on time interval   hdfs  rollSize 1024 File size to trigger roll  in bytes  0  never roll based on file size    hdfs rollCount 10 Number of events written to file before it rolled  0   never roll based on number of events   hdfs idle Timeout 0 Timeout after which inactive files get closed  0   disable automatic closing of idle files   hdfs batchSize 100 number of events written to file before it is flushed to HDFS    hdfs codeC   Compression codec  one of following   gzip  bzip2  Izo  Izop  snappy   hdfs fileType SequenceFile File format  currently sequenceFile  DataStream Of CompressedStream  1 DataStream
2.    This interceptor inserts into the event headers  the time in millis at which it processes the event  This interceptor inserts a header with key timestamp  whose value is the relevant timestamp  This interceptor can preserve an existing timestamp if it is already present in the configuration     Property Name Default Description  type   The component type name  has to be timestamp or the FQCN  preserveExisting false If the timestamp already exists  should it be preserved   true or false    Example for agent named a1     al sources   rl   al channels   cl   al sources rl channels   cl  al sources rl type   seq  al sources rl interceptors   il  al sources rl interceptors il type   timestamp    Host Interceptor    This interceptor inserts the hostname or IP address of the host that this agent is running on  It inserts a header with key host or a configured key whose  value is the hostname or IP address of the host  based on configuration     Property Name Default Description   type   The component type name  has to be host   preserveExisting false If the host header already exists  should it be preserved   true or false  uselP true Use the IP Address if true  else use hostname    hostHeader host The header key to be used     Example for agent named a1     al sources   rl   al channels   cl   al sources rl interceptors   il  al sources rl interceptors il type   host  al sources rl interceptors il hostHeader   hostname    Static Interceptor    Static interceptor allows user to a
3.    morphlineFile         morphlineld null Optional name used to identify a morphline if there are multiple  morphlines in a morphline config file   batchSize 1000 The maximum number of events to take per flume transaction    batchDurationMillis 1000 The maximum duration per flume transaction  ms   The transaction    commits after this duration or when batchSize is exceeded   whichever comes first     handlerClass org apache flume sink solr morphline MorphlineHandlerlmp  The FQCN of a class implementing  org apache flume sink solr morphline MorphlineHandler    Example for agent named a1     al channels   cl   al sinks   kl   al sinks kl type   org apache flume sink solr morphline MorphlineSolrSink  al sinks kl channel   cl   al sinks kl morphlineFile    etc flume ng conf morphline conf     al sinks kl morphlineId   morphlinel     al sinks k1 batchSize   1000     al sinks k1 batchDurationMillis   1000    ElasticSearchSink    This sink writes data to an elasticsearch cluster  By default  events will be written so that the Kibana graphical interface can display them   just as if  logstash wrote them     The elasticsearch and lucene core jars required for your environment must be placed in the lib directory of the Apache Flume installation  Elasticsearch  requires that the major version of the client JAR match that of the server and that both are running the same minor version of the JVM   SerializationExceptions will appear if this is incorrect  To select the required version fir
4.    selector   Depends on the selector type value   interceptors     Space separated list of interceptors    interceptors         Warning  The problem with ExecSource and other asynchronous sources is that the source can not guarantee that if there is a failure to put the  event into the Channel the client knows about it  In such cases  the data will be lost  As a for instance  one of the most commonly requested features  is the tail  F  file  like use case where an application writes to a log file on disk and Flume tails the file  sending each line as an event  While this  is possible  there   s an obvious problem  what happens if the channel fills up and Flume can   t send an event  Flume has no way of indicating to the  application writing the log file that it needs to retain the log or that the event hasn   t been sent  for some reason  If this doesn   t make sense  you need  only know this  Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as  ExecSource  As an extension of this warning   and to be completely clear   there is absolutely zero guarantee of event delivery when using this  source  For stronger reliability guarantees  consider the Spooling Directory Source or direct integration with Flume via the SDK           Note  You can use ExecSource to emulate TailSource from Flume 0 9x  flume og   Just use unix command tail  F  full path to your file   Parameter  F is better in this case than  f as it wil
5.   e g  America Los_Angeles    hdfs useLocalTimeStamp false Use the local time  instead of the timestamp from the event header  while replacing the escape    hdfs closeTries    sequences    0 Number of times the sink must try to close a file  If set to 1  this sink will not re try a failed close  due to  for  example  NameNode or DataNode failure   and may leave the file in an open state with a  tmp extension  If  set to 0  the sink will try to close the file until the file is eventually closed  there is no limit on the number of  times it would try      hdfs retryInterval 180 Time in seconds between consecutive attempts to close a file  Each close call costs multiple RPC round   trips to the Namenode  so setting this too low can cause a lot of load on the name node  If set to 0 or less   the sink will not attempt to close the file if the first attempt fails  and may leave the file open or with a     tmp     extension    serializer TEXT Other possible options include avro_event or the fully qualified class name of an implementation of the  EventSerializer Builder interface    serializer      Example for agent named a1     al channels   cl   al sinks   k1   al sinks kl type   hdfs   al sinks kl channel   cl   al sinks kl hdfs path    flume events  y  m  d  H M  S  al sinks kl hdfs filePrefix   events   al sinks kl hdfs round   true  al sinks kl hdfs roundValue   10  al sinks kl hdfs roundUnit   minute    The above configuration will round down the timestamp to the last 10th min
6.   metricl     metricValuel    metric2     metric3     metricValue3       metric4      metricValue2      metricValue4      Here is an example         CHANNEL  fileChannel    EventPutSuccessCount   468085       Type    CHANNEL     StopTime   0     EventPutAttemptCount   468086     ChannelSize   233428     StartTime   1344882233070     EventTakeSuccessCount   458200     ChannelCapacity   600000     EventTakeAttemptCount   458288      CHANNEL memChannel    EventPutSuccessCount   22948908     Type   CHANNEL     StopTime   0     EventPutAttemptCount   22948908     ChannelSize   5     StartTime   1344882209413     EventTakeSuccessCount   22948900     ChannelCapacity   100     EventTakeAttemptCount   22948908          Property Name Default Description   type   The component type name  has to be http  port 41414 The port to start the server on     We can start Flume with JSON Reporting support as follows     bin flume ng agent   conf file example conf   name al  Dflume monitoring type http  Dflume monitoring port 34545    Metrics will then be available at http    lt hostname gt   lt port gt  metrics webpage  Custom components can report metrics as mentioned in the Ganglia  section above     Custom Reporting    It is possible to report metrics to other systems by writing servers that do the reporting  Any reporting class has to implement the interface   org apache  flume  instrumentation MonitorService  Such a class can be used the same way the GangliaServer is used for reporting  They
7.   org apache flume Sink  org apache flume Sink  org apache flume Sink  org apache flume ChannelSelector  org apache flume ChannelSelector  org apache flume ChannelSelector  org apache flume SinkProcessor  org apache flume SinkProcessor  org apache flume SinkProcessor  org apache flume SinkProcessor  org apache flume interceptor Interceptor  org apache flume interceptor  Interceptor    org apache flume interceptor Interceptor  org apache flume interceptor Interceptor  org apache flume interceptor Interceptor  org apache flume channel file encryption KeyProvider Builder  org apache flume channel file encryption KeyProvider Builder  org apache flume channel file encryption CipherProvider  org apache flume channel file encryption CipherProvider  org apache flume serialization  EventSerializer Builder  org apache flume serialization  EventSerializer Builder  org apache flume serialization  EventSerializer Builder    Alias Conventions    avro  netcat   seq   exec   syslogtcp  multiport_syslogtcp  syslogudp   spooldir   http   thrift   jms    null   logger   avro   hdfs   hbase  asynchbase  elasticsearch  file_roll   irc   thrift  replicating  multiplexing  default  failover  load_balance  timestamp  host    static  regex_filter  regex_extractor  jceksfile         aesctrnopadding    text  avro_event    org example MyChannel  org apache flume source AvroSource  org apache flume source NetcatSource  org apache flume source SequenceGeneratorSource  org apache flume source ExecSource  o
8.   this source will return a HTTP status of 400  If the channel is full  or the source is  unable to append events to the channel  the source will return a HTTP 503   Temporarily unavailable status     All events sent in one post request are considered to be one batch and inserted into the channel in one transaction     Property Name Default Description   type The component type name  needs to be http   port   The port the source should bind to    bind 0 0 0 0 The hostname or IP address to listen on   handler org apache flume source http JSONHandler The FQCN of the handler class    handler     Config parameters for the handler   selector type replicating replicating or multiplexing   selector   Depends on the selector type value   interceptors   Space separated list of interceptors   interceptors     enableSSL false Set the property true  to enable SSL  HTTP Source does not support SSLVv3   excludeProtocols SSLv3 Space separated list of SSL TLS protocols to exclude  SSLv3 is always excluded   keystore Location of the keystore includng keystore file name    keystorePassword Keystore password  For example  a http source for agent named a1     al sources   rl   al channels   cl   al sources rl type http   al sources rl port 5140   al sources rl channels   cl   al sources rl handler   org example rest RestHandler  al sources rl handler nickname   random props    JSONHandler    A handler is provided out of the box which can handle events represented in JSON format  and supports UTF
9.  8  UTF 16 and UTF 32 character sets   The handler accepts an array of events  even if there is only one event  the event has to be sent in an array  and converts them to a Flume event based  on the encoding specified in the request  If no encoding is specified  UTF 8 is assumed  The JSON handler supports UTF 8  UTF 16 and UTF 32   Events are represented as follows            headers        timestamp     434324343     host     random_host example com   WG    body     random_body     hy        headers          namenode     namenode example com     datanode     random _datanode example com     be   body     really random_body     al    To set the charset  the request must have content type specified aS application json  charset UTF 8  replace UTF 8 with UTF 16 or UTF 32 as  required      One way to create an event in the format expected by this handler is to use JSONEvent provided in the Flume SDK and use Google Gson to create the  JSON string using the Gson fromJson Object  Type  method  The type token to pass as the 2nd argument of this method for list of events can be  created by     Type type   new TypeToken lt List lt JSONEvent gt  gt       getType       BlobHandler    By default HTTPSource splits JSON input into Flume events  As an alternative  BlobHandler is a handler for HTTPSource that returns an event that  contains the request parameters as well as the Binary Large Object  BLOB  uploaded with this request  For example a PDF or JPG file  Note that this  approach is no
10.  NY mem channel 1 file channel 2  agent_foo sources avro AppSrv sourcel selector optional CA   mem channel 1 file channel 2    agent_foo sources avro AppSrv sourcel selector mapping AZ   file channel 2  agent_foo sources avro AppSrv sourcel selector default   mem channel 1    The selector will attempt to write to the required channels first and will fail the transaction if even one of these channels fails to consume the events   The transaction is reattempted on all of the channels  Once all required channels have consumed the events  then the selector will attempt to write to  the optional channels  A failure by any of the optional channels to consume the event is simply ignored and not retried     If there is an overlap between the optional channels and required channels for a specific header  the channel is considered to be required  and a failure  in the channel will cause the entire set of required channels to be retried  For instance  in the above example  for the header    CA    mem channel 1 is  considered to be a required channel even though it is marked both as required and optional  and a failure to write to this channel will cause that event to  be retried on all channels configured for the selector     Note that if a header does not have any required channels  then the event will be written to the default channels and will be attempted to be written to  the optional channels for that header  Specifying optional channels will still cause the event to be written to
11.  alias key 1  keyalg AES  keysize 128  validity 9000     keystore src test resources test keystore  storetype jceks     storepass keyStorePassword    al channels cl encryption   al channels cl encryption     activeKey   key 0  cipherProvider   AESCTRNOPADDING    al channels cl encryption keyProvider   key provider 0  al channels cl encryption keyProvider   JCEKSFILE  al channels cl encryption keyProvider keyStoreFile    path to my keystore  al channels cl encryption keyProvider keyStorePasswordFile    path to my keystore password  al channels cl encryption keyProvider keys   key 0    Let   s say you have aged key 0 out and new files should be encrypted with key 1     al channels cl encryption activeKey   key 1  al channels cl encryption cipherProvider   AESCTRNOPADDING  al channels cl encryption keyProvider   JCEKSFILE  al channels cl encryption keyProvider keyStoreFile    path to my keystore    al channels cl encryption   al channels cl encryption     keyProvider keyStorePasswordFile    path to my keystore password  keyProvider keys   key 0 key 1    The same scenerio as above  however key 0 has its own password     al channels cl encryption activeKey   key 1  al channels cl encryption cipherProvider   AESCTRNOPADDING  al channels cl encryption keyProvider   JCEKSFILE  al channels cl encryption keyProvider keyStoreFile    path to my keystore    al channels cl encryption   al channels cl encryption   al channels cl encryption     keyProvider keyStorePasswordFile    keyProvider
12.  available  this is preferable over an auto generated UUID because it enables subsequent updates and deletes of event in data  stores using said well known application level key     Property Name Default Description    type   The component type name has to be org apache flume sink solr morphline UUIDInterceptor Builder  headerName id The name of the Flume header to modify   preserveExisting true If the UUID header already exists  should it be preserved   true or false   prefix S The prefix string constant to prepend to each generated UUID    Morphline Interceptor    This interceptor filters the events through a morphline configuration file that defines a chain of transformation commands that pipe records from one  command to another  For example the morphline can ignore certain events or alter or insert certain event headers via regular expression based pattern  matching  or it can auto detect and set a MIME type via Apache Tika on events that are intercepted  For example  this kind of packet sniffing can be    used for content based dynamic routing in a Flume topology  Morphlinelnterceptor can also help to implement dynamic routing to multiple Apache Solr  collections  e g  for multi tenancy      Currently  there is a restriction in that the morphline of an interceptor must not generate more than one output record for each input event  This  interceptor is not intended for heavy duty ETL processing   if you need this consider moving ETL processing from the Flume Source to a 
13.  can poll the  platform mbean server to poll the mbeans for metrics  For example  if an HTTP monitoring service called nTTPReporting can be used as follows       bin flume ng agent   conf file example conf   name al  Dflume monitoring type com example reporting HTTPReporting  Dflume monitoring no    Property Name Default Description  type   The component type name  has to be FQCN    Reporting metrics from custom components    Any custom flume components should inherit from the org  apache  flume  instrumentation MonitoredCounterGroup Class  The class should then provide  getter methods for each of the metrics it exposes  See the code below  The MonitoredCounterGroup expects a list of attributes whose metrics are  exposed by this class  As of now  this class only supports exposing metrics as long values     public class SinkCounter extends MonitoredCounterGroup implements  SinkCounterMBean      private static final String COUNTER_CONNECTION CREATED     sink connection creation count      private static final String COUNTER_CONNECTION_ CLOSED     sink connection closed count      private static final String COUNTER_CONNECTION FAILED   sink connection failed count      private static final String COUNTER_BATCH EMPTY     sink  batch empty      private static final String COUNTER_BATCH_UNDERFLOW     sink  batch underflow      private static final String COUNTER_BATCH_COMPLETE     sink batch complete      private static final String COUNTER_EVENT_DRAIN_ ATTEMPT   sink event drain a
14.  cases that stream raw data into HDFS  via the HdfsSink  and simultaneously extract  transform and load the same data  into Solr  via MorphlineSolrSink   In particular  this sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a  data model that is useful to Search applications     The ETL functionality is customizable using a morphline configuration file that defines a chain of transformation commands that pipe event records from  one command to another     Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records  including  arbitrary binary payloads  A morphline command is a bit like a Flume Interceptor  Morphlines can be embedded into Hadoop components such as  Flume     Commands to parse and transform a set of standard data formats such as log files  Avro  CSV  Text  HTML  XML  PDF  Word  Excel  etc  are provided  out of the box  and additional custom commands and parsers for additional data formats can be added as morphline plugins  Any kind of data format  can be indexed and any Solr documents for any kind of Solr schema can be generated  and any custom ETL logic can be registered and executed     Morphlines manipulate continuous streams of records  The data model can be described as follows  A record is a set of named fields where each field  has an ordered list of one or more values  A value can be any Java Object  That is  a record is essentially a ha
15.  flow over multiple sinks  It maintains an indexed list of active sinks on which the load  must be distributed  Implementation supports distributing load using either via round_robin Or random selection mechanisms  The choice of selection    mechanism defaults to round_robin type  but can be overridden via configuration  Custom selection mechanisms are supported via custom classes that  inherits from abstractSinkSelector     When invoked  this selector picks the next sink using its configured selection mechanism and invokes it  For round_robin and random In case the  selected sink fails to deliver the event  the processor picks the next available sink via its configured selection mechanism  This implementation does not  blacklist the failing sink and instead continues to optimistically attempt every available sink  If all sinks invocations result in failure  the selector  propagates the failure to the sink runner     If backoff is enabled  the sink processor will blacklist sinks that fail  removing them for selection for a given timeout  When the timeout ends  if the sink    is still unresponsive timeout is increased exponentially to avoid potentially getting stuck in long waits on unresponsive sinks  With this disabled  in  round robin all the failed sinks load will be passed to the next sink in line and thus not evenly balanced    Required properties are in bold     Property Name Default Description   processor sinks   Space separated list of sinks that are participating in
16.  groups using a specified regular expression and appends the match groups as headers on the event  It also  supports pluggable serializers for formatting the match groups before adding them as event headers     Property Name Default Description  type   The component type name has to be regex_extractor    regex   Regular expression for matching against events    serializers   Space separated list of serializers for mapping matches to header names and serializing their values   See  example below  Flume provides built in support for the following serializers   org apache  flume  interceptor RegexExtractorInterceptorPassThroughSerializer  org apache  flume  interceptor  RegexExtractorInterceptorMillisSerializer    serializers  lt s1 gt  type default Must be default  org apache flume interceptor  RegexExtractorInterceptorPassThroughSerializer    org apache  flume  interceptor RegexExtractorInterceptorMillisSerializer  or the FQCN of a custom class  that implements org apache  flume  interceptor RegexExtractorInterceptorSerializer  serializers  lt s1 gt  name      serializers     Serializer specific properties    The serializers are used to map the matches to a header name and a formatted header value  by default  you only need to specify the header name  and the default org apache  flume  interceptor  RegexExtractorInterceptorPassThroughSerializer will be used  This serializer simply maps the matches to  the specified header name and passes the value through as it was extracted b
17.  indexType bar_type  al sinks kl clusterName   foobar_cluster  al sinks kl batchSize   500   al sinks kl ttl   5d   al sinks kl serializer   org apache flume sink elasticsearch ElasticSearchDynamicSerializer  al sinks kl channel   cl    Kite Dataset Sink  experimental           Warning  This source is experimental and may change between minor versions of Flume  Use at your own risk        Experimental sink that writes events to a Kite Dataset  This sink will deserialize the body of each incoming event and store the resulting record in a Kite  Dataset  It determines target Dataset by opening a repository URI  kite repo uri  and loading a Dataset by name  kite dataset name     The only supported serialization is avro  and the record schema must be passed in the event headers  using either flume  avro schema 1iteral with the  JSON schema representation or flume avro schema ur1 with a URL where the schema may be found  ndfs      URIs are supported   This is  compatible with the Log4jAppender flume client and the spooling directory source   s Avro deserializer using deserializer schemaType   LITERAL     Note 1  The flume avro schema hash header is not supported  Note 2  In some cases  file rolling may occur slightly after the roll interval has been  exceeded  However  this delay will not exceed 5 seconds  In most cases  the delay is neglegible     Property Name Default Description   channel     type   Must be org apache flume sink kite DatasetSink   kite repo uri   URI of the repo
18.  is turned into a Flume event and  sent via the connected channel   Required properties are in bold     Property Name Default Description    channels     type   The component type name  needs to be netcat  bind   Host name or IP address to bind to   port   Port   to bind to   max line length 512 Max line length per event body  in bytes   ack every event true Respond with an    OK    for every event received  selector type replicating replicating or multiplexing   selector   Depends on the selector type value  interceptors   Space separated list of interceptors    interceptors    Example for agent named a1     al sources   rl   al channels   cl  al sources rl type   netcat  al sources rl bind   0 0 0 0  al sources rl bind   6666  al sources rl channels   cl    Sequence Generator Source    A simple sequence generator that continuously generates events with a counter that starts from O and increments by 1  Useful mainly for testing   Required properties are in bold     Property Name Default Description   channels       type   The component type name  needs to be seq  selector type replicating or multiplexing   selector   replicating   Depends on the selector type value  interceptors   Space separated list of interceptors    interceptors      batchSize 1    Example for agent named a1     al sources   rl  al channels   cl  al sources rl type   seq    al sources rl channels   cl    Syslog Sources    Reads syslog data and generate Flume events  The UDP source treats an entire messag
19.  its log file and stop processing     To avoid the above issues  it may be useful to add a unique identifier  such as a timestamp  to log file names when they are moved into the spooling  directory     Despite the reliability guarantees of this source  there are still cases in which events may be duplicated if certain downstream failures occur  This is  consistent with the guarantees offered by other Flume components     Property Name Default Description   channels     type   The component type name  needs to be spooldir   spoolDir   The directory from which to read files from   fileSuffix  COMPLETED Suffix to append to completely ingested files  deletePolicy never When to delete completed files  never Or immediate    fileHeader false Whether to add a header storing the absolute path filename     fileHeaderKey file Header key to use when appending absolute path filename to event header     basenameHeader false Whether to add a header storing the basename of the file    basenameHeaderKey basename Header Key to use when appending basename of file to event header    ignorePattern    Regular expression specifying which files to ignore  skip    trackerDir  flumespool Directory to store metadata related to processing of files  If this path is not an absolute path  then it is    interpreted as relative to the spoolDir     consumeOrder oldest In which order files in the spooling directory will be consumed oldest  youngest and random  In case Of oldest  and youngest  the last modified
20.  keys   key 0 key 1  keyProvider keys key 0 passwordFile    path to key 0 password     path to my keystore password    Spillable Memory Channel    The events are stored in an in memory queue and on disk  The in memory queue serves as the primary store and the disk as overflow  The disk store  is managed using an embedded File channel  When the in memory queue is full  additional incoming events are stored in the file channel  This channel  is ideal for flows that need high throughput of memory channel during normal operation  but at the same time need the larger capacity of the file  channel for better tolerance of intermittent sink side outages or drop in drain rates  The throughput will reduce approximately to file channel speeds  during such abnormal situations  In case of an agent crash or restart  only the events stored on disk are recovered when the agent comes online  This  channel is currently experimental and not recommended for use in production     Required properties are in bold     Please refer to file channel for additional required properties     Property Name Default Description   type   The component type name  needs to be SPILLABLEMEMORY   memoryCapacity 10000 Maximum number of events stored in memory queue  To disable use of in memory queue  set this to  zero    overflowCapacity 100000000 Maximum number of events stored in overflow disk  i e File channel   To disable use of overflow  set  this to zero    overflowTimeout 3 The number of seconds to wait befor
21.  the default channels  if no required channels  are specified  If no channels are designated as default and there are no required  the selector will attempt to write the events to the optional channels   Any failures are simply ignored in that case     Flume Sources    Avro Source    Listens on Avro port and receives events from external Avro client streams  When paired with the built in Avro Sink on another  previous hop  Flume  agent  it can create tiered collection topologies  Required properties are in bold     Property   Name Default Description   channels       type   The component type name  needs to be avro   bind   hostname or IP address to listen on   port   Port   to bind to   threads   Maximum number of worker threads to spawn   selector type   selector     interceptors   Space separated list of interceptors   interceptors     compression  none This can be    none    or    deflate     The compression type must match the compression type of matching AvroSource  type   ssl false Set this to true to enable SSL encryption  You must also specify a    keystore    and a    keystore password      keystore   This is the path to a Java keystore file  Required for SSL    keystore    The password for the Java keystore  Required for SSL    password   keystore type JKS The type of the Java keystore  This can be    JKS    or    PKCS12       exclude  SSLv3 Space separated list of SSL TLS protocols to exclude  SSLv3 will always be excluded in addition to the protocols  protocols s
22.  the group   processor type default The component type name  needs to be load balance   processor backoff false Should failed sinks be backed off exponentially    processor selector round robin Selection mechanism  Must be either round_robin  random or FQCN of custom class that  inherits from abstractSinkSelector   processor selector maxTimeOut 30000 Used by backoff selectors to limit exponential backoff  in milliseconds     Example for agent named a1     al sinkgroups   gl   al sinkgroups gl sinks   kl k2  al sinkgroups gl processor type   load_balance  al sinkgroups gl processor backoff   true  al sinkgroups gl processor selector   random    Custom Sink Processor    Custom sink processors are not supported at the moment     Event Serializers    The   ile ro11 sink and the nafs sink both support the Eventserializer interface  Details of the EventSerializers that ship with Flume are provided  below     Body Text Serializer    Alias  text  This interceptor writes the body of the event to an output stream without any transformation or modification  The event headers are ignored   Configuration options are as follows     Property Name Default Description    appendNewline true Whether a newline will be appended to each event at write time  The default of true assumes  that events do not contain newlines  for legacy reasons     Example for agent named a1     al sinks   kl   al sinks kl type   file roll   al sinks kl channel   cl  al sinks kl sink directory    var log flume  al sin
23.  time of the files will be used to compare the files  In case of a tie  the file with  smallest laxicographical order will be consumed first  In case of random any file will be picked randomly  When  using oldest and youngest the whole directory will be scanned to pick the oldest youngest file  which might be  slow if there are a large number of files  while using random may cause old files to be consumed very late if    new files keep coming in the spooling directory     maxBackoff 4000 The maximum time  in millis  to wait between consecutive attempts to write to the channel s  if the channel is  full  The source will start at a low backoff and increase it exponentially each time the channel throws a    ChannelException  upto the value specified by this parameter     batchSize 100 Granularity at which to batch transfer to the channel  inputCharset UTF 8 Character set used by deserializers that treat the input file as text   decodeErrorPolicy FAIL What to do when we see a non decodable character in the input file  Faru  Throw an exception and fail to    parse the file  repLace  Replace the unparseable character with the    replacement character    char  typically    Unicode U FFFD  renore  Drop the unparseable character sequence     deserializer LINE Specify the deserializer used to parse the file into events  Defaults to parsing each line as an event  The  class specified must implement EventDeserializer Builder    deserializer   Varies per event deserializer    bufferMaxLin
24.  using a shell script called flume ng which is located in the bin directory of the Flume distribution  You need to specify the agent  name  the config directory  and the config file on the command line       bin flume ng agent  n  agent_name  c conf  f conf flume conf properties template    Now the agent will start running source and sinks configured in the given properties file     A simple example    Here  we give an example configuration file  describing a single node Flume deployment  This configuration lets a user generate events and  subsequently logs them to the console      example conf  A single node Flume configuration     Name the components on this agent   al sources   rl   al sinks   kl    al channels   cl      Describe configure the source    al sources rl type   netcat  al sources rl bind   localhost  al sources rl port   44444      Describe the sink  al sinks kl type   logger      Use a channel which buffers events in memory  al channels cl type   memory  al channels cl capacity   1000  al channels cl transactionCapacity   100      Bind the source and sink to the channel  al sources rl channels   cl  al sinks kl channel   cl    This configuration defines a single agent named a1  a1 has a source that listens for data on port 44444  a channel that buffers event data in memory   and a sink that logs event data to the console  The configuration file names the various components  then describes their types and configuration  parameters  A given configuration file m
25.  will not compress  output file and please don   t set codeC  2 CompressedStream requires set hdfs codeC with an available  codeC   Allow only this number of open files  If this number is exceeded  the oldest file is closed    Specify minimum number of replicas per HDFS block  If not specified  it comes from the default Hadoop  config in the classpath    Format for sequence file records  One of    Text    or    Writable     the default      Number of milliseconds allowed for HDFS operations  such as open  write  flush  close  This number  should be increased if many HDFS timeout operations are occurring     Number of threads per HDFS sink for HDFS IO ops  open  write  etc    Number of threads per HDFS sink for scheduling timed file rolling  Kerberos user principal for accessing secure HDFS   Kerberos keytab for accessing secure HDFS    hdfs maxOpenFiles 5000  hdfs minBlockReplicas    hdfs writeFormat    hdfs callTimeout 10000    hdfs threadsPoolSize 10  hdfs rollTimerPoolSize 1  hdfs kerberosPrincipal    hdfs kerberosKeytab  hdfs proxyUser         hdfs round false Should the timestamp be rounded down  if true  affects all time based escape sequences except  t    hdfs roundValue 1 Rounded down to the highest multiple of this  in the unit configured using hdafs roundunit   less than  current time    hdfs roundUnit second The unit of the round down value   second  minute OF hour    hdfs timeZone Local Time Name of the timezone that should be used for resolving the directory path
26. FS sink Kerberos related options     if the underlying HDFS is running in secure mode  Please refer to the HDFS Sink section for    Monitoring    Monitoring in Flume is still a work in progress  Changes can happen very often  Several Flume components report metrics to the JMX platform MBean  server  These metrics can be queried using Jconsole     Ganglia Reporting    Flume can also report these metrics to Ganglia 3 or Ganglia 3 1 metanodes  To report metrics to Ganglia  a flume agent must be started with this  support  The Flume agent has to be started by passing in the following parameters as system properties prefixed by   1ume monitoring   and can be  specified in the flume env sh     Property Name Default Description   type   The component type name  has to be ganglia   hosts   Comma separated list of hostname port of Ganglia servers   pollFrequency 60 Time  in seconds  between consecutive reporting to Ganglia server  isGanglia3 false Ganglia server version is 3  By default  Flume sends in Ganglia 3 1 format    We can start Flume with Ganglia support as follows       bin flume ng agent   conf file example conf   name al  Dflume monitoring type ganglia  Dflume monitoring hosts com example 1234 com exa    JSON Reporting    Flume can also report metrics in a JSON format  To enable reporting in JSON format  Flume hosts a Web server on a configurable port  Flume reports  metrics in the following JSON format         typeNamel componentName1    typeName2 componentName2          
27. Flume 1 5 2 User Guide    Introduction    Overview    Apache Flume is a distributed  reliable  and available system for efficiently collecting  aggregating and moving large amounts of log data from many  different sources to a centralized data store     The use of Apache Flume is not only restricted to log data aggregation  Since data sources are customizable  Flume can be used to transport massive  quantities of event data including but not limited to network traffic data  social media generated data  email messages and pretty much any data source  possible    Apache Flume is a top level project at the Apache Software Foundation    There are currently two release code lines available  versions 0 9 x and 1 x    Documentation for the 0 9 x track is available at the Flume 0 9 x User Guide     This documentation applies to the 1 4 x track     New and existing users are encouraged to use the 1 x releases so as to leverage the performance improvements and configuration flexibilities available  in the latest architecture     System Requirements    1  Java Runtime Environment   Java 1 6 or later  Java 1 7 Recommended    2  Memory   Sufficient memory for configurations used by sources  channels or sinks  3  Disk Space   Sufficient disk space for configurations used by channels or sinks   4  Directory Permissions   Read Write permissions for directories used by agent    Architecture    Data flow model    A Flume event is defined as a unit of data flow having a byte payload and an opti
28. Flume Sink   e g  to a MorphlineSolrSink     Required properties are in bold     Property   Name Default Description   type   The component type name has to be org apache flume sink solr morphline MorphlineInterceptor Builder   morphlineFile     The relative or absolute path on the local file system to the morphline configuration file  Example   etc   1ume   ng conf morphline conf   morphlineld null Optional name used to identify a morphline if there are multiple morphlines in a morphline config file    Sample flume conf file     al sources avroSrc interceptors   morphlineinterceptor    al sources avroSrc interceptors morphlineinterceptor type   org apache flume sink solr morphline MorphlineInterceptor Builder  al sources avroSrc interceptors morphlineinterceptor morphlineFile    etc flume ng conf morphline conf  al sources avroSrc interceptors morphlineinterceptor morphlineId   morphlinel    Regex Filtering Interceptor    This interceptor filters events selectively by interpreting the event body as text and matching the text against a configured regular expression  The  supplied regular expression can be used to include events or exclude events     Property Name Default Description   type   The component type name has to be regex_filter   regex ea Regular expression for matching against events   excludeEvents false If true  regex determines events to exclude  otherwise regex determines events to include     Regex Extractor Interceptor    This interceptor extracts regex match
29. While Flume ships with many out of the box sources  channels  sinks  serializers  and the like  many  implementations exist which ship separately from Flume     While it has always been possible to include custom Flume components by adding their jars to the FLUME_CLASSPATH variable in the flume env sh  file  Flume now supports a special directory called piugins a which automatically picks up plugins that are packaged in a specific format  This allows for  easier management of plugin packaging issues as well as simpler debugging and troubleshooting of several classes of issues  especially library  dependency conflicts     The plugins d directory    The plugins d directory is located at  FLUME_HOME plugins d  At startup time  the f1ume ng start script looks in the plugins a directory for plugins that  conform to the below format and includes them in proper paths when starting up java     Directory layout for plugins  Each plugin  subdirectory  within p1ugins a can have up to three sub directories   1  lib   the plugin   s jar s   2  libext   the plugin   s dependency jar s   3  native   any required native libraries  such as  so files    Example of two plugins within the plugins d directory     plugins d    plugins d custom source 1   plugins d custom source 1 lib my source  jar  plugins d custom source 1 libext spring core 2 5 6 jar  plugins d custom source 2   plugins d custom source 2 lib custom  jar  plugins d custom source 2 native gettext so    Data ingestion    Flume sup
30. a persistent storage that   s backed by a database  The JDBC channel currently supports embedded Derby  This is a durable  channel that   s ideal for flows where recoverability is important  Required properties are in bold     Property Name  type   db type   driver class  driver url  db username  db password  connection properties file  create schema  create index  create foreignkey  transaction isolation    maximum connections  maximum capacity  sysprop    sysprop user home    Example for agent named a1     al channels   cl  al channels cl type   jdbc    File Channel    Default    DERBY    org apache derby jdbc _EmbeddedDriver   constructed from other properties        sa    true  true  true     READ_COMMITTED       10  0  unlimited     Required properties are in bold     Property Name Default    type  checkpointDir    useDualCheckpoints  backupCheckpointDir    dataDirs    transactionCapacity  checkpointinterval  maxFileSize  minimumRequiredSpace    capacity   keep alive  use log replay v1  use fast replay  encryption activeKey  encryption cipherProvider  encryption keyProvider    Description       flume file     channel checkpoint    false       flume file   channel data  10000  30000  2146435071  524288000    1000000  3   false  false    encryption keyProvider keyStoreFile    encrpytion keyProvider keyStorePasswordFile        encryption keyProvider keys    encyption keyProvider keys   passwordFile      Description   The component type name  needs to be jdbc  Database vendo
31. ample  a multiport syslog TCP source for agent named a1     al sources raul  al channels cl    al sources rl type    multiport_syslogtcp    al sources rl channels   cl  al sources rl host   0 0 0 0  al sources rl ports   10001 10002 10003  al sources rl portHeader   port    Syslog UDP Source    Property Name Default Description   channels     type   The component type name  needs to be syslogudp   host   Host name or IP address to bind to   port   Port   to bind to   keepFields false Setting this to true will preserve the Priority  Timestamp and Hostname in the body of the event   selector type replicating or multiplexing   selector   replicating Depends on the selector type value   interceptors   Space separated list of interceptors    interceptors    For example  a syslog UDP source for agent named a1     al sources   rl  al channels   cl  al sources rl type   syslogudp  al sources rl port   5140  al sources rl host   localhost  al sources rl channels   cl    HTTP Source    A source which accepts Flume Events by HTTP POST and GET  GET should be used for experimentation only  HTTP requests are converted into  flume events by a pluggable    handler    which must implement the HTTPSourceHandler interface  This handler takes a HttpServletRequest and returns a  list of flume events  All events handled from one Http request are committed to the channel in one transaction  thus allowing for increased efficiency on  channels like the file channel  If the handler throws an exception
32. aximum number of events stored in the channel   transactionCapacity 100 The maximum number of events the channel will take from a source or give to a sink per  transaction   keep alive 3 Timeout in seconds for adding or removing an event   byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in  the channel  to account for data in headers  See below    byteCapacity see description Maximum total bytes of memory allowed as a sum of all events in this channel  The    implementation only counts the Event boay  which is the reason for providing the  byteCapacityBufferPercentage Configuration parameter as well  Defaults to a computed value  equal to 80  of the maximum memory available to the JVM  i e  80  of the  Xmx value passed  on the command line   Note that if you have multiple memory channels on a single JVM  and  they happen to hold the same physical events  i e  if you are using a replicating channel selector  from a single source  then those event sizes may be double counted for channel byteCapacity  purposes  Setting this value to o will cause this value to fall back to a hard internal limit of about  200 GB     Example for agent named a1     al channels   cl   al channels cl type   memory  al channels cl capacity   10000  al channels cl transactionCapacity   10000  al channels cl byteCapacityBufferPercentage   20    al channels cl byteCapacity    JDBC Channel      800000    The events are stored in 
33. can be   for instance   it cannot be larger than what you can store in memory or on disk on a single machine   but in practice  flume events can be everything from textual log  entries to image files  The key property of an event is that they are generated in a continuous  streaming fashion  If your data is not regularly generated   i e  you are trying to do a single bulk load of data into a Hadoop cluster  then Flume will still work  but it is probably overkill for your situation  Flume  likes relatively stable topologies  Your topologies do not need to be immutable  because Flume can deal with changes in topology without losing data  and can also tolerate periodic reconfiguration due to fail over or provisioning  It probably won   t work well if you plant to change topologies every day   because reconfiguration takes some thought and overhead     Flow reliability in Flume    The reliability of a Flume flow depends on several factors  By adjusting these factors  you can achieve a wide array of reliability options with Flume     What type of channel you use  Flume has both durable channels  those which will persist data to disk  and non durable channels  those which will  lose data if a machine fails   Durable channels use disk based storage  and data stored in such channels will persist across machine restarts or non  disk related failures     Whether your channels are sufficiently provisioned for the workload  Channels in Flume act as buffers at various hops  These buffers 
34. channel 2      set channel for sinks  agent_foo sinks hdfs Clusterl sink1l channel   mem channel 1  agent_foo sinks avro forward sink2 channel   file channel 2      channel selector configuration   agent_foo sources avro AppSrv sourcel selector type   multiplexing  agent_foo sources avro AppSrv sourcel selector header   State  agent_foo sources avro AppSrv sourcel selector mapping CA   mem channel 1  agent_foo sources avro AppSrv sourcel selector mapping AZ file channel 2  agent_foo sources avro AppSrv sourcel selector mapping NY mem channel 1 file channel 2  agent_foo sources avro AppSrv sourcel selector default   mem channel 1    The selector checks for a header called    State     If the value is    CA    then its sent to mem channel 1  if its    AZ    then it goes to file channel 2 or if its    NY     then both  If the    State    header is not set or doesn   t match any of the three  then it goes to mem channel 1 which is designated as    default        The selector also supports optional channels  To specify optional channels for a header  the config parameter    optional    is used in the following way       channel selector configuration   agent_foo sources avro AppSrv sourcel selector type   multiplexing  agent_foo sources avro AppSrv sourcel selector header   State  agent_foo sources avro AppSrv sourcel selector mapping CA   mem channel 1  agent_foo sources avro AppSrv sourcel selector mapping AZ file channel 2  agent_foo sources avro AppSrv sourcel selector mapping
35. channel is full   then it will reset    and retry from the most recent Avro container file sync point  To reduce potential event duplication in such a failure scenario  write sync markers more  frequently in your Avro input files     Property Name Default Description   deserializer schemaType HASH How the schema is represented  By default  or when the value sasu is specified  the Avro  schema is hashed and the hash is stored in every event in the event header     flume avro schema hash     If trtzRax is specified  the JSON encoded schema itself is    stored in every event in the event header    flume avro schema literal     Using LIterar mode is  relatively inefficient compared to nasu mode     BlobDeserializer    This deserializer reads a Binary Large Object  BLOB  per event  typically one BLOB per file  For example a PDF or JPG file  Note that this approach is  not suitable for very large objects because the entire BLOB is buffered in RAM     Property Name Default Description  deserializer   The FQCN of this class  org  apache  flume sink solr morphline BlobDeserializer Builder  deserializer maxBlobLength 100000000 The maximum number of bytes to read and buffer for a given request    NetCat Source    A netcat like source that listens on a given port and turns each line of text into an event  Acts like nc  k  1  host   port   In other words  it opens a  specified port and listens for data  The expectation is that the supplied data is newline separated text  Each line of text
36. cl   al sinks   kl   al sinks kl type   thrift  al sinks kl channel   cl  al sinks kl hostname   10 10 10 10  al sinks kl port   4545    IRC Sink    The IRC sink takes messages from attached channel and relays those to configured IRC destinations  Required properties are in bold     Property Name Default Description   channel     type   The component type name  needs to be irc  hostname   The hostname or IP address to connect to  port 6667 The port number of remote host to connect  nick   Nick name   user   User name   password   User password   chan   channel   name   splitlines    boolean    splitchars n line separator  if you were to enter the default value into the config file  then you would need to escape the    backslash  like this      n      Example for agent named a1     al channels   cl   al sinks   kl   al sinks kl type   irc  al sinks kl channel   cl  al sinks kl hostname   irc yourdomain com  al sinks kl nick flume   al sinks kl chan    1lume    File Roll Sink    Stores events on the local filesystem  Required properties are in bold     Property Name Default Description    channel     type   The component type name  needs to be file roll    sink directory   The directory where files will be stored   sink rolllnterval 30 Roll the file every 30 seconds  Specifying 0 will disable rolling and cause all events to be written to a single file   sink serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer Builder int
37. e as a single event  The TCP sources create a new event for    each string of characters separated by a newline     n         Required properties are in bold     Syslog TCP Source    The original  tried and true syslog TCP source     Property Name Default Description   channels     type   The component type name  needs to be syslogtcp   host   Host name or IP address to bind to   port   Port   to bind to   eventSize 2500 Maximum size of a single event line  in bytes   keepFields false Setting this to true will preserve the Priority  Timestamp and Hostname in the body of the event   selector type replicating or multiplexing   selector   replicating Depends on the selector type value   interceptors   Space separated list of interceptors    interceptors      For example  a syslog TCP source for agent named a1     al sources   rl   al channels   cl  al sources rl type   syslogtcp  al sources rl port   5140  al sources rl host   localhost    al sources rl channels el    Multiport Syslog TCP Source    This is a newer  faster     capability means that it    multi port capable version of the Syslog TCP source  Note that the ports configuration setting has replaced port  Multi port  can listen on many ports at once in an efficient manner  This source uses the Apache Mina library to do that  Provides support    for RFC 3164 and many common RFC 5424 formatted messages  Also provides the capability to configure the character set used on a per port basis     Property   Name Default Descri
38. e enabling disk overflow when memory fills up    byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the  channel  to account for data in headers  See below    byteCapacity see Maximum bytes of memory allowed as a sum of all events in the memory queue  The implementation   description only counts the Event boay  which is the reason for providing the bytecapacityBufferPercentage   configuration parameter as well  Defaults to a computed value equal to 80  of the maximum memory  available to the JVM  i e  80  of the  Xmx value passed on the command line   Note that if you have  multiple memory channels on a single JVM  and they happen to hold the same physical events  i e  if  you are using a replicating channel selector from a single source  then those event sizes may be  double counted for channel byteCapacity purposes  Setting this value to o will cause this value to fall  back to a hard internal limit of about 200 GB    avgEventSize 500 Estimated average size of events  in bytes  going into the channel    lt file channel properties gt  see file Any file channel property with the exception of    keep alive    and    capacity    can be used  The keep alive   channel of file channel is managed by Spillable Memory Channel  Use    overflowCapacity    to set the File    channel   s capacity     In memory queue is considered full if either memoryCapacity or byteCapacity limit is reached     Example for age
39. e type of the custom channel is its FQCN  Required properties are in bold     Property Name Default Description  type   The component type name  needs to be a FQCN    Example for agent named a1     al channels   cl  al channels cl type   org example MyChannel    Flume Channel Selectors    If the type is not specified  then defaults to    replicating        Replicating Channel Selector  default     Required properties are in bold     Property Name Default Description  selector type replicating The component type name  needs to be replicating  selector optional     Set of channels to be marked as optional    Example for agent named a1 and it   s source called r1     al sources   rl   al channels   cl c2 c3  al source rl selector type   replicating  al source rl channels   cl c2 c3  al source rl selector optional   c3    In the above configuration  c3 is an optional channel  Failure to write to c3 is simply ignored  Since c1 and c2 are not marked optional  failure to write to  those channels will cause the transaction to fail     Multiplexing Channel Selector  Required properties are in bold     Property Name Default Description  selector type replicating The component type name  needs to be multiplexing    selector header flume selector header  selector default    selector mapping          Example for agent named a1 and it   s source called r1     al sources   rl   al channels   cl c2 c3 c4  al sources rl selector type   multiplexing  al sources rl selector header   state  al s
40. earchSink   hostNames     Comma separated list of hostname port  if the port is not  present the default port    9300    will be used   indexName flume The name of the index which the date will be appended  to  Example    flume      gt     flume yyyy MM dd      indexType logs The type to index the document to  defaults to    log      clusterName_ elasticsearch Name of the ElasticSearch cluster to connect to   batchSize 100 Number of events to be written per txn    ttl   TTL in days  when set will cause the expired documents    to be deleted automatically  if not set documents will  never be automatically deleted  TTL is accepted both in  the earlier form of integer only e g  a1 sinks k1 ttl   5 and  also with a qualifier ms  millisecond   s  second   m   minute   h  hour   d  day  and w  week   Example  a1 sinks k1 ttl   5d will set TTL to 5 days  Follow  http   www elasticsearch org guide reference mapping ttl   field  for more information    serializer org apache flume sink elasticsearch  ElasticSearchLogStashEventSerializer The ElasticSearchIndexRequestBuilderFactory or  ElasticSearchEventSerializer to use  Implementations of  either class are accepted but  ElasticSearchIndexRequestBuilderFactory is preferred     serializer        Properties to be passed to the serializer     Example for agent named a1     al channels   cl   al sinks   kl   al sinks kl type   elasticsearch  al sinks kl hostNames   127 0 0 1 9200 127 0 0 2 9300  al sinks k1 indexName foo_index  al sinks k1l
41. ed without having to restart the agent    compression  none This can be    none    or    deflate     The compression type must match the compression type of matching AvroSource   type   compression  6 The level of compression to compress event  0   no compression and 1 9 is compression  The higher the number   level the more compression   ssl false Set to true to enable SSL for this AvroSink  When configuring SSL  you can optionally set a    truststore         truststore password        truststore type     and specify whether to    trust all certs       trust all certs false If this is set to true  SSL server certificates for remote servers  Avro Sources  will not be checked  This should  NOT be used in production because it makes it easier for an attacker to execute a man in the middle attack and     listen in    on the encrypted connection    truststore   The path to a custom Java truststore file  Flume uses the certificate authority information in this file to determine  whether the remote Avro Source   s SSL authentication credentials should be trusted  If not specified  the default  Java JSSE certificate authority files  typically    jssecacerts    or    cacerts    in the Oracle JRE  will be used    truststore    The password for the specified truststore    password   truststore type JKS The type of the Java truststore  This can be    JKS    or other supported Java truststore type    exclude  SSLv2Hello Space separated list of SSL TLS protocols to exclude   protocols SSL
42. egacySource  host   The hostname or IP address to bind to   port   The port   to listen on   selector type replicating or multiplexing   selector   replicating Depends on the selector type value   interceptors   Space separated list of interceptors    interceptors    Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   org apache flume source avroLegacy  AvroLegacySource  al sources rl host   0 0 0 0   al sources rl bind   6666   al sources rl channels   cl    Thrift Legacy Source    Property Name Default Description   channels       type   The component type name  needs to be org apache flume source thriftLegacy ThriftLegacySource  host   The hostname or IP address to bind to   port     The port   to listen on   selector type replicating or multiplexing   selector   replicating Depends on the selector type value   interceptors   Space separated list of interceptors    interceptors      Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   org apache flume source thriftLegacy ThriftLegacySource  al sources rl host   0 0 0 0   al sources rl bind   6666   al sources rl channels   cl    Custom Source    A custom source is your own implementation of the Source interface  A custom source   s class and its dependencies must be included in the agent   s  classpath when starting the Flume agent  The type of the custom source is its FQCN     Property Name Default Description   channels       type   The component ty
43. erface   batchSize 100    Example for agent named a1     al channels   cl   al sinks   k1   al sinks kl type   file_roll  al sinks kl channel   cl  al sinks kl sink directory    var log flume    Null Sink    Discards all events it receives from the channel  Required properties are in bold     Property Name Default Description  channel      type   The component type name  needs to be nu11   batchSize 100    Example for agent named a1     al channels   cl  al sinks   kl  al sinks kl type   null  al sinks kl channel   cl    HBaseSinks    HBaseSink    This sink writes data to HBase  The Hbase configuration is picked up from the first hbase site xml encountered in the classpath  A class implementing  HbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and or increments  These puts and  increments are then written to HBase  This sink provides the same consistency guarantees as HBase  which is currently row wise atomicity  In the  event of Hbase failing to write certain events  the sink will replay all events in that transaction     The HBaseSink supports writing data to secure HBase  To write to secure HBase  the user the agent is running as must have write permissions to the  table the sink is configured to write to  The principal and keytab to use to authenticate against the KDC can be specified in the configuration  The  hbase site xml in the Flume agent   s classpath must have authentication set to kerberos  For details on 
44. es a  Obselete  This option is now ignored    bufferMaxLineLength 5000  Deprecated  Maximum length of a line in the commit buffer  Use deserializer maxLineLength instead    selector type replicating replicating or multiplexing   selector   Depends on the selector type value   interceptors   Space separated list of interceptors    interceptors    Example for an agent named agent 1     agent 1 channels   ch 1  agent 1 sources   src 1    agent 1 sources src l type   spooldir  agent 1 sources src 1l channels   ch 1  agent 1 sources src 1 spoolDir    var log apache flumeSpool  agent 1 sources src 1 fileHeader   true    Twitter 1  firehose Source  experimental           Warning  This source is hightly experimental and may change between minor versions of Flume  Use at your own risk        Experimental source that connects via Streaming API to the 1  sample twitter firehose  continously downloads tweets  converts them to Avro format  and sends Avro events to a downstream Flume sink  Requires the consumer and access tokens and secrets of a Twitter developer account  Required    properties are in bold     Property Name Default Description   channels     type   The component type name  needs to be org apache flume source twitter TwitterSource  consumerKey   OAuth consumer key   consumerSecret   OAuth consumer secret   accessToken   OAuth access token   access TokenSecret   OAuth toekn secret   maxBatchSize 1000 Maximum number of twitter messages to put in a single batch   maxBatchDurat
45. flow  using file channel or other stable channel will resume processing events where it left off  If the agent can   t be restarted on the same hardware  then  there is an option to migrate the database to another hardware and setup a new Flume agent that can resume processing the events saved in the db   The database HA futures can be leveraged to move the Flume agent to another host     Compatibility  HDFS  Currently Flume supports HDFS 0 20 2 and 0 23     AVRO    TBD    Additional version requirements    TBD    Tracing    TBD    More Sample Configs    TBD    Component Summary    Component Interface Type Alias Implementation Class  org apache flume Channel memory org apache flume channel MemoryChannel  org apache flume Channel jdbc org apache flume channel jdbc JdbcChannel  org apache flume Channel file org apache flume channel  file  FileChannel    org apache flume Channel   org apache flume channel PseudoTxnMemoryChannel    org apache flume Channel  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Source  org apache flume Sink  org apache flume Sink  org apache flume Sink  org apache flume Sink  org apache flume Sink  org apache flume Sink  org apache flume Sink  org apache flume Sink
46. g a number of first tier agents with an avro sink  all pointing to an avro source of single agent  Again you  could use the thrift sources sinks clients in such a scenario   This source on the second tier agent consolidates the received events into a single  channel which is consumed by a sink to its final destination     Multiplexing the flow    Flume supports multiplexing the event flow to one or more destinations  This is achieved by defining a flow multiplexer that can replicate or selectively  route an event to one or more channels     Channel1 Ce       The above example shows a source from agent    foo    fanning out the flow to three different channels  This fan out can be replicating or multiplexing  In  case of replicating flow  each event is sent to all three channels  For the multiplexing case  an event is delivered to a subset of available channels when  an event   s attribute matches a preconfigured value  For example  if an event attribute called    txnType    is set to    customer     then it should go to  channel1 and channels  if it   s    vendor    then it should go to channel2  otherwise channel3  The mapping can be set in the agent   s configuration file     Configuration    As mentioned in the earlier section  Flume agent configuration is read from a file that resembles a Java property file format with hierarchical property  settings     Defining the flow    To define the flow within a single agent  you need to link the sources and sinks via a channel  Yo
47. g and backup routes  fail over  for failed hops     Reliability  The events are staged in a channel on each agent  The events are then delivered to the next agent or terminal repository  like HDFS  in the flow  The  events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository  This is a how the single hop    message delivery semantics in Flume provide end to end reliability of the flow     Flume uses a transactional approach to guarantee the reliable delivery of the events  The sources and sinks encapsulate in a transaction the    storage retrieval  respectively  of the events placed in or provided by a transaction provided by the channel  This ensures that the set of events are  reliably passed from point to point in the flow  In the case of a multi hop flow  the sink from the previous hop and the source from the next hop both  have their transactions running to ensure that the data is safely stored in the channel of the next hop     Recoverability    The events are staged in the channel  which manages recovery from failure  Flume supports a durable file channel which is backed by the local file  system  There   s also a memory channel which simply stores the events in an in memory queue  which is faster but any events still left in the memory  channel when an agent process dies can   t be recovered     Setup    Setting up an agent    Flume agent configuration is stored in a local configuration file  This is a text f
48. have a fixed    capacity  and once that capacity is full you will create back pressure on earlier points in the flow  If this pressure propagates to the source of the flow   Flume will become unavailable and may lose data     Whether you use redundant topologies  Flume let   s you replicate flows across redundant topologies  This can provide a very easy source of fault  tolerance and one which is overcomes both disk or machine failures     The best way to think about reliability in a Flume topology is to consider various failure scenarios and their outcomes  What happens if a disk fails   What happens if a machine fails  What happens if your terminal sink  e g  HDFS  goes down for some time and you have back pressure  The space of  possible designs is huge  but the underlying questions you need to ask are just a handful     Flume topology design  The first step in designing a Flume topology is to enumerate all sources and destinations  terminal sinks  for your data  These will define the edge  points of your topology  The next consideration is whether to introduce intermediate aggregation tiers or event routing  If you are collecting data form a  large number of sources  it can be helpful to aggregate the data in order to simplify ingestion at the terminal sink  An aggregation tier can also smooth    out burstiness from sources or unavailability at sinks  by acting as a buffer  If you are routing data between different locations  you may also want to  split flows at various p
49. he list of events returned by one interceptor is passed to the next interceptor in the chain  Interceptors can modify  or drop events  If an interceptor needs to drop events  it just does not return that event in the list that it returns  If it is to drop all events  then it simply  returns an empty list  Interceptors are named components  here is an example of how they are created through configuration     al sources   rl   al sinks   kl   al channels   cl   al sources rl interceptors   il i2   al sources rl interceptors il type   org apache flume interceptor HostInterceptor Builder  al sources rl interceptors il preserveExisting   false  al sources rl interceptors il hostHeader   hostname   al sources rl interceptors i2 type   org apache flume interceptor TimestampInterceptor Builder  al sinks kl filePrefix   FlumeData   CollectorHost   Y  m  d   al sinks kl channel   cl    Note that the interceptor builders are passed to the type config parameter  The interceptors are themselves configurable and can be passed  configuration values just like they are passed to any other configurable component  In the above example  events are passed to the HostInterceptor first  and the events returned by the HostInterceptor are then passed along to the Timestamplnterceptor  You can specify either the fully qualified class  name  FQCN  or the alias timestamp  If you have multiple collectors writing to the same HDFS path  then you could also use the HostInterceptor     Timestamp Interceptor 
50. he serializer    kerberosPrincipal   Kerberos user principal for accessing secure HBase  kerberosKeytab   Kerberos keytab for accessing secure HBase    Example for agent named a1     al channels   cl   al sinks   kl   al sinks kl type   hbase  al sinks kl table   foo_table  al sinks kl columnFamily   bar_cf    al sinks kl serializer   org apache flume sink hbase RegexHbaseEventSerializer  al sinks kl channel   cl    AsyncHBaseSink    This sink writes data to HBase using an asynchronous model  A class implementing AsyncHbaseEventSerializer which is specified by the configuration  is used to convert the events into HBase puts and or increments  These puts and increments are then written to HBase  This sink uses the Asynchbase  API to write to HBase  This sink provides the same consistency guarantees as HBase  which is currently row wise atomicity  In the event of Hbase  failing to write certain events  the sink will replay all events in that transaction  The type is the FQCN  org apache flume sink hbase AsyncHBaseSink   Required properties are in bold     Property Name Default Description   channel     type   The component type name  needs to be asynchbase  table   The name of the table in Hbase to write to   zookeeperQuorum     The quorum spec  This is the value for the property    hbase  zookeeper quorum in hbase site xml    znodeParent  nbase Ihe base path tor the znode tor the  RUO I  region  Value ot  zookeeper znode parent in hbase site xml    columnFamily   The column fa
51. hema will be included as a Flume header     Sample log4j properties file configured to use Avro serialization           log4j appender flume Hostname   example com  log4j appender flume Port   41414    log4j appender flume   org apache flume clients 1log4jappender Log4jAppender    log4j appender flume AvroReflectionEnabled   true  log4j appender flume AvroSchemaUrl   hdfs   namenode path to schema avsc      configure a class s logger to output to the flume appender    log4j logger org example MyClass   DEBUG  flume  Fene    Load Balancing Log4J Appender    Appends Log4j events to a list of flume agent   s avro source  A client using this appender must have the flume ng sdk in the classpath  eg  flume ng     sdk 1 5 2 jar   This appender supports a round robi    n and random scheme for performing the load balancing  It also supports a configurable backoff    timeout so that down agents are removed temporarily from the set of hosts Required properties are in bold     Property Name Default   Hosts     Selector ROUND_ROBIN  MaxBackoff     UnsafeMode false  AvroReflectionEnabled false  AvroSchemaurl      Description  A space separated list of host port at which Flume  through an AvroSource  is listening for events    Selection mechanism  Must be either ROUND_ROBIN  RANDOM or custom FQDN to class that  inherits from LoadBalancingSelector     A long value representing the maximum amount of time in milliseconds the Load balancing client  will backoff from a node that has failed to co
52. how to do this  please refer to HBase  documentation      For convenience  two serializers are provided with Flume  The SimpleHbaseEventSerializer   org apache flume sink hbase SimpleHbaseEventSerializer  writes the event body as is to HBase  and optionally increments a column in Hbase  This is  primarily an example implementation  The RegexHbaseEventSerializer  org apache flume sink hbase RegexHbaseEventSerializer  breaks the event  body based on the given regex and writes each part into different columns     The type is the FQCN  org apache flume sink hbase HBaseSink     Required properties are in bold     Property Name Default Description   channel       type   The component type name  needs to be hbase   table   The name of the table in Hbase to write to    columnFamily   The column family in Hbase to write to    zookeeperQuorum     The quorum spec  This is the value for the property  hbase  zookeeper quorum in hbase site xml   znodeParent  nbase The base path for the znode for the  ROOT  region  Value of  zookeeper  znode parent in hbase site xml   batchSize 100 Number of events to be written per txn    coalescelncrements false Should the sink coalesce multiple increments to a cell per batch     This might give better performance if there are multiple increments  to a limited number of cells     serializer org apache flume sink hbase SimpleHbaseEventSerializer Default increment column      iCol     payload column      pCol      serializer     Properties to be passed to t
53. ibutes like timestamp or machine where the event originated  The HDFS directory path may contain  formatting escape sequences that will replaced by the HDFS sink to generate a directory file name to store the events  Using this sink requires hadoop  to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster  Note that a version of Hadoop that supports the sync    call is required     The following are the escape sequences supported     Alias Description   f host  Substitute value of event header named    host     Arbitrary header names are    supported    t Unix time in milliseconds   a locale   s short weekday name  Mon  Tue         A locale   s full weekday name  Monday  Tuesday         b locale   s short month name  Jan  Feb         B locale   s long month name  January  February        Ac locale   s date and time  Thu Mar 3 23 05 25 2005    d day of month  01    D date  same as  m  d  y     H hour  00  23    l hour  01  12      j day of year  001  366    k hour   0  23     M month  01  12     M minute  00  59      p locale   s equivalent of am or pm    S seconds since 1970 01 01 00 00 00 UTC    S second  00  59     y last two digits of year  00  99     Y year  2010     Z  hhmm numeric timezone  for example   0400     The file in use will have the name mangled to include     tmp    at the end  Once the file is closed  this extension is removed  This allows excluding partially  complete files in the directory  Required properties are in bold 
54. ight define several named agents  when a given Flume process is launched a flag is passed telling it which  named agent to manifest     Given this configuration file  we can start Flume as follows       bin flume ng agent   conf conf   conf file example conf   name al  Dflume root logger INFO console    Note that in a full deployment we would typically include one more option    conf  lt conf dir gt   The  lt conf dir gt  directory would include a shell script  flume env sh and potentially a log4j properties file  In this example  we pass a Java option to force Flume to log to the console and we go without a  custom environment script     From a separate terminal  we can then telnet port 44444 and send Flume an event       telnet localhost 44444   Trying 1271 Ore Ole lieve   Connected to localhost localdomain  127 0 0 1    Escape character is        Hello world   lt ENTER gt    OK    The original Flume terminal will output the event in a log message     12 06 19 15 32 19 INFO source NetcatSource  Source starting  12 06 19 15 32 19 INFO source NetcatSource  Created serverSocket sun nio ch ServerSocketChannelImpl  127 0 0 1 44444   12 06 19 15 32 34 INFO sink LoggerSink  Event    headers    body  48 65 6C 6C 6F 20 77 6F 72 6C 64 21 OD Hello world        Congratulations   you   ve successfully configured and deployed a Flume agent  Subsequent sections cover agent configuration in much more detail     Installing third party plugins    Flume has a fully plugin based architecture  
55. ile that follows the Java properties file format  Configurations for one or  more agents can be specified in the same configuration file  The configuration file includes properties of each source  sink and channel in an agent and  how they are wired together to form data flows     Configuring individual components    Each component  source  sink or channel  in the flow has a name  type  and set of properties that are specific to the type and instantiation  For  example  an Avro source needs a hostname  or IP address  and a port number to receive data from  A memory channel can have max queue size      capacity      and an HDFS sink needs to know the file system URI  path to create files  frequency of file rotation     hdfs rolllnterval     etc  All such  attributes of a component needs to be set in the properties file of the hosting Flume agent     Wiring the pieces together    The agent needs to know what individual components to load and how they are connected in order to constitute the flow  This is done by listing the  names of each of the sources  sinks and channels in the agent  and then specifying the connecting channel for each sink and source  For example  an  agent flows events from an Avro source called avroWeb to HDFS sink hdfs cluster1 via a file channel called file channel  The configuration file will  contain names of these components and file channel as a shared channel for both avroWeb source and hdfs cluster1 sink     Starting an agent    An agent is started
56. ing for the config file even if the  config file is not found at the expected location  Otherwise  the Flume agent will terminate if the  config doesn   t exist at the expected location  No property value is needed when setting this property   eg  just specifying  Dflume called from service is enough     Property  flume called from service    Flume periodically polls  every 30 seconds  for changes to the specified config file  A Flume agent loads a new configuration from the config file if either  an existing file is polled for the first time  or if an existing file   s modification date has changed since the last time it was polled  Renaming or moving a  file does not change its modification time  When a Flume agent polls a non existent file then one of two things happens  1  When the agent polls a non   existent config file for the first time  then the agent behaves according to the flume called from service property  If the property is set  then the agent will  continue polling  always at the same period     every 30 seconds   If the property is not set  then the agent immediately terminates     OR    2  When the  agent polls a non existent config file and this is not the first time the file is polled  then the agent makes no config changes for this polling period  The  agent continues polling rather than terminating     Log4J Appender    Appends Log4j events to a flume agent   s avro source  A client using this appender must have the flume ng sdk in the classpath  eg  flu
57. ionMillis 1000 Maximum number of milliseconds to wait before closing a batch    Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   org apache flume source twitter TwitterSource  al sources rl channels   cl   al sources rl consumerKey   YOUR_TWITTER_CONSUMER_KEY  al sources rl consumerSecret   YOUR_TWITTER_CONSUMER_SECRET  al sources rl accessToken   YOUR_TWITTER_ACCESS TOKEN  al sources rl accessTokenSecret   YOUR_TWITTER_ACCESS_TOKEN_ SECRET  al sources rl maxBatchSize   10  al sources rl maxBatchDurationMillis   200    Event Deserializers  The following event deserializers ship with Flume     LINE    This deserializer generates one event per line of text input     Property Name Default Description    deserializer maxLineLength 2048 Maximum number of characters to include in a single event  If a line exceeds this  length  it is truncated  and the remaining characters on the line will appear in a  subsequent event     deserializer outputCharset UTF 8 Charset to use for encoding events put into the channel   AVRO    This deserializer is able to read an Avro container file  and it generates one event per Avro record in the file  Each event is annotated with a header that  indicates the schema used  The body of the event is the binary Avro record data  not including the schema or the rest of the container file elements     Note that if the spool directory source must retry putting one of these events onto a channel  for example  because the 
58. ks kl sink serializer   text  al sinks kl sink serializer appendNewline   false    Avro Event Serializer    Alias  avro_event  This interceptor serializes Flume events into an Avro container file  The schema used is the same schema used for Flume events in  the Avro RPC mechanism  This serializers inherits from the abstractavroEventSerializer Class  Configuration options are as follows     Property Name Default Description   syncintervalBytes 2048000 Avro sync interval  in approximate bytes    compressionCodec null Avro compression codec  For supported codecs  see Avro   s CodecFactory  docs     Example for agent named a1     al sinks kl type   hdfs  al sinks kl channel   cl    al sinks kl hdfs path    flume events  y  m  d  H amp M  S  al sinks kl serializer   avro_event  al sinks k1l serializer compressionCodec   snappy    Flume Interceptors    Flume has the capability to modify drop events in flight  This is done with the help of interceptors  Interceptors are classes that implement  org apache  flume  interceptor Interceptor interface  An interceptor can modify or even drop events based on any criteria chosen by the developer of  the interceptor  Flume supports chaining of interceptors  This is made possible through by specifying the list of interceptor builder class names in the  configuration  Interceptors are specified as a whitespace separated list in the source configuration  The order in which the interceptors are specified is  the order in which they are invoked  T
59. l also follow file rotation     Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   exec   al sources rl command   tail  F  var log secure  al sources rl channels   cl    The    shell    config is used to invoke the    command    through a command shell  such as Bash or Powershell   The    command    is passed as an argument to     shell    for execution  This allows the    command    to use features from the shell such as wildcards  back ticks  pipes  loops  conditionals etc  In the    absence of the    shell    config  the    command    will be invoked directly  Common values for    shell         bin sh  c         bin ksh  c        cmd  c        powershell    Command     etc     agent_foo sources tailsource 1 type   exec  agent_foo sources tailsource 1 shell    bin bash  c  agent_foo sources tailsource 1 command   for i in  path   txt  do cat  i  done    JMS Source    JMS Source reads messages from a JMS destination such as a queue or topic  Being a JMS application it should work with any JMS provider but has  only been tested with ActiveMQ  The JMS source provides configurable batch size  message selector  user pass  and message to flume event  converter  Note that the vendor provided JMS jars should be included in the Flume classpath using plugins d directory  preferred    classpath on  command line  or via FLUME_CLASSPATH variable in flume env sh     Required properties are in bold     Property Name Default Description   channels     
60. ltiplexing select has a further set of properties to bifurcate the flow  This requires specifying a mapping of an event attribute to a set for channel   The selector checks for each configured attribute in the event header  If it matches the specified value  then that event is sent to all the channels  mapped to that value  If there   s no match  then the event is sent to set of channels configured as default       Mapping for multiplexing selector   lt Agent gt  sources  lt Sourcel gt  selector type   multiplexing   lt Agent gt  sources  lt Sourcel gt  selector header    lt someHeader gt    lt Agent gt  sources  lt Sourcel gt  selector mapping  lt Valuel gt     lt Channeli gt    lt Agent gt  sources  lt Sourcel gt  selector mapping  lt Value2 gt     lt Channell gt   lt Channel2 gt    lt Agent gt  sources  lt Sourcel gt  selector mapping  lt Value3 gt     lt Channel2 gt    Hoos     lt Agent gt  sources  lt Sourcel gt  selector default    lt Channel2 gt     The mapping allows overlapping the channels for each value     The following example has a single flow that multiplexed to two paths  The agent named agent_foo has a single avro source and two channels linked to  two sinks       list the sources  sinks and channels in the agent  agent_foo sources   avro AppSrv sourcel   agent_foo sinks   hdfs Clusterl sink1l avro forward sink2  agent_foo channels   mem channel 1 file channel 2      set channels for source  agent_foo sources avro AppSrv sourcel channels   mem channel 1 file 
61. me ng sdk   1 5 2 jar   Required properties are in bold     Property Name Default Description   Hostname   The hostname on which a remote Flume agent is running with an avro source   Port   The port at which the remote Flume agent   s avro source is listening    UnsafeMode false If true  the appender will not throw exceptions on failure to send the events   AvroReflectionEnabled false Use Avro Reflection to serialize Log4j events   Do not use when users log strings   AvroSchemaurl   A URL from which the Avro schema can be retrieved     Sample log4j properties file          log4j appender flume   org apache flume clients 1log4jappender Log4jAppender  log4j appender flume Hostname   example com    log4j appender flume Port   41414  log4j appender flume UnsafeMode   true      configure a class s logger to output to the flume appender    log4j logger org example MyClass   DEBUG  flume  Poes    By default each event is converted to a string by calling tostring    or by using the Log4j layout  if specified     If the event is an instance of org apache avro generi    c GenericRecord  org apache avro specific SpecificRecord  OF if the property AvroReflectionEnabled    is set to true then the event will be serialized using Avro serialization     Serializing every event with its Avro schema is inefficient  so it is good practice to provide a schema URL from which the schema can be retrieved by  the downstream sink  typically the HDFS sink  If avroschemaur1 is not specified  then the sc
62. mily in Hbase to write to   batchSize 100 Number of events to be written per txn   coalescelncrements false Should the sink coalesce multiple increments to a cell per    batch  This might give better performance if there are  multiple increments to a limited number of cells    timeout 60000 The length of time  in milliseconds  the sink waits for acks  from hbase for all events in a transaction    serializer org apache flume sink hbase SimpleAsyncHbaseEventSerializer   serializer     Properties to be passed to the serializer    Note that this sink takes the Zookeeper Quorum and parent znode information in the configuration  Zookeeper Quorum and parent node configuration   may be specified in the flume configuration file  Alternatively  these configuration values are taken from the first hbase site xml file in the classpath     If these are not provided in the configuration  then the sink will read this information from the first hbase site xml file in the classpath     Example for agent named a1     al channels   cl   al sinks   kl   al sinks kl type   asynchbase   al sinks kl table   foo table   al sinks kl columnFamily   bar_cf   al sinks kl serializer   org apache flume sink hbase SimpleAsyncHbaseEventSerializer  al sinks kl channel   cl    MorphlineSolrSink    This sink extracts data from Flume events  transforms it  and loads it in near real time into Apache Solr servers  which in turn serve queries to end  users or search applications     This sink is well suited for use
63. n out to a ByteArrayOutputStream wrapped in an ObjectOutputStream and the resulting array is copied to the body of the  FlumeEvent     Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   jms   al sources rl channels   cl   al sources rl initialContextFactory   org apache activemq  jndi ActiveMOInitialContextFactory  al sources rl connectionFactory   GenericConnectionFactory   al sources rl providerURL   tcp   mqserver 61616   al sources rl destinationName   BUSINESS_DATA   al sources rl destinationType   QUEUE    Spooling Directory Source    This source lets you ingest data by placing files to be ingested into a    spooling    directory on disk  This source will watch the specified directory for new  files  and will parse events out of new files as they appear  The event parsing logic is pluggable  After a given file has been fully read into the channel  it  is renamed to indicate completion  or optionally deleted      Unlike the Exec source  this source is reliable and will not miss data  even if Flume is restarted or killed  In exchange for this reliability  only immutable   uniquely named files must be dropped into the spooling directory  Flume tries to detect these problem conditions and will fail loudly if they are violated     1  Ifa file is written to after being placed into the spooling directory  Flume will print an error to its log file and stop processing   2  Ifa file name is reused at a later time  Flume will print an error to
64. nnel  This is done in the same hierarchical namespace fashion where  you set the component type and other values for the properties specific to each component       properties for sources   lt Agent gt  sources  lt Source gt   lt someProperty gt     lt someValue gt       properties for channels   lt Agent gt  channel  lt Channel gt   lt someProperty gt     lt someValue gt       properties for sinks   lt Agent gt  sources  lt Sink gt   lt someProperty gt     lt someValue gt     The property    type    needs to be set for each component for Flume to understand what kind of object it needs to be  Each source  sink and channel type    has its own set of properties required for it to function as intended  All those need to be set as needed  In the previous example  we have a flow from  avro AppSrv source to hdfs Cluster1 sink through the memory channel mem channel 1  Here   s an example that shows configuration of each of those  components     agent_foo sources   avro AppSrv source  agent_foo sinks   hdfs Clusterl sink  agent_foo channels   mem channel 1    4 I      set channel for sources  sinks      properties of avro AppSrv source  agent_foo sources avro AppSrv source type   avro  agent_foo sources avro AppSrv source bind   localhost  agent_foo sources avro AppSrv source port   10000      properties of mem channel 1  agent_foo channels mem channel 1 type   memory  agent_foo channels mem channel 1 capacity   1000  agent_foo channels mem channel 1 transactionCapacity   100      p
65. nsume an event  Defaults to no backoff    If true  the appender will not throw exceptions on failure to send the events   Use Avro Reflection to serialize Log4j events   A URL from which the Avro schema can be retrieved     Sample log4j properties file configured using defaults     Hoas  log4j appender out2   org apache flume clients     log4jappender   LoadBalancingLog4 jAppender    log4j appender out2 Hosts   localhost 25430 localhost 25431      configure a class s logger to output to the flume appender    log4j logger org example MyClass   DEBUG  flume  Hee    Sample log4j properties file configured using RANDOM load balancing     He ave  log4j appender out2   org apache flume clients     log4jappender   LoadBalancingLog4 jAppender    log4j appender out2 Hosts   localhost 25430 localhost 25431    log4j appender out2 Selector   RANDOM      configure a class s logger to output to the flume appender    log4j logger org example MyClass   DEBUG  flume    e     Sample log4j properties file configured using backoff     Weave  log4j appender out2   org apache flume clients     log4jappender   LoadBalancingLog4 jAppender    log4j appender out2 Hosts   localhost 25430 localhost 25431 localhost 25432    log4j appender out2 Selector   ROUND ROBIN  log4j appender out2 MaxBackoff   30000      configure a class s logger to output to the flume appender    log4j logger org example MyClass   DEBUG  flume  Mass    Security    The HDFS sink supports Kerberos authentication  configuring the HD
66. nt named a1        al channels   cl   al channels cl type   SPILLABLEMEMORY  al channels cl memoryCapacity   10000  al channels cl overflowCapacity   1000000  al channels cl byteCapacity   800000  al channels cl checkpointDir    mnt flume checkpoint  al channels cl dataDirs    mnt flume data    To disable the use of the in memory queue and function like a file channel        al channels   cl   al channels cl type   SPILLABLEMEMORY  al channels cl memoryCapacity   0  al channels cl overflowCapacity   1000000  al channels cl checkpointDir    mnt flume checkpoint  al channels cl dataDirs    mnt flume data    To disable the use of overflow disk and function purely as a in memory channel     al channels   cl   al channels cl type   SPILLABLEMEMORY  al channels cl memoryCapacity   100000  al channels cl overflowCapacity   0       Pseudo Transaction Channel          Warning  The Pseudo Transaction Channel is only for unit testing purposes and is NOT meant for production use        Required properties are in bold     Property Name Default Description    type   The component type name  needs to be org  apache  flume channel   PseudoTxnMemoryChannel  capacity 50 The max number of events stored in the channel  keep alive 3 Timeout in seconds for adding or removing an event    Custom Channel    A custom channel is your own implementation of the Channel interface  A custom channel   s class and its dependencies must be included in the agent   s  classpath when starting the Flume agent  Th
67. oints  this creates sub topologies which may themselves include aggregation points     Sizing a Flume deployment  Once you have an idea of what your topology will look like  the next question is how much hardware and networking capacity is needed  This starts by  quantifying how much data you generate  That is not always a simple task  Most data streams are bursty  for instance  due to diurnal patterns  and  potentially unpredictable  A good starting point is to think about the maximum throughput you ll have in each tier of the topology  both in terms of events  per second and bytes per second  Once you know the required throughput of a given tier  you can calulate a lower bound on how many nodes you  require for that tier  To determine attainable throughput  it   s best to experiment with Flume on your hardware  using synthetic or sampled event data  In    general  disk based channels should get 10   s of MB s and memory based channels should get 100   s of MB s or more  Performance will vary widely   however depending on hardware and operating environment     Sizing aggregate throughput gives you a lower bound on the number of nodes you will need to each tier  There are several reasons to have additional  nodes  such as increased redundancy and better ability to absorb bursts in load     Troubleshooting    Handling agent failures    If the Flume agent goes down  then the all the flows hosted on that agent are aborted  Once the agent is restarted  then flow will resume  The 
68. onal set of string attributes  A Flume agent is a  JVM  process that  hosts the components through which events flow from an external source to the next destination  hop      eb  Server    A Flume source consumes events delivered to it by an external source like a web server  The external source sends events to Flume in a format that is  recognized by the target Flume source  For example  an Avro Flume source can be used to receive Avro events from Avro clients or other Flume  agents in the flow that send events from an Avro sink  A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a  Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol When a Flume source receives an event  it  stores it into one or more channels  The channel is a passive store that keeps the event until it   s consumed by a Flume sink  The file channel is one  example     it is backed by the local filesystem  The sink removes the event from the channel and puts it into an external repository like HDFS  via Flume  HDFS sink  or forwards it to the Flume source of the next Flume agent  next hop  in the flow  The source and sink within the given agent run  asynchronously with the events staged in the channel     Complex flows    Flume allows a user to build multi hop flows where events travel through multiple agents before reaching the final destination  It also allows fan in and  fan out flows  contextual routin
69. ources rl selector mapping CZ   cl  al sources rl selector mapping US   c2 c3  al sources rl selector default   c4    Custom Channel Selector    A custom channel selector is your own implementation of the ChannelSelector interface  A custom channel selector   s class and its dependencies must  be included in the agent   s classpath when starting the Flume agent  The type of the custom channel selector is its FQCN     Property Name Default Description  selector type   The component type name  needs to be your FQCN    Example for agent named a1 and its source called r1     al sources   rl  al channels   cl  al sources rl selector type   org example MyChannelSelector    Flume Sink Processors    Sink groups allow users to group multiple sinks into one entity  Sink processors can be used to provide load balancing capabilities over all sinks inside  the group or to achieve fail over from one sink to another in case of temporal failure     Required properties are in bold     Property Name Default Description  sinks   Space separated list of sinks that are participating in the group  processor type default The component type name  needs to be default  failover Of load_balance    Example for agent named a1     al sinkgroups   gl  al sinkgroups gl sinks   k1 k2  al sinkgroups gl processor type   load_balance    Default Sink Processor    Default sink processor accepts only a single sink  User is not forced to create processor  sink group  for single sinks  Instead user can follow the so
70. ovider  org apache flume channel file encryption  AESCTRNoPaddingProvide  org example MyCipherProvider   org apache flume serialization  BodyTextEventSerializer Builder  org apache flume serialization  lumeEventAvroEventSerializer Build  org example MyEventSerializer Builder    These conventions for alias names are used in the component specific examples above  to keep the names short and consistent across all examples     Alias Name Alias Type  agent  channel  source  sink   sink group  interceptor  key   host  serializer    oxs7s lt  TO RNR  TO HM    
71. ow    To setup a multi tier flow  you need to have an avro thrift sink of first hop pointing to avro thrift source of the next hop  This will result in the first Flume  agent forwarding events to the next Flume agent  For example  if you are periodically sending files  1 file per event  using avro client to a local Flume  agent  then this local agent can forward it to another agent that has the mounted for storage     Weblog agent config       list sources  sinks and channels in the agent  agent_foo sources   avro AppSrv source  agent_foo sinks   avro forward sink  agent_foo channels   file channel      define the flow  agent_foo sources avro AppSrv source channels   file channel  agent_foo sinks avro forward sink channel   file channel      avro sink properties  agent_foo sources avro forward sink type   avro  agent_foo sources avro forward sink hostname   10 1 1 100  agent_foo sources avro forward sink port   10000      configure other pieces    HDFS agent config       list sources  sinks and channels in the agent  agent_foo sources   avro collection source  agent_foo sinks   hdfs sink   agent_foo channels   mem channel      define the flow    agent_foo sources avro collection source channels   mem channel  agent_foo sinks hdfs sink channel   mem channel      avro sink properties    agent_foo sources avro collection source type   avro  agent_foo sources avro collection source bind   10 1 1 100  agent_foo sources avro collection source port   10000      configure other piece
72. pe name  needs to be your FQCN  selector type replicating Of multiplexing   selector   replicating Depends on the selector type value   interceptors   Space separated list of interceptors    interceptors    Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   org example MySource  al sources rl channels   cl    Scribe Source    Scribe is another type of ingest system  To adopt existing Scribe ingest system  Flume should use ScribeSource based on Thrift with compatible  transfering protocol  For deployment of Scribe please follow the guide from Facebook  Required properties are in bold     Property Name Default Description   type   The component type name  needs to be org  apache  flume  source scribe ScribeSource  port 1499 Port that Scribe should be connected   workerThreads 5 Handing threads number in Thrift   selector type   selector      Example for agent named a1     al sources   rl   al channels   cl   al sources rl type   org apache flume source scribe ScribeSource  al sources rl port   1463   al sources rl workerThreads   5   al sources rl channels   cl    Flume Sinks    HDFS Sink    This sink writes events into the Hadoop Distributed File System  HDFS   It currently supports creating text and sequence files  It supports compression  in both file types  The files can be rolled  close current file and create a new one  periodically based on the elapsed time or size of data or number of  events  It also buckets partitions data by attr
73. pecified    ipFilter false Set this to true to enable ipFiltering for netty   ipFilter rules   Define N netty ipFilter pattern rules with this config     Example for agent named a1     al sources   rl   al channels   cl  al sources rl type   avro  al sources rl channels   cl  al sources rl bind 0 0050  al sources rl port 4141    Example of ipFilter rules   ipFilter rules defines N netty ipFilters separated by a comma a pattern rule must be in this format    lt allow    or deny gt   lt    ip    or    name    for computer name gt   lt pattern gt  or allow deny ip name pattern  example  ipFilter rules allow ip 127   allow name localhost deny ip     Note that the first rule to match will apply as the example below shows from a client on the localhost    This will Allow the client on localhost be deny clients from any other ip    allow name localhost deny ip     This will deny the client on localhost be allow  clients from any other ip    deny name localhost  allow ip      Thrift Source    Listens on Thrift port and receives events from external Thrift client streams  When paired with the built in ThriftSink on another  previous hop  Flume  agent  it can create tiered collection topologies  Required properties are in bold     Property Name Default Description    channels      type   The component type name  needs to be thrift  bind   hostname or IP address to listen on   port   Port   to bind to   threads   Maximum number of worker threads to spawn  selector type   selector     in
74. ports a number of mechanisms to ingest data from external sources     RPC  An Avro client included in the Flume distribution can send a given file to Flume Avro source using avro RPC mechanism     bin flume ng avro client  H localhost  p 41414  F  usr logs log 10  The above command will send the contents of  usr logs log 10 to to the Flume source listening on that ports    Executing commands    There   s an exec source that executes a given command and consumes the output  A single    line    of output ie  text followed by carriage return     r     or  line feed      n     or both together     Note  Flume does not support tail as a source  One can wrap the tail command in an exec source to stream the file     Network streams    Flume supports the following mechanisms to read data from popular log stream types  such as     1  Avro  2  Thrift  3  Syslog  4  Netcat    Setting multi agent flow    bar    In order to flow the data across multiple agents or hops  the sink of the previous agent and source of the current hop need to be avro type with the sink  pointing to the hostname  or IP address  and port of the source        Consolidation    A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the  storage subsystem  For example  logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster     Consolidation       This can be achieved in Flume by configurin
75. ppend a static header with static value to all events     The current implementation does not allow specifying multiple headers at one time  Instead user might chain multiple static interceptors each defining  one static header     Property Name Default Description  type   The component type name  has to be static  preserveExisting true If configured header already exists  should it be preserved   true or false    key key Name of header that should be created  value value Static value that should be created    Example for agent named a1     al sources   rl   al channels   cl   al sources rl channels   cl  al sources rl type   seq  al sources rl interceptors   il  al sources rl interceptors il type   static  al sources rl interceptors il key   datacenter  al sources rl interceptors il value   NEW_YORK    UUID Interceptor    This interceptor sets a universally unique identifier on all events that are intercepted  An example UUID is   5755073 77a9 43c1 8f  ad b7a586f  c1b97  which  represents a 128 bit value     Consider using UUIDInterceptor to automatically assign a UUID to an event if no application level unique key for the event is available  It can be  important to assign UUIDs to events as soon as they enter the Flume network  that is  in the first Flume Source of the flow  This enables subsequent  deduplication of events in the face of replication and redelivery in a Flume network that is designed for high availability and high performance  If an  application level key is
76. ption   channels     type   The component type name  needs to be multiport_syslogtcp   host   Host name or IP address to bind to    ports   Space separated list  one or more  of ports to bind to    eventSize 2500 Maximum size of a single event line  in bytes    keepFields false Setting this to true will preserve the Priority  Timestamp and Hostname in the body of the event    portHeader   If specified  the port number will be stored in the header of each event using the header name specified here  This  allows for interceptors and channel selectors to customize routing logic based on the incoming port    charset default UTF 8 Default character set used while parsing syslog events into strings    charset port    Character set is configurable on a per port basis     lt port gt    batchSize 100 Maximum number of events to attempt to process per request loop  Using the default is usually fine    readBufferSize 1024 Size of the internal Mina read buffer  Provided for performance tuning  Using the default is usually fine    numProcessors  auto  Number of processors available on the system for use while processing messages  Default is to auto detect   of CPUs   detected  using the Java Runtime API  Mina will spawn 2 request processing threads per detected CPU  which is often   reasonable    selector type replicating replicating  multiplexing  or custom   selector     Depends on the selector type value   interceptors   Space separated list of interceptors     interceptors      For ex
77. r  needs to be DERBY   Class for vendor   s JDBC driver   JDBC connection URL   User id for db connection   password for db connection   JDBC Connection property file path   If true  then creates db schema if not there  Create indexes to speed up lookups    Isolation level for db session READ_UNCOMMITTED   READ_COMMITTED  SERIALIZABLE  REPEATABLE_READ  Max connections allowed to db   Max number of events in the channel   DB Vendor specific properties   Home path to store embedded Derby database    The component type name  needs to be file   The directory where checkpoint file will be stored    Backup the checkpoint  If this is set to true  backupcheckpointpDir must be set    The directory where the checkpoint is backed up to  This directory must not be  the same as the data directories or the checkpoint directory    Comma separated list of directories for storing log files  Using multiple  directories on separate disks can improve file channel peformance   The maximum size of transaction supported by the channel   Amount of time  in millis  between checkpoints   Max size  in bytes  of a single log file   Minimum Required free space  in bytes   To avoid data corruption  File Channel  stops accepting take put requests when free space drops below this value  Maximum capacity of the channel   Amount of time  in sec  to wait for a put operation   Expert  Use old replay logic   Expert  Replay without using queue   Key name used to encrypt new data   Cipher provider type  supported 
78. rg apache flume source SyslogTcpSource  org apache flume source MultiportSyslogTCPSource  org apache flume source SyslogUDPSource  org apache flume source SpoolDirectorySource  org apache flume source http  HTT PSource  org apache flume source  ThriftSource  org apache flume source jms JMSSource  org apache flume source avroLegacy AvroLegacySource  org apache flume source thriftLegacy  ThriftLegacySource  org example MySource  org apache flume sink NullSink  org apache flume sink LoggerSink  org apache flume sink AvroSink  org apache flume sink hdfs  HDFSEventSink  org apache flume sink hbase  HBaseSink  org apache flume sink hbase AsyncHBaseSink  org apache flume sink elasticsearch ElasticSearchSink  org apache flume sink RollingFileSink  org apache flume sink irc IRCSink  org apache flume sink  ThriftSink   org example MySink  org apache flume channel ReplicatingChannelSelector  org apache flume channel MultiplexingChannelSelector  org example MyChannelSelector  org apache flume sink DefaultSinkProcessor  org apache flume sink FailoverSinkProcessor  org apache flume sink LoadBalancingSinkProcessor    org apache flume interceptor  TimestampInterceptor Builder  org apache flume interceptor HostInterceptor Builder    org apache flume interceptor StaticInterceptor Builder  org apache flume interceptor RegexFilteringInterceptor Builder  org apache flume interceptor RegexFilteringInterceptor Builder  org apache flume channel file encryption JCEFilekeyProvider  org example MyKeyPr
79. roperties of hdfs Clusterl sink  agent_foo sinks hdfs Clusterl sink type   hdfs  agent_foo sinks hdfs Clusterl sink hdfs path   hdfs   namenode flume webdata    Adding multiple flows in an agent    A single Flume agent can contain several independent flows  You can list multiple sources  sinks and channels in a config  These components can be  linked to form multiple flows       list the sources  sinks and channels for the agent   lt Agent gt  sources    lt Sourcel gt   lt Source2 gt     lt Agent gt  sinks    lt Sink1 gt   lt Sink2 gt     lt Agent gt  channels    lt Channell gt   lt Channel2 gt     Then you can link the sources and sinks to their corresponding channels  for sources  of channel  for sinks  to setup two different flows  For example  if  you need to setup two flows in an agent  one going from an external avro client to external HDFS and another from output of a tail to avro sink  then  here   s a config to do that       list the sources  sinks and channels in the agent  agent_foo sources   avro AppSrv sourcel exec tail source2  agent_foo sinks   hdfs Clusterl sinkl avro forward sink2  agent_foo channels   mem channel 1 file channel 2      flow  1 configuration  agent_foo sources avro AppSrv sourcel channels   mem channel 1  agent_foo sinks hdfs Clusterl sinkl channel   mem channel 1      flow  2 configuration  agent_foo sources exec tail source2 channels   file channel 2  agent_foo sinks avro forward sink2 channel   file channel 2    Configuring a multi agent fl
80. s    Heee    Here we link the avro forward sink from the weblog agent to the avro collection source of the hdfs agent  This will result in the events coming from the  external appserver source eventually getting stored in HDFS     Fan out flow    As discussed in previous section  Flume supports fanning out the flow from one source to multiple channels  There are two modes of fan out  replicating  and multiplexing  In the replicating flow  the event is sent to all the configured channels  In case of multiplexing  the event is sent to only a subset of  qualifying channels  To fan out the flow  one needs to specify a list of channels for a source and the policy for the fanning it out  This is done by adding  a channel    selector    that can be replicating or multiplexing  Then further specify the selection rules if its a multiplexer  If you don   t specify a selector   then by default it   s replicating       List the sources  sinks and channels for the agent   lt Agent gt  sources    lt Sourcel gt     lt Agent gt  sinks    lt Sink1 gt   lt Sink2 gt     lt Agent gt  channels    lt Channell gt   lt Channel2 gt       set list of channels for source  separated by space    lt Agent gt  sources  lt Sourcel gt  channels    lt Channell gt   lt Channel2 gt       set channel for sinks   lt Agent gt  sinks  lt Sink1 gt  channel   lt Agent gt  sinks  lt Sink2 gt  channel     lt Channel1 gt    lt Channel2 gt      lt Agent gt  sources  lt Sourcel gt  selector type   replicating       The mu
81. sh table where each hash table entry  contains a String key and a list of Java Objects as values   The implementation uses Guava   s arrayListMultimap  which is a Listmultimap   Note that a  field can have multiple values and any two records need not use common field names     This sink fills the body of the Flume event into the _attachment_body field of the morphline record  as well as copies the headers of the Flume event into  record fields of the same name  The commands can then act on this data     Routing to a SolrCloud cluster is supported to improve scalability  Indexing load can be spread across a large number of MorphlineSolrSinks for  improved scalability  Indexing load can be replicated across multiple MorphlineSolrSinks for high availability  for example using Flume features such as  Load balancing Sink Processor  MorphlinelInterceptor can also help to implement dynamic routing to multiple Solr collections  e g  for multi tenancy      The morphline and solr jars required for your environment must be placed in the lib directory of the Apache Flume installation   The type is the FQCN  org apache flume sink solr morphline MorphlineSolrSink    Required properties are in bold     Property Name Default Description  channel    type   The component type name  needs to be    org apache  flume sink solr morphline MorphlineSolrSink    The relative or absolute path on the local file system to the    morphline configuration file  Example   etc   1ume   ng conf morphline conf 
82. sitory to open   kite dataset name   Name of the Dataset where records will be written   kite batchSize 100 Number of records to process in each batch   kite rolllnterval 30 Maximum wait time  seconds  before data files are released  auth kerberosPrincipal   Kerberos user principal for secure authentication to HDFS  auth kerberosKeytab   Kerberos keytab location  local FS  for the principal   auth proxyUser   The effective user for HDFS actions  if different from the kerberos principal    Custom Sink    A custom sink is your own implementation of the Sink interface  A custom sink   s class and its dependencies must be included in the agent   s classpath  when starting the Flume agent  The type of the custom sink is its FQCN  Required properties are in bold     Property Name Default Description  channel    type   The component type name  needs to be your FQCN    Example for agent named a1     al channels   cl   al sinks   kl   al sinks kl type   org example MySink  al sinks kl channel   cl    Flume Channels  Channels are the repositories where the events are staged on a agent  Source adds the events and Sink removes it     Memory Channel    The events are stored in an in memory queue with configurable max size  It   s ideal for flows that need higher throughput and are prepared to lose the  staged data in the event of a agent failures  Required properties are in bold     Property Name Default Description   type   The component type name  needs to be memory   capacity 100 The m
83. st determine the version of elasticsearch and the JVM version the  target cluster is running  Then select an elasticsearch client library which matches the major version  A 0 19 x client can talk to a 0 19 x cluster  0 20 x  can talk to 0 20 x and 0 90 x can talk to 0 90 x  Once the elasticsearch version has been determined then read the pom xml file to determine the correct  lucene core JAR version to use  The Flume agent which is running the ElasticSearchSink should also match the JVM the target cluster is running down  to the minor version     Events will be written to a new index every day  The name will be  lt indexName gt  yyyy MM dd where  lt indexName gt  is the indexName parameter  The  sink will start writing to a new index at midnight UTC     Events are serialized for elasticsearch by the ElasticSearchLogStashEventSerializer by default  This behaviour can be overridden with the serializer  parameter  This parameter accepts implementations of org apache flume sink elasticsearch ElasticSearchEventSerializer or  org apache flume sink elasticsearch ElasticSearchIndexRequestBuilderFactory  Implementing ElasticSearchEventSerializer is deprecated in favour of  the more powerful ElasticSearchIndexRequestBuilderFactory     The type is the FQCN  org apache flume sink elasticsearch  ElasticSearchSink    Required properties are in bold     Property   Name Default Description   channel     type   The component type name  needs to be  org apache flume sink elasticsearch ElasticS
84. t suitable for very large objects because it buffers up the entire BLOB in RAM     Property Name Default Description  handler   The FQCN of this class  org apache flume sink solr morphline BlobHandler  handler maxBlobLength 100000000 The maximum number of bytes to read and buffer for a given request    Legacy Sources    The legacy sources allow a Flume 1 x agent to receive events from Flume 0 9 4 agents  It accepts events in the Flume 0 9 4 format  converts them to  the Flume 1 0 format  and stores them in the connected channel  The 0 9 4 event properties like timestamp  pri  host  nanos  etc get converted to 1 x  event header attributes  The legacy source supports both Avro and Thrift RPC connections  To use this bridge between two Flume versions  you need  to start a Flume 1 x agent with the avroLegacy or thriftLegacy source  The 0 9 4 agent should have the agent Sink pointing to the host port of the 1 x  agent     Note  The reliability semantics of Flume 1 x are different from that of Flume 0 9 x  The E2E or DFO mode of a Flume 0 9 x agent will not be  supported by the legacy source  The only supported 0 9 x mode is the best effort  though the reliability setting of the 1 x flow will be applicable to the  events once they are saved into the Flume 1 x channel by the legacy source     Required properties are in bold     Avro Legacy Source    Property Name Default Description   channels       type   The component type name  needs to be org apache flume source avroLegacy AvroL
85. terceptors   Space separated list of interceptors    interceptors    Example for agent named a1     al sources   rl   al channels   cl  al sources rl type   thrift  al sources rl channels   cl  al sources rl bind   0 0 0 0  al sources rl port 4141    Exec Source    Exec source runs a given Unix command on start up and expects that process to continuously produce data on standard out  stderr is simply  discarded  unless property logStdErr is set to true   If the process exits for any reason  the source also exits and will produce no further data  This  means configurations such as cat  named pipe  Of tail  F  file  are going to produce the desired results where as date will probably not   the former  two commands produce streams of data where as the latter produces a single event and exits     Required properties are in bold     Property   Name Default Description   channels     type   The component type name  needs to be exec   command   The command to execute   shell   A shell invocation used to run the command  e g   bin sh  c  Required only for commands relying on shell features like  wildcards  back ticks  pipes etc    restartThrottle 10000 Amount of time  in millis  to wait before attempting a restart   restart false Whether the executed cmd should be restarted if it dies   logStdErr false Whether the command   s stderr should be logged   batchSize 20 The max number of lines to read and send to the channel at a time   selector type replicating replicating or multiplexing
86. ttempt      private static final String COUNTER_EVENT_DRAIN_ SUCCESS   sink event drain sucess      private static final String   ATTRIBUTES      COUNTER_CONNECTION_ CREATED  COUNTER_CONNECTION_CLOSED   COUNTER_CONNECTION_ FAILED  COUNTER_BATCH_ EMPTY   COUNTER_BATCH_UNDERFLOW  COUNTER_BATCH_ COMPLETE   COUNTER_EVENT_ DRAIN ATTEMPT  COUNTER_EVENT_ DRAIN SUCCESS  hi    public SinkCounter String name     super  MonitoredCounterGroup Type SINK  name  ATTRIBUTES            Override   public long getConnectionCreatedCount      return get  COUNTER_CONNECTION_CREATED           public long incrementConnectionCreatedCount      return increment  COUNTER_CONNECTION_CREATED          Topology Design Considerations    Flume is very flexible and allows a large range of possible deployment scenarios  If you plan to use Flume in a large  production deployment  it is  prudent to spend some time thinking about how to express your problem in terms of a Flume topology  This section covers a few considerations     Is Flume a good fit for your problem     If you need to ingest textual log data into Hadoop HDFS then Flume is the right fit for your problem  full stop  For other use cases  here are some  guidelines     Flume is designed to transport and ingest regularly generated event data over relatively stable  potentially complex topologies  The notion of    event  data    is very broadly defined  To Flume  an event is just a generic blob of bytes  There are some limitations on how large an event 
87. type   The component type name  needs to be jms   initialContextFactory   Inital Context Factory  e g  org apache activemq jndi ActiveMQlInitialContextFactory   connectionFactory   The JNDI name the connection factory shoulld appear as   providerURL   The JMS provider URL   destinationName   Destination name   destinationType   Destination type  queue or topic    messageSelector   Message selector to use when creating the consumer   userName   Username for the destination provider   passwordFile   File containing the password for the destination provider   batchSize 100 Number of messages to consume in one batch   converter type DEFAULT Class to use to convert messages to flume events  See below    converter   7 Converter properties    converter charset UTF 8 Default converter only  Charset to use when converting JMS TextMessages to byte arrays   Converter    The JMS source allows pluggable converters  though it   s likely the default converter will work for most purposes  The default converter is able to  convert Bytes  Text  and Object messages to FlumeEvents  In all cases  the properties in the message are added as headers to the FlumeEvent     BytesMessage   Bytes of message are copied to body of the FlumeEvent  Cannot convert more than 2GB of data per message     TextMessage     Text of message is converted to a byte array and copied to the body of the FlumeEvent  The default converter uses UTF 8 by default but this is  configurable     ObjectMessage     Object is writte
88. types  AESCTRNOPADDING   Key provider type  supported types  JCEKSFILE   Path to the keystore file   Path to the keystore password file   List of all keys  e g  history of the activeKey setting    Path to the optional key password file    Note  By default the File Channel uses paths for checkpoint and data directories that are within the user home as specified above  As a result if you  have more than one File Channel instances active within the agent  only one will be able to lock the directories and cause the other channel  initialization to fail  It is therefore necessary that you provide explicit paths to all the configured channels  preferably on different disks  Furthermore   as file channel will sync to disk after every commit  coupling it with a sink source that batches events together may be necessary to provide good  performance where multiple disks are not available for checkpoint and data directories     Example for agent named a1     al channels   cl  al channels cl type   file  al channels cl checkpointDir       mnt flume checkpoint    al channels cl dataDirs    mnt flume data    Encryption  Below is a few sample configurations   Generating a key with a password seperate from the key store password     keytool  genseckey  alias key 0  keypass keyPassword  keyalg AES     keysize 128  validity 9000  keystore test keystore     storetype jceks  storepass keyStorePassword    Generating a key with the password the same as the key store password     keytool  genseckey 
89. u need to list the sources  sinks and channels for the  given agent  and then point the source and sink to a channel  A source instance can specify multiple channels  but a sink instance can only specify one  channel  The format is as follows       list the sources  sinks and channels for the agent   lt Agent gt  sources    lt Source gt     lt Agent gt  sinks    lt Sink gt     lt Agent gt  channels    lt Channell gt   lt Channel2 gt       set channel for source   lt Agent gt  sources  lt Source gt  channels    lt Channell gt   lt Channel2 gt           set channel for sink   lt Agent gt  sinks  lt Sink gt  channel    lt Channell gt     For example  an agent named agent_foo is reading data from an external avro client and sending it to HDFS via a memory channel  The config file  weblog config could look like       list the sources  sinks and channels for the agent  agent_foo sources   avro appserver src 1  agent_foo sinks   hdfs sink 1   agent_foo channels   mem channel 1      set channel for source  agent_foo sources avro appserver src 1 channels   mem channel 1      set channel for sink  agent_foo sinks hdfs sink 1 channel   mem channel 1    This will make the events flow from avro AppSrv source to hdfs Cluster1 sink through the memory channel mem channel 1  When the agent is started  with the weblog config as its config file  it will instantiate that flow     Configuring individual components    After defining the flow  you need to set properties of each source  sink and cha
90. urce    channel   sink pattern that was explained above in this user guide     Failover Sink Processor    Failover Sink Processor maintains a prioritized list of sinks  guaranteeing that so long as one is available events will be processed  delivered      The failover mechanism works by relegating failed sinks to a pool where they are assigned a cool down period  increasing with sequential failures  before they are retried  Once a sink successfully sends an event  it is restored to the live pool     To configure  set a sink groups processor to failover and set priorities for all individual sinks  All specified priorities must be unique  Furthermore   upper limit to failover time can be set  in milliseconds  using maxpenalty property     Required properties are in bold     Property Name Default Description   sinks   Space separated list of sinks that are participating in the group   processor type default The component type name  needs to be failover   processor priority  lt sinkName gt       lt sinkName gt  must be one of the sink instances associated with the current sink group  processor maxpenalty 30000  in millis     Example for agent named a1     al sinkgroups   gl   al sinkgroups gl sinks   kl k2  al sinkgroups gl processor type   failover  al sinkgroups gl processor priority kl 5  al sinkgroups gl processor priority k2 10  al sinkgroups gl processor maxpenalty   10000    Load balancing Sink Processor    Load balancing sink processor provides the ability to load balance
91. ute  For example  an event with timestamp 11 54 34 AM  June 12  2012 will  cause the hdfs path to become  flume events 2012 06 12 1150 00     Logger Sink  Logs event at INFO level  Typically useful for testing debugging purpose  Required properties are in bold     Property Name Default Description  channel      type   The component type name  needs to be logger    Example for agent named a1     al channels    al sinks   kl    al sinks kl type      ol    logger    al sinks kl channel   cl    Avro Sink    This sink forms one half of Flume   s tiered collection support  Flume events sent to this sink are turned into Avro events and sent to the configured  hostname   port pair  The events are taken from the configured Channel in batches of the configured batch size  Required properties are in bold     Property   Name Default Description   channel     type   The component type name  needs to be avro    hostname   The hostname or IP address to bind to    port ae The port   to listen on    batch size 100 number of event to batch together for send    connect  20000 Amount of time  ms  to allow for the first  handshake  request    timeout   request  20000 Amount of time  ms  to allow for requests after the first    timeout   reset  none Amount of time  s  before the connection to the next hop is reset  This will force the Avro Sink to reconnect to the   connection  next hop  This will allow the sink to connect to hosts behind a hardware load balancer when news hosts are   interval add
92. v3   maxloWorkers 2   the number The maximum number of I O worker threads  This is configured on the NettyAvroRpcClient   of available NioClientSocketChannelFactory     processors in the  machine    Example for agent named a1     al channels    al sinks   kl    al sinks kl type      el    avro    al sinks kl channel   cl    al sinks kl hostname    al sinks kl port      Thrift Sink    4545    10 10 10 10    This sink forms one half of Flume   s tiered collection support  Flume events sent to this sink are turned into Thrift events and sent to the configured  hostname   port pair  The events are taken from the configured Channel in batches of the configured batch size  Required properties are in bold     Property  Name    Default Description    channel        type   The component type name  needs to be thrift    hostname     The hostname or IP address to bind to    port   The port   to listen on    batch size 100 number of event to batch together for send    connect  20000 Amount of time  ms  to allow for the first  handshake  request    timeout   request  20000 Amount of time  ms  to allow for requests after the first    timeout   connection  none Amount of time  s  before the connection to the next hop is reset  This will force the Thrift Sink to reconnect to the next hop   reset  This will allow the sink to connect to hosts behind a hardware load balancer when news hosts are added without having to  interval restart the agent     Example for agent named a1     al channels   
93. y the regex  You can plug custom serializer implementations into the  extractor using the fully qualified class name  FQCN  to format the matches in anyway you like     Example 1   If the Flume event body contained 1 2 3 4f00bar5 and the following configuration was used    al sources rl interceptors il regex      d     d     d   al sources rl interceptors il serializers   sl s2 s3   al sources rl interceptors il serializers sl name  al sources rl interceptors il serializers s2 name  al sources rl interceptors il serializers s3 name    one  two  three    The extracted event will contain the same body but the following headers will have been added one  gt 1  two  gt 2  three  gt 3    Example 2   If the Flume event body contained 2012 10 18 18 47 57 614 some log line and the following configuration was used    al sources rl interceptors il regex         n     d  d  d  d   d  d   d  d  s  d  d   d  d   al sources rl interceptors il serializers   sl   al sources rl interceptors il serializers sl type org apache  flume  interceptor RegexExtractorInterceptorMillisSerializer  al sources rl interceptors il serializers sl name timestamp   al sources rl interceptors il serializers sl pattern   yyyy MM dd HH mm    the extracted event will contain the same body but the following headers will have been added timestamp  gt 1350611220000    Flume Properties    Property Name Default Description   flume called from service   If this property is specified then the Flume agent will continue poll
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Vous pouvez télécherger le document - Mairie de Saint  Epson Stylus Color 1520 Ink Jet Printer User Setup Information  Guide pour le placement sélectif des invalides  JB Burner O&M Manual - Webster Engineering    T'nB TSED100BL mouse pad  Dataflex Printer Boy 100  Bomba dosificadora magnética ProMinent® Beta® BT4a y BT5a    Copyright © All rights reserved. 
   Failed to retrieve file