Home

Flume 1.5.2 User Guide - Apache Flume

1. Note For all of the time related escape sequences a header with the key timestamp must exist among the headers of the event unless hdfs useLocalTimeStamp S Set to true One way to add this automatically is to use the Timestamplnterceptor Name Default Description channel type The component type name needs to be hafs hdfs path HDFS directory path eg hdfs namenode flume webdata hdfs filePrefix FlumeData Name prefixed to files created by Flume in hdfs directory hafs fileSuffix Suffix to append to file eg avro NOTE period is not automatically added hdfs inUsePrefix Prefix that is used for temporal files that flume actively writes into hdfs inUseSuffix tmp Suffix that is used for temporal files that flume actively writes into hdfs rollInterval 30 Number of seconds to wait before rolling current file 0 never roll based on time interval hdfs rollSize 1024 File size to trigger roll in bytes 0 never roll based on file size hdfs rollCount 10 Number of events written to file before it rolled 0 never roll based on number of events hdfs idle Timeout 0 Timeout after which inactive files get closed 0 disable automatic closing of idle files hdfs batchSize 100 number of events written to file before it is flushed to HDFS hdfs codeC Compression codec one of following gzip bzip2 Izo Izop snappy hdfs fileType SequenceFile File format currently sequenceFile DataStream Of CompressedStream 1 DataStream
2. This interceptor inserts into the event headers the time in millis at which it processes the event This interceptor inserts a header with key timestamp whose value is the relevant timestamp This interceptor can preserve an existing timestamp if it is already present in the configuration Property Name Default Description type The component type name has to be timestamp or the FQCN preserveExisting false If the timestamp already exists should it be preserved true or false Example for agent named a1 al sources rl al channels cl al sources rl channels cl al sources rl type seq al sources rl interceptors il al sources rl interceptors il type timestamp Host Interceptor This interceptor inserts the hostname or IP address of the host that this agent is running on It inserts a header with key host or a configured key whose value is the hostname or IP address of the host based on configuration Property Name Default Description type The component type name has to be host preserveExisting false If the host header already exists should it be preserved true or false uselP true Use the IP Address if true else use hostname hostHeader host The header key to be used Example for agent named a1 al sources rl al channels cl al sources rl interceptors il al sources rl interceptors il type host al sources rl interceptors il hostHeader hostname Static Interceptor Static interceptor allows user to a
3. morphlineFile morphlineld null Optional name used to identify a morphline if there are multiple morphlines in a morphline config file batchSize 1000 The maximum number of events to take per flume transaction batchDurationMillis 1000 The maximum duration per flume transaction ms The transaction commits after this duration or when batchSize is exceeded whichever comes first handlerClass org apache flume sink solr morphline MorphlineHandlerlmp The FQCN of a class implementing org apache flume sink solr morphline MorphlineHandler Example for agent named a1 al channels cl al sinks kl al sinks kl type org apache flume sink solr morphline MorphlineSolrSink al sinks kl channel cl al sinks kl morphlineFile etc flume ng conf morphline conf al sinks kl morphlineId morphlinel al sinks k1 batchSize 1000 al sinks k1 batchDurationMillis 1000 ElasticSearchSink This sink writes data to an elasticsearch cluster By default events will be written so that the Kibana graphical interface can display them just as if logstash wrote them The elasticsearch and lucene core jars required for your environment must be placed in the lib directory of the Apache Flume installation Elasticsearch requires that the major version of the client JAR match that of the server and that both are running the same minor version of the JVM SerializationExceptions will appear if this is incorrect To select the required version fir
4. selector Depends on the selector type value interceptors Space separated list of interceptors interceptors Warning The problem with ExecSource and other asynchronous sources is that the source can not guarantee that if there is a failure to put the event into the Channel the client knows about it In such cases the data will be lost As a for instance one of the most commonly requested features is the tail F file like use case where an application writes to a log file on disk and Flume tails the file sending each line as an event While this is possible there s an obvious problem what happens if the channel fills up and Flume can t send an event Flume has no way of indicating to the application writing the log file that it needs to retain the log or that the event hasn t been sent for some reason If this doesn t make sense you need only know this Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource As an extension of this warning and to be completely clear there is absolutely zero guarantee of event delivery when using this source For stronger reliability guarantees consider the Spooling Directory Source or direct integration with Flume via the SDK Note You can use ExecSource to emulate TailSource from Flume 0 9x flume og Just use unix command tail F full path to your file Parameter F is better in this case than f as it wil
5. e g America Los_Angeles hdfs useLocalTimeStamp false Use the local time instead of the timestamp from the event header while replacing the escape hdfs closeTries sequences 0 Number of times the sink must try to close a file If set to 1 this sink will not re try a failed close due to for example NameNode or DataNode failure and may leave the file in an open state with a tmp extension If set to 0 the sink will try to close the file until the file is eventually closed there is no limit on the number of times it would try hdfs retryInterval 180 Time in seconds between consecutive attempts to close a file Each close call costs multiple RPC round trips to the Namenode so setting this too low can cause a lot of load on the name node If set to 0 or less the sink will not attempt to close the file if the first attempt fails and may leave the file open or with a tmp extension serializer TEXT Other possible options include avro_event or the fully qualified class name of an implementation of the EventSerializer Builder interface serializer Example for agent named a1 al channels cl al sinks k1 al sinks kl type hdfs al sinks kl channel cl al sinks kl hdfs path flume events y m d H M S al sinks kl hdfs filePrefix events al sinks kl hdfs round true al sinks kl hdfs roundValue 10 al sinks kl hdfs roundUnit minute The above configuration will round down the timestamp to the last 10th min
6. metricl metricValuel metric2 metric3 metricValue3 metric4 metricValue2 metricValue4 Here is an example CHANNEL fileChannel EventPutSuccessCount 468085 Type CHANNEL StopTime 0 EventPutAttemptCount 468086 ChannelSize 233428 StartTime 1344882233070 EventTakeSuccessCount 458200 ChannelCapacity 600000 EventTakeAttemptCount 458288 CHANNEL memChannel EventPutSuccessCount 22948908 Type CHANNEL StopTime 0 EventPutAttemptCount 22948908 ChannelSize 5 StartTime 1344882209413 EventTakeSuccessCount 22948900 ChannelCapacity 100 EventTakeAttemptCount 22948908 Property Name Default Description type The component type name has to be http port 41414 The port to start the server on We can start Flume with JSON Reporting support as follows bin flume ng agent conf file example conf name al Dflume monitoring type http Dflume monitoring port 34545 Metrics will then be available at http lt hostname gt lt port gt metrics webpage Custom components can report metrics as mentioned in the Ganglia section above Custom Reporting It is possible to report metrics to other systems by writing servers that do the reporting Any reporting class has to implement the interface org apache flume instrumentation MonitorService Such a class can be used the same way the GangliaServer is used for reporting They
7. org apache flume Sink org apache flume Sink org apache flume Sink org apache flume ChannelSelector org apache flume ChannelSelector org apache flume ChannelSelector org apache flume SinkProcessor org apache flume SinkProcessor org apache flume SinkProcessor org apache flume SinkProcessor org apache flume interceptor Interceptor org apache flume interceptor Interceptor org apache flume interceptor Interceptor org apache flume interceptor Interceptor org apache flume interceptor Interceptor org apache flume channel file encryption KeyProvider Builder org apache flume channel file encryption KeyProvider Builder org apache flume channel file encryption CipherProvider org apache flume channel file encryption CipherProvider org apache flume serialization EventSerializer Builder org apache flume serialization EventSerializer Builder org apache flume serialization EventSerializer Builder Alias Conventions avro netcat seq exec syslogtcp multiport_syslogtcp syslogudp spooldir http thrift jms null logger avro hdfs hbase asynchbase elasticsearch file_roll irc thrift replicating multiplexing default failover load_balance timestamp host static regex_filter regex_extractor jceksfile aesctrnopadding text avro_event org example MyChannel org apache flume source AvroSource org apache flume source NetcatSource org apache flume source SequenceGeneratorSource org apache flume source ExecSource o
8. this source will return a HTTP status of 400 If the channel is full or the source is unable to append events to the channel the source will return a HTTP 503 Temporarily unavailable status All events sent in one post request are considered to be one batch and inserted into the channel in one transaction Property Name Default Description type The component type name needs to be http port The port the source should bind to bind 0 0 0 0 The hostname or IP address to listen on handler org apache flume source http JSONHandler The FQCN of the handler class handler Config parameters for the handler selector type replicating replicating or multiplexing selector Depends on the selector type value interceptors Space separated list of interceptors interceptors enableSSL false Set the property true to enable SSL HTTP Source does not support SSLVv3 excludeProtocols SSLv3 Space separated list of SSL TLS protocols to exclude SSLv3 is always excluded keystore Location of the keystore includng keystore file name keystorePassword Keystore password For example a http source for agent named a1 al sources rl al channels cl al sources rl type http al sources rl port 5140 al sources rl channels cl al sources rl handler org example rest RestHandler al sources rl handler nickname random props JSONHandler A handler is provided out of the box which can handle events represented in JSON format and supports UTF
9. 8 UTF 16 and UTF 32 character sets The handler accepts an array of events even if there is only one event the event has to be sent in an array and converts them to a Flume event based on the encoding specified in the request If no encoding is specified UTF 8 is assumed The JSON handler supports UTF 8 UTF 16 and UTF 32 Events are represented as follows headers timestamp 434324343 host random_host example com WG body random_body hy headers namenode namenode example com datanode random _datanode example com be body really random_body al To set the charset the request must have content type specified aS application json charset UTF 8 replace UTF 8 with UTF 16 or UTF 32 as required One way to create an event in the format expected by this handler is to use JSONEvent provided in the Flume SDK and use Google Gson to create the JSON string using the Gson fromJson Object Type method The type token to pass as the 2nd argument of this method for list of events can be created by Type type new TypeToken lt List lt JSONEvent gt gt getType BlobHandler By default HTTPSource splits JSON input into Flume events As an alternative BlobHandler is a handler for HTTPSource that returns an event that contains the request parameters as well as the Binary Large Object BLOB uploaded with this request For example a PDF or JPG file Note that this approach is no
10. NY mem channel 1 file channel 2 agent_foo sources avro AppSrv sourcel selector optional CA mem channel 1 file channel 2 agent_foo sources avro AppSrv sourcel selector mapping AZ file channel 2 agent_foo sources avro AppSrv sourcel selector default mem channel 1 The selector will attempt to write to the required channels first and will fail the transaction if even one of these channels fails to consume the events The transaction is reattempted on all of the channels Once all required channels have consumed the events then the selector will attempt to write to the optional channels A failure by any of the optional channels to consume the event is simply ignored and not retried If there is an overlap between the optional channels and required channels for a specific header the channel is considered to be required and a failure in the channel will cause the entire set of required channels to be retried For instance in the above example for the header CA mem channel 1 is considered to be a required channel even though it is marked both as required and optional and a failure to write to this channel will cause that event to be retried on all channels configured for the selector Note that if a header does not have any required channels then the event will be written to the default channels and will be attempted to be written to the optional channels for that header Specifying optional channels will still cause the event to be written to
11. alias key 1 keyalg AES keysize 128 validity 9000 keystore src test resources test keystore storetype jceks storepass keyStorePassword al channels cl encryption al channels cl encryption activeKey key 0 cipherProvider AESCTRNOPADDING al channels cl encryption keyProvider key provider 0 al channels cl encryption keyProvider JCEKSFILE al channels cl encryption keyProvider keyStoreFile path to my keystore al channels cl encryption keyProvider keyStorePasswordFile path to my keystore password al channels cl encryption keyProvider keys key 0 Let s say you have aged key 0 out and new files should be encrypted with key 1 al channels cl encryption activeKey key 1 al channels cl encryption cipherProvider AESCTRNOPADDING al channels cl encryption keyProvider JCEKSFILE al channels cl encryption keyProvider keyStoreFile path to my keystore al channels cl encryption al channels cl encryption keyProvider keyStorePasswordFile path to my keystore password keyProvider keys key 0 key 1 The same scenerio as above however key 0 has its own password al channels cl encryption activeKey key 1 al channels cl encryption cipherProvider AESCTRNOPADDING al channels cl encryption keyProvider JCEKSFILE al channels cl encryption keyProvider keyStoreFile path to my keystore al channels cl encryption al channels cl encryption al channels cl encryption keyProvider keyStorePasswordFile keyProvider
12. available this is preferable over an auto generated UUID because it enables subsequent updates and deletes of event in data stores using said well known application level key Property Name Default Description type The component type name has to be org apache flume sink solr morphline UUIDInterceptor Builder headerName id The name of the Flume header to modify preserveExisting true If the UUID header already exists should it be preserved true or false prefix S The prefix string constant to prepend to each generated UUID Morphline Interceptor This interceptor filters the events through a morphline configuration file that defines a chain of transformation commands that pipe records from one command to another For example the morphline can ignore certain events or alter or insert certain event headers via regular expression based pattern matching or it can auto detect and set a MIME type via Apache Tika on events that are intercepted For example this kind of packet sniffing can be used for content based dynamic routing in a Flume topology Morphlinelnterceptor can also help to implement dynamic routing to multiple Apache Solr collections e g for multi tenancy Currently there is a restriction in that the morphline of an interceptor must not generate more than one output record for each input event This interceptor is not intended for heavy duty ETL processing if you need this consider moving ETL processing from the Flume Source to a
13. can poll the platform mbean server to poll the mbeans for metrics For example if an HTTP monitoring service called nTTPReporting can be used as follows bin flume ng agent conf file example conf name al Dflume monitoring type com example reporting HTTPReporting Dflume monitoring no Property Name Default Description type The component type name has to be FQCN Reporting metrics from custom components Any custom flume components should inherit from the org apache flume instrumentation MonitoredCounterGroup Class The class should then provide getter methods for each of the metrics it exposes See the code below The MonitoredCounterGroup expects a list of attributes whose metrics are exposed by this class As of now this class only supports exposing metrics as long values public class SinkCounter extends MonitoredCounterGroup implements SinkCounterMBean private static final String COUNTER_CONNECTION CREATED sink connection creation count private static final String COUNTER_CONNECTION_ CLOSED sink connection closed count private static final String COUNTER_CONNECTION FAILED sink connection failed count private static final String COUNTER_BATCH EMPTY sink batch empty private static final String COUNTER_BATCH_UNDERFLOW sink batch underflow private static final String COUNTER_BATCH_COMPLETE sink batch complete private static final String COUNTER_EVENT_DRAIN_ ATTEMPT sink event drain a
14. cases that stream raw data into HDFS via the HdfsSink and simultaneously extract transform and load the same data into Solr via MorphlineSolrSink In particular this sink can process arbitrary heterogeneous raw data from disparate data sources and turn it into a data model that is useful to Search applications The ETL functionality is customizable using a morphline configuration file that defines a chain of transformation commands that pipe event records from one command to another Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with streams of generic records including arbitrary binary payloads A morphline command is a bit like a Flume Interceptor Morphlines can be embedded into Hadoop components such as Flume Commands to parse and transform a set of standard data formats such as log files Avro CSV Text HTML XML PDF Word Excel etc are provided out of the box and additional custom commands and parsers for additional data formats can be added as morphline plugins Any kind of data format can be indexed and any Solr documents for any kind of Solr schema can be generated and any custom ETL logic can be registered and executed Morphlines manipulate continuous streams of records The data model can be described as follows A record is a set of named fields where each field has an ordered list of one or more values A value can be any Java Object That is a record is essentially a ha
15. flow over multiple sinks It maintains an indexed list of active sinks on which the load must be distributed Implementation supports distributing load using either via round_robin Or random selection mechanisms The choice of selection mechanism defaults to round_robin type but can be overridden via configuration Custom selection mechanisms are supported via custom classes that inherits from abstractSinkSelector When invoked this selector picks the next sink using its configured selection mechanism and invokes it For round_robin and random In case the selected sink fails to deliver the event the processor picks the next available sink via its configured selection mechanism This implementation does not blacklist the failing sink and instead continues to optimistically attempt every available sink If all sinks invocations result in failure the selector propagates the failure to the sink runner If backoff is enabled the sink processor will blacklist sinks that fail removing them for selection for a given timeout When the timeout ends if the sink is still unresponsive timeout is increased exponentially to avoid potentially getting stuck in long waits on unresponsive sinks With this disabled in round robin all the failed sinks load will be passed to the next sink in line and thus not evenly balanced Required properties are in bold Property Name Default Description processor sinks Space separated list of sinks that are participating in
16. groups using a specified regular expression and appends the match groups as headers on the event It also supports pluggable serializers for formatting the match groups before adding them as event headers Property Name Default Description type The component type name has to be regex_extractor regex Regular expression for matching against events serializers Space separated list of serializers for mapping matches to header names and serializing their values See example below Flume provides built in support for the following serializers org apache flume interceptor RegexExtractorInterceptorPassThroughSerializer org apache flume interceptor RegexExtractorInterceptorMillisSerializer serializers lt s1 gt type default Must be default org apache flume interceptor RegexExtractorInterceptorPassThroughSerializer org apache flume interceptor RegexExtractorInterceptorMillisSerializer or the FQCN of a custom class that implements org apache flume interceptor RegexExtractorInterceptorSerializer serializers lt s1 gt name serializers Serializer specific properties The serializers are used to map the matches to a header name and a formatted header value by default you only need to specify the header name and the default org apache flume interceptor RegexExtractorInterceptorPassThroughSerializer will be used This serializer simply maps the matches to the specified header name and passes the value through as it was extracted b
17. indexType bar_type al sinks kl clusterName foobar_cluster al sinks kl batchSize 500 al sinks kl ttl 5d al sinks kl serializer org apache flume sink elasticsearch ElasticSearchDynamicSerializer al sinks kl channel cl Kite Dataset Sink experimental Warning This source is experimental and may change between minor versions of Flume Use at your own risk Experimental sink that writes events to a Kite Dataset This sink will deserialize the body of each incoming event and store the resulting record in a Kite Dataset It determines target Dataset by opening a repository URI kite repo uri and loading a Dataset by name kite dataset name The only supported serialization is avro and the record schema must be passed in the event headers using either flume avro schema 1iteral with the JSON schema representation or flume avro schema ur1 with a URL where the schema may be found ndfs URIs are supported This is compatible with the Log4jAppender flume client and the spooling directory source s Avro deserializer using deserializer schemaType LITERAL Note 1 The flume avro schema hash header is not supported Note 2 In some cases file rolling may occur slightly after the roll interval has been exceeded However this delay will not exceed 5 seconds In most cases the delay is neglegible Property Name Default Description channel type Must be org apache flume sink kite DatasetSink kite repo uri URI of the repo
18. is turned into a Flume event and sent via the connected channel Required properties are in bold Property Name Default Description channels type The component type name needs to be netcat bind Host name or IP address to bind to port Port to bind to max line length 512 Max line length per event body in bytes ack every event true Respond with an OK for every event received selector type replicating replicating or multiplexing selector Depends on the selector type value interceptors Space separated list of interceptors interceptors Example for agent named a1 al sources rl al channels cl al sources rl type netcat al sources rl bind 0 0 0 0 al sources rl bind 6666 al sources rl channels cl Sequence Generator Source A simple sequence generator that continuously generates events with a counter that starts from O and increments by 1 Useful mainly for testing Required properties are in bold Property Name Default Description channels type The component type name needs to be seq selector type replicating or multiplexing selector replicating Depends on the selector type value interceptors Space separated list of interceptors interceptors batchSize 1 Example for agent named a1 al sources rl al channels cl al sources rl type seq al sources rl channels cl Syslog Sources Reads syslog data and generate Flume events The UDP source treats an entire messag
19. its log file and stop processing To avoid the above issues it may be useful to add a unique identifier such as a timestamp to log file names when they are moved into the spooling directory Despite the reliability guarantees of this source there are still cases in which events may be duplicated if certain downstream failures occur This is consistent with the guarantees offered by other Flume components Property Name Default Description channels type The component type name needs to be spooldir spoolDir The directory from which to read files from fileSuffix COMPLETED Suffix to append to completely ingested files deletePolicy never When to delete completed files never Or immediate fileHeader false Whether to add a header storing the absolute path filename fileHeaderKey file Header key to use when appending absolute path filename to event header basenameHeader false Whether to add a header storing the basename of the file basenameHeaderKey basename Header Key to use when appending basename of file to event header ignorePattern Regular expression specifying which files to ignore skip trackerDir flumespool Directory to store metadata related to processing of files If this path is not an absolute path then it is interpreted as relative to the spoolDir consumeOrder oldest In which order files in the spooling directory will be consumed oldest youngest and random In case Of oldest and youngest the last modified
20. keys key 0 key 1 keyProvider keys key 0 passwordFile path to key 0 password path to my keystore password Spillable Memory Channel The events are stored in an in memory queue and on disk The in memory queue serves as the primary store and the disk as overflow The disk store is managed using an embedded File channel When the in memory queue is full additional incoming events are stored in the file channel This channel is ideal for flows that need high throughput of memory channel during normal operation but at the same time need the larger capacity of the file channel for better tolerance of intermittent sink side outages or drop in drain rates The throughput will reduce approximately to file channel speeds during such abnormal situations In case of an agent crash or restart only the events stored on disk are recovered when the agent comes online This channel is currently experimental and not recommended for use in production Required properties are in bold Please refer to file channel for additional required properties Property Name Default Description type The component type name needs to be SPILLABLEMEMORY memoryCapacity 10000 Maximum number of events stored in memory queue To disable use of in memory queue set this to zero overflowCapacity 100000000 Maximum number of events stored in overflow disk i e File channel To disable use of overflow set this to zero overflowTimeout 3 The number of seconds to wait befor
21. the default channels if no required channels are specified If no channels are designated as default and there are no required the selector will attempt to write the events to the optional channels Any failures are simply ignored in that case Flume Sources Avro Source Listens on Avro port and receives events from external Avro client streams When paired with the built in Avro Sink on another previous hop Flume agent it can create tiered collection topologies Required properties are in bold Property Name Default Description channels type The component type name needs to be avro bind hostname or IP address to listen on port Port to bind to threads Maximum number of worker threads to spawn selector type selector interceptors Space separated list of interceptors interceptors compression none This can be none or deflate The compression type must match the compression type of matching AvroSource type ssl false Set this to true to enable SSL encryption You must also specify a keystore and a keystore password keystore This is the path to a Java keystore file Required for SSL keystore The password for the Java keystore Required for SSL password keystore type JKS The type of the Java keystore This can be JKS or PKCS12 exclude SSLv3 Space separated list of SSL TLS protocols to exclude SSLv3 will always be excluded in addition to the protocols protocols s
22. the group processor type default The component type name needs to be load balance processor backoff false Should failed sinks be backed off exponentially processor selector round robin Selection mechanism Must be either round_robin random or FQCN of custom class that inherits from abstractSinkSelector processor selector maxTimeOut 30000 Used by backoff selectors to limit exponential backoff in milliseconds Example for agent named a1 al sinkgroups gl al sinkgroups gl sinks kl k2 al sinkgroups gl processor type load_balance al sinkgroups gl processor backoff true al sinkgroups gl processor selector random Custom Sink Processor Custom sink processors are not supported at the moment Event Serializers The ile ro11 sink and the nafs sink both support the Eventserializer interface Details of the EventSerializers that ship with Flume are provided below Body Text Serializer Alias text This interceptor writes the body of the event to an output stream without any transformation or modification The event headers are ignored Configuration options are as follows Property Name Default Description appendNewline true Whether a newline will be appended to each event at write time The default of true assumes that events do not contain newlines for legacy reasons Example for agent named a1 al sinks kl al sinks kl type file roll al sinks kl channel cl al sinks kl sink directory var log flume al sin
23. time of the files will be used to compare the files In case of a tie the file with smallest laxicographical order will be consumed first In case of random any file will be picked randomly When using oldest and youngest the whole directory will be scanned to pick the oldest youngest file which might be slow if there are a large number of files while using random may cause old files to be consumed very late if new files keep coming in the spooling directory maxBackoff 4000 The maximum time in millis to wait between consecutive attempts to write to the channel s if the channel is full The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException upto the value specified by this parameter batchSize 100 Granularity at which to batch transfer to the channel inputCharset UTF 8 Character set used by deserializers that treat the input file as text decodeErrorPolicy FAIL What to do when we see a non decodable character in the input file Faru Throw an exception and fail to parse the file repLace Replace the unparseable character with the replacement character char typically Unicode U FFFD renore Drop the unparseable character sequence deserializer LINE Specify the deserializer used to parse the file into events Defaults to parsing each line as an event The class specified must implement EventDeserializer Builder deserializer Varies per event deserializer bufferMaxLin
24. using a shell script called flume ng which is located in the bin directory of the Flume distribution You need to specify the agent name the config directory and the config file on the command line bin flume ng agent n agent_name c conf f conf flume conf properties template Now the agent will start running source and sinks configured in the given properties file A simple example Here we give an example configuration file describing a single node Flume deployment This configuration lets a user generate events and subsequently logs them to the console example conf A single node Flume configuration Name the components on this agent al sources rl al sinks kl al channels cl Describe configure the source al sources rl type netcat al sources rl bind localhost al sources rl port 44444 Describe the sink al sinks kl type logger Use a channel which buffers events in memory al channels cl type memory al channels cl capacity 1000 al channels cl transactionCapacity 100 Bind the source and sink to the channel al sources rl channels cl al sinks kl channel cl This configuration defines a single agent named a1 a1 has a source that listens for data on port 44444 a channel that buffers event data in memory and a sink that logs event data to the console The configuration file names the various components then describes their types and configuration parameters A given configuration file m
25. will not compress output file and please don t set codeC 2 CompressedStream requires set hdfs codeC with an available codeC Allow only this number of open files If this number is exceeded the oldest file is closed Specify minimum number of replicas per HDFS block If not specified it comes from the default Hadoop config in the classpath Format for sequence file records One of Text or Writable the default Number of milliseconds allowed for HDFS operations such as open write flush close This number should be increased if many HDFS timeout operations are occurring Number of threads per HDFS sink for HDFS IO ops open write etc Number of threads per HDFS sink for scheduling timed file rolling Kerberos user principal for accessing secure HDFS Kerberos keytab for accessing secure HDFS hdfs maxOpenFiles 5000 hdfs minBlockReplicas hdfs writeFormat hdfs callTimeout 10000 hdfs threadsPoolSize 10 hdfs rollTimerPoolSize 1 hdfs kerberosPrincipal hdfs kerberosKeytab hdfs proxyUser hdfs round false Should the timestamp be rounded down if true affects all time based escape sequences except t hdfs roundValue 1 Rounded down to the highest multiple of this in the unit configured using hdafs roundunit less than current time hdfs roundUnit second The unit of the round down value second minute OF hour hdfs timeZone Local Time Name of the timezone that should be used for resolving the directory path
26. FS sink Kerberos related options if the underlying HDFS is running in secure mode Please refer to the HDFS Sink section for Monitoring Monitoring in Flume is still a work in progress Changes can happen very often Several Flume components report metrics to the JMX platform MBean server These metrics can be queried using Jconsole Ganglia Reporting Flume can also report these metrics to Ganglia 3 or Ganglia 3 1 metanodes To report metrics to Ganglia a flume agent must be started with this support The Flume agent has to be started by passing in the following parameters as system properties prefixed by 1ume monitoring and can be specified in the flume env sh Property Name Default Description type The component type name has to be ganglia hosts Comma separated list of hostname port of Ganglia servers pollFrequency 60 Time in seconds between consecutive reporting to Ganglia server isGanglia3 false Ganglia server version is 3 By default Flume sends in Ganglia 3 1 format We can start Flume with Ganglia support as follows bin flume ng agent conf file example conf name al Dflume monitoring type ganglia Dflume monitoring hosts com example 1234 com exa JSON Reporting Flume can also report metrics in a JSON format To enable reporting in JSON format Flume hosts a Web server on a configurable port Flume reports metrics in the following JSON format typeNamel componentName1 typeName2 componentName2
27. Flume 1 5 2 User Guide Introduction Overview Apache Flume is a distributed reliable and available system for efficiently collecting aggregating and moving large amounts of log data from many different sources to a centralized data store The use of Apache Flume is not only restricted to log data aggregation Since data sources are customizable Flume can be used to transport massive quantities of event data including but not limited to network traffic data social media generated data email messages and pretty much any data source possible Apache Flume is a top level project at the Apache Software Foundation There are currently two release code lines available versions 0 9 x and 1 x Documentation for the 0 9 x track is available at the Flume 0 9 x User Guide This documentation applies to the 1 4 x track New and existing users are encouraged to use the 1 x releases so as to leverage the performance improvements and configuration flexibilities available in the latest architecture System Requirements 1 Java Runtime Environment Java 1 6 or later Java 1 7 Recommended 2 Memory Sufficient memory for configurations used by sources channels or sinks 3 Disk Space Sufficient disk space for configurations used by channels or sinks 4 Directory Permissions Read Write permissions for directories used by agent Architecture Data flow model A Flume event is defined as a unit of data flow having a byte payload and an opti
28. Flume Sink e g to a MorphlineSolrSink Required properties are in bold Property Name Default Description type The component type name has to be org apache flume sink solr morphline MorphlineInterceptor Builder morphlineFile The relative or absolute path on the local file system to the morphline configuration file Example etc 1ume ng conf morphline conf morphlineld null Optional name used to identify a morphline if there are multiple morphlines in a morphline config file Sample flume conf file al sources avroSrc interceptors morphlineinterceptor al sources avroSrc interceptors morphlineinterceptor type org apache flume sink solr morphline MorphlineInterceptor Builder al sources avroSrc interceptors morphlineinterceptor morphlineFile etc flume ng conf morphline conf al sources avroSrc interceptors morphlineinterceptor morphlineId morphlinel Regex Filtering Interceptor This interceptor filters events selectively by interpreting the event body as text and matching the text against a configured regular expression The supplied regular expression can be used to include events or exclude events Property Name Default Description type The component type name has to be regex_filter regex ea Regular expression for matching against events excludeEvents false If true regex determines events to exclude otherwise regex determines events to include Regex Extractor Interceptor This interceptor extracts regex match
29. While Flume ships with many out of the box sources channels sinks serializers and the like many implementations exist which ship separately from Flume While it has always been possible to include custom Flume components by adding their jars to the FLUME_CLASSPATH variable in the flume env sh file Flume now supports a special directory called piugins a which automatically picks up plugins that are packaged in a specific format This allows for easier management of plugin packaging issues as well as simpler debugging and troubleshooting of several classes of issues especially library dependency conflicts The plugins d directory The plugins d directory is located at FLUME_HOME plugins d At startup time the f1ume ng start script looks in the plugins a directory for plugins that conform to the below format and includes them in proper paths when starting up java Directory layout for plugins Each plugin subdirectory within p1ugins a can have up to three sub directories 1 lib the plugin s jar s 2 libext the plugin s dependency jar s 3 native any required native libraries such as so files Example of two plugins within the plugins d directory plugins d plugins d custom source 1 plugins d custom source 1 lib my source jar plugins d custom source 1 libext spring core 2 5 6 jar plugins d custom source 2 plugins d custom source 2 lib custom jar plugins d custom source 2 native gettext so Data ingestion Flume sup
30. a persistent storage that s backed by a database The JDBC channel currently supports embedded Derby This is a durable channel that s ideal for flows where recoverability is important Required properties are in bold Property Name type db type driver class driver url db username db password connection properties file create schema create index create foreignkey transaction isolation maximum connections maximum capacity sysprop sysprop user home Example for agent named a1 al channels cl al channels cl type jdbc File Channel Default DERBY org apache derby jdbc _EmbeddedDriver constructed from other properties sa true true true READ_COMMITTED 10 0 unlimited Required properties are in bold Property Name Default type checkpointDir useDualCheckpoints backupCheckpointDir dataDirs transactionCapacity checkpointinterval maxFileSize minimumRequiredSpace capacity keep alive use log replay v1 use fast replay encryption activeKey encryption cipherProvider encryption keyProvider Description flume file channel checkpoint false flume file channel data 10000 30000 2146435071 524288000 1000000 3 false false encryption keyProvider keyStoreFile encrpytion keyProvider keyStorePasswordFile encryption keyProvider keys encyption keyProvider keys passwordFile Description The component type name needs to be jdbc Database vendo
31. ample a multiport syslog TCP source for agent named a1 al sources raul al channels cl al sources rl type multiport_syslogtcp al sources rl channels cl al sources rl host 0 0 0 0 al sources rl ports 10001 10002 10003 al sources rl portHeader port Syslog UDP Source Property Name Default Description channels type The component type name needs to be syslogudp host Host name or IP address to bind to port Port to bind to keepFields false Setting this to true will preserve the Priority Timestamp and Hostname in the body of the event selector type replicating or multiplexing selector replicating Depends on the selector type value interceptors Space separated list of interceptors interceptors For example a syslog UDP source for agent named a1 al sources rl al channels cl al sources rl type syslogudp al sources rl port 5140 al sources rl host localhost al sources rl channels cl HTTP Source A source which accepts Flume Events by HTTP POST and GET GET should be used for experimentation only HTTP requests are converted into flume events by a pluggable handler which must implement the HTTPSourceHandler interface This handler takes a HttpServletRequest and returns a list of flume events All events handled from one Http request are committed to the channel in one transaction thus allowing for increased efficiency on channels like the file channel If the handler throws an exception
32. aximum number of events stored in the channel transactionCapacity 100 The maximum number of events the channel will take from a source or give to a sink per transaction keep alive 3 Timeout in seconds for adding or removing an event byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel to account for data in headers See below byteCapacity see description Maximum total bytes of memory allowed as a sum of all events in this channel The implementation only counts the Event boay which is the reason for providing the byteCapacityBufferPercentage Configuration parameter as well Defaults to a computed value equal to 80 of the maximum memory available to the JVM i e 80 of the Xmx value passed on the command line Note that if you have multiple memory channels on a single JVM and they happen to hold the same physical events i e if you are using a replicating channel selector from a single source then those event sizes may be double counted for channel byteCapacity purposes Setting this value to o will cause this value to fall back to a hard internal limit of about 200 GB Example for agent named a1 al channels cl al channels cl type memory al channels cl capacity 10000 al channels cl transactionCapacity 10000 al channels cl byteCapacityBufferPercentage 20 al channels cl byteCapacity JDBC Channel 800000 The events are stored in
33. can be for instance it cannot be larger than what you can store in memory or on disk on a single machine but in practice flume events can be everything from textual log entries to image files The key property of an event is that they are generated in a continuous streaming fashion If your data is not regularly generated i e you are trying to do a single bulk load of data into a Hadoop cluster then Flume will still work but it is probably overkill for your situation Flume likes relatively stable topologies Your topologies do not need to be immutable because Flume can deal with changes in topology without losing data and can also tolerate periodic reconfiguration due to fail over or provisioning It probably won t work well if you plant to change topologies every day because reconfiguration takes some thought and overhead Flow reliability in Flume The reliability of a Flume flow depends on several factors By adjusting these factors you can achieve a wide array of reliability options with Flume What type of channel you use Flume has both durable channels those which will persist data to disk and non durable channels those which will lose data if a machine fails Durable channels use disk based storage and data stored in such channels will persist across machine restarts or non disk related failures Whether your channels are sufficiently provisioned for the workload Channels in Flume act as buffers at various hops These buffers
34. channel 2 set channel for sinks agent_foo sinks hdfs Clusterl sink1l channel mem channel 1 agent_foo sinks avro forward sink2 channel file channel 2 channel selector configuration agent_foo sources avro AppSrv sourcel selector type multiplexing agent_foo sources avro AppSrv sourcel selector header State agent_foo sources avro AppSrv sourcel selector mapping CA mem channel 1 agent_foo sources avro AppSrv sourcel selector mapping AZ file channel 2 agent_foo sources avro AppSrv sourcel selector mapping NY mem channel 1 file channel 2 agent_foo sources avro AppSrv sourcel selector default mem channel 1 The selector checks for a header called State If the value is CA then its sent to mem channel 1 if its AZ then it goes to file channel 2 or if its NY then both If the State header is not set or doesn t match any of the three then it goes to mem channel 1 which is designated as default The selector also supports optional channels To specify optional channels for a header the config parameter optional is used in the following way channel selector configuration agent_foo sources avro AppSrv sourcel selector type multiplexing agent_foo sources avro AppSrv sourcel selector header State agent_foo sources avro AppSrv sourcel selector mapping CA mem channel 1 agent_foo sources avro AppSrv sourcel selector mapping AZ file channel 2 agent_foo sources avro AppSrv sourcel selector mapping
35. channel is full then it will reset and retry from the most recent Avro container file sync point To reduce potential event duplication in such a failure scenario write sync markers more frequently in your Avro input files Property Name Default Description deserializer schemaType HASH How the schema is represented By default or when the value sasu is specified the Avro schema is hashed and the hash is stored in every event in the event header flume avro schema hash If trtzRax is specified the JSON encoded schema itself is stored in every event in the event header flume avro schema literal Using LIterar mode is relatively inefficient compared to nasu mode BlobDeserializer This deserializer reads a Binary Large Object BLOB per event typically one BLOB per file For example a PDF or JPG file Note that this approach is not suitable for very large objects because the entire BLOB is buffered in RAM Property Name Default Description deserializer The FQCN of this class org apache flume sink solr morphline BlobDeserializer Builder deserializer maxBlobLength 100000000 The maximum number of bytes to read and buffer for a given request NetCat Source A netcat like source that listens on a given port and turns each line of text into an event Acts like nc k 1 host port In other words it opens a specified port and listens for data The expectation is that the supplied data is newline separated text Each line of text
36. cl al sinks kl al sinks kl type thrift al sinks kl channel cl al sinks kl hostname 10 10 10 10 al sinks kl port 4545 IRC Sink The IRC sink takes messages from attached channel and relays those to configured IRC destinations Required properties are in bold Property Name Default Description channel type The component type name needs to be irc hostname The hostname or IP address to connect to port 6667 The port number of remote host to connect nick Nick name user User name password User password chan channel name splitlines boolean splitchars n line separator if you were to enter the default value into the config file then you would need to escape the backslash like this n Example for agent named a1 al channels cl al sinks kl al sinks kl type irc al sinks kl channel cl al sinks kl hostname irc yourdomain com al sinks kl nick flume al sinks kl chan 1lume File Roll Sink Stores events on the local filesystem Required properties are in bold Property Name Default Description channel type The component type name needs to be file roll sink directory The directory where files will be stored sink rolllnterval 30 Roll the file every 30 seconds Specifying 0 will disable rolling and cause all events to be written to a single file sink serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer Builder int
37. e as a single event The TCP sources create a new event for each string of characters separated by a newline n Required properties are in bold Syslog TCP Source The original tried and true syslog TCP source Property Name Default Description channels type The component type name needs to be syslogtcp host Host name or IP address to bind to port Port to bind to eventSize 2500 Maximum size of a single event line in bytes keepFields false Setting this to true will preserve the Priority Timestamp and Hostname in the body of the event selector type replicating or multiplexing selector replicating Depends on the selector type value interceptors Space separated list of interceptors interceptors For example a syslog TCP source for agent named a1 al sources rl al channels cl al sources rl type syslogtcp al sources rl port 5140 al sources rl host localhost al sources rl channels el Multiport Syslog TCP Source This is a newer faster capability means that it multi port capable version of the Syslog TCP source Note that the ports configuration setting has replaced port Multi port can listen on many ports at once in an efficient manner This source uses the Apache Mina library to do that Provides support for RFC 3164 and many common RFC 5424 formatted messages Also provides the capability to configure the character set used on a per port basis Property Name Default Descri
38. e enabling disk overflow when memory fills up byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel to account for data in headers See below byteCapacity see Maximum bytes of memory allowed as a sum of all events in the memory queue The implementation description only counts the Event boay which is the reason for providing the bytecapacityBufferPercentage configuration parameter as well Defaults to a computed value equal to 80 of the maximum memory available to the JVM i e 80 of the Xmx value passed on the command line Note that if you have multiple memory channels on a single JVM and they happen to hold the same physical events i e if you are using a replicating channel selector from a single source then those event sizes may be double counted for channel byteCapacity purposes Setting this value to o will cause this value to fall back to a hard internal limit of about 200 GB avgEventSize 500 Estimated average size of events in bytes going into the channel lt file channel properties gt see file Any file channel property with the exception of keep alive and capacity can be used The keep alive channel of file channel is managed by Spillable Memory Channel Use overflowCapacity to set the File channel s capacity In memory queue is considered full if either memoryCapacity or byteCapacity limit is reached Example for age
39. e type of the custom channel is its FQCN Required properties are in bold Property Name Default Description type The component type name needs to be a FQCN Example for agent named a1 al channels cl al channels cl type org example MyChannel Flume Channel Selectors If the type is not specified then defaults to replicating Replicating Channel Selector default Required properties are in bold Property Name Default Description selector type replicating The component type name needs to be replicating selector optional Set of channels to be marked as optional Example for agent named a1 and it s source called r1 al sources rl al channels cl c2 c3 al source rl selector type replicating al source rl channels cl c2 c3 al source rl selector optional c3 In the above configuration c3 is an optional channel Failure to write to c3 is simply ignored Since c1 and c2 are not marked optional failure to write to those channels will cause the transaction to fail Multiplexing Channel Selector Required properties are in bold Property Name Default Description selector type replicating The component type name needs to be multiplexing selector header flume selector header selector default selector mapping Example for agent named a1 and it s source called r1 al sources rl al channels cl c2 c3 c4 al sources rl selector type multiplexing al sources rl selector header state al s
40. earchSink hostNames Comma separated list of hostname port if the port is not present the default port 9300 will be used indexName flume The name of the index which the date will be appended to Example flume gt flume yyyy MM dd indexType logs The type to index the document to defaults to log clusterName_ elasticsearch Name of the ElasticSearch cluster to connect to batchSize 100 Number of events to be written per txn ttl TTL in days when set will cause the expired documents to be deleted automatically if not set documents will never be automatically deleted TTL is accepted both in the earlier form of integer only e g a1 sinks k1 ttl 5 and also with a qualifier ms millisecond s second m minute h hour d day and w week Example a1 sinks k1 ttl 5d will set TTL to 5 days Follow http www elasticsearch org guide reference mapping ttl field for more information serializer org apache flume sink elasticsearch ElasticSearchLogStashEventSerializer The ElasticSearchIndexRequestBuilderFactory or ElasticSearchEventSerializer to use Implementations of either class are accepted but ElasticSearchIndexRequestBuilderFactory is preferred serializer Properties to be passed to the serializer Example for agent named a1 al channels cl al sinks kl al sinks kl type elasticsearch al sinks kl hostNames 127 0 0 1 9200 127 0 0 2 9300 al sinks k1 indexName foo_index al sinks k1l
41. ed without having to restart the agent compression none This can be none or deflate The compression type must match the compression type of matching AvroSource type compression 6 The level of compression to compress event 0 no compression and 1 9 is compression The higher the number level the more compression ssl false Set to true to enable SSL for this AvroSink When configuring SSL you can optionally set a truststore truststore password truststore type and specify whether to trust all certs trust all certs false If this is set to true SSL server certificates for remote servers Avro Sources will not be checked This should NOT be used in production because it makes it easier for an attacker to execute a man in the middle attack and listen in on the encrypted connection truststore The path to a custom Java truststore file Flume uses the certificate authority information in this file to determine whether the remote Avro Source s SSL authentication credentials should be trusted If not specified the default Java JSSE certificate authority files typically jssecacerts or cacerts in the Oracle JRE will be used truststore The password for the specified truststore password truststore type JKS The type of the Java truststore This can be JKS or other supported Java truststore type exclude SSLv2Hello Space separated list of SSL TLS protocols to exclude protocols SSL
42. egacySource host The hostname or IP address to bind to port The port to listen on selector type replicating or multiplexing selector replicating Depends on the selector type value interceptors Space separated list of interceptors interceptors Example for agent named a1 al sources rl al channels cl al sources rl type org apache flume source avroLegacy AvroLegacySource al sources rl host 0 0 0 0 al sources rl bind 6666 al sources rl channels cl Thrift Legacy Source Property Name Default Description channels type The component type name needs to be org apache flume source thriftLegacy ThriftLegacySource host The hostname or IP address to bind to port The port to listen on selector type replicating or multiplexing selector replicating Depends on the selector type value interceptors Space separated list of interceptors interceptors Example for agent named a1 al sources rl al channels cl al sources rl type org apache flume source thriftLegacy ThriftLegacySource al sources rl host 0 0 0 0 al sources rl bind 6666 al sources rl channels cl Custom Source A custom source is your own implementation of the Source interface A custom source s class and its dependencies must be included in the agent s classpath when starting the Flume agent The type of the custom source is its FQCN Property Name Default Description channels type The component ty
43. erface batchSize 100 Example for agent named a1 al channels cl al sinks k1 al sinks kl type file_roll al sinks kl channel cl al sinks kl sink directory var log flume Null Sink Discards all events it receives from the channel Required properties are in bold Property Name Default Description channel type The component type name needs to be nu11 batchSize 100 Example for agent named a1 al channels cl al sinks kl al sinks kl type null al sinks kl channel cl HBaseSinks HBaseSink This sink writes data to HBase The Hbase configuration is picked up from the first hbase site xml encountered in the classpath A class implementing HbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and or increments These puts and increments are then written to HBase This sink provides the same consistency guarantees as HBase which is currently row wise atomicity In the event of Hbase failing to write certain events the sink will replay all events in that transaction The HBaseSink supports writing data to secure HBase To write to secure HBase the user the agent is running as must have write permissions to the table the sink is configured to write to The principal and keytab to use to authenticate against the KDC can be specified in the configuration The hbase site xml in the Flume agent s classpath must have authentication set to kerberos For details on
44. es a Obselete This option is now ignored bufferMaxLineLength 5000 Deprecated Maximum length of a line in the commit buffer Use deserializer maxLineLength instead selector type replicating replicating or multiplexing selector Depends on the selector type value interceptors Space separated list of interceptors interceptors Example for an agent named agent 1 agent 1 channels ch 1 agent 1 sources src 1 agent 1 sources src l type spooldir agent 1 sources src 1l channels ch 1 agent 1 sources src 1 spoolDir var log apache flumeSpool agent 1 sources src 1 fileHeader true Twitter 1 firehose Source experimental Warning This source is hightly experimental and may change between minor versions of Flume Use at your own risk Experimental source that connects via Streaming API to the 1 sample twitter firehose continously downloads tweets converts them to Avro format and sends Avro events to a downstream Flume sink Requires the consumer and access tokens and secrets of a Twitter developer account Required properties are in bold Property Name Default Description channels type The component type name needs to be org apache flume source twitter TwitterSource consumerKey OAuth consumer key consumerSecret OAuth consumer secret accessToken OAuth access token access TokenSecret OAuth toekn secret maxBatchSize 1000 Maximum number of twitter messages to put in a single batch maxBatchDurat
45. flow using file channel or other stable channel will resume processing events where it left off If the agent can t be restarted on the same hardware then there is an option to migrate the database to another hardware and setup a new Flume agent that can resume processing the events saved in the db The database HA futures can be leveraged to move the Flume agent to another host Compatibility HDFS Currently Flume supports HDFS 0 20 2 and 0 23 AVRO TBD Additional version requirements TBD Tracing TBD More Sample Configs TBD Component Summary Component Interface Type Alias Implementation Class org apache flume Channel memory org apache flume channel MemoryChannel org apache flume Channel jdbc org apache flume channel jdbc JdbcChannel org apache flume Channel file org apache flume channel file FileChannel org apache flume Channel org apache flume channel PseudoTxnMemoryChannel org apache flume Channel org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Source org apache flume Sink org apache flume Sink org apache flume Sink org apache flume Sink org apache flume Sink org apache flume Sink org apache flume Sink org apache flume Sink
46. g a number of first tier agents with an avro sink all pointing to an avro source of single agent Again you could use the thrift sources sinks clients in such a scenario This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination Multiplexing the flow Flume supports multiplexing the event flow to one or more destinations This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels Channel1 Ce The above example shows a source from agent foo fanning out the flow to three different channels This fan out can be replicating or multiplexing In case of replicating flow each event is sent to all three channels For the multiplexing case an event is delivered to a subset of available channels when an event s attribute matches a preconfigured value For example if an event attribute called txnType is set to customer then it should go to channel1 and channels if it s vendor then it should go to channel2 otherwise channel3 The mapping can be set in the agent s configuration file Configuration As mentioned in the earlier section Flume agent configuration is read from a file that resembles a Java property file format with hierarchical property settings Defining the flow To define the flow within a single agent you need to link the sources and sinks via a channel Yo
47. g and backup routes fail over for failed hops Reliability The events are staged in a channel on each agent The events are then delivered to the next agent or terminal repository like HDFS in the flow The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository This is a how the single hop message delivery semantics in Flume provide end to end reliability of the flow Flume uses a transactional approach to guarantee the reliable delivery of the events The sources and sinks encapsulate in a transaction the storage retrieval respectively of the events placed in or provided by a transaction provided by the channel This ensures that the set of events are reliably passed from point to point in the flow In the case of a multi hop flow the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop Recoverability The events are staged in the channel which manages recovery from failure Flume supports a durable file channel which is backed by the local file system There s also a memory channel which simply stores the events in an in memory queue which is faster but any events still left in the memory channel when an agent process dies can t be recovered Setup Setting up an agent Flume agent configuration is stored in a local configuration file This is a text f
48. have a fixed capacity and once that capacity is full you will create back pressure on earlier points in the flow If this pressure propagates to the source of the flow Flume will become unavailable and may lose data Whether you use redundant topologies Flume let s you replicate flows across redundant topologies This can provide a very easy source of fault tolerance and one which is overcomes both disk or machine failures The best way to think about reliability in a Flume topology is to consider various failure scenarios and their outcomes What happens if a disk fails What happens if a machine fails What happens if your terminal sink e g HDFS goes down for some time and you have back pressure The space of possible designs is huge but the underlying questions you need to ask are just a handful Flume topology design The first step in designing a Flume topology is to enumerate all sources and destinations terminal sinks for your data These will define the edge points of your topology The next consideration is whether to introduce intermediate aggregation tiers or event routing If you are collecting data form a large number of sources it can be helpful to aggregate the data in order to simplify ingestion at the terminal sink An aggregation tier can also smooth out burstiness from sources or unavailability at sinks by acting as a buffer If you are routing data between different locations you may also want to split flows at various p
49. he list of events returned by one interceptor is passed to the next interceptor in the chain Interceptors can modify or drop events If an interceptor needs to drop events it just does not return that event in the list that it returns If it is to drop all events then it simply returns an empty list Interceptors are named components here is an example of how they are created through configuration al sources rl al sinks kl al channels cl al sources rl interceptors il i2 al sources rl interceptors il type org apache flume interceptor HostInterceptor Builder al sources rl interceptors il preserveExisting false al sources rl interceptors il hostHeader hostname al sources rl interceptors i2 type org apache flume interceptor TimestampInterceptor Builder al sinks kl filePrefix FlumeData CollectorHost Y m d al sinks kl channel cl Note that the interceptor builders are passed to the type config parameter The interceptors are themselves configurable and can be passed configuration values just like they are passed to any other configurable component In the above example events are passed to the HostInterceptor first and the events returned by the HostInterceptor are then passed along to the Timestamplnterceptor You can specify either the fully qualified class name FQCN or the alias timestamp If you have multiple collectors writing to the same HDFS path then you could also use the HostInterceptor Timestamp Interceptor
50. he serializer kerberosPrincipal Kerberos user principal for accessing secure HBase kerberosKeytab Kerberos keytab for accessing secure HBase Example for agent named a1 al channels cl al sinks kl al sinks kl type hbase al sinks kl table foo_table al sinks kl columnFamily bar_cf al sinks kl serializer org apache flume sink hbase RegexHbaseEventSerializer al sinks kl channel cl AsyncHBaseSink This sink writes data to HBase using an asynchronous model A class implementing AsyncHbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and or increments These puts and increments are then written to HBase This sink uses the Asynchbase API to write to HBase This sink provides the same consistency guarantees as HBase which is currently row wise atomicity In the event of Hbase failing to write certain events the sink will replay all events in that transaction The type is the FQCN org apache flume sink hbase AsyncHBaseSink Required properties are in bold Property Name Default Description channel type The component type name needs to be asynchbase table The name of the table in Hbase to write to zookeeperQuorum The quorum spec This is the value for the property hbase zookeeper quorum in hbase site xml znodeParent nbase Ihe base path tor the znode tor the RUO I region Value ot zookeeper znode parent in hbase site xml columnFamily The column fa
51. hema will be included as a Flume header Sample log4j properties file configured to use Avro serialization log4j appender flume Hostname example com log4j appender flume Port 41414 log4j appender flume org apache flume clients 1log4jappender Log4jAppender log4j appender flume AvroReflectionEnabled true log4j appender flume AvroSchemaUrl hdfs namenode path to schema avsc configure a class s logger to output to the flume appender log4j logger org example MyClass DEBUG flume Fene Load Balancing Log4J Appender Appends Log4j events to a list of flume agent s avro source A client using this appender must have the flume ng sdk in the classpath eg flume ng sdk 1 5 2 jar This appender supports a round robi n and random scheme for performing the load balancing It also supports a configurable backoff timeout so that down agents are removed temporarily from the set of hosts Required properties are in bold Property Name Default Hosts Selector ROUND_ROBIN MaxBackoff UnsafeMode false AvroReflectionEnabled false AvroSchemaurl Description A space separated list of host port at which Flume through an AvroSource is listening for events Selection mechanism Must be either ROUND_ROBIN RANDOM or custom FQDN to class that inherits from LoadBalancingSelector A long value representing the maximum amount of time in milliseconds the Load balancing client will backoff from a node that has failed to co
52. how to do this please refer to HBase documentation For convenience two serializers are provided with Flume The SimpleHbaseEventSerializer org apache flume sink hbase SimpleHbaseEventSerializer writes the event body as is to HBase and optionally increments a column in Hbase This is primarily an example implementation The RegexHbaseEventSerializer org apache flume sink hbase RegexHbaseEventSerializer breaks the event body based on the given regex and writes each part into different columns The type is the FQCN org apache flume sink hbase HBaseSink Required properties are in bold Property Name Default Description channel type The component type name needs to be hbase table The name of the table in Hbase to write to columnFamily The column family in Hbase to write to zookeeperQuorum The quorum spec This is the value for the property hbase zookeeper quorum in hbase site xml znodeParent nbase The base path for the znode for the ROOT region Value of zookeeper znode parent in hbase site xml batchSize 100 Number of events to be written per txn coalescelncrements false Should the sink coalesce multiple increments to a cell per batch This might give better performance if there are multiple increments to a limited number of cells serializer org apache flume sink hbase SimpleHbaseEventSerializer Default increment column iCol payload column pCol serializer Properties to be passed to t
53. ibutes like timestamp or machine where the event originated The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory file name to store the events Using this sink requires hadoop to be installed so that Flume can use the Hadoop jars to communicate with the HDFS cluster Note that a version of Hadoop that supports the sync call is required The following are the escape sequences supported Alias Description f host Substitute value of event header named host Arbitrary header names are supported t Unix time in milliseconds a locale s short weekday name Mon Tue A locale s full weekday name Monday Tuesday b locale s short month name Jan Feb B locale s long month name January February Ac locale s date and time Thu Mar 3 23 05 25 2005 d day of month 01 D date same as m d y H hour 00 23 l hour 01 12 j day of year 001 366 k hour 0 23 M month 01 12 M minute 00 59 p locale s equivalent of am or pm S seconds since 1970 01 01 00 00 00 UTC S second 00 59 y last two digits of year 00 99 Y year 2010 Z hhmm numeric timezone for example 0400 The file in use will have the name mangled to include tmp at the end Once the file is closed this extension is removed This allows excluding partially complete files in the directory Required properties are in bold
54. ight define several named agents when a given Flume process is launched a flag is passed telling it which named agent to manifest Given this configuration file we can start Flume as follows bin flume ng agent conf conf conf file example conf name al Dflume root logger INFO console Note that in a full deployment we would typically include one more option conf lt conf dir gt The lt conf dir gt directory would include a shell script flume env sh and potentially a log4j properties file In this example we pass a Java option to force Flume to log to the console and we go without a custom environment script From a separate terminal we can then telnet port 44444 and send Flume an event telnet localhost 44444 Trying 1271 Ore Ole lieve Connected to localhost localdomain 127 0 0 1 Escape character is Hello world lt ENTER gt OK The original Flume terminal will output the event in a log message 12 06 19 15 32 19 INFO source NetcatSource Source starting 12 06 19 15 32 19 INFO source NetcatSource Created serverSocket sun nio ch ServerSocketChannelImpl 127 0 0 1 44444 12 06 19 15 32 34 INFO sink LoggerSink Event headers body 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 OD Hello world Congratulations you ve successfully configured and deployed a Flume agent Subsequent sections cover agent configuration in much more detail Installing third party plugins Flume has a fully plugin based architecture
55. ile that follows the Java properties file format Configurations for one or more agents can be specified in the same configuration file The configuration file includes properties of each source sink and channel in an agent and how they are wired together to form data flows Configuring individual components Each component source sink or channel in the flow has a name type and set of properties that are specific to the type and instantiation For example an Avro source needs a hostname or IP address and a port number to receive data from A memory channel can have max queue size capacity and an HDFS sink needs to know the file system URI path to create files frequency of file rotation hdfs rolllnterval etc All such attributes of a component needs to be set in the properties file of the hosting Flume agent Wiring the pieces together The agent needs to know what individual components to load and how they are connected in order to constitute the flow This is done by listing the names of each of the sources sinks and channels in the agent and then specifying the connecting channel for each sink and source For example an agent flows events from an Avro source called avroWeb to HDFS sink hdfs cluster1 via a file channel called file channel The configuration file will contain names of these components and file channel as a shared channel for both avroWeb source and hdfs cluster1 sink Starting an agent An agent is started
56. ing for the config file even if the config file is not found at the expected location Otherwise the Flume agent will terminate if the config doesn t exist at the expected location No property value is needed when setting this property eg just specifying Dflume called from service is enough Property flume called from service Flume periodically polls every 30 seconds for changes to the specified config file A Flume agent loads a new configuration from the config file if either an existing file is polled for the first time or if an existing file s modification date has changed since the last time it was polled Renaming or moving a file does not change its modification time When a Flume agent polls a non existent file then one of two things happens 1 When the agent polls a non existent config file for the first time then the agent behaves according to the flume called from service property If the property is set then the agent will continue polling always at the same period every 30 seconds If the property is not set then the agent immediately terminates OR 2 When the agent polls a non existent config file and this is not the first time the file is polled then the agent makes no config changes for this polling period The agent continues polling rather than terminating Log4J Appender Appends Log4j events to a flume agent s avro source A client using this appender must have the flume ng sdk in the classpath eg flu
57. ionMillis 1000 Maximum number of milliseconds to wait before closing a batch Example for agent named a1 al sources rl al channels cl al sources rl type org apache flume source twitter TwitterSource al sources rl channels cl al sources rl consumerKey YOUR_TWITTER_CONSUMER_KEY al sources rl consumerSecret YOUR_TWITTER_CONSUMER_SECRET al sources rl accessToken YOUR_TWITTER_ACCESS TOKEN al sources rl accessTokenSecret YOUR_TWITTER_ACCESS_TOKEN_ SECRET al sources rl maxBatchSize 10 al sources rl maxBatchDurationMillis 200 Event Deserializers The following event deserializers ship with Flume LINE This deserializer generates one event per line of text input Property Name Default Description deserializer maxLineLength 2048 Maximum number of characters to include in a single event If a line exceeds this length it is truncated and the remaining characters on the line will appear in a subsequent event deserializer outputCharset UTF 8 Charset to use for encoding events put into the channel AVRO This deserializer is able to read an Avro container file and it generates one event per Avro record in the file Each event is annotated with a header that indicates the schema used The body of the event is the binary Avro record data not including the schema or the rest of the container file elements Note that if the spool directory source must retry putting one of these events onto a channel for example because the
58. ks kl sink serializer text al sinks kl sink serializer appendNewline false Avro Event Serializer Alias avro_event This interceptor serializes Flume events into an Avro container file The schema used is the same schema used for Flume events in the Avro RPC mechanism This serializers inherits from the abstractavroEventSerializer Class Configuration options are as follows Property Name Default Description syncintervalBytes 2048000 Avro sync interval in approximate bytes compressionCodec null Avro compression codec For supported codecs see Avro s CodecFactory docs Example for agent named a1 al sinks kl type hdfs al sinks kl channel cl al sinks kl hdfs path flume events y m d H amp M S al sinks kl serializer avro_event al sinks k1l serializer compressionCodec snappy Flume Interceptors Flume has the capability to modify drop events in flight This is done with the help of interceptors Interceptors are classes that implement org apache flume interceptor Interceptor interface An interceptor can modify or even drop events based on any criteria chosen by the developer of the interceptor Flume supports chaining of interceptors This is made possible through by specifying the list of interceptor builder class names in the configuration Interceptors are specified as a whitespace separated list in the source configuration The order in which the interceptors are specified is the order in which they are invoked T
59. l also follow file rotation Example for agent named a1 al sources rl al channels cl al sources rl type exec al sources rl command tail F var log secure al sources rl channels cl The shell config is used to invoke the command through a command shell such as Bash or Powershell The command is passed as an argument to shell for execution This allows the command to use features from the shell such as wildcards back ticks pipes loops conditionals etc In the absence of the shell config the command will be invoked directly Common values for shell bin sh c bin ksh c cmd c powershell Command etc agent_foo sources tailsource 1 type exec agent_foo sources tailsource 1 shell bin bash c agent_foo sources tailsource 1 command for i in path txt do cat i done JMS Source JMS Source reads messages from a JMS destination such as a queue or topic Being a JMS application it should work with any JMS provider but has only been tested with ActiveMQ The JMS source provides configurable batch size message selector user pass and message to flume event converter Note that the vendor provided JMS jars should be included in the Flume classpath using plugins d directory preferred classpath on command line or via FLUME_CLASSPATH variable in flume env sh Required properties are in bold Property Name Default Description channels
60. ltiplexing select has a further set of properties to bifurcate the flow This requires specifying a mapping of an event attribute to a set for channel The selector checks for each configured attribute in the event header If it matches the specified value then that event is sent to all the channels mapped to that value If there s no match then the event is sent to set of channels configured as default Mapping for multiplexing selector lt Agent gt sources lt Sourcel gt selector type multiplexing lt Agent gt sources lt Sourcel gt selector header lt someHeader gt lt Agent gt sources lt Sourcel gt selector mapping lt Valuel gt lt Channeli gt lt Agent gt sources lt Sourcel gt selector mapping lt Value2 gt lt Channell gt lt Channel2 gt lt Agent gt sources lt Sourcel gt selector mapping lt Value3 gt lt Channel2 gt Hoos lt Agent gt sources lt Sourcel gt selector default lt Channel2 gt The mapping allows overlapping the channels for each value The following example has a single flow that multiplexed to two paths The agent named agent_foo has a single avro source and two channels linked to two sinks list the sources sinks and channels in the agent agent_foo sources avro AppSrv sourcel agent_foo sinks hdfs Clusterl sink1l avro forward sink2 agent_foo channels mem channel 1 file channel 2 set channels for source agent_foo sources avro AppSrv sourcel channels mem channel 1 file
61. me ng sdk 1 5 2 jar Required properties are in bold Property Name Default Description Hostname The hostname on which a remote Flume agent is running with an avro source Port The port at which the remote Flume agent s avro source is listening UnsafeMode false If true the appender will not throw exceptions on failure to send the events AvroReflectionEnabled false Use Avro Reflection to serialize Log4j events Do not use when users log strings AvroSchemaurl A URL from which the Avro schema can be retrieved Sample log4j properties file log4j appender flume org apache flume clients 1log4jappender Log4jAppender log4j appender flume Hostname example com log4j appender flume Port 41414 log4j appender flume UnsafeMode true configure a class s logger to output to the flume appender log4j logger org example MyClass DEBUG flume Poes By default each event is converted to a string by calling tostring or by using the Log4j layout if specified If the event is an instance of org apache avro generi c GenericRecord org apache avro specific SpecificRecord OF if the property AvroReflectionEnabled is set to true then the event will be serialized using Avro serialization Serializing every event with its Avro schema is inefficient so it is good practice to provide a schema URL from which the schema can be retrieved by the downstream sink typically the HDFS sink If avroschemaur1 is not specified then the sc
62. mily in Hbase to write to batchSize 100 Number of events to be written per txn coalescelncrements false Should the sink coalesce multiple increments to a cell per batch This might give better performance if there are multiple increments to a limited number of cells timeout 60000 The length of time in milliseconds the sink waits for acks from hbase for all events in a transaction serializer org apache flume sink hbase SimpleAsyncHbaseEventSerializer serializer Properties to be passed to the serializer Note that this sink takes the Zookeeper Quorum and parent znode information in the configuration Zookeeper Quorum and parent node configuration may be specified in the flume configuration file Alternatively these configuration values are taken from the first hbase site xml file in the classpath If these are not provided in the configuration then the sink will read this information from the first hbase site xml file in the classpath Example for agent named a1 al channels cl al sinks kl al sinks kl type asynchbase al sinks kl table foo table al sinks kl columnFamily bar_cf al sinks kl serializer org apache flume sink hbase SimpleAsyncHbaseEventSerializer al sinks kl channel cl MorphlineSolrSink This sink extracts data from Flume events transforms it and loads it in near real time into Apache Solr servers which in turn serve queries to end users or search applications This sink is well suited for use
63. n out to a ByteArrayOutputStream wrapped in an ObjectOutputStream and the resulting array is copied to the body of the FlumeEvent Example for agent named a1 al sources rl al channels cl al sources rl type jms al sources rl channels cl al sources rl initialContextFactory org apache activemq jndi ActiveMOInitialContextFactory al sources rl connectionFactory GenericConnectionFactory al sources rl providerURL tcp mqserver 61616 al sources rl destinationName BUSINESS_DATA al sources rl destinationType QUEUE Spooling Directory Source This source lets you ingest data by placing files to be ingested into a spooling directory on disk This source will watch the specified directory for new files and will parse events out of new files as they appear The event parsing logic is pluggable After a given file has been fully read into the channel it is renamed to indicate completion or optionally deleted Unlike the Exec source this source is reliable and will not miss data even if Flume is restarted or killed In exchange for this reliability only immutable uniquely named files must be dropped into the spooling directory Flume tries to detect these problem conditions and will fail loudly if they are violated 1 Ifa file is written to after being placed into the spooling directory Flume will print an error to its log file and stop processing 2 Ifa file name is reused at a later time Flume will print an error to
64. nnel This is done in the same hierarchical namespace fashion where you set the component type and other values for the properties specific to each component properties for sources lt Agent gt sources lt Source gt lt someProperty gt lt someValue gt properties for channels lt Agent gt channel lt Channel gt lt someProperty gt lt someValue gt properties for sinks lt Agent gt sources lt Sink gt lt someProperty gt lt someValue gt The property type needs to be set for each component for Flume to understand what kind of object it needs to be Each source sink and channel type has its own set of properties required for it to function as intended All those need to be set as needed In the previous example we have a flow from avro AppSrv source to hdfs Cluster1 sink through the memory channel mem channel 1 Here s an example that shows configuration of each of those components agent_foo sources avro AppSrv source agent_foo sinks hdfs Clusterl sink agent_foo channels mem channel 1 4 I set channel for sources sinks properties of avro AppSrv source agent_foo sources avro AppSrv source type avro agent_foo sources avro AppSrv source bind localhost agent_foo sources avro AppSrv source port 10000 properties of mem channel 1 agent_foo channels mem channel 1 type memory agent_foo channels mem channel 1 capacity 1000 agent_foo channels mem channel 1 transactionCapacity 100 p
65. nsume an event Defaults to no backoff If true the appender will not throw exceptions on failure to send the events Use Avro Reflection to serialize Log4j events A URL from which the Avro schema can be retrieved Sample log4j properties file configured using defaults Hoas log4j appender out2 org apache flume clients log4jappender LoadBalancingLog4 jAppender log4j appender out2 Hosts localhost 25430 localhost 25431 configure a class s logger to output to the flume appender log4j logger org example MyClass DEBUG flume Hee Sample log4j properties file configured using RANDOM load balancing He ave log4j appender out2 org apache flume clients log4jappender LoadBalancingLog4 jAppender log4j appender out2 Hosts localhost 25430 localhost 25431 log4j appender out2 Selector RANDOM configure a class s logger to output to the flume appender log4j logger org example MyClass DEBUG flume e Sample log4j properties file configured using backoff Weave log4j appender out2 org apache flume clients log4jappender LoadBalancingLog4 jAppender log4j appender out2 Hosts localhost 25430 localhost 25431 localhost 25432 log4j appender out2 Selector ROUND ROBIN log4j appender out2 MaxBackoff 30000 configure a class s logger to output to the flume appender log4j logger org example MyClass DEBUG flume Mass Security The HDFS sink supports Kerberos authentication configuring the HD
66. nt named a1 al channels cl al channels cl type SPILLABLEMEMORY al channels cl memoryCapacity 10000 al channels cl overflowCapacity 1000000 al channels cl byteCapacity 800000 al channels cl checkpointDir mnt flume checkpoint al channels cl dataDirs mnt flume data To disable the use of the in memory queue and function like a file channel al channels cl al channels cl type SPILLABLEMEMORY al channels cl memoryCapacity 0 al channels cl overflowCapacity 1000000 al channels cl checkpointDir mnt flume checkpoint al channels cl dataDirs mnt flume data To disable the use of overflow disk and function purely as a in memory channel al channels cl al channels cl type SPILLABLEMEMORY al channels cl memoryCapacity 100000 al channels cl overflowCapacity 0 Pseudo Transaction Channel Warning The Pseudo Transaction Channel is only for unit testing purposes and is NOT meant for production use Required properties are in bold Property Name Default Description type The component type name needs to be org apache flume channel PseudoTxnMemoryChannel capacity 50 The max number of events stored in the channel keep alive 3 Timeout in seconds for adding or removing an event Custom Channel A custom channel is your own implementation of the Channel interface A custom channel s class and its dependencies must be included in the agent s classpath when starting the Flume agent Th
67. oints this creates sub topologies which may themselves include aggregation points Sizing a Flume deployment Once you have an idea of what your topology will look like the next question is how much hardware and networking capacity is needed This starts by quantifying how much data you generate That is not always a simple task Most data streams are bursty for instance due to diurnal patterns and potentially unpredictable A good starting point is to think about the maximum throughput you ll have in each tier of the topology both in terms of events per second and bytes per second Once you know the required throughput of a given tier you can calulate a lower bound on how many nodes you require for that tier To determine attainable throughput it s best to experiment with Flume on your hardware using synthetic or sampled event data In general disk based channels should get 10 s of MB s and memory based channels should get 100 s of MB s or more Performance will vary widely however depending on hardware and operating environment Sizing aggregate throughput gives you a lower bound on the number of nodes you will need to each tier There are several reasons to have additional nodes such as increased redundancy and better ability to absorb bursts in load Troubleshooting Handling agent failures If the Flume agent goes down then the all the flows hosted on that agent are aborted Once the agent is restarted then flow will resume The
68. onal set of string attributes A Flume agent is a JVM process that hosts the components through which events flow from an external source to the next destination hop eb Server A Flume source consumes events delivered to it by an external source like a web server The external source sends events to Flume in a format that is recognized by the target Flume source For example an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol When a Flume source receives an event it stores it into one or more channels The channel is a passive store that keeps the event until it s consumed by a Flume sink The file channel is one example it is backed by the local filesystem The sink removes the event from the channel and puts it into an external repository like HDFS via Flume HDFS sink or forwards it to the Flume source of the next Flume agent next hop in the flow The source and sink within the given agent run asynchronously with the events staged in the channel Complex flows Flume allows a user to build multi hop flows where events travel through multiple agents before reaching the final destination It also allows fan in and fan out flows contextual routin
69. ources rl selector mapping CZ cl al sources rl selector mapping US c2 c3 al sources rl selector default c4 Custom Channel Selector A custom channel selector is your own implementation of the ChannelSelector interface A custom channel selector s class and its dependencies must be included in the agent s classpath when starting the Flume agent The type of the custom channel selector is its FQCN Property Name Default Description selector type The component type name needs to be your FQCN Example for agent named a1 and its source called r1 al sources rl al channels cl al sources rl selector type org example MyChannelSelector Flume Sink Processors Sink groups allow users to group multiple sinks into one entity Sink processors can be used to provide load balancing capabilities over all sinks inside the group or to achieve fail over from one sink to another in case of temporal failure Required properties are in bold Property Name Default Description sinks Space separated list of sinks that are participating in the group processor type default The component type name needs to be default failover Of load_balance Example for agent named a1 al sinkgroups gl al sinkgroups gl sinks k1 k2 al sinkgroups gl processor type load_balance Default Sink Processor Default sink processor accepts only a single sink User is not forced to create processor sink group for single sinks Instead user can follow the so
70. ovider org apache flume channel file encryption AESCTRNoPaddingProvide org example MyCipherProvider org apache flume serialization BodyTextEventSerializer Builder org apache flume serialization lumeEventAvroEventSerializer Build org example MyEventSerializer Builder These conventions for alias names are used in the component specific examples above to keep the names short and consistent across all examples Alias Name Alias Type agent channel source sink sink group interceptor key host serializer oxs7s lt TO RNR TO HM
71. ow To setup a multi tier flow you need to have an avro thrift sink of first hop pointing to avro thrift source of the next hop This will result in the first Flume agent forwarding events to the next Flume agent For example if you are periodically sending files 1 file per event using avro client to a local Flume agent then this local agent can forward it to another agent that has the mounted for storage Weblog agent config list sources sinks and channels in the agent agent_foo sources avro AppSrv source agent_foo sinks avro forward sink agent_foo channels file channel define the flow agent_foo sources avro AppSrv source channels file channel agent_foo sinks avro forward sink channel file channel avro sink properties agent_foo sources avro forward sink type avro agent_foo sources avro forward sink hostname 10 1 1 100 agent_foo sources avro forward sink port 10000 configure other pieces HDFS agent config list sources sinks and channels in the agent agent_foo sources avro collection source agent_foo sinks hdfs sink agent_foo channels mem channel define the flow agent_foo sources avro collection source channels mem channel agent_foo sinks hdfs sink channel mem channel avro sink properties agent_foo sources avro collection source type avro agent_foo sources avro collection source bind 10 1 1 100 agent_foo sources avro collection source port 10000 configure other piece
72. pe name needs to be your FQCN selector type replicating Of multiplexing selector replicating Depends on the selector type value interceptors Space separated list of interceptors interceptors Example for agent named a1 al sources rl al channels cl al sources rl type org example MySource al sources rl channels cl Scribe Source Scribe is another type of ingest system To adopt existing Scribe ingest system Flume should use ScribeSource based on Thrift with compatible transfering protocol For deployment of Scribe please follow the guide from Facebook Required properties are in bold Property Name Default Description type The component type name needs to be org apache flume source scribe ScribeSource port 1499 Port that Scribe should be connected workerThreads 5 Handing threads number in Thrift selector type selector Example for agent named a1 al sources rl al channels cl al sources rl type org apache flume source scribe ScribeSource al sources rl port 1463 al sources rl workerThreads 5 al sources rl channels cl Flume Sinks HDFS Sink This sink writes events into the Hadoop Distributed File System HDFS It currently supports creating text and sequence files It supports compression in both file types The files can be rolled close current file and create a new one periodically based on the elapsed time or size of data or number of events It also buckets partitions data by attr
73. pecified ipFilter false Set this to true to enable ipFiltering for netty ipFilter rules Define N netty ipFilter pattern rules with this config Example for agent named a1 al sources rl al channels cl al sources rl type avro al sources rl channels cl al sources rl bind 0 0050 al sources rl port 4141 Example of ipFilter rules ipFilter rules defines N netty ipFilters separated by a comma a pattern rule must be in this format lt allow or deny gt lt ip or name for computer name gt lt pattern gt or allow deny ip name pattern example ipFilter rules allow ip 127 allow name localhost deny ip Note that the first rule to match will apply as the example below shows from a client on the localhost This will Allow the client on localhost be deny clients from any other ip allow name localhost deny ip This will deny the client on localhost be allow clients from any other ip deny name localhost allow ip Thrift Source Listens on Thrift port and receives events from external Thrift client streams When paired with the built in ThriftSink on another previous hop Flume agent it can create tiered collection topologies Required properties are in bold Property Name Default Description channels type The component type name needs to be thrift bind hostname or IP address to listen on port Port to bind to threads Maximum number of worker threads to spawn selector type selector in
74. ports a number of mechanisms to ingest data from external sources RPC An Avro client included in the Flume distribution can send a given file to Flume Avro source using avro RPC mechanism bin flume ng avro client H localhost p 41414 F usr logs log 10 The above command will send the contents of usr logs log 10 to to the Flume source listening on that ports Executing commands There s an exec source that executes a given command and consumes the output A single line of output ie text followed by carriage return r or line feed n or both together Note Flume does not support tail as a source One can wrap the tail command in an exec source to stream the file Network streams Flume supports the following mechanisms to read data from popular log stream types such as 1 Avro 2 Thrift 3 Syslog 4 Netcat Setting multi agent flow bar In order to flow the data across multiple agents or hops the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname or IP address and port of the source Consolidation A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem For example logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster Consolidation This can be achieved in Flume by configurin
75. ppend a static header with static value to all events The current implementation does not allow specifying multiple headers at one time Instead user might chain multiple static interceptors each defining one static header Property Name Default Description type The component type name has to be static preserveExisting true If configured header already exists should it be preserved true or false key key Name of header that should be created value value Static value that should be created Example for agent named a1 al sources rl al channels cl al sources rl channels cl al sources rl type seq al sources rl interceptors il al sources rl interceptors il type static al sources rl interceptors il key datacenter al sources rl interceptors il value NEW_YORK UUID Interceptor This interceptor sets a universally unique identifier on all events that are intercepted An example UUID is 5755073 77a9 43c1 8f ad b7a586f c1b97 which represents a 128 bit value Consider using UUIDInterceptor to automatically assign a UUID to an event if no application level unique key for the event is available It can be important to assign UUIDs to events as soon as they enter the Flume network that is in the first Flume Source of the flow This enables subsequent deduplication of events in the face of replication and redelivery in a Flume network that is designed for high availability and high performance If an application level key is
76. ption channels type The component type name needs to be multiport_syslogtcp host Host name or IP address to bind to ports Space separated list one or more of ports to bind to eventSize 2500 Maximum size of a single event line in bytes keepFields false Setting this to true will preserve the Priority Timestamp and Hostname in the body of the event portHeader If specified the port number will be stored in the header of each event using the header name specified here This allows for interceptors and channel selectors to customize routing logic based on the incoming port charset default UTF 8 Default character set used while parsing syslog events into strings charset port Character set is configurable on a per port basis lt port gt batchSize 100 Maximum number of events to attempt to process per request loop Using the default is usually fine readBufferSize 1024 Size of the internal Mina read buffer Provided for performance tuning Using the default is usually fine numProcessors auto Number of processors available on the system for use while processing messages Default is to auto detect of CPUs detected using the Java Runtime API Mina will spawn 2 request processing threads per detected CPU which is often reasonable selector type replicating replicating multiplexing or custom selector Depends on the selector type value interceptors Space separated list of interceptors interceptors For ex
77. r needs to be DERBY Class for vendor s JDBC driver JDBC connection URL User id for db connection password for db connection JDBC Connection property file path If true then creates db schema if not there Create indexes to speed up lookups Isolation level for db session READ_UNCOMMITTED READ_COMMITTED SERIALIZABLE REPEATABLE_READ Max connections allowed to db Max number of events in the channel DB Vendor specific properties Home path to store embedded Derby database The component type name needs to be file The directory where checkpoint file will be stored Backup the checkpoint If this is set to true backupcheckpointpDir must be set The directory where the checkpoint is backed up to This directory must not be the same as the data directories or the checkpoint directory Comma separated list of directories for storing log files Using multiple directories on separate disks can improve file channel peformance The maximum size of transaction supported by the channel Amount of time in millis between checkpoints Max size in bytes of a single log file Minimum Required free space in bytes To avoid data corruption File Channel stops accepting take put requests when free space drops below this value Maximum capacity of the channel Amount of time in sec to wait for a put operation Expert Use old replay logic Expert Replay without using queue Key name used to encrypt new data Cipher provider type supported
78. rg apache flume source SyslogTcpSource org apache flume source MultiportSyslogTCPSource org apache flume source SyslogUDPSource org apache flume source SpoolDirectorySource org apache flume source http HTT PSource org apache flume source ThriftSource org apache flume source jms JMSSource org apache flume source avroLegacy AvroLegacySource org apache flume source thriftLegacy ThriftLegacySource org example MySource org apache flume sink NullSink org apache flume sink LoggerSink org apache flume sink AvroSink org apache flume sink hdfs HDFSEventSink org apache flume sink hbase HBaseSink org apache flume sink hbase AsyncHBaseSink org apache flume sink elasticsearch ElasticSearchSink org apache flume sink RollingFileSink org apache flume sink irc IRCSink org apache flume sink ThriftSink org example MySink org apache flume channel ReplicatingChannelSelector org apache flume channel MultiplexingChannelSelector org example MyChannelSelector org apache flume sink DefaultSinkProcessor org apache flume sink FailoverSinkProcessor org apache flume sink LoadBalancingSinkProcessor org apache flume interceptor TimestampInterceptor Builder org apache flume interceptor HostInterceptor Builder org apache flume interceptor StaticInterceptor Builder org apache flume interceptor RegexFilteringInterceptor Builder org apache flume interceptor RegexFilteringInterceptor Builder org apache flume channel file encryption JCEFilekeyProvider org example MyKeyPr
79. roperties of hdfs Clusterl sink agent_foo sinks hdfs Clusterl sink type hdfs agent_foo sinks hdfs Clusterl sink hdfs path hdfs namenode flume webdata Adding multiple flows in an agent A single Flume agent can contain several independent flows You can list multiple sources sinks and channels in a config These components can be linked to form multiple flows list the sources sinks and channels for the agent lt Agent gt sources lt Sourcel gt lt Source2 gt lt Agent gt sinks lt Sink1 gt lt Sink2 gt lt Agent gt channels lt Channell gt lt Channel2 gt Then you can link the sources and sinks to their corresponding channels for sources of channel for sinks to setup two different flows For example if you need to setup two flows in an agent one going from an external avro client to external HDFS and another from output of a tail to avro sink then here s a config to do that list the sources sinks and channels in the agent agent_foo sources avro AppSrv sourcel exec tail source2 agent_foo sinks hdfs Clusterl sinkl avro forward sink2 agent_foo channels mem channel 1 file channel 2 flow 1 configuration agent_foo sources avro AppSrv sourcel channels mem channel 1 agent_foo sinks hdfs Clusterl sinkl channel mem channel 1 flow 2 configuration agent_foo sources exec tail source2 channels file channel 2 agent_foo sinks avro forward sink2 channel file channel 2 Configuring a multi agent fl
80. s Heee Here we link the avro forward sink from the weblog agent to the avro collection source of the hdfs agent This will result in the events coming from the external appserver source eventually getting stored in HDFS Fan out flow As discussed in previous section Flume supports fanning out the flow from one source to multiple channels There are two modes of fan out replicating and multiplexing In the replicating flow the event is sent to all the configured channels In case of multiplexing the event is sent to only a subset of qualifying channels To fan out the flow one needs to specify a list of channels for a source and the policy for the fanning it out This is done by adding a channel selector that can be replicating or multiplexing Then further specify the selection rules if its a multiplexer If you don t specify a selector then by default it s replicating List the sources sinks and channels for the agent lt Agent gt sources lt Sourcel gt lt Agent gt sinks lt Sink1 gt lt Sink2 gt lt Agent gt channels lt Channell gt lt Channel2 gt set list of channels for source separated by space lt Agent gt sources lt Sourcel gt channels lt Channell gt lt Channel2 gt set channel for sinks lt Agent gt sinks lt Sink1 gt channel lt Agent gt sinks lt Sink2 gt channel lt Channel1 gt lt Channel2 gt lt Agent gt sources lt Sourcel gt selector type replicating The mu
81. sh table where each hash table entry contains a String key and a list of Java Objects as values The implementation uses Guava s arrayListMultimap which is a Listmultimap Note that a field can have multiple values and any two records need not use common field names This sink fills the body of the Flume event into the _attachment_body field of the morphline record as well as copies the headers of the Flume event into record fields of the same name The commands can then act on this data Routing to a SolrCloud cluster is supported to improve scalability Indexing load can be spread across a large number of MorphlineSolrSinks for improved scalability Indexing load can be replicated across multiple MorphlineSolrSinks for high availability for example using Flume features such as Load balancing Sink Processor MorphlinelInterceptor can also help to implement dynamic routing to multiple Solr collections e g for multi tenancy The morphline and solr jars required for your environment must be placed in the lib directory of the Apache Flume installation The type is the FQCN org apache flume sink solr morphline MorphlineSolrSink Required properties are in bold Property Name Default Description channel type The component type name needs to be org apache flume sink solr morphline MorphlineSolrSink The relative or absolute path on the local file system to the morphline configuration file Example etc 1ume ng conf morphline conf
82. sitory to open kite dataset name Name of the Dataset where records will be written kite batchSize 100 Number of records to process in each batch kite rolllnterval 30 Maximum wait time seconds before data files are released auth kerberosPrincipal Kerberos user principal for secure authentication to HDFS auth kerberosKeytab Kerberos keytab location local FS for the principal auth proxyUser The effective user for HDFS actions if different from the kerberos principal Custom Sink A custom sink is your own implementation of the Sink interface A custom sink s class and its dependencies must be included in the agent s classpath when starting the Flume agent The type of the custom sink is its FQCN Required properties are in bold Property Name Default Description channel type The component type name needs to be your FQCN Example for agent named a1 al channels cl al sinks kl al sinks kl type org example MySink al sinks kl channel cl Flume Channels Channels are the repositories where the events are staged on a agent Source adds the events and Sink removes it Memory Channel The events are stored in an in memory queue with configurable max size It s ideal for flows that need higher throughput and are prepared to lose the staged data in the event of a agent failures Required properties are in bold Property Name Default Description type The component type name needs to be memory capacity 100 The m
83. st determine the version of elasticsearch and the JVM version the target cluster is running Then select an elasticsearch client library which matches the major version A 0 19 x client can talk to a 0 19 x cluster 0 20 x can talk to 0 20 x and 0 90 x can talk to 0 90 x Once the elasticsearch version has been determined then read the pom xml file to determine the correct lucene core JAR version to use The Flume agent which is running the ElasticSearchSink should also match the JVM the target cluster is running down to the minor version Events will be written to a new index every day The name will be lt indexName gt yyyy MM dd where lt indexName gt is the indexName parameter The sink will start writing to a new index at midnight UTC Events are serialized for elasticsearch by the ElasticSearchLogStashEventSerializer by default This behaviour can be overridden with the serializer parameter This parameter accepts implementations of org apache flume sink elasticsearch ElasticSearchEventSerializer or org apache flume sink elasticsearch ElasticSearchIndexRequestBuilderFactory Implementing ElasticSearchEventSerializer is deprecated in favour of the more powerful ElasticSearchIndexRequestBuilderFactory The type is the FQCN org apache flume sink elasticsearch ElasticSearchSink Required properties are in bold Property Name Default Description channel type The component type name needs to be org apache flume sink elasticsearch ElasticS
84. t suitable for very large objects because it buffers up the entire BLOB in RAM Property Name Default Description handler The FQCN of this class org apache flume sink solr morphline BlobHandler handler maxBlobLength 100000000 The maximum number of bytes to read and buffer for a given request Legacy Sources The legacy sources allow a Flume 1 x agent to receive events from Flume 0 9 4 agents It accepts events in the Flume 0 9 4 format converts them to the Flume 1 0 format and stores them in the connected channel The 0 9 4 event properties like timestamp pri host nanos etc get converted to 1 x event header attributes The legacy source supports both Avro and Thrift RPC connections To use this bridge between two Flume versions you need to start a Flume 1 x agent with the avroLegacy or thriftLegacy source The 0 9 4 agent should have the agent Sink pointing to the host port of the 1 x agent Note The reliability semantics of Flume 1 x are different from that of Flume 0 9 x The E2E or DFO mode of a Flume 0 9 x agent will not be supported by the legacy source The only supported 0 9 x mode is the best effort though the reliability setting of the 1 x flow will be applicable to the events once they are saved into the Flume 1 x channel by the legacy source Required properties are in bold Avro Legacy Source Property Name Default Description channels type The component type name needs to be org apache flume source avroLegacy AvroL
85. terceptors Space separated list of interceptors interceptors Example for agent named a1 al sources rl al channels cl al sources rl type thrift al sources rl channels cl al sources rl bind 0 0 0 0 al sources rl port 4141 Exec Source Exec source runs a given Unix command on start up and expects that process to continuously produce data on standard out stderr is simply discarded unless property logStdErr is set to true If the process exits for any reason the source also exits and will produce no further data This means configurations such as cat named pipe Of tail F file are going to produce the desired results where as date will probably not the former two commands produce streams of data where as the latter produces a single event and exits Required properties are in bold Property Name Default Description channels type The component type name needs to be exec command The command to execute shell A shell invocation used to run the command e g bin sh c Required only for commands relying on shell features like wildcards back ticks pipes etc restartThrottle 10000 Amount of time in millis to wait before attempting a restart restart false Whether the executed cmd should be restarted if it dies logStdErr false Whether the command s stderr should be logged batchSize 20 The max number of lines to read and send to the channel at a time selector type replicating replicating or multiplexing
86. ttempt private static final String COUNTER_EVENT_DRAIN_ SUCCESS sink event drain sucess private static final String ATTRIBUTES COUNTER_CONNECTION_ CREATED COUNTER_CONNECTION_CLOSED COUNTER_CONNECTION_ FAILED COUNTER_BATCH_ EMPTY COUNTER_BATCH_UNDERFLOW COUNTER_BATCH_ COMPLETE COUNTER_EVENT_ DRAIN ATTEMPT COUNTER_EVENT_ DRAIN SUCCESS hi public SinkCounter String name super MonitoredCounterGroup Type SINK name ATTRIBUTES Override public long getConnectionCreatedCount return get COUNTER_CONNECTION_CREATED public long incrementConnectionCreatedCount return increment COUNTER_CONNECTION_CREATED Topology Design Considerations Flume is very flexible and allows a large range of possible deployment scenarios If you plan to use Flume in a large production deployment it is prudent to spend some time thinking about how to express your problem in terms of a Flume topology This section covers a few considerations Is Flume a good fit for your problem If you need to ingest textual log data into Hadoop HDFS then Flume is the right fit for your problem full stop For other use cases here are some guidelines Flume is designed to transport and ingest regularly generated event data over relatively stable potentially complex topologies The notion of event data is very broadly defined To Flume an event is just a generic blob of bytes There are some limitations on how large an event
87. type The component type name needs to be jms initialContextFactory Inital Context Factory e g org apache activemq jndi ActiveMQlInitialContextFactory connectionFactory The JNDI name the connection factory shoulld appear as providerURL The JMS provider URL destinationName Destination name destinationType Destination type queue or topic messageSelector Message selector to use when creating the consumer userName Username for the destination provider passwordFile File containing the password for the destination provider batchSize 100 Number of messages to consume in one batch converter type DEFAULT Class to use to convert messages to flume events See below converter 7 Converter properties converter charset UTF 8 Default converter only Charset to use when converting JMS TextMessages to byte arrays Converter The JMS source allows pluggable converters though it s likely the default converter will work for most purposes The default converter is able to convert Bytes Text and Object messages to FlumeEvents In all cases the properties in the message are added as headers to the FlumeEvent BytesMessage Bytes of message are copied to body of the FlumeEvent Cannot convert more than 2GB of data per message TextMessage Text of message is converted to a byte array and copied to the body of the FlumeEvent The default converter uses UTF 8 by default but this is configurable ObjectMessage Object is writte
88. types AESCTRNOPADDING Key provider type supported types JCEKSFILE Path to the keystore file Path to the keystore password file List of all keys e g history of the activeKey setting Path to the optional key password file Note By default the File Channel uses paths for checkpoint and data directories that are within the user home as specified above As a result if you have more than one File Channel instances active within the agent only one will be able to lock the directories and cause the other channel initialization to fail It is therefore necessary that you provide explicit paths to all the configured channels preferably on different disks Furthermore as file channel will sync to disk after every commit coupling it with a sink source that batches events together may be necessary to provide good performance where multiple disks are not available for checkpoint and data directories Example for agent named a1 al channels cl al channels cl type file al channels cl checkpointDir mnt flume checkpoint al channels cl dataDirs mnt flume data Encryption Below is a few sample configurations Generating a key with a password seperate from the key store password keytool genseckey alias key 0 keypass keyPassword keyalg AES keysize 128 validity 9000 keystore test keystore storetype jceks storepass keyStorePassword Generating a key with the password the same as the key store password keytool genseckey
89. u need to list the sources sinks and channels for the given agent and then point the source and sink to a channel A source instance can specify multiple channels but a sink instance can only specify one channel The format is as follows list the sources sinks and channels for the agent lt Agent gt sources lt Source gt lt Agent gt sinks lt Sink gt lt Agent gt channels lt Channell gt lt Channel2 gt set channel for source lt Agent gt sources lt Source gt channels lt Channell gt lt Channel2 gt set channel for sink lt Agent gt sinks lt Sink gt channel lt Channell gt For example an agent named agent_foo is reading data from an external avro client and sending it to HDFS via a memory channel The config file weblog config could look like list the sources sinks and channels for the agent agent_foo sources avro appserver src 1 agent_foo sinks hdfs sink 1 agent_foo channels mem channel 1 set channel for source agent_foo sources avro appserver src 1 channels mem channel 1 set channel for sink agent_foo sinks hdfs sink 1 channel mem channel 1 This will make the events flow from avro AppSrv source to hdfs Cluster1 sink through the memory channel mem channel 1 When the agent is started with the weblog config as its config file it will instantiate that flow Configuring individual components After defining the flow you need to set properties of each source sink and cha
90. urce channel sink pattern that was explained above in this user guide Failover Sink Processor Failover Sink Processor maintains a prioritized list of sinks guaranteeing that so long as one is available events will be processed delivered The failover mechanism works by relegating failed sinks to a pool where they are assigned a cool down period increasing with sequential failures before they are retried Once a sink successfully sends an event it is restored to the live pool To configure set a sink groups processor to failover and set priorities for all individual sinks All specified priorities must be unique Furthermore upper limit to failover time can be set in milliseconds using maxpenalty property Required properties are in bold Property Name Default Description sinks Space separated list of sinks that are participating in the group processor type default The component type name needs to be failover processor priority lt sinkName gt lt sinkName gt must be one of the sink instances associated with the current sink group processor maxpenalty 30000 in millis Example for agent named a1 al sinkgroups gl al sinkgroups gl sinks kl k2 al sinkgroups gl processor type failover al sinkgroups gl processor priority kl 5 al sinkgroups gl processor priority k2 10 al sinkgroups gl processor maxpenalty 10000 Load balancing Sink Processor Load balancing sink processor provides the ability to load balance
91. ute For example an event with timestamp 11 54 34 AM June 12 2012 will cause the hdfs path to become flume events 2012 06 12 1150 00 Logger Sink Logs event at INFO level Typically useful for testing debugging purpose Required properties are in bold Property Name Default Description channel type The component type name needs to be logger Example for agent named a1 al channels al sinks kl al sinks kl type ol logger al sinks kl channel cl Avro Sink This sink forms one half of Flume s tiered collection support Flume events sent to this sink are turned into Avro events and sent to the configured hostname port pair The events are taken from the configured Channel in batches of the configured batch size Required properties are in bold Property Name Default Description channel type The component type name needs to be avro hostname The hostname or IP address to bind to port ae The port to listen on batch size 100 number of event to batch together for send connect 20000 Amount of time ms to allow for the first handshake request timeout request 20000 Amount of time ms to allow for requests after the first timeout reset none Amount of time s before the connection to the next hop is reset This will force the Avro Sink to reconnect to the connection next hop This will allow the sink to connect to hosts behind a hardware load balancer when news hosts are interval add
92. v3 maxloWorkers 2 the number The maximum number of I O worker threads This is configured on the NettyAvroRpcClient of available NioClientSocketChannelFactory processors in the machine Example for agent named a1 al channels al sinks kl al sinks kl type el avro al sinks kl channel cl al sinks kl hostname al sinks kl port Thrift Sink 4545 10 10 10 10 This sink forms one half of Flume s tiered collection support Flume events sent to this sink are turned into Thrift events and sent to the configured hostname port pair The events are taken from the configured Channel in batches of the configured batch size Required properties are in bold Property Name Default Description channel type The component type name needs to be thrift hostname The hostname or IP address to bind to port The port to listen on batch size 100 number of event to batch together for send connect 20000 Amount of time ms to allow for the first handshake request timeout request 20000 Amount of time ms to allow for requests after the first timeout connection none Amount of time s before the connection to the next hop is reset This will force the Thrift Sink to reconnect to the next hop reset This will allow the sink to connect to hosts behind a hardware load balancer when news hosts are added without having to interval restart the agent Example for agent named a1 al channels
93. y the regex You can plug custom serializer implementations into the extractor using the fully qualified class name FQCN to format the matches in anyway you like Example 1 If the Flume event body contained 1 2 3 4f00bar5 and the following configuration was used al sources rl interceptors il regex d d d al sources rl interceptors il serializers sl s2 s3 al sources rl interceptors il serializers sl name al sources rl interceptors il serializers s2 name al sources rl interceptors il serializers s3 name one two three The extracted event will contain the same body but the following headers will have been added one gt 1 two gt 2 three gt 3 Example 2 If the Flume event body contained 2012 10 18 18 47 57 614 some log line and the following configuration was used al sources rl interceptors il regex n d d d d d d d d s d d d d al sources rl interceptors il serializers sl al sources rl interceptors il serializers sl type org apache flume interceptor RegexExtractorInterceptorMillisSerializer al sources rl interceptors il serializers sl name timestamp al sources rl interceptors il serializers sl pattern yyyy MM dd HH mm the extracted event will contain the same body but the following headers will have been added timestamp gt 1350611220000 Flume Properties Property Name Default Description flume called from service If this property is specified then the Flume agent will continue poll

Flume 1.5.2 User Guide - Apache Flume

Contents

Download Pdf Manuals

Related Search

Related Contents